Pattern Discovery from Unstructured and Scarcely Labeled Text Corpora

Open Access
Soleimani Bajestani, Hossein
Graduate Program:
Electrical Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
June 10, 2016
Committee Members:
  • David J. Miller, Dissertation Advisor
  • David J. Miller, Committee Chair
  • George Kesidis, Committee Member
  • Vishal Monga, Committee Member
  • C. Lee Giles, Outside Member
  • David Russell Hunter, Outside Member
  • Machine Learning
  • Topic Models
  • Anomaly Detection
  • Semi-supervised Learning
  • Text Modeling
  • Credit Attribution
In this dissertation, we propose probabilistic models for processing large collections of unstructured and sparingly labeled text corpora. We develop our methods for performing challenging tasks such as clustering, anomaly detection, classification, and credit attribution in text documents, a domain with a very high-dimensional feature space. We first focus on discovering interpretable topics which manifest on a small subset of the entire high-dimensional feature space. We propose a parsimonious topic model which allows parameter sharing in high-dimensional discrete data such as text corpora. In related topic models such as Latent Dirichlet Allocation (LDA), all words are modeled topic-specifically, even though many words occur with similar frequencies across different topics. Our approach, in contrast, determines salient words for each topic, which have topic-specific probabilities, with the rest explained by a universal shared model. Moreover, unlike in LDA, where all topics are in principle present in every document with non-zero proportions, our model identifies a sparse subset of relevant topics for each document. We derive a Bayesian Information Criterion (BIC), balancing model complexity and goodness of fit. Unlike in standard BIC where all model parameters are penalized equally, we identify an effective sample size and corresponding penalty specific to each parameter type in our model. We minimize BIC to jointly determine our entire model -- the topic-specific words, document-specific topics, all model parameter values, and the total number of topics -- in a wholly unsupervised fashion. Experimental results show that our model achieves higher test set likelihood and better agreement with ground-truth class labels, compared to several baseline topic models. We then adopt our parsimonious model for detecting anomalous groups and the atypical patterns they exhibit. We define an anomalous group (cluster) as a set of points which collectively exhibit abnormal patterns. In many applications, this can lead to better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. In particular, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns they exhibit, and identify the subsets of salient features. Experimental results show that our method can accurately detect anomalous topics and salient features (words) under each such topic and achieve better performance compared to both standard group individual anomaly detection techniques. We then move from wholly unlabeled data to sparingly labeled text corpora and develop semi-supervised models for document classification and credit attribution. We first propose a semi-supervised hierarchical class-based mixture of topic models for classifying documents. Most topic models incorporate documents' class labels by generating them after generating the words. In these models, the ground-truth class labels have small effect on the estimated topics, as they are dominated by a huge set of word features. In contrast, in our model, we generate the words in each document conditioned on the class label. We show that our generative process allows us to better incorporate ground-truth labels. Within our framework, we also provide a principled mechanism to control the contribution of the class labels and the word space to the likelihood function. Experiments show that our approach achieves better classification accuracy compared to some standard semi-supervised and supervised topic models. Finally, we propose a semi-supervised multi-label topic model for jointly performing document classification and credit attribution to document sentences, i.e. labeling each sentence by the most appropriate label subset. Under our model, each sentence is associated with, i.e. explains, only a subset of the document's labels (including possibly none of them), with the label set of the document the union of the labels of all of its sentences. Our model, in a semi-supervised fashion, discovers the topics present, learns associations between topics and class labels, predicts labels for unlabeled documents, and determines label associations for each sentence in every document. For learning, our model only uses labels provided at the document level. We develop a Hamiltonian Monte Carlo based algorithm for efficiently sampling from the joint label distribution over all sentences, a very high-dimensional discrete space. We also propose an approximate extension of our model based on stochastic variational inference which can scale up to massive datasets by performing inference in parallel. Experiments show that our approach outperforms several baseline methods with respect to both document and sentence-level classification, as well as test set log-likelihood.