Improved generative modeling approaches for semi-supervised and domain adaptive classifier learning from labels and constraints

Open Access
Raghuram, Jayaram
Graduate Program:
Electrical Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
July 17, 2014
Committee Members:
  • David Jonathan Miller, Dissertation Advisor
  • David Jonathan Miller, Committee Chair
  • George Kesidis, Committee Member
  • Kenneth Jenkins, Committee Member
  • Dr Yu Zhang, Committee Member
  • semisupervised classification
  • semisupervised constraint based learning
  • classifier domain adaptation
  • machine learning
This dissertation makes contributions towards the following three closely related, important problems in machine learning: {\em 1. Semi-supervised classification, 2. Semi-supervised learning with instance-level constraints, and 3. Semi-supervised domain adaptation of classifiers}. Semi-supervised learning, which has been an active area of research for more than a decade now, attempts to mitigate the problem of label scarcity or limited supervision in practical machine learning and statistical modeling tasks, such as classification and clustering, by exploiting the relative abundance and easy availability of unlabeled data to improve upon model solutions which would otherwise be based only on the limited data having supervision. Domain adaptation of classifiers is a relatively recent area of research, where the goal is to leverage the availability of a large labeled database and/or an existing classifier model to adaptively learn a better classifier model for a target domain where the underlying distribution of data is different. In the case of {\em semi-supervised classification}, where the partial supervision is in the form of class labels, we developed a method for improving upon the class posterior probability model given the feature vector (set of features) for a generative mixture model based classifier. In particular, based on novel stochastic data generation methods, we allow the class posterior probability models within the mixture components to be non-trivial functions of the feature vector, addressing a significant limitation of existing methods which only allow a single class per component or a single, feature vector independent probability mass function per component. This allows for more {\em fine-grained} component conditional class modeling, leading to potentially better classification performance as we demonstrate on synthetic and real data sets in Chapter \ref{ss_fine_grained}. In the case of {\em semi-supervised constraint-based learning}, where supervision is in the form of constraints on pairs of data samples indicating whether the sample pairs are from the same underlying class (a must-link constraint) or from different underlying classes (a cannot-link constraint), we developed a method for predicting the grouping of data samples into (unknown) classes, which {\em not only} satisfies a majority of the constraints, but also ensures that the spatial implications of the constraints are consistently enforced in the solution, leading to better grouping solutions and generalization on unseen data. Most of the prior works addressing this problem do not provide a serious treatment of this requirement to enforce the spatial implications of the constraints in the solution. Also, they make a typically unrealistic assumption that the number of underlying classes is known and provided as input, while our method does not require this knowledge, and in fact provides an estimate of the number of classes as a by-product of the model learning. In Chapter \ref{ss_constraint_based}, we demonstrate that our method can lead to significant performance improvements on a variety of synthetic and real data sets. In the case of {\em semi-supervised domain adaptation of classifiers}, we have a scenario similar to semi-supervised learning in one of the data domains (called target domain), with scarcity of labeled data and relative abundance of unlabeled data. However, in addition, there is easy availability of labeled data from a different domain (called source domain) for which it is possible that the underlying probability distribution of the data (features and class labels) may be different from that of the target domain. Under the assumption that the underlying data distributions of the two domains are not {\em very} different, we leverage an existing generative mixture model based classifier learned solely using the labeled data from the source domain, and adapt its parameters using both the labeled and unlabeled data sets from the target domain. The formulation and solution approach adopted by this method and the semi-supervised constraint based learning method are similarly motivated by the need to achieve label propagation (or constraint propagation) by imposing space-partitioning in the solution. In chapter \ref{ss_domain_adapt}, using publicly available Internet packet-flow traffic data from different temporal and spatial domains, we demonstrate significant classification performance improvements in the setting of semi-supervised domain adaptation.