Mixture Modeling for Complex and Large-scale Data with Applications

Open Access
Qiao, Mu
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
September 14, 2012
Committee Members:
  • Jia Li, Dissertation Advisor
  • James Z Wang, Dissertation Advisor
  • Jia Li, Committee Chair
  • Daniel Kifer, Committee Chair
  • Jesse Louis Barlow, Committee Member
  • Xiaolong Zhang, Committee Member
  • mixture models
  • complex
  • large scale
  • high dimensional classification
  • clustering
  • visualization
Mixture models have been applied in a wide range of engineering and scientific fields. In practice, data are often complex and in large scale. The data may be high dimensional, have missing values, or contain objects that are not well defined in a vector space. The flexibility of mixture models enables us to model the distribution of complex and large-scale data and estimate their probability densities. In this dissertation, we present several new mixture models that extend such application. A two-way Gaussian mixture model (GMM) is proposed for classifying high dimensional data. This model regularizes the mixture component means by dividing variables into groups and then constraining the parameters for the variables in the same group to be identical. The grouping of the variables is not pre-determined, but rather it is optimized as part of model estimation. A dimension reduction property for a two-way mixture of distributions from a general exponential family is proved. The issue of missing values that tend to arise when the dimension is extremely high is addressed. Estimation methods for the two-way Gaussian mixture with or without missing data are derived. Experiments on several real data sets show that the parsimonious two-way mixture often outperforms a mixture model without variable grouping, and as a byproduct, significant dimension reduction is achieved. We propose a new approach for mixture modeling based only upon pairwise distances via the concept of hypothetical local mapping (HLM). This work is motivated by increasingly commonplace applications involving complex objects that cannot be effectively represented by tractable mathematical entities. The new modeling approach consists of two steps. A distance-based clustering algorithm is applied first. Then HLM takes as input the distances between the training data and their corresponding cluster centroids and estimates the model parameters. In the special case where all the training data are taken as cluster centroids, we obtain a distance-based kernel density. We have examined the classification performance of the mixture models on many data sets. Experimental comparisons have been made with other state-of-the-art distance-based classification methods, for instance, k-NN, variations of k-NN, and SVM based algorithms. It is found that HLM based algorithms are highly competitive in terms of classification accuracy, and in the mean time are computationally efficient during both training and testing. Furthermore, the HLM based modeling approach adapts readily to incremental learning, a valuable mechanism to achieve scalability for dynamic data arriving at a high velocity. We have developed two schemes of incremental learning and tested them on several data sets. Driven by demands in visual analytics, we investigate a GMM with component means constrained in a pre-selected subspace, which allows us to visualize the clustering or classification structure of high dimension data in a lower dimensional subspace. We prove that the subspace containing the component means of a GMM with a common covariance matrix also contains the modes of the density and the class means. This finding motivates us to identify a subspace by applying weighted principal component analysis to the modes of a kernel density and class means. For choosing the kernel bandwidth, we acquire multiple subspaces from the kernel densities based on a sequence of bandwidths. The GMM constrained by each subspace is estimated, and the model yielding the maximum likelihood is chosen. A dimension reduction property is proved in the sense of being informative for classification or clustering. Experiments on real and simulated data sets are conducted to examine several ways of determining the subspace and to compare with the reduced rank mixture discriminant analysis (MDA). Our new method with the simple technique of spanning the subspace only by class means often outperforms the reduced rank MDA when the subspace dimension is very low, making it particularly appealing for visualization. Finally, to bridge the computational gap between information visualization and data mining in visual analytics, we implement two parallel versions of hierarchical mode association clustering (HMAC), a previously proposed nonparametric method that groups data points into a cluster if they are associated with the same mode of a mixture-type density. Parallel HMAC runs on a cluster of compute nodes using the message passing interface (MPI) library and dramatically improves the speed of the original algorithm, making it feasible to perform clustering on large-scale data.