Variable Selection and Regularized Mixture Modeling for Clustering

Open Access
Author:
Lee, Hyang Min
Graduate Program:
Statistics
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
June 07, 2010
Committee Members:
  • Jia Li, Dissertation Advisor
  • Jia Li, Committee Chair
  • Bruce G Lindsay, Committee Member
  • Donald Richards, Committee Member
  • Peng Liu, Committee Member
Keywords:
  • Modal EM
  • Ridgeline EM
  • Wrapper method
  • Aggregated distinctiveness
  • Gaussian mixture models
  • Covariance shrinkage
  • BIC-type penalized log-likelihood
  • Mixture model based clustering
  • Model regularization
Abstract:
We introduce a new variable selection algorithm and a new regularization mixture modeling algorithm for clustering based on Gaussian mixture models. First, a new variable selection algorithm is developed for clustering based on mode association. In conventional mixture-model-based clustering, each mixture component is treated as one cluster and the separation between clusters is usually measured by the ratio of between- and within-component dispersion. We allow one cluster to contain several components depending on whether or not they merge into one mode. Improved geometric characteristics of clusters are achieved by this new approach. The extent of separation between clusters is quantified using critical points on the ridgeline between two modes, which reflects the exact geometry of the density function. The computational foundation for the new separability measure consists of a recently developed Modal EM (MEM) algorithm which solves modes of a density in the form of a Gaussian mixture, and a Ridgeline EM (REM) algorithm which solves the ridgeline that passes through critical points of the mixed density of two uni-mode clusters. Forward selection is used to find a subset of variables that maximizes an aggregated index of pairwise separability between clusters. We demonstrate experimental results using simulated and real data sets. Second, a new regularization mixture modeling for clustering is performed with a covariance shrinkage. The covariance shrinkage method allows different components to have different levels of complexity. A complexity parameter is assigned to each component to determine the extent of shrinkage towards a diagonal or common covariance matrix. A BIC-type penalized log-likelihood is proposed to estimate the model parameters and the complexity parameters. A generalized EM algorithm is developed for model estimation. Using simulated and real data sets, we compare the proposed covariance shrinkage method in terms of likelihood and parameter accuracy with covariance shrinkage using a single complexity parameter and estimation without shrinkage. We also investigate the impact on clustering based on this new mixture modeling technique.