Variable Selection and Regularized Mixture Modeling for Clustering
Open Access
- Author:
- Lee, Hyang Min
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 07, 2010
- Committee Members:
- Jia Li, Dissertation Advisor/Co-Advisor
Jia Li, Committee Chair/Co-Chair
Bruce G Lindsay, Committee Member
Donald Richards, Committee Member
Peng Liu, Committee Member - Keywords:
- Modal EM
Ridgeline EM
Wrapper method
Aggregated distinctiveness
Gaussian mixture models
Covariance shrinkage
BIC-type penalized log-likelihood
Mixture model based clustering
Model regularization - Abstract:
- We introduce a new variable selection algorithm and a new regularization mixture modeling algorithm for clustering based on Gaussian mixture models. First, a new variable selection algorithm is developed for clustering based on mode association. In conventional mixture-model-based clustering, each mixture component is treated as one cluster and the separation between clusters is usually measured by the ratio of between- and within-component dispersion. We allow one cluster to contain several components depending on whether or not they merge into one mode. Improved geometric characteristics of clusters are achieved by this new approach. The extent of separation between clusters is quantified using critical points on the ridgeline between two modes, which reflects the exact geometry of the density function. The computational foundation for the new separability measure consists of a recently developed Modal EM (MEM) algorithm which solves modes of a density in the form of a Gaussian mixture, and a Ridgeline EM (REM) algorithm which solves the ridgeline that passes through critical points of the mixed density of two uni-mode clusters. Forward selection is used to find a subset of variables that maximizes an aggregated index of pairwise separability between clusters. We demonstrate experimental results using simulated and real data sets. Second, a new regularization mixture modeling for clustering is performed with a covariance shrinkage. The covariance shrinkage method allows different components to have different levels of complexity. A complexity parameter is assigned to each component to determine the extent of shrinkage towards a diagonal or common covariance matrix. A BIC-type penalized log-likelihood is proposed to estimate the model parameters and the complexity parameters. A generalized EM algorithm is developed for model estimation. Using simulated and real data sets, we compare the proposed covariance shrinkage method in terms of likelihood and parameter accuracy with covariance shrinkage using a single complexity parameter and estimation without shrinkage. We also investigate the impact on clustering based on this new mixture modeling technique.