Distance-based Model-Selection with application to the Analysis of Gene Expression Data

Open Access
- Author:
- Ray, Surajit
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 19, 2003
- Committee Members:
- Bruce G. Lindsay, Committee Chair/Co-Chair
Thomas P Hettmansperger, Committee Member
Francesca Chiaromonte, Committee Member
Benjamin Franklin Pugh, Committee Member - Keywords:
- Mixtture models
number of components
modality
Quadratic distance
Microarray - Abstract:
- Multivariate mixture models provide a convenient method of density estimation and model based clustering as well as providing possible explanations for the actual data generation process. But the problem of choosing the number of components ($g$) in a statistically meaningful way is still a subject of considerable research . Available methods for estimating $g$ include, optimizing AIC and BIC, estimating the number through nonparametric maximum likelihood, hypothesis testing and Bayesian approaches with entropy distances. In our current research we present several rules for selecting a finite mixture model, and hence $g$, based on estimation and inference using a quadratic distance measure. In one methodology the goal is to find the minimal number of components that are needed to adequately describe the true distribution based on a nonparametric confidence set for the true distribution. We also present results for selecting $g$ based on a risk analysis that includes a penalty for overfitting. Another less formal methodology is based on the concordance measure which is analogous to $R^2$ in regression. Moreover, we find develop diagnostics for purposes of outlier detection. These diagnostics help to distinguish between outliers and true clusters, and they provide insight into the initial values for iterative estimation of additional components. In this dissertation we also develop tools for determining the number of modes in a mixture of multivariate normal densities. We use these criterion to select clusters which display distinct modes. Finally we fine tune our methods to analyze gene-expression data from micro-arrays, and compare them with other competitive methods.