Estimation and Model Selection for Block Clustering with Mixtures: A Composite Likelihood Approach
Open Access
- Author:
- Kuruppumullage Don, Prabhani
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- May 20, 2014
- Committee Members:
- Bruce G Lindsay, Dissertation Advisor/Co-Advisor
Francesca Chiaromonte, Dissertation Advisor/Co-Advisor
Bruce G Lindsay, Committee Chair/Co-Chair
Francesca Chiaromonte, Committee Chair/Co-Chair
David Russell Hunter, Committee Member
Kateryna Dmytrivna Makova, Committee Member
Dr Aleksandra Slavkovic, Special Member - Keywords:
- Block clustering
Composite Likelihood
EM algorithm
Mixture models - Abstract:
- Clustering is the task of finding useful and meaningful groups in data, in a way that members within a group are more similar to each other than to members of other groups. There are many well established statistical methods that are used for clustering; among these mixture-based approaches have several advantages and have become increasingly popular. In this thesis, we introduce a mixture-based approach for block clustering (i.e. simultaneous clustering of rows and columns of a data matrix). We discuss the computational challenges that prevent the use of traditional likelihood approaches in this setting, and provide an alternative. We build a composite likelihood that overcomes the computational burden, and devise a nested Expectation-Maximization (EM) algorithm to estimate the block mixture model. Moreover, we develop two useful tools for model selection in block clustering. These can be used, in particular, to determine the number of row and column groups. We discuss how the gradient function can be used to assess the lack of fit of a block mixture model and provide an EM gradient search algorithm to progress towards better fitting models. Further, we develop a composite likelihood ratio test for comparing two block mixture models and incorporate it into a forward model selection method. We then use our methods for two human genomics applications. In one we simultaneously cluster loci of enhanced microsatellite mutability and a large array of genomic features characterizing their environment. In the second, we do the same type of analysis for so-called common fragile sites in the genome. Finally, we list some of the limitations of our methods, identifying challenges and discussing avenues for future developments.