Estimation and Model Selection for Block Clustering with Mixtures: A Composite Likelihood Approach

Open Access
Author:
Kuruppumullage Don, Prabhani
Graduate Program:
Statistics
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
May 20, 2014
Committee Members:
  • Bruce G Lindsay, Dissertation Advisor
  • Francesca Chiaromonte, Dissertation Advisor
  • Bruce G Lindsay, Committee Chair
  • Francesca Chiaromonte, Committee Chair
  • David Russell Hunter, Committee Member
  • Kateryna Dmytrivna Makova, Committee Member
  • Dr Aleksandra Slavkovic, Special Member
Keywords:
  • Block clustering
  • Composite Likelihood
  • EM algorithm
  • Mixture models
Abstract:
Clustering is the task of finding useful and meaningful groups in data, in a way that members within a group are more similar to each other than to members of other groups. There are many well established statistical methods that are used for clustering; among these mixture-based approaches have several advantages and have become increasingly popular. In this thesis, we introduce a mixture-based approach for block clustering (i.e. simultaneous clustering of rows and columns of a data matrix). We discuss the computational challenges that prevent the use of traditional likelihood approaches in this setting, and provide an alternative. We build a composite likelihood that overcomes the computational burden, and devise a nested Expectation-Maximization (EM) algorithm to estimate the block mixture model. Moreover, we develop two useful tools for model selection in block clustering. These can be used, in particular, to determine the number of row and column groups. We discuss how the gradient function can be used to assess the lack of fit of a block mixture model and provide an EM gradient search algorithm to progress towards better fitting models. Further, we develop a composite likelihood ratio test for comparing two block mixture models and incorporate it into a forward model selection method. We then use our methods for two human genomics applications. In one we simultaneously cluster loci of enhanced microsatellite mutability and a large array of genomic features characterizing their environment. In the second, we do the same type of analysis for so-called common fragile sites in the genome. Finally, we list some of the limitations of our methods, identifying challenges and discussing avenues for future developments.