Clustering Algorithms for Next-generation Sequencing Data from Heterogeneous Populations

Open Access
- Author:
- Prabhakara, Shruthi
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- August 31, 2012
- Committee Members:
- Raj Acharya, Dissertation Advisor/Co-Advisor
Webb Colby Miller, Committee Member
Mary Poss, Committee Member
Padma Raghavan, Committee Member - Keywords:
- Clustering
Metagenomics
Mixture-modeling
Algorithms - Abstract:
- Next Generation Sequencing (NGS) technologies generate data more efficiently, economically and with a greater depth than ever before. NGS has opened up an array of possibilities for many applications including whole-genome sequencing, epigenetics, metagenomics and characterization of pathogens. Of these, the characterization of diversity of heterogeneous microbial environments such as metagenomes and viral populations has recently gained significant interest. Although a host of methods for whole genome assembly have been developed, reconstruction of heterogeneous populations using NGS data still remains a challenge. As compared to existing technologies, reads produced by NGS are typically shorter and more error-prone. The growth in the size of the datasets is fast outpacing the computational power needed to analyze it. Thus, many computational challenges arise while analyzing deep sequence data from heterogeneous populations. The computational methods presented in this dissertation aim to analyze and quantify the genetic diversity within a heterogeneous population based on a set of deep sequencing reads. A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. Unlike single genome sequencing, assembly of a metagenome is intractable and is by large, an unsolved mystery. A crucial step in metagenomics that is not required in single genome assembly, is binning the reads belonging to a species. Clustering methods aim to identify the species present in the sample, classify the sequences by their species of origin and quantify the abundance of each of these species. The efficacy of clustering methods depends on the number of reads in the dataset, the read length and relative abundances of source genomes in the microbial community. From clustering, to assembly, to annotation and function prediction, bioinformatics is posed with new challenges in handling noisy, huge and often partial sequence data. Until now, methods to classify short sequences generated by NGS technologies have been relatively inaccurate. Most methods use sequence homology to assign reads to common ancestors. However, most extant databases are highly biased in their representation of true diversity, such methods fail to find homologs for reads derived from novel species. To address the aforementioned problems, in the first part of dissertation, I have focused on methods to characterize and analyze the taxonomic content of vast amounts of short metagenome reads. The main contributions of the dissertation are: (i) A two-pass soft clustering semi-supervised method that is a hybrid of comparative and composition based methods. Such a hybrid approach is effective when evolutionarily close training genomes are available. (ii) An unsupervised Naive Bayes mixture model based on Poisson or Gaussian distributions to model reads within each bin. (iii) An unsupervised multivariate Bayesian mixture model based on Poisson and Multinomial distributions for clustering discrete sequence data. This method overcomes the bottleneck of above Naive Bayes by taking into account the conditional dependencies between the words within the reads. For the latter two methods, we present a two-way clustering approach to reduce the high-dimensionality and sparsity associated with the data. The method combines the words along the reads into word groups and constrains the parameters for words within the same group to be identical. High genetic variability in viral populations plays an important role in disease progression, pathogenesis and drug resistance. The last few years has seen significant progress in the development of methods for reconstruction of viral populations using data from NGS technologies. These methods identify the differences between individual haplotypes by mapping the short reads to the reference genome. Much less has been published about resolving the population structure when an assembled reference genome is not well-defined, which severely limits the number of populations that can take advantage of these new technologies. In the remainder of the dissertation, I have described a computational framework, called Mutant-Bin, for clustering individual haplotypes in a heterogeneous population and determining their prevalence, based on a set of deep sequencing reads. The main advantages of our method are that: (i) it enables determination of the population structure and haplotype frequencies when a reference genome is lacking; (ii) the method is unsupervised, the number of haplotypes does not have to be specified in advance; (iii) it identifies the polymorphic sites with derived nucleotides that co-occur in a subset of haplotypes and the frequency with which they appear in the viral population.