Statistical methods for the analysis of multi-condition large-scale genomic data

Open Access
- Author:
- Koch, Hillary
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- August 20, 2020
- Committee Members:
- Qunhua Li, Dissertation Advisor/Co-Advisor
Qunhua Li, Committee Chair/Co-Chair
Francesca Chiaromonte, Committee Member
Lin Lin, Committee Member
Ross Cameron Hardison, Outside Member
Ephraim Mont Hanks, Program Head/Chair
Benjamin A Shaby, Special Member - Keywords:
- large-scale inference
statistical genomics
mixture models
Gaussian process
composite likelihood
empirical Bayes - Abstract:
- The comparison of data collected from several different biological conditions is a recurring task in genomic data analysis. High-throughput sequencing data present new challenges in this area, as biases inherent to these rapidly evolving technologies, in addition to complex dependencies arising from natural biological processes, complicate statistical modeling and inference. However, when leveraged correctly, the large scale of these data can reveal insights about genomic processes, how they work with one another, how these synergies are affected by treatments, and the relationship of these processes to phenotype. This dissertation aims to introduce statistical methods that improve statistical power over the state of the art to elucidate these relationships when data are collected from two or more biological conditions. The first method concerns T cell receptor repertoire sequencing data. These data permit a deeper study of immune response, but have a unique structure that makes their meaningful quantification challenging. As such, most analysis methods are limited to simple one-number summaries. We introduce a biologically interpretable model that captures the distribution of the T cell receptor frequencies and is robust to varying sequencing depth across experiments. We apply our method to several datasets and demonstrate its ability to tease out distinguishing features in the T cell receptor repertoire among sampled individuals from differing conditions. The second project addresses joint analysis for pattern detection across many biological conditions when the data can be summarized as scores. Joint analyses of genomic datasets obtained in multiple different conditions are essential for understanding the biological mechanism that drives tissue-specific features and cell differentiation, but still remain computationally challenging. To address this we introduce a statistical methodology that learns patterns of condition-specificity present in the data while retaining tractability. Our approach provides a generic framework facilitating a host of downstream analyses, such as clustering genomic features sharing similar conditional-specific patterns and identifying which of these features are involved in cell fate commitment. We illustrate our method's value on two sets of hematopoietic datasets, each of a different data type. Finally, we introduce a method specifically designed for differential analysis of Hi-C data collected from two experimental conditions. Hi-C, a high-throughput experimental technique, describes how chromosomes organize spatially within the nucleus of the cell. These data exhibit unique spatial structure, as genomic loci can show strong correlations when they are nearby not only in 3D space within the nucleus, but also 1D space along the chromosome. Because of this complexity, a method that detects differences between Hi-C samples while controlling false discoveries has remained absent. The final project introduces a spatial model for sliding window statistics that meets this need. We use polymer simulations and real data to show our method has increased capacity over existing alternatives to identify differentially interacting genomic regions.