STATISTICAL METHODS FOR COMAPRING NEXT-GENERATION SEQUENCING DATA REPRODUCIBILITY, SIMILARITY AND DIFFERENTIATION

Restricted (Penn State Only)
Author:
Yang, Tao
Graduate Program:
Bioinformatics and Genomics
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
November 16, 2018
Committee Members:
  • Qunhua Li, Dissertation Advisor
  • Qunhua Li, Committee Chair
  • Shaun Mohany, Committee Member
  • Yu Zhang, Committee Member
  • Ross Hardison, Outside Member
  • Feng Yue, Dissertation Advisor
Keywords:
  • sequencing
  • reproducibility
  • quality control
  • ChIP-seq
  • Hi-C
  • differential interaction
  • genomics
Abstract:
Next-generation sequencing technologies has stimulated numerous innovations in genomics studies during the past decade. Among them, Hi-C is a powerful technology for studying genome-wide chromatin interactions. However, current methods for assessing Hi-C data reproducibility between replicated experiments can produce misleading results because they ignore spatial features in Hi-C data, such as domain structure and distance dependence. It is also needed to have an adequate statistical tool to estimate the similarity between Hi-C contact maps in comparative studies across cell types and conditions. As the first part of my thesis, I present a framework for assessing the reproducibility and similarity of Hi-C data that systematically accounts for these features. In particular, we introduce a novel similarity measure, the stratum-adjusted correlation coefficient (SCC), for quantifying the similarity between Hi-C interaction matrices. Not only does it provide a statistically sound and reliable evaluation of reproducibility, SCC can also be used to quantify differences between Hi-C contact matrices and to determine the optimal sequencing depth for a desired resolution. The measure consistently shows higher accuracy than existing approaches in distinguishing subtle differences in reproducibility and depicting interrelationships of cell lineages. The proposed measure is straightforward to interpret and easy to compute, making it well-suited for providing standardized, interpretable, automatable, and scalable quality control. We also developed the freely available R package HiCRep (Bioconductor) to perform this analysis. One of the most interested features in the Hi-C data is called topologically associating domains (TADs), in which that DNA sequences within a TAD physically interact with each other more frequently than with sequences outside the TAD. TADs are essential in constraining the activity of transcriptional regulatory elements. TADs function as an isolated environment such that gene regulation and interactions rarely go beyond the TADs. Previous studies have observed that changes in TADs structures are associated with altered transcriptional outcome, suggesting that architectural changes may play an important role in regulating gene expression. Identification of differential TADs structures across conditions will provide insights on condition-specific regulatory mechanisms and identify potential pharmacologic targets. However, so far little work has been done on detecting differential TADs structures. In the second part of the thesis, I present a novel statistical method that can accurately and quickly uncover differential TADs structures from Hi-C data. The method is not limited to detect the differential TADs regions, but any regional changes in Hi-C contact maps. To validate the identifications, we applied our method to a Hi-C dataset obtained from a knockout experiment that depletes a critical transcription regulator that co-localizes with CTCF, and identified the changes in TADs structures between the wild type and the knockout. Our results show that the identified differential interacting genomic regions (DIGRs) correspond well with the depleted sites, confirming the biological relevance of our identifications. We further compared the differentiations between two cell lines in the hemopoiesis lineage, and studied the gene activity within the DIGRs, which reveals interesting biological insights. In the last part, I present a method for evaluating the reproducibility of enrichment-based chromatin profiling data, including ChIP-seq, RNA-seq, ATAC-seq and DNAse-seq data. Enrichment-based chromatin profiling sequencing experiments have become essential tools to investigate the functional roles of genomic regions. Measuring reproducibility is central to the data quality control, and critical to ensure the credibility of scientific discoveries. Evaluating the reproducibility of enrichment-based sequencing data is complicated by the variation of enrichment characteristics and the heterogeneous correlation structure between replicated samples. We present a model-based method to comprehensively assess the reproducibility between replicated samples. The method only requires minimum preprocessing of raw data and does not rely on peak calling. Thus, it involves less information loss than the peak level reproducibility measure. The model is designed to assess three aspects of the data reproducibility – the dependence between the enriched signals, the bulk correlation across whole range of signal values, and the degree of lack of enrichment. By the combination of the three quantities, our model is flexible to assess the reproducibility of data with different signal types (i.e., narrow-peak, broad-peak) and enrichment levels. We demonstrate that our method is also more accurate than the other existing measures. The freely available R package mTDR implements (GitHub) our method.