Chromatin loop detection via machine learning on genome-wide contact data

Open Access
- Author:
- Salameh, Tarik
- Graduate Program:
- Bioinformatics and Genomics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- December 13, 2019
- Committee Members:
- Feng Yue, Dissertation Advisor/Co-Advisor
Robert G Levenson, Dissertation Advisor/Co-Advisor
Aron Eliot Lukacher, Outside Member
Robert G Levenson, Committee Chair/Co-Chair
Shaun Mahony, Committee Member
Gary A Thomas, Special Member
George H Perry, Program Head/Chair
Feng Yue, Committee Chair/Co-Chair - Keywords:
- machine learning
chromatin looping
3dgenomics
supervised learning
genome-wide contact map
Hi-C
loop extrusion - Abstract:
- The central dogma of molecular biology states that information flows from DNA via dedicated processes resulting in the production of proteins. The human genome encodes for roughly 20,000 genes and 100,000 isoforms, yet these elements represent only ~2% of all DNA. As different cell types carry the same sequential information but differ in their expressed genes, epigenetic machinery and non-coding elements are critical to determine the flow of information from DNA to protein in a tissue-specific manner. The epigenetic programming of cells can be traced to the three-dimensional conformation of chromosomes, whereby genes are activated or silenced by interactions between distal elements of DNA sequence. In my work, I interrogate 3D structures of mammalian chromosomes through analysis of published genome-wide interaction experiments from a variety of sources. Emphasis is also placed on developing computational methods, specifically machine learning (ML) strategies that are fundamentally data-driven. Using a combination of genome-wide contact datasets collected from the GM12878 lymphoblastic cell line, I developed and tested the Peakachu software tool. Peakachu builds mathematical models learned from example datasets to accurately classify looping vs non-looping regions in a genome. After showing this approach accurately recaptures known chromatin interactions in GM12878, I performed additional testing using a variety of cell types, read depths, and data sources. This ultimately resulted in a comprehensive list of intra-chromosomal loop coordinates from Hi-C data for more than 50 cell lines and tissue types, available for download at 3dgenome.org. Analysis of Peakachu-predicted loops showed that at least two classes exist for chromatin interactions less than one megabase long. First, there is a large subset of loops that are generally longer-range and are anchored by CCCTF proteins bound at distal sequences in a specific orientation. Second, there are numerous shorter-range interactions less enriched for CCCTF that hold a large incidence of promoter-enhancer contacts. In the same datasets, statistical enrichment methods primarily detect the first class of loops, which are also the basis of our current theories on loop formation. The existence of the second class of loops implies a need to develop complimentary theories on loop formation other than CTCF extrusion that result in spatially specific regulatory interactions. Additionally, detecting this second class enables the mapping of genes to their distal regulatory elements in a tissue-specific manner. Importantly, I also quantify how some recent experimental protocols, such as DNA SPRITE, resolve short-range interactions better than others, such as Hi-C. In addition to my work on genome-wide contact data, I contributed to the medical field through sequence analysis of patients with multiple sclerosis (MS). MS is a neurodegenerative inflammatory disorder with multiple genetic risk factors. The allele imparting the highest risk is HLA-DRB1*1501; I helped to show that distal but commonly co-inherited alleles may be responsible for reducing DRB1 expression in MS patients. This analysis combines large-scale association studies with genome-wide contact experiments and the aforementioned chromatin loop predictions and represents a template for discerning causal disease variants in non-coding DNA using readily available datasets. It is my hope that my work continues to aid future scientists in the quest to characterize complex diseases and further understand chromatin dynamics.