Exploring Sequence Architectures at Flanking Regions of Copy Number Variants

Open Access
- Author:
- Khan, Hossain
- Graduate Program:
- Bioinformatics and Genomics
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- May 30, 2019
- Committee Members:
- Santhosh Girirajan, Thesis Advisor/Co-Advisor
Anton Nekrutenko, Committee Member
Yifei Huang, Committee Member
George H Perry, Program Head/Chair - Keywords:
- CNVs
Structural Variations
Alu elements
L1 LINE elements
Flanking regions of CNVs
Segmental Duplications - Abstract:
- Copy number variations (CNVs) represent a subtype of structural variations where a portion of the genome is deleted or duplicated resulting in copy number changes of genes within the region. These copy number changes happen due to aberrant recombination between specific repeat architectures facilitated by a degree of sequence homology present within these repeats. In this study, a comprehensive and an exhaustive catalog of 150,802 copy number variants have been leveraged to explore the flanking regions of copy number variants and enumerate the abundances of different subtypes of repeat architectures. Thus, looking at the flanking regions of copy number variants may indicate which repeat architecture is relatively more abundant and may contribute more to the formation of larger fraction of copy number variants. Alu elements (~67.12 percent) were found to be the most abundant repeat architecture followed by segmental duplications (~22.59 percent) and L1 LINE elements (~21.71 percent) in both upstream and downstream flanking regions. Furthermore, AluY, AluSx and AluSx1 were found to be relatively more abundant among other 34 different Alu Subtypes at the flanking regions. Interestingly, the newer AluY subfamilies were the least abundant in the flanking regions. In the case of L1 elements, 114 different L1 LINE subtypes were present and exhibited a heterogenous pattern in these flanking regions. Among those 114 subtypes, 4 specific L1 subtypes L1MB3, L1MB5, L1MB7, L1MB8 were relatively more abundant than the rest. Strikingly, all these 4 subtypes are relatively old L1s and have existed even before the evolution of primates. To assess the pathogenicity, CNVs were screened for the presence of dosage sensitive genes and the presence of transcription factor specific regulatory elements. Only 3.61 and 46.85 percent copy number variants had exons of dosage sensitive genes and transcription factor specific regulatory elements within themselves respectively. Thus, majority of the copy number variants has a higher likelihood of being benign which is indicative of the dataset since the CNVs in the dataset are representative of normal population. However, the tissue and cell-type specificity of regulatory elements have not been considered for the screening which is a major limitation. Next, CNVs with 50 percent or more overlaps have been stratified to form copy number variable regions (CNVRs) to see if there are regions across each chromosome that gets more than the usual amount of CNV hits. Majority of the CNVRs had CNV hits of 15 or below except a few which had hits of 100 or more. Surprisingly, some of the maximal CNV hits CNVRs for each chromosome were depleted in segmental duplications but had overrepresentations of different Alu subtypes. In these segmental duplications depleted regions, it is highly likely that Alu elements might be the main repeat architecture sensitizing the genome to undergo more rearrangements which consequently makes the CNVRs get CNV hits of 100 or more.