genome and transcriptome architecture shaped by short tandem repeats: comparison of RNA-DNA differences, reverse transcription errors, and sequencing errors in short tandem repeats

Open Access
- Author:
- Fungtammasan, Arkarachai
- Graduate Program:
- Integrative Biosciences
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- September 14, 2015
- Committee Members:
- Kateryna Dmytrivna Makova, Dissertation Advisor/Co-Advisor
Ross Cameron Hardison, Committee Member
Francesca Chiaromonte, Committee Member
Kristin Ann Eckert, Committee Member - Keywords:
- microsatellites
common fragile sites
error correction model
sequencing error
reverse transcription error
RNA-DNA difference - Abstract:
- Short Tandem Repeats (STRs) of 1–6 bp DNA motifs are prevalent in genomes and have several interesting properties, including high mutation rates and a tendency to form secondary non-B DNA structures. Due to the high variability of these repeats, STRs have been commonly used as markers in population genetics and forensic science. Several STR loci have medical implications, especially long tri-nucleotide STRs. To date, more then 40 neurological disorders have been reported to involve repeat expansions/contractions of STRs. My dissertation focuses on the following three questions pertaining to the role of STRs in genome and transcriptome instability: 1) What is the contribution of STRs to genome instability, when other genomic features are also considered? 2) How to distinguish genetic variation of STRs from sequencing and bioinformatics errors, and 3) What are the relative levels of RNA— DNA differences and transcription errors at STRs, and how to distinguish biological transcription variation from technical errors? To address these questions, I separated the thesis into three phases. First, I identified the genomic features that contribute to chromosome fragility and analyzed the relative contribution of STRs in the presence of other genomic features. I chose the aphidicolin-induced Common Fragile Sites (aCFSs), which is the largest class of human chromosome fragile sites available for performing such an analysis. aCFSs tend to locate in R- bands far from centromeres. They also have low CpG islands density but high DNA flexibility, high content of mononucleotide STRs, and high content of Alu repeats. Also, the fragility level of aCFSs increases with the density of evolutionary conserved breakpoints. Second, to profile STR length from Next Generation Sequencing (NGS) data , I developed the STR-FM pipeline (Short Tandem Repeat profiling using Flank-based Mapping approach) and used our pipeline to estimate sequencing errors of Illumina data from standard and PCR-free library preparation protocols. I found that the STR sequencing errors increase exponentially with STR length and the number of repeats. STR sequencing errors have strong contraction bias. I build the genotyping model that takes into account sequencing errors. My genotyping model has high prediction accuracy in both diploid and genetically heterogeneous samples. I also used STR-FM and the genotyping model to estimate de novo germ-line mutation rates from a three-generation trio. In contrast to NGS sequencing errors, the germ-line mutations have similar levels of STR expansion and contraction. Third, I estimated the RNA-DNA Difference (RDD) rates and Reverse Transcription (RT) error rates using two different approaches. Initially, I created a Maximum Likelihood Model to calculate the most likely RDD and RT error rates and their directions using replicated cDNA sequencing data. The RT error rates are approximately one order of magnitude higher than RDD rates. Next, we verified our RDD and RT errors estimation using barcoded RNA sequencing data, which is the current standard gold to detect RDDs and RT errors. The estimated RT rates from both approaches are comparable and they both suggest that RT errors have expansion bias, which is the opposite of the STR sequencing errors (they have contraction bias). In addition, I proposed the RNA inferring model to estimate the most likely RNA length profile after correcting for sequencing errors and RT errors. This dissertation builds a platform that allows us to explore biology of STRs from NGS data. Besides, it illuminates the impact of STRs on genome and transcriptome stabilities. Finally, I distributed tools developed in this research through Galaxy genomic portal and github to promote reproducibility and development of science.