ASSEMBLY ALGORITHMS FOR NEXT GENERATION SEQUENCE DATA
Open Access
- Author:
- Ratan, Aakrosh
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- October 07, 2009
- Committee Members:
- Webb Colby Miller, Dissertation Advisor/Co-Advisor
Webb Colby Miller, Committee Chair/Co-Chair
Piotr Berman, Committee Member
Stephan Schuster, Committee Member
Raj Acharya, Committee Member - Keywords:
- next-generation sequencing
genome assembly
algorithms - Abstract:
- Next-generation sequencing is revolutionizing genomics, promising higher coverage at a lower cost per base when compared to Sanger sequencing. Shorter reads and higher error rates from these new instruments necessitate the development of new algorithms and software. This dissertation describes approaches to tackle some problems related to genome assembly with these short fragments. We describe YASRA (Yet Another Short Read Assembler), that performs comparative assembly of short reads using a reference genome, which can differ substantially from the genome being sequenced. We explain the algorithm and present the results of assembling one ancient-mitochondrial and one plastid dataset. Comparing the performance of YASRA with the AMOScmp-shortReads and Newbler mapping assemblers (version 2.0.00.17) as template genomes are varied, we find that YASRA generates fewer contigs with higher coverage and fewer errors. We also analyze situations where the use of comparative assembly outperforms de novo assembly, and vice-versa, and compare the performance of YASRA with that of the Velvet (version 0.7.53) and Newbler de novo assemblers (version 2.0.00.17). We utilize the concept of “overlap-graphs” from YASRA to find genetic differences within a target species. We describe a simple pipeline for deducing such locations of variation in the presence of a reference genome and then extend it to deduce polymorphisms in a species without the help of a reference genome. Our implementation of this algorithm, DIAL (De Novo Identification of Alleles) is described. The method works even when the coverage is insufficient for de novo assembly and can be extended to determine small indels (insertions/deletions). We evaluate the effectiveness of the approach using published Roche/454 sequence data of Dr. James Watson to detect heterozygous locations. We also apply our approach on recent Illumina data from Orangutan, in each case comparing our results to those from computational analysis that used a reference genome sequence.