Extended Safe Contigs in the Face of Incomplete Coverage

Open Access
- Author:
- Hutton, John
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- March 22, 2018
- Committee Members:
- Paul Medvedev, Thesis Advisor/Co-Advisor
- Keywords:
- contig assembly
bioinformatics
non-ideal
linear genome - Abstract:
- In bioinformatics, genome assembly is a problem that, as yet, has no direct answer. Due to the way we are able to read DNA, we cannot reconstruct an original genome with 100% accuracy. Instead, we must read shorter segments of nucleotides. From this set of reads we attempt to assemble nucleotide strings that we know will be in the original genome. These contigs can then be used to produce a candidate assembly of the original genome. This method implies that we would like to get the longest contigs possible to use in the assembly process. Tomescu and Medvedev [1] considered this for a circular genome and developed the omnitig algorithm, which provides all possible contigs for a circular genome. However, the work focuses on a perfect case (i.e. the entire genome is represented by error free reads with no coverage gaps). This thesis extends the omnitig algorithm to work on a linear genome. The modified algorithm is then tested along with omnitig and two other algorithms to see how they react to increasingly non-ideal data. We look at a linear genomes in cases when 100% coverage is not achieved and in cases when reads contain errors. We find that the modified algorithm outperforms the other algorithms to varying degrees in terms of contig length and genome coverage. Additionally, we see that the extent of the benefit from the new algorithm is somewhat dependent on the quality of the input data.