Extended Safe Contigs in the Face of Incomplete Coverage

Open Access
Hutton, John
Graduate Program:
Computer Science and Engineering
Master of Science
Document Type:
Master Thesis
Date of Defense:
March 22, 2018
Committee Members:
  • Paul Medvedev, Thesis Advisor
  • contig assembly
  • bioinformatics
  • non-ideal
  • linear genome
In bioinformatics, genome assembly is a problem that, as yet, has no direct answer. Due to the way we are able to read DNA, we cannot reconstruct an original genome with 100% accuracy. Instead, we must read shorter segments of nucleotides. From this set of reads we attempt to assemble nucleotide strings that we know will be in the original genome. These contigs can then be used to produce a candidate assembly of the original genome. This method implies that we would like to get the longest contigs possible to use in the assembly process. Tomescu and Medvedev [1] considered this for a circular genome and developed the omnitig algorithm, which provides all possible contigs for a circular genome. However, the work focuses on a perfect case (i.e. the entire genome is represented by error free reads with no coverage gaps). This thesis extends the omnitig algorithm to work on a linear genome. The modified algorithm is then tested along with omnitig and two other algorithms to see how they react to increasingly non-ideal data. We look at a linear genomes in cases when 100% coverage is not achieved and in cases when reads contain errors. We find that the modified algorithm outperforms the other algorithms to varying degrees in terms of contig length and genome coverage. Additionally, we see that the extent of the benefit from the new algorithm is somewhat dependent on the quality of the input data.