Accurate Measurement of Variants With Continuous Ranges of Frequencies With Next-Generation Sequencing

Open Access
- Author:
- Stoler, Nicholas
- Graduate Program:
- Integrative Biosciences
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- July 15, 2020
- Committee Members:
- Anton Nekrutenko, Dissertation Advisor/Co-Advisor
Anton Nekrutenko, Committee Chair/Co-Chair
Kateryna Dmytrivna Makova, Committee Member
Paul Medvedev, Committee Member
Francesca Chiaromonte, Outside Member
George H Perry, Program Head/Chair
Michael DeGiorgio, Committee Member - Keywords:
- next generation sequencing
duplex sequencing
variant detection - Abstract:
- The detection of genetic variants is central to the study of disease, evolution, and populations. Fortunately, next-generation sequencing has enabled genome-wide variant detection at affordable prices. However, detection of low-frequency variants, such as those involved in tumor evolution, mitochondrial disease, and antibiotic resistance remains a challenge because of the high signal to noise ratio in standard sequencing technologies. For applications like these, the accuracy and quality of sequencing data becomes paramount. The genomics community has worked to address this need in many ways. First, a great deal of effort has gone into understanding the quality of the raw data produced by current sequencing methods. And second, a series of innovative methods has been developed for improving on the raw data. Many studies have examined the error rate and sequence biases of contemporary sequencing platforms. But the data examined are small numbers of samples in controlled environments. And many manufacturers have introduced many new technologies in recent years with potential effects on sequencing quality. In this dissertation I develop a method of identifying sequencing errors which can be easily automated and applied retroactively on existing datasets. I demonstrate its utility by performing a survey of 1,943 public datasets from the Sequence Read Archive. With this survey, I am able to uncover differences in the error rates and biases of current Illumina sequencing platforms. I find that the error rates of public datasets from the more expensive, high-throughput instruments are lower and less variable than those of smaller-scale machines. But I also find great variation within each platform, especially the lower-end ones. To improve on these error rates, a series of groups have developed methods based on consensus sequencing. This principle utilizes DNA barcodes to be able to combine multiple reads from the same molecule. The highest-fidelity design, duplex sequencing, can improve on the accuracy of standard sequencing by four orders of magnitude. But there are limitations in the standard software for processing and combining the raw reads of duplex sequencing. Existing tools require a reference sequence to produce any consensus sequences from the reads. This limits analysis to systems with a suitable reference, and it can introduce reference bias into the consensus sequences. Another issue is the occurrence of errors in the barcodes used to identify reads originating from the same molecule. Standard duplex processing tools simply discard reads affected by barcode errors. Here, I present Du Novo, a tool built to process duplex sequencing reads without the need for a reference. Using real and simulated reads, I show that Du Novo is able to provide nearly the same accuracy as the existing pipeline, even as it yields more data. In simulations, Du Novo was able to detect 95% of variants at 0.01% minor allele frequency, with 0 false positives. I also describe great improvements over the first version of Du Novo. After several performance improvements, including the replacement of the core multiple sequence aligner, Du Novo 2.0 is able to perform the alignments up to 10x faster. Another improvement to Du Novo is the addition of an error correction pipeline which can recover reads with errors in their barcodes. This feature is able to increase the yield of final consensus sequences by up to 23%. Du Novo is the first tool able to perform reference-free processing of duplex sequencing data, and the first to correct barcode errors in the process. These features enable the analysis of more sample types, and with greater accuracy and yield than ever.