Compression Algorithms for de Bruijn Graphs and Uncovering Hidden Assembly Artifacts
Open Access
- Author:
- Rahman, Amatur
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 28, 2023
- Committee Members:
- Chitaranjan Das, Program Head/Chair
David Koslicki, Major Field Member
Paul Medvedev, Chair & Dissertation Advisor
Francesca Chiaromonte, Outside Unit & Field Member
Mingfu Shao, Major Field Member - Keywords:
- de Bruijn graph
compression
k-mer set
colored de Bruijn graph
assembly
unitig
unitig algorithm
assembly artifact
bidirected de Bruijn graph
palindromic unitig - Abstract:
- In this dissertation, I present four projects covering two main research objectives. The first objective of my dissertation is to optimize storage usage of sequence analysis tools and facilitate faster analysis of sequencing data. The second objective is to better understand the common algorithm-driven causes of assembly errors by investigating the unitig algorithm, which is a core algorithm at the heart of most assemblers. We achieve the first goal by developing space-efficient representation for the de Bruijn graph. There have been numerous indexing data structures proposed that allow to store de Bruijn graphs compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. Our tools help in alleviating this problem. A de bruijn graph has as its vertices short substrings of fixed length (called k-mers) from original sequence data. We note that because the k-mers in the k-mer set are generated from a long string, they exhibit spectrum-like property and as a result there is inherent redundancy in them. By exploiting this redundancy, we develop compression algorithms which are lossless (i.e. contains the exact k-mers as in original data set) and efficient (i.e. compression for billions of k-mers possible in mid-range server) while achieving better compression ratios than available state-of-the-art methods. Our tool UST was the first one in the series of compression algorithm to outperform the then available compression tools for k-mer sets, which was then superseded by ESS by upto 10 to 27 percent. The final tool in the compression series is called ESS-color, which extends its support to compress colored de Bruijn graph, and achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead. We believe that these tools and studies developed during my PhD studies will enhance the quality of k-mer-based bioinformatics softwares and large-scale data analysis by making k-mer sets readily available to tool developers. To achieve the second objective, which is to uncover hidden source of assembly errors, we investigate unitigs algorithm, a common denominator of de Bruijn graph based assembly. Recent assemblies by the T2T and VGP consortia have achieved significant accuracy but required a tremendous amount of effort and resources. More typical assembly efforts, on the other hand, still suffer both from mis-assemblies (joining sequences that should not be adjacent) and from under-assemblies (not joining sequences that should be adjacent). By investigating the unitig algorithm, we prove that, contrary to popular belief, even when there are no sequencing errors, unitigs are not always safe (i.e. they are not guaranteed to be substrings of the sequenced genome). We also prove that the unitigs of a bidirected de Bruijn graph are different from those of a doubled de Bruijn graph and, contrary to our expectations, result in under-assembly. Using experimental simulations, we then confirm that these two artifacts exist not only in theory but also in the output of widely used assemblers. In particular, when coverage is low then even error-free data results in unsafe unitigs; also, unitigs may unnecessarily split palindromes in half if special care is not taken. This work was the first to theoretically predict the existence of these assembler artifacts and confirm and measure the extent of their occurrence in practice. I believe our work contributes to better understanding of errors in assembly algorithms and has the potential to lead studies of improving assembly quality.