Advancing K-mer Methods for Metagenomic Research

Open Access
- Author:
- Liu, Shaopeng
- Graduate Program:
- Bioinformatics and Genomics (PhD)
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 25, 2024
- Committee Members:
- David Koslicki, Program Head/Chair
Erika Ganda, Outside Unit Member
David Koslicki, Chair & Dissertation Advisor
Paul Medvedev, Major Field Member
Qunhua Li, Minor & Outside Field Member - Keywords:
- Metagenomics
MinHash
K-mer algorithm
Knowledge Graph - Abstract:
- Metagenomics, which involves the comprehensive study of genomic material extracted directly from environmental samples, presents complex challenges in bioinformatics. These challenges primarily arise from the vast diversity of microbial communities and the extensive volume of reference databases. K-mer-based algorithms have become a fundamental tool in addressing these obstacles, providing an efficient means to analyze and interpret metagenomic data. In this dissertation, we aim to further advance the development, application, and effectiveness of k-mer-based methods in metagenomics, emphasize their crucial role in enhancing the precision and speed of metagenomic analyses, and facilitate the integration of computational resources across diverse applications. In Chapter 1, we lay the groundwork by introducing the fundamental concepts of metagenomics and outlining the current challenges in metagenomic analysis. We then discuss frequently used k-mer-based methods in metagenomics, focusing particularly on the MinHash algorithm and its applications. Lastly, we explore the concept of knowledge graphs, a promising approach that can be integrated with metagenomic analysis to enhance data mining capabilities in this field. In Chapter 2, we present CMash, an implementation of the containment MinHash algorithm designed to improve the classic MinHash algorithm to provide more robust k-mer-based similarity estimations. Besides, CMash incorporates multi-resolution capabilities through k-mer truncation, allowing for the simultaneous estimation of similarities across a range of k values, significantly reducing the computational effort required when multiple k values are needed. In Chapter 3, we extend the application of the FracMinHash algorithm, which has already been established and employed for taxonomic profiling. We adapt FracMinHash for functional profiling by integrating it with the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, and we also showcase its superior performance to DIAMOND, a dominant functional profiler in metagenomics. We have developed a functional profiling pipeline, which will be seamlessly integrated with our metagenomic knowledge graph efforts for data mining purposes. This pipeline is designed for public use, enhancing accessibility and utility of k-mer-based sketching methods in metagenomic analysis. In Chapter 4, we continue the exploration of k-mer sketching methods by examining syncmers. We find that syncmers share similar characteristics with the FracMinHash sketch, offering equivalent k-mer-based similarity estimations. At the same time, syncmers offer an optimized sampling method that more evenly selects k-mers across the genome, making them advantageous as seeds for sequence matching. In practical applications, we demonstrate that syncmers can effectively replace the FracMinHash sketch for metagenomic comparisons, while ensuring greater genomic conservation. Chapter 5 emphasizes works on multi-resolution k-mer methods. In CMash, we show the efficacy of multi-resolution k-mer-based similarity estimation by truncating the k-mer sketch. Building upon this concept, here we develop a generalized multi-resolution framework for more k-mer methods, specifically strobemers and syncmers. This insight led to the development of multi-resolution syncmers and multi-resolution strobemers. By integrating these advancements, we have devised an innovative seeding method, ERS-mer, that could enhances the flexibility of the original seeding method in a state-of-the-art aligner, Strobealign. Chapter 6 covers our efforts for metagenomic data mining. We integrate a knowledge graph (KG) with metagenomic analysis, where k-mer methods can serve as a crucial bridge to connect KG, reference databases, and sample analysis. This chapter details the development of a comprehensive Metagenomic Knowledge Graph (MKG), which integrates k-mer-based sketching methods in our preceding work to map taxonomic/functional profiles to uncover hidden biological associations and create connections within the broader realm of metagenomic knowledge. The MKG integrates microbe- and diseaserelevant knowledge including microbial taxonomies, genomes, genetic elements, pathways, drugs, and disease information into a targeted microbial-specific knowledge graph. It is further integrated with the general-purpose biomedical knowledge graph RTX-KG2, enhancing biological interconnections. By utilizing the MKG, metagenomic samples are profiled using the k-mer-based sketching method outlined previously, facilitating their integration into the graph to derive sample-specific metagenomic signatures with an integrated functional hierarchy. Additionally, the graph’s topology is leveraged to delve into the unknowns, such as predicting potential pathogens, thereby unlocking new dimensions of information from metagenomic samples and expanding the frontiers of microbial research. In conclusion, this dissertation underscores the advancements and applications of k-mer methods in metagenomics for extensive genomic analysis and exploration. It lays a foundation for future metagenomic research and emphasizes that the MKG will substantially enhance our comprehension of the intricate dynamics within metagenomic communities.