Advancing Genomic and Transcriptomic Knowledge Through Reproducible Bioinformatics Workflows

Open Access
- Author:
- Sebastian, Aswathy
- Graduate Program:
- Bioinformatics and Genomics (PhD)
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- October 13, 2023
- Committee Members:
- Andrew Patterson, Major Field Member
Shaun Mahony, Chair of Committee
Istvan Albert, Major Field Member & Dissertation Advisor
Reka Albert, Outside Unit & Field Member
David Koslicki, Program Head/Chair - Keywords:
- Bioinformatics
Workflows
Genomics
Transcriptomics
Metabarcoding - Abstract:
- Bioinformatics research involves managing various types of data and employing diverse computational methods. Workflows consolidate the discrete data analytics tasks into a unified process. Automation, defined as using computational technologies with minimal human intervention, proves vital in addressing the increasing complexity of workflows and meeting scientific standards for reproducibility, scalability, and reusability. Nevertheless, integrating varied datasets and software tools remains challenging. This dissertation focuses on identifying suitable data analysis techniques, creating reusable and reproducible analysis pipelines, and using them for scientific discovery. The developed computational methods are applied explicitly to areas such as genome assembly, transcriptome analysis, and metabarcoding studies. In Chapter 1, we provide an overview of bioinformatics data analysis and discuss the challenges of workflow creation. We examine the importance of automation in computational analyses and compare popular approaches to workflow automation. In Chapter 2, we develop and evaluate various genome assembly workflows that integrate multiple data sources. Utilizing these workflows, we generate the first complete genomic assembly for a Plasmodium parasite, Plasmodium yoelii 17XNL strain. Annotation of the assembly yields improved gene models that provide complete transcript information. Comparison between the 17XNL genome and its closest relative, 17X, reveals high similarity, with differences occurring in genes of functional importance. In Chapter 3, we perform an RNA-Seq analysis to investigate the role of the NOT1 gene in Plasmodium transmission. Differential expression analysis reveals multiple mRNAs involved in regulating RNA metabolism during host-to-vector transmission. The study employs the Transcript Integrity Number (TIN) metric as a selection criterion for differential expression. In Chapter 4, we develop tincheck, a tool that computes Transcript Integrity Number (TIN), a quantitative metric that gauges transcript coverage evenness. TIN is rarely used in RNA-Seq analysis due to the absence of user-friendly tools. tincheck enables seamless TIN calculation and outperforms existing alternatives. Our tests show that using TIN as a filtering criterion in differential expression analysis enhances the precision of subsequent functional annotation results. In Chapter 5, we develop a workflow for creating a DNA marker database suitable for metabarcoding analysis. The database accommodates user-specified taxonomic groups and marker sequences, enabling the use of taxa-specific marker references. We apply the workflow to create a fish-specific mitochondrial marker database that forms the reference for multiple metabarcoding studies that assess and document fish diversity by the U.S. Fish and Wildlife Services. This dissertation presents novel methods and tools used to generate biological insights while promoting reproducible research practices. The thesis contributes to an improved understanding of genome assembly, transcriptome analysis, and metabarcoding studies, laying the groundwork for future scientific exploration and discoveries. The workflows and computational methods developed in this dissertation are available from the GitHub repository https://github.com/aswathyseb/bioinformatics_workflows.git