Scalable and Accurate Algorithms for Search in Genomic Big Data and Analysis of Genetic Variants

Open Access
- Author:
- Sun, Chen
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- August 24, 2018
- Committee Members:
- Paul Medvedev, Dissertation Advisor/Co-Advisor
Paul Medvedev, Committee Chair/Co-Chair
Daniel Kifer, Committee Member
Kamesh Madduri, Committee Member
Michael Degiorgio, Outside Member - Keywords:
- Algorithm
Big Data
Database
Genomics
Probabilistic Data Structures
Data Mining
Parallel Computing
Bioinformatics
Genetic Variants
SNP
Computational Biology - Abstract:
- Next generation sequencing technology has been extensively used in biological and medical research. With abundant genomic data generated, new challenges also arise: the expanding capacity of genomic data pushes the boundaries of current searching and analysis methods. Two of the main challenges for genomic big data are how to efficiently search to identify datasets of interest, and how to deeply analyze the large volume of data to discover new knowledge in genomics. In this dissertation, we present four research achievements that aim to tackle the above two challenges in genomic data. The AllSome Sequence Bloom Tree data structure and associated search algorithms are first introduced to help find datasets of interest, filter out futile ones, and narrow down the data size. To meet the demand of further deep analysis, several scalable algorithms for sequence analysis are introduced. Based on them, a genetic variant analysis toolkit is developed, which contains three methods (ISVDA, VarGeno and VarMatch), which address different directions of small genetic variant study. ISVDA is an iterative small variant discovery algorithm that can detect small genetic variants that are previously hard to detect. VarGeno is a fast and accurate single nucleotide polymorphism genotyping tool. VarMatch is introduced to find high confidence variants among multiple variant detection results. It can also be used to evaluate variant calling results.