Bridging Paired-end RNA-seq reads

Open Access
- Author:
- Li, Xiang
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- March 17, 2021
- Committee Members:
- Mingfu Shao, Thesis Advisor/Co-Advisor
Chitaranjan Das, Program Head/Chair
Paul Medvedev, Committee Member - Keywords:
- RNA-seq
transcript assembly
de Bruijn graph - Abstract:
- The widely-used high-throughput RNA-sequencing technologies (RNA-seq) usually produce paired-end reads. We explore if full fragments can be computationally reconstructed from the sequenced two ends—a problem here we refer to as bridging. Solving this problem provides longer, more informative RNA-seq reads, and hence benefits downstream RNA-seq analysis such as transcriptome assembly and expression quantification. However, bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data itself provides sufficient information for accurate bridging, let alone proper models and efficient algorithms that characterize and determine the true bridges. We proposed a novel mathematical formulation to seek a path in a compacted de Bruijn graph for bridging such that its bottleneck weight is maximized. This formulation characterizes true bridges and is efficient in filtering out false bridges. This formulation admits optimal substructure property, and hence efficient dynamic programming algorithms can be designed. We designed a new truncated Dijkstra’s algorithm for this problem and proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra’s algorithm from scratch for all vertices for further speeding up. These innovative techniques result in scalable algorithms that can bridge all paired-end reads in a compacted de Bruijn graph with millions of vertices. Our experiments showed that paired-end RNA-seq reads can be accurately bridged to a large extend. High precision was observed with varied sensitivity in both simulation and real data.