Analysis of RNA-Seq Data with Excess of Zeros

Open Access
- Author:
- Nunes, Marcus Alexandre
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 20, 2013
- Committee Members:
- James Landis Rosenberger, Dissertation Advisor/Co-Advisor
Yu Zhang, Committee Chair/Co-Chair
Qunhua Li, Committee Member
John Edward Carlson, Special Member - Keywords:
- generalized linear models
RNA-Seq
design of experiments
optimal - Abstract:
- Next-generation sequencing technologies are revolutionizing the analysis of genomics data. Also called massively parallel signature sequencing (MPSS), these methods generate large amounts of data by generating and identifying millions of short sequences of genetic code. These short sequences are aligned to a reference genome and the number of occurrences of reads in each gene is counted. The counts obtained from this procedure are used to define the digital expression of the genes. However, a large portion of these counts are zeros. The conventional Generalized Linear Model used to test differential expression in RNA-Seq data are not capable of dealing with this issue satisfactorily. In this work we propose a method capable of handling this characteristic of genomic data by using the Hurdle model. We fit Hurdle models to count data from next-gen sequencing sources and we develop a Likelihood Ratio Test to compare the fits of two of these models in order to decide which one better fits the data. We also derive near-optimal designs for these models, using a variation of the exchange algorithm. We present simulation results to demonstrate the performance of our proposed method and compare it to current methods. In order to gain an understanding of the method characteristics, several cases, with different parameters, are analyzed. To assess the power and the asymptotic behavior of our test, we simulate simple examples where the gene counts do not belong to a genomic dataset. However, to evaluate how our method performs in real world applications, we simulate datasets that resemble real counts from RNA-Seq experiments. Moreover, we compare our method to a well known differential gene expression method from the literature.