A Generalized Linear Model For Peak Calling in ChIP-Seq Data
Open Access
- Author:
- Xu, Jialin
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- July 26, 2012
- Committee Members:
- Yu Zhang, Dissertation Advisor/Co-Advisor
Naomi S Altman, Committee Member
Debashis Ghosh, Committee Member
Ross Cameron Hardison, Committee Member - Keywords:
- Generalized Linear Model
Negative binomial distribution
ChIP-Sequencing
Peak calling
FDR - Abstract:
- Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) has become a routine for detecting genome-wide protein-DNA interaction. The success of ChIP- Seq data analysis highly depends on the quality of peak calling to detect peaks of tag counts at a genomic location and evaluate if the peak corresponds to a real protein-DNA interaction event. The challenges in peak calling include 1) how to combine the forward and the reverse strand tag data to improve the power of peak calling, 2) how to account for the variation of tag data observed across different genomic locations, and 3) how to use the negative control data to reduce false positives caused by regional biases that might be generated by local structure. I introduce a new peak calling method based on the generalized linear model (GLMNB) that utilizes negative binomial distribution to model tag count data and accounts for the variation of background tags that may randomly bind to the DNA sequence at varying levels due to local genomic structures and sequence contents. I allow local shifting of peaks observed on the forward and the reverse stands, such that at each potential binding site, a binding profile representing the pattern of a real peak signal is fitted to best explain the observed tag data with maximum likelihood. Our method can also detect multiple peaks within a local region if there are multiple binding sites in the region. I also extend the model to incorporate ChIP-Seq data with multiple tracks in order to answer broader scientific questions. Assuming there are k ChIP replicates and one negative control data under c biological conditions, the extended model with likelihood ratio test can be used to identify 1) binding event under at least one conditions or 2) differential binding events under different biological conditions.