Computational Approaches for Integrative Detection of Protein-DNA Interactions

Open Access
Chen, Kuanbei
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
October 08, 2013
Committee Members:
  • Webb Colby Miller, Dissertation Advisor
  • Webb Colby Miller, Committee Chair
  • Dr Yu Zhang, Dissertation Advisor
  • Ross Cameron Hardison, Committee Member
  • Padma Raghavan, Committee Member
  • Kamesh Madduri, Committee Member
  • Computational Biology
  • Bioinformatics
  • Peak-Calling
  • ChIP-seq
  • Gene regulation
  • Comparative Genomics
Gene regulation is a complex process that usually involves the cooperation of multiple transcription factors (TFs), which may bind to a regulatory module in DNA that forms a complex that regulates the target gene's expression. The massively parallel second generation sequencing technologies applied to DNA samples highly enriched for TF occupancy (ChIP-seq) enables comprehensive and accurate mapping of epigenetic features at reasonably high resolution. With the increasing volume of such data being generated, one challenge is to develop computational methods that incorporate multiple relevant datasets to better understand gene regulatory mechanisms. Therefore, in this dissertation, we describe two novel computational approaches to detect TF binding features by incorporating multiple relevant ChIP-seq datasets. The first method is called PASS2 (Poisson Approximation for Significance version 2). While the traditional methods utilize ChIP-seq data from only one experiment to detect binding occupancy, PASS2 combines relevant biological features (e.g. co-binding information) to detect TF binding of a target protein. The idea behind this is that the binding of a target transcription factor can be partially learned from the co-binding proteins and so including the relevant data can improve the power of protein-binding detection. In addition to detecting protein binding occupancy, identifying differential binding regions across conditions (cell lines, time points, and individuals) is also beneficial to us to understand gene regulation. Consideration of such approaches is still very limited, and thus we develop a second method: Cross-CaP (Cross Conditions and Proteins), which is designed to identify any differential and condition-specific TF occupancy across multiple conditions and features (e.g. TFs). The method is general and can be applied to datasets with at least two conditions and one feature. It also works when only a few or no biological replicates available. We apply the above two methods on both simulated and real datasets to demonstrate their robustness and power of data integration.