Distinctive Genomic Features of Erythroid Cis-Regulatory Modules

Open Access
Zhang, Ying
Graduate Program:
Doctor of Philosophy
Document Type:
Date of Defense:
April 16, 2009
Committee Members:
  • Ross Cameron Hardison, Dissertation Advisor
  • Robert Paulson, Committee Chair
  • Ross Cameron Hardison, Committee Member
  • Douglas Cavener, Committee Member
  • Francesca Chiaromonte, Committee Member
  • Kateryna Dmytrivna Makova, Committee Member
  • Cis-Regulatory Modules
  • GATA1
  • Word Enumeration
  • ChIP
Regulation of gene expression is a major challenge in biology. My dissertation aims to improve our ability to reliably identify cis-regulatory modules (CRMs) in vertebrates. With the growing number of completed and high-quality draft sequences of several vertebrate genomes, comparative genomics and other bioinformatics methods have become first-line methods to predict and analyze CRMs. Recently, our lab has reported two large-scale investigations of Erythroid cis-regulatory modules, one of which used a systematic way to predict and test erythroid CRMs (RP-based computational predictions followed by report-gene assays), the other one used microarray coupled chromatin immunoprecipitation to identify in vivo occupied sites by GATA1. The results were satisfactory; we successfully identified 42 functional CRMs and 63 in vivo occupied sites by GATA1. To improve the predictive power of the computational models and to investigate the power of motifs in predicting the occupancy, both conservation-based (ESPERR algorithm) and motif-based (direct enumeration of words) bioinformatic methods have been applied to current datasets for an attempt of decoding the genomic and bioinformatic signals that are associated with active DNA fragments. ESPERR can distinguish known Erythroid CRMs from neutral DNAs, but it met its limitation when attempted to discriminate GATA1-occupied sites from unoccupied ones. Direct enumeration of words can identify motifs that are predictive of occupancy given the presence of WGATAR, but we need additional signals to correct identify the one real binding sites from dozens of candidates. Repeated cycles of computational predictions and biological tests, with new knowledge being incorporated into each current model, should refine our ability to correctly identify cis-regulatory modules.