Evaluating Multiple Sequence Alignments Using ENCODE Data

Open Access
Wang, Qingyu
Graduate Program:
Integrative Biosciences
Master of Science
Document Type:
Master Thesis
Date of Defense:
June 19, 2013
Committee Members:
  • Webb Colby Miller, Thesis Advisor
  • Ross Cameron Hardison, Thesis Advisor
  • Naomi S Altman, Thesis Advisor
  • Multiple Sequence Alignment
  • Motif-based evaluation
  • ChIP-seq
As a key aspect of decoding genomic sequences, comparative genomic research has become increasingly prominent during the last decade. A crucial prerequisite for comparative genomics is multiple genomic sequence alignments, especially whole-genome multiple sequence alignments (MSAs). Various downstream analyses rely implicitly on whole-genome MSAs. Unfortunately, whole-genome MSAs still suffer from inadequate reliability and the research on assessing their quality has not been fully addressed yet. Regardless of whether we seek better alignment methods or want to select the most reliable available MSAs, a practical evaluation method for MSAs is imperative. Therefore, we propose a new method, MSAME (MSA Motif-based Evaluation), to quantify the reliability of MSAs based on the ChIP-seq data produced by the ENCODE project. Our method is one of the first MSA evaluation methods based on experimental data. Instead of using simulations relying on evolutionary models, we define a biological criterion inferred from different scenarios of motif shifting in ChIP-seq peak regions, and evaluate MSAs from such a functional perspective. Our method efficiently identifies two types of MSA errors, and it is robust to noises introduced by parameter changes. By applying our method to evaluate high coverage 11-way eutherian MultiZ and EPO alignments, we identify 7.9% and 6.7% of alignments in specific ChIP-seq bound regions as unreliable MSAs respectively. This allows us to identify putative evolutionary motif shifting as a by-product. Finally, we conduct further analysis for putative evolutionary motif shifting events that we detect for binding sites of the transcription factors GATA1, CTCF, and NRSF.