Open Access
Song, Giltae
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
October 10, 2011
Committee Members:
  • Webb Colby Miller, Dissertation Advisor
  • Webb Colby Miller, Committee Chair
  • Raj Acharya, Committee Member
  • Padma Raghavan, Committee Member
  • Ross Cameron Hardison, Committee Member
  • Yu Zhang, Committee Member
  • duplication
  • evolution
  • orthology
  • gene clusters
  • conversion
Gene clusters containing multiple similar genomic regions in close proximity are of great interest for biomedical studies because of their associations with inherited diseases. However, such regions are difficult to analyze due to their structural complexity and their complicated evolutionary histories, reflecting a variety of large-scale mutational events and current computational methods to extract evolutionary information from sequence data for such clusters are suboptimal. We describe a new method called CAGE for recon- structing the recent evolutionary history of gene clusters, and evaluate its performance on both simulated data and actual human gene clusters. Although our CAGE program for inferring the evolutionary history of gene clus- ters provides useful information, our analysis still encounters computational challenges. One of the ma jor reasons for these difficulties is gene conversion among the duplicated regions of the cluster, which can obscure their true relationships. Many computational methods for detecting gene conversion events have been released, but their performance has not been assessed for wide deployment in evolutionary history studies due to a lack of accurate evaluation methods. We designed a new method that simulates gene cluster evolution, including large-scale events of duplication, deletion, and conversion as well as small mutations. We used this simulation data to evaluate several different programs for detecting gene conversion events. Our evaluation identifies strengths and weaknesses of several methods for detecting gene conversion. To correct the distorted information generated by traditional methods for analyz- ing sequence data in gene clusters such as construction of phylogenetic trees or multi- species alignments, we have developed an automated pipeline for detecting conversion events using the best performance in our evaluation study. This pipeline is available in our software suite called CHAP (Cluster History Analysis Package), and we used it to analyze the conversion events that affected two well-studied gene clusters (α-globin and β -globin) and three gene clusters for which comparative sequence data were generated from seven primate species: CCL (chemokine ligand), IFN (interferon), and CYP2abf (part of cytochrome P450 family 2). One concept describing the evolutionary relationships in gene clusters is orthology. Orthologs derive from a common ancestor by speciation, and paralogs by duplication. Discriminating orthologs from paralogs is a necessary step in most multiple-species se- quence analyses. Accurately mapping orthology, however, is complicated by conversion events. We propose a refined method of orthology assignment based on two paradigms for interpreting the definition: by position or by content. The former traces orthology resulting from speciation and duplication, while the latter includes the influence of con- version events as well. We have developed a computational method for automatically mapping both types of orthology in gene clusters, and have extended our CHAP software package for analyzing cluster histories to include this new orthology pipeline; we call this new package CHAP 2. Our results are visualized for users to examine easily. We evaluate this method using both simulation data and real gene clusters, in- cluding the well-known α-globin and β -globin clusters. We also use CHAP 2 to analyze four more loci: CCL (chemokine ligand), IFN (interferon), CYP2abf (part of cytochrome P450 family 2), and KIR (killer cell immunoglobulin-like receptors). These new methods and results facilitate and extend our understanding of evo- lution at these and other loci by adding automated, accurate evolutionary inference to the biologist’s toolkit. CHAP 2 is freely available at lab.