Algorithms for Aligning and Clustering Genomic Sequences that Contain Duplications

Open Access
- Author:
- Hou, Minmei
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 18, 2007
- Committee Members:
- Webb Colby Miller, Committee Chair/Co-Chair
Piotr Berman, Committee Member
Hongyuan Zha, Committee Member
Ross Cameron Hardison, Committee Member - Keywords:
- algorithm
genomic sequence
duplication
alignment
homologous
orthologous - Abstract:
- Genomes of advanced organisms contain numerous repeated sequences, including gene clusters, tandem repeats, interspersed repeats, and segmental duplications. Among these, gene clusters are the class most frequently of functional importance. Algorithmic processing of regions containing these clusters remains challenging in practice, and its lack of clean solutions has been a big obstacle in sequence analysis in bioinformatics. This thesis includes new methodologies for solving two sets of problems in processing the sequences of gene-cluster regions, particularly methods to properly align gene-cluster regions of multiple species. Similar sequences sharing the same evolutionary origin are homologous. Homologous sequences that differ by speciation are orthologous. One set of problems deals with aligning all and only orthologous sequences in a gene-cluster region, between two or more species. A two-way orthologous-sequence identification tool is developed to produce orthologous pairwise alignments. The results are evaluated based on the phylogenetic inference of gene sequences. High specificity is achieved without much loss of sensitivity. Two approaches are designed to create orthologous multi-species alignments. One uses a chosen species to guide the alignment process, and it has been successfully applied genome-wide. The other solves a more di±cult formulation of the problem, where all species are treated equally. Its computational dificulty is discussed, and some initial experiments are reported. Another set of methods deals with the construction of all homologous groups within a single genome. Each homologous group is expected to contain precisely the genomic intervals that are homologous to each other. A mixture of algorithmic and heuristic procedures is designed to maintain a balance between the completeness and purity of each group. We verify the accuracy and e±ciency of these methodologies.