Computational Approaches To Predict Phenotype Differences In Populations From High-throughput Sequencing Data

Open Access
Bedoya Reina, Oscar Camilo
Graduate Program:
Integrative Biosciences
Doctor of Philosophy
Document Type:
Date of Defense:
March 05, 2014
Committee Members:
  • Webb Colby Miller, Dissertation Advisor
  • Ross Cameron Hardison, Committee Member
  • George H Perry, Committee Member
  • Kamesh Madduri, Committee Member
  • Population genomics
  • bioinformatics
  • Galaxy tools
  • conservation genomics
  • biomedical informatics
High-throughput sequencing technologies are changing the world. They are revolutionizing the life sciences and will be the foundation of a promising century of innovations. In recent years, the development of new sequencing technologies has dramatically decreased the cost of genome sequencing. Less than twenty years ago, sequencing the human genome cost 3 billion dollars, and took about a decade to complete. Today, high-quality 30X full-genome coverage can be obtained in just one day for US$ 5,000, while sequencing just the ~21,000 human genes to the same depth costs only about US$ 500. The latter is sufficient for detecting most of the rare variants, along with other sources of genetic variability such as indels, copy-number variations, and inversions that are characteristic of complex diseases. The enormous quantity of information produced by these sequencing technologies provides the raw material for understanding the diseases, diversity, and evolution of species, and likely the power to predict them. Nevertheless, the processing and analysis of this data remains challenging for all but the largest and most experienced research groups, and even the best results seem to lack predictive power for large populations. One of the challenges in this analysis is identifying the polymorphisms within species from the vast amount of raw data produced by the sequencing instruments. Fortunately, this and other data-processing steps are becoming more affordable as technology evolves. Once the gigabases of sequence data have been filtered to a few million intra-species DNA polymorphisms and a few thousand amino-acid variants, the computational requirements for their exploration are often relatively modest. As part of my research, we have developed a set of tools that run on the Galaxy web server to perform such analyses. These tools facilitate understanding the population structure of focus species and developing testable hypotheses about phenotypic consequences of geneticiv polymorphisms. In particular, new tools are introduced to 1) assess the over-representation of genes of interest in several phenotype categories, 2) calculate the impact of genes of interest in a given phenotype, and 3) predict gene networks based on different attributes. In addition to avoiding the need for research groups to download and install all relevant software, a major benefit of using Galaxy for these, or indeed other, analyses is that reproducibility of published results is often enhanced. Moreover, these tools can be applied to polymorphisms identified by technologies other than sequencing, such as SNP genotyping microarrays. We have applied these tools to the analysis of various population genomes, and have obtained very interesting results. We focused our analysis in genomes of endangered species, namely Polar bear, Bighorn sheep, Aye-aye, and Tasmanian devil. In Polar bears, hypothetical molecular adaptations to the extreme of life in the high Arctic were found. For example, mutations in the gene BTN1A1 associated to the high fat content of milk, and others that may allow a fine-tuning of nitric oxide production to control the trade-offs between oxygen consumption, heat production, and energy production. Similarly, we identified hypothetical adaptations in Desert Bighorn sheep that may allow them to tolerate water scarcity. These included mutations in genes involved in water homeostasis and renal retention (i.e.LASS6, XPNPEP1 and XPNPEP3). In Aye-aye, we found mutations in a population that might be involved in muscular adaptations to exploit the niche variability produced by landscape variation. In Tasmanian devil, the use of these tools resulted in four different hypothetical mechanisms that may produce or worsen the devastating Devil facial tumor disease (DFTD). To validate our methods, we also applied our tools to model species or populations of commercial interest. For example, we detected a selective sweep in the genome of commercial broilers that overlaps the major QTL explaining differences in growth between broilers and layers. Other interesting findings include a 3' UTR mutation in the interleukin 6 receptor (IL6R) in proximity to a conserved miRNA binding site for three patients with LGL leukemia. Thesev changes may increase the translation of the IL6R and result in aberrant expression of the STAT3 gene. Also, we found that 10% of the Landrace pig genome is ultimately derived from Asian boars, and this portion is enriched in immune-related genes. These findings are exciting, yet we are only observing the tip of the iceberg.