PheWAS AND BEYOND: APPROACHES TO ADDRESS CHALLENGES FOR IDENTIFYING ROBUST ASSOCIATIONS USING CLINICAL DATA

Open Access
- Author:
- Verma, Anurag
- Graduate Program:
- Integrative Biosciences
- Degree:
- Doctor of Education
- Document Type:
- Dissertation
- Date of Defense:
- December 14, 2017
- Committee Members:
- Marylyn Ritchie, Dissertation Advisor/Co-Advisor
Moriah Louise Szpara, Committee Chair/Co-Chair
Ross Cameron Hardison, Committee Member
N/A, Committee Member
Yu Zhang, Outside Member - Keywords:
- PheWAS
EHR
GWAS
Genomics
Biobank - Abstract:
- In an emerging approach called precision medicine, the primary focus is to utilize an individual’s clinical data along with genetic, environmental, and lifestyle information to tailor clinical care. The initial steps toward precision medicine involve enrolling individuals into studies to link their genotype and phenotype data. Patient data can be used to discover clinically relevant genetic associations. The most common methodology to identify genetic associations is called an genome-wide association study (GWAS), in which tests for associations are performed between single-nucleotide polymorphisms (SNPs) across the genome (usually over 500,000 SNPs) and a single disease outcome or trait. There is now a growing amount of evidence to demonstrate the success of some of these genetic associations. However, the impact of GWAS has been limited due to its focus on a single phenotype, and hence, the effect of a given SNP across multiple phenotypes cannot be explored. An alternative approach called a phenome-wide association study (PheWAS) has been successful in simultaneously scanning genome-wide significant variants over hundreds of phenotypes. Using this approach, we can identify genetic variants associated with a wide range of phenotypes, also referred to as cross-phenotype associations. Such findings have the potential to identify pleiotropy (where one variant is affecting two or more independent phenotypes with same underlying biological mechanism) or an underlying genetic architecture of comorbidities. The majority of PheWAS have used data from de-identified electronic health records (EHRs) linked to genotype data, and a few have been performed in large-scale epidemiologic studies and clinical trials. While existing studies have demonstrated the development of PheWAS methodology, the focus has remained on a small set of genome-wide significant SNPs or a genomic region of iv interest. After advances in genotyping and sequencing technologies, as well as in phenotype data collection, it is imperative to apply PheWAS on a genome-wide scale. It will allow us to investigate genetic associations across all SNPs and phenotypes in a study population. However, there can be many challenges with expanding the current PheWAS approach to investigate associations across the genome. In this dissertation, I aim to address following specific challenges regarding large-scale PheWAS analysis. 1) Evaluating heterogeneous groups simultaneously makes precision medicine impossible; stratifying samples based on context such as age, sex, or drugs can help to improve precision in identifying true genetic associations (Chapter 2). 2) The number of association tests raise the statistical threshold of significance in such a way that finding the significant associations is difficult. Also, the impact of factors such as sample size, casecontrol ratio, and minor allele frequency on the statistical power to identify associations have not been explored (Chapter 3). 3) Integrating results from independent PheWAS using different types of data sets (e.g., clinical lab measures) within the EHRs have not been evaluated. We employed new strategies to integrate such results to add robustness to PheWAS associations (Chapter 4). 4) Large-scale PheWAS usually result in large quantities of significant associations. The majority of associations from common variants lie in non-coding regions of the genome. Hence, the interpretation of results is a challenging task. We developed a high throughput strategy to prioritize associations based on their biological relevance (Chapter 5). 5) Few methods elucidate cross-phenotype connections with shared underlying genetic etiology. We evaluated networkbased approaches to identify dynamic interconnections between diseases (Chapter 6).