Methods for assessing replicability in GWAS and analyzing large-scale Electronic Health Record datasets.

Open Access
- Author:
- McGuire, Daniel
- Graduate Program:
- Biostatistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- May 20, 2021
- Committee Members:
- Arthur Berg, Program Head/Chair
Dajiang Liu, Chair & Dissertation Advisor
Qunhua Li, Outside Unit Member
Arthur Berg, Major Field Member
John Elfar, Outside Field Member - Keywords:
- replicability
GWAMA
GWAS
EHR
mixed-model
mixture-model
spatial random effects
pollution
PheWAS
heritability
family study
meta-analysis
variance components
two-level mixture model
random effects - Abstract:
- This thesis is a collection of two articles presented in two Chapters. While the topics of the Chapters are not strongly related in theme, they both present novel approaches for tackling large data problems in the field of Biostatistics. In the first Chapter, we develop a model based approach for assessing replicability of associations within a genome-wide association meta-analysis (GWAMA). GWAMA is an effective approach to enlarge sample sizes and empower the discovery of novel associations between genotype and phenotype. Independent replication has been used as a gold-standard for validating genetic associations. However, as current GWAMA often seeks to aggregate all available datasets, it becomes impossible to find a large enough independent dataset to replicate new discoveries. Here we introduce a method, MAMBA (Meta-Analysis Model-based Assessment of replicability), for assessing the ``posterior-probability-of-replicability'' for identified associations by leveraging the strength and consistency of association signals between contributing studies. We demonstrate using simulations that MAMBA is more powerful and robust than existing methods, and produces more accurate genetic effects estimates. We apply MAMBA to a large-scale meta-analysis of addiction phenotypes with 1.2 million individuals. In addition to accurately identifying replicable common variant associations, MAMBA also pinpoints novel replicable rare variant associations from imputation-based GWAMA and hence greatly expands the set of analyzable variants. In the second Chapter, ``Dissecting Genetic Heritability, Environmental Risk Components and Causal Effects of Air Pollution for Complex Human Diseases Using a Health Insurance Database of 50 Million Individuals'', we use a large dataset of electronic health records (EHR) to decompose contributions of genetic heritability, family environment, and community-level environment on 1,083 phenotypes, using the familial relationships and approximate geographic information embedded in the EHR. We also assess causal effects of pollution on those phenotypes using publicly available datasets of environmental exposures summarized at county and metropolitan statistical area level. To motivate this idea, we note that most complex diseases are jointly influenced by both genetics and environment. The advent of large national-level EHR datasets has offered new opportunities for disentangling the role of genes and environment through the deep phenotype information and approximate pedigree structures that EHR datasets provide. In this study, we made innovative use of the approximate geographical locations of patients to jointly model genetics and spatially correlated sources of environmental risk. Environmental risk factors (such as air pollution) are often shared across families living in similar locations but are typically insufficiently considered in traditional family-based variance components models, leading to biased estimates of genetic heritability. In this study, we extracted EHR from 257,620 quad-families (parents with two children) and analyzed 1,083 disease outcome measurements. We used approximate geographical locations embedded in the EHR to estimate community-level environment effects and quantified genetic heritability and environmental risk factors on disease phenotype variation. We found that jointly modeling both genetic and community-level environment effects improve both heritability and environmental variance component estimates. We further augmented the EHR with publicly available environmental data, including levels of particular matter (PM2.5), nitrogen dioxide (NO2), climate, and sociodemographic data. We used wind speed and direction as instrument variables in regression models to assess the causal effects of air pollution on 1,083 diseases. While individual air pollutant levels often stem from common sources such as traffic pollution, we found PM2.5 and NO2 to have unique disease etiologies and affect biologically distinct disease categories. In total, we found PM2.5 or NO2 have statistically significant putative causal effects on 135 diseases, including respiratory, musculoskeletal, digestive, metabolic, and sleep disorders. Many of these associations have been previously cited with plausible biological mechanisms, although some were reported only in small studies from heavily polluted areas. These analyses showcase several novel strategies for jointly modeling genetic and environmental effects on disease risk using large EHR datasets, and will benefit upcoming biobank studies in the era of precision medicine.