Leveraging Big Genetic Data for Prediction in Multi-ethnic Studies: Applications to Tobacco Use Phenotypes

Open Access
- Author:
- Yang, Lina
- Graduate Program:
- Biostatistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- September 24, 2021
- Committee Members:
- Dajiang Liu, Co-Chair & Dissertation Advisor
Vernon Chinchilli, Co-Chair of Committee
David Mauger, Major Field Member
Bibo Jiang, Outside Field Member
Ian Paul, Outside Unit Member
Arthur Berg, Professor in Charge/Director of Graduate Studies - Keywords:
- GWAS
meta-analysis
linkage disequilibrium
polygenic risk score
multi-ethnic population
tobacco usage - Abstract:
- Large-scale genetic datasets have revolutionized human genetic research. As the cost of sequencing decreases dramatically, biobank scale datasets with millions of individuals and hundreds of millions of genetic variants have emerged. Given the scale of the sequence datasets, query and retrieval of information from them have become a central problem that precedes genetic analyses. We developed an R-package seqminer2 for efficient querying and retrieving genetic variants in biobank scale datasets. It implements a variant-based index and substantially improves the speed of querying sequence datasets by several magnitudes compared to the other state-of-the-art tools. It also requires much smaller memory to run making it feasible to directly read genetic data into R program. It supports popular file formats for statistical genetic analysis, including VCF/BCF, BGEN, and PLINK formats. The improved efficiency and comprehensive support for various file formats has greatly facilitated our method development for risk prediction in multi-ethnic populations and will facilitate others’ research in the genetic and genomic field as well. With the help of seqminer2, we developed a novel meta-analysis approach to predict polygenic risk score (PRS) in multi-ethnic samples. It is currently challenging to construct PRS in the diverse US and worldwide populations because the majority of the available training data are from the European population. If using European samples as training data to predict PRS in non-European samples, the prediction accuracy can be low due to the different patterns of linkage disequilibrium (LD) and heterogeneity of genetic effects in diverse ethnic populations. An alternative way is to train the model with the same population as the target population. However, the sample size for the target population other than European can be much reduced and results in worse prediction performance. Our method integrates multi-ethnic studies as training dataset while accommodating heterogeneity in genetic effects and linkage disequilibrium patterns. It decomposes genetic effect heterogeneity into a fixed effect and top principal components (PCs) of genetic variation. It integrates the heterogeneous genetic effect estimates across ancestries to improve the PRS prediction for individuals from diverse ancestries. We showed our method improved the prediction accuracy for individuals from different ancestries in the simulation comparisons over various scenarios for heterogeneity across diverse populations. Applying our method to GWAS and Sequencing Consortium of Alcohol and Nicotine use (GSCAN) dataset improved prediction for tobacco use phenotypes. Our approach facilitates stratifying the risk of smoking behaviors across ancestries and would contribute to quantifying nicotine dependence risk in diverse populations.