THE BAYESIAN LASSO, BAYESIAN SCAD AND BAYESIAN GROUP LASSO WITH APPLICATIONS TO GENOME-WIDE ASSOCIATION STUDIES

Open Access
Author:
Li, Jiahan
Graduate Program:
Statistics
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
April 13, 2011
Committee Members:
  • Rongling Wu, Dissertation Advisor
  • Rongling Wu, Committee Chair
  • Runze Li, Committee Chair
  • Bruce G Lindsay, Committee Member
  • Tao Yao, Committee Member
Keywords:
  • lasso
  • variable selection
  • Bayesian approach
  • high-dimensional data
Abstract:
Recently, genome-wide association studies (GWAS) have successfully identified genes that may affect complex traits or diseases. However, the standard statistical tests for each single-nucleotide polymorphism (SNP) separately are too simple to elucidate a comprehensive picture of the genetic architecture of phenotypes. A simultaneous analysis of a large number of SNPs, although statistically challenging, especially with a small number of samples, is crucial for genetic modeling. This is a variable selection problem for high-dimensional data, with SNPs as the predictors and phenotypes as the responses in our statistical model. In genome-wide association studies, phenotypical values are either collected at a single time point for each subject, or collected repeatedly over a period at subject-specific time points. When the response variable is univariate, we present two-stage procedures designed for the problems where the number of predictors greatly exceeds the number of observations. At the first stage, we preprocess the data such that variable selection procedure can be proceeded in an accurate and efficient manner. At the second stage, variable selection techniques based on penalized linear regression are applied to the preprocessed data. When the longitudinal phenotype of interest is measured at irregularly spaced time points, we develop a Bayesian regularized estimation procedure for the variable selection of nonparametric varying-coefficient models. Our method could simultaneously selection important predictors and estimate their time-varying effects. We approximate time-varying effects by Legendre polynomials, and present a Bayesian hierarchical model with group lasso penalties that encourage sparse solutions at the group level. In both scenarios, our models obviate the choice of the tuning parameters by imposing diffuse hyperpriors on them and estimating them along with other parameters, and provide not only point estimates but also interval estimates of all parameters. Markov chain Monte Carlo (MCMC) algorithms are developed to simulate the parameters from their posterior distributions. The proposed methods are illustrated with numerical examples and a real data set from the Framingham Heart Study.