Feature Screening For Ultra-high Dimensional Longitudinal Data

Open Access
Chu, Wanghuan
Graduate Program:
Doctor of Philosophy
Document Type:
Date of Defense:
May 10, 2016
Committee Members:
  • Runze Li, Dissertation Advisor
  • Runze Li, Committee Chair
  • Matthew Reimherr, Committee Member
  • Lingzhou Xue, Committee Member
  • Donna Coffman, Outside Member
  • Feature screening
  • ultra-high dimensional data
  • longitudinal genetic study
High and ultrahigh dimensional data analysis is now receiving more and more attention in many scientific fields. Various variable selection methods have been proposed for high dimensional data where feature dimension p increases with sample size n at polynomial rates. In ultrahigh dimensional setting, p is allowed to grow with n at an exponential rate. Instead of jointly selecting active covariates, a more effective approach is to incorporate screening rule that aims at filtering out unimportant covariates through marginal regression techniques. This thesis is concerned with feature screening methods for ultrahigh dimensional longitudinal data. Such data occur frequently in longitudinal genetic studies, where phenotypes and some covariates are measured repeatedly over a certain time period. Along with the genetic measurements, longitudinal genetic studies provide valuable resources for exploring primary genetic and environmental factors that influence complex phenotypes over time. The proposed statistical methods in this work allow us not only to identify genetic determinants of common complex disease, but also to understand at which stage of human life do the genetic determinants become important. In Chapter 3, we propose a new feature screening procedure for ultrahigh dimensional time-varying coefficient models. We present an effective screening rule based on marginal B-spline regression that incorporates time-varying variance and within-subject correlations. We show that under certain conditions, this procedure possesses sure screening property, and the false selection rates can be controlled. We demonstrate how within subject variability can be harnessed for increasing screening accuracy by Monte Carlo simulation studies. Furthermore, we illustrate the proposed screening rule via an empirical analysis of the Childhood Asthma Management Program (CAMP) data. Our empirical analysis clearly shows that the proposed approach is especially useful for such studies as children change quite extensively over a four-year period with highly nonlinear patterns. In Chapter 4, we study screening rules for ultrahigh dimensional covariates that are potentially associated with random effects. Mixed effects models are popular for taking into account the dependence structure of longitudinal data, as subject-specific random effects can explicitly account for within-subject correlation. We propose a two-step screening procedure for generalized varying-coefficient mixed effects models. The two-step procedure screens fixed effects first and then random effects. We conduct simulation studies to assess the finite sample performance of this two-step screening approach for continuous response with linear regression, binary response with logistic regression, count response with Poisson regression, and ordinal response with proportional-odds cumulative logit model. In real data application, we apply this procedure to data from Framingham Heart Study (FHS), and explore the genetic and environmental effects on body mass index (BMI), obesity and blood pressure in three separate analyses. Our results confirm some findings from previous studies, and also identify genetic markers with highly significant effects and interesting time-dependent patterns that worth further exploration.