Statistical Methods for Different Ultrahigh Dimensional Models

Open Access
Author:
Liu, Jingyuan
Graduate Program:
Statistics
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
December 17, 2012
Committee Members:
  • Runze Li, Dissertation Advisor
  • Rongling Wu, Committee Chair
  • Vernon Michael Chinchilli, Committee Member
  • Dennis Kon Jin Lin, Committee Member
  • Liwang Cui, Special Member
Keywords:
  • ultrahigh dimension
  • varying coefficient model
  • partially linear model
  • feature screening
Abstract:
This thesis studies feature screening and variable selection procedures for ultrahigh dimensional varying coefficient models and partially linear models, and the extension of the methods to longitudinal data structure. A new independence screening procedure is proposed for varying coefficient models based on the conditional correlation between each predictor and the response given the depending covariate (CCIS, for short). We establish and prove the ranking consistency and sure screening property of CCIS, and demonstrate them empirically through simulations. Furthermore, the iterative screening procedure (ICCIS) is developed to enhance the finite sample performance. In the Framingham Heart Study (FHS) example, we derive a new two-stage approach to select significant Single-nucleotide polymorphism (SNPs) for explaining body mass index (BMI), and the effect of SNPs may depend on the baseline age of patients. Firstly CCIS is applied to reduce the ultrahigh dimensionality to the scale under sample size, and secondly several penalized regression techniques are modified for varying coefficient models to further select important variables as well as estimate the coefficient functions. Moreover, CCIS for varying coefficient models can be extended for the longitudinal data structure. Consider the time-varying coefficient model as an example, where multiple response values are observed for every subject. We apply CCIS in the first stage to the pooled sample, in which we treat all the observations as independent individuals, although those from the same subject are actually correlated. In this case, the within subject correlation is ignored in the screening stage. However, the simulation studies show that we do not lose ranking consistency and sure screening property by doing this. In the real data example, we use a modified two-stage approach to restudy the effect of SNPs on BMI using FHS data. The dynamic pattern of age instead of baseline age is considered to illustrate the longitudinal structure. If the efficiency of coefficient function estimators are of interest, we can add one more step of a weighted least squared method after the variable selection stage, by incorporating the covariance matrix estimation procedure. For partially linear models, another independence screening procedure is developed in this thesis based on the partial residual method (PRSIS, for short). The partially linear model can be converted to a linear model with transformed response and predictors, and then the traditional screening methods for linear models can be applied, such as sure independence screening (SIS, Fan and Lv, 2008). The desired theoretical properties are demonstrated through simulation studies. Soybean data analysis are provided to illustrate the two-stage approach based on PRSIS, using which the important markers are selected for explaining the dry biomass of soybean.