feature screening and variable selection for ultrahigh dimensional data analysis

Open Access
Zhong, Wei
Graduate Program:
Doctor of Philosophy
Document Type:
Date of Defense:
May 04, 2012
Committee Members:
  • Runze Li, Dissertation Advisor
  • Runze Li, Committee Chair
  • Bruce G Lindsay, Committee Member
  • Dennis Kon Jin Lin, Committee Member
  • Jingzhi Huang, Committee Member
  • ultrahigh dimensionality
  • distance correlation
  • feature screening
  • sure screening property
  • variable selection
This dissertation is concerned with feature screening and variable selection in ultrahigh dimensional data analysis, where the number of predictors, p, greatly exceeds the sample size n. That is, p>>n. Ultrahigh dimensional data analysis has become increasingly important in diverse fields of scientific fields, such as genetics and finance. In Chapter 3, we develop a sure independence screening procedure based on the distance correlation learning(DC-SIS, for short) to select important predictors for ultrahigh dimensional data. The DC-SIS can be implemented as easily as the sure independence screening procedure based on the Pearson correlation (SIS, for short) proposed by Fan and Lv (2008). However, the DC-SIS can significantly improve the SIS. Fan and Lv (2008) established the sure screening property for the SIS based on linear models. That is, with a proper threshold, it can select all important predictors with probability approaching to one as n tends to infinity. We show that the sure screening property is valid for the DC-SIS under more general settings including linear models. Furthermore, the implementation of the DC-SIS does not require model specification (e.g., linear model or generalized linear model) for responses or predictors. This is a very appealing property in ultrahigh dimensional data analysis. Moreover, the DC-SIS can be used directly to screen grouped predictor variables and for multivariate response variables. We establish its sure screening property for the DC-SIS, and conduct simulations to examine its finite sample performance. An iterative procedure DC-ISIS is also proposed to enhance the finite sample performance. Numerical comparison indicates that the DC-SIS performs much better than the SIS in various models. We also illustrate the performance of DC-SIS and DC-ISIS through two real data examples. In Chapter 4, we propose a two-stage feature screening and variable selection procedure to study the estimation of the index parameter in heteroscedastic single-index models with ultrahigh dimensional covariates. In the screening stage, we propose a robust independent ranking and screening (RIRS) procedure to reduce the ultrahigh dimensionality of the covariates to a moderate scale. Aside from its computational simplicity, the RIRS procedure maintains the ranking consistency property and the sure screening property. Therefore, in an asymptotic sense the RIRS procedure guarantees to retain all the truly active predictors. However, some inactive predictors may be selected as well. In the cleaning stage, we propose penalized linear quantile regression to refine the selection of the preceding RIRS procedure, and to simultaneously estimate the direction of the index parameter. We establish the consistency and the oracle property of the resulting penalized estimator, and demonstrate through comprehensive numerical studies that the two-stage estimation procedure is computationally expedient and presents an outstanding finite sample performance.