Feature Screening in Ultra-high Dimensional Survival Data Analysis

Open Access
Sun, Wei
Graduate Program:
Doctor of Philosophy
Document Type:
Date of Defense:
October 03, 2014
Committee Members:
  • Runze Li, Dissertation Advisor
  • Yu Zhang, Committee Member
  • Qunhua Li, Committee Member
  • Arthur Steven Berg, Special Member
  • Cox’s model
  • feature screening
  • SCAD
  • ultra-high dimensional survival data analysis
  • Gail-Simon test
Much research has been devoted to developing variable selection methods for decades since high dimensional data arise from many scientific and technological fields. Adopting continuous penalties such as the LASSO (Tibshirani, 1996) and the SCAD (Fan and Li, 2001) made it possible to cope with the high dimensionality. Independence screening is very useful tool to identify all the important covariates at less computational cost than the traditional methods when the number of covariates grows at non-polynomial rate of the sample size. When the response is survival time, feature screening is more challenging because the responses are subject to censoring. In this thesis we propose a model-free independence feature screening procedure for ultra-high dimensional survival data. This new procedure can be directly applied for most commonly-used models such as Cox’s model, Cox’s frailty model, additive Cox’s model, parametric, nonparametric and semiparametric proportional odds models and accelerated failure time models, in survival data analysis. Thus, the virtue of the new procedure is desirable since it is usual that little prior information is known for the actual true model for ultra-high dimensional data. The newly proposed procedure is easy to implement and computationally efficient. We systematically studied the theoretical properties of the proposed procedures, and established the sure screening property and consistency in ranking property for the proposed procedure. Its performance is evaluated and compared with the existing procedure proposed based on Cox’s model (Fan, Feng, & Wu, 2010) by extensive simulation studies and the real data analysis. Since our proposed procedure uses marginal correlation utility measure, an inherent issue is that it cannot identify those important features that are marginally independent with response. To resolve this issue, we propose an iterative procedure in spirit similar to iterative sure independent screening procedure proposed by Fan and Lv (2008). The major challenge in the development of the iterative procedure is the lack of definition of residuals under the model-free framework for survival data analysis. The commonly used residuals, such as martingale residual, Schoenfeld residual and deviance residual, are all defined with respect to certain semiparametric models. Therefore those residuals are not applicable in our model-free framework. We instead use the residuals from regressing the entire features space on the previously selected active features. We also carefully studied the performance of the proposed iterative procedures. Our Monte Carlo simulation studies show that the proposed iterative procedures performs quite well with moderate sample sizes.