Model Misspecification and Feature Screening for Ultrahigh Dimensional Data

Open Access
Lin, Junyi
Graduate Program:
Doctor of Philosophy
Document Type:
Date of Defense:
October 03, 2011
Committee Members:
  • Runze Li, Committee Chair
  • Naomi Altman, Committee Member
  • Debashis Ghosh, Committee Member
  • Hunter David, Committee Member
  • Ping Xu, Committee Member
  • Covariate-adjusted approach; generalized linear re
The variance-bias trade-off has been partially discussed for linear and logistic regression models, but not for generalized linear models as a whole. In this dissertation, we derive the bias of the treatment effect in covariate-unadjusted models, when some important covariates are omitted. This result encourages the use of the covariate-adjusted approach in general. On the other hand, we show that for a broad class of generalized linear models, estimation of the treatment effect obtained from covariate-adjusted models have larger variances compared to those obtained from covariate-unadjusted models. This result reveals the potential loss of efficiency related to the covariate-adjusted approach, particularly when sample size is not large. These theoretical results are illustrated through examples, a simulation study and a real data example. This dissertation is also concerned with feature screening for ultrahigh dimensional data. We propose two unified sure independence ranking and screening procedures based on conditional characteristic functions. The proposed procedures do not require specification of a regression function. In addition, they can be directly applied for univariate or multivariate continuous, discrete and categorical responses. We show that, with the number of predictors growing at an exponential rate of the available sample size, these unified procedures possess both the ranking consistency and sure screening properties. The ranking consistency property ensures that all important features will be ranked above the unimportant ones asymptotically, and the sure screening property guarantees that all important features will be retained with an overwhelming probability after screening. Both are desired properties in ultrahigh dimensional data analysis. We study the finite-sample performance of our proposed independence ranking and screening procedures through simulations and illustrate the proposed procedures via an empirical analysis of a real-world data set.