Thresholded partial correlation approach for variable selection in linear models and partially linear models

Open Access
Lou, Lejia
Graduate Program:
Doctor of Philosophy
Document Type:
Date of Defense:
October 14, 2013
Committee Members:
  • Runze Li, Dissertation Advisor
  • Runze Li, Committee Chair
  • David Russell Hunter, Committee Member
  • Bing Li, Committee Member
  • Rongling Wu, Committee Member
  • Variable Selection
  • Linear Model
  • Partially Linear Model
  • Nonparametric Regression
This thesis is concerned with variable selection in linear models and partially linear models for high-dimensional data analysis. With the development of technology, it is crucial to identify a small subset of covariates that exhibits the strongest relationship with the response. Researchers have made much effort on developing variable selection methodologies, such as regularized techniques, including least absolute shrinkage and selection operator (LASSO, Tibshirani, 1996) and penalized least squares estimate with smoothly clipped absolute deviation penalty (SCAD, FanLi, 2001). Different from those regularization methods for variable selection in linear models, Buhlmann(2010) proposed the PC-simple algorithm to select significant variables. As they showed, under some conditions and proper choice of the significance level, the PC-simple algorithm can consistently identify the true active set with probability approaching $1$ when the response and covariates are jointly normally distributed. In Chapter 3, we study the performance of the PC-simple algorithm under a non-normal distribution. The PC-simple algorithm develops the variable selection based on the fact that the asymptotic distributions of Fisher's z-transform of the sample marginal and partial correlations follow standard normal distribution. This fact is invalid when the samples are from non-normal distributions. This is the drawback of the PC-simple algorithm. Thus, we derive the asymptotic distribution of Fisher's z-transform of the sample marginal correlations and partial correlations under elliptical distributions, and we find that the asymptotic distributions depend on the kurtosis. According to the threshold we develop, the PC-simple algorithm would result in over-fitting a model with positive kurtosis, and under-fitting a model with negative kurtosis. The results from the extensive simulation studies with elliptical distributions, including normal distributions and mixture normal distributions, are consistent with our understanding. With normal samples, the PC-simple algorithm and our proposal have similar performance as both of them utilize similar asymptotic distributions of the sample marginal and partial correlations. However, with mixture normal distributions, the PC-simple algorithm overfits the data as the kurtosis of the mixture normal distributions are greater than 0. Our proposal outperforms PC-simple algorithm in terms of the correct fitting percentage. That implies adjusting the variance is urgent. Moreover, the application of our proposal to the cardiomyopathy microarray data suggests that our proposal is comparable to the regularization approach with the SCAD penalty, and outperforms the LASSO penalty. Furthermore, by imposing some conditions on the partial correlations, we show that the proposed approach can consistently identify the true active set. In Chapter 4, we study how to apply the thresholded partial correlation approach to select significant variables in partially linear models. First, we transform the partially linear model to a linear model approximately by the partial residuals technique. Then we apply the thresholded partial correlation approach to the resulting linear model to obtain the estimated active set. After that, we apply the least squares approach to get the estimates of the coefficients in the linear part. The estimation of the nonparametric function is obtained via substituting the estimation of the linear part into the original model. We call this approach the thresholded partial correlation on partial residuals (TPC-PR) approach. Similarly, we can utilize the PC-simple algorithm on the partial residuals, pretending the samples are from normal distributions, and we call the resulting algorithm the PC-simple algorithm on partial residuals (PC-PR). We establish the asymptotic consistency of variable selection in the linear part, and the asymptotic normality of the nonparametric function. Simulation studies show that our proposal performs as well as the penalized approach on partial residuals with the SCAD penalty (FanLi, 2004), and outperforms the LASSO penalty. The real data analysis also demonstrates that our proposal can yield a parsimonious model.