Topics on Variable Selection in High-Dimensional Data
Open Access
- Author:
- Wang, Jia
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- May 27, 2021
- Committee Members:
- Lingzhou Xue, Major Field Member
David Hunter, Major Field Member
Rongling Wu, Outside Unit & Field Member
Runze Li, Chair & Dissertation Advisor
Ephraim Hanks, Program Head/Chair - Keywords:
- Bayesian variable selection
High-dimensional data
Network analysis
Partially linear model
Time-varying network model - Abstract:
- Variable selection has been extensively studied in the last few decades as it provides a principled solution to high dimensionality arising in a broad spectrum of real applications, such as bioinformatics, health studies, social science and econometrics. This dissertation is concerned with variable selection for ultrahigh-dimensional data when the dimension is allowed to grow with the sample size or the network size at an exponential rate. We propose new Bayesian approaches to selecting variables under several model frameworks, including (1) partially linear models (2) static social network models with degree heterogeneity and (3) time-varying network models. Firstly for partially linear models, we develop a procedure which employs the difference-based method to reduce the impact from the estimation of the nonparametric component, and incorporates Bayesian subset modeling with diffusing prior (BSM-DP) to shrink the corresponding estimator in the linear component. Secondly, a class of network models where the connection probability depends on ultrahigh-dimensional nodal covariates (homophily) and node-specific popularity (degree heterogeneity) is considered. We propose a Bayesian method to select nodal features in both dense and sparse networks under a relaxed assumption on popularity parameters. To alleviate the computational burden for large sparse networks, we particularly develop another working model in which parameters are updated based on a dense sub-graph at each step. Lastly, we extend the static model to time-varying cases, where the connection probability at time t is modeled based on observed nodal attributes at time t and node-specific continuous-time baseline functions evaluated at time t. Those Bayesian proposals are shown to be analogous to a mixture of L0 and L2 penalized methods and work well in the setting of highly correlated predictors. Corresponding model selection consistency is studied for all aforementioned models, in the sense that the probability of the true model being selected converges to one asymptotically. The finite sample performance of the proposed models is further examined by simulation studies and analyses on social-media and financial datasets.