# NEW STATISTICAL ANALYTIC TOOLS FOR HIGH DIMENSIONAL DATA

Restricted (Penn State Only)
Author:
Yang, Songshan
Statistics
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
May 08, 2018
Committee Members:
This dissertation studies the feature screening and two-sample mean testing procedures for high-dimensional data. Firstly, a new feature screening procedure based on the joint quasi-likelihood is proposed for generalized varying coefficient models. Secondly, we propose a new testing method considering the correlation structure for high-dimensional mean vectors. Generalized varying coefficient models are particularly useful for examining dynamic effects of covariates on a continuous, binary or count response. This dissertation is concerned with feature screening for generalized varying coefficient models with ultrahigh dimensional covariates. The proposed screening procedure is based on joint quasi-likelihood of all predictors, and therefore is distinguished from marginal screening procedures proposed in the literature. In particular, the proposed procedure can effectively identify active predictors that are jointly dependent but marginal independent of the response. In order to carry out the proposed procedure, we propose an effective algorithm and establish the ascent property of the proposed algorithm. We further prove that the proposed procedure possesses the sure screening property. That is, with probability tending to one, the selected variable set includes the actual active predictors. We examine the finite sample performance of the proposed procedure and compare it with existing ones via Monte Carlo simulations, and illustrate the proposed procedure by a real data example. Testing the population mean is fundamental in statistical inference. The traditional Hotelling's $T^2$ test becomes practically infeasible due to the singularity of sample covariance matrix when the dimensionality of the data is larger than the sample size. For a symmetric positive definite $W$ matrix, we consider $T=(\bx_1-\bx_2)^T W (\bx_1-\bx_2)$ for the two sample problem. We first prove that in order to maximize the asymptotic power of $T$, $W=\lambda \Sigma^{-1}$ for some positive constant $\lambda$. The goal is to model correlation matrix and use the correlation to improve the power of a test. We consider linear structure models for the inverse of correlation matrix $\Omega\hat{=}R^{-1}$: $\Omega(\btheta)= \theta_1G_1 + \sum_{l=2}^L \theta_l G_l$. An estimation procedure for $\btheta$ is proposed and the asymptotic power of the proposed test by incorporating correlation information is demonstrated. We compare the performances of the proposed test and the existing methods via Monte Carlo simulations, and a real data example is also given.