Statistical Inference with Corrupted Data

Open Access
- Author:
- Li, Mengyan
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- April 28, 2020
- Committee Members:
- Yanyuan Ma, Dissertation Advisor/Co-Advisor
Yanyuan Ma, Committee Chair/Co-Chair
Michael G Akritas, Committee Member
Bing Li, Committee Member
Lan Kong, Outside Member
Runze Li, Dissertation Advisor/Co-Advisor
Runze Li, Committee Chair/Co-Chair
Ephraim Mont Hanks, Program Head/Chair - Keywords:
- Measurement error model
Efficient score method
Semiparametric regression
High dimensional inference
Nonignorable missing data
Statistical inference - Abstract:
- Corrupted data are ubiquitous in many applications where measurement errors or missing data cannot be ignored. For example, measurement errors arise frequently in nutriology, biomedical science, etc. Missing data are common in research involving human subjects, such as health-related studies and sample surveys. Statistical inference with corrupted data is believed to be challenging, and improper treatments can lead to biased estimation and erroneous inference. Assumptions are imposed to guarantee the model identifiability and to facilitate the establishment of theoretical results. Those assumptions can be too restrictive in some applications. Another issue in analyzing corrupted data is impaired estimation efficiency. Extra noises or partial observations often lead to a lack of power. In this dissertation, we focus on developing robust or efficient methodologies for analyzing corrupted data and establishing asymptotic properties of the newly proposed estimators to make the statistical inference feasible. In Chapter 3, we consider a given parametric regression model with a covariate measured with heteroscedastic error and the error distribution can have an arbitrary form. Both the variance function of the measurement error and the distribution of the error-prone covariate are left completely unspecified. We avoid performing deconvolution, the standard treatment in the prior literature, by using a novel spline-assisted semiparametric approach. Its most distinctive feature is to embed B-splines approximation of the variance function in a semiparametric treatment; this achieves robust estimation that allows misspecification of the covariate distribution. By combining the knowledge of the B-splines technique, integral equations, and semiparametric analysis, we establish our estimator’s theoretical properties. In Chapter 4, we study statistical inference on parameters associated with a finite number of error-prone covariates in high-dimensional linear measurement error models. In the high-dimensional settings, the main challenges posed by measurement errors are nonconvexity and lack of closed-form solutions which significantly complicate the analysis of standard regularization methods such as Lasso and Dantzig selector. To counteract the effect of high-dimensional nuisance parameters and correct the biases introduced by measurement errors, we propose a new corrected decorrelated score test and a corresponding one-step estimator. By adapting the bias-correction and the decorrelation operations to our model, we show that our test statistic is asymptotically normal and retains power under the local alternatives around zero. Further, our one-step estimator has significantly better convergence performance than other existing estimators, and it is semiparametrically efficient. In Chapter 5, we consider the data where all the covariates are fully observed and the scalar response is subject to nonignorable missingness, i.e., the missingness mechanism depends on the missing values themselves. In such cases, model identifiability and model misspecification can be two critical problems. We assume a flexible semiparametric exponential tilting propensity where the relationship between the missing indicator and the response is totally unspecified and estimated nonparametrically, while the relationship between the missingness indicator and the covariates are modeled parametrically. To guarantee that the model is identifiable, we model the fully observed part of the data parametrically. We devise two estimators for the parameter of interest in the parametric parts using a semiparametric treatment. The first one is robust against the misspecification of the distribution of the covariates, while the second estimator is semiparametrically efficient.