New Statistical Procedures for Analysis of HIV Data and High Dimensional Data

Restricted (Penn State Only)
Ye, Jingyi
Graduate Program:
Doctor of Philosophy
Document Type:
Date of Defense:
October 03, 2017
Committee Members:
  • Runze Li, Dissertation Advisor
  • Runze Li, Committee Chair
  • Le Bao, Committee Member
  • Zhibiao Zhao, Committee Member
  • Lan Kong, Outside Member
  • HIV
  • prevalence and incidence
  • incidence assay data
  • refitted cross-validation
  • high dimension data
  • varying coefficient model
  • varying error variance
This dissertation consists of two parts. In the first part, we develop a new statistical procedure for analysing HIV data to improve efficiency of parameter estimates by incorporating extra available information. Also, we use the procedure to study the impact of this additional information. In the second part, we develop a new error variance function estimation procedure for ultrahigh dimensional varying coefficient models. The human immunodeficiency virus (HIV) is a lentivirus that causes HIV infection and acquired immunodeficiency syndrome (AIDS). Accurate estimation and prediction of HIV epidemics can help people have a better understanding on HIV epidemics, and also help government make laws and formulate policies. Two key indicators, prevalence and incidence, are widely used to estimate HIV epidemics. HIV prevalence is the proportion of HIV positive population among the general population. HIV incidence is the proportion of new HIV infections among the general population.The new treatment, Antiretroviral treatment (ART), reduces the AIDS-related deaths, and changes the AIDS-related mortality rate substantially. In the UNAIDS 2014 Gap report, the number of people who are newly infected with HIV is continuing to decline in most countries and regions in the world, which suggests a slow-down of HIV epidemics. Traditionally, the increase of HIV prevalence rate mostly is due to the increase of new infections. However, reduction of AIDS-related deaths becomes another important reason of increasing HIV prevalence rate. In this case, knowing incidence helps people fuller understand the HIV epidemic. One of our goals is to utilize the newly available incidence assays in the process of estimation and projection of HIV epidemics, and to understand the contribution of such data in the presence of historical HIV prevalence data, which has been the main data source for estimating HIV epidemics. The Susceptible-Infectious-Recovered (SIR) system is widely used in the epidemiology. Under Bayesian framework, Incremental Mixture Importance Sampling (IMIS) can be used to draw the posterior samples. The current method to incorporate new incidence assays with SIR model is to fit the historical prevalence data and new incidence data all over again even if the fitted results to the prevalence data are available. We propose a new method, Sequential IMIS, to estimate prevalence and incidence with assay data. Our method reduces the computing time in most scenarios, and enables the study the impact of incidence assay data in multiple scenarios. Also, we improve the stopping rule for IMIS to avoid the algorithm stops in the local maximum. Incidence assay data are impact by four parameters: prevalence, incidence, the false recent rate (FRR), and mean duration of recent infection (MDRI). We use the proposed method to study the impact to prevalence and incidence rates and impact to the changes of prevalence and incidence rates over time when incorporating the new incidence assay data. This impact takes both one time data and time series data into consideration. Our research shows that in most countries, incidence assay data can significantly improve the accuracy of the incidence estimate. In the second part, we propose a new estimation procedure for error variance function estimation for ultrahigh dimensional varying coefficient models (VCM). Low dimensional VCM was systematically introduced in Hastie and Tibshirani (1993), and is one of the most commonly used nonparametric regression models in statistics. Error variance function estimation plays important roles in estimation of confidence interval and hypothesis testing for VCM, and is very challenging in the present of ultrahigh dimensional covariates. A naive way is to select variables first, and refit the model with low dimensional selected models. We first show both theoretically and empirically that this naive estimator significantly underestimates the error variance and may lead to an inconsistent estimate. We further propose a new estimation procedure for error variance function by using group least absolute shrinkage and selection operator (LASSO) and refitted cross-validation (RCV) techniques. We study the asymptotic property of the RCV estimate, and compare it with the naive estimate. Our findings include that the RCV estimator is consistent estimator and follows an asymptotic normal distribution with smallest variance. It significantly improves the naive estimator. We further conduct simulation studies to examine the finite sample performance of the RCV estimate.