A Latent-Class Selection Model for Nonignorably Missing Data
Open Access
- Author:
- Jung, Hyekyung
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 21, 2007
- Committee Members:
- Joseph Francis Schafer, Committee Chair/Co-Chair
John Walter Graham, Committee Member
Aleksandra B Slavkovic, Committee Member
Bruce G. Lindsay, Committee Member - Keywords:
- multivariate incomplete data
nonignorable nonresponse
latent variable. - Abstract:
- Most missing-data procedures assume that the missing values are ignorably missing or missing at random (MAR), which means that the probabilities of response do not depend on unseen quantities. Although this assumption is convenient, it is sometimes questionable. For example, questionnaire items pertaining to sensitive information (e.g., substance use, delinquency, etc) may show high rates of missingness. Participants who fail to respond may do so for a variety of reasons, some of which could be strongly related to the underlying true values. Data are said to be nonignorably missing if the probabilities of missingness depend on unobserved quantities. Traditional selection models for nonignorable nonresponse are outcome-based, tying these probabilities to partially observed values directly (e.g., by a logistics regression). These methods are inherently unstable, because the relationship between a partially observed variable and its missingness indicator is understandably difficult to estimate. Moreover, with multivariate or longitudinal responses, the number of distinct missingness patterns becomes quite large, making traditional selection modeling even more unattractive. Information in the missing-data indicators is sometimes well summarized by a simple latent-class structure, suggesting that a large number of missing-data patterns may be reduced to just a few prototypes. In this thesis, we describe the new method for imputing missing values under a latent-class selection model (LCSM). In the LCSM, the response behavior is assumed to be related to the items in question, and to additional covariates, only through a latent membership measured by the missingness indicators. We describe the LCSM and apply it to data from a school-based study of alcohol risk and exposure among adolescents in Pennsylvania, which has sensitive items with high rates of missingness. We examine alcohol risk index for students from 8 to 13 years old and compare our model's performance to that of MAR-based alternative.