Optimization and Statistical Estimation for the Post Randomization Method
Open Access
- Author:
- Woo, Yong
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 07, 2013
- Committee Members:
- Aleksandra B Slavkovic, Dissertation Advisor/Co-Advisor
Donald Richards, Dissertation Advisor/Co-Advisor
Naomi S Altman, Committee Member
Stephanie Trea Lanza, Committee Member - Keywords:
- Post Randomization Method
Statistical Disclosure Control
EM Algorithm
Generalized Linear Models
Constrained Optimization - Abstract:
- The field of Statistical Disclosure Control (SDC) aims at developing methodology that balances the objectives of providing data for valid statistical inference and safeguarding confidential information. One of the SDC methods for categorical variables is the Post Randomization Method (PRAM). The basic idea underlying PRAM is to misclassify values of the categorical variables, via a known probability mechanism captured by a PRAM matrix. This thesis focuses on three primary methodological developments that enable PRAM to become a more theoretically and practically viable SDC method. First, we focus on the issue of obtaining valid statistical analysis with data subject to PRAM. The application of PRAM is known to produced biased parameter estimates in generalized linear models (GLMs). We develop and implement EM-type algorithms that take into account the effect of PRAM and obtain asymptotically unbiased estimators of parameters in GLMs, when both covariates and response variables are subject to PRAM. The basic ideas are based on the ``EM by method of weights" in the missing data literature. Second, we extend the proposed methodology in order to deal with dependent covariates when estimating parameters in GLMs by relaxing the assumption of independence of covariates. This is done by modeling the distribution of the covariates subject to PRAM as a product of univariate conditional distributions. This approach advances the PRAM methodology by making it more applicable in practice and results in more accurate estimators of the regression parameters. Results from simulation studies and application to the 1993 Current Population Survey are presented. Lastly, we address the issue of obtaining optimal PRAM matrices which produce safe files and maximize data utility with respect to a widely-used utility measure for PRAM: entropy-based information loss, EBIL, a variant of Shannon's entropy. We show that for a certain class of PRAM matrices, EBIL displays monotonic properties, which implies the minimum of EBIL occurs at an extreme point of the convex region that satisfies a pre-determined rule for safe files. Using these properties, we present an algorithm that obtains PRAM matrices which produce safe files with higher data utility when compared to PRAM matrices obtained using built-in numerical methods and routines.