STATISTICAL DATA PRIVACY METHODS FOR INCREASING RESEARCH OPPORTUNITIES

Open Access
- Author:
- Snoke, Joshua Valor
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- May 11, 2018
- Committee Members:
- Aleksandra B Slavkovic, Dissertation Advisor/Co-Advisor
Naomi S Altman, Committee Chair/Co-Chair
Matthew Logan Reimherr, Committee Member
Timothy Raymond Brick, Committee Member
Timothy Raymond Brick, Outside Member - Keywords:
- Statistical Data Privacy
Disclosure Control
Multiparty Computation
Differential Privacy
Partitioned Data
Synthetic Data - Abstract:
- In this dissertation, we develop statistical methods for providing access to sensitive data, with the goal of simultaneously protecting individuals’ privacy and enabling high quality research. In addition to the theoretical contributions we provide to the area of statistical data privacy, our work is motivated by collaborations with practitioners and real policy problems, and as such is meant to be highly practical and easy to implement. We present two alternative paradigms for providing researchers access to sensitive data that build on ideas from statistical disclosure control (SDC) methodology, and techniques of secure multiparty computation (SMPC) and differential privacy (DP) from computer science. First, under the SMPC framework we develop an algorithm for computing secure maximum likelihood estimates (MLE) over partitioned databases without sharing any data or intermediate statistics. This is motivated by the scenario where different entities (or individuals) hold separate partitions of data, and researchers wish to obtain model estimates or statistics utilizing all the data which cannot be combined. We show that under a certain set of assumptions our method for estimation across these partitions achieves identical results as estimation with the full data but without violating privacy. We demonstrate the utility of the algorithm through the simulations and estimation of structural equation models with real data, and point out that is more widely applicable to factor models, linear regression, and PCA. Second, we provide new theoretical results for the utility evaluation of synthetic data based on its distributional similarity to the original data. The release of synthetic data is motivated by the desire for researchers to have downloadable microdata which they can use for exploration and model testing but that do not violate privacy. We derive new theoretical results for the propensity score mean-squared-error (pMSE) utility measure, and demonstrate how its use can improve on the choice of synthetic data models. We further combine the pMSE with differentially private methodology to produce synthetic data that maximize distributional similarity under the constraints of epsilon-DP. This ensures that we not only release synthetic data that others high utility, but it also guarantees quantifiable and provable privacy protections for the individuals in the data.