Privacy Preserving Methods in the Era of Big Data: New Methods and Connections

Open Access
- Author:
- Nixon, Michelle Anne
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- October 07, 2020
- Committee Members:
- Aleksandra B Slavkovic, Dissertation Advisor/Co-Advisor
Aleksandra B Slavkovic, Committee Chair/Co-Chair
Naomi S Altman, Committee Member
Lingzhou Xue, Committee Member
Jennifer Lynne Van Hook, Outside Member
Ephraim Mont Hanks, Program Head/Chair - Keywords:
- Statistical Data Privacy
Synthetic Data
Statistical Disclosure Control
Differential Privacy
Knockoffs - Abstract:
- In this dissertation, we present three novel projects directly applicable to the data privacy literature. In our first project, we propose new methods to synthesize data with heavy tails and heteroskedastic errors. These methods are based on those already proposed in the general statistics literature for modeling heavy tails, including quantile regression, quantile random forests, mixture models, and composite models. We offer suggestions for creating synthetic data products for heavy tailed data and compare our proposed approaches to other methods for creating synthetic data via simulations and applications to the Census of Scotland and the Synthetic Longitudinal Business Database. In our second project, we propose a novel method for synthesizing differentially-private contingency tables using an augmented Bayesian latent class model from a subset of summaries (such as all k-way marginals). Our method achieves differential privacy by adding noise from the Geometric Mechanism to these summaries. Privacy is preserved as the remainder of our approach is a post-processing technique: we model the underlying contingency table using a Bayesian latent class model but assume independence between counts to make estimation feasible. We account for the additional privacy-preserving noise using a measurement error model. We implement our proposed methodology to a subset of the American Community Survey and compare it to two commonly used approaches in the literature. In our third project, we investigate connections between the knockoff method for false discovery rate control and synthetic data methods for data privacy. The knockoff method relies on specially-constructed variables that obey specific relationships with the original data set. We propose that the specific knockoff methods, such as fixed-X and model-X knockoffs, share an inherit connection with several previously proposed methods for creating synthetic data, and provide both theoretical and empirical insights. To the best of our knowledge those are first such important connections made in the literature.