ON IMPROVING THE UTILITY FOR DATA ANALYSES UNDER DIFFERENTIAL PRIVACY

Open Access
- Author:
- Wang, Yue
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- February 26, 2019
- Committee Members:
- Daniel Kifer, Dissertation Advisor/Co-Advisor
Daniel Kifer, Committee Chair/Co-Chair
Clyde Lee Giles, Committee Member
Sencun Zhu, Committee Member
Aleksandra B Slavkovic, Outside Member - Keywords:
- Differential privacy
statistical data analyses - Abstract:
- Differential privacy has been widely used for protecting sensitive data. When it comes to statistical hypothesis testing under differential privacy, earlier approaches either added too much noise, leading to a significant loss of power, or added a small amount of noise but failed to adjust the test to account for the added noise, resulting in unreliable results. We aim to conduct test of independence, test of sample proportions and goodness of fit test on tabular data that get rid of those drawbacks meanwhile providing valid results. With an asymptotic regime more suited to privacy preserving hypothesis testing, we showed a modified equivalence between the chi-squared and likelihood ratio tests, and used these tests for the three applications. On a more general basis, we studied the sampling distribution for statistics computed from data. In the non-private setting, when such statistics are used for hypothesis testing or confidence intervals, their true sampling distributions are often replaced by approximating distributions that are easier to work with (e.g., using the Gaussian approximation justified by the Central Limit Theorem). When data are perturbed for differential privacy, the approximating distributions need to be modified accordingly to account for the privacy noise. Various competing methods for creating such approximating distributions were proposed in prior works, with a lack of formal justification despite that they worked well empirically. We solved the problem by introducing a general asymptotic recipe for creating the approximating distributions for differentially private statistics, providing finite sample guarantees for the quality of the approximations as well as degradation results under postprocessing on the statistics. Beyond the statistical analyses carried out on sensitive data, we also targeted at quantifying the uncertainty for models trained on such data. With the data protected by differential privacy, there are two sources of randomness: the randomness due to the (non-private) data sampling process and the randomness in the privacy preserving mechanism. We proposed a general framework to construct confidence intervals for the model parameters of a variety of differentially private machine learning models, accounting for both sources of randomness. Specifically, we provided algorithms for models trained with objective perturbation and output perturbation. The algorithms work for both $\epsilon$-differential privacy and $\rho$-zcdp. In another work, instead of focusing on the application-specific adjustment needed for the privacy enforcement, we worked on optimizing an estimate for the perturbed data such that more accurate inference can be obtained based on the estimate. Prior works showed that the accuracy of many queries could be improved by postprocessing the perturbed data to enforce the consistency constraints that were known to hold for the original data. It is common to formulate the problem with a least squares minimization. However, such methods lacked the strength of making use of the noise distribution used to perturb the data. We decided to apply the maximum likelihood estimation with constraints to further improve the performance. Moreover, we proposed a general framework to solve such formulations efficiently based on the alternating direction method of multipliers (ADMM). It also yields the benefit of re-using existing efficient solvers for the least squares approach. We tested the performance of the proposed methods on a variety of datasets with extensive experiments, and pointed out their strength as well as the limitations.