Open Access
Wang, Yue
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
February 26, 2019
Committee Members:
  • Daniel Kifer, Dissertation Advisor
  • Daniel Kifer, Committee Chair
  • Clyde Lee Giles, Committee Member
  • Sencun Zhu, Committee Member
  • Aleksandra B Slavkovic, Outside Member
  • Differential privacy
  • statistical data analyses
Differential privacy has been widely used for protecting sensitive data. When it comes to statistical hypothesis testing under differential privacy, earlier approaches either added too much noise, leading to a significant loss of power, or added a small amount of noise but failed to adjust the test to account for the added noise, resulting in unreliable results. We aim to conduct test of independence, test of sample proportions and goodness of fit test on tabular data that get rid of those drawbacks meanwhile providing valid results. With an asymptotic regime more suited to privacy preserving hypothesis testing, we showed a modified equivalence between the chi-squared and likelihood ratio tests, and used these tests for the three applications. On a more general basis, we studied the sampling distribution for statistics computed from data. In the non-private setting, when such statistics are used for hypothesis testing or confidence intervals, their true sampling distributions are often replaced by approximating distributions that are easier to work with (e.g., using the Gaussian approximation justified by the Central Limit Theorem). When data are perturbed for differential privacy, the approximating distributions need to be modified accordingly to account for the privacy noise. Various competing methods for creating such approximating distributions were proposed in prior works, with a lack of formal justification despite that they worked well empirically. We solved the problem by introducing a general asymptotic recipe for creating the approximating distributions for differentially private statistics, providing finite sample guarantees for the quality of the approximations as well as degradation results under postprocessing on the statistics. Beyond the statistical analyses carried out on sensitive data, we also targeted at quantifying the uncertainty for models trained on such data. With the data protected by differential privacy, there are two sources of randomness: the randomness due to the (non-private) data sampling process and the randomness in the privacy preserving mechanism. We proposed a general framework to construct confidence intervals for the model parameters of a variety of differentially private machine learning models, accounting for both sources of randomness. Specifically, we provided algorithms for models trained with objective perturbation and output perturbation. The algorithms work for both $\epsilon$-differential privacy and $\rho$-zcdp. In another work, instead of focusing on the application-specific adjustment needed for the privacy enforcement, we worked on optimizing an estimate for the perturbed data such that more accurate inference can be obtained based on the estimate. Prior works showed that the accuracy of many queries could be improved by postprocessing the perturbed data to enforce the consistency constraints that were known to hold for the original data. It is common to formulate the problem with a least squares minimization. However, such methods lacked the strength of making use of the noise distribution used to perturb the data. We decided to apply the maximum likelihood estimation with constraints to further improve the performance. Moreover, we proposed a general framework to solve such formulations efficiently based on the alternating direction method of multipliers (ADMM). It also yields the benefit of re-using existing efficient solvers for the least squares approach. We tested the performance of the proposed methods on a variety of datasets with extensive experiments, and pointed out their strength as well as the limitations.