Theoretical and Applied Problems in Partially Private Data

Open Access
- Author:
- Seeman, Jeremy
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- February 06, 2023
- Committee Members:
- Murali Haran, Program Head/Chair
Bharath Kumar Sriperumbudur, Major Field Member
Matthew Reimherr, Co-Chair & Dissertation Advisor
Aleksandra Slavkovic, Co-Chair & Dissertation Advisor
Daniel Kifer, Outside Unit & Field Member - Keywords:
- statistical data privacy
missing data
bayesian computation
computational social science
science and technology studies - Abstract:
- Research in statistical data privacy (SDP) has traditionally self-organized into two disjoint schools of thought: statistical disclosure limitation (SDL) and formal privacy (FP). Both perspectives rely on different units of analysis, measures of disclosure risk, and adversarial assumptions. Yet in recent years, differential privacy (DP), a particular variant of FP, has emerged as the methodologically preferred perspective by analyzing release mechanisms and database schemas under the broadest possible adversarial assumptions. To do so, DP quantifies privacy loss by analyzing noise injected into output statistics. For non-trivial statistics, this noise is necessary to ensure finite privacy loss. However, data curators frequently release collections of statistics where some use DP mechanisms and others are released without additional randomized noise. This includes many cases where DP mechanisms are implemented in such a way that depends on the confidential data, such as by choosing the privacy loss parameter based on confidential data (or synthetic data highly correlated with the confidential data). Consequently, DP alone cannot characterize the privacy loss attributable to the entire joint collection of releases, nor decisions that were made in implementing the mechanism. Such problems pose an existential threat to building DP systems in practice that DP alone cannot answer. In this dissertation, we study the privacy and utility properties of ``partially private data" (PPD), collections of statistics where only some are released through DP mechanisms. In particular, we define the random variable $Z$ as ``public information" not protected by DP. PPD is inherently statistical, as it relies on assumptions about the correlation structure between private and public information. We present a privacy formalism, $(\epsilon, \{ \Theta_z\}_{z \in \mathcal{Z}})$-Pufferfish ($\epsilon$-TP for short when $\{ \Theta_z\}_{z \in \mathcal{Z}}$ is implied), a collection of Pufferfish mechanisms indexed by realizations of $Z$. First, we prove that this definition has similar properties to DP. Next, we introduce two release mechanisms for publishing (PPD) satisfying $\epsilon$-TP and prove their desirable properties. We additionally introduce perfect sampling algorithms to exactly implement these mechanisms, as well as approximate Bayesian computation algorithms for sampling from the posterior of a parameter given PPD. We then compare this inference approach to the alternative where noisy statistics are deterministically combined with Z. We derive mild conditions under which using our algorithms offers both theoretical and computational improvements over this more common approach. We demonstrate all the effects above on two case studies: one on COVID-19 data, and one on rural mortality data. Finally, we discuss the implications of all the above from a social and legal perspective, with the end goal of using PPD to make FP technologies more accessible to essential social science data curators.