A first principle approach toward data privacy and utility

Open Access
Lin, Bing-rong
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
March 19, 2014
Committee Members:
  • Daniel Kifer, Dissertation Advisor
  • Daniel Kifer, Committee Chair
  • Wang Chien Lee, Committee Member
  • Adam Smith, Committee Member
  • David Miller, Committee Member
  • Data Privacy; Anonymization; Utility
Individual data are collected by government and online service providers such as social networks, search engines and shopping websites. The huge amount of data about users’ activities and personal information are collected and analyzed to improve the quality of service or to serve as scientific research data. However, the individual’s privacy may not be protected. The protection of privacy relies on the privacy definitions. Privacy definitions act as a contract defines the behavior of algorithms on processing sensitive data and producing non-sensitive sanitized data. Most of privacy definitions are constructed from intuition and intuition alone leads us astray. Examples include the release of AOL search log and Netflix competition dataset. The released data were anonymized based on intuition but, only a few days later, the journalist and researchers figure out means to identify people in the dataset. One goal of this thesis is to provide a systematic approach to extract semantic guarantees of privacy definitions. Most of privacy definitions are constructed from intuition. In most cases, it is not clear what these privacy definitions actually guarantee. To the best of our knowledge, we present the first general framework for extracting semantic guarantees from privacy definitions. The privacy guarantees we can extract are Bayesian in nature and deal with changes in an attacker’s beliefs. The framework can be applied to privacy definitions or even to individual algorithms to identify the types of inferences they defend against. We illustrate the use of our framework with analyses of several definitions and algorithms for which we can derive previously unknown semantics. The other goal of this thesis is to systematically study utility measures via the first principle approach. The privacy definitions constrain the behavior of privacy algorithms. The utility measures are designed to help to choose which privacy algorithm to be used to process the sensitive data. In statistical privacy, utility refers to two concepts: information preservation – how much statistical information is retained by a sanitizing algorithm, and usability – how (and with how much difficulty) does one extract this information to build statistical models, answer queries, etc. We analyze the information-preserving properties of utility measures with utility axioms and study how violations of an axiom can be fixed. We show that the average error of Bayesian decision makers forms the unique class of utility measures that satisfy all of the axioms. The axioms are agnostic to Bayesian concepts such as subjective probabilities and hence strengthen support for Bayesian views in privacy research. This result connects information preservation to aspect of usability. Utility is measured in Bayesian manner. Shouldn’t Bayesian decision theory be a good choice to work with sanitized data? An application to the sorted histogram problem shows that our decision-theoretic post-processing algorithm can estimate the underlying sorted histogram with much greater accuracy than prior work.