SINGLING-OUT VS. BLENDING-IN: OUTLIER DETECTION AND DIFFERENTIAL PRIVACY IN DATA

Restricted (Penn State Only)
Author:
Kuo, Yu Hsuan
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
January 15, 2019
Committee Members:
  • Daniel Kifer, Dissertation Advisor
  • Daniel Kifer, Committee Chair
  • Clyde Lee Giles, Committee Member
  • Wang-Chien Lee, Committee Member
  • Zhenhui Li, Outside Member
Keywords:
  • outlier detection
  • outlier explanation
  • differential privacy
Abstract:
Advances in technology have enabled an explosion of data collection. Such datasets can be very noisy and often contain a large amount of outliers. We notice that training a model on noisy datasets can result in models skewing by outliers. In the first topic of this dissertation, we deal with the problem of detecting outliers in noisy large-scale sensor datasets. We formulate it as contextual outlier detection framework because of correlations between measurements of sensor readings. The proposed solution is a robust regression model that explicitly models the outliers and detects outliers simultaneously with the model fitting. Next, it is also critical to provide human interpretable rules to explain the outliers because rules could help people find the root cause of outliers and fix the problems. Our interpretation method uses many operations among attributes. It contains a type system which enforces the constraint that variables can only be combined in ways that result in meaningful units of measurement. The results show that our method is effective and could potentially facilitate the public use of large-scale sensors data. Many collected datasets involve personal data. Institutions like the government and hospitals rely on analyzing such data to make decisions. However, analysis on sensitive data could result in privacy issues. The second part of this dissertation focuses on publication of differentially private histograms. We consider the problem of releasing a class of queries named hierarchical count-of-counts histograms which is motivated by the tables that are published in truncated form in Summary File 1 by U.S. Decennial Census. The proposed solution uses isotonic regression for non-hierarchical count-of-counts histogram estimation. An optimal weighted matching algorithm is further used in publishing its hierarchical version. The experiments show that the performance of proposed methods are data-dependent and a good choice by default. Also, we empirically study algorithms for releasing unattributed histograms. Our evaluation results serve as a guidance of when the algorithms do well and thus could benefit data owners in selecting algorithms for publishing their data.