Fusion-Aware Privacy and Warehousing for Healthcare Databases

Open Access
Ganta, Srivatsava Ranjit
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
October 21, 2008
Committee Members:
  • Raj Acharya, Dissertation Advisor
  • Raj Acharya, Committee Chair
  • Adam Smith, Committee Chair
  • Ganapati P Patil, Committee Member
  • Robert Collins, Committee Member
  • Information Fusion
  • Anonymization
  • Privacy
  • Auxiliary Information
In the current information era, data is being generated at an alarming rate. The healthcare domain is no exception. Healthcare organizations collect and maintain data from different stages of care provided to each and every patient. This starts with the collection of general demographic and disease history at patient check-in, followed by clinical and laboratory information during treatment and finally follow-up and medical histories. In addition to patient data, huge amounts of medical literature and genome-wide data such as DNA sequences, protein sequences etc. are maintained. Managing this enormous information content is a daunting task for healthcare organizations. In this dissertation, we explore two key challenges faced in healthcare data management: 1. Data Privacy, and 2. Data Warehousing. In the first and primary part, we take up the problem of data privacy. Data collected by healthcare organizations consists of sensitive information such as disease diagnosis etc. which should not be used or made available for non-medical purposes. However, such data needs to be disseminated and distributed by healthcare organizations to promote research, disease studies etc. While the dissemination and distribution of this information is beneficial, patient privacy is of foremost concern. Hence, privacy preserving dissemination of information becomes an important problem. In this dissertation, we focus on privacy preserving mechanisms for two specific dissemination scenarios: a) Data publishing, and b) Data sharing. In data publishing, we consider the challenge posed by what has come to be known as emph{auxiliary information} in the research community. The problem occurs when sensitive data is published in an anonymized version, and a potential adversary uses this version of data to collect auxiliary information from other sources and then infers hidden sensitive information. Our contribution to this problem scenario is to introduce a new class of attacks based on auxiliary information called the emph{information fusion based privacy attacks}, where, an adversary fuses auxiliary information gathered with published anonymized releases to cause a privacy breach. We model two instances of such attacks: 1. Independent Release Based Attack, and 2. Web Based Attack. Our investigation on the effects of these attacks proves that a large class of existing solutions are indeed vulnerable in such scenarios. On the other hand, in the data sharing scenario, we consider the problem of privacy policy regulated data sharing among healthcare organizations. The problem here is that organizations may need to share sensitive data while following potentially conflicting privacy policies. We identify the properties of this problem with respect to the current healthcare system and design a solution based on the novel idea of sticky privacy policies. In the second part, we take up the problem of data warehousing. Healthcare data is distributed among multiple organizational entities such as hospitals, clinical labs, research centers and government agencies that are controlled independently. This hinders the availability of such valuable data to researchers, who end up with only islands of data. This scenario poses a serious challenge for a global study of the disease. Furthermore, data mining and knowledge discovery on these islands of data leads to limited results as the data does not capture intrinsic relationships among these inter-related sources. This brings out the need for data warehouse platforms that offer single-point access to patient, clinical, and genomic data from multiple sources and fusion based knowledge discovery tools that mine multiple inter-related data sets through information fusion. In the final part of this dissertation, we present one such system, FUZEBASE, that delivers these functionalities for cancer research data as part of a consortium of cancer centers in the state of Pennsylvania.