Secure Statistical Analysis on Vertically Distributed Databases

Open Access
- Author:
- Samizo, Yuji
- Graduate Program:
- Statistics
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- July 11, 2016
- Committee Members:
- Aleksandra B Slavkovic, Thesis Advisor/Co-Advisor
TBA1, Committee Member
TBA3, Committee Member - Keywords:
- Statistical disclosure limitaion
distributed databases
Secure multiparty computation
Lasso regression
LARS algorithm - Abstract:
- Integrating multiple databases that are distributed among different data owners can be beneficial in numerous contexts of statistical analysis. Unfortunately, the actual sharing of data is often impeded by concerns about data confidentiality. A situation like this requires tools that can produce correct results while minimizing risk of disclosure. Over the past ten years a number of "secure'' protocols have been proposed to solve specific statistical problems such as linear regression and classification in a distributed setting. In this thesis, we first explore the disclosure risks associated with several existing protocols designed for the vertically partitioned database setting. We focus on the specific case where two parties are trying to perform logistic regression without actually combining their data. Although the protocols can be considered secure in the sense that there is no danger for either party's data to be fully exposed, there is information leakage resulting from the intermediate computations and also from the estimated coefficients. We provide detailed analysis of such cases. Secondly we show how these previously proposed secure computation protocols can be applied to penalize regression methods, with a focus on the LARS algorithm used to do Lasso regression. A protocol for the vertically partitioned database setting is described, along with a thorough discussion on possible disclosure risks and computation. We also provide a detailed description on how to perform model selection and possible ways to expand our protocol to LARS-type algorithms for generalized linear models, such as logistic regression.