Sampling contingency tables given sets of marginals and/or conditionals in the context of statistical disclosure limitation
Open Access
- Author:
- Lee, Juyoun
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- September 01, 2009
- Committee Members:
- Aleksandra B Slavkovic, Dissertation Advisor/Co-Advisor
Aleksandra B Slavkovic, Committee Chair/Co-Chair
Donald Richards, Committee Member
Murali Haran, Committee Member
Yuri Zarhin, Committee Member - Keywords:
- Algebraic Statistics
Monte Carlo)
Stochastic Computation and Simulation Methods (MCM
Categorical Data Analysis
Contingency Tables
Statistical Disclosure Limitation Method - Abstract:
- Federal agencies and other organizations often publish data summarized in arrays of non-negative integers, called contingency tables. When such data are released, it is necessary to prevent sensitive information pertaining to individuals from being disclosed. In statistical disclosure limitation, we must maintain a balance between disclosure risk and the data utility needed to make valid statistical inferences. One method for achieving this balance is to release partial information about the original data. In practice, many agencies release data summarized in the form of marginal sums or conditional probabilities. Sampling methods for multi-way contingency tables given a set of observed marginal sums have been studied in diverse ways; yet, there is almost no literature about sampling of tables given a set of observed conditional probabilities. In this thesis, we focus on a set of conditional probabilities instead of marginal sums. We propose MCMC simulation schemes coupled with tools from algebraic statistics to sample tables from the sets of possible tables given observed conditional values. We also propose a simple extension to the case given a combination of observed marginal totals and conditional values. These algorithms can be used to compute posterior distribution and assess data utility and disclosure risk in the context of statistical disclosure limitation. We demonstrate the proposed algorithms with simple examples and discuss their advantages and disadvantages. In addition, proposed sampling algorithms can be used for releasing synthetic contingency tables. We study both the disclosure risk and data utility associated with proposed synthetic tabular data releases.