"efficient Combinatorial Methods in Sparsification, Summarization and Testing of Large Datasets"

Open Access
- Author:
- Yaroslavtsev, Grigory
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- December 09, 2013
- Committee Members:
- Sofya Raskhodnikova, Dissertation Advisor/Co-Advisor
Sofya Raskhodnikova, Committee Chair/Co-Chair
Adam Smith, Committee Member
Piotr Berman, Committee Member
Jason Ryder Morton, Special Member - Keywords:
- algorithms
big data
sublinear algorithms
graph theory
sparsification
summarization - Abstract:
- Increasingly large amounts of structured data are being collected by personal computers, mobile devices, personal gadgets, sensors, etc., and stored in data centers operated by the government and private companies. Processing of such data to extract key information is one of the main challenges faced by computer scientists. Developing methods for constructing compact representations of large data is a natural way to approach this challenge. This thesis is focused on rigorous mathematical and algorithmic solutions for sparsi- cation and summarization of large amounts of information using discrete combinatorial methods. Areas of mathematics most closely related to it are graph theory, information theory and analysis of real-valued functions over discrete domains. These areas, somewhat surprisingly, turn out to be related when viewed through the computational lens. In this thesis we illustrate the power and limitations of methods for constructing small representations of large data sets, such as graphs and databases, using a variety methods drawn from these areas. The primary goal of sparsication, summarization and sketching methods discussed here is to remove redundancy in distributed systems, reduce storage space and save other resources by compressing the representation, while preserving the key structural properties of the original data or system. For example, the size of the network can be reduced by removing redundant nodes and links, while (approximately) preserving the distances or connectivities between the nodes. This thesis serves as an overview of results in [31, 30, 152, 38, 171, 33]. It also contains an extension of results in [152] obtained jointly with David Woodruff.