"efficient Combinatorial Methods in Sparsification, Summarization and Testing of Large Datasets"

Open Access
Author:
Yaroslavtsev, Grigory
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
December 09, 2013
Committee Members:
  • Sofya Raskhodnikova, Dissertation Advisor
  • Sofya Raskhodnikova, Committee Chair
  • Adam Smith, Committee Member
  • Piotr Berman, Committee Member
  • Jason Ryder Morton, Special Member
Keywords:
  • algorithms
  • big data
  • sublinear algorithms
  • graph theory
  • sparsification
  • summarization
Abstract:
Increasingly large amounts of structured data are being collected by personal computers, mobile devices, personal gadgets, sensors, etc., and stored in data centers operated by the government and private companies. Processing of such data to extract key information is one of the main challenges faced by computer scientists. Developing methods for constructing compact representations of large data is a natural way to approach this challenge. This thesis is focused on rigorous mathematical and algorithmic solutions for sparsi- cation and summarization of large amounts of information using discrete combinatorial methods. Areas of mathematics most closely related to it are graph theory, information theory and analysis of real-valued functions over discrete domains. These areas, somewhat surprisingly, turn out to be related when viewed through the computational lens. In this thesis we illustrate the power and limitations of methods for constructing small representations of large data sets, such as graphs and databases, using a variety methods drawn from these areas. The primary goal of sparsi cation, summarization and sketching methods discussed here is to remove redundancy in distributed systems, reduce storage space and save other resources by compressing the representation, while preserving the key structural properties of the original data or system. For example, the size of the network can be reduced by removing redundant nodes and links, while (approximately) preserving the distances or connectivities between the nodes. This thesis serves as an overview of results in [31, 30, 152, 38, 171, 33]. It also contains an extension of results in [152] obtained jointly with David Woodru ff.