Exploiting Sparsity, Structure, and Geometry for Knowledge Discovery

Open Access
Author:
Chatterjee, Anirban
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
June 07, 2011
Committee Members:
  • Padma Raghavan, Dissertation Advisor
  • Padma Raghavan, Committee Chair
  • Mahmut Taylan Kandemir, Committee Member
  • Suzanne Michelle Shontz, Committee Member
  • Kateryna Dmytrivna Makova, Committee Member
Keywords:
  • sparse graph embedding
  • sparse graph partitioning
  • data mining
  • sparse linear solvers
Abstract:
Data-driven discovery seeks to obtain a computational model of the underlying process using observed data on a large number of variables. Observations can be viewed as points in a high-dimensional space with coordinates given by values of the variables. It is common for observations to have nonzero values in only a few dimensions, i.e., the data are sparse. We seek to exploit the sparsity of the data by interpreting the observations as a sparse graph and manipulating its geometry (or embedding) in a high-dimensional space. Our goal is to obtain algorithms that demonstrate high accuracy for key problems in data analysis and parallel scientific computing. The first part of this dissertation focuses on combining geometry and sparse graph structure to yield more accurate classification algorithms for data mining. We have developed a feature subspace transformation (FST) scheme that transforms the data iteratively and provides a better selection of features to improve unsupervised classification, also known as clustering. FST utilizes the combinatorial structure of the entity relationship graph and high-dimensional geometry of the sparse data to iteratively bring related entities closer, which enhances clustering. Our approach improves clustering quality relative to established schemes such as K-Means and multilevel K-Means (GraClus). Next we consider transformations to enhance supervised classification with prelabeled data. We obtain similarity graph neighborhoods (SGN) in the high-dimensional feature subspace of the training data and transform it by determining displacements for each entity. Our SGN classifier is a supervised learning scheme that is trained on these transformed data. The goal of our SGN transform is to increase the separation between means of different classes, such that the classifier learns a better boundary. Our results indicate that a linear discriminant and support vector machine classification on these SGN transformed data improves accuracy by 5.0% and 4.52%, respectively. The second part of this dissertation focuses on utilizing geometry and the structure of sparse graphs to enhance the quality and performance of algorithms for parallel scientific computing. We develop a parallel scheme, ScalaPart, for partitioning a large sparse graph into k subgraphs such that the number of cross edges is reduced. Our scheme combines a parallel graph embedding with a parallel geometric partitioning scheme to yield a scalable approach suitable for multicore-multiprocessors. Our analysis of ScalaPart demonstrates its scalability, and our empirical evaluation indicates that the performance of ScalaPart and the quality of cuts compares well with established schemes such as ParMetis and PT-Scotch. Next, we consider scalable sparse linear system solution which plays a key role in many applications including data mining with support vector machines and partial differential equation-based modeling and simulation. Our graph partitioning algorithm can be used to yield a nested dissection fill-reducing ordering and a tree for structuring numeric computations for the solution of sparse linear systems. We develop a hybrid linear solver that couples tree-structured direct and iterative solves for repeated right-hand side solutions. Our results indicate that our hybrid solver is 1.87 times faster than preconditioned conjugate gradients (PCG) based methods for achieving same levels of accuracy. In conclusion, this dissertation demonstrates that by combining geometric and combinatorial properties of sparse graphs and matrices, we can enhance algorithms that are commonly used in knowledge discovery through modeling and simulation.