Irregular Graph Algorithms on Modern Multicore, Manycore, and Distributed Processing Systems

Open Access
Slota, George Michael
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
February 22, 2016
Committee Members:
  • Kamesh Madduri, Dissertation Advisor
  • Kamesh Madduri, Committee Chair
  • Mahmut Taylan Kandemir, Committee Member
  • Wang Chien Lee, Committee Member
  • Soundar Kumara, Committee Member
  • Sivasankaran Rajamanickam, Special Member
  • graph mining
  • high performance computing
  • graph partitioning
  • subgraph isomorphism
Graph analysis is the study of real-world interaction data, be it through biological or chemical interaction networks, human social or communication networks, or other graph-representable datasets pervasive throughout the social and physical sciences. Due to increasing data sizes and complexities, it is important to develop efficient and scalable approaches for the algorithms, tools, and techniques used to study such data. Efficient utilization of the increasing heterogeneity and complexity of modern high performance computing systems is another major consideration for these efforts. The primary contributions of this thesis are as follows: First, parallel and scalable solutions to several basic graph analytics are presented. An implementation of the color-coding algorithm for subgraph isomorphism is introduced as FASCIA (Fast Approximate Subgraph Counting and Enumeration). Using several optimizations for work avoidance, memory usage reduction, and cache/data movement efficiency, FASCIA demonstrates up to a five orders-of-magnitude per-core speedup relative to prior art. FASCIA is able to calculate the counts of subgraphs up to 10 vertices on multi-billion edge graphs in minutes on a modest 16 node cluster and use these counts for a variety of analytics. Using FASCIA's baseline approach, FastPath is also introduced to find minimum weight paths in weighted networks. The Multistep method is next introduced as an approach for graph connectivity, weak connectivity, and strong connectivity, with a generalization of Multistep also presented for graph biconnectivity. The Multistep approaches are shown to demonstrate a 2-7x mean speedup relative to the prior state-of-the-art. A graph partitioner called PuLP (Partitioning using Label Propagation) is also introduced along with a general distributed graph layout strategy, DGL. PuLP was specifically designed to partition small-world graphs having skewed degree distributions, such as social interaction networks and web graphs. PuLP is able to partition such graphs an order of magnitude faster and with a fraction of the memory of other comparable partitioners (ParMETIS, PT-Scotch) while giving comparable partitions in terms of cut quality and balance. Additionally, this thesis presents how using techniques derived from these efforts, a suite of distributed graph analytics could be implemented and applied to the largest publicly-available web crawl of 3.5 billion pages and 130 billion links. End-to-end execution of analysis using these implementations completing in 20 minutes on only 256 nodes of the Blue Waters supercomputing system. Throughout this thesis, analyses of the algorithms and subroutines that comprise the Multistep, FASCIA/FastPath, and PuLP/DGL implementations is undertaken. Common optimizations are then identified (e.g., multiple levels of queues to match the memory hierarchy, techniques for non-blocking and asynchronous updates to shared data, efficient distributed communication patterns, among others) and their effects on performance are quantified. It is demonstrated how the optimization techniques can be utilized when processing under the higher degree of parallelism available in modern manycores (GPUs, Intel Xeon Phis) as well as how the techniques can be extended for more general-purpose graph processing in both the shared- and distributed-memory spaces. Also under consideration is the state of current hardware trends, with the goal of identifying how to modify and extend these general optimizations for forthcoming high performance computing architectures. Additionally, new optimizations and potential further research areas are introduced which might also be applicable for accelerating graph processing on these future systems.