RDF3X-MPI: A Partitioned RDF engine for Data-Parallel SPARQL Querying

Open Access
Author:
Chirravuri, Sai Krishnan
Graduate Program:
Computer Science and Engineering
Degree:
Master of Science
Document Type:
Master Thesis
Date of Defense:
July 28, 2014
Committee Members:
  • Kamesh Madduri, Thesis Advisor
  • Piotr Berman, Thesis Advisor
Keywords:
  • RDF3x
  • MPI
  • RDF
  • SPARQL
  • DISTRIBUTED
  • IN-MEMORY
Abstract:
The Semantic Web is a collection of technologies that facilitate universal access to linked data. The Resource Description Framework (RDF) model is one such technology that is being developed by the World Wide Web Consortium (W3C). A common representation of RDF data is as a set of triples. Each triple contains three fields: a subject, a predicate, and an object. A collection of triples can also be visualized as a directed graph, with subjects and objects as vertices in the graph, and predicates as edges connecting the vertices. When large collections of triples are aggregated, they form massive RDF graphs. Collections of RDF triple data sets have been growing over the past decade, and publicly-available RDF data sets now have billions of triples. As data sizes continue to grow, the time to process and query large RDF data sets also continues to increase. This work presents RDF3x-MPI, a new scalable, parallel RDF data management and querying system based on the RDF3x data management system. RDF3x (RDF Triple eXpress) is a state-of-the-art RDF engine that is shown to outperform alternatives by one or two orders of magnitude, on several well-known benchmarks and in experimental studies. Our approach leverages all the data storage, indexing, and querying optimizations in RDF3x. We additionally partition input RDF data to support parallel data ingestion, and devise a methodology to execute SPARQL queries in parallel, with minimal inter-processor communication. Using our new approach, we demonstrate a performance improvement of up to 12.9$\times$ in query evaluation for the LUBM benchmark, using 32-way MPI task parallelism. This work also presents an in-depth characterization of SPARQL query execution times with RDF3x and RDF3x-MPI on several large-scale benchmark instances.