=Paper=
{{Paper
|id=Vol-1644/paper38
|storemode=property
|title=Query Planning for Evaluating SPARQL Property Paths
|pdfUrl=https://ceur-ws.org/Vol-1644/paper38.pdf
|volume=Vol-1644
|authors=Nikolay Yakovets,Parke Godfrey,Jarek Gryz
|dblpUrl=https://dblp.org/rec/conf/amw/YakovetsGG16
}}
==Query Planning for Evaluating SPARQL Property Paths==
Query Planning for Evaluating SPARQL Property Paths Nikolay Yakovets, Parke Godfrey, and Jarek Gryz Department of Computer Science and Engineering, York University, Canada {hush,godfrey,jarek}@cse.yorku.ca Abstract. The extension of SPARQL in version 1.1 with property paths offers a type of regular path query for RDF graph databases. Such queries are difficult to optimize and evaluate efficiently, however. We have em- barked on a project, Waveguide, to build a cost-based optimizer for SPARQL queries with property paths. Waveguide builds a query plan— which we call a waveplan (WP)—which guides the query evaluation. There are numerous choices in the construction of a plan, and a number of optimization methods, so the space of plans for a query can be quite large. Execution costs of plans for the same query can vary by orders of magnitude. A WP’s costs can be estimated, which opens the way to cost- based optimization. We demonstrate that the plan space of Waveguide properly subsumes existing techniques and that the new plans it adds are relevant. 1 Introduction & Motivation Graph data is becoming rapidly prevalent with the rise of the Semantic Web, social networks, and data-driven exploration in life sciences. There is a need for natural, expressive ways to query over these graphs. Standards are coming into place for this. The Resource Description Framework (RDF) [9] provides a data model for graph data. An RDF store is a set of triples that describes a directed, edge-labeled multi-graph. A triple, hs, r, oi, denotes an edge from node “s” (the subject) to node “o” (the object), with the edge labeled by “r” (the role, also called label or predicate). The SPARQL query language [8] correspondingly provides a formal means to query over RDF stores. A query defines sub-graph match criteria; its evaluation over an RDF store returns all embedded sub-graphs or variable bindings meeting the criteria. For example, the query ?friend :friendOf Charles . (Q1) evaluates to a list of people (nodes binding to variable “?friend”) who are friends of (role “:friendOf”) “Charles” (a named node, so a constant). In its latest version, 1.1, SPARQL’s expressiveness is extended with property paths [5]. This effectively introduces the concept of regular path queries (RPQs)— well studied before the advent of RDF and SPARQL—into the query language. Instead of specifying the path of interest explicitly between nodes, one may now specify it implicitly via a regular expression. For example, the query ?friend :friendOf+ Charles . (Q2) evaluates to a list of people who are friends of “Charles”, or friends of people who are friends of “Charles”, and so forth (that is, a transitive closure over “:friendOf”). While SPARQL provides the expressiveness we desire, such queries are more challenging to optimize well. Query Q1 could be evaluated just by extracting the triples with “r = :friendOf” and “o = Charles”. For even a slightly more complicated query, however, it may not be straightforward to find a plan to evaluate it efficiently. Q2 is more challenging in requiring transitive closure over “:friendOf” with respect to the graph. Property path evaluation is a tale of two methods: two quite different ap- proaches appear in the literature. For RPQs, the seminal work [7] which intro- duced the G+ query language shows how to use a finite state machine effectively as a plan to guide the graph walk for the query’s evaluation. We call this approach FA. Subsequent work on RPQs has followed this idea. SPARQL with property paths is much more recent. Systems for SPARQL query evaluation have followed the second approach, based primarily on the seminal work of [6]. These extend the relational algebra to accommodate the translation of a property path’s regular expression, and then use dynamic pro- gramming over the (extended) relational-algebra parse to devise a plan. Added is an “α” operator, which provides the transitive closure over a relation to ac- commodate regular expressions’ Kleene star. Thus, we call this approach α-RA, the relational algebra extended by “α” [1]. Which approach is better? We shall show that the effective “plan spaces” that result from FA and α-RA are incomparable. Sometimes, for a given query and graph, an FA plan will be the better choice. Other times, an α-RA plan will be. Our goal is to formalize the notion of plan space for both, to be able to choose the best plan. We shall show that a richer plan space can be had that properly subsumes FA and α-RA, and offers more plans existing in neither (“mixed” plans), which sometimes are the best plans. We have designed a system called Waveguide [12] with the goal to provide viable cost-based query optimization and evaluation for SPARQL over RDF stores that is on par with the state of the art for relational database systems. In [12], we address the first but critical step of this endeavor, defining a plan space—the space of query plans, waveplans (WPs)—for SPARQL queries. We focus on single-path, property-path queries, essentially the RPQ fragment of SPARQL 1.1. We consider a set semantics—the distinct directive in each query— and thus do not consider aggregation. In [12], our contributions are as follows. 1. plan space. (a) Summarize the state of the art for evaluation of RPQs and SPARQL property paths. (b) Establish why none suffices. (c) Devise Waveguide’s plan space, and demonstrate that it subsumes the state of the art, and extends beyond it. (d) Model the cost factors that determine the efficiency of plans, and present the powerful optimizations offered by Waveguide plans. 2. performance study. (a) Provide a microbenchmark over a pertinent RPQ template with realistic queries over real RDF stores / graphs. (b) Substantiate the optimizations of our approach. (c) Justify the necessity of the richer plan space. 2 Approach Work on property-path evaluation has been remiss in not drawing the connection to RPQs. How do the FA and α-RA approaches compare? Does one subsume the other? Or are they incomparable? If so, a combined approach might be superior. A generalized approach might offer new plans that neither FA nor α-RA can produce with superior performance. Both the FA and α-RA approaches effectively provide evaluation plans for property-path queries. However, the plan spaces that are implicit in these ap- proaches have not been considered. In FA, choosing a different (but still correct) automaton for the plan might offer a significantly more efficient plan. In sys- tems taking the α-RA approach, planning is done over the α-RA expression tree that results from the property path’s translation, but no planning specific to the semantics of property paths takes place. In [12], we show the approaches of FA and α-RA are, indeed, incomparable. We next describe Waveguide, which is a generalized approach to query plan- ning and evaluation for SPARQL property-path queries, and which properly subsumes the FA and α-RA approaches. We discuss Waveguide’s evaluation model, present Waveguide’s plans, and we show that Waveguide’s plan space properly subsumes FA ∪ α-RA. Waveguide’s evaluation strategy is based on an iterative search algorithm, and variations thereof. In Waveguide, we perform path search efficiently while simultaneously recognizing the path expressions. Waveguide’s input is a graph database G and a waveplan PQ which guides a number of search wavefronts that explore the given graph. We introduce the term wavefront to refer to a part of the plan that evaluates breadth-first during the evaluation. This graph exploration, driven by an iterative search procedure, is inspired by the semi-naı̈ve bottom-up strategy used in the evaluation of linear recursive expressions based on fixpoint, as is done for the α operator for α-RA. The key idea is, given a seed —a set of nodes in the graph from which this wavefront begins its search—as a start, to expand repeatedly the search wave- fronts until no new answers are produced; i.e., we reach a fixpoint. Each search wavefront is guided by an automaton in the plan, a finite state machine based on an NFA. This is akin to the FA approach. Different, though, from NFAs which are used as recognizers of regular expressions on strings, wavefront automata have features directed to the evaluation of regular expressions over graphs. In [12], we present a cost framework for Waveguide search, search cost factors that can magnify the cost (properties of the graph and of resulting pre- paths computed during evaluation), and optimization methods that are enabled by WPs which address the search factors, in turn. We prototype a system [11] that implements Waveguide methodology to benchmark waveplans to study their performance. In this Waveguide system, resource-intensive tasks are delegated to PostgreSQL via SQL and procedural SQL routines. We test our implementation of Waveguide by running a collection of real- istic path queries over real-world datasets YAGO2s [10] and DBPedia [3]. We generate path queries based on data patterns we identified in real-world graphs. The goal of these experiments is to verify the gains offered by Waveguide op- timizations, and show that they correspond to our cost framework and analysis. Further, we design a microbenchmark to analyze the performance of the Waveg- uide prototype in comparison with two RDF stores: Virtuoso [4] and Jena TDB [2]. References 1. R. Agrawal. Alpha: An extension of relational algebra to express a class of recursive queries. Software Engineering, IEEE Transactions on, 14(7):879–885, 1988. 2. J. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, and K. Wilkinson. Jena: implementing the semantic web recommendations. In Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pages 74–83. ACM, 2004. 3. The DBpedia knowledge base. http://dbpedia.org/. 4. O. Erling and I. Mikhailov. Virtuoso: RDF Support in Native RDBMS. Semantic Web Information Management, 1:501, 2010. 5. S. Harris and A. Seaborne. SPARQL 1.1 query language. W3C Recommendation. http://www.w3.org/TR/sparql11-query/, 2013. 6. K. Losemann and W. Martens. The complexity of evaluating path expressions in SPARQL. In Proceedings of the 31st symposium on Principles of Database Systems, pages 101–112. ACM, 2012. 7. A. Mendelzon and P. Wood. Finding regular simple paths in graph databases. SIAM Journal on Computing, 24(6):1235–1258, 1995. 8. E. Prud’Hommeaux, A. Seaborne, et al. SPARQL query language for RDF. W3C Recommendation, 15, 2008. 9. W3C: Resource Description Framework (RDF). http://www.w3.org/TR/rdf- concepts/, 2004. 10. YAGO2s: A high-quality knowledge base. http://yago-knowledge.org/resource/. Max Planck Institut Informatik. 11. N. Yakovets, P. Godfrey, and J. Gryz. WAVEGUIDE: evaluating SPARQL prop- erty path queries. In Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23-27, 2015., pages 525–528, 2015. 12. N. Yakovets, P. Godfrey, and J. Gryz. Evaluation of SPARQL Property Paths via Recursive SQL. In Proceedings of the annual ACM SIGMOD conference (SIG- MOD’16), San Francisco, USA, June 2016. ACM.