WIREFRAME: Two-phase, Cost-based Optimization for Conjunctive Regular Path Queries Parke Godfrey† Nikolay Yakovets‡ § Zahid Abul-Basher Mark Chignell§ † York University, Toronto, Canada godfrey@yorku.ca ‡ Eindhoven University of Technology, Eindhoven, The Netherlands n.yakovets@tue.nl § University of Toronto, Toronto, Canada {zahid,chignell}@mie.utoronto.ca 1 Introduction Graph-like data has come to the forefront with the advent of social networks and the Semantic Web. As methods and tools have been deployed for handling graph data, many others have found the paradigm to benefit their applications. Large biological corpora fit the model well. The W3C standards offer a graph data model via RDF, and a corresponding declarative query language for it via SPARQL. The queries are suited to sub- graph matching into a large data graph. Each subgraph answer itself is small, but there can be many answers. The query language encompasses the earlier de- fined concept of regular path queries (RPQs) and conjunctive regular path queries (CRPQs). An RPQ queries for pairs of nodes from the data graph which have a requisite path between them which matches the RPQ’s specification. CRPQ queries for tuples of nodes (subgraph matches) such that pairs within match by specified RPQs. This, of course, also entails that the pairs join up (conjoin) in the way the query asks. One can then call a CRPQ a “query graph”: its “nodes” are the query’s binding variables; and the “edges” between nodes are the constituent RPQs. Let these query edges be distinctly labeled. While graph databases and their applications are coming into wide use, we are only at the very beginning of understanding how to scale these systems well. Recent work has brought a cost-based optimization approach to RPQs [6]. We set out a framework herein which we call Wireframe for a two-phase, cost-based optimization for CRPQs. In Wireframe, CRPQ planning—and, likewise, evalu- ation—is separated into two phases. In the first phase, the plan is for evaluating the “answer graph”. In the second phase, a plan is posited for enumerating the subgraph-match answer tuples from this answer graph. In this brief paper, we provide the motivation for Wireframe, present its general framework, and lay out our agenda for, and the challenges in, realizing the approach fully. (a) Query graph symmetry. (b) Many-many join multiplicity. Fig. 1: Two query graphs: symmetry & join multiplicity. 2 WIREFRAME’s Approach Within SPARQL, a CRPQ query is with a fixed number of named node vari- ables,1 and a number of “edge” conditions defined via RPQs (property paths in SPARQL) between node variables. Let us define “closed SPARQL”, cSPARQL, to encompass this CRPQ fragment with the proviso that each query edge (RPQ) is distinctly labeled.2 A query in cSPARQL then can be considered as a labeled, directed multi-graph, the query graph, the same as the graph database itself (the data graph), albeit, significantly smaller, over which it is to be evaluated. Let the evaluation of a cSPARQL query also be a labeled, directed multi-graph, the answer graph. As with RDF, let the answer graph be represented as an answer set of triples: the “edges” are node pairs—with the nodes drawn from the data graph—labeled by the query graph’s edge labels. Each answer edge means that the pair of nodes is an answer to the corresponding RPQ. And each answer edge must participate in some subgraph match that answers the CRPQ. This treats graph query evaluation algebraically: the result of the evaluation is a graph itself. How to evaluate efficiently the answer-graph triples of a cSPARQL query with respect to a data graph is a worthwhile endeavor in its own right. This can also be used to evaluate the query with respect to SPARQL, to find the sub- graph match (SGM) tuples. The set of answer triples suffices to enumerate the SGM tuples. This approach to CRPQ evaluation for SPARQL then is two-phase. 1. optimization. How to evaluate the answer graph most efficiently. 2. enumeration. How to enumerate the SGM tuples most efficiently. The two phases have quite different optimization criteria; separating the steps allows for us to plan them “independently”. Division between the answer graph and the subgraph-match answers also allows for factorization, which escapes from the combinatoric complexity that arises from subgraph match symmetries during evaluation against the data graph. At the core of planning in each phase is the choice of a spanning tree in the query graph. For the answer-graph phase, this directs the graph walk during evaluation. For the subgraph-match phase, this directs the enumeration of the SGM answer tuples for the CRPQ. 1 Some may be free which are returned as part of an answer’s bindings, and the others existential which are not. Query “edge” conditions may be between any of the node variables. Let us consider here only queries with free node variables. 2 This has been introduced in SPARQL with the use of CONSTRUCT operator [3]. At one end of the spectrum, a CRPQ is simply sub-graph match template. Let each query edge be a data-graph label (the simplest RPQ). This type of SPARQL query has now been well studied [4]. The number of sub-graph matches can be extremely large, due to two causes: symmetry in the query graph; and multiplicity from “many-many joins”. For example, if h1, 2, 3i (on hA, B, Ci) is an answer tuple to query graph in Fig. 1a, then so is h3, 2, 1i. If sub-tuple h4, 5i on hD, Ei appears in m sub-tuple answers on hA, B, C, D, Ei and appears in n sub-tuple answers on hD, E, F, G, Hi to query graph in Fig. 1a, then h4, 5i appears in m × n answer tuples (on hA, B, C, D, E, F, G, Hi). Our two-phase approach will often result in better plans for pure sub-graph match queries. It improves extremely, however, for CRPQs where the query edges are complex RPQs. The graph walks for evaluating the RPQs are not effectively repeated due to multiplicities resulting from SGM matching. Given a CRPQ, let k be the number of node variables, n be the number of triples in its answer graph, and s be the number of its SGM tuples. Most always 3n  ks. This is akin to factorization in relational queries. In certain application scenarios, it has been demonstrated that solving projections of the query can offer immense performance improvement, as the “final” answers are many-many joins of the projected answer tuples [2]. This is always the scenario for CRPQs. The ultimate factorization of a CRPQ is down to node pairs—the answers to the individual RPQs—which are our very answer-graph triples. A way to evaluate the answer graph is to use a rooted spanning tree of the query graph. Evaluate the RPQs emanating from the root node. Only triples for each RPQ that join at the root may be expanded upon. Call query edges that are not in the spanning tree filter edges. On reaching a node with filter edges, we can only expand a node value (to find triples with respect to the RPQs in the spanning tree from this node) which matches via triples across the filters on the corresponding nodes. (This is sideways information passing.) This operation is quite similar to what is called a semi-join; optimizations for semi- join [5] can be employed. When a node value fails the filter test ultimately, those answer triples are removed, recursively up to the root. The evaluation proceeds recursively, expanding nodes down the spanning tree. Wireframe uses a cost- based planner to choose the most cost-effective spanning tree of the query graph for evaluating the answer graph. A way to enumerate the sub-graph match tuples given the answer graph is to use a rooted spanning tree of the query graph. Join answer triples by the root node to create sub-tuples. A sub-tuple covers partially the spanning tree from the root. Again, call those query edges not participating in the spanning tree filter edges. Whenever a sub-tuple covers a filter edge, check that the pair-projection of the tuple is an answer triple labeled by that edge. Proceed recursively. Wireframe uses a cost-based planner to choose the most cost-effective spanning tree of the query graph for enumerating the SGM tuples from the answer graph. Note that the spanning tree chosen for answer-graph evaluation is not likely to be the same spanning tree for SGM-tuple enumeration. The filter edges for the two are used in quite different ways. For answer-graph evaluation, an existence check for a node value must be made over the answer triples for the edge (RPQ). For SGM-tuple enumeration, an existence check for a node-pair value is made. In [6], Waveguide, a cost-based optimizer that enumerates through a plan space for RPQs, was presented. Wireframe employs Waveguide for planning the RPQs. As a CRPQ also encompasses joins between the RPQs, the results of one constrain the results of others. Choice of a spanning tree in the query graph specifies a topological sort of the RPQs. The cardinality reductions conferred by an earlier RPQ on latter ones can be passed to Waveguide for planning for those. (Choice of the spanning tree dictates that the remaining edges are filters.) Cost considerations can be additionally made on sharing parts of RPQ plans, as RPQs may overlap in their regular expressions, as is explored in [1]. The space of spanning trees can be enumerated to choose a best one. A cost-based enumeration of spanning trees can be also done to choose the best plan for enumerating the SGM tuples from the answer graph. The cost criteria here are that the multiplicity from the many-many join at each branch in the tree is minimized, while also maximizing the filtering by the filter edges. Essentially, this is the best plan for “defactorizing” the SGM tuples. 3 Next Steps & Challenges Wireframe offers a viable cost-based optimization methodology for CRPQs. Our next steps in the work are to implement cost metrics, to design a dynamic- programming enumeration for constructing the best spanning trees for both phases, and to devise efficient runtime execution. The challenges are to couple this more tightly with the Waveguide optimizer for the RPQ planning [6], while also taking advantage of commonalities across the RPQs [1]. References 1. Z. Abul-Basher, N. Yakovets, P. Godfrey, S. Ghajar-Khosravi, and M. H. Chignell. TASWEET: Optimizing Disjunctive Path Queries in Graph Databases. In EDBT, pp. 470–473, March 2017. 2. N. Bakibayev, D. Olteanu, and J. Závodnỳ. FDB: A Query Engine for Factorised Relational Databases. In VLDB, 5(11):1232–1243, 2012. 3. S. Harris, A. Seaborne, and E. Prudhommeaux. SPARQL 1.1 Query Language. W3C Recommendation, 21(10), 2013. 4. J. Pérez, M. Arenas, and C. Gutierrez. Semantics and Complexity of SPARQL. In ISWC, pp. 30–43. Springer, November 2006. 5. K. Stocker, D. Kossmann, R. Braumandi, and A. Kemper. Integrating Semi-join- reducers into State-of-the-art Query Processors. In ICDE, pp. 575–584. IEEE, April 2001. 6. N. Yakovets, P. Godfrey, and J. Gryz. Query Planning for Evaluating SPARQL Property Paths. In SIGMOD, pp. 1–15, ACM, June 2016.