How to bring some MAGIC to SPARQL Work in progress Paper Christina Ehrlinger University Passau Germany Christina.Ehrlinger@uni-passau.de ABSTRACT which focus on this like the RDF-3X triple store [7]. Second, For SPARQL, the query language to retrieve information we can rewrite the SPARQL query in order to optimize the from an RDF graph, the optimization of the order of the order of the triples. containing triples is a challenging topical issue. In this paper, The same kind of optimizations took place for SQL. There we show an approach which passes the information about the are highly performant database systems which also considers already found binding for a variable to all other occurrences the storage of the relations as well as techniques to optimize of this variable in combination with a cost model in order to the order of the joins, which have to be done in order to minimize the execution time. Different to other approaches execute SQL queries with more than one relation. our approach as described in this paper can be applied to a SPARQL query, which not only consists of basic graph The aim of this approach is not an optimization, which pattern, but also contains group graph pattern like FILTER perfectly fits into a self-designed triple store. It is a generic or OPTIONAL. Our experiments show the applicability of way how to reorder SPARQL queries in order to use their our approach and function as preliminary proof of concept. full potential according to their execution time. The rest of the paper is organized as follows. Section 2 1. INTRODUCTION describes the problem statement for the reordering of the The structure of the web changes from linked documents, triples and states a motivational example. Section 3 pres- which contain the information, to directly link the informati- ents our approach for using the sideways information pas- on using the keyword Linked Open Data. This direct linking sing first for a query containing a basic graph pattern and of information generates a huge number of quite big graphs. furthermore the adaption of the approach in order to use In order to handle these graphs, the Resource Descripti- other components besides to a basic graph pattern, for ex- on Framework RDF 1 is one of the used standard models ample FILTER. Section 4 shows the experiments which were for data interchange on the Web. It uses Uniform Resource executed so far. Section 5 discusses related work. Identifiers (URIs) to refer to resources in order to connect them. The resulting structure is a directed, labeled graph, 2. PROBLEM STATEMENT where the edges represent the connection between the re- In order to present our approach for reordering a SPAR- sources. These RDF graphs are generated very easily due to QL query, we first have a closer look at a simple SPARQL the absence of a schema. There exist specialized databases query and discuss the impact of different orderings. Consi- for storing these RDF graphs, so-called triple stores 2 . dering the SPARQL query in Listing 1. This query can be The SPARQL query language [8] is used to access the executed against a dataset generated using the Lehigh Uni- information stored in these RDF graphs. These SPARQL versity Benchmark (LUBM) data generator [4]. It consists queries must be performant in order to find the desired in- of 5 triples, whereas triple t5 also contains a literal. formation in these huge graphs. For optimizing the execution time of a SPARQL query we can consider two different ap- SELECT ∗ proaches. First, we can design a triple store, which optimizes WHERE { the storing or the usage of indices. There are also approaches ?p : t e a c h e r O f ? c . #t 1 ? s : takesCourse ?c . #t 2 1 https://www.w3.org/RDF/ ? s : a d v i s o r ?p . #t 3 2 https://www.w3.org/wiki/LargeTripleStores ? s : memberOf ?d . #t 4 ?d : name ’ Department0 ’ . #t 5 } Listing 1: Example for a SPARQL query 31st GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 11.06.2019 - 14.06.2019, Saarburg, Germany. Copyright is held by the author/owner(s). While executing the query as shown in Listing 1 using In the following chapter, we present an approach in or- the common triple store GraphDB 3 , the query has an ave- der to use the sideways information passing for SPARQL rage execution time of 4606 ms while executing ten times. queries. If we slightly change the order of the triples, we get the query shown in Listing 2, whereas this query has an avera- 3. USING SIDEWAYS INFORMATION PAS- ge execution time of 12283 ms (executed ten times). Even for this simple query executed against a quite small dataset SING FOR SPARQL QUERIES of approximately 130 000 triples we see the huge impact of In order to optimize a SPARQL query regarding the SIP, reordering the triples of a SPARQL query. we can represent the query as a graph to visualize the possi- ble ways to pass the bindings. In Figure 1 we see the resulting SELECT ∗ graph for the SPARQL query in Listing 1. WHERE { Every node represents a triple pattern of the query. Two ? s : takesCourse ?c . #t 2 nodes are connected with an undirected edge if they share ?p : t e a c h e r O f ? c . #t 1 a common variable. Conceptually, the representation allows ? s : memberOf ?d . #t 4 us to determine the best order of the triples by transforming ? s : a d v i s o r ?p . #t 3 the undirected graph in a directed graph using the approach ?d : name ’ Department0 ’ . #t 5 described in this paper, whereas the directed edge between } node t1 and node t2 denotes that t1 is in the ranking order before t2. Listing 2: Reordered SPARQL query from Listing 1 We also find examples of reordering which lead to a high t2 improvement regarding the execution time. In Listing 3 we again have the same SPARQL query apart from the ordering t1 t4 t5 of the triples. This query has an average execution time of 325 ms. t3 SELECT ∗ WHERE { ?d : name ’ Department0 ’ . #t 5 ? s : memberOf ?d . #t 4 Figure 1: Query graph for query in Listing 1 ? s : a d v i s o r ?p . #t 3 ? s : takesCourse ?c . #t 2 Another way to visually represent the query is shown in ?p : t e a c h e r O f ? c . #t 1 Figure 2. Here we represent every subject and object posi- } tion as a node, no matter if this is a variable, an IRI or a blank node. For the triple pattern Listing 3: Reordered SPARQL query from Listing 1 ?s takesCourse ?c The mechanism of reordering to get a more efficient query we get two nodes representing the variables ?s and respec- is not a new idea. This was done in Datalog [5] while apply- tively ?c and they are connected with a directed edge, which ing the Magic Set Transformation [1] to a Datalog program represents the predicate takesCourse between these two va- in order to sort the subgoals of a rule and the same idea riables. The edge is directed from the node representing the was also used while applying the join order optimization to subject of the triple to the node representing the object of a SQL query in a relational database system [11]. the triple. For the triple ?s takesCourse ?c, the edge is di- rected from the node representing the variable ?s to the node Especially for Datalog and SPARQL, the ordering of sub- representing the variable ?c. goals or respectively triples has a huge impact on the va- riables. The term Sideways Information Passing, short SIP, was used for this in Datalog and it describes how the bin- ?c teacherOf ?p ’Department0’ dings of variables are passed from one subgoal to another subgoal. Depending on the order of the subgoals it results in name takesCourse advisor a variety of different SIP possibilities, called SIP strategies [10]. For our approach, we adopt this term for SPARQL. In the context of SPARQL, it describes the way the binding of ?s memberOf ?d the variables is passed between the triples. Consider the query from Listing 3 again. We start with the triple ?d :name ’Department0’. This generates bindings for Figure 2: SPARQL graph for query in Listing 1 the variable ?d, which are passed to the second triple ?s :memberOf ?d. Again this generates bindings for the varia- This kind of representation describes the pattern we are ble ?s, which is passed to the triples ?s :advisor ?p and looking for while executing the query against a dataset. ?s :takesCourse ?c. The approach described in this paper focuses on how to rewrite the query with a more efficient ordering of the triples. This is done based on the query graph in order to optimize 3 https://www.ontotext.com/products/graphdb/ the sideways information passing. It consists of the following four steps: a takesCourse b takesCourse c 1. Generate query graph teacherOf takesCourse takesCourse 2. Transform undirected graph into directed graph 3. Determine order of the nodes according to the directi- e d advisor on of the edges 4. Generate optimized SPARQL query takesCourse teacherOf advisor According to the different elements of a SPARQL query, we have a closer look at the rearrangement of a query only g f h consisting of a basic graph pattern and in a second step, we takesCourse also consider all possible group graph patterns of a SPARQL query. Figure 3: small sample instance graph as generated 3.1 Basic Graph Pattern Query by LUBM Basic graph patterns (BGP) are sets of triple patterns [8]. The most simple SPARQL query consists of exactly one triple pattern: for a node if this node has at least one outgoing edge of this property. In general, the outgoing value for property p ?s ?p ?o determines the average number of awaiting bindings for the One example of a query which consists only of a basic object position while the subject position already has a bin- graph pattern but more than one triple pattern is shown in ding, whereas the incoming value determines the average Listing 1. As described before, in the first step we generate number of awaiting bindings for the subject position while the corresponding query graph for the SPARQL query from the object position already has a binding. Listing 1 as seen in Figure 1. During the second step, we We also have to consider the cost model in order to choose convert the undirected query graph into a directed graph. the first node, if the query contains only triples of the form In the following, we describe how this is achieved. Similar to ?s p ?o, where only the predicate position p is bound and Datalog, while performing the Sideways Information Passing not a single literal or IRI is part of the query. as a preliminary step before the Magic Set Transformation In Figure 3 we see a small instance graph as an example of [1], we try to find Literals or IRIs in the query to get a the generated RDF graphs by LUBM. The shape of the no- starting point for generating bindings for the variables. In de represents the type of the node. A rectangle represents a general, we are looking for the node with the highest num- student, a triangle represents a professor and a circle repres- ber of bound positions (subject, predicate, object). In our ents a course. The outgoing and incoming values resulting example, the only literal is the name of the department, so from the small example can be seen in Table 1. the triple, which contains this literal, will be the first triple according to the ranking order. property outgoing incoming As explained before, this generates a binding for the va- advisor 1 2 riable ?d. This binding is passed to all triples, which also takesCourse 2 1.5 contain the variable ?d, no matter, if it occurs in the sub- teacherOf 2 1 ject or object position. In the query graph, this passing is visualized by adding an outgoing direction to all undirected Table 1: Outgoing and incoming values calculated edges of the current node, which corresponds to the current based on the instance graph in Figure 3 triple. The search for a node, where the corresponding triple has the highest number of bound positions and the subse- For a better understanding, we have a closer look at the quent adding of directions to edges are repeated until all calculations for the outgoing and incoming values. Consider edges are directed. the property takesCourse. We sum up all edges with the la- bel takesCourse. Node e has three edges, the node b has two This approach works fine for SPARQL queries, where the edges and the node g has one edge with the label takesCour- corresponding query graph is basically a path. In our exam- se. Overall we have six edges with this label. To determine ple query, we can follow the path until node t4. Now more the outgoing value we divide this sum by the number of no- than one node is affected by adding the direction of an ed- des, which have an outgoing edge with this label (in this ge and we have to decide which of the two nodes t2 and t3 example three nodes). Overall we get an outgoing value of 2 should be processed next. In order to solve this, we have in- for the property takesCourse. Similar calculations are done troduced a cost-based model to pick the best next node out for the incoming value for the property takesCourse. Now of all possible nodes. we consider all nodes, which have an incoming edge with The chosen cost model takes into account the average oc- this label. Node a has two edges as well as node c. Both currence of every property but in addition, also considers nodes f and h have one edge. Again the number of edges the position of the common variable. For every property p is divided by the number of nodes in order to calculate the we determine two different numbers, so-called outgoing and average amount. Overall we get an incoming value of 1.5 for incoming value. The naming of these values origins in the the property takesCourse. SPARQL graph representation. The outgoing value descri- So we have two different values for every property. Due to bes the average amount of outgoing edges of this property this distinction, we take into account the sideways informa- tion passing to the subject or rather the object position. t1 t2 Using this cost model we can now determine which node to choose in our example from Figure 1. After handling the nodes t5 and t4 we have to choose between the nodes t2 and t4 t5 t3. In both nodes the corresponding triple contains the al- ready bound variable in the subject position, so we compare the outgoing values of both predicates as seen in Table 1 and t6 t3 choose the smaller one in order to minimize the intermitted results. Because advisor has an outgoing value of 1 while takesCourse has an outgoing value of 2, we choose node t3 Figure 4: Query Graph for query in Listing 4 after to be the subsequent processed node. processing the nodes t5, t4 and t3 Based on the number of bound positions and if equivocal based on the presented cost model we are able to transform the undirected query graph into a directed graph. In the we can imagine that in general, it can not result in the very third step, we can use topological sorting for extracting the best execution time if the filter is placed at an arbitrary resulting order of the nodes from the directed graph. In the position. Based on the experiments in section 4 we were able last step, we map the node to the corresponding triple in to substantiate the assumption to place the filter after all order to get the optimized SPARQL query. triples, which have at least one common variable with the filter condition, were processed. Based on the query graph, 3.2 Handling FILTERs all edges of the consuming node have to be directed in order Until now we have considered SPARQL queries which con- to be the next node to be processed. sist of a basic graph pattern or in other words of a set of Considering our example query graph in Figure 4 we have triple patterns. On the one hand, these triple patterns gene- already processed the nodes t5, t4 and t3. In the current rate bindings for variables. On the other hand, they are also state we have to choose between node t2, node t1 or the consuming the binding of variables for the subject position consuming node t6. As described before, the consuming no- in order to generate a binding for the object position or vice de can only be a candidate, if all edges of the consuming versa. Overall a triple pattern can generate and can consu- node were turned into incoming edges in order to get the me bindings for variables. Things are slightly different as it bindings from all relevant triples. So we have to choose bet- comes to FILTERs in SPARQL. SPARQL FILTERs restrict ween t2 and t1. This is done using the approach presented solutions to those for which the filter expression evaluates before. In doing so, we first choose t2, afterwards t1. Now to TRUE [8]. So a FILTER can only consume bindings for all edges of the consuming node are directed and we can the variables, but can not generate any bindings. In order to add the filter to the order of the so far processed triples. If handle this, we have to slightly adapt our present approach. there would be any undirected edge in the query graph, we While generating the query graph, the approach stays the just continue using our approach. Overall this handling of same, so for every filter in the query, we add a new node a filter in a SPARQL query does not mean we always add to the query graph next to the nodes for every triple. In the filter at the end of the query. Considering the following order to distinguish the node for a triple and the node for filter regarding the variable ?s: a filter, the node for a filter is drawn with a bold line. For a better differentiation, we also name the node for the filter FILTER(?s != http://www.Department0. Uni- consuming node. The meaning of the edges stays the same. versity0.edu/GraduateStudent29) Consider the following SPARQL query, which is the same SPARQL query as in Listing 1 extended with one FILTER This filter would be added in the current order after we expression: have processed all triples containing the variable ?s. In summary, we have adapted our approach to handle the SELECT ∗ FILTER construct of a SPARQL query. WHERE { ?p t e a c h e r O f ? c . #t 1 3.3 Handling Group Graph Patterns next to ? s takesCourse ?c . #t 2 FILTER ? s a d v i s o r ?p . #t 3 Next to a FILTER, there are many more possibilities to ? s memberOf ?d . #t 4 write a group graph pattern in SPARQL, for example OP- ?d name ’ Department0 ’ . #t 5 TIONAL or SERVICE. Also these elements can consume FILTER( ? p != bindings similar to a FILTER, but considering only the trip- ) #t 6 Consider the SPARQL query in Listing 5 with a BGP of } three triples and an OPTIONAL clause with two triples. The idea is to handle those queries on different abstracti- Listing 4: extended SPARQL query from Listing 1 on levels. While reordering the triples which are contained directly in the WHERE clause, the group graph pattern is The corresponding query graph is shown in Figure 4. abstracted into one single node. On this basis, we can con- While transforming the query graph from an undirected sider the triple patterns and group graph patterns inside of graph into a directed graph, we use the same approach as the current group graph pattern. discussed before. The only change regards the handling of these consuming nodes. From a naive standpoint of view, SELECT ∗ the execution times regarding all permutations of the query WHERE { presented in Listing 1 depicted in Figure 6. For this example, ? s a d v i s o r ?p . #t 1 our approach achieves the best execution time by on average ? s memberOf ?d . #t 2 of 325 ms. All permutations were executed 10 times in order ?d name ’ Department0 ’ . #t 3 to reach a good average. OPTIONAL{ #O ? s : takesCourse ?c . #t 4 20 20 ?p : t e a c h e r O f ? c . #t 5 20 } numbers of permutations } 15 14 Listing 5: SPARQL query with OPTIONAL During the abstraction, the query graph for the first level contains four nodes. The nodes t1, t2 and t3 represent the 10 10 9 9 corresponding triple pattern while the node O represents the 8 OPTIONAL clause as shown in Figure 5. Based on this, 7 we consider a query graph containing the triples t4 and t5, 6 5 which are inside of the OPTIONAL. The incoming edges 5 4 for the second query graph represent the information about 3 3 2 already bound variables from outside the OPTIONAL as shown in Figure 5. 0 0 00 00 00 00 00 00 00 00 t1 t4 0-5 <20 <40 <60 <80 <100 <120 >150 t3 O execution time in ms t2 t5 Figure 6: Distribution of the execution times in re- lation to the permutations of the query in Listing 1 Figure 5: Query Graphs for the query in Listing 5 for the two different abstraction levels This test setup was executed with several different queries but the overall picture is always the same. In Table 2 we see Until now we have only considered queries, which have at a comparison between some of the LUBM queries regarding most one OPTIONAL group graph pattern. This approach the best possible execution time for one of the permutati- is also applicable if a SPARQL query has more than one ons compared with the execution time achieved while exe- group graph pattern. If this is the case, we generate a con- cuting the permutation derived from our approach. In the suming node for every group graph pattern and use the same last column, we listed the percentage of permutations of the approach as described before. We only have to extend the respective query, which had the same or smaller execution mechanism to handle several consuming nodes as potential time than the derived permutation from our approach. next nodes. In order to get the best possible order, we use a preference ranking for the different kind of group graph best achieved patterns. So for example, a FILTER is ranked above an OP- LUBM % of faster execution execution TIONAL. This approach is quite generic because a SPARQL Query permutations time time query can also be nested using an arbitrary number of le- 1 11 ms 11 ms - vels. Conceptually handling these elements like OPTIONAL 2 7 ms 16 ms 0,03056 should be possible using our presented approach as stated 7 15 ms 15 ms - before. As described in section 4 we will need more experi- 9 4 ms 7 ms 0,07778 ments and tests to show the applicability of our approach 12 3 ms 39 ms 0,58334 for queries with group graph patterns like OPTIONAL. 13 4 ms 4 ms - 4. EXPERIMENTS Table 2: Comparism between best possible executi- In this section, we present the experiments carried out on time and achieved execution time using our ap- to test our approach and to get an idea if this approach is proach for some LUBM queries promising. We used the LUBM data generator [4] to generate a dataset of approximately 130 000 triples. These triples For LUBM query 2 our approach as presented in this pa- are stored in GraphDB. One of the first experiments was to per achieves an execution time of 16 ms, whereas the best execute all possible permutations of the BGP of a SPARQL permutation achieves an execution time of 7 ms. Based on query to get an idea about the different execution times these two numbers, it seems that our approach does not per- regarding the different orders of the triples. In addition, the form well. If we add the information, that this query has 6 query was reordered using our presented approach. For every triples, so 720 possible permutations and only approximately query tested so far the reordered query was in the forefront 3% of these permutations have the same or better executi- of all execution times from the permuted queries. To give an on time, our approach is in a better position. For some of idea how the execution times can differ, the distribution of the queries, as LUBM query 7 with 4 triples, our approach derived the best possible permutation based on the executi- dataset showed the impact on the execution time based on on time. These experiments also showed some examples like the order of the triples. Also, these experiments functioned LUBM query 12, where our approach does not determine a as a preliminary proof of concept for our approach. While good permutation of the query. These queries will be used testing the execution time of all possible permutations of the to improve our cost model. triples, the order of triples, our approaches chooses, was in Based on the assumption the approach is using a good the forefront of the smallest execution times. heuristic, we have also examined the impact of the filter pla- As described in the title, this is a work in progress paper. cement. In order to do this, we used our approach to reorder Therefore these experiments are not the top of the flagpole. the BGP of the query, added the filter statement at every The next steps will include tests with bigger datasets genera- possible position and compared the execution times again. ted with the LUBM dataset generator as well as tests using Also for this test setup, the experiment matches with the data and test queries from the Berlin SPARQL Benchmark expectations about placing the filter after all triples which (BSBM) [2]. While the test queries from LUBM are quite contain at least one of the variables of the filter. In Table 3 short, the queries tested in BSBM contain more triples and we see the impact of the filter placement. also more complex patterns like FILTERs or OPTIONAL clauses. Another benchmark which provides more complex filter placement execution time test queries than LUBM is the SP 2 Bench [9], which we want before t5 289 ms to use as a source for test queries. after t5 279 ms In order to improve the reordering based on the current after t4 281 ms approach we want to take into account semantic information. after t3 277 ms This will be part of our future work. after t2 277 ms after t1 274 ms 7. REFERENCES [1] C. Beeri and R. Ramakrishnan. On the power of Table 3: Impact of the FILTER placement for the magic. In Proceedings of the Sixth ACM enlarged SPARQL query in Listing 4 SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’87, pages 269–284, New Using our approach for the query as shown in Listing 4 we York, NY, USA, 1987. ACM. place the filter after the triple t5, which is also the best po- [2] C. Bizer and A. Schultz. The berlin sparql benchmark. sition according to all possible positions as seen in Table 3. Int. J. Semantic Web Inf. Syst., 5:1–24, 2009. During our experiments, we have observed the impact of the [3] A. Gubichev and T. Neumann. Exploiting the query filter condition itself, but explicitly considering the conditi- structure for efficient join ordering in sparql queries. on of the filter will be part of our future research. Overall In EDBT, pages 439–450. OpenProceedings.org, 2014. our performed experiments showed the applicability of our [4] Y. Guo, Z. Pan, and J. Heflin. Lubm: A benchmark approach. for owl knowledge base systems. J. Web Sem., 3(2-3):158–182, 2005. 5. RELATED WORK [5] J. W. Lloyd. Foundations of Logic Programming. In general, query optimization is a well-established area Springer-Verlag, Berlin, Heidelberg, 1984. especially for SQL, but regarding SPARQL queries it is a [6] M. Meimaris and G. Papastefanatos. Distance-based quite current topic. In the following, we shortly describe so- triple reordering for SPARQL query optimization. In me different approaches, which use different methodologies 33rd IEEE International Conference on Data than our approach presented here. The Characteristic Set Engineering, ICDE 2017, San Diego, CA, USA, April approach [3] uses dynamic programming based algorithm on 19-22, 2017, pages 1559–1562. IEEE Computer a precomputed hierarchical structure, which allows determi- Society, 2017. ning the best order of the triples. In [6] they translate a query [7] T. Neumann and G. Weikum. Rdf-3x: a risc-style into a multidimensional vector space and perform distance- engine for rdf. PVLDB, 1(1):647–659, 2008. based optimization by considering the relative differences [8] E. Prud’hommeaux and A. Seaborne. SPARQL Query between the triple patterns. Comparing different selectivity Language for RDF. W3C Recommendation, 2008. estimations for a SPARQL query was done in [12], whereby [9] M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel. this approach focusses only on BGP queries. It describes dif- Sp2bench: A SPARQL performance benchmark. ferent ways to compute heuristics for the optimization but CoRR, abs/0806.4627, 2008. does not consider the structure of the triple. Our approach [10] S. Sippu and E. Soisalon-Soininen. Multiple sip fills this gap between using statistical data and taking the strategies and bottom-up adorning in logic query structure of the query as well as the triple into account. optimization. In ICDT, 1990. [11] M. Steinbrunn, G. Moerkotte, and A. Kemper. 6. CONCLUSION AND FUTURE WORK Heuristic and randomized optimization for the join In this paper, we have presented our approach for opti- ordering problem. The VLDB Journal, 6(3):191–208, mizing a SPARQL query by means of using the sideways Aug. 1997. information passing. We show how to use our approach for [12] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and a SPARQL query only consisting of a basic graph pattern. D. Reynolds. Sparql basic graph pattern optimization Based on this we showed how to adapt the approach in or- using selectivity estimation. In Proceedings of the 17th der to handle queries, which also contain FILTERs or other International Conference on World Wide Web, WWW group graph patterns. The experiments based on the LUBM ’08, pages 595–604, New York, NY, USA, 2008. ACM.