Query Optimization for Large Scale Clustered RDF Data Ishaq Zouaghi Amin Mesmoudi Jorge Galicia LIAS/ISAE-ENSMA University of Poitiers LIAS/ISAE-ENSMA Chasseneuil-du-Poitou, France Poitiers, France Chasseneuil-du-Poitou, France LR-Sys’Com-ENIT/UTM amin.mesmoudi@univ-poitiers.fr jorge.galicia@ensma.fr Tunis, Tunisia ishaq.zouaghi@ensma.fr Ladjel Bellatreche Taoufik Aguili LIAS/ISAE-ENSMA LR-Sys’Com-ENIT/UTM Chasseneuil-du-Poitou, France Tunis, Tunisia bellatreche@ensma.fr taoufik.aguili@enit.utm.tn ABSTRACT the volume of available RDF data grows, the need for high perfor- The popularity of the Resource Description Framework (RDF) mance RDF data management systems becomes more noticeable. and SPARQL has thrust the development of high-performance To cope with the proliferation of RDF datasets, early works systems to manage data represented with this model. Former used the well-established relational model as backend, storing approaches adapted the well-established relational model apply- RDF triples directly into tables (single-table [3, 10], vertical par- ing its storage, query processing, and optimization strategies. titioning [1]). In these approaches, a SPARQL query, generally However, the borrowed techniques from the relational model are specified as a sequence of triple patterns (TPs), is mapped to a not universally applicable in the RDF context. First, the schema- SQL statement. Although, considerable research efforts were ded- free nature of RDF induces intensive joins overheads. Also, opti- icated to these relational-based strategies, they rapidly suffered mization strategies trying to find the optimal join order rely on from many intensive join overheads induced by the schema-free error-prone statistics unable to capture all the correlations among nature of RDF. Some optimization strategies borrowed from the triples. Graph-based approaches keep the graph structure of RDF relational model strive to reduce the evaluation cost by finding representing the data directly as a graph. Their execution model an optimal execution order of a query expressed as a sequence leans on graph exploration operators to find subgraph matches of TPs. For example, deciding the optimal join order for the TPs to a query. Even if they have shown to outperform relational- of the query shown in Figure 2 (e.g. [(tp1 ▷◁?f tp2 ) ▷◁?f tp3 ]) based systems in complex queries, they are barely scalable and Furthermore, these strategies are not universally applicable in optimization techniques are completely system dependent. In the RDF context. Firstly, because they rely on statistics that are this paper, we propose optimization strategies for graph-based very error-prone since they are gathered on the entire collection RDF management systems. We intend to take the strengths of of data (contrary to the relational model where statistics are cal- relational databases and propose logical structures generically culated per table entity). Capturing statistics for datasets without depicting graph-based query execution. First, we define novel an explicit schema is not a simple task. Former approaches col- statistics collected for clusters of triples to better capture the lected statistics at the predicate level and assumed independence dependencies found in the original graph. Second, we redefine between triples. However, as shown later by [6, 11] these assump- an execution plan based on these logical structures. Finally, we tion led to considerable underestimations since RDF triples are introduce an algorithm for selecting the optimal execution plan highly correlated. Capturing these correlations may prompt to based on a customized cost model. exponentially huge statistics whose maintenance is very com- plex. Although, some heuristics have been introduced [20], they KEYWORDS are still not sufficient to estimate the cardinalities of complex queries involving several joins. Moreover, these query optimiza- Optimization, RDF, SPARQL, Cardinality Estimation, Cost Model. tion strategies do not tackle the leading issue which is the in- tensive joins product of the unsuited direct mapping of RDF to 1 INTRODUCTION tables. Even with an optimal join order, the join operation would The versatility of the Resource Description Framework (RDF) has still be the bottleneck at query runtime especially for complex contributed to its rapid expansion not only as a standard data queries (which are more and more frequent in SPARQL). model in the semantic Web but also as the preferred representa- Graph-based processing systems (e.g. [22]) keep the graph tion for data from diverse domains (e.g. genetics, biology). The structure of RDF representing the data directly as a graph. In RDF model uses triples consisting of a subject, a predicate and these systems, the graph essence of RDF is maintained and query an object < s, p, o > to represent data and SPARQL as its query processing is turned into a subgraph matching problem. They out- language. Currently, public RDF data sets (known as knowledge perform relational-based systems when solving complex queries bases) with billions of triples are extensive sources of information [9, 22]. However, as shown in [2, 9] they are less scalable since (e.g. DBPedia1 , Bio2RDF2 ) popularly queried and aggregated. As their processing is mostly based on main memory. There have not 1 https://wiki.dbpedia.org/ been generic optimization strategies specifically built for these 2 https://bio2rdf.org/ systems. There is not a single logical layer enclosing their execu- tion model (as a graph exploration) and their data organization; © Copyright 2020 for this paper held by its author(s). Published in the proceedings of DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT without a common scheme optimization strategies will still be 2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution completely system dependent. Even though the relational-based 4.0 International (CC BY 4.0). optimization strategies were not ideal, they offered some com- in a wider table whose dimensions correspond to the number of fort to the designer since they allowed to organize the execution distinct subjects and predicates. The overheads of this approach regardless of how the data were stored on disk (i.e. if the data are the great number of null values and the treatment of multi- are stored in a single table, binary tables). valued properties. The vertical partitioning approach proposed In this paper, we propose optimization strategies for graph- in SW-Store[1] overcomes this drawback storing the data in n based RDF management systems. Our strategies fit both central- binary tables, where n is the number of distinct predicates. Still, ized and distributed approaches. We took the strengths of the overheads exists when many predicates are involved in a single logical execution modeling from the relational databases and query. Most recent approaches have tried to find the implicit propose first logical structures to portray the query execution schema of the data in an RDF dataset. These approaches use based on the exploration of the query and input graphs. We re- the characteristic sets [11] to distinguish entities and store the define an execution plan based on these structures and present data of similar entities together (e.g. EAGRE [21], [15]). Other ap- an algorithm that generates and picks based on a cost model proaches maintain the graph structure of RDF data representing the optimal execution plan for a given query. Our cost model the data as adjacency lists (e.g. gStore [22]). The main disad- relies on statistics, but in contrast to former approaches estimat- vantage of this approaches is related to scalability to large RDF ing statistics on the global graph we collect them for clusters of graphs [9]. triples (that we named graph fragments G f ) logically connected in the original graph. Our cost model considers the interactions 2.2 DQE strategies between graph fragments to estimate the network and disk costs Before describing optimization strategies applied during query of a given query plan. execution, let us classify SPARQL evaluation approaches. In cen- The contributions of the paper are summarized as follows: tralized systems, the evaluation is either join-based (e.g. [10]) or (1) We formalize a logical model describing the query execu- graph-matching based (e.g. [14]). The first approaches comprise tion of RDF systems based on graph exploration methods. all systems translating each single graph pattern into SQL and (2) We present the essential statistics collected for each graph combining the results on each iteration using the join operation fragment. (on a single or multiple tables). Indexing the data enhance the per- (3) We detail what is to the best of our knowledge the first formance since the joins are performed as merge-joins (e.g. [12]). cost model to compare execution plans based on the disk Graph matching approaches on the other side break a SPARQL and network costs in graph-based systems. query into subgraphs, and to avoid invalid intermediate results (4) We present a study of the problem allowing to choose since at every iteration only valid subgraph bindings are kept. the optimal execution plan. We prove its complexity and The static query optimization strategies use maintained sta- proffer a branch and bound like algorithm to efficiently tistics about the data to determine an optimal query plan. The explore and select the optimal execution plan in terms of estimation of the cardinality is the base measure used to evaluate our logical structures. and compare execution plans for a specific query. The rest of the paper is organized as follows. First, Section 2 presents the state of the art of the optimization strategies pro- Cardinality Estimation in RDBMS. It has been longly seen as a posed for RDF systems. Then, Section 3, introduces the prelimi- key component in the query optimization and it is a well estab- nary definitions used to describe the query execution based on lished field in the the relational database world[8]. It is usually graph exploration. Section 4 presents the cost model allowing to solved by using various summarization techniques such as one- compare the equivalent plans and introduces an algorithm to find dimensional synopsis (e.g. one-dimensional histograms[17]) Even the best execution plan. Next, section 5 presents the experimental if the cardinality estimation used in the relational model would study. Finally, Section 6 summarizes the work and gives insights seem useful for the semantic Web, its estimation has been less on future researches. successful due to the heterogeneous, string-oriented nature and to the fact that queries in RDF contain many self-joins [16]. 2 RELATED WORK Cardinality Estimation in SPARQL. Currently, several studies In this section, we summarize the most relevant optimization have investigated the cardinality estimation issues for SPARQL strategies adopted by triple stores in the state of the art. We clas- queries. A line of work uses very simplistic models based on sify them according to whether the strategy is applied before or RDF-specific statistical synopses including counters of frequent during the query execution (BQE and DQE respectively). In the predicate-sequences in paths of the data graph[13]. Similarly, first category we consider approaches of organization, indexing other approaches use one-dimensional histograms and pre-compute and distribution of data; all of them have been implemented by the number of occurrences of all predicates pairs to estimate the different systems with the aim of finding the best query perfor- triple pattern and joined triple patterns selectivities[19]. The first mance. The second category depicts optimization strategies at approach was implemented in RDF-3X and the second in Jena query runtime. ARQ optimizer. The drawback of these approaches is that the formulas assume statistical independence between tuples, which 2.1 BQE strategies produce large estimation errors[6]. The second line of work in- 2.1.1 Data organization. The earliest RDF processing systems troduced a specific kind of summary based on a schema-level adapted the prominent relational model. The naïve approach em- synopsis for RDF data while preserving as much of its structure braced by Sesame [3] stores the data in a single table of three as possible[18]. Finally, the third line of approaches collected columns (subject, predicate, object). Its major drawback is the statistics for tuple groups based on characteristic sets[6, 18], or processing of self-joins that turns quite expensive when SPARQL by summarizing the graph into large entities[7]. queries become more complex. The property table approach Contrarily to existing techniques, the set of strategies pro- (Jena2[10]) reduces the number of self-joins storing the data posed in this paper are independent of the physical storage and iata_code query evaluation models of any system. Our assumptions are A319 Paris Orly "ORY" only that the data are logically clustered based on the predicates pla ne _m al o de riv and that the main query evaluation operator is based on a graph l ar exploration strategy. We propose a set of techniques allowing has_flight departure iata_code Air France AF37 El Prat "BCN" to find the best way to explore the data graph in order to evalu- rt ure ate a SPARQL query. We rely on a novel cost model that takes de pa into account the correlation between predicates and nodes. Our has_flight arrival proposal is not only adapted to centralized systems but also to Iberia IB52 London Heathrow l de e parallel systems that rely on graph exploration as query eval- mo e_ has_ par tur a n fli de uation technique. In this kind of systems, our cost model will pl gh t arrival consider the interactions between fragments to estimate the disk B737 IB83 Berlin Tegel plane_model and network costs. 3 PRELIMINARIES Figure 1: RDF Graph G 3.1 RDF and SPARQL has_flight departure ?c ?f pla El Prat The Resource Description Framework (RDF) has been widely ne _m SELECT ?c ?f ?m WHERE { o ?c <:has_flight> ?f . #tp1 accepted as the data model for the Linked Open Data and the de l ?f <:departure> . #tp2 semantic Web. The model uses triples consisting of a subject, ?m ?f <:plane_model> ?m . #tp3 a predicate and an object < s, p, o > as its main abstract struc- } ture. The model provides flexibility without explicitly enforcing a schema. A collection of interlinked RDF triples could be repre- Figure 2: SPARQL query Q sented as a graph as shown in Figure 1. The graph of the example contains data related to air traffic control. The RDF graph is into Data Stars (formally defined in Definition 3.3). Data stars al- formally defined in Definition 3.1. low identifying the triples related to a specific instance grouping the data by subject (or object). Definition 3.1. (RDF Graph) An RDF graph is denoted as G = ⟨Vc , LV , E, L E ⟩ where Vc is a collection of vertices corresponding Definition 3.3. (Data Star) Given a node x (named data star to all subjects and objects, LV is a collection of vertex labels, E head) in a RDF graph G, a Data Star DS(x) is the set of triples is a collection of directed edges that connect the corresponding related to node x with a direct edge. We name Forward Data subjects and objects, and L E is a collection of edge labels. Given −→ Star and Backward Data Star to the sets DS(x) = {(x, p, o)|∃p,o : an edge e ∈ E, its edge label is its property . ←− (x, p, o) ∈ G} and DS(x) = {(s, p, x)|∃s,p : (s, p, x) ∈ G} respec- SPARQL is the most popular query language for RDF. A simple tively. SPARQL query consists of a query form (e.g. SELECT in Figure 2), Data Stars extend the notion of a record in the relational data- a Basic Graph Pattern (BGP) and a set of SPARQL operations (e.g. base model. The primary key of a DS(x) corresponds to its head x. FILTER). A Basic Graph Pattern is composed of triple patterns Records are grouped in tables in a RDBMS, following this logic we (TPs). TPs are expressed in a triple form and they are composed group records describing similar entities into sets named Graph of at least one of S, P, O being a variable. An example query Fragments. with three TPs and its graph representation is shown in Figure When building the Graph Fragments, ontologies could be ap- 2. In our work, we consider only SPARQL queries with bounded plied since they intend to provide an overall schema of the data predicates. A SPARQL query can also be represented as a graph stored in an RDF graph. However, several studies show that as described in Definition 3.2. there is still a very partial use of ontology classes and sometimes subjects share triples with properties coming from different on- Definition 3.2. (Graph Query) A Graph Query is denoted as tological sources [15]. Consequently, we decided to group Data Q = (V , LV , E, L E ), where V = Vp ∪ Vc is the union of the sets Stars in Graph Fragments based on the combination of properties of variable and bounded vertices. LV is the set of vertex labels, characterizing an entity using Characteristic Sets [11]. the labels of variable vertices are distinguished with a leading Each subject s in the graph G has a characteristic set defined as question mark symbol. E and L E represent the set directed edges → − (s) = {p|∃ : (s, p, o) ∈ G}. Similarly, for the objects we define cs o between vertices and its labels respectively. ←− −−→ cs (o) = {p|∃s : (s, p, o) ∈ G}. A Forward Graph Fragment G f groups Forward Data Stars having the same characteristic set. ←−− 3.2 Overview of Query Evaluation Backward Graph Fragment G f are formed similarly. Its formal In this section, we discuss the main definitions allowing to model definition is given in Definition 3.4. the logical organization of data in clusters keeping its graph Definition 3.4. (Graph Fragment) A Graph Fragment is a set structure. Then, we detail the query evaluation operators based −−→ of Data Stars, it is named a Forward Graph Fragment G f if it on graph exploration on the clustered data. −−→ −→ groups Forward Data Stars such that G f = {DS(x)|∀i,j→ − (x ) = cs i 3.2.1 Graph storage. In contrast to several of the approaches → − ←− − cs (x j )}. Likewise, a Backward Graph Fragment G f is defined as mentioned in Section 2.1.1 in which the graph structure of the ←−− ←− ←−(x ) = cs ←−(x )}. loaded RDF data is broken, we strive to preserve it. The storage G f = {DS(x)|∀i,j cs i j model groups RDF data first such that implicit structures within It is shown that indexing and compressing the data of frag- the data are automatically discovered. Data are firstly grouped ments in B+Trees improves significantly the performance at ←−− −−→ ←−− query runtime[9, 12]. In the rest of the paper we assume that the QS (?m) QS (?f ) QS (?f ) data are indexed using this structure, however the estimations ←−− −−→ ←−− G f 11 G f 13 G f 16 and the cost model are easily generalized to other data struc- tures. Additionally, the data could be stored as Forward Graph −−→ ←−− Fragments, Backward Graph Fragments or using both structures. G f 14 G f 17 In the definitions of the following section, we assume that both types of fragments are available, yet this is not a mandatory ←−− −−→ ←−− G f 12 G f 15 G f 18 condition. 3.2.2 Query execution. In this section we formalize the logical structures used to describe the query evaluation. As previously Figure 3: Execution Plan mentioned, a SPARQL query can also be represented as a directed graph whose nodes are either variables (e.g. ?f, ?m in Figure 2) or bounded values (e.g. ). Let us first recall how a to the variables ?m and ?f. These mappings are sent to the for- SPARQL query is evaluated in most of the state-of-the-art sys- ward graph fragments whose characteristics match the predicates tems. Traditionally, a SPARQL query is evaluated in a TP by TP −→ of the second Query Star QS(?f )2 . In this way, only the perti- manner [4]. A query execution plan is then seen as a join of TPs ←− −→ on a variable. For example, an execution plan for the query of nent mappings with respect to QS(?m)1, QS(?f )2 are kept. The Figure 2 composed of three single triple patterns is: process continues similarly for the following query stars. It is evident that an Execution Plan represents a way to explore tp1 ▷◁?f tp2 ▷◁?f tp3 the graph. Indeed, finding the optimal execution plan conveys to find the best way to explore the data graph. In the next section This representation can become quite complex when several we detail the optimization strategy followed to find the optimal TPs are involved in the query. Optimization strategies for these query plans in terms of Query Stars. We define an acceptable approaches seek to find the optimal execution order of triple query plan and then depict the algorithm used to generate them. patterns according to pre-computed statistics. Then we present the cost model in terms of disk and network To shorten the logical query plan, TPs can be grouped if they cost applied to decide on the optimal plan. share a common variable on the subject (or object). We name these structures Forward Query Stars and Backward Query Stars if they group the triples on subject or object respectively. Further- 4 GRAPH-BASED QUERY OPTIMIZATION more, as it will be shown later, with these structures the query In this section, we present our cost-based optimization strategy execution can be easier tracked following a graph exploration which allows comparing execution plans (based on Query Stars). approach. Both structures are formally described in Definition Firstly, we define an acceptable plan and we detail the statistics 3.5. allowing to evaluate the cost of an execution plan based on the disk and network cost. Next we define the problem of finding an Definition 3.5. (Query Star) Let Q be the SPARQL query graph. optimal plan and we describe a branch and bound based algorithm −→ A Forward Query Star QS(x) is the set of triple patterns such used to generate the list of candidate execution plans. −→ that QS(x) = {(x, p, o)|∃p,o : (x, p, o) ∈ Q }, x is named the head ←− 4.1 Acceptable Execution Plan AP of the Query Star. Likewise, a Backward Query Star QS(x) is ←− −→ ←− QS(x) = {(s, p, x)|∃s,p : (s, p, x) ∈ Q }. We use QS, QS to denote In the last section, we defined an Execution Plan P as an order the set of forward and backward graph stars and qs to denote function applied to a set of Query Stars. An execution plan P indistinctly a forward and backward query star. is called an Acceptable Execution Plan if it fulfills the following conditions: The execution of a query can be expressed as a join of Query (1) Coverage: All nodes and predicates of the given query Stars. Since we consider two copies of the data, (one copy stored −−→ ←−− are covered by the set of Query Stars of the plan. For as G f and another as G f ), the execution plans consider both types example, for the query of Figure 2, the execution plan of Query Stars. An execution plan is composed of a sequence of ←− −→ [QS(?m), QS(?f )] is not a valid plan since the node ?c and joined Query Stars as shown in Definition 3.6. the edge has_flight are not covered by the plan. Definition 3.6. (Execution Plan) An execution plan is an order (2) Instantiated head: This condition guarantees that for a plan function applied on a set of Query Stars. The function denotes P = [SQ 1, ..., SQ n ], ∀i >1SQ, the head of the SQ i must be the order in which the mappings for each Query Star will be already instantiated. We use this condition to avoid to a found. We denote by P = [QS 1, QS 2, ..., QSn ] the plan formed by cartesian product when mappings are exchanged between executing QS 1 , then QS 2 ,..., and finally QSn . two star queries. For example, the plan shown in Figure 4a is not acceptable since the head of the second query star Let us consider for instance that the optimal query plan for the −→ ←− −→ ←− (QS(?c) is not instantiated before finding the mappings query of Figure 2 is P1 = [QS(?m)1, QS(?f )2, QS(?f )3 ]. The exe- for this query star. A plan in which the head has been ←− cution engine starts processing QS(?m)1 by loading all the back- instantiated is shown in Figure 4b. ward fragments whose characteristic set contains all the predi- ←− The formal definition of an Acceptable Plan is given in Propo- cates (in this case only the plane_model predicate) of QS(?m)1 . sition 4.1. ←−− ←−− Assuming that the backward fragments G f 11 and G f 12 are the only backward fragments matching the predicates of the query Proposition 4.1. (Acceptable Plan) AP Let us consider Q as −→ ←− star, the execution scans both fragments and finds the mappings a given query, QS and QS as the sets of forward and backward Table 1: Collected Statistics ▷◁ dept. hs_fl Statistic Description pl_m (a) dist(G fk ) # of data stars in G fk . (b) count(pi , G fk ) # of edges with predicate pi in G fk . −−→ −−→ (c) dist_N E(pi , G fk ) # of distinct nodes linked to the data (a) [Q S (?f ), Q S (?c)] star heads in G fk with respect to pi . (d) SF (G fk , G f j , pi ) Proportion of the # of nodes in G fk pointing to G f j with respect to pi . dept. hs_fl pl_m for a graph fragment G fk (forward or backward) whose char- acteristic set is cs = {p1, ..., pm }, considering that pi ∈ cs are −−→ ←−− (b) [Q S (?f ), Q S (?f )] summarized in Table 1. Some examples of statistics for the exam- ple graph of Figure 1 are given in Figure 5. Figure 4: Head instantiation examples Let us consider the statistics shown in Figure 5. The statistics ←−− for the backward graph fragment G f 4 are shown in Figure 5a. ←−− The dist(G f 4 ) is 3 since the fragment contains three data stars graph star queries respectively, T as the set of triple patterns and whose heads are AF37, IB62 and IB83. Both count(pi , G fk ) and the following functions: dist_N E(pi , G fk ) are calculated only for the predicate has_flight −→ ←− • Tr: q ∪ QS ∪ QS → T It returns the set triple patterns of a (identified as 1 in the Figure) since it is the only predicate in the graph. fragment’s characteristic set. There are three edges in the frag- −→ ←− ment having has_flight as a predicate, therefore count(pi , G fk ) = • Nd: q ∪ QS ∪ QS → V It returns the nodes of a graph. −→ ←− 3. For this fragment, dist_N E(pi , G fk ) = 2 since there are two • Head: QS ∪ QS → V a function that returns the head of a distinct nodes (Air France and Iberia) linked to the data star query star. ←−− −→ ←− heads of G f 4 . An acceptable plan AP is a tuple < X, f > where X ⊂ QS ∪ QS The selectivity factor SF (G fk , G f j , p) between two graph frag- and f : X → {1...|X |} is the order function of query stars such ments with respect to a predicate is the ratio of the number of that: edges in a node pointing to the data star’s head located in an- (1) Q S ∈X T r (QS) = Tr (q) Ð other fragment. For example, in Figure 5c, the selectivity factor ←−− ←−− (2) ∀i ∈ {2...|X |}, head(f −1 (i)) ∈ i−1 j=1 Nd(f (j)) −1 SF (G f 5, G f 4, 2) (2 is the id of the predicate arrival) is 2/2 since Ð ←−− out of the 2 edges of the predicate arrival in G f 5 , 2 nodes (AF37 4.2 Cost model ←−− and IB83) are the heads of data stars located in G f 4 . In this section, we present a novel cost model used to compare For simplicity and to keep the same notation, the selectiv- execution plans P. As for distributed databases, the cost is ex- ity of two graph fragments sharing the same head is repre- pressed with respect to the estimated total time. The total time sented as SF (G fk , G f j , −1). For example, the selectivity factor is the sum of all time components (CPU, I/O, Communication), −−→ ←−− SF (G f 3, G f 5, −1) in Figure 5b is 1/2 since out of the two heads however, the I/Os and the communication costs are generally −−→ the dominant factors. Our cost estimation considers both the in G f 3 , one of them is a head of a data star in the graph fragment ←−− disk and communication costs for each query star on the plan as G f 5 (shown in Figure 5c). shown in Equation 1. The parameters TI /0 and TT R are the time of a disk I/0 and the time to transmit a data unit from one site 4.2.2 Disk cost. We present in this section the different for- to another respectively. The estimated number of of I/O’s and mulations that allow estimating the number of pages targeted by network packets transmitted are represented by DC and N C a plan. We assume that the data on each fragment are stored as and their calculation is described in the next sections. a clustered B+Tree as done in [9, 12]. Our disk cost is related to this data structure. However if the fragments were stored using Õ another data structure, only the parameters of the function calcu- Total_Cost(P) = TI /0 ∗ DC(qs) + TT R ∗ N C(qs) (1) lating the number pf disk pages (ND P ) would change. The disk qs ∈ P cost for a single query star is given in Equation 2. In this equation, Both the disk and network costs are estimated based on sta- the N P function allows estimating the number of pages targeted tistical information about each graph fragment G f (Definition by a query star in a fragment and, the tuple (G f j , k j ) represents 3.4). For the disk cost, the statistics allow estimating the number the estimation of the number of data stars k j in the fragment of Data Stars loaded on each fragment that contains potential G f j involved in the evaluation of such a query. In the following query matches. Likewise, for the network cost, the number of sections, we detail the function N P(G f , k) and we present the intermediate results exchanged between fragments at different formulations to estimate the number of data stars read on each sites is estimated based on the same statistics. In the next section, graph fragment (k). we describe the statistical data collected for each graph fragment. Õ 4.2.1 Fragment statistics. We rely on statistical data are col- DC(qsi ) = ND P (G f j , k j ) (2) lected for each Graph Fragment G fk . The statistical data collected (Gf j ,k j )∈ input qsi has_flight AF37 Air France iata_code Paris Orly "ORY" arrival Paris Orly AF37 has_flight IB62 Iberia ht _flig has iata_code arrival El Prat "BCN" Berlin Tegel IB83 IB83 (a) pi (b) (c) Gf j pi (d ) (a) pi (b) (c) Gf j pi (d ) (a) pi (b) (c) Gf j pi (d) 2 5 2 2 5 -1 1/2 2 2 2 2 2 2 2/2 3 1 3 2 1 1 2/3 6 -1 1/2 3 -1 1/2 2 -1 3/3 8 5 2/2 4 2 2/2 ←−− −−→ ←−− (a) Gf 4 (b) Gf 3 (c) Gf 5 Figure 5: Subset of statistics for G of Figure 1 . The number of pages per fragment. Let us now detail the func- The input of a query star is expressed as a set of tuples (G f j , k j ). tion N P estimating the number of pages read from the disk given G f j is a Fragment satisfying the predicates of qs 1 and k j is the a graph fragment G f j and a number of data stars k j as inputs: number of data stars targeted by this query. The Valid input data stars is calculated for each (G f j , k) ∈ NG f j input_DSqsi and it is expressed as follows: Õ ND P (G f j , k) = r f (i) valid_DS qsi = {(G f j , k ′ ) | k ′ = ⌈k ∗ min f (e)⌉} i=1 e ∈Edдes(qs i ) where, Where k ( ( NG f j ∗ BGf j i f i = HGf j 1 dist _N E(e .l abel ,Gf j ) , i f e.node is constant r f (i) = f (e) = r i ∗ r f (i + 1) otherwise 1 , otherwise The parameters and properties in the previous function are In the f (e) function, we calculate a reduction factor for each defined as follows: predicate (e.label) to consider an estimation of the number of (1) BGf j : number of disk pages in the last level. data stars found in the fragment after the execution of the star (2) HGf j : number of levels. query. For each tuple (G f j , k ′ ) ∈ valid_DS qsi , we calculate the (3) r i : reduction factor for the i t h level. Given two levels "i" output_DSqsi expressed as a set of triplets (G f j , pi , k ′′ ) where and "i + 1", and "Z " and L as the number of pages for the G f j is the fragment from the input, pi is a predicate of the qsi level "i" and "i + 1" respectively, r i is defined as by Z /L. and k ′′ is the number of distinct edдe.node related to pi with (4) NGf j : number of data stars in the fragment G f (i.e., num- respect to the valid input. The Output is calculated as follows: ber of keys in the leaf level of the B+Tree). (5) r f (i): is the number of pages to be manipulated at the i th tree level. output_DSqsi = {(G f j , pi , k ′′ )|pi ∈ edдes(qsi )∧k ′′ = ⌈N DSpi ⌉} As it is shown in last equation, the number of pages N P is Where , i f e.node is const ( the sum of the number of pages manipulated at all levels. 1 N DSpi = k′ The number of data stars k. In this part we explain the pro- dist (Gf j ) ∗ dist_N E(pi , G f j ) , otherwise cedure to obtain the number of data stars per each star query N DSpi is the number of data stars head related to each predicate. (represented as k). To better understand the procedure, let us After computing the output of the first query star, We can now recall the graph exploration execution model illustrated in Figure compute the input of the second query star. For each star order 3. For a given plan, the execution is done finding the mappings greater than one in the plan, the input is computed as follows: of one query star after another. This execution model guides the estimation of data stars per query stars. We start calculating the input_DS qsi = {(G f j , k) / G f j |= qs i ∧ k = ⌈N DS⌉} number of data stars for the first query star (Input). Then we estimate the number of data stars that we get after executing To calculate the Input we of qsi if the head of the query star is this query star(Valid Input). Next, we estimate the number of a variable we consider two cases: data stars sent to the next query star for each predicate (Output). (i) Neighbor star queries: In this case, the last evaluated query Finally, using the selectivity factor and the output, we calculate star is a neighbor of the current query star, therefore in this case the input of the next query star. This procedure continues until we can use the selectivity factors. The data stars heads targeted the last query star. in the fragment "G f j " are computed as follows: Let us introduce the estimation of the Input of the first query Õ N DS = k ′′ ∗ SF (G fk , G f j , p) star. We distinguish two cases: (Gf k ,p,k ′′ ) ∈output qs i −1 (i) If the head of the star query is a variable: Where p is the edge between qsi and qsi−1 . input_DS qs1 = {(G f j , k)|G f j |= qs 1 ∧ k = dist(GF j )} (ii) Same head star queries: In this case, the current query star (ii) If the head of the star query is a constant: has the same head of the last evaluated query star so there is no link between the star queries. In this case, when Head(qsi ) input_DS qs1 = {(G f j , 1) | Head(qs 1 ) ∈ GF j ∧ G f j |= qs 1 } is restricted, the number of data stars is equal to the product between the selectivity of the query star head in the fragment Input mapping: The input of the first query star is computed that satisfies qsi . similarly as the inputs of the first query star for the disk cost in which we distinguish two cases: Õ (i) If the head of the query star is a variable: N DS = k ′ ∗ SF (G fk , G f j , −1) (Gf k ,k ′ ) ∈val id _input qs i −1 input_MP qs1 = {(G f j , M)|G f j |= qs 1 ∧ M = dist(GF j )} (ii) If the head of the star query is a constant: If the head of the query star is a bounded value, the input is (G f j , 1) | Head(qs 1 ) ∈ GF j ∧ G f j |= qs 1  calculated similarly to the first query star: input_MP qs1 = (G f j , 0) | Head(qs 1 ) ∈ GF j ∧ G f j ̸ |= qs 1 input_DS qsi = {(G f j , 1) | Head(qsi ) ∈ GF j ∧ G f j |= qs i } Valid mapping: The valid mapping of the query star is com- 4.2.3 Net cost. We present in this section the formulations to puted using the input mapping calculated in the previous step. estimate the number of network packets exchanged between frag- The valid input estimates how many variable bindings exist after ments in different machines. Unlike the Disk Cost in which we executing the current query star. For each input ((G f j , M)) we estimate the number of data stars loaded by each graph fragment calculate the product between the number of mappings in the in a query star, the Network Cost aims to estimate the number input M and the estimated number of mappings after the execu- of mappings sent from one graph fragment to another. tion of the current query star. The valid input is calculated for As a recall, a mapping is a binding between the nodes of the each input (G f j , M) as follows: query star and the corresponding values in the input graph. For example, the mappings of the first query star of Figure 3 are the valid_MP qsi = {(G f j , M ′ ) | M ′ = M ∗ P(qsi , G f j )} bindings for variables ?m and ?f in the input graph (e.g. ?m → The number of mappings after the execution of the current query B737, ?f → IB83). As illustrated in Figure 3, these mappings are star are the product of the mappings for each edge of the query sent to the graph fragments of the following query stars. The total star. For each edge e of the query, we calculate a permutation network cost is equivalent to the sum of mappings exchanged between n and k where n is the difference of the mean number of between fragments located in different sites (illustrated with edges on each data star of the fragment having the same predicate dotted arrows in Figure 3). as e.label and the number of edges in the query star pointing to The network cost is shown in Equation 3 as the sum of ex- a constant node. The value of k equals to the number of edges in changed packages between graph fragments. The function NN P the query pointing to a variable node. More precisely, returns the number of packages exchanged between two frag- ments given a number of mappings M and its size S. Ö n! P(qsi , G f j ) = (n − k)! e ∈Edдes(qs i ) Õ N C(qsi ) = NN P (G f k , G f j , M) (3) (Gf k ,Gf j ,M ) where, ∈ output _M P qsi n = count(e.label, G f j )/dist(G f j ) − N _const(e.label, qsi )   Number of exchanged packages. Let us detail the function NN P k = count(e.label, qsi ) − N _const(e.label, qsi ) calculating the number of packages as the product between the and N_const is a function returning the number of edges labeled number of mappings M, its size S and the output of the function as e.label in the query star pointing to a bounded node. loc(G fk , G f j ) returning 1 if the fragments are located on distinct sites and 0 otherwise. Its formulation is as follows: Output mapping: After computing the valid_MP of the first   query star, we calculate the number of exchanged results between NN P (G f k , G f j , M) = M ∗ sizei ∗ loc(G f k , G f j ) /sizepack fragments using the selectivity factor. This value is calculated for each graph fragment from the current query star qsi to the graph The parameter sizei represents the size of a single mapping fragments of the next query star qsi+1 . For each valid mapping for the current query star and all the previously found mappings found previously, the output mapping is a set of triples defined and sizepack indicates the size of a single package transmitted as follows: over the network. The estimation of the number of mappings M from one query output_MP qsi = {(G fk , G f j , M ′′ )|G f j |= qs i+1 ∧M ′′ = ⌈IntMP⌉} star to another is done at the graph fragment level of each query The intermediate mappings IntMP is a function returning the star. We start calculating the number of data stars loaded per number of mappings sent to each graph fragment based on the graph fragment for the first query star (input_MP). Then, based selectivity factor. on this input, we estimate for each graph fragment the number M ∗ SF (G fk , G f j , p) , i f qs i has neiдhbor  ′ of mappings produced after the execution of the first query star IntMP = (valid_MP). These mappings, named valid mappings, are multi- M ′ ∗ SF (G fk , G f j , −1) , otherwise plied by the selectivity factors between the graph fragments of The input of the following query star is calculated as follows: the first and second query stars to obtain the output mapping for each pair of graph fragments. The output mapping of the input_MP qsi = {(G f j , M) | G f j |= qs i ∧ M = Nbr _MP(G f j )} first query star becomes the input of each fragment in the fol- The total number of mappings for a single graph fragment is the lowing query star. The next mappings are estimated following sum of all the mappings received from all the graph fragments the same methodology (calculating the input mappings, valid in the previous query star. It is calculated as: input mappings and the output mappings for each fragment in Õ the query stars). Next, we detail the formulas used to calculate Nbr _MP(G f j ) = M ′′ the mappings at each step. (Gf k ,Gf i ,M ′′ )∈output _M P qs i −1 ∧Gf i =Gf j 4.3 Stars Ordering and Selection Problem [] Several execution plans can be used to evaluate a given query. In −−→ −−→ Section 4.2 we develop a cost model allowing to compare equiva- [S Q (?c )] [S Q (?f )] ... Cos t = c 1 Cos t = c 4 lent plans for a query. Finding an optimal acceptable execution plan AP ∗ consists in selecting the acceptable plan for a given −−→ −−→ −−→ ←−− −−→ ←−− query such that it minimizes a given cost function (defined in [S Q (?c ), S Q (?f )] [S Q (?f ), S Q (El Prat)] [S Q (?f ), S Q (?f )] Cos t = c 2 Cos t = c 3 Cos t = c 5 Equation 1). We name this problem as the Stars Ordering and Selection problem since we seek to find the optimal ordering of Query Stars in the plan. Its definition and complexity are given Figure 6: Decision Tree for Query of Figure 2 in Proposition 4.2 and Theorem 1 respectively. Proposition 4.2. Stars Ordering and Selection (SOS) problem Algorithm overview. The optimal execution plan discovery Given a query q, find an acceptable plan P ∗ such that: algorithm follows a branch and bound strategy in which the set minimize Total_Cost(P ∗ ) (Eq. 1) of candidate plans is enumerated in a decision tree. The root of the tree contains the union of the sets of forward and backward Theorem 1. The Stars ordering and Selection (SOS) problem is query stars for a specific query. Each node of the decision tree NP-Hard. contains a candidate plan, the Allowed_stars set for this plan Theorem 1 is explained as follows: our problem is as difficult and its cost (cost function described in Section 4.2). The children as the well-known problems belonging to the NP-Hard class. nodes are the execution plans resulting from adding a query There is no efficient (polynomial) algorithm that can solve this star from the set of Allowed_stars to the execution plan of the problem. Then we face two cases: either an exact and exponential parent node. The Allowed_stars set is empty if the node’s plan algorithm or a polynomial and not exact algorithm. In the next is an AP. If the cost of the plan in the node is greater than the section we describe a branch and bound based algorithm allowing cost of the best plan (at that moment), then the node will not be to find the optimal query plan based on some parameters. Due expanded to its children nodes even if there are still query stars to the lack of space the proof of this theorem is found online 3 . in the Allowed_stars set. The cost of the best plan is initialized to infinite so the first AP generated becomes the best plan at an 4.4 Optimal P Finding Algorithm early exploration. We present in this section our parametric algorithm allowing Let us consider the example tree shown in Figure 6 in which the −→ to find the best plan for a given query. Our algorithm relies on the first star query to be explored is SQ(?c), with Allowed_Stars = ←− −→ a branch and bound strategy to enumerate candidate solutions. {SQ(El Prat), SQ(?f )} and with a cost c 1 . Since c 1 is not greater To prune invalid execution plans it relies on the concepts of than infinite, we must continue expanding the node to the query Allowed_Stars and Star_Distance defined next. −→ stars in the Allowed_Stars set of SQ(?c). The tree is expanded Allowed Stars. This concept guarantees that all the generated −→ −→ and the child node contains the plan [SQ(?c), SQ(?f )] which is execution plans are acceptable plans APs. For a given plan X , an an AP because its Allowed_stars is empty. Since the cost c 2 of Allowed_Star is the set containing the query stars (forward and the plan is lower than infinite, the plan of the node becomes the backward) such that any of them can be added to X and produce best query plan (at the moment). The tree continues to expand an AP. For the example query of Figure 2, the Allowed_Star −→ ←− −→ −→ ←−− creating the child [SQ(?f ), SQ(El Prat)]. Assuming that the cost set for the plan [QS(?c)] is {QS(?f ), QS(El Prat)} since both c 3 > c 2 , the node will not be expanded even if the star query −→ −→ −→ ←− −→ plans ([QS(?c), QS(?f )], [QS(?c), QS(El Prat)]) are acceptable. SQ(?f ) is still in the Acceptable_stars set of the node. The algo- The formal definition is given in Proposition 4.3. rithm continues exploring similarly until no more star queries in Proposition 4.3. (Allowed stars) Let X be a valid plan, the the root could be expanded and the best possible plan has been allowed stars set is defined as follows: found. −→ ←− The tree exploration technique is given in Algorithm 1. The Allowed_stars(X ) = {qs |qs ∈ QS ∪ QS and [X, qs] is an AP} initialization is done in steps 1-3. We start by creating a node Stars Distance. This user-defined parameter allows skipping under the root of the decision tree (step 5). Each node of the some combination that the user does not want to explore based decision tree is characterized by three elements: the plan, the on the concept of distance. The distance between two stars in allowed query stars and the cost. In steps 6-8 we initialize the a query graph is the number of edges separating the heads of elements for the node created in step 5. Then in step 9, we call the star queries by considering the shortest path. For example, a function named Add_Query_Star for each query star. This for the query in Figure 2, the distance between the star queries function is described in Algorithm 2. −→ ←− QS(?c) and QS(?m) is 2 since between the heads of both heads The Add_Query_Star function (Alg. 2) receives as parameters there are two predicates (has_flight and plane_model). The a query star, the node of the decision tree, the best plan (at the distance between stars allows considering only plans that privi- moment) and the stars distance. It starts calculating the node’s lege to evaluate neighbors’ stars. The formal definition is given elements: it adds the query star to the plan (step 1), then calculates in Proposition 4.4. the allowed star (step 2) and the cost as defined in Section 4.2 (step 3). Next, if the cost of the plan is smaller than the cost of the plan Proposition 4.4. (Stars Distance) Given two query stars QS(x) defined as best plan then we call the Enumerate_Child_Branch and QS(y), the distance between both queries is given by: function (step 5) defined in Algorithm 3. If the cost is greater, distance(QS(x), QS(y)) = |{p|p is a path between x and y}| then it exits (step 7). 3 Theorem 1’s proof & Experimental Queries: https://www.lias-lab.fr/~amesmoudi/ The Enumerate_Child_Brach (Alg. 3) function has as inputs papers/dolap2020/SOS-NP-hardness-proof.pdf the node of the decision tree, the best plan at the moment and the Algorithm 1: Optimal P Finding statistics is negligible compared to the size and the loading times −→ ←− of the datasets. Next, to study the accuracy of the estimations of INPUTS: d: stars distance, QS(x), QS(x): forward and data stars and mappings obtained with the cost model of Section backward star queries 4.2, we compared the predicted value (of data stars and mappings) OUTPUT: P: Best Plan with the real number of structures exchanged in the solution 1: P.Plan ←− [ ] of the plan considered as best plan. Finally, we evaluated the 2: P.Cost ←− ∞ precision of the algorithm selecting the optimal execution time 3: T : decision tree based on a precision measure that we define in Sect 5.4. −→ ←− 4: for qs ∈ QS(x) ∪ QS(x) do 5: Create a new node N in T 5.1 Experimental setup 6: N .Plan ←− [ ] Hardware: We conducted all experiments on a machine with 7: N .Allowed_QSs ←− [ ] an Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz CPU, 1TB hard 8: N .Cost ←− 0 disk and 32GB of RAM. Ubuntu server 16.04 LTS is used as an 9: Add_Query_Star (qs, N , P, d) operating system. 10: end for Datasets: We created statistics and evaluate execution times ex- 11: Return P ecuting queries3 on real and synthetic datasets. We utilize the popular LUBM (16GB) and Watdiv (15GB) benchmarks as syn- Algorithm 2: Add_Query_Star thetic datasets and a Yago2 (41GB) and DBLP (32GB) as real datasets. INPUTS: qs: Query Star,N : decision Tree node,P: Best Plan, d: stars distance 5.2 Collection of statistics 1: N .Plan ←− N .Plan ∪ qs 2: N .Allowed_QSs ←− Allowed_Stars (N.Plan) The evaluation of the process of statistics generation are sum- 3: N.Cost←− Total_Cost(N.Plan) ◁ Defined in Sect. 4.2 marized in Table 2. As it is shown, the size of the statistics is 4: if N .Cost < P.Cost then very small compared to the actual size of the data, just a few 5: Enumerate_Child_Branch(N , P, d) MB for all the datasets. For example, in the Yago dataset (41GB) 6: else the statistics are stored in a file of only 82MB (0.2%). The time 7: EXIT (i.e., abandon this branch) in minutes to generate the statistics is shown in the column ST. 8: end if The time to generate the statistics is negligible compared to the loading times of the real database (in real datasets it was less than 10% of the loading time). In our case, we intentionally worked stars distance. If the set of allowed stars is empty, we consider the with a hardware with limited specifications to prove that the plan of this node and its cost as the new best plan and minimal generation of statistics is scalable. We are able to generate the cost respectively (steps 1-3). If there are query stars in the allowed statistics without loading the entire database to main memory. stars set, then for each element that fulfills the distance constraint we call the Add_Query_Star function (steps 5-11). Table 2: Size of statistics M: millions, ST: Statistics, LT: loading time Algorithm 3: Enumerate_Child_Branch −−→ ←−− INPUTS: N : decision Tree node, P: Best Plan, d: stars Dataset Triples |Gf | |Gf | ST % ST % (M) (MB) Size (min) LT. distance 1: if N .Allowed_QSs is empty then LUBM 19.9 11 13 0.008 0.00005 0.50 4.4 2: P.Plan ←− N .Plan Watdiv 109 39,855 1,181 198 0.2 84.2 71.6 3: P.Cost ←− N .Cost Yago 284 25,511 1,216 82 1.32 84.1 9.4 DBLP 207 247 26 0.196 0.0006125 4.7 3.6 4: else 5: for qs ∈ N .Allowed_QSs do 6: qs l ←− last query star in N .Plan 7: if distance(head(qs l ), head(qs) ≤ d) then 5.3 Evaluation of cost model 8: N’ ←− a copy of N We evaluated the estimations of our model measuring the relative 9: Add_Query_Star (qs, N ′, P) error in the estimation of data stars and mappings for the query 10: end if selected as best query. The results of these estimations are shown 11: end for in Figure 7 (plotted with logarithmic scale for readability). The 12: end if relative error is greater in queries that do not send back any result. However, as it is seen later, this estimation does not affect the choice of the best execution plan. 5 EXPERIMENTAL EVALUATION We conducted our experiments in the QDAG system [9], storing 5.4 Optimal P Algorithm the data as graph fragments and solving queries using a graph- We defined a precision measure to evaluate the choice of the exploration approach. We evaluated firstly the time needed to plan made by the selection algorithm. We sorted the execution generate the proposed statistics. We do not give the total loading plans for each query based on their execution time. The precision times for the tested datasets (they are found in [5, 9]), instead we measures how far is the best plan proposed by the algorithm prove that the dimensions and generation time of the proposed compared to the actual best plan in terms of execution time. The 8 3 ∆DS ∆DS ∆DS 12 ∆DS ∆MP ∆MP ∆MP ∆MP 3 10 6 % Log Relative Error % Log Relative Error % Relative Error % Relative Error 2 8 2 4 6 1 4 2 1 2 0 0 0 0 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 Query Query Query Query (a) Watdiv (b) LUBM (c) DBLP (d) Yago Figure 7: Relative Error Estimation 1 1 VLDB J. 18, 2 (2009), 385–406. 0.8 0.8 [2] Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat, and Panos Kalnis. 2017. A 0.6 0.6 Survey and Experimental Comparison of Distributed SPARQL Engines for Precision Precision Very Large RDF Data. PVLDB 10, 13 (2017), 2049–2060. 0.4 0.4 [3] Jeen Broekstra, Arjohn Kampman, and Frank van Harmelen. 2002. Sesame: A 0.2 0.2 Generic Architecture for Storing and Querying RDF and RDF Schema. In The Semantic Web - ISWC First International Semantic Web Conference, Sardinia, 0 1 2 3 4 0 1 2 3 4 5 Italy, June 9-12. 54–68. Query Query [4] Lei Gai, Xiaoming Wang, and Tengjiao Wang. 2018. ROSIE: Runtime Op- timization of SPARQL Queries over RDF Using Incremental Evaluation. In (a) Watdiv (b) LUBM 11th International Conference, KSEM 2018, Changchun, China, August 17-19. 1 1 117–131. 0.8 0.8 [5] Jorge Galicia, Amin Mesmoudi, and Ladjel Bellatreche. 2019. RDFPartSuite: Bridging Physical and Logical RDF Partitioning. In 21st International Confer- 0.6 0.6 ence, DaWaK 2019, Linz, Austria, August 26-29. 136–150. Precision Precision 0.4 0.4 [6] Andrey Gubichev and Thomas Neumann. 2014. Exploiting the query structure for efficient join ordering in SPARQL queries. In Proceedings of the 17th EDBT 0.2 0.2 2014, Athens, Greece, March 24-28. 439–450. 0 0 [7] Sairam Gurajada, Stephan Seufert, Iris Miliaraki, and Martin Theobald. [n.d.]. 1 2 3 4 5 6 1 2 3 4 Query Query TriAD: A Distributed Shared-nothing RDF Engine Based on Asynchronous Message Passing. In Proceedings of the 2014 ACM SIGMOD, year = 2014, location (c) DBLP (d) Yago = Snowbird, Utah, USA, pages = 289–300, numpages = 12. [8] Yannis E. Ioannidis. 2003. The History of Histograms (abridged). In Proceedings of 29th VLDB 2003, Berlin, Germany, September 9-12. 19–30. Figure 8: Precision of Best P Algorithm [9] Abdallah Khelil, Amin Mesmoudi, Jorge Galicia, and Mohamed Senouci. 2019. Should We Be Afraid of Querying Billions of Triples in a Graph-Based Cen- tralized System?. In Model and Data Engineering - 9th International Conference, MEDI 2019, Toulouse, France, October 28-31. 251–266. precision is defined as: [10] Brian McBride. 2002. Jena: A Semantic Web Toolkit. IEEE Internet Computing Precision(P) = (#Plans − Pos(P))/(#Plans − 1) 6, 6 (2002), 55–59. [11] Thomas Neumann and Guido Moerkotte. 2011. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In Proceedings of where Pos is a function returning the plan’s rank with respect to the 27th ICDE 2011, April 11-16, Hannover, Germany. 984–994. the sorted plans in terms of execution time. The results for each [12] Thomas Neumann and Gerhard Weikum. 2008. RDF-3X: a RISC-style engine dataset are shown in Figure 8. For all datasets, the prediction of for RDF. PVLDB 1, 1 (2008), 647–659. [13] Thomas Neumann and Gerhard Weikum. 2010. The RDF-3X Engine for the best execution time escapes is either the best possible plan Scalable Management of RDF Data. The VLDB Journal (2010), 91–113. (according to the execution time) or one of the top best. [14] Peng Peng, Lei Zou, M. Tamer Özsu, Lei Chen, and Dongyan Zhao. 2016. Processing SPARQL queries over distributed RDF graphs. VLDB J. 25, 2 (2016), 243–268. 6 CONCLUSION [15] Minh-Duc Pham, Linnea Passing, Orri Erling, and Peter A. Boncz. 2015. De- riving an Emergent Relational Schema from RDF Data. In Proceedings of the In this paper, inspired from the relational model, we first pro- 24th WWW 2015, Florence, Italy, May 18-22. 864–874. vided logical structures to model the execution plan based on [16] Theoni Pitoura and Peter Triantafillou. 2008. Self-Join Size Estimation in graph exploration techniques. Then, we proposed a novel cost Large-scale Distributed Data Systems. In Proceedings of the 24th ICDE, April 7-12, Cancún, Mexico. 764–773. model comparing equivalent logical execution plans based on [17] Viswanath Poosala, Yannis E. Ioannidis, Peter J. Haas, and Eugene J. Shekita. statistics collected for clusters of triples (that we denoted graph 1996. Improved Histograms for Selectivity Estimation of Range Predicates. fragments). The cost model estimates the disk and network in- (1996), 294–305. [18] Giorgio Stefanoni, Boris Motik, and Egor V. Kostylev. 2018. Estimating the teractions for a specific logical plan. Furthermore, we studied Cardinality of Conjunctive Queries over RDF Data Using Graph Summarisa- formally the complexity of the problem related to the choice of tion. In Proceedings of the World Wide Web Conference on World Wide Web, WWW , Lyon, France, April 23-27. 1043–1052. the best execution plan and we proposed a branch and bound [19] Markus Stocker, Andy Seaborne, Abraham Bernstein, Christoph Kiefer, and like algorithm allowing to find the best plan for a specific query. Dave Reynolds. 2008. SPARQL basic graph pattern optimization using selec- For experimentations, we used synthetic and real datasets. The tivity estimation. In Proceedings of the 17th WWW 2008, Beijing, China, April 21-25. 595–604. results showed that cardinality estimations based on our model [20] Petros Tsialiamanis, Lefteris Sidirourgos, Irini Fundulaki, Vassilis are very precise even if the collected statistics’ size is negligible. Christophides, and Peter A. Boncz. 2012. Heuristics-based query opti- For future work, we plan to explore other optimization strate- misation for SPARQL. In 15th EDBT’12, Berlin, Germany, March 27-30. 324–335. gies such as real time execution plan auto-adaptation, also we [21] Xiaofei Zhang, Lei Chen, Yongxin Tong, and Min Wang. 2013. EAGRE: Towards plan to use Machine Learning techniques on runtime logs. scalable I/O efficient SPARQL query evaluation on the cloud. In 29th IEEE ICDE, Brisbane, Australia, April 8-12. 565–576. [22] Lei Zou, M. Tamer Özsu, Lei Chen, Xuchuan Shen, Ruizhe Huang, and REFERENCES Dongyan Zhao. 2014. gStore: a graph-based SPARQL query engine. VLDB J. [1] Daniel J. Abadi, Adam Marcus, Samuel Madden, and Kate Hollenbach. 2009. 23, 4 (2014), 565–590. SW-Store: a vertically partitioned DBMS for Semantic Web data management.