1. Introduction

Processing SPARQL TOP-k Queries Online with Web Preemption

JulienAimonier-Dava

HalaSkaf-Moll

iand PascalMoll

Semantic Web, SPARQL, Web Preemptiont,op- Queries

0 LS2N, University of Nantes , 2 rue de la Houssinière BP 92208, 44322 Nantes Cedex3 , France

Processingtop- queries on public online SPARQL endpoints often runs into fair use policy quotas and does not complete. Indeed, existing endpoints mainly follow the tradmitaitoenriaallize-and-sort strategy. Although restricted SPARQL servers ensure the terminattoipo-n oqfueries without quotas enforcement, they follow tmheaterialize-and-sort approach, resulting in high data transfer and poor performance. In this paper, we propose to extend the Web preemption model with a preemptable partiatlop- operator. This operator drastically reduces data transfer and significantly improves query execution time. Experimental results show a reduction in data transfer by a factor of 100 and a reduction of up to 39% in Wikidata query execution time.

1. Introduction

Context and motivation. In knowledge graphs,top- queries allow users to search for the largest cities in the world, the longest rivers, the highest mountains, the oldest softw1a]r.e, etc [ However, processingtop- queries on public SPARQL endpoints is challenging, mainly due to the fair-use policies of public endpoints that stop queries before termin2a, t3i]o.nFo[r instance, a query that just looks for the 10 oldest people on Wikidata runs out of time (c1fa.)fig. uSruech a problem is not restricted to Wikidata. The query that returns the ten first classes ordered by their name timeout both on Wikidata and DBpedia (cf. figu1br)e.

Why aretop- queries interrupted by quotas? Mainly because SPARQL engines follow the materialize-and-sort approach to answetrop- queries [ 4 ], i.e. engines first materialize all the query results, then sort them according to the ORDER-BY clause, and finally keep and return the first results. To return the 10 oldest people on Wikidata, the query depicted in1Faigure materializes more than 2,8M people.

Related works.

Many techniques have been proposed to proviedaerly termination orearly pruning when processingtop- queries [ 1 ]. In SPARQL, thanks to sorted access to data, a smart

SELECT ? humanLabel ( YEAR ( ? death ) − YEAR ( ? b i r t h ) AS ? l i f e s p a n ) WHERE { ?human wdt : P31 / wdt : P279 ∗ wd : Q5 . ?human wdt : P569 ? b i r t h . ?human wdt : P570 ? death .

SERVICE w i k i b a s e : l a b e l {

bd : se rv ic ePa ra m w i k i b a s e : language ” en ” .

} } ORDER BY DESC( ? l i f e s p a n ) LIMIT 10 (a) TOP 10 oldest people

SELECT DISTINCT ? c ? l WHERE { ? s a ? c .

? s r d f s : l a b e l ? l } ORDER BY ? l LIMIT 10 (b) TOP 10 classes engine can decide to stop the query execution earlier because the remaining mappings are guaranteed not to be part of the troepsults4[ , 5 ]. Although early termination or early pruning can drastically improve execution time, none of them guarantee that a query will terminate before quotas on a public SPARQL endpoint.

To ensure termination toofp- queries, existing approaches rely on servers that ensure a fair-use policy without quotas, i.e. fairness is guaranteed by a restricted SPARQL interface. Restricted SPARQL servers such as TP6F][, Web preemption 7[] or SmartKG8[] only support a restricted set of SPARQL operators that do not impact the responsiveness of the restricted server. Other SPARQL operators are supported by the client. To the best of our knowledge, eficient evaluation oftop- queries has not been studied in the context of restricted SPARQL servers.top- queries are evaluated following the traditmioantearlialize-and-sort approach, where materialization is done on the client-side. Even if queries are guaranteed to terminate, data transfer and execution time can be prohibitive.

The research question is to study how a restricted server can preoarvliydetermination or early pruning while ensuring the fairness of the server and the terminattioonp-ofqueries. Approach and Contributions. In this paper, we propose to extend Web preemption with a new preemptabletop- operator that ensures terminatiotnopo-f queries, while enabling the use of early pruning techniques. The contributions of the paper are the following: (1) We propose a preemptablteop- operator whose overhead does not depend o, ni.e. the server is fair whatever the value of tLhIeMIT clause. (2) We implement thteop- operators as an extension of the preemptabSlaeGe server. (3) The experimental results show that the new operators can improve query execution time by up to 60% and divide the data transfer by 100.

The remainder of this paper is organized as follows. Sec2tiionntroduces SPARQLtop- queries and briefly recalls the definitions related to our proposal. Sec3tipornesents the approach for processingtop- queries using preemptablteop- operators. Sectio4npresents our experimental results. Sect5iosnummarizes related works. Finally, conclusions and future work are outlined in Secti6o.n : a1 : c o n f e r e n c e :WWW . : a3 : c o n f e r e n c e : ISWC . : a5 : c o n f e r e n c e : ESWC . : a1 : p u b l i c a t i o n 2021 . : a3 : p u b l i c a t i o n 2022 . : a5 : p u b l i c a t i o n 2021 . : a1 : c i t a t i o n s 20 . : a3 : c i t a t i o n s 15 . : a5 : c i t a t i o n s 10 . : a2 : c o n f e r e n c e : ISWC . : a4 : c o n f e r e n c e : ISWC . : a6 : c o n f e r e n c e : ESWC . : a2 : p u b l i c a t i o n 2021 . : a4 : p u b l i c a t i o n 2022 . : a6 : p u b l i c a t i o n 2022 . : a2 : c i t a t i o n s 10 . : a4 : c i t a t i o n s 12 . : a6 : c i t a t i o n s 2 . :WWW : rank 1 . : ISWC : rank 2 (a) RDF Graph 1 . : ESWC : rank 3 .

SELECT ? a r t i c l e WHERE { ? conf : rank ? rank . #1 ? a r t i c l e : c o n f e r e n c e ? conf . #2 ? a r t i c l e : p u b l i c a t i o n ? data . #3 ? a r t c i l e : c i t a t i o n s ? c i t a t i o n s . #4 } ORDER BY ? rank , DESC( ? c i t a t i o n s ) LIMIT 2

(b) SPARQL query 1

2. Preliminaries and backgrounds

In this section, we briefly recall the definitions related to our proposal. We follow the notations from [9, 10] and consider three disjoint se t(sIRIs), (literals) an d (blank nodes). We denote the set of RDF terms a s∪ ∪ . An RDF triple(, , ) ∈ ( ∪ ) × × connects a subjec t through a predica teto an objec t. An RDF graph is a finite set of RDF triples. We assume the existence of an infinite set of variables, disjoint from the previous sets. A map pinfrgom to is a partial functi o∶n → . The domain of , denoted by() , is the subset of where is defined. A SPARQL graph pattern expressio n is defined recursively as follows: 1. A tuple from( ∪ ) × ( ∪ ) × ( ∪ ) is a triple graph pattern. 2. If 1 and 2 are graph patterns, then expressio n1s A(ND 2), ( 1 OPT 2) and ( 1 UNION 2) are graph patterns. 3. If is a graph pattern a ndis a SPARQL built-in condition, then the expressi onFI(LTER ) is a graph pattern.

The evaluation of a graph patteronver an RDF graph , denoted by , produces a multiset J K Ω of solution mappings.

2.1. TOP-k Queries

top- queries return the topresults according to a user-specified ranking (scoring) function. top- SPARQL queries can be expressed usingORDER BY and LIMIT clauses that first impose an order on the result set and then limit the number of results. Fol4l]o, wthinegOR[DER BY clause can be formulated as a ranking functℱiocnombining several ranking criter{ i1a, … , }. Given a graph patter n, a ranking criterio(n? 1, … , ? ) is a function defined over a set of variables, where∀∈[1,] ∶ ? ∈ () . The evaluation of a ranking criter ioonn a mapping , denoted by[] , is the substitution of all varia?b l e∈s by the corresponding values from the mapping, i.e(.? ). Given two mappings 1 and 2, 1 > 2 according to a ranking functionℱ ({ 1, … , }), denoted byℱ [ 1] > ℱ [ 2], if ∃∈[1,] ∶ [ 1] > [ 2] and ∀∈[1,[ ∶ [ 1] = [ 2].

In order to evaluate the ranking functℱio,anll the variables i(n) that contribute in the evaluation oℱf must be bounded. SinceOPTIONAL and UNION clauses can introduce unbound variables, we assume all the variable (s)in to becertain variables as defined in [ 11], i.e. variables that are certainly bounded for every mapping produ ce.dTbhye relaxation of the certain variables constraint is part of the future work.

2.2. Web Preemption and TOP-k Queries

Web preemption [7] is the capacity of a Web server to suspend a running SPARQL query after a ifxed quantum of time and resume the following waiting query. When suspending a qu e,ray preemptable server saves the internal state of all oper ationras osafved plan , which is sent to the client. The client can continue the executiobnyosfending back to the server. When reading , the server restarts qu eryfrom where it has been stopped. As a preemptable server can restart queries from where they have been stopped and makes a progress at each quantum, it eventually delivers complete results after a bounded number of quanta.

However, Web preemption comes with overheads. The time of the suspend and resume operations represents the overhead in time of a preemptable server. The s izreeporfesents the overhead in space of a preemptable server and may be transferred over the network each time the server suspends a query. To be tractable, a preemptable server has to minimize these overheads. For this purpose, a preemptable server only implements SPARQL operators that can be saved and resumed in bounded time, i.e. preemptable operators. Other operators are considered as not preemptable and implemented on the client. For instanScCeA, Nt,hJeOIN, UNION, LIMIT and FILTER operators just need to manage one mapping at a time, consequently, they can be saved and resumed in bounded time. On the other hand, some SPARQL operators require to materialize mappings, e.g. OtRhDeER BY operator requires materializing mappings before sorting, while thGeROUP BY operator requires materializing the group keys1(s2e])e. [ As materialization of mappings requires space resources that depends on the size of the data, they cannot be saved in bounded time.

Figure3 illustrates how Web preemption processesttohpe- query 1 over the RDF graph 1, both depicted in Figur2e. First, the client decomposes the qu er1yinto 1′ so that only preemptable operators are sent to the preemptable server,Oi.ReD. EtRheBY and LIMIT clauses are removed fro m 1. Let us suppose that quer y1′ requires 3 quanta to complete. At the end of each quantum , the client receives the mappingsand asks for the following results by sending the saved pla n of 1′ back to the server. Once all mappings are obtained, the client sorts them according to the ranking function defined by OtRhDeER BY clause, and apply the slice operator to keep only the top 2 mappings.

Althoughtop- queries are expected to transfer o nrlyesults, in the case of Web preemption, because theORDER BY operator is implemented on the client-side, all mappings will be transferred anyway Moreover, this execution strategy prevents us from taking advantage of possible optimizations when evaluating SPARtQoLp- queries, such as early termination or early pruning of partial mapping1s, 1[ 3 ]. To overcome this issue, one solution could be to apply state-of-art solutions on the client-side. However, this solution requires computing joins on the

3. Preemptable TOP-k Operator

Given a ranking functionℱ, the computation of the t opmappings only requires maintaining an ordered list ofmappings. If we set an upper bound for , a simple iterator as defined by Algorithm1 is preemptable. Algorith1mis a simple adaptation of the “Top-N heapsort” algorithm that is commonly used in the Postgres database. The idea of the Top-N heapsort algorithm is to maintain in a main-memory bufer only the bmesatppings seen so far. In a realworld configuration, the upper bound should be set by the SPARQL endpoint administrator.

Algorithm1 relies on a data structuℱrethat maintains a list of mapping1,s… , such that ∀∈[1,[ , ℱ [ ] ≤ ℱ [ +1 ]. First, the algorithm fillℱs with the first mappings drawn from the if =

Data: ℱ: ordered list of mappings accordingℱto, : last mapping rea d, : threshold to enter tthoep- . previous iteratoℐr. Then, each time a new mappin g is drawn fromℐ, is compared with the mapping with the smallest rank ℱ

in, i.e. 1 denoted by in Algorithm1. If ℱ [ ] > ℱ [ ], saving (resp. resuming) the iterator is((i,n )) , Algorithm1 is preemptable. replaces in the to p . Once ℐ exhausted, Algorithm1incrementally retur nℱs.

When the query is suspended (resp. resumed) by the preemptable SPARQL server, Algor1ithm just has to save ℱ in (resp. resume ℱ from ). As the time and space complexities of

However, Algorithm1 has two drawbacks. First, it imposes a maximum va luoen the number of results returned tboyp- queries. If users want to executetoap- query with > , they have to rely on the client-side solution, i.e. transferring and sorting all solutions of to end up keeping only solutions. Second, even if the size of tthoep- is bounded by , the overhead of saving and resuming thteop- can have a significant impact on the execution time and the data transfer (wh enis sent to the client), as demonstrated in our experimental study. Problem Statement. Is it possible to implement a preemptatbolpe- iterator whose complexity does not depend on?

3.1. Preemptable Partial TOP-k Iterator

The key idea to avoid depending on is to rely on partitaolp- . Using a preemptive Web server, the evaluation of a graph patteorvner an RDF graph naturally creates a partition of mappings 1, ..., over time, where is produced during quantu m . Intuitively, a partial top- is obtained by computing the topmappings on a partition of mappin gs .

Instead of computing the wholteop- query on the server, which requires saving and resuming it each time the preemption occurs, the server only computes ptaorpt-ialℱ1 , … , ℱ thetop- .

Data: : last mapping read. else maximum limi t fortop- queries, ℱ: empty list of ordered mappings accordingℱt,o : threshold to enter 19 21 23 25 18 Procedure Open(): 20 Procedure Load( ′): can be saved and resumed in bounded time.

To comply with the Web preemption mode7l],[the amount of data transferred to the client per quantum need to be bounded. Consequently, the size of a parttoipa-l If ℱ reaches

mappings, it will trigger the end of the quant u,mand ℱ will be sent to the client. Thus, end-users are no more limited b,ybut reachin g may degrade performance in ℱ is bounded by . terms of data transfer and execution time. When the query execution completes, the client just because the server does not remember thoep- between quanta, it may transfer up(,t )o needs to merge a llℱ1 , … , ℱ to compute the to psolution mappings ℱ of the query. However, mappings at each quantum, even if those mappings do not contribute ttootph- e. To avoid transferring too many useless mappings, the client sends to the server the mapping with the smallest ran k (for decreasing order), with the saved pl a.n Using as a threshold, the server can discard a part of the solution mappings that do not contributteotpo- .the mappings produced byℐ during a quantum . Algorithm3 merges partiatlop-

Figure5 illustrates how to compute quer1yover the RDF grap h 1 using Algorithm2 and 3. Algorithm2 implements a preemptable iterator that computes a tpaorpt- ialℱ based on the ℱ produced by the server until the query completes. Let us assume that q1uerreyquires 3 quanta1, 2 and 3 to complete, with two mappings produced per quantum. F1o,rmappings 1 and 2 are produced byℐ and inserted intoℱ1 . When preemption occurs, bot h and ℱ1 are sent to the client. Then, the client mer geℱ1s into ℱ and sends back to the server. Becau|s eℱ| = 2 = , the client also sends the mapping with the smallest ran k2,, in.ee.xt to . During quantum 2, mappings 3 and 4 are produced. Because both mappings have a higher rank th2a,nthey are both inserted int oℱ2 . However, if the server had known the current state otfotph-e, it would q1 q2 q3 ℱ = ⟨ 1, 2⟩

= 2 ℱ = ⟨ 1, 3⟩

= 3 ℱ = ⟨ 1, 3⟩

= 3 Ω = ℱ = ⟨ 1, 3⟩ 13 Function GetNext(): 14 15 16 if | ℱ| > 0 then

return ℱ.() 2 mappings of query 1 on 1. have known that4 does not contribute to tthoep- , but without remembering the whole top , like Algorithm1 in Figure4, the server cannot safely rejec4t.This scenario illustrates the overhead of relying on a parttiaolp- iterator. By knowing only, the server may transfer mappings that do not contribute tottohpe- , such as 4. At the end of quantu m2, ℱ2 and are sent to the client2, is merged int o ℱ, and 3 becomes the mapping with the lowest rank.

ℱ Finally, during quantum3, all generated mappings have a lower rank th3a,cnonsequently, no new mappings are transferred to the client. The query completes ℱ wi=th⟨ 1, 3⟩, the top

Data: ℱ: ranking function of , : limit of , ℱ: ordered list of mappings accordingℱto, : threshold to enter the Algorithm 3: Client-Sidetop- iterator.

Require: : SPARQL top- query.

top- . 1 Procedure Open(): 2 3 4 5 6 7 8 9 10 11 12 else

3.2. Early Pruning of Partial Solution Mappings

To improve query execution time, state-of-art approaches try to guarantee early termination of top- queries. Unfortunately, existing techniques rely on expensive sort operations that are not preemptable. Although early termination cannot be achieved using Web preemption, early pruning only requires knowing the mapping with the lowest rank among the top mappings [ 13, 4 ]. Intuitively, early pruning consists in pruning a partial mapping if its rank is smaller than the lowest of thsoe far computed mappings. For instance, let us suppose that query 1 is executed using a join order that corresponds to the order of triple patterns in the query. In this case, we can use early pruning to skip all the articles published at ESWC. Following the execution depicted in Figu5r,eit is known from the first quantum that none of the papers published at ESWC can enter the top-2, because the ESWC conference rank is higher than 2.

Early pruning can be very eficient but highly depends on the join order. Moreover, the opportunity for pruning arises only wh e(nor more) complete results have been produced.

4. Experimental Study

The purpose of the experimental study is to answer the following questions: (1) Does using a preemptabletop- operator improve query execution performance? (2) What is the impact of and the duration of a quantum on performance? (3) What is the impact of early pruning on performance? (4) How does this approach perform on real-wtoorpld- queries?

4.1. Experimental Setup

To implement our approach, we used thSaeGe query engine1. Following the Web preemption model,SaGe is divided into a Python preemptable Web server and a JavaScript Web client. top- iterators, i.e. Algorith1masnd 2, have been implemented as an extension of tShaeGe preemptable Web server. To make experiments simpler, we created three custom Python clients: • SaGe is ourbaseline, i.e. Web preemption withouttaop- iteratort.op- queries are executed using theORDER BY and LIMIT operators as explained in secti2o.2nand depicted in Figure3. • SaGe-top- executestop- queries using Algorith m1defined in section 3. This corresponds to the adaptation of the “Top-N heapsort” operator that is already used in the Postgres database. • SaGe-partial-top- executestop- queries using the partiatolp- iterator defined in section3.1. This corresponds to our contribution without early pruning enabled. In all experiments, the maximum li mi,tto be set by thSeaGe server administrator, was fixed at = 10000 . The maximum number of mappings that can be sent to the client per quantum was set at10000.

Data and Queries. We used the Waterloo SPARQL Diversity Benchmark (Wat2D)[i1v4]. We re-used the RDF graph and the SPARQL queries from tShaeGe [7] experimental study. The RDF graph contains107 triples, while the workload contains 193 queries. From these 193 queries, 1https://sage.univ-nantes.fr 2https://dsg.uwaterloo.ca/watdiv we randomly picked 20 SPARQL conjunctive queries with diferent shapes and up to 10 joins per query. Because queries do not havOeRDER BY clauses, we randomly generated OaRnDER BY clause for each query, with up to 2 variables per clause. For the choice of the variables, we privileged those which take Literals as value. FoLrItMhIeT clauses, we experiment diferent values of , namely 10, 100, 1000, 10000.

To test our approach on real queries, we extracted 20 queries from the Wikidata qu3e.ry logs Queries contain from 1 to 10 joins, aORnDER BY clause with a single variable, anLdIaMIT operator wit h∈ [ 1, 10000 ] . Queries are executed on the 2017-03-13 dump of Wikida4twaith 2262M triples.

All RDF graphs are stored using HD1T5[]. The HDT-backend of theSaGe query engine is used to query the RDF graphs. Code, configurations, queries, and RDF graphs can be found online for reproducible purpos5e.s Evaluation Metrics. (1) Execution Time is the time spent executing a query, from when the query is sent to the server until the client returns the last rDesautlatT. r(a2n)sfer is the number of bytes exchanged between the client and the server during the execution of a quNeruym.b(3e)r of HTTP Calls is the number of HTTP requests sent to the server during the execution of a query.

Hardware Setup. We ran all experiments on a local cloud instance with Ubuntu 20.04.4 LTS, a AMD EPYC 7513 32-Core processor (the VM was configured with 16 vCPUs and 1 thread per core), 1TB SSD (part of a storage pool), and 64GB of RAM. Both the client and the server were running on the same machine.

Software Setup. We used Python v3.9.7, Virtuoso Open Source Edition v7.2.6 aSnadGe as of July 24 2022 (7ea3468).

4.2. Experimental Results

We first ensured that ourtop- iterators yield complete and correct results. We ran both Virtuoso and the diferent approaches to check if they provide complete and accurate results for each query using Virtuoso results as the ground truth.

Does using a preemptable top- operator improve query execution performance? Figure 6 presents the performance achieved by the diferent approaches on the WatDiv queries. First, let us consider only the results obtained w=it1h0 , without enabling early pruning. We will then see what is the impact oafnd early pruning on performance. As expected, using a dedicatedtop- iterator significantly improves query execution performance comparSeadGteo. First, bothSaGe-top- and SaGe-partial-top- significantly reduce data transfer compared toSaGe that requires to transfer all mappings to the cSlaiGenet-t.op- only sends the top 3https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en 4https://www.rdfhdt.org/datasets 5https://github.com/momo54/sage-orderby-experiment (a) WatDiv queries with = 10 (b) WatDiv queries with = 100 (c) WatDiv queries with = 1000 (d) WatDiv queries with = 10000 mappings, whileSaGe-partial-top- only transfers the topmappings seen during a quantum that are greater than the threshold computed by the client. CompSarGeed, StaoGe-top- and SaGe-partial-top- require sending fewer HTTP calls. BecauSseaGe transfers more mappings, it more often triggers the end of quanta. Indeed, Web preemption limits the number of mappings sent to the client per quantu7m]. [When the limit is reached, it triggers the end of the quantum, even if there was time left, and the longer the duration of a quantum, the more likely it is to occur. In terms of execution time, SbaoGteh-top- and SaGe-partial-top- are close toSaGe. The slight diference can be explained becausSeaGe transfers more mappings, increasing serialization times.

What is the impact o f and the duration of a quantum? As depicted in Figure6, does not impact the number of HTTP calls. No matter the val u,eaollf approaches require seeing all solution mappings. Consequently, they need the same number of quanta to complete, regardless of .

While does not impact the number of HTTP calls, it drastically afects the execution time of SaGe-top- . Algorithm1 used by SaGe-top- requires to save (resp. load) the current troepsults each time a query is suspended (resp. resumed). Consequently, w heinncreases, the time spent suspending and resumingtop- queries increase accordingly. On the WatDiv queries, Figure7 presents the time spent executing, saving, and resuming queries, both foSratGhee-top- and theSaGe-partial-top- approaches. As expected, whe n increases,SaGe-top- makes Web preemption overheads increase drastically, to the point where the server spends more time saving and resuming queries than executing them. As a reSsauGlte,-top- can require more time thaSnaGe to complete. The durationof a quantum also plays an important role. When is small, queries are more often suspended (resp. resumed), increasing Web preemption overheads. Thus, it is tractable to useStahGee-top- approach, but and must be carefully configured by the SPARQL endpoint administrator. The higher the duration of a quantum, the higher could be. Compared toSaGe-top- , does not impactSaGe-partial-top- because the time complexity of saving and resuming a parttoiap-l iterator does not depend o n(see Figure7).

This property comes at the cost of an overhead on the data transfer compSaarGede-top- . The partiatlop- iterator does not remember the current statetoofpt- hebetween quanta. The only information it has is the threshold computed by the client. If the threshold becomes outdated during a quantum, the server may transfer useless solution mappings that do not contribute to thteop- . NeverthelessS,aGe-partial-top- is better thaSnaGe-top- in almost all tested configurations. BecauSseaGe-top- has to transfer the t orpesults each time the query is suspended (resp. resumed), the data transfer can increase rapidlisyliafrge or the query is suspended/resumed too often. In the worst casSea,Ge-top- can transfer more data thanSaGe, i.e. more than all the query solutions.

What is the impact of early pruning? As depicted in Figure6, early pruning has a significant impact on WatDiv queries. Whe n≤ 1000 , we observe a 60% reduction of the execution time when using SaGe-partial-top- (resp. SaGe-top- ) with early pruning enabled, compared toSaGe-partial-top- (resp. SaGe-top- ) without early pruning. Because the opportunity for pruning arises only when(or more) complete results have been produced, early pruning has less impact whe n increases. As a result, we observe a smaller reduction, around 15%, when = 10000 . Another critical factor is the join order. The earlier the variables used by the ranking function are evaluated, the more influential the pruning is likely to be. Finally, even if the threshold is only updated between quanta when uSsainGge-partial-top- , it does not significantly impact the eficiency of early pruning compared tSoaGe-top- .

How does this approach perform on realtop- queries? Figure8 presents the performance of the diferent approaches on Wikidata queries, with and without early pruning. Compared to SaGe, bothSaGe-top- and SaGe-partial-top- allow a significant reduction in terms of data transfer. HoweverS,aGe-top- requires and to be well configured to avoid increasing Web preemption overheads. As expected, all approaches behave in a similar way in terms of execution time. However, if early pruning is enabled when usiSnagGe-top- and SaGe-partial-top- , we observe a 35% reduction compared StaoGe.

5. Related Works

State-of-the-art SPARQL query engines, such as Virtuoso and Jena, relymoanteraializeand-sort [ 1 ] approach to process SPARQLtop- queries. This consists of computing all the matching solutions (e.g., thousands), even if only a limited nu mboefrsolutions (e.g., ten) is requested 1[ 6 ]. Consequently, these queries are often long-running queries that require a lot of CPU and memory resources to terminate. To ensure stable and responsive services to users, public SPARQL endpoints set up quotas, e.g. on the number of results returned to the client, the execution time, or the arrival rate. Therefore, tmoapn- yqueries cannot be executed online simply because they reach quotas of the fair-use polic7i,e1s7[ , 2 ]. Compared to existing public endpoints, our approach ensures thattaolpl- join queries terminate without materializing all results.

Early termination and early pruning are common techniques to speed-up the processing of top- queries in databases1][. Such techniques have been adapted for RDF and SPARQL in [ 18, 4, 5, 13 ]. Early termination allows deciding if it is possible to stop the execution of a runningtop- query. A top- query can be stopped as soon as the remaining mappings are guaranteed not to be part ofttohpe- results. Early termination relies both on sorted access to data and the computation of upper bounds to entetropt-h e. Magliacane et al4.][ propose a SPARQL engine-level implementation of rank-aware physical operators allowing early termination. The SPARQL-RANK execution model creates a rank-aware pipeline of operators that incrementally output a ranked set of mappings according to a user-defined scoring function. Unfortunately, rank-aware operators require to materialize the sorted mappings. Following the Web preemption model, rank-join operators are not preemptable because they cannot be saved and resumed in bounded time. Early termination can be further improved thanks to MS-tree indexes, allowing the computation of tighter upper bounds, as proposed by Dong 1e8t].al. [ Such an approach is eficient, but the maintenance of extra indexes is costly. Yang e1t9a]l. [ propose the STAR framework. SPARQL queries are decomposed into sets of star-shaped queries for which an eficient algorithm to compute tthoep- is provided. Thanks to a join algorithm and an upper bound schema that allows the computation of tight upper bounds, STAR can combine thetop- of star-shaped queries to compute tthoep- of any SPARQLtop- query. However, just like SPARQL-RANK, Yang et al. use non-preemptable algorithms due to their space complexities. Wagner et a5l]. [propose an approach to computteop- queries without complete result materialization that relies on a probabilistic early pruning method. Given a partial mapping , the algorithm predicts with bounded accuracy the probabi li tbyeoinf g part of thetop- results. Our approach makes use of early pruning, but we aimed to compute fully accurate results as 1in3][.

6. Conclusion

In this paper, we extended Web preemption with a parttoiap-l iterator, significantly improving queries performance. By relying on parttioapl- , we can computetop- queries with arbitrary large and small values o, fwithout increasing Web preemption overheads. Moreover, with a dedicatedtop- iterator, it becomes possible to use early pruning techniques to improve query execution time. In future work, we plan to work on a way to support early termination of top- queries in the context of Web preemption. Another path we would like to explore is the evaluation otfop- queries when the ranking function is defined over aggregates. The evaluation of SPARQL aggregate queries using Web preemption also rely on a partial preemptable iterator20[, 12]. If our partiatlop- iterator can be used ove2r0,[12], a dedicated approach may allow to significantly improve performance.

Acknowledgments

This work is supported by the ANR-19-CE23-0014 DeKaloG project (CE23 - Intelligence artificielle) and the CominLabs MiKroloG project. [7] T. Minier, H. Skaf-Molli, P. Molli, SaGe: Web Preemption for Public SPARQL Query Services, in: WWW ’19: The World Wide Web Conference, Association for Computing Machinery, New York, NY, United States, 2019, p. 1268–1278. [8] A. Azzam, J. D. Fernández, M. Acosta, M. Beno, A. Polleres, SMART-KG: Hybrid Shipping for SPARQL Querying on the Web, in: WWW ’20: Proceedings of The Web Conference 2020, Association for Computing Machinery, New York, NY, United States, 2020, pp. 984–994.

[9] S. Harris, A. Seaborne, SPARQL 1.1 Query Language, in: W3C Recommendation, 2013. [10] J. Pérez, M. Arenas, C. Gutiérrez, Semantics and complexity of SPARQL, ACM Transactions on Database Systems 34 (2009) 1–45. [11] M. Schmidt, M. Meier, G. Lausen, Foundations of SPARQL query optimization, in: Proceedings of the 13th international conference on database theory, 2010, p. 4–33. [12] J. Aimonier-Davat, H. Skaf-Molli, P. Molli, A. Grall, T. Minier, Online approximative SPARQL query processing for COUNT-DISTINCT queries with Web Preemption, Semantic Web Journal 13 (2022) 735–755. [13] A. Wagner, T. T. Duc, G. Ladwig, A. Harth, R. Studer, Processing SPARQL Property Path Queries Online with Web Preemption, in: 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Crete, Greece, May 27-31, 2012, Proceedings, Springer, 2012, pp. 56–71. [14] G. Aluç, O. Hartig, M. T. Özsu, K. Daudjee, Diversified Stress Testing of RDF Data Management Systems, in: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I, Springer, 2014, p. 197–212. [15] J. D. Fernández, M. A. Martínez-Prieto, C. Gutiérrez, A. Polleres, M. Arias, Binary RDF representation for publication and exchange (HDT), Journal of Web Semantics 19 (2013) 22–41. [16] S. Zahmatkesh, E. D. Valle, D. Dell’Aglio, A. Bozzon, Towards a top-K SPARQL query benchmark generator, in: OrdRing’14: Proceedings of the 3rd International Conference on Ordering and Reasoning, volume 1303, CEUR-WS.org, 2014, pp. 35–46. [17] C. Buil-Aranda, A. Polleres, J. Umbrich, Strategies for Executing Federated Queries in SPARQL 1.1, in: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II, Springer, 2014, p. 390–405. [18] W. Dong, Z. Lei, Z. Dongyan, Top-k queries on RDF graphs, Information Sciences: an

International Journal 316 (2015) 201–217. [19] Y. Shengqi, H. Fangqiu, W. Yinghui, Y. Xifeng, Fast top-k search in knowledge graphs, in: 2016 IEEE 32nd international conference on data engineering (ICDE), IEEE, 2016, pp. 990–1001. [20] A. Grall, T. Minier, H. Skaf-Molli, P. Molli, Processing SPARQL Aggregate Queries with Web Preemption, in: 17th International Conference, ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, Proceedings, volume 12123 oLfecture Notes in Computer Science, Springer, 2020, pp. 235–251.

[1]

I. F.

Ilyas , G. Beskales,

M. A.

Soliman , A survey of top-k query processing techniques in relational database systems , ACM Computing Surveys 40 ( 2008 ) 1 - 58 .

[2]

Soulet ,

F. M.

Suchanek , Anytime Large-Scale Analytics of Linked Open Data , in: 18th International Semantic Web Conference, Auckland, New Zealand, October 26-30 , 2019 , Proceedings, Part

, volume 11778 oLfecture Notes in Computer Science, Springer, 2019 , p. 576 - 592 .

[3]

Hasnain ,

Mehmood , S. S. e Zainab, A. Hogan, SPORTAL: Profiling the Content of Public SPARQL Endpoints , International Journal on Semantic Web and Information Systems (IJSWIS) 12 ( 2016 ) 134 - 163 .

[4]

Magliacane ,

Bozzon ,

E. D.

Valle , Eficient Execution of Top-K SPARQL Queries , in: 11th International Semantic Web Conference, Boston, MA, USA, November 11 - 15 , 2012 , Proceedings, Part

, volume 7649 oLfecture Notes in Computer Science, Springer, 2012 , p. 344 - 360 .

[5]

Wagner ,

Bicer , T. Tran, Pay-as-you-go Approximate Join Top-k Processing for the Web of Data , in: 11th International Conference, ESWC 2014, Anissaras, Crete, Greece, May 25 -29, 2014 , Proceedings, Springer, 2014 , pp. 130 - 145 .

[6]

Verborgh ,

M. V.

Sande ,

Hartig ,

J. V.

Herwegen ,

L. D.

Vocht ,

B. D.

Meester , G. Haesendonck,

Colpaert , Triple Pattern Fragments: A low-cost knowledge graph interface for the Web , Journal of Web Semantics 37-38 ( 2016 ) 184 - 206 .