<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Processing SPARQL TOP-k Queries Online with Web Preemption</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>JulienAimonier-Dava</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>HalaSkaf-Moll</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>iand PascalMoll</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Semantic Web, SPARQL, Web Preemptiont,op- Queries</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LS2N, University of Nantes</institution>
          ,
          <addr-line>2 rue de la Houssinière BP 92208, 44322 Nantes Cedex3</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Processingtop- queries on public online SPARQL endpoints often runs into fair use policy quotas and does not complete. Indeed, existing endpoints mainly follow the tradmitaitoenriaallize-and-sort strategy. Although restricted SPARQL servers ensure the terminattoipo-n oqfueries without quotas enforcement, they follow tmheaterialize-and-sort approach, resulting in high data transfer and poor performance. In this paper, we propose to extend the Web preemption model with a preemptable partiatlop- operator. This operator drastically reduces data transfer and significantly improves query execution time. Experimental results show a reduction in data transfer by a factor of 100 and a reduction of up to 39% in Wikidata query execution time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Context and motivation. In knowledge graphs,top- queries allow users to search for the
largest cities in the world, the longest rivers, the highest mountains, the oldest softw1a]r.e, etc [
However, processingtop- queries on public SPARQL endpoints is challenging, mainly due to
the fair-use policies of public endpoints that stop queries before termin2a, t3i]o.nFo[r instance,
a query that just looks for the 10 oldest people on Wikidata runs out of time (c1fa.)fig. uSruech
a problem is not restricted to Wikidata. The query that returns the ten first classes ordered by
their name timeout both on Wikidata and DBpedia (cf. figu1br)e.</p>
      <p>
        Why aretop- queries interrupted by quotas? Mainly because SPARQL engines follow the
materialize-and-sort approach to answetrop- queries [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], i.e. engines first materialize all the
query results, then sort them according to the ORDER-BY clause, and finally keep and return
the first  results. To return the 10 oldest people on Wikidata, the query depicted in1Faigure
materializes more than 2,8M people.
      </p>
      <p>Related works.</p>
      <p>
        Many techniques have been proposed to proviedaerly termination orearly
pruning when processingtop- queries [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In SPARQL, thanks to sorted access to data, a smart
      </p>
      <p>SELECT
? humanLabel
( YEAR ( ? death ) − YEAR ( ? b i r t h ) AS ? l i f e s p a n )
WHERE {
?human wdt : P31 / wdt : P279 ∗ wd : Q5 .
?human wdt : P569 ? b i r t h .
?human wdt : P570 ? death .</p>
      <p>SERVICE w i k i b a s e : l a b e l {</p>
      <p>bd : se rv ic ePa ra m w i k i b a s e : language ” en ” .</p>
      <p>}
} ORDER BY DESC( ? l i f e s p a n ) LIMIT 10
(a) TOP 10 oldest people</p>
      <p>SELECT DISTINCT ? c ? l
WHERE {
? s a ? c .</p>
      <p>
        ? s r d f s : l a b e l ? l
} ORDER BY ? l LIMIT 10
(b) TOP 10 classes
engine can decide to stop the query execution earlier because the remaining mappings are
guaranteed not to be part of the troepsults4[
        <xref ref-type="bibr" rid="ref5">, 5</xref>
        ]. Although early termination or early pruning
can drastically improve execution time, none of them guarantee that a query will terminate
before quotas on a public SPARQL endpoint.
      </p>
      <p>To ensure termination toofp- queries, existing approaches rely on servers that ensure a
fair-use policy without quotas, i.e. fairness is guaranteed by a restricted SPARQL interface.
Restricted SPARQL servers such as TP6F][, Web preemption 7[] or SmartKG8[] only support
a restricted set of SPARQL operators that do not impact the responsiveness of the restricted
server. Other SPARQL operators are supported by the client. To the best of our knowledge,
eficient evaluation oftop- queries has not been studied in the context of restricted SPARQL
servers.top- queries are evaluated following the traditmioantearlialize-and-sort approach,
where materialization is done on the client-side. Even if queries are guaranteed to terminate,
data transfer and execution time can be prohibitive.</p>
      <p>The research question is to study how a restricted server can preoarvliydetermination or
early pruning while ensuring the fairness of the server and the terminattioonp-ofqueries.
Approach and Contributions. In this paper, we propose to extend Web preemption with
a new preemptabletop- operator that ensures terminatiotnopo-f queries, while enabling
the use of early pruning techniques. The contributions of the paper are the following: (1) We
propose a preemptablteop- operator whose overhead does not depend o, ni.e. the server
is fair whatever the value of tLhIeMIT clause. (2) We implement thteop- operators as an
extension of the preemptabSlaeGe server. (3) The experimental results show that the new
operators can improve query execution time by up to 60% and divide the data transfer by 100.</p>
      <p>The remainder of this paper is organized as follows. Sec2tiionntroduces SPARQLtop-
queries and briefly recalls the definitions related to our proposal. Sec3tipornesents the
approach for processingtop- queries using preemptablteop- operators. Sectio4npresents
our experimental results. Sect5iosnummarizes related works. Finally, conclusions and future
work are outlined in Secti6o.n
: a1 : c o n f e r e n c e :WWW . : a3 : c o n f e r e n c e : ISWC . : a5 : c o n f e r e n c e : ESWC .
: a1 : p u b l i c a t i o n 2021 . : a3 : p u b l i c a t i o n 2022 . : a5 : p u b l i c a t i o n 2021 .
: a1 : c i t a t i o n s 20 . : a3 : c i t a t i o n s 15 . : a5 : c i t a t i o n s 10 .
: a2 : c o n f e r e n c e : ISWC . : a4 : c o n f e r e n c e : ISWC . : a6 : c o n f e r e n c e : ESWC .
: a2 : p u b l i c a t i o n 2021 . : a4 : p u b l i c a t i o n 2022 . : a6 : p u b l i c a t i o n 2022 .
: a2 : c i t a t i o n s 10 . : a4 : c i t a t i o n s 12 . : a6 : c i t a t i o n s 2 .
:WWW : rank 1
. : ISWC : rank 2
(a) RDF Graph  1
. : ESWC : rank 3
.</p>
      <p>SELECT ? a r t i c l e WHERE {
? conf : rank ? rank . #1
? a r t i c l e : c o n f e r e n c e ? conf . #2
? a r t i c l e : p u b l i c a t i o n ? data . #3
? a r t c i l e : c i t a t i o n s ? c i t a t i o n s . #4
} ORDER BY ? rank , DESC( ? c i t a t i o n s ) LIMIT 2</p>
      <p>(b) SPARQL query  1</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries and backgrounds</title>
      <p>In this section, we briefly recall the definitions related to our proposal. We follow the notations
from [9, 10] and consider three disjoint se t(sIRIs),  (literals) an d (blank nodes). We denote
the set of RDF terms a s∪  ∪  . An RDF triple(, , ) ∈ ( ∪ ) ×  ×  connects a subjec t
through a predica teto an objec t. An RDF graph is a finite set of RDF triples. We assume
the existence of an infinite set of variables, disjoint from the previous sets. A map pinfrgom
 to is a partial functi o∶n  →  . The domain of , denoted by() , is the subset of
where is defined. A SPARQL graph pattern expressio n is defined recursively as follows:
1. A tuple from( ∪  ) × ( ∪  ) × ( ∪  ) is a triple graph pattern.
2. If  1 and  2 are graph patterns, then expressio n1s A(ND  2), ( 1 OPT  2) and ( 1 UNION
 2) are graph patterns.
3. If  is a graph pattern a ndis a SPARQL built-in condition, then the expressi onFI(LTER
 ) is a graph pattern.</p>
      <p>The evaluation of a graph patteronver an RDF graph , denoted by   , produces a multiset
J K
Ω of solution mappings.</p>
      <sec id="sec-2-1">
        <title>2.1. TOP-k Queries</title>
        <p>top- queries return the topresults according to a user-specified ranking (scoring) function.
top- SPARQL queries can be expressed usingORDER BY and LIMIT clauses that first impose
an order on the result set and then limit the number of results. Fol4l]o, wthinegOR[DER BY
clause can be formulated as a ranking functℱiocnombining several ranking criter{ i1a, … ,   }.
Given a graph patter n, a ranking criterio(n? 1, … , ?  ) is a function defined over a set
of  variables, where∀∈[1,] ∶ ?  ∈  () . The evaluation of a ranking criter ioonn a
mapping  , denoted by[] , is the substitution of all varia?b l e∈s  by the corresponding
values from the mapping, i.e(.?  ). Given two mappings 1 and  2,  1 &gt;  2 according to a
ranking functionℱ ({ 1, … ,   }), denoted byℱ [ 1] &gt; ℱ [ 2], if ∃∈[1,] ∶   [ 1] &gt;   [ 2] and
∀∈[1,[ ∶   [ 1] =   [ 2].</p>
        <p>In order to evaluate the ranking functℱio,anll the variables i(n) that contribute in the
evaluation oℱf must be bounded. SinceOPTIONAL and UNION clauses can introduce unbound
variables, we assume all the variable (s)in to becertain variables as defined in [ 11], i.e.
variables that are certainly bounded for every mapping produ ce.dTbhye relaxation of the
certain variables constraint is part of the future work.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Web Preemption and TOP-k Queries</title>
        <p>Web preemption [7] is the capacity of a Web server to suspend a running SPARQL query after a
ifxed quantum of time and resume the following waiting query. When suspending a qu e,ray
preemptable server saves the internal state of all oper ationras osafved plan  , which is
sent to the client. The client can continue the executiobnyosfending   back to the server.
When reading  , the server restarts qu eryfrom where it has been stopped. As a preemptable
server can restart queries from where they have been stopped and makes a progress at each
quantum, it eventually delivers complete results after a bounded number of quanta.</p>
        <p>However, Web preemption comes with overheads. The time of the suspend and resume
operations represents the overhead in time of a preemptable server. The s izreeporfesents
the overhead in space of a preemptable server and may be transferred over the network each
time the server suspends a query. To be tractable, a preemptable server has to minimize these
overheads. For this purpose, a preemptable server only implements SPARQL operators that
can be saved and resumed in bounded time, i.e. preemptable operators. Other operators are
considered as not preemptable and implemented on the client. For instanScCeA, Nt,hJeOIN,
UNION, LIMIT and FILTER operators just need to manage one mapping at a time, consequently,
they can be saved and resumed in bounded time. On the other hand, some SPARQL operators
require to materialize mappings, e.g. OtRhDeER BY operator requires materializing mappings
before sorting, while thGeROUP BY operator requires materializing the group keys1(s2e])e. [
As materialization of mappings requires space resources that depends on the size of the data,
they cannot be saved in bounded time.</p>
        <p>Figure3 illustrates how Web preemption processesttohpe- query 1 over the RDF graph
 1, both depicted in Figur2e. First, the client decomposes the qu er1yinto 1′ so that only
preemptable operators are sent to the preemptable server,Oi.ReD. EtRheBY and LIMIT clauses
are removed fro m 1. Let us suppose that quer y1′ requires 3 quanta to complete. At the end
of each quantum  , the client receives the mappingsand asks for the following results by
sending the saved pla n  of  1′ back to the server. Once all mappings are obtained, the client
sorts them according to the ranking function defined by OtRhDeER BY clause, and apply the
slice operator to keep only the top 2 mappings.</p>
        <p>
          Althoughtop- queries are expected to transfer o nrlyesults, in the case of Web
preemption, because theORDER BY operator is implemented on the client-side, all mappings will be
transferred anyway Moreover, this execution strategy prevents us from taking advantage of
possible optimizations when evaluating SPARtQoLp- queries, such as early termination or
early pruning of partial mapping1s, 1[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. To overcome this issue, one solution could be to apply
state-of-art solutions on the client-side. However, this solution requires computing joins on the
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Preemptable TOP-k Operator</title>
      <p>Given a ranking functionℱ, the computation of the t opmappings only requires maintaining
an ordered list ofmappings. If we set an upper bound for , a simple iterator as defined
by Algorithm1 is preemptable. Algorith1mis a simple adaptation of the “Top-N heapsort”
algorithm that is commonly used in the Postgres database. The idea of the Top-N heapsort
algorithm is to maintain in a main-memory bufer only the bmesatppings seen so far. In a
realworld configuration, the upper bound should be set by the SPARQL endpoint administrator.</p>
      <p>Algorithm1 relies on a data structuℱrethat maintains a list of mapping1,s… ,   such that
∀∈[1,[ , ℱ [  ] ≤ ℱ [ +1 ]. First, the algorithm fillℱs with the first mappings drawn from the
if   =</p>
      <p>Data:  ℱ: ordered list of mappings accordingℱto,  : last mapping rea d, : threshold to enter tthoep- .
previous iteratoℐr. Then, each time a new mappin g  is drawn fromℐ,   is compared with
the mapping with the smallest rank ℱ</p>
      <p>in, i.e.  1 denoted by  in Algorithm1. If ℱ [  ] &gt; ℱ [  ],
saving (resp. resuming) the iterator is((i,n ))
, Algorithm1 is preemptable.
  replaces  in the to p . Once ℐ exhausted, Algorithm1incrementally retur nℱs.</p>
      <p>When the query is suspended (resp. resumed) by the preemptable SPARQL server, Algor1ithm
just has to save ℱ in   (resp. resume ℱ from   ). As the time and space complexities of</p>
      <p>However, Algorithm1 has two drawbacks. First, it imposes a maximum va luoen the
number of results returned tboyp- queries. If users want to executetoap- query  with
 &gt;  , they have to rely on the client-side solution, i.e. transferring and sorting all solutions of
 to end up keeping only solutions. Second, even if the size of tthoep- is bounded by  , the
overhead of saving and resuming thteop- can have a significant impact on the execution time
and the data transfer (wh enis sent to the client), as demonstrated in our experimental study.

Problem Statement. Is it possible to implement a preemptatbolpe- iterator whose
complexity does not depend on?</p>
      <sec id="sec-3-1">
        <title>3.1. Preemptable Partial TOP-k Iterator</title>
        <p>The key idea to avoid depending on is to rely on partitaolp- . Using a preemptive Web
server, the evaluation of a graph patteorvner an RDF graph naturally creates a partition
of mappings  1, ...,   over time, where  is produced during quantu m . Intuitively, a partial
top- is obtained by computing the topmappings on a partition of mappin gs .</p>
        <p>Instead of computing the wholteop- query on the server, which requires saving and
resuming it each time the preemption occurs, the server only computes ptaorpt-ialℱ1 , … ,  ℱ
thetop- .</p>
        <p>Data:   : last mapping read.
else
maximum limi t fortop- queries, ℱ: empty list of ordered mappings accordingℱt,o  : threshold to enter
19
21
23
25
18 Procedure Open():
20 Procedure Load( ′):
can be saved and resumed in bounded time.</p>
        <p>To comply with the Web preemption mode7l],[the amount of data transferred to the client
per quantum need to be bounded. Consequently, the size of a parttoipa-l
If  ℱ reaches</p>
        <p>mappings, it will trigger the end of the quant u,mand  ℱ will be sent to the
client. Thus, end-users are no more limited b,ybut reachin g
may degrade performance in
ℱ is bounded by  .
terms of data transfer and execution time. When the query execution completes, the client just
because the server does not remember thoep- between quanta, it may transfer up(,t )o
needs to merge a llℱ1 , … ,  ℱ to compute the to psolution mappings ℱ of the query. However,
mappings at each quantum, even if those mappings do not contribute ttootph- e. To avoid
transferring too many useless mappings, the client sends to the server the mapping with the
smallest ran k (for decreasing order), with the saved pl a.n Using   as a threshold, the
server can discard a part of the solution mappings that do not contributteotpo- .the
mappings produced byℐ during a quantum  . Algorithm3 merges partiatlop-</p>
        <p>Figure5 illustrates how to compute quer1yover the RDF grap h 1 using Algorithm2 and 3.
Algorithm2 implements a preemptable iterator that computes a tpaorpt- ialℱ based on the
ℱ produced
by the server until the query completes. Let us assume that  q1uerreyquires 3 quanta1,  2
and  3 to complete, with two mappings produced per quantum.  F1o,rmappings  1 and  2 are
produced byℐ and inserted intoℱ1 . When preemption occurs, bot h and  ℱ1 are sent to the
client. Then, the client mer geℱ1s into ℱ and sends   back to the server. Becau|s eℱ| = 2 =  ,

the client also sends the mapping with the smallest ran k2,, in.ee.xt to  . During quantum 2,
mappings  3 and  4 are produced. Because both mappings have a higher rank th2a,nthey are
both inserted int oℱ2 . However, if the server had known the current state otfotph-e, it would
q1
q2
q3
 ℱ = ⟨ 1, 2⟩</p>
        <p>=  2
 ℱ = ⟨ 1, 3⟩</p>
        <p>=  3
 ℱ = ⟨ 1, 3⟩</p>
        <p>=  3
Ω =  ℱ = ⟨ 1, 3⟩
13 Function GetNext():
14
15
16
if | ℱ| &gt; 0 then</p>
        <p>return  ℱ.()
2 mappings of query 1 on  1.
have known that4 does not contribute to tthoep- , but without remembering the whole top
 , like Algorithm1 in Figure4, the server cannot safely rejec4t.This scenario illustrates the
overhead of relying on a parttiaolp- iterator. By knowing only, the server may transfer
mappings that do not contribute tottohpe- , such as  4. At the end of quantu m2,  ℱ2 and  

are sent to the client2, is merged int o ℱ, and  3 becomes the mapping with the lowest rank.</p>
        <p>ℱ
Finally, during quantum3, all generated mappings have a lower rank th3a,cnonsequently,
no new mappings are transferred to the client. The query completes ℱ
wi=th⟨ 1,  3⟩, the top</p>
        <p>Data: ℱ: ranking function of ,  : limit of  ,  ℱ: ordered list of mappings accordingℱto,  : threshold to enter the
Algorithm 3: Client-Sidetop- iterator.</p>
        <p>Require:  : SPARQL top- query.</p>
        <p>top- .
1 Procedure Open():
2
3
4
5
6
7
8
9
10
11
12
else</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Early Pruning of Partial Solution Mappings</title>
        <p>
          To improve query execution time, state-of-art approaches try to guarantee early termination
of top- queries. Unfortunately, existing techniques rely on expensive sort operations that
are not preemptable. Although early termination cannot be achieved using Web preemption,
early pruning only requires knowing the mapping with the lowest rank among the top
mappings [
          <xref ref-type="bibr" rid="ref4">13, 4</xref>
          ]. Intuitively, early pruning consists in pruning a partial mapping if its rank
is smaller than the lowest of  thsoe far computed mappings. For instance, let us suppose
that query 1 is executed using a join order that corresponds to the order of triple patterns in
the query. In this case, we can use early pruning to skip all the articles published at ESWC.
Following the execution depicted in Figu5r,eit is known from the first quantum that none of
the papers published at ESWC can enter the top-2, because the ESWC conference rank is higher
than 2.
        </p>
        <p>Early pruning can be very eficient but highly depends on the join order. Moreover, the
opportunity for pruning arises only wh e(nor more) complete results have been produced.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Study</title>
      <p>The purpose of the experimental study is to answer the following questions: (1) Does using a
preemptabletop- operator improve query execution performance? (2) What is the impact of
and the duration of a quantum on performance? (3) What is the impact of early pruning on
performance? (4) How does this approach perform on real-wtoorpld- queries?</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>To implement our approach, we used thSaeGe query engine1. Following the Web preemption
model,SaGe is divided into a Python preemptable Web server and a JavaScript Web client.
top- iterators, i.e. Algorith1masnd 2, have been implemented as an extension of tShaeGe
preemptable Web server. To make experiments simpler, we created three custom Python clients:
• SaGe is ourbaseline, i.e. Web preemption withouttaop- iteratort.op- queries are
executed using theORDER BY and LIMIT operators as explained in secti2o.2nand depicted
in Figure3.
• SaGe-top- executestop- queries using Algorith m1defined in section 3. This
corresponds to the adaptation of the “Top-N heapsort” operator that is already used in the
Postgres database.
• SaGe-partial-top- executestop- queries using the partiatolp- iterator defined in
section3.1. This corresponds to our contribution without early pruning enabled.
In all experiments, the maximum li mi,tto be set by thSeaGe server administrator, was fixed
at = 10000 . The maximum number of mappings that can be sent to the client per quantum
was set at10000.</p>
        <p>Data and Queries. We used the Waterloo SPARQL Diversity Benchmark (Wat2D)[i1v4]. We
re-used the RDF graph and the SPARQL queries from tShaeGe [7] experimental study. The RDF
graph contains107 triples, while the workload contains 193 queries. From these 193 queries,
1https://sage.univ-nantes.fr
2https://dsg.uwaterloo.ca/watdiv
we randomly picked 20 SPARQL conjunctive queries with diferent shapes and up to 10 joins
per query. Because queries do not havOeRDER BY clauses, we randomly generated OaRnDER
BY clause for each query, with up to 2 variables per clause. For the choice of the variables, we
privileged those which take Literals as value. FoLrItMhIeT  clauses, we experiment diferent
values of , namely 10, 100, 1000, 10000.</p>
        <p>
          To test our approach on real queries, we extracted 20 queries from the Wikidata qu3e.ry logs
Queries contain from 1 to 10 joins, aORnDER BY clause with a single variable, anLdIaMIT 
operator wit h∈ [
          <xref ref-type="bibr" rid="ref1">1, 10000</xref>
          ] . Queries are executed on the 2017-03-13 dump of Wikida4twaith
2262M triples.
        </p>
        <p>All RDF graphs are stored using HD1T5[]. The HDT-backend of theSaGe query engine is
used to query the RDF graphs. Code, configurations, queries, and RDF graphs can be found
online for reproducible purpos5e.s
Evaluation Metrics. (1) Execution Time is the time spent executing a query, from when the
query is sent to the server until the client returns the last rDesautlatT. r(a2n)sfer is the number
of bytes exchanged between the client and the server during the execution of a quNeruym.b(3e)r
of HTTP Calls is the number of HTTP requests sent to the server during the execution of a
query.</p>
        <p>Hardware Setup. We ran all experiments on a local cloud instance with Ubuntu 20.04.4 LTS,
a AMD EPYC 7513 32-Core processor (the VM was configured with 16 vCPUs and 1 thread per
core), 1TB SSD (part of a storage pool), and 64GB of RAM. Both the client and the server were
running on the same machine.</p>
        <p>Software Setup. We used Python v3.9.7, Virtuoso Open Source Edition v7.2.6 aSnadGe as of
July 24 2022 (7ea3468).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Results</title>
        <p>We first ensured that ourtop- iterators yield complete and correct results. We ran both
Virtuoso and the diferent approaches to check if they provide complete and accurate results for
each query using Virtuoso results as the ground truth.</p>
        <p>Does using a preemptable top- operator improve query execution performance? Figure 6
presents the performance achieved by the diferent approaches on the WatDiv queries. First,
let us consider only the results obtained w=it1h0 , without enabling early pruning. We
will then see what is the impact oafnd early pruning on performance. As expected, using a
dedicatedtop- iterator significantly improves query execution performance comparSeadGteo.
First, bothSaGe-top- and SaGe-partial-top- significantly reduce data transfer compared
toSaGe that requires to transfer all mappings to the cSlaiGenet-t.op- only sends the top
3https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en
4https://www.rdfhdt.org/datasets
5https://github.com/momo54/sage-orderby-experiment
(a) WatDiv queries with = 10
(b) WatDiv queries with = 100
(c) WatDiv queries with = 1000
(d) WatDiv queries with = 10000
mappings, whileSaGe-partial-top- only transfers the topmappings seen during a quantum
that are greater than the threshold computed by the client. CompSarGeed, StaoGe-top-
and SaGe-partial-top- require sending fewer HTTP calls. BecauSseaGe transfers more
mappings, it more often triggers the end of quanta. Indeed, Web preemption limits the number
of mappings sent to the client per quantu7m]. [When the limit is reached, it triggers the end of
the quantum, even if there was time left, and the longer the duration of a quantum, the more
likely it is to occur. In terms of execution time, SbaoGteh-top- and SaGe-partial-top- are
close toSaGe. The slight diference can be explained becausSeaGe transfers more mappings,
increasing serialization times.</p>
        <p>What is the impact o f and the duration of a quantum? As depicted in Figure6,  does not
impact the number of HTTP calls. No matter the val u,eaollf approaches require seeing all
solution mappings. Consequently, they need the same number of quanta to complete, regardless
of  .</p>
        <p>While does not impact the number of HTTP calls, it drastically afects the execution time of
SaGe-top- . Algorithm1 used by SaGe-top- requires to save (resp. load) the current
troepsults each time a query is suspended (resp. resumed). Consequently, w heinncreases, the time
spent suspending and resumingtop- queries increase accordingly. On the WatDiv queries,
Figure7 presents the time spent executing, saving, and resuming queries, both foSratGhee-top-
and theSaGe-partial-top- approaches. As expected, whe n increases,SaGe-top- makes
Web preemption overheads increase drastically, to the point where the server spends more
time saving and resuming queries than executing them. As a reSsauGlte,-top- can require
more time thaSnaGe to complete. The durationof a quantum also plays an important role.
When  is small, queries are more often suspended (resp. resumed), increasing Web preemption
overheads. Thus, it is tractable to useStahGee-top- approach, but and  must be carefully
configured by the SPARQL endpoint administrator. The higher the duration of a quantum, the
higher could be. Compared toSaGe-top- ,  does not impactSaGe-partial-top- because
the time complexity of saving and resuming a parttoiap-l iterator does not depend o n(see
Figure7).</p>
        <p>This property comes at the cost of an overhead on the data transfer compSaarGede-top- .
The partiatlop- iterator does not remember the current statetoofpt- hebetween quanta.
The only information it has is the threshold computed by the client. If the threshold becomes
outdated during a quantum, the server may transfer useless solution mappings that do not
contribute to thteop- . NeverthelessS,aGe-partial-top- is better thaSnaGe-top- in almost
all tested configurations. BecauSseaGe-top- has to transfer the t orpesults each time the
query is suspended (resp. resumed), the data transfer can increase rapidlisyliafrge or the
query is suspended/resumed too often. In the worst casSea,Ge-top- can transfer more data
thanSaGe, i.e. more than all the query solutions.</p>
        <p>What is the impact of early pruning? As depicted in Figure6, early pruning has a significant
impact on WatDiv queries. Whe n≤ 1000 , we observe a 60% reduction of the execution time
when using SaGe-partial-top- (resp. SaGe-top- ) with early pruning enabled, compared
toSaGe-partial-top- (resp. SaGe-top- ) without early pruning. Because the opportunity
for pruning arises only when(or more) complete results have been produced, early pruning
has less impact whe n increases. As a result, we observe a smaller reduction, around 15%,
when  = 10000 . Another critical factor is the join order. The earlier the variables used by the
ranking function are evaluated, the more influential the pruning is likely to be. Finally, even if
the threshold is only updated between quanta when uSsainGge-partial-top- , it does not
significantly impact the eficiency of early pruning compared tSoaGe-top- .</p>
        <p>How does this approach perform on realtop- queries? Figure8 presents the performance of
the diferent approaches on Wikidata queries, with and without early pruning. Compared to
SaGe, bothSaGe-top- and SaGe-partial-top- allow a significant reduction in terms of data
transfer. HoweverS,aGe-top- requires and  to be well configured to avoid increasing Web
preemption overheads. As expected, all approaches behave in a similar way in terms of execution
time. However, if early pruning is enabled when usiSnagGe-top- and SaGe-partial-top- ,
we observe a 35% reduction compared StaoGe.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Related Works</title>
      <p>
        State-of-the-art SPARQL query engines, such as Virtuoso and Jena,
relymoanteraializeand-sort [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] approach to process SPARQLtop- queries. This consists of computing all the
matching solutions (e.g., thousands), even if only a limited nu mboefrsolutions (e.g., ten) is
requested 1[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Consequently, these queries are often long-running queries that require a lot of
CPU and memory resources to terminate. To ensure stable and responsive services to users,
public SPARQL endpoints set up quotas, e.g. on the number of results returned to the client, the
execution time, or the arrival rate. Therefore, tmoapn- yqueries cannot be executed online
simply because they reach quotas of the fair-use polic7i,e1s7[
        <xref ref-type="bibr" rid="ref2">, 2</xref>
        ]. Compared to existing public
endpoints, our approach ensures thattaolpl- join queries terminate without materializing all
results.
      </p>
      <p>
        Early termination and early pruning are common techniques to speed-up the processing
of top- queries in databases1][. Such techniques have been adapted for RDF and SPARQL
in [
        <xref ref-type="bibr" rid="ref4 ref5">18, 4, 5, 13</xref>
        ]. Early termination allows deciding if it is possible to stop the execution of
a runningtop- query. A top- query can be stopped as soon as the remaining mappings
are guaranteed not to be part ofttohpe- results. Early termination relies both on sorted
access to data and the computation of upper bounds to entetropt-h e. Magliacane et al4.][
propose a SPARQL engine-level implementation of rank-aware physical operators allowing early
termination. The SPARQL-RANK execution model creates a rank-aware pipeline of operators
that incrementally output a ranked set of mappings according to a user-defined scoring function.
Unfortunately, rank-aware operators require to materialize the sorted mappings. Following the
Web preemption model, rank-join operators are not preemptable because they cannot be saved
and resumed in bounded time. Early termination can be further improved thanks to MS-tree
indexes, allowing the computation of tighter upper bounds, as proposed by Dong 1e8t].al. [
Such an approach is eficient, but the maintenance of extra indexes is costly. Yang e1t9a]l. [
propose the STAR framework. SPARQL queries are decomposed into sets of star-shaped queries
for which an eficient algorithm to compute tthoep- is provided. Thanks to a join algorithm
and an upper bound schema that allows the computation of tight upper bounds, STAR can
combine thetop- of star-shaped queries to compute tthoep- of any SPARQLtop- query.
However, just like SPARQL-RANK, Yang et al. use non-preemptable algorithms due to their
space complexities. Wagner et a5l]. [propose an approach to computteop- queries without
complete result materialization that relies on a probabilistic early pruning method. Given a
partial mapping , the algorithm predicts with bounded accuracy the probabi li tbyeoinf g
part of thetop- results. Our approach makes use of early pruning, but we aimed to compute
fully accurate results as 1in3][.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we extended Web preemption with a parttoiap-l iterator, significantly improving
queries performance. By relying on parttioapl- , we can computetop- queries with arbitrary
large and small values o, fwithout increasing Web preemption overheads. Moreover, with a
dedicatedtop- iterator, it becomes possible to use early pruning techniques to improve query
execution time. In future work, we plan to work on a way to support early termination of
top- queries in the context of Web preemption. Another path we would like to explore is
the evaluation otfop- queries when the ranking function is defined over aggregates. The
evaluation of SPARQL aggregate queries using Web preemption also rely on a partial preemptable
iterator20[, 12]. If our partiatlop- iterator can be used ove2r0,[12], a dedicated approach
may allow to significantly improve performance.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work is supported by the ANR-19-CE23-0014 DeKaloG project (CE23 - Intelligence
artificielle) and the CominLabs MiKroloG project.
[7] T. Minier, H. Skaf-Molli, P. Molli, SaGe: Web Preemption for Public SPARQL Query
Services, in: WWW ’19: The World Wide Web Conference, Association for Computing
Machinery, New York, NY, United States, 2019, p. 1268–1278.
[8] A. Azzam, J. D. Fernández, M. Acosta, M. Beno, A. Polleres, SMART-KG: Hybrid Shipping
for SPARQL Querying on the Web, in: WWW ’20: Proceedings of The Web Conference
2020, Association for Computing Machinery, New York, NY, United States, 2020, pp.
984–994.</p>
      <p>[9] S. Harris, A. Seaborne, SPARQL 1.1 Query Language, in: W3C Recommendation, 2013.
[10] J. Pérez, M. Arenas, C. Gutiérrez, Semantics and complexity of SPARQL, ACM Transactions
on Database Systems 34 (2009) 1–45.
[11] M. Schmidt, M. Meier, G. Lausen, Foundations of SPARQL query optimization, in:
Proceedings of the 13th international conference on database theory, 2010, p. 4–33.
[12] J. Aimonier-Davat, H. Skaf-Molli, P. Molli, A. Grall, T. Minier, Online approximative
SPARQL query processing for COUNT-DISTINCT queries with Web Preemption, Semantic
Web Journal 13 (2022) 735–755.
[13] A. Wagner, T. T. Duc, G. Ladwig, A. Harth, R. Studer, Processing SPARQL Property Path
Queries Online with Web Preemption, in: 9th Extended Semantic Web Conference, ESWC
2012, Heraklion, Crete, Greece, May 27-31, 2012, Proceedings, Springer, 2012, pp. 56–71.
[14] G. Aluç, O. Hartig, M. T. Özsu, K. Daudjee, Diversified Stress Testing of RDF Data
Management Systems, in: 13th International Semantic Web Conference, Riva del Garda,
Italy, October 19-23, 2014. Proceedings, Part I, Springer, 2014, p. 197–212.
[15] J. D. Fernández, M. A. Martínez-Prieto, C. Gutiérrez, A. Polleres, M. Arias, Binary RDF
representation for publication and exchange (HDT), Journal of Web Semantics 19 (2013)
22–41.
[16] S. Zahmatkesh, E. D. Valle, D. Dell’Aglio, A. Bozzon, Towards a top-K SPARQL query
benchmark generator, in: OrdRing’14: Proceedings of the 3rd International Conference
on Ordering and Reasoning, volume 1303, CEUR-WS.org, 2014, pp. 35–46.
[17] C. Buil-Aranda, A. Polleres, J. Umbrich, Strategies for Executing Federated Queries in
SPARQL 1.1, in: 13th International Semantic Web Conference, Riva del Garda, Italy,
October 19-23, 2014. Proceedings, Part II, Springer, 2014, p. 390–405.
[18] W. Dong, Z. Lei, Z. Dongyan, Top-k queries on RDF graphs, Information Sciences: an</p>
      <p>International Journal 316 (2015) 201–217.
[19] Y. Shengqi, H. Fangqiu, W. Yinghui, Y. Xifeng, Fast top-k search in knowledge graphs,
in: 2016 IEEE 32nd international conference on data engineering (ICDE), IEEE, 2016, pp.
990–1001.
[20] A. Grall, T. Minier, H. Skaf-Molli, P. Molli, Processing SPARQL Aggregate Queries with
Web Preemption, in: 17th International Conference, ESWC 2020, Heraklion, Crete, Greece,
May 31–June 4, 2020, Proceedings, volume 12123 oLfecture Notes in Computer Science,
Springer, 2020, pp. 235–251.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          , G. Beskales,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Soliman</surname>
          </string-name>
          ,
          <article-title>A survey of top-k query processing techniques in relational database systems</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>40</volume>
          (
          <year>2008</year>
          )
          <fpage>1</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Soulet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <article-title>Anytime Large-Scale Analytics of Linked Open Data</article-title>
          , in: 18th International Semantic Web Conference, Auckland, New Zealand,
          <source>October 26-30</source>
          ,
          <year>2019</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , volume
          <volume>11778</volume>
          oLfecture Notes in Computer Science, Springer,
          <year>2019</year>
          , p.
          <fpage>576</fpage>
          -
          <lpage>592</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hasnain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mehmood</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. S.</surname>
          </string-name>
          <article-title>e Zainab, A. Hogan, SPORTAL: Profiling the Content of Public SPARQL Endpoints</article-title>
          ,
          <source>International Journal on Semantic Web and Information Systems (IJSWIS) 12</source>
          (
          <year>2016</year>
          )
          <fpage>134</fpage>
          -
          <lpage>163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Magliacane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bozzon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Valle</surname>
          </string-name>
          ,
          <article-title>Eficient Execution of Top-K SPARQL Queries</article-title>
          , in: 11th International Semantic Web Conference, Boston, MA, USA, November
          <volume>11</volume>
          -
          <issue>15</issue>
          ,
          <year>2012</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , volume
          <volume>7649</volume>
          oLfecture Notes in Computer Science, Springer,
          <year>2012</year>
          , p.
          <fpage>344</fpage>
          -
          <lpage>360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bicer</surname>
          </string-name>
          , T. Tran,
          <article-title>Pay-as-you-go Approximate Join Top-k Processing for the Web of Data</article-title>
          , in: 11th International Conference, ESWC 2014, Anissaras, Crete, Greece, May
          <volume>25</volume>
          -29,
          <year>2014</year>
          , Proceedings, Springer,
          <year>2014</year>
          , pp.
          <fpage>130</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Herwegen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Vocht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Meester</surname>
          </string-name>
          , G. Haesendonck,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colpaert</surname>
          </string-name>
          , Triple Pattern Fragments:
          <article-title>A low-cost knowledge graph interface for the Web</article-title>
          ,
          <source>Journal of Web Semantics 37-38</source>
          (
          <year>2016</year>
          )
          <fpage>184</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>