<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimizing Approximate Membership Metadata in Triple Pattern Fragments for Clients and Servers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ruben Taelman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joachim Van Herwegen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miel Vander Sande</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruben Verborgh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IDLab, Department of Electronics and Information Systems, Ghent University - imec</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Depending on the HTTP interface used for publishing Linked Data, the effort of evaluating a SPARQL query can be redistributed differently between clients and servers. For instance, lower server-side CPU usage can be realized at the expense of higher bandwidth consumption. Previous work has shown that complementing lightweight interfaces such as Triple Pattern Fragments (TPF) with additional metadata can positively impact the performance of clients and servers. Specifically, Approximate Membership Filters (AMFs)-data structures that are small and probabilistic-in the context of TPF were shown to reduce the number of HTTP requests, at the expense of increasing query execution times. In order to mitigate this significant drawback, we have investigated unexplored aspects of AMFs as metadata on TPF interfaces. In this article, we introduce and evaluate alternative approaches for server-side publication and client-side consumption of AMFs within TPF to achieve faster query execution, while maintaining low server-side effort. Our alternative client-side algorithm and the proposed server configurations significantly reduce both the number of HTTP requests and query execution time, with only a small increase in server load, thereby mitigating the major bottleneck of AMFs within TPF. Compared to regular TPF, average query execution is more than 2 times faster and requires only 10% of the number of HTTP requests, at the cost of at most a 10% increase in server load. These findings translate into a set of concrete guidelines for data publishers on how to configure AMF metadata on their servers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        SPARQL endpoints, that expose Linked Data on the Web through a query-based
interface, tend to suffer from availability issues [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In comparison to most other HTTP
servers, SPARQL endpoints require high-end computational resources due the high
complexity of SPARQL queries and can thus be difficult to sustain when a number of
concurrent clients request query execution. In order to cope with this problem, the
Linked Data Fragments (LDF) effort [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has been initiated as a conceptual framework
to investigate alternative query interfaces to publish Linked Datasets, by redistributing
the effort of query evaluation between servers and clients.
      </p>
      <p>LDF interfaces allow some parts of the query to be performed on the server, and some
on the client, which leads to a redistribution of effort between server and client. This
redistribution requires queries to be decomposed into multiple smaller queries, which
typically leads to slower query execution due to the HTTP overhead of these
roundtrips, compared to fully server-side query execution. In order to reduce this
number of smaller queries, servers could send a pre-filter to the client, which could
potentially eliminate many of these queries. The focus of this work is investigating
such pre-filters.</p>
      <p>
        In recent years, different kinds of these LDF interfaces have been introduced, such as
Triple Pattern Fragments (TPF) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Bindings-Restricted Triple Pattern Fragments [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
SaGe [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and Smart-KG [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Each of these types of interfaces introduce their own
trade-offs in terms of server and client effort. Additionally, LDF interfaces can enable
feature-based extensibility, which allows servers to optionally expose certain features
as metadata through usage of self-descriptive hypermedia [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which can then be
detected automatically by supporting clients to enhance the query evaluation process.
Due to the extensibility of TPF, several interface features have already been proposed
for TPF [7, 8, 9]. One such feature is Approximate Membership Filter (AMF) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
metadata, which supporting clients can use to reduce the number of HTTP requests,
with only a slight increase in server cost. Unfortunately, this currently comes at the
cost of slower query execution, because the individual HTTP requests were larger and
more expensive to compute. Since TPF is quickly gaining adoption among
publishers [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we focus on improving the performance of AMF with TPF in this
work. AMF could also be useful for other types of LDF interfaces, but we consider
this out of scope for this work.
      </p>
      <p>
        Even though the work on extending TPF with AMFs showed excessive overhead, we
claim that these problems can be resolved, and that AMFs can be used to lower
overall query execution times without significantly increasing server load. As such, the
goal of our work is to investigate what changes are required server-side and
clientside to optimize AMFs for TPF. Concretely, we introduce six dimensions through
which the AMF approach from Vander Sande et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] can be improved. One of these
dimensions involves the introduction of a new client-side algorithm to handle AMFs.
The other dimensions are related to the server-side handling of AMFs. The effects and
feasibility of each of these dimensions are evaluated and analyzed in detail. In
summary, our work brings a deeper understanding of the appliance and benefits of AMF
metadata for Linked Data interfaces, so that Linked Data publishers can expose their
Linked Datasets in a more efficient manner through TPF interfaces.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In this section we cover the relevant existing research relating to our work. We start
by discussing the TPF interface. After that, we discuss different AMFs, followed by
their use in query evaluation, and their use for the TPF interface.</p>
      <sec id="sec-2-1">
        <title>2.1. Triple Pattern Fragments</title>
        <p>
          Linked Data Fragments (LDF) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is a conceptual framework to study interfaces for
publishing Linked Data, by comparing server and client effort. During query
execution, some LDFs may require a low server effort, at the cost of increased client-side
querying effort (e.g. data dumps). while others require a high server effort, at the cost
of minimal client-side effort (e.g. SPARQL endpoint). The Triple Pattern Fragments
(TPF) interface [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] was introduced as a trade-off between those extremes, by
restricting the server interface to triple pattern queries, and leaving the remainder of query
evaluation to the client. Compared to SPARQL endpoints, TPF in general reduces the
required server-side capacity and load for query evaluation at the expense of more
bandwidth usage and slower query times. Results show that the number of HTTP
requests forms the primary bottleneck during querying.
        </p>
        <p>TPF follows the REST architectural style, and aims to be a fully self-descriptive API.
TPF achieves this by including metadata and declarative controls in all of its RDF
responses next to the main data. The metadata can contain anything that may be useful
for clients during query execution, such as cardinality estimates.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Approximate Membership Functions</title>
        <p>Approximate Membership Functions (AMFs) are probabilistic data structures that
efficiently can determine membership of a set, at the cost of false positives. They are
typically much smaller than a full dataset, making them a useful pre-filtering method.
When selecting among different AMF techniques, we need to take into account
tradeoffs between filter size and false-positive rate.</p>
        <p>
          Bloom filters [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and Golomb-coded sets (GCS) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] are examples of AMF
techniques. Both approaches guarantee a 100% recall, but not a 100% precision. A Bloom
filter is essentially a bitmap filled with the range of a predefined number of hash
functions. Elements are added to the filter by applying all hash functions, and OR-ing the
results into the bitmap. Afterwards, membership tests can be done by applying all
hash functions again, and performing a bit-wise AND to see if all results are possibly
present. GCS were introduced as an improvement to Bloom filters by using only a
single hash function. Furthermore, the range of the hash function is always a
uniformly distributed list instead of a bitmap, which allows for more efficiency compression
using geometric distributions [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Compared to Bloom filters, GCS achieve a higher
compression rate, at the cost of slower decompression.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Approximate Membership for Query Evaluation</title>
        <p>AMFs find their use in many areas related to RDF querying, such as join optimization
and source selection.</p>
        <p>
          AMFs have been proven to be a useful tool for improving the performance of graph
pattern joins. Bloom filters can therefore be used to efficiently group connected triple
patterns by frequency [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], to improve the efficiency of merge joins as a way of
representing equivalent classes [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and for joining distributed and stored streams [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
Furthermore, Bloom filters are also used in the domain of federated querying to
optimize the process of source selection. Concretely, SPARQL’s boolean ASK response
can be enhanced with Bloom filters as a way of sharing a concise summary of the
matching results [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. This allows source selection algorithms to identify overlap
between different sources, and can either minimize the required number of requests, or it
can be used to retrieve as many results as possible.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Approximate Membership Metadata for TPF</title>
        <p>
          Pure TPF query plans typically produce a large number of membership requests [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ],
checking whether a specific triple (without variables) is present in a dataset. Due to
the significant number of HTTP requests that these membership requests require,
these can cause unacceptably high query execution times. The authors have shown
50% of all requests are membership requests for 1 in 3 queries, which indicates that
optimizing membership queries can have a positive effect on query evaluation.
In the spirit of LDF, servers can combine multiple interface features to assist
supporting clients with query evaluation. An interface feature with approximate membership
metadata for all variables in the requested triple patterns considerably reduced the
number of membership requests to a server [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. In order to reduce unneeded data
transfer to clients that are unable to handle AMF metadata, the actual binary AMFs
are included out-of-band behind a link in the metadata. Client-side query engines can
detect this AMF metadata, and use it to test the membership of triples. Clients can
skip many membership requests by ruling out true negatives (because of the 100%
recall of AMFs), leaving only HTTP requests to distinguish false from true positives
(because of the &lt;100% precision). More details on the exact representation of this
AMF metadata can be found on
https://github.com/comunica/Article-SSWS2020AMF/wiki/AMF-metadata.
        </p>
        <p>The results of this work show that there is a significant decrease in HTTP requests
when AMFs are used, at the cost of only a small increase in server load. However,
even though the number of HTTP requests was lower (reduction of 33%), the total
execution time increased for most queries, because of the long server delays when
generating AMFs. In this work, we aim to solve this problem of higher execution times.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Statement</title>
      <p>
        The goal of our work is to optimize query execution over TPF interfaces using AMFs.
We build upon the work from Vander Sande et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], where the authors allowed the
number of HTTP requests to be reduced at the cost of slower query execution. Our
goal is to mitigate this major drawback, while retaining its advantages.
Vander Sande et al. introduced a number of follow-up questions that we use as a basis
for defining our research questions. Concretely, in order to mitigate the earlier
mentioned drawbacks, our research questions are defined as follows:
      </p>
      <sec id="sec-3-1">
        <title>1. Can query execution time be lowered by combining triple pattern AMFs client-side on larger Basic Graph Patterns (BGPs)?</title>
        <p>Earlier work focused on using AMF metadata from triple pattern queries to test
the membership of materialized triples, while there is potential for exploiting this
for other types of patterns in the query as well. For instance, combining multiple
AMFs at BGP-level by applying AMFs on triple patterns with shared variables.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2. To what extent do HTTP caching and AMFs speed up query execution?</title>
        <p>As Vander Sande et al. suggest that caching of AMFs reduce server delays, we
investigate the impact of caching HTTP requests and/or AMFs.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3. How does selectively enabling AMF impact server load and querying?</title>
        <p>Earlier work introduced AMF as a feature that was always enabled. However,
some specific AMFs may be too expensive for servers to calculate on the fly. As
such, it may be beneficial to only enable AMF for queries that have a result
count lower than a certain threshold.</p>
      </sec>
      <sec id="sec-3-4">
        <title>4. How does network bandwidth impact query performance with AMFs?</title>
        <p>In experiments by Vander Sande et al., the HTTP bandwidth was set to a realistic
1Mbps. However, there is still an open question as to what extent different rates
have an impact on the importance of AMF.</p>
      </sec>
      <sec id="sec-3-5">
        <title>5. How low can AMF false-positive probabilities become to still achieve decent client-side query performance?</title>
        <p>Based on their results, Vander Sande et al. have suggested that additional
experimentation is needed with regards to lower AMF false-positive probabilities, as
higher probabilities did not have a significant effect on query performance. Note
that query correctness is never affected, but rather the number of requests to the
server, since every positive match requires a request to verify whether it is a true
or false positive.</p>
        <p>An answer to these research questions will be formed using the experiments from
Section 5.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Client-side AMF Algorithms</title>
      <p>In this section, we explain the existing triple-based AMF algorithm, and we show
where it lacks. Following that, we introduce a new client-side BGP-based AMF
algorithm that solves these problems. For the reader’s convenience, detailed examples of
how these two algorithms work can be found on
https://github.com/comunica/ArticleSSWS2020-AMF/wiki/AMF-algorithm-examples. Finally, we introduce a heuristic
that determines whether or not the BGP-based algorithm is beneficial to use.
4.1. Triple-based AMF Algorithm
function hasTriple(triple, context) {
for position in ['subject', 'predicate', 'object']
if !context.amf[position].contains(triple[position])</p>
      <p>
        return false;
return super.hasTriple(triple, context);
}
Listing 1: Triple-based AMF algorithm by Vander Sande et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as a pre-filtering
step for testing the membership of triples.
      </p>
      <p>
        Vander Sande et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduced an algorithm that acts as a cheap pre-processing
step for testing the membership of triples. This algorithm was used in combination
with the streaming greedy client-side TPF algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for evaluating SPARQL
queries. Listing 1 depicts this algorithm in pseudo-code.
      </p>
      <p>Concretely, every triple pattern that has all of its variables resolved to constants is run
through this function right before a more expensive HTTP request would be
performed. This function takes a triple and a query context containing the AMFs that
were detected during the last TPF response for that pattern. It will test the AMFs for
all triple components, and from the moment that a true negative is found, false will be
returned. Once all checks pass, the original HTTP-based membership logic will be
invoked.</p>
      <sec id="sec-4-1">
        <title>4.2. BGP-based AMF Algorithm</title>
        <p>Following the idea of the triple-based algorithm, we introduce an extension that
applies this concept for BGPs. This makes it possible to use AMFs not only for testing
the membership of triples, but also for using AMFs to test partially bound triple
patterns that may still have variables. In theory, this should filter (true negative) bindings
earlier in the query evaluation process.
function getBindings(triplePatterns, context) {
for ((triplePattern, amf) in (triplePatterns, context.amfs))
for position in ['subject', 'predicate', 'object']
if ((!triplePattern[position].isVariable()</p>
        <p>&amp;&amp; !amf[position].contains(triplePattern.subject))
return new EmptyStream();
return super.getBindings(triplePatterns, context);
}
Listing 2: BGP-based AMF algorithm as a pre-filtering step for BGP evaluation.
Listing 2 shows this algorithm in pseudo-code. Just like the triple-based algorithm, it
acts as a pre-processing step when BGPs are being processed. It takes a list of triple
patterns as input, and query context containing a list of corresponding AMFs that
were detected during the last TPF responses for each respective pattern. The
algorithm iterates over each pattern, and for each triple component that is not a variable, it
will run it through its AMF. Once a true negative is found, it will immediately return
an empty stream to indicate that this BGP definitely contains no results. If all checks
on the other hand pass, the original BGP logic will be invoked, which will down the
line invoke more expensive HTTP requests.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Heuristic for Enabling the BGP Algorithm</title>
        <p>While our BGP-based algorithm may filter out true negative bindings sooner than the
the triple-based algorithm, it may lead to larger AMFs being downloaded, possibly
incurring a larger HTTP overhead. In some cases, this cost may become too high if
the number of bindings that needs to be tested is low, e.g. downloading an AMF of
10MB would be too costly when only a single binding needs to be tested. To cope
with these cases, we introduce a heuristic in Listing 3, that will estimate whether or
not the BGP-based algorithm will be cheaper in terms of HTTP overhead compared to
just executing the HTTP membership requests directly. Concretely, the heuristic
checks if the size of an AMF is lower than the size of downloading TPF responses.
This heuristic has been designed for fast calculation, with exactness as a lower
priority. Based on measurements, we set AMF_TRIPLE_SIZE to 2 bytes, and
TPF_BINDING_SIZE to 1000 bytes by default. In Section 5, we will evaluate the
effects for different TPF_BINDING_SIZE values. In future work, more exact
heuristics should be investigated that take perform live profiling of HTTP requests
and delays to avoid the need of these constants.
function preferAmfForBgp(bindingsCount, triplePatternsCardinality)
totalAmfsSize = triplePatternsCardinality.sum() * AMF_TRIPLE_SIZ
joinRequestData = (bindingsCount * triplePatternsCardinality.len
* TPF_BINDING_SIZE;
return totalAmfsSize &lt; joinRequestData;
}
Listing 3: Heuristic for checking if the BGP-based AMF algorithm should be
executed, where bindingsCount is the number of intermediary bindings for the
current BGP, and triplePatternsCardinality is an array of cardinality
estimates for each triple pattern in the BGP. AMF_TRIPLE_SIZE is a parameter
indicating the number of bytes required to represent a triple inside an AMF, and
TPF_BINDING_SIZE is the size in bytes of a single TPF response.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>The goal of this section is to answer the research questions from Section 3. First, we
briefly discuss the implementations of our algorithm. After that, we present our
experimental setup, and we present our results. All code and results results can be found on
GitHub (https://github.com/comunica/comunica-feature-amf).</p>
      <sec id="sec-5-1">
        <title>5.1. Implementation</title>
        <p>
          For implementing the client-side AMF algorithms, we make use of the
JavaScriptbased Comunica SPARQL querying framework [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Since Comunica already fully
supports the TPF algorithm, we could implement our algorithms as fully standalone
plugins. Our algorithms are implemented in separate Comunica modules, and will be
available open-source on GitHub. Concretely, we implemented the original
triplebased AMF algorithm, our new BGP-based AMF algorithm (BGP Simple), and a
variant of this BGP-based algorithm (BGP Combined) that pre-fetches AMFs in parallel.
The original TPF server extension in the LDF server software by Vander Sande et
al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] allowed both Bloom filters and GCS to be created on the fly for any triple
pattern. To support our experiments, we extended this implementation with new features.
This implementation is available on GitHub
(https://github.com/LinkedDataFragments/Server.js/tree/feature-handlers-amf-2). In order to measure the server overhead
of large AMFs, we added a config option to dynamically enable AMFs for triple
patterns with number of matching triples below a given result count threshold. Next to
that, we implemented an optional file-based cache to avoid recomputing AMFs to
make pre-computation of AMFs possible.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Experimental Setup</title>
        <p>Based on our LDF server and Comunica implementations that were discussed in
Subsection 5.1, we defined five experiments, corresponding to our five research questions
from Section 3. These experiments are defined and executed using Comunica Bencher
(https://github.com/comunica/comunica-bencher), which is a Docker-based
benchmark execution framework for evaluating Linked Data Fragments. This enables
reproducibility of these experiments, as they can be re-executed with a single command.
The following experiments execute WatDiv with a dataset scale of 100 and a query
count of 5 for the default query templates, leading to a total of 100 queries. We only
report results for Bloom filters for experiments where no significant difference was
measured with GCS. Each experiment includes a warmup phase, and averages results
over 3 separate runs. During this warmup phase, the server caches all generated
AMFs. For each query, the client-side timeout was set to 5 minutes and, to enforce a
realistic Web bandwidth, the network delay was set to 1024Kbps. All experiments
were executed on a 64-bit Ubuntu 14.04 machine with 128 GB of memory and a
24core 2.40 GHz CPU—each Docker container was limited to one CPU core, behind an
NGINX HTTP cache.</p>
        <p>1. Client-side AMF Algorithms: In this experiment, we compare different
client-side algorithms (None, Triple, BGP Simple, BGP Combined, Triple with
BGP Combined) for using AMF metadata.
2. Caching: In this experiment, we evaluate the effects of caching all HTTP
requests combined with caching AMF filters server-side, both following the LRU
cache replacement strategy. We also compare the effects of using AMF metadata
client-side or not. Finally, we test the effects of warm and cold caches.
3. Dynamically Enabling AMF: In this experiment, we compare different result
count thresholds (0, 1.000, 10.000, 100.000, 1.000.000) with each other, with
either the server-side AMF filter cache enabled or not. We disable the HTTP cache
and warmup phase to evaluate a cold-start.
4. Network Bandwidths: Different network bandwidths (256kbps, 512kbps,
2048kbps, 4096kbps) are tested for evaluating network speedups, and their
effects or different AMF algorithms (None, Triple, BGP Combined) are tested.
5. False-positive Probabilities: In this final experiment, we compare different
AMF false-positive probabilities (1/4096, 1/1024, 1/64, 1/4, 1/2).</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Results</title>
        <p>In this section, we present the results for each of our experiments separately. We
analyzed our results statistically by comparing means using the Kruskal-Wallis test, and
report on their p-values (low values indicate non-equal means).</p>
      </sec>
      <sec id="sec-5-4">
        <title>Client-side AMF Algorithms</title>
        <p>reduce query evaluation times and outperforms the triple-based algorithm. Only for 5
of the 20 queries, evaluation times are higher or equal. Our combined BGP algorithm
is slightly faster than the simple BGP algorithm. By using both the combined
BGPbased and the triple-based algorithms, we can reduce evaluation times slightly further.
Fig. 2 shows the query result arrival times for query F3, and is similar to the arrival
times for other queries. This figure shows that the time-until-first-result is the highest
for BGP-based AMF algorithms. However, once this first result comes in, the arrival
rate becomes much higher compared to the other algorithms. This delay for the
BGPbased algorithms is caused by the higher download times for large AMFs, and
explains the higher or equal evaluation times for 5 of the 20 queries.</p>
        <p>Fig. 3 shows the BGP-based algorithms significantly lower the number of required
HTTP requests, which explains the significant reduction in query execution times.
This allows the NGINX cache hit rate to slightly increase compared to the regular and
triple-based TPF algorithms, since fewer requests are made, which lowers the number
of required cache evictions.</p>
        <p>Based on these results, there is no statistically significant difference between the
evaluation times of the triple-based AMF algorithm, and not using AMF metadata at all
(p-value: 0.93). The simple and combined BGP algorithms are significantly faster
than not using AMF metadata (p-values &lt; 0.01). Furthermore, the simple and
combined BGP algorithm are on average more than twice as fast as the triple-based
algorithm, which make them significantly faster (p-values &lt; 0.01). Furthermore,
combining our simple and combined BGP algorithm with the triple-based algorithms shows
no further statistically significant improvement (p-values: 0.95, 0.67).
In Fig. 4, we show the results where we apply the heuristic for dynamically disabling
the BGP heuristic based on different parameter values. On average, setting the request
size parameter value to 2000 has the lowest average evaluation time for this
experiment. This case achieves lower evaluation times for 19 of the 20 queries, which is an
improvement compared to not using the heuristic. This improvement by itself
however only small, and not statistically significant (p-value: 0.18).</p>
      </sec>
      <sec id="sec-5-5">
        <title>Caching</title>
        <p>evaluation times for caching of both HTTP requests and AMF filters compared to just
caching HTTP requests (p-value: 0.77). This shows that an HTTP cache achieves the
best results, and additionally caching AMF filters server-side is not worth the effort.
If we compare these results with the results for non-AMF-aware querying, we see that
if HTTP caching is disabled, query evaluation times for non-AMF-aware querying are
significantly lower than AMF-aware approaches (p-value: &lt; 0.01). On the other hand,
if HTTP caching is enabled, query evaluation times for non-AMF-aware querying are
significantly worse than with AMF-aware approaches (p-value: &lt; 0.01). While
caching is already very important for TPF-based querying, these results show that
caching becomes even more important when AMFs are being used.</p>
        <p>Finally, our results show that when our cache is warm, exposing Bloom filters instead
of GCS achieves faster query evaluation times. While there are a few outliers where
GCS is two to three times slower, the difference is only small in most cases (p-value:
0.18).</p>
      </sec>
      <sec id="sec-5-6">
        <title>Dynamically Enabling AMF</title>
        <p>Fig. 8: Average server CPU usage increases when AMF result count thresholds
increase when caching is disabled, but much slower if caching is enabled.
Fig. 6 shows lower server-side AMF result count thresholds lead to higher query
evaluation times when HTTP caching is enabled (p-value: &lt; 0.01). Fig. 7 shows that
AMF result count thresholds also have an impact on query evaluation times when
HTTP caching is disabled (p-value: &lt; 0.01), but it does not necessarily lower it. For
this experiment, setting the threshold to 10K leads to the lowest overall query
evaluation times.</p>
        <p>Fig. 8 shows that lower AMF result count thresholds lead to lower server loads when
HTTP caching is disabled (p-value: 0.03). On the other hand, if HTTP caching is
enabled, there is no correlation (Pearson) between AMF result count threshold and
server CPU usage (p-value: 0.46). This shows that if caching is enabled, dynamically
enabling AMFs based on the number of triples is not significantly important, and may
therefore be disabled to always expose AMFs.</p>
        <p>For this experiment, average CPU usage increased from 31.65% (no AMF) to 40.56%
(all AMF) when caching is enabled. Furthermore, when looking at the raw HTTP
logs, we observe that by always exposing AMFs, we use 28.66% of the total number
of HTTP requests compared to not exposing AMFs. As such, AMFs significantly
reduce the number of HTTP requests at the cost of ~10% more server load.</p>
      </sec>
      <sec id="sec-5-7">
        <title>Network Bandwidth</title>
        <p>Fig. 9, Fig. 10 and Fig. 11 show the effects of different bandwidths on query
evaluation times over different algorithms. We observe that when not using AMF, or using
the triple-level AMF algorithm, lower bandwidths lead to higher query evaluation
times. However, when bandwidths become much higher, query evaluation times
decrease at a lower rate. In contrast, the BGP-level AMF algorithm continuously
becomes faster when bandwidth increases. We do not measure any significant impact of
bandwidth on both non-AMF usage and triple-level AMF usage (p-values: 0.29,
0.23). For BGP-level AMF, we measure a significant impact (p-value: &lt; 0.01). This
shows that if BGP-level AMF is used, then higher bandwidths can be exploited more
for faster query evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In this article, we introduced client-side and server-side improvements to the AMF
feature for TPF. The experimental results show that our client-side algorithms make
average query execution more than two times faster than with regular TPF while only
requiring 10% of the number of HTTP requests, at the cost of less than 10%
additional server CPU usage.</p>
      <p>We offer implementations of these algorithms and server enhancements, which means
that they can be used by any of the existing data publishers that are exposing their
data through a TPF interface, or any client that aims to query from them.
Hereafter, we conclude our findings with respect to our research questions, based on
the evaluation, and we introduce a set of recommendations for data publishers using
AMF with TPF.</p>
      <sec id="sec-6-1">
        <title>6.1. Research findings</title>
        <p>BGP-based Algorithms Improve Query Efficiency Results show that our new
client-side BGP-based algorithms that use AMF metadata significantly reduce query
evaluation times (Research Question 1). However, the are a few outliers where our
new algorithms perform worse than the triple-based algorithm. This is because AMFs
are sometimes very large, which has a significant impact on query execution times
when they have to be downloaded from the server. Our results have shown that a
heuristic that can decide whether or not to use the BGP-based algorithm can solve this
problem, but further research is needed to come up with a more general heuristic that
works in a variety of cases.</p>
      </sec>
      <sec id="sec-6-2">
        <title>BGP-based Algorithms Postpone Time to First Results Even though total query</title>
        <p>
          evaluation times for the AMF-aware algorithms are mostly lower, we observe that the
time-until-first-result is mostly higher. The reason for this is that the BGP-based
algorithms tends to use larger AMFs, which introduces a bottleneck when requesting them
over HTTP. Even though we have this overhead, the gains we get from this are
typically worth it, as results come in much faster once the AMFs have been downloaded.
This finding shows that dynamically switching between different algorithms may be
interesting to investigate in future work. Our bandwidth experiment results confirm
this bandwidth bottleneck when downloading large AMFs, and show that higher
bandwidths lead to even more performance gains for the BGP-level algorithms
(Research Question 4). This continuous efficiency can be investigated further in the future
using metrics such as diefficiency [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
        <p>Pre-computation and Caching of AMFs is Essential Our results show that
AMFaware querying only has a positive impact on query evaluation times if the server can
deliver AMF filters sufficiently fast (Research Question 2) by for example caching
them. Furthermore, if no cache is active, AMF-aware querying performs worse than
non-AMF-aware querying. Ideally, all AMFs should be pre-computed, but due to the
large number of possible triple patterns in a dataset, this is not feasible. On the other
hand, our results have shown that server-side on the fly creation of AMFs only starts
to have a significant impact for sizes larger than 10.000 (Research Question 3).
On a low-end machine (2,7 GHz Intel Core i5, 8GB RAM), creation of AMFs takes
0.0125 msec per triple, or 0.125 seconds for AMF creation of size 10.000. As such,
AMFs of size 10.000 or less can be created on the fly with acceptable durations for
Web servers (after which they can still be cached).</p>
        <p>
          Fig. 14 shows that there is only a very small number of triple patterns with a very
large number of matches. When setting the WatDiv dataset to a size of 10M triples,
there are only 90 triple patterns with a size larger than 10.000. Setting that size to
100M triples, this number increases to 255, so this is not a linear increase. Due to this
low number of very large patterns, servers can easily pre-compute these offline before
dataset publication time. Since the WatDiv dataset achieves a high diversity of
structuredness, it is similar to real-world RDF datasets [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. As such, this behavior can be
generalized to other datasets with a similar structuredness.
        </p>
      </sec>
      <sec id="sec-6-3">
        <title>Bloom Filters are Preferred over GCS with Active Cache Results show that</title>
        <p>when AMFs are pre-computed, Bloom filters achieve faster query evaluation times
than GCS (Research Question 2). This is because Bloom filter creation requires less
effort client-side than GCS due to the simpler decompression, at the cost of more
server effort. However, this higher server effort is negligible if AMFs can be
pre-computed. As such, we recommend Bloom filters to always be preferred over GCS, unless
AMFs can not be cached.</p>
      </sec>
      <sec id="sec-6-4">
        <title>A Good Trade-off Between False-positive Probabilities and AMF Size Lowering</title>
        <p>the false-positive probability of an AMF increases its size. As we have seen that larger
AMFs have an impact on query evaluation times, we do not want AMFs to become
too large. On the other hand, we do not want the false-positive probabilities to become
too high, as that leads to more unneeded HTTP requests. Our results have shown that
a probability of 1/64 leads to an optimal balance for our experiments (Research
Question 5). However, further research is needed to investigate this trade-off for other
types of datasets and queries.</p>
      </sec>
      <sec id="sec-6-5">
        <title>6.2. Recommendations for Publishers</title>
        <p>Based on the conclusions of our experimental results, we derived the following
guidelines for publishers who aim to use the AMF feature:</p>
        <p>Enable HTTP caching with a tool such as NGINX.</p>
        <p>Pre-compute AMFs (or at least cache) AMFs of size 10.000 or higher.
If AMFs can be cached, prefer Bloom filters over GCS.</p>
        <p>Use a false-positive probability of 1/64.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Buil-Aranda</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandenbussche</surname>
            , P.-Y.:
            <given-names>SPARQL</given-names>
          </string-name>
          <string-name>
            <surname>Web-Querying</surname>
            <given-names>Infrastructure</given-names>
          </string-name>
          :
          <article-title>Ready for Action? In: The Semantic Web-ISWC</article-title>
          <year>2013</year>
          . pp.
          <fpage>277</fpage>
          -
          <lpage>293</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Van Herwegen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>De Vocht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>De Meester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Haesendonck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Triple Pattern Fragments: a Low-cost Knowledge Graph Interface for the Web</article-title>
          .
          <source>Journal of Web Semantics</source>
          . (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buil-Aranda</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Bindings-Restricted Triple Pattern</surname>
          </string-name>
          <article-title>Fragments</article-title>
          .
          <source>In: Proceedings of the 15th International Conference on Ontologies, DataBases, and Applications of Semantics</source>
          . pp.
          <fpage>762</fpage>
          -
          <lpage>779</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Minier</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skaf-Molli</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Molli</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : SaGe:
          <article-title>Web preemption for public SPARQL query services</article-title>
          .
          <source>In: The World Wide Web Conference</source>
          . pp.
          <fpage>1268</fpage>
          -
          <lpage>1278</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Amr</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            <given-names>Garcia</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.D.</given-names>
            ,
            <surname>Maribel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>SMART-KG: Hybrid Shipping for SPARQL Querying on the Web</article-title>
          .
          <article-title>(</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A Web API ecosystem through feature-based reuse</article-title>
          .
          <source>Internet Computing</source>
          .
          <volume>22</volume>
          ,
          <fpage>29</fpage>
          -
          <lpage>37</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          , R., Van Herwegen,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>Opportunistic Linked Data querying through approximate membership metadata</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          . pp.
          <fpage>92</fpage>
          -
          <lpage>110</lpage>
          . Springer (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Van</given-names>
            <surname>Herwegen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>Query Execution Optimization for Clients of Triple Pattern Fragments</article-title>
          .
          <source>In: The Semantic Web. Latest Advances and New Domains</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Taelman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E.:
          <article-title>Versioned Triple Pattern Fragments: A Low-cost Linked Data Interface Feature for Web Archives</article-title>
          .
          <source>In: Proceedings of the 3rd Workshop on Managing the Evolution and Preservation of the Data Web</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Verborgh</surname>
          </string-name>
          , R.:
          <article-title>DBpedia's triple pattern fragments: usage patterns and insights</article-title>
          .
          <source>In: European Semantic Web Conference</source>
          . pp.
          <fpage>431</fpage>
          -
          <lpage>442</lpage>
          . Springer (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Bloom</surname>
            ,
            <given-names>B.H.</given-names>
          </string-name>
          :
          <article-title>Space/time trade-offs in hash coding with allowable errors</article-title>
          .
          <source>Communications of the ACM</source>
          .
          <volume>13</volume>
          ,
          <fpage>422</fpage>
          -
          <lpage>426</lpage>
          (
          <year>1970</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Putze</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanders</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singler</surname>
          </string-name>
          , J.: Cache-,
          <string-name>
            <surname>hash-</surname>
          </string-name>
          ,
          <article-title>and space-efficient bloom filters</article-title>
          .
          <source>Journal of Experimental Algorithmics (JEA)</source>
          .
          <volume>14</volume>
          ,
          <issue>4</issue>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Gallager</surname>
          </string-name>
          , R.,
          <string-name>
            <surname>Van Voorhis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Optimal source codes for geometrically distributed integer alphabets (corresp</article-title>
          .).
          <source>IEEE Transactions on Information theory.</source>
          (
          <year>1975</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Estimating selectivity for joined RDF triple patterns</article-title>
          .
          <source>In: Proceedings of the 20th ACM international conference on Information and knowledge management</source>
          . pp.
          <fpage>1435</fpage>
          -
          <lpage>1444</lpage>
          . ACM (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>Scalable join processing on very large RDF graphs</article-title>
          .
          <source>In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data</source>
          . pp.
          <fpage>627</fpage>
          -
          <lpage>640</lpage>
          . ACM (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Dia</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aoul</surname>
            ,
            <given-names>Z.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Métais</surname>
          </string-name>
          , E.:
          <article-title>Fast SPARQL join processing between distributed streams and stored RDF graphs using bloom filters</article-title>
          .
          <source>In: 12th International Conference on Research Challenges in Information Science</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schenkel</surname>
          </string-name>
          , R.:
          <article-title>Towards benefit-based RDF source selection for SPARQL queries</article-title>
          .
          <source>In: Proceedings of the 4th International Workshop on Semantic Web Information Management</source>
          . p.
          <fpage>2</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Taelman</surname>
            , R., Van Herwegen,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vander</surname>
            <given-names>Sande</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          , R.:
          <article-title>Comunica: a Modular SPARQL Query Engine for the Web</article-title>
          .
          <source>In: Proceedings of the 17th International Semantic Web Conference</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Acosta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vidal</surname>
          </string-name>
          , M.-E.,
          <string-name>
            <surname>Sure-Vetter</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Diefficiency metrics: measuring the continuous efficiency of query processing approaches</article-title>
          . In: International Semantic Web Conference. pp.
          <fpage>3</fpage>
          -
          <lpage>19</lpage>
          . Springer (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kementsietsidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivas</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Udrea</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Apples and oranges: a comparison of RDF benchmarks and real RDF datasets</article-title>
          .
          <source>In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>