<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Web Semantics 41
(2016) 9-29. URL: http://dx.doi.org/10.1016/j.websem.2016.10.001. doi:10.1016/j.websem.2016.
10.001.
[10] B. Bogaerts</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3442381.3449911</article-id>
      <title-group>
        <article-title>Optimizing Traversal Queries of Sensor Data Using a Rule-Based Reachability Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bryan-Elliott Tam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruben Taelman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julián Rojas Meléndez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pieter Colpaert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IDLab, Department of Electronics and Information Systems, Ghent University - imec</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>12851</volume>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Link Traversal queries face challenges in completeness and long execution time due to the size of the web. Reachability criteria define completeness by restricting the links followed by engines. However, the number of links to dereference remains the bottleneck of the approach. Web environments often have structures exploitable by query engines to prune irrelevant sources. Current criteria rely on using information from the query definition and predefined predicate. However, it is dificult to use them to traverse environments where logical expressions indicate the location of resources. We propose to use a rule-based reachability criterion that captures logical statements expressed in hypermedia descriptions within linked data documents to prune irrelevant sources. In this poster paper, we show how the Comunica link traversal engine is modified to take hints from a hypermedia control vocabulary, to prune irrelevant sources. Our preliminary findings show that by using this strategy, the query engine can significantly reduce the number of HTTP requests and the query execution time without sacrificing the completeness of results. Our work shows that the investigation of hypermedia controls in link pruning of traversal queries is a worthy efort for optimizing web queries of unindexed decentralized databases.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Linked data</kwd>
        <kwd>Link Traversal Query Processing</kwd>
        <kwd>Fragmented database</kwd>
        <kwd>Descentralized environments</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The increasing amount of available Linked Data on the Web [1] prompts the need for eficient query
interfaces. During a typical query execution in a SPARQL endpoint, the endpoint takes the whole
query load and delivers the results to the client. This paradigm can lead to high workloads, which
are partly responsible for the historically low availability of SPARQL endpoints [2]. Researchers and
practitioners have made eforts to introduce alternative Linked Data publication methods that enable
client’s participation in the query execution process [3]. The goal of those methods is to lower server-side
workloads while keeping fast query execution to the client [4]. The TREE hypermedia specification is an
efort in that direction [ 5, 6], that introduces the concept of domain-oriented fragmentation of large RDF
datasets. For example, in the case of periodic measurements of sensor data, a fragmentation can be made
on the publication date of each data entity. A fragment can be considered an RDF document published in
a server. TREE aims to describes dataset fragmentation in ways that enable clients to easily fetch
queryrelevant subsets. The data within a fragment are bound by constraints expressed through hypermedia
descriptions [7]. Each fragment contains relations to other pages, and those relations contain the
constraints of the data of every reachable fragment. In this paper, we refer to those constraints as
domain-specific expressions. They can be expressions such as ? &gt; 2022-01-09T00:00:00.000000 =⇒
ex:afterFirstSeptember given that ? is the date of publication of sensor data and the implication pertains
to the location of the data respecting the constraint. In English, the expression means “the data produced
by the sensors after the first of September are stored at ex:afterFirstSeptember.” Because of the
hyperlinked nature of the documents network, clients must traverse them to find the relevant data
to answer their queries. We propose to use Link Traversal Query Processing (LTQP) [8] as a query
mechanism to perform those queries.</p>
      <p>LTQP starts by dereferencing a set of user-provided URLs [8]. From these dereferenced documents,
links to other documents are dereferenced recursively and inserted in an internal data store. LDQL [9]
is a theoretical query language to define the traversal of LTQP queries. However, LDQL is centered
around nested regular expressions, thus, is not made to express the traversal of links based on
domainspecific expressions such as time relations. The subweb specifications language (SWSL) [ 10], allows
data providers to define traversal paths concerning the information they publish. Thus, given that
the query engine trusts the data publisher it can adapt its traversal to follow the paths given by the
specification. Akin to LDQL, it is dificult with the SWSL to express traversal using domain-specific
expressions, because its syntax is centered around the matching of triple patterns and not reasoning
rules or evaluation of literals. Furthermore, SWSL does not propose a mechanism for using the query
or input from the user to impact the source selection process, unlike LDQL. Given those limitations, we
propose to return to the more abstract concept of reachability criteria [11], to define a mechanism of
traversal centered around rules.</p>
      <p>In this paper, we propose to use a boolean solver as the main link pruning mechanism for a reachability
criterion to traverse TREE documents. The logical operators are defined by the TREE specification. 1
As a concrete use case, we consider the publication of (historical) sensor data. An example query is
presented in Figure 1 along with the triples representing the link between two documents expressed
using the TREE specification.
1 PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
2 PREFIX saref: &lt;https://saref.etsi.org/core/&gt;
3 PREFIX dahcc: &lt;https://dahcc.idlab.ugent.be/</p>
      <p>Ontology/Sensors/&gt;
4
5 SELECT * WHERE {
6 ?s saref:hasTimestamp ?t;
7 saref:hasValue ?result;
8 saref:measurementMadeBy ?sensor.
9 ?sensor dahcc:analyseStateOf ?stateOf;
10 saref:measuresProperty {:property}.
11 FILTER(?t="2022-01-03T10:57:54.000000"^^xsd:</p>
      <p>dateTime)
12 }</p>
    </sec>
    <sec id="sec-2">
      <title>2. A Rule-Based Reachability Criterion</title>
      <p>Most research on LTQP is centered around query execution in Linked Open Data environments. Given
the pseudo-infinite number of documents on the Web, traversing over all documents is practically
infeasible. To define completeness, diferent reachability criteria [ 11] were introduced to allow the
discrimination of links. Recently, an alternative direction was designed where the query engine uses
the structure from the data publisher to guide itself towards relevant data sources [12, 13].</p>
      <p>We define our approach as a rule-based reachability criterion. Our approach builds upon the concept
of structural assumptions [12] to exploit the structural properties of TREE annotated datasets. We
therefore interpret the hypermedia descriptions of constraints in TREE fragments as boolean expressions
 (? &gt;= 2022-01-03T09:47:59.000000 in Figure 1). Upon discovery of a document, the query engine
gathers the relevant triples to form the boolean expression of the constraint on the data of reachable
fragments. After the parsing of the expression, the filter expression  of the SPARQL query is pushed
1 https://treecg.github.io/specification/
down into the engine’s source selection component. The source selection component can be formalized
as a reachability criterion 2</p>
      <p>
        () → {true, false}
where when it returns true the target IRI  must be dereferenced. Finally, the two boolean expressions
are evaluated to determine their satisfiability. It can be formalized as determining if
() = ∃|( () ∧ ) = true
hold true given  is the variable targeted by  and  is the link towards the next fragment (&lt;nextNode&gt;
from “&lt;&gt; tree:node &lt;nextNode&gt;” in Figure 1). A variable targetted by  is defined by an RDF
object where the predicate as a value ?target from the triple defining the fragmentation path in the
form “”?s tree:path ?target” (saref:hasTimestamp in Figure 1). Upon satisfaction the IRI
targeting the next fragment is added to the link queue otherwise the IRI is pruned. The process is
schematized in Figure 2.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(2)
      </p>
      <p>Query
PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; 
PREFIX etsi: &lt;https://saref.etsi.org/core/&gt;
PREFIX dahcc: &lt;https://dahcc.idlab.ugent.be/Ontology/Sensors/&gt;
PREFIX saref: &lt;https://saref.etsi.org/core/&gt;
SELECT * WHERE {
  ?s saref:hasTimestamp ?t;
    saref:hasValue ?result;
    saref:measurementMadeBy ?sensor.
  ?sensor dahcc:analyseStateOf ?stateOf;
    saref:measuresProperty {:property}.
  FILTER(?t="2022-01-03T10:57:54.000000"^^xsd:dateTime)
}
Collect the filter expression segment</p>
      <p>F1 and F2 related to E1 and E2
(3)</p>
      <p>Pruned link related to
unsatisfiable E and F couple </p>
      <p>(5)
Query
engine</p>
      <p>Evaluate the
satisfiability of E and F
(4)</p>
      <p>Tranform tree:relation into
boolean expressions E1 and E2</p>
      <p>
        (2)
Dereference
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
tree:Node
      </p>
      <p>tree:relation(s)
saref:hasTimestamp &lt; 2022-01-03T09:47:59Z
saref:hasTimestamp &gt;= 2022-01-03T09:47:59Z
rest of the triples
tree:Node
tree:Node
2.1. Preliminary Results
We implemented our approach using the query engine Comunica [14]. For evaluation, we executed four
queries similar to the one in Figure 1. 3 They were executed over the DAHCC participant 31 dataset [15]
(487 MB) with a timeout of two minutes. We fragmented the dataset according to the TREE specification.
We use a B-tree topology with a depth of 1 using 100 and 1000 nodes ().</p>
      <p>The queries were executed using two configurations. In the first configuration, we use a
predicatebased reachability criterion where the engine follows each link of the fragmented dataset. 4 For the
2 We use a simplified formalization to illustrate the source selection mechanism and to not introduce unnecessary concepts
for the aim of this poster paper.
3 The implementation, the queries and the evaluation are available at the following links:
https://github.com/constraintAutomaton/comunica-feature-link-traversal/tree/feature/time-filtering-tree-sparqleeimplementation, https://github.com/TREEcg/TREE-Guided-Link-Traversal-Query-Processing-Evaluation/tree/main
4 The query engine will always follow ex:nextNode from expressions following the schema "ex:currentNode
tree:relation [tree:node ex:nextNode]" regardless of the constraints.
second one, we use our rule-based reachability criterion approach. As shown in Table 1 no queries
could be answered within the timeout by following every fragment. A possible explanation is the high
number of HTTP requests performed [8] leading to non-relevant data sources. With our rule-based
reachability criterion, the queries executed over the 1000 nodes fragmentation perform better than
the ones with 100 nodes. The query execution time has a percentage of reduction of 86% with Q1 and
79% with Q2 compared to the fragmentation with 100 nodes. With Q3 we see that the percentage of
reduction is 33%, this lowering of performance gain might be caused by the increase by a factor of 6
in HTTP requests. This raises an interesting observation because we do not observe a reduction in
execution time with a reduction in HTTP requests. Previous research has proposed that ineficient
query plans might be the bottleneck of some queries in structured environments [12, 16]. However, our
results seem to show that the size of the internal triple store might have a bigger impact on performance
than noted in previous studies. As large-scale link traversal over the web will result in the acquisition
of a large number of triples, a future interesting research direction would be to find ways to remove
triples that are certain to not lead to a query result from the internal triple store. The query Q4 was not
able to be answered, with any setup, because the query requires a larger number of fragments than the
other to be processed.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion</title>
      <p>This paper reported on preliminary tests to add guided link traversal support into the Comunica
querying engine using a rule-based reachability approach. A similar approach could be performed
with other SPARQL query engines supporting Link Traversal Query Processing. Our preliminary
results show that our rule-based reachability criterion can significantly reduce the execution time of
queries aligned with hypermedia description constraints compared to predicate-based reachability
opening the possibility for faster and more versatile traversal-based query execution over fragmented
RDF documents. Our experiment also highlights that the size of the internal data store might have
more impact on performance than noted in previous studies. In future work, we will perform more
exhaustive evaluations of other types of domain-oriented fragmentation strategies such as string and
geospatial evaluations, and investigate how to generalize our approach to support more expressive
online reasoning for online source selection during traversal queries. Furthermore, we also showed
there might still be room for optimization by researching ways for pruning useless triples from the
internal triple store during the link traversal process.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ermilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Linked open data statistics: Collection and exploitation</article-title>
          , in: P.
          <string-name>
            <surname>Klinov</surname>
          </string-name>
          , D. Mouromtsev (Eds.),
          <source>Knowledge Engineering and the Semantic Web</source>
          , Springer, Berlin, Heidelberg,
          <year>2013</year>
          , pp.
          <fpage>242</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>