=Paper=
{{Paper
|id=Vol-3828/ISWC2024_paper_6
|storemode=property
|title=Optimizing traversal queries of sensor data using a rule-based reachability approach
|pdfUrl=https://ceur-ws.org/Vol-3828/paper6.pdf
|volume=Vol-3828
|authors=Bryan-Elliott Tam,Ruben Taelman,Julián Rojas Meléndez,Pieter Colpaert
|dblpUrl=https://dblp.org/rec/conf/semweb/TamTMC24
}}
==Optimizing traversal queries of sensor data using a rule-based reachability approach==
Optimizing Traversal Queries of Sensor Data Using a
Rule-Based Reachability Approach
Bryan-Elliott Tam1,* , Ruben Taelman1 , Julián Rojas Meléndez1 and Pieter Colpaert1
1
IDLab, Department of Electronics and Information Systems, Ghent University – imec
Abstract
Link Traversal queries face challenges in completeness and long execution time due to the size of the web.
Reachability criteria define completeness by restricting the links followed by engines. However, the number of
links to dereference remains the bottleneck of the approach. Web environments often have structures exploitable
by query engines to prune irrelevant sources. Current criteria rely on using information from the query definition
and predefined predicate. However, it is difficult to use them to traverse environments where logical expressions
indicate the location of resources. We propose to use a rule-based reachability criterion that captures logical
statements expressed in hypermedia descriptions within linked data documents to prune irrelevant sources. In
this poster paper, we show how the Comunica link traversal engine is modified to take hints from a hypermedia
control vocabulary, to prune irrelevant sources. Our preliminary findings show that by using this strategy, the
query engine can significantly reduce the number of HTTP requests and the query execution time without
sacrificing the completeness of results. Our work shows that the investigation of hypermedia controls in link
pruning of traversal queries is a worthy effort for optimizing web queries of unindexed decentralized databases.
Keywords
Linked data, Link Traversal Query Processing, Fragmented database, Descentralized environments
1. Introduction
The increasing amount of available Linked Data on the Web [1] prompts the need for efficient query
interfaces. During a typical query execution in a SPARQL endpoint, the endpoint takes the whole
query load and delivers the results to the client. This paradigm can lead to high workloads, which
are partly responsible for the historically low availability of SPARQL endpoints [2]. Researchers and
practitioners have made efforts to introduce alternative Linked Data publication methods that enable
client’s participation in the query execution process [3]. The goal of those methods is to lower server-side
workloads while keeping fast query execution to the client [4]. The TREE hypermedia specification is an
effort in that direction [5, 6], that introduces the concept of domain-oriented fragmentation of large RDF
datasets. For example, in the case of periodic measurements of sensor data, a fragmentation can be made
on the publication date of each data entity. A fragment can be considered an RDF document published in
a server. TREE aims to describes dataset fragmentation in ways that enable clients to easily fetch query-
relevant subsets. The data within a fragment are bound by constraints expressed through hypermedia
descriptions [7]. Each fragment contains relations to other pages, and those relations contain the
constraints of the data of every reachable fragment. In this paper, we refer to those constraints as
domain-specific expressions. They can be expressions such as ?𝑡 > 2022-01-09T00:00:00.000000 =⇒
ex:afterFirstSeptember given that ?𝑡 is the date of publication of sensor data and the implication pertains
to the location of the data respecting the constraint. In English, the expression means “the data produced
by the sensors after the first of September are stored at ex:afterFirstSeptember.” Because of the
hyperlinked nature of the documents network, clients must traverse them to find the relevant data
Posters, Demos, and Industry Tracks at ISWC 2024, November 13–15, 2024, Baltimore, USA
*
Corresponding author.
$ bryanelliott.tam@ugent.be (B. Tam); ruben.taelman@ugent.be (R. Taelman); JulianAndres.RojasMelendez@UGent.be
(J. Rojas Meléndez); pieter.colpaert@ugent.be (P. Colpaert)
https://www.rubensworks.net (R. Taelman); https://julianrojas.org (J. Rojas Meléndez); https://pietercolpaert.be
(P. Colpaert)
0000-0003-3467-9755 (B. Tam); 0000-0001-5118-256X (R. Taelman); 0000-0002-6645-1264 (J. Rojas Meléndez);
0000-0001-6917-2167 (P. Colpaert)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
to answer their queries. We propose to use Link Traversal Query Processing (LTQP) [8] as a query
mechanism to perform those queries.
LTQP starts by dereferencing a set of user-provided URLs [8]. From these dereferenced documents,
links to other documents are dereferenced recursively and inserted in an internal data store. LDQL [9]
is a theoretical query language to define the traversal of LTQP queries. However, LDQL is centered
around nested regular expressions, thus, is not made to express the traversal of links based on domain-
specific expressions such as time relations. The subweb specifications language (SWSL) [10], allows
data providers to define traversal paths concerning the information they publish. Thus, given that
the query engine trusts the data publisher it can adapt its traversal to follow the paths given by the
specification. Akin to LDQL, it is difficult with the SWSL to express traversal using domain-specific
expressions, because its syntax is centered around the matching of triple patterns and not reasoning
rules or evaluation of literals. Furthermore, SWSL does not propose a mechanism for using the query
or input from the user to impact the source selection process, unlike LDQL. Given those limitations, we
propose to return to the more abstract concept of reachability criteria [11], to define a mechanism of
traversal centered around rules.
In this paper, we propose to use a boolean solver as the main link pruning mechanism for a reachability
criterion to traverse TREE documents. The logical operators are defined by the TREE specification. 1
As a concrete use case, we consider the publication of (historical) sensor data. An example query is
presented in Figure 1 along with the triples representing the link between two documents expressed
using the TREE specification.
1 PREFIX xsd: 1 @prefix tree: .
2 PREFIX saref: 2 @prefix xsd: .
Ontology/Sensors/> 3 @prefix ex: .
4 4 @prefix saref: .
6 ?s saref:hasTimestamp ?t; 5
7 saref:hasValue ?result; 6 <> tree:relation [
8 saref:measurementMadeBy ?sensor. 7 a tree:GreaterThanOrEqualToRelation ;
9 ?sensor dahcc:analyseStateOf ?stateOf; 8 tree:node ;
10 saref:measuresProperty {:property}. 9 tree:value "2022-01-03T09:47:59"^^xsd:
11 FILTER(?t="2022-01-03T10:57:54.000000"^^xsd: dateTime ;
dateTime) 10 tree:path saref:hasTimestamp
12 } 11 ] .
Figure 1: On the left, is a SPARQL query to get sensor measurements and information about the sensor. On the
right, is the hypermedia description of the location and constraint of the next fragment located in ex:nextNode.
The constraint describes publication times (?𝑡) where ?𝑡 >= 2022-01-03T09:47:59.000000.
2. A Rule-Based Reachability Criterion
Most research on LTQP is centered around query execution in Linked Open Data environments. Given
the pseudo-infinite number of documents on the Web, traversing over all documents is practically
infeasible. To define completeness, different reachability criteria [11] were introduced to allow the
discrimination of links. Recently, an alternative direction was designed where the query engine uses
the structure from the data publisher to guide itself towards relevant data sources [12, 13].
We define our approach as a rule-based reachability criterion. Our approach builds upon the concept
of structural assumptions [12] to exploit the structural properties of TREE annotated datasets. We
therefore interpret the hypermedia descriptions of constraints in TREE fragments as boolean expressions
𝐸 (?𝑡 >= 2022-01-03T09:47:59.000000 in Figure 1). Upon discovery of a document, the query engine
gathers the relevant triples to form the boolean expression of the constraint on the data of reachable
fragments. After the parsing of the expression, the filter expression 𝐹 of the SPARQL query is pushed
1
https://treecg.github.io/specification/
down into the engine’s source selection component. The source selection component can be formalized
as a reachability criterion 2
𝑐(𝑖) → {true, false} (1)
where when it returns true the target IRI 𝑖 must be dereferenced. Finally, the two boolean expressions
are evaluated to determine their satisfiability. It can be formalized as determining if
𝑐(𝑖) = ∃𝑥|(𝐹 (𝑥) ∧ 𝐸𝑖 ) = true (2)
hold true given 𝑥 is the variable targeted by 𝐸𝑖 and 𝑖 is the link towards the next fragment (
from “<> tree:node ” in Figure 1). A variable targetted by 𝐸 is defined by an RDF
object where the predicate as a value ?target from the triple defining the fragmentation path in the
form “”?s tree:path ?target” (saref:hasTimestamp in Figure 1). Upon satisfaction the IRI
targeting the next fragment is added to the link queue otherwise the IRI is pruned. The process is
schematized in Figure 2.
Query
PREFIX xsd:
PREFIX etsi:
PREFIX dahcc:
PREFIX saref:
SELECT * WHERE {
?s saref:hasTimestamp ?t;
saref:hasValue ?result;
saref:measurementMadeBy ?sensor.
?sensor dahcc:analyseStateOf ?stateOf;
saref:measuresProperty {:property}.
FILTER(?t="2022-01-03T10:57:54.000000"^^xsd:dateTime)
} Pruned link related to
unsatisfiable E and F couple
(5)
Tranform tree:relation into tree:Node
boolean expressions E1 and E2
tree:relation(s)
(2)
Collect the filter expression segment saref:hasTimestamp < 2022-01-03T09:47:59Z
F1 and F2 related to E1 and E2
(3) saref:hasTimestamp >= 2022-01-03T09:47:59Z
Dereference
Query (1)
tree:Node
engine
tree:Node
Evaluate the
satisfiability of E and F rest of the triples
(4)
Figure 2: A schematization of our rule-based reachability criteria with a TREE document. First a TREE node is
dereferenced, then the TREE relations are transformed into boolean expressions 𝐸, followed by the construction
of 𝐹 from the filter expression related to the path of 𝐸 (the variable 𝑡 related to saref:hasTimestamp), then
the satisfiability 𝐸 ∧ 𝐹 is determined and finally links to non-query relevant data are pruned.
2.1. Preliminary Results
We implemented our approach using the query engine Comunica [14]. For evaluation, we executed four
queries similar to the one in Figure 1. 3 They were executed over the DAHCC participant 31 dataset [15]
(487 MB) with a timeout of two minutes. We fragmented the dataset according to the TREE specification.
We use a B-tree topology with a depth of 1 using 100 and 1000 nodes (𝑛).
The queries were executed using two configurations. In the first configuration, we use a predicate-
based reachability criterion where the engine follows each link of the fragmented dataset. 4 For the
2
We use a simplified formalization to illustrate the source selection mechanism and to not introduce unnecessary concepts
for the aim of this poster paper.
3
The implementation, the queries and the evaluation are available at the following links:
https://github.com/constraintAutomaton/comunica-feature-link-traversal/tree/feature/time-filtering-tree-sparqlee-
implementation, https://github.com/TREEcg/TREE-Guided-Link-Traversal-Query-Processing-Evaluation/tree/main
4
The query engine will always follow ex:nextNode from expressions following the schema "ex:currentNode
tree:relation [tree:node ex:nextNode]" regardless of the constraints.
n Query Time-predicate (ms) Time-rule (ms) HTTP-request-rule Res-rule
100 Q1 x 8,892 3 0
100 Q2 x 3,541 3 1
100 Q3 x 59,274 8 8,166
1000 Q1 x 1,171 3 0
1000 Q2 x 734 3 1
1000 Q3 x 39,987 51 8,166
Table 1
The predicate-based (-predicate) reachability criterion is not able to execute the queries. The rule-based (-
rule) criterion performs better in term of execution time (Time) with a larger number of fragments even when
performing more HTTP requests (HTTP-request). Q4 is not depicted because the instances were not able to
terminate before the timeout.
second one, we use our rule-based reachability criterion approach. As shown in Table 1 no queries
could be answered within the timeout by following every fragment. A possible explanation is the high
number of HTTP requests performed [8] leading to non-relevant data sources. With our rule-based
reachability criterion, the queries executed over the 1000 nodes fragmentation perform better than
the ones with 100 nodes. The query execution time has a percentage of reduction of 86% with Q1 and
79% with Q2 compared to the fragmentation with 100 nodes. With Q3 we see that the percentage of
reduction is 33%, this lowering of performance gain might be caused by the increase by a factor of 6
in HTTP requests. This raises an interesting observation because we do not observe a reduction in
execution time with a reduction in HTTP requests. Previous research has proposed that inefficient
query plans might be the bottleneck of some queries in structured environments [12, 16]. However, our
results seem to show that the size of the internal triple store might have a bigger impact on performance
than noted in previous studies. As large-scale link traversal over the web will result in the acquisition
of a large number of triples, a future interesting research direction would be to find ways to remove
triples that are certain to not lead to a query result from the internal triple store. The query Q4 was not
able to be answered, with any setup, because the query requires a larger number of fragments than the
other to be processed.
3. Conclusion
This paper reported on preliminary tests to add guided link traversal support into the Comunica
querying engine using a rule-based reachability approach. A similar approach could be performed
with other SPARQL query engines supporting Link Traversal Query Processing. Our preliminary
results show that our rule-based reachability criterion can significantly reduce the execution time of
queries aligned with hypermedia description constraints compared to predicate-based reachability
opening the possibility for faster and more versatile traversal-based query execution over fragmented
RDF documents. Our experiment also highlights that the size of the internal data store might have
more impact on performance than noted in previous studies. In future work, we will perform more
exhaustive evaluations of other types of domain-oriented fragmentation strategies such as string and
geospatial evaluations, and investigate how to generalize our approach to support more expressive
online reasoning for online source selection during traversal queries. Furthermore, we also showed
there might still be room for optimization by researching ways for pruning useless triples from the
internal triple store during the link traversal process.
References
[1] I. Ermilov, M. Martin, J. Lehmann, S. Auer, Linked open data statistics: Collection and exploitation,
in: P. Klinov, D. Mouromtsev (Eds.), Knowledge Engineering and the Semantic Web, Springer,
Berlin, Heidelberg, 2013, pp. 242–249.
[2] C. Buil-Aranda, A. Hogan, J. Umbrich, P.-Y. Vandenbussche, Sparql web-querying infrastruc-
ture: Ready for action?, in: Proceedings 12th ISWC, ISWC ’13, Springer-Verlag, Berlin, Hei-
delberg, 2013, p. 277–293. URL: https://doi.org/10.1007/978-3-642-41338-4_18. doi:10.1007/
978-3-642-41338-4_18.
[3] R. Verborgh, M. V. Sande, O. Hartig, J. V. Herwegen, L. D. Vocht, B. D. Meester, G. Haesendonck,
P. Colpaert, Triple pattern fragments: A low-cost knowledge graph interface for the web, J. Web
Semant. 37-38 (2016) 184–206.
[4] A. Azzam, C. Aebeloe, G. Montoya, I. Keles, A. Polleres, K. Hose, Wisekg: Balanced access to
web knowledge graphs, in: Proceedings of the Web Conference 2021, WWW ’21, Association
for Computing Machinery, New York, NY, USA, 2021, p. 1422–1434. URL: https://doi.org/10.1145/
3442381.3449911. doi:10.1145/3442381.3449911.
[5] P. Colpaert, Building materializable querying interfaces with the tree hypermedia specification
(2022). URL: https://treecg.github.io/paper-materializable-interfaces/.
[6] D. Van Lancker, P. Colpaert, H. Delva, B. Van de Vyvere, J. Rojas Meléndez, R. Dedecker,
P. Michiels, R. Buyle, A. De Craene, R. Verborgh, Publishing base registries as linked data
event streams, Proceedings 21th ICWE (2021). URL: https://raw.githubusercontent.com/ddvlanck/
Publishing-Base-Registries-As-LDES/master/Linked-Data-Event-Streams.pdf.
[7] R. T. Fielding, Architectural Styles and the Design of Network-based Software Architectures, Ph.D.
thesis, University of California,Irvine, 2000.
[8] O. Hartig, M. T. Özsu, Walking without a map: Ranking-based traversal for querying linked data,
in: P. Groth, E. Simperl, A. Gray, M. Sabou, M. Krötzsch, F. Lecue, F. Flöck, Y. Gil (Eds.), The
Semantic Web – ISWC 2016, Springer International Publishing, Cham, 2016, pp. 305–324.
[9] O. Hartig, J. Pérez, Ldql: A query language for the web of linked data, Journal of Web Semantics 41
(2016) 9–29. URL: http://dx.doi.org/10.1016/j.websem.2016.10.001. doi:10.1016/j.websem.2016.
10.001.
[10] B. Bogaerts, B. Ketsman, Y. Zeboudj, H. Aamer, R. Taelman, R. Verborgh, Link traversal with dis-
tributed subweb specifications, in: S. Moschoyiannis, R. Peñaloza, J. Vanthienen, A. Soylu, D. Roman
(Eds.), Proceedings of the 5th International Joint Conference on Rules and Reasoning, volume 12851
of Lecture Notes in Computer Science, Springer, 2021, pp. 62–79. URL: https://www.bartbogaerts.
eu/articles/2021/005-RuleML-GuidedLink-SubwebSpec/SubwebSpecifications.pdf. doi:10.1007/
978-3-030-91167-6_5.
[11] O. Hartig, J.-C. Freytag, Foundations of traversal based query execution over linked data, in:
Conference on Hypertext and Social Media, HT ’12, ACM, New York, NY, USA, 2012, p. 43–52.
URL: https://doi.org/10.1145/2309996.2310005. doi:10.1145/2309996.2310005.
[12] R. Taelman, R. Verborgh, Link traversal query processing over decentralized environments with
structural assumptions, in: Proceedings of the 22nd International Semantic Web Conference, 2023.
URL: https://comunica.github.io/Article-ISWC2023-SolidQuery/.
[13] R. Verborgh, R. Taelman, Guided link-traversal-based query processing, 2020. URL: https://arxiv.
org/abs/2005.02239. doi:10.48550/ARXIV.2005.02239.
[14] R. Taelman, J. Van Herwegen, M. Vander Sande, R. Verborgh, Comunica: a modular sparql
query engine for the web, in: Proceedings 17th ISWC, 2018. URL: https://comunica.github.io/
Article-ISWC2018-Resource/.
[15] Bram, S., De Brouwer, M., Stojchevska, M. , Van Der Donckt, J. , Nelis, J. , Ruyssinck, J., van der
Herten, J. , Casier, K. , Van Ooteghem, J. , Crombez, P. , De Turck, F. , Van Hoecke, S. and Ongenae,
F., Data Analytics For Health and Connected Care: Ontology, Knowledge Graph and Applications,
in: Published in the proceedings of the sixteenth EAI Pervasive Healthcare conference, Springer,
2022. URL: https://dahcc.idlab.ugent.be.
[16] Eschauzier, Ruben and Taelman, Ruben and Verborgh, Ruben, How does the link queue evolve dur-
ing traversal-based query processing?, in: 7th Workshop on Storing, Querying and Benchmarking
Knowledge Graphs (QuWeDa) at ISWC 2023, 2023, pp. 26–33.