1. Introduction

Journal of Web Semantics 41 (2016) 9-29. URL: http://dx.doi.org/10.1016/j.websem.2016.10.001. doi:10.1016/j.websem.2016. 10.001. [10] B. Bogaerts

10.1145/3442381.3449911

Optimizing Traversal Queries of Sensor Data Using a Rule-Based Reachability Approach

Bryan-Elliott Tam

Ruben Taelman

Julián Rojas Meléndez

Pieter Colpaert

0 0 IDLab, Department of Electronics and Information Systems, Ghent University - imec

2024

12851 13 15

Link Traversal queries face challenges in completeness and long execution time due to the size of the web. Reachability criteria define completeness by restricting the links followed by engines. However, the number of links to dereference remains the bottleneck of the approach. Web environments often have structures exploitable by query engines to prune irrelevant sources. Current criteria rely on using information from the query definition and predefined predicate. However, it is dificult to use them to traverse environments where logical expressions indicate the location of resources. We propose to use a rule-based reachability criterion that captures logical statements expressed in hypermedia descriptions within linked data documents to prune irrelevant sources. In this poster paper, we show how the Comunica link traversal engine is modified to take hints from a hypermedia control vocabulary, to prune irrelevant sources. Our preliminary findings show that by using this strategy, the query engine can significantly reduce the number of HTTP requests and the query execution time without sacrificing the completeness of results. Our work shows that the investigation of hypermedia controls in link pruning of traversal queries is a worthy efort for optimizing web queries of unindexed decentralized databases.

eol>Linked data Link Traversal Query Processing Fragmented database Descentralized environments

1. Introduction

The increasing amount of available Linked Data on the Web [1] prompts the need for eficient query interfaces. During a typical query execution in a SPARQL endpoint, the endpoint takes the whole query load and delivers the results to the client. This paradigm can lead to high workloads, which are partly responsible for the historically low availability of SPARQL endpoints [2]. Researchers and practitioners have made eforts to introduce alternative Linked Data publication methods that enable client’s participation in the query execution process [3]. The goal of those methods is to lower server-side workloads while keeping fast query execution to the client [4]. The TREE hypermedia specification is an efort in that direction [ 5, 6], that introduces the concept of domain-oriented fragmentation of large RDF datasets. For example, in the case of periodic measurements of sensor data, a fragmentation can be made on the publication date of each data entity. A fragment can be considered an RDF document published in a server. TREE aims to describes dataset fragmentation in ways that enable clients to easily fetch queryrelevant subsets. The data within a fragment are bound by constraints expressed through hypermedia descriptions [7]. Each fragment contains relations to other pages, and those relations contain the constraints of the data of every reachable fragment. In this paper, we refer to those constraints as domain-specific expressions. They can be expressions such as ? > 2022-01-09T00:00:00.000000 =⇒ ex:afterFirstSeptember given that ? is the date of publication of sensor data and the implication pertains to the location of the data respecting the constraint. In English, the expression means “the data produced by the sensors after the first of September are stored at ex:afterFirstSeptember.” Because of the hyperlinked nature of the documents network, clients must traverse them to find the relevant data to answer their queries. We propose to use Link Traversal Query Processing (LTQP) [8] as a query mechanism to perform those queries.

LTQP starts by dereferencing a set of user-provided URLs [8]. From these dereferenced documents, links to other documents are dereferenced recursively and inserted in an internal data store. LDQL [9] is a theoretical query language to define the traversal of LTQP queries. However, LDQL is centered around nested regular expressions, thus, is not made to express the traversal of links based on domainspecific expressions such as time relations. The subweb specifications language (SWSL) [ 10], allows data providers to define traversal paths concerning the information they publish. Thus, given that the query engine trusts the data publisher it can adapt its traversal to follow the paths given by the specification. Akin to LDQL, it is dificult with the SWSL to express traversal using domain-specific expressions, because its syntax is centered around the matching of triple patterns and not reasoning rules or evaluation of literals. Furthermore, SWSL does not propose a mechanism for using the query or input from the user to impact the source selection process, unlike LDQL. Given those limitations, we propose to return to the more abstract concept of reachability criteria [11], to define a mechanism of traversal centered around rules.

In this paper, we propose to use a boolean solver as the main link pruning mechanism for a reachability criterion to traverse TREE documents. The logical operators are defined by the TREE specification. 1 As a concrete use case, we consider the publication of (historical) sensor data. An example query is presented in Figure 1 along with the triples representing the link between two documents expressed using the TREE specification. 1 PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 2 PREFIX saref: <https://saref.etsi.org/core/> 3 PREFIX dahcc: <https://dahcc.idlab.ugent.be/

Ontology/Sensors/> 4 5 SELECT * WHERE { 6 ?s saref:hasTimestamp ?t; 7 saref:hasValue ?result; 8 saref:measurementMadeBy ?sensor. 9 ?sensor dahcc:analyseStateOf ?stateOf; 10 saref:measuresProperty {:property}. 11 FILTER(?t="2022-01-03T10:57:54.000000"^^xsd:

dateTime) 12 }

2. A Rule-Based Reachability Criterion

Most research on LTQP is centered around query execution in Linked Open Data environments. Given the pseudo-infinite number of documents on the Web, traversing over all documents is practically infeasible. To define completeness, diferent reachability criteria [ 11] were introduced to allow the discrimination of links. Recently, an alternative direction was designed where the query engine uses the structure from the data publisher to guide itself towards relevant data sources [12, 13].

We define our approach as a rule-based reachability criterion. Our approach builds upon the concept of structural assumptions [12] to exploit the structural properties of TREE annotated datasets. We therefore interpret the hypermedia descriptions of constraints in TREE fragments as boolean expressions (? >= 2022-01-03T09:47:59.000000 in Figure 1). Upon discovery of a document, the query engine gathers the relevant triples to form the boolean expression of the constraint on the data of reachable fragments. After the parsing of the expression, the filter expression of the SPARQL query is pushed 1 https://treecg.github.io/specification/ down into the engine’s source selection component. The source selection component can be formalized as a reachability criterion 2

() → {true, false} where when it returns true the target IRI must be dereferenced. Finally, the two boolean expressions are evaluated to determine their satisfiability. It can be formalized as determining if () = ∃|( () ∧ ) = true hold true given is the variable targeted by and is the link towards the next fragment (<nextNode> from “<> tree:node <nextNode>” in Figure 1). A variable targetted by is defined by an RDF object where the predicate as a value ?target from the triple defining the fragmentation path in the form “”?s tree:path ?target” (saref:hasTimestamp in Figure 1). Upon satisfaction the IRI targeting the next fragment is added to the link queue otherwise the IRI is pruned. The process is schematized in Figure 2. ( 1 ) (2)

Query PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX etsi: <https://saref.etsi.org/core/> PREFIX dahcc: <https://dahcc.idlab.ugent.be/Ontology/Sensors/> PREFIX saref: <https://saref.etsi.org/core/> SELECT * WHERE { ?s saref:hasTimestamp ?t; saref:hasValue ?result; saref:measurementMadeBy ?sensor. ?sensor dahcc:analyseStateOf ?stateOf; saref:measuresProperty {:property}. FILTER(?t="2022-01-03T10:57:54.000000"^^xsd:dateTime) } Collect the filter expression segment

F1 and F2 related to E1 and E2 (3)

Pruned link related to unsatisfiable E and F couple

(5) Query engine

Evaluate the satisfiability of E and F (4)

Tranform tree:relation into boolean expressions E1 and E2

(2) Dereference ( 1 ) tree:Node

tree:relation(s) saref:hasTimestamp < 2022-01-03T09:47:59Z saref:hasTimestamp >= 2022-01-03T09:47:59Z rest of the triples tree:Node tree:Node 2.1. Preliminary Results We implemented our approach using the query engine Comunica [14]. For evaluation, we executed four queries similar to the one in Figure 1. 3 They were executed over the DAHCC participant 31 dataset [15] (487 MB) with a timeout of two minutes. We fragmented the dataset according to the TREE specification. We use a B-tree topology with a depth of 1 using 100 and 1000 nodes ().

The queries were executed using two configurations. In the first configuration, we use a predicatebased reachability criterion where the engine follows each link of the fragmented dataset. 4 For the 2 We use a simplified formalization to illustrate the source selection mechanism and to not introduce unnecessary concepts for the aim of this poster paper. 3 The implementation, the queries and the evaluation are available at the following links: https://github.com/constraintAutomaton/comunica-feature-link-traversal/tree/feature/time-filtering-tree-sparqleeimplementation, https://github.com/TREEcg/TREE-Guided-Link-Traversal-Query-Processing-Evaluation/tree/main 4 The query engine will always follow ex:nextNode from expressions following the schema "ex:currentNode tree:relation [tree:node ex:nextNode]" regardless of the constraints. second one, we use our rule-based reachability criterion approach. As shown in Table 1 no queries could be answered within the timeout by following every fragment. A possible explanation is the high number of HTTP requests performed [8] leading to non-relevant data sources. With our rule-based reachability criterion, the queries executed over the 1000 nodes fragmentation perform better than the ones with 100 nodes. The query execution time has a percentage of reduction of 86% with Q1 and 79% with Q2 compared to the fragmentation with 100 nodes. With Q3 we see that the percentage of reduction is 33%, this lowering of performance gain might be caused by the increase by a factor of 6 in HTTP requests. This raises an interesting observation because we do not observe a reduction in execution time with a reduction in HTTP requests. Previous research has proposed that ineficient query plans might be the bottleneck of some queries in structured environments [12, 16]. However, our results seem to show that the size of the internal triple store might have a bigger impact on performance than noted in previous studies. As large-scale link traversal over the web will result in the acquisition of a large number of triples, a future interesting research direction would be to find ways to remove triples that are certain to not lead to a query result from the internal triple store. The query Q4 was not able to be answered, with any setup, because the query requires a larger number of fragments than the other to be processed.

3. Conclusion

This paper reported on preliminary tests to add guided link traversal support into the Comunica querying engine using a rule-based reachability approach. A similar approach could be performed with other SPARQL query engines supporting Link Traversal Query Processing. Our preliminary results show that our rule-based reachability criterion can significantly reduce the execution time of queries aligned with hypermedia description constraints compared to predicate-based reachability opening the possibility for faster and more versatile traversal-based query execution over fragmented RDF documents. Our experiment also highlights that the size of the internal data store might have more impact on performance than noted in previous studies. In future work, we will perform more exhaustive evaluations of other types of domain-oriented fragmentation strategies such as string and geospatial evaluations, and investigate how to generalize our approach to support more expressive online reasoning for online source selection during traversal queries. Furthermore, we also showed there might still be room for optimization by researching ways for pruning useless triples from the internal triple store during the link traversal process.

[1]

Ermilov ,

Martin ,

Lehmann ,

Auer , Linked open data statistics: Collection and exploitation , in: P. Klinov , D. Mouromtsev (Eds.), Knowledge Engineering and the Semantic Web , Springer, Berlin, Heidelberg, 2013 , pp. 242 - 249 .