Continuously Updating Query Results
over Real-Time Linked Data
Ruben Taelman, Ruben Verborgh, Pieter Colpaert,
Erik Mannens, and Rik Van de Walle
Data Science Lab (Ghent University - iMinds)
Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
{firstname.lastname}@ugent.be
Abstract. Existing solutions to query dynamic Linked Data sources extend
the sparql language, and require continuous server processing for each query.
Traditional sparql endpoints accept highly expressive queries, contributing to
high server cost. Extending these endpoints for time-sensitive queries increases the
server cost even further. To make continuous querying over real-time Linked Data
more affordable, we extend the low-cost Triple Pattern Fragments (tpf) interface
with support for time-sensitive queries. In this paper, we discuss a framework on
top of tpf that allows clients to execute sparql queries with continuously updating
results. Our experiments indicate that this extension significantly lowers the server
complexity. The trade-off is an increase in the execution time per query. We prove
that by moving the complexity of continuously evaluating real-time queries over
Linked Data to the clients and thus increasing the bandwidth usage, the cost of
server-side interfaces is significantly reduced. Our results show that this solution
makes real-time querying more scalable in terms of cpu usage for a large amount
of concurrent clients when compared to the alternatives.
Keywords: Linked Data, Linked Data Fragments, sparql, continuous querying,
real-time querying
1 Introduction
As the Web of Data is a dynamic dataspace, different results may be returned depending
on when a question was asked. The end-user might be interested in seeing the query results
update over time, and thus may have to re-execute the entire query over and over again
(i.e., polling). This is, however, not very practical, especially if it is unknown beforehand
when data will change. An additional problem is that many public (even static) sparql
query endpoints suffer from a low availability [5]. The unrestricted complexity of sparql
queries [15] combined with the public character of sparql endpoints entails a high server
cost, which makes it expensive to host such an interface with high availability. Dynamic
sparql streaming solutions offer combined access to dynamic data streams and static
background data through continuously executing queries. Because of this continuous
querying, the cost for these servers is even higher than with static querying.
In this work, we therefore devise a solution that enables clients to continuously
evaluate non-high frequency queries by polling specific fragments of the data. The
2 Ruben Taelman et al.
resulting framework performs this without the server needing to remember any client
state. Its mechanism requires the server to annotate its data so that the client can efficiently
determine when to retrieve fresh data. The generic approach in this paper is applied to
the use case of public transit route planning. It can be used in various other domains
with continuously updating data, such as smart city dashboards, business intelligence, or
sensor networks.
In the next section, we discuss related research on which our solution will be based.
In Section 3, we present a motivating use case. Section 4 discusses different techniques
to represent dynamic data, after which Section 5 gives an explanation of our proposed
query solution. Next, Section 6 shows an overview of our experimental setup and its
results. Finally, Section 7 discusses the conclusions of this work with further research
opportunities.
2 Related Work
In this section, we first explain techniques to perform rdf annotation, which will be used
to determine freshness. Then, we zoom in on possible representations of temporal data
in rdf. We finish by discussing existing sparql streaming extensions and a low-cost
(static) Linked Data publication technique.
2.1 rdf Annotations
Annotations allow us to attach metadata to triples. We might for example want to say
that a triple is only valid between a certain time interval, or that a triple is only valid in a
certain geographical area.
rdf 1.0 [11] allows triple annotation through reification. This mechanism uses subject,
predicate and object as predicates, which allow the addition of annotations to such reified
rdf triples. The downside of this approach is that one triple is now transformed to three
triples, which significantly increases the total amount of triples.
Singleton Properties [14] create unique instances (singletons) of predicates, which
then can be used for further specifying that relationship by for example adding annotations.
New instances of predicates are created by relating them to the old predicate through
the sp:singletonPropertyOf predicate. While this approach requires fewer triples than
reification to represent the same information, it still has the issue of the original triple
being lost, because the predicate is changed in this approach.
With rdf 1.1 [6] came graph support, which allows triples to be encapsulated into
named graphs, which can also be annotated. Graph-based annotation requires fewer triples
than both reification and singleton properties when representing the same information. It
only requires the addition of a fourth element to the triple which transforms it to a quad.
This fourth element, the graph, can be used to add the annotations to.
2.2 Temporal data in the rdf model
Regular rdf triples cannot express the time and space in which the fact they describe
is true. In domains where data needs to be represented for certain times or time ranges,
Continuously Updating Query Results over Real-Time Linked Data 3
these traditional representations should thus be extended. Time labeling [9] is the process
of annotating triples with their change time, as opposed to making completely new
snapshots of the graph every time a change occurs. Gutierrez et al. made a distinction
between point-based and interval-based labeling, which are interchangeable [8]. The
former states information about an element at a certain time instant, while the latter states
information at all possible times between two time instants.
The authors introduced a temporal vocabulary [8] for the discussed mechanisms,
which will be referred to as tmp in the remainder of this document. Its core predicates are:
tmp:interval This predicate can be used on a subject to make it valid in a certain time
interval. The range of this property is a time interval, which is represented by the
two mandatory properties tmp:initial and tmp:final.
tmp:instant Used on subjects to make it valid on a certain time instant as a point-based
time representation. The range of this property is xsd:dateTime.
tmp:initial and tmp:final The domain of these predicates is a time interval. Their
range is a xsd:dateTime, and they respectively indicate the start and the end of the
interval-based time representation.
Next to these properties, we will also introduce our own predicate tmp:expiration
with range xsd:dateTime which indicates that the subject is only valid up until the given
time.
2.3 sparql Streaming Extensions
Several sparql extensions exist that enable querying over data streams. These data
streams are traditionaly represented as a monotonically non-decreasing stream of triples
that are annotated with their timestamp. These require continuous processing [7] of
queries because of the constantly changing data.
c-sparql [4] is an approach to querying over static and dynamic data. This system
requires the client to register a query to the server in an extended sparql syntax that
allows the use of windows over dynamic data. This query registration [3, 7] must occur
by clients to make sure that the streaming-enabled sparql endpoint can continuously
re-evaluate this query, as opposed to traditional endpoints where the query is evaluated
only once. c-sparql’s execution of queries is based on the combination of a regular
sparql engine with a Data Stream Management System (dsms) [2]. The internal model
of c-sparql creates queries that distribute work between the dsms and the sparql engine
to respectively process the dynamic and static data.
cqels [12] is a “white box” approach, as opposed to “black box” approaches like
c-sparql. This means that cqels natively implements all query operators without
transforming it to another language, removing the overhead of delegating it to another
system. According to previous research [12], cqels performs much better than c-sparql
for large datasets; for simple queries and small datasets the opposite is true.
2.4 Triple Pattern Fragments
Experiments have shown that more than half of public sparql endpoints have an
availability of less than 95% [5]. A possible contributor to this low availability is the
4 Ruben Taelman et al.
unrestricted complexity of sparql queries [15] combined with the unpredictable user and
query load of sparql endpoints. Clients can send arbitrarily complex sparql queries,
which could form a bottleneck in endpoints. Triple Pattern Fragments (tpf) [17] aim
to solve this issue of high interface cost by moving part of the query processing to the
client, which reduces the server load at the cost of increased query times and bandwidth.
The endpoints are limited to an interface through which only separate triple patterns can
be queried instead of full sparql queries. The client is then responsible for carrying out
the remaining evaluation of complex sparql queries asked by its users.
3 Use Case
A guiding use case, based on public transport, will be referred to in the remainder of this
paper. When public transport route planning applications return dynamic data, they can
account for factors such as train delays as part of a continuously updating route plan. In
this use case, different clients need to obtain all train departure information for a certain
station. A static sparql query can be used to retrieve all information using this basic
data model, which can be found in Listing 1.1, the first two triple patterns refer to triples
that contain dynamic data.
SELECT ?delay ?platform ?headSign ?routeLabel ?departureTime
WHERE {
_:id t:delay ?delay.
_:id t:platform ?platform.
_:id t:departureTime ?departureTime.
_:id t:headSign ?headSign.
_:id t:routeLabel ?routeLabel.
FILTER (?departureTime > "2015-12-08T10:20:00"^^xsd:dateTime).
FILTER (?departureTime < "2015-12-08T11:20:00"^^xsd:dateTime).
}
Listing 1.1: The basic sparql query for retrieving all upcoming train departure information in a
certain station. The two first triple patterns are dynamic, the last three are static.
4 Dynamic Data Representation
Our solution consists of a partial redistribution of query evaluation workload from the
server to the client, which requires the client to be able to access the server data. There
needs to be a distinction between regular static data and continuously updating dynamic
data in the server’s dataset. For this, we chose to define a certain temporal range in
which these dynamic facts are valid, as a consequence the client will know when the data
becomes invalid and has to fetch new data to remain up-to-date. To capture the temporal
scope of data triples, we annotate this data with time. In this section, we discuss two
different types of time labeling, and different methods to annotate this data.
Continuously Updating Query Results over Real-Time Linked Data 5
4.1 Time Labeling Types
We use interval-based labeling to indicate the start and endpoint of the period during
which triples are valid. Point-based labeling is used to indicate the expiration time.
With expiration times, we only save the latest version of a given fact in a dataset,
assuming that the old version can be removed when a newer one arrives. These expiration
times provide enough information to determine when a certain fact becomes invalid
in time. We use time intervals for storing multiple versions of the same fact, i.e., for
maintaining a history of facts. These time intervals must indicate a start- and endtime
for making it possible to distinguish between different versions of a certain fact. These
intervals cannot overlap in time for the same facts.
4.2 Methods for Time Annotation
The two time labeling types introduced in the last section can be annotated on triples in
different ways. In Section 2.1 we discussed several methods for rdf annotation. We will
apply time labels to triples using the singleton properties, graphs and implicit graphs
annotation techniques.
Singleton Properties Singleton properties annotation is done by creating a singleton
property for the predicate of each dynamic triple. Each of these singleton properties can
then be annotated with its time annotation, being either a time interval or expiration times.
Graphs To time-annotate triples using graphs, we can encapsulate triples inside contexts,
and annotate each context graph with a time annotation.
Implicit Graphs A tpf interface gives a unique uri to each fragment corresponding
to a triple pattern, including patterns without variables, i.e., actual triples. Since Triple
Pattern Fragments [17] are the basis of our solution, we can interpret each fragment
as a graph. We will refer to these as implicit graphs. This uri can then be used as
graph identifier for this triple for adding time information. For example, the uri for the
triple