=Paper=
{{Paper
|id=Vol-1870/paper-01
|storemode=property
|title=SLD Revolution: A Cheaper, Faster yet more Accurate Streaming Linked Data Framework
|pdfUrl=https://ceur-ws.org/Vol-1870/paper-01.pdf
|volume=Vol-1870
|authors=Marco Balduini,Emanuele Della Valle,Riccardo Tommasini
|dblpUrl=https://dblp.org/rec/conf/esws/BalduiniV017
}}
==SLD Revolution: A Cheaper, Faster yet more Accurate Streaming Linked Data Framework==
SLD Revolution: A Cheaper, Faster yet more Accurate
Streaming Linked Data Framework
Marco Balduini, Emanuele Della Valle, Riccardo Tommasini
DEIB, Politecnico of Milano, Milano, Italy
Abstract. RDF Stream Processing (RSP) is gaining momentum. The RDF stream
data model is progressively adopted and many SPARQL extensions for continuous
querying are converging into a unified RSP query language. However, the RSP
community still has to investigate when transforming data streams in RDF streams
pays off. In this paper, we report on several experiments on a revolutionized version
of our Streaming Linked Data framework (namely, SLD Revolution). SLD Revo-
lution adopts i) Generic Programming, i.e. it operates on time-stamped generic
data items, and ii) it applies a lazy-transformation approach, i.e. it postpones the
RDF stream transformation until it can benefit from it. , processing data according
to their nature (event-, tuple-, tree- and graph-based). SLD Revolution results to
be a cheaper (it uses less memory and has a smaller CPU load), faster (it reaches
higher maximum input throughput) yet more accurate (it provides a smaller error
rate in the results) solution than its ancestor SLD.
1 Introduction
RDF Stream Processing (RSP) is gaining momentum. The RSP W3C community group1
has just reached 100 members. It is actively working on a report that will present the
RDF Stream data model, the possible RDF Stream serializations and the syntax and
semantics of the RSP-QL query language.
We started using RSP in 2011, when we won the Semantic Web Challenge with
Bottari [1]. A key ingredient of our solution was the Streaming Linked Data framework
(SLD) [2]. SLD is a middleware that extends RSP engines with adapters, decorators
and publishers. All these components observe and push RDF streams in a central RDF
stream bus. The adapters are able to ingest any kind of external data, transforming them
into RDF streams modeled as time-stamped RDF Graphs and push them into the bus.
A network of C-SPARQL queries [3] analyzes the RDF streams in the bus, elaborate
them, and push the results in other internal streams. Decorators can semantically enrich
RDF streams using user defined functions (e.g., in Bottari to add the opinion that a social
media user expresses about a given named entity in a micro-post). Publishers push data
to the bus encoded in the Streaming Linked Data format [4].
SLD it is currently a key component of SocialOmeter2 , i.e. a commercial solution for
monitoring topics on social media. Interested readers may take a look the deployment for
the Italian edition of the popular TV show Masterchef, the MTV week 2016 in Milano
1
https://www.w3.org/community/rsp/
2
http://www.socialometers.com/
2
or the March 2015 eclipse in Europe. Those deployments process in real-time up to 3000
microposts per minute on a 50 C/month machine in the cloud (8 GB of RAM and 4
cores), providing semantic analysis and sophisticated visualizations.
In five years of SLD usage, we learned that using RDF streams is valuable when
i) data are naturally represented as graphs, i.e. micro-posts in the larger social graph,
and when ii) the availability of popular vocabularies makes easy writing adapters that
semantically annotate the incoming data, e.g. we wrote adapters that annotated streams
from the major social networks using SIOC [5].
However, we have also found out several weaknesses of the approach:
– RDF streams cannot be found in the wild, yet. JSON is largely used in practice (e.g.,
Twitter Streaming APIs3 and W3C activity stream 2.0 working draft4 ).
– The results of C-SPARQL queries are often relational and forcing them into an RDF
streams is not natural, i.e., a user would naturally use the REGISTER QUERY ...
AS SELECT ... form instead of REGISTER STREAM ... AS CONSTRUCT ... one.
It takes three triples to state how many times a hashtag appears in the micro-posts
observed in 1 minute, while the tuple htimestamp, hashtag, counti is more succinct.
– It is harder to express some computation using C-SPARQL over RDF streams and
graphs than writing a path expression over JSON; or writing an SQL query over
relations; or writing an EPL statement over events.
– SLD builds on the C-SPARQL engine and, thus, shares with it some shortcomings,
i.e. it gives incorrect answers when it is overloaded engine [6, 7].
In this paper, we challenge the hypothesis that RDF streams should play such a
central role in SLD. We investigate if i) using time-stamped generic data items (instead
of focusing only time-stamped RDF graphs) and ii) processing them according to their
event-, tuple-, tree- and graph-based nature, offer the opportunity to engineer a cheaper
(uses less memory and CPU), faster (it reaches higher maximum input throughput) yet
more accurate (i.t. with a smaller error in the results) version of SLD that we called
SLD Revolution. We bring experimental evidence that supports the design decision
of revolutionizing SLD in those two directions. Using our experience on social media
monitoring, we design a set of experiments to evaluate our hypothesis: we chose the
expected maximum rate of micro-posts per minute and the machine to deal with the
worst-case scenario; by reducing the available memory and processor time requirement
we push the overload status to an higher input rate.
The remainder of the paper is organized as follows. Section 2 introduces the state-of-
the-art in stream processing with a special cut on RSP. Section 3 presents SLD Revolution
and its new processing model. Sections 4 and 5, respectively, describe the settings and
the results of the experiments we run. Finally, in Section 6, we conclude and present our
future work.
2 State of the art
Data model. RSP extends the RDF data model and the SPARQL query model in order
to take into account the streaming nature of the data.
3
https://dev.twitter.com/streaming/overview
4
http://www.w3.org/TR/activitystreams-core/
3
(a) (b)
Fig. 1. The CQL model, cf. [8], and its adaptation for RDF Stream Processing
A relational data stream [9] S is defined as an unbounded sequence of time-
stamped data items (di , ti ): S = (d1 , t1 ), (d2 , t2 ), . . . , (dn , tn ), . . . , where di is a rela-
tion and ti ∈ N the associated time instant. Different approaches constrain ti so that it
holds either ti ≤ ti+1 , i.e. stream items are in a non-decreasing time order, or ti < ti+1 ,
i.e. stream items are in strictly increasing time order.
An RDF Stream is defined in the same way, but di is either an RDF statement
(as done in most of the RSP approaches [3, 10–12]) or an RDF graph (as done in
SLD [2] and proposed by the RSP W3C community). An RDF statement is a triple
(s, p, o) ∈ (I ∪ B) × (I) × (I ∪ B ∪ L), where I is the set of IRIs, B is the set of blank
nodes and L is the set of literals. An RDF graph is a set of RDF statements.
Processing Model. Also the processing model of RSP [3, 10, 11] inherits from the
work done in the database community. In particular, it is inspired by the CQL stream
processing model (proposed by the DB group of the Stanford University [8]) which
defines three classes of operators (Figure 1.a):
– stream-to-relation operators are able to transform streams in relations. Since a
stream is a potentially infinite bag of time-stamped data items, those operators
extract finite bags of data enabling query answering. One of the most studied
operator of this class is the sliding window that chunks the incoming streams into
portions of length ω and slides of a length β
– relation-to-relation operators transform relations in other relations. Relational
algebraic expressions are a well-known cases of this class of operators.
– relation-to-stream operators are optional and allow to output the results as a part
of a stream. Alternatively, a time-varying relation is provided.
Figure 1.b presents the CQL model adapted to the RSP case. The stream and the
relation concepts are mapped to RDF streams and to set of mappings (using the SPARQL
algebra terminology), respectively. To highlight the similarity of the RSP operators [12]
to the CQL ones, similar names are used: S2R, R2R and R2S to indicate the operators
respectively analogous to stream-to-relation, relation-to-relation and relation-to-stream
operators.
RSP Middlewares. In order to ease the task of deploying the RSP Engine in real-
world applications, three middleware were designed: the Linked Stream Middleware
[13], a semantically enabled service architecture for mashups over streaming and stored
data [14] and our SLD [2].
The three approaches fulfill similar requirements for the end user. They offer ex-
tensible means for real-time data collection, for publishing and querying collected
4
Fig. 2. The architecture of the Streaming Linked Data framework.
information as Linked Data, and for visualizing data and query results. They differ in the
approach. Our SLD and the Linked Stream Middleware take both a data driven approach,
but they address in a different way the non-functional requirements; while SLD is an
in-memory solution for stream processing of RDF streams with limited support for static
information, the Linked Stream Middleware is a cloud-based infrastructure to integrate
time-dependent data with other Linked Data sources. The middleware described in [14],
instead, takes a service oriented approach, thus it also includes service discovery and
service composition among its features.
Figure 2 illustrates the architecture of SLD that offers: (i) a set of adapters that
transforms heterogeneous data streams in RDF streams attaching to each received
element a time-stamp that identifies the ingestion time (e.g., a stream of microposts
in JSON as an RDF stream using the SIOC vocabulary [5] or a stream of weather
sensor observations in XML using the Semantic Sensor Network vocabulary [15]); (ii) a
publish/subscribe bus to internally manage RDF streams, (iii) some facilities to record
and replay RDF streams; (iv) a set of user defined components to decorate an RDF stream
(e.g., adding sentiment annotations to microposts); (v) a wrapper for the C-SPARQL
Engine [3] that allows to create networks of C-SPARQL queries, and (vi) a linked data
server to publish results following the Streaming Linked Data Format [4].
3 SLD Revolution and its Processing Model
SLD Revolution adopts generic programming [16] where continuous processing opera-
tors are expressed in terms of types to-be-specified-later. This idea - which was pioneered
in [17] - can be adapted to stream processing by choosing to model the element di in
the stream S as time-stamped generic data items that are instantiated when needed for
specific types provided as parameters.
Figure 3 illustrates the SLD Revolution processing model. The stream and the relation
concepts of CQL are mapped to generic data streams ShT i and to instantaneous
generic data items IhT i.
In line with CQL and RSP-QL, SLD Revolution proposes three classes of operators:
– The stream-to-instantaneous S2I hT i operators transform the infinite generic
data stream ShT i in to a finite bag of instantaneous generic data items IhT i.
– The instantaneous-to-instantaneous I2I hT , T 0 i operators transform instanta-
neous generic data items IhT i into other instantaneous generic data items
IhT 0 i, where T and T 0 can be of the same type or of different types. For instance, a
5
S2I I2I
Generic Generic
streams instantaneous
I2S
Fig. 3. The processing model of the SLD revolution framework.
C-SPARQL query of the type REGISTER QUERY ... AS SELECT ... takes in input
time-stamped RDF graphs and generates as output time-stamped relations.
– The instantaneous-to-stream I2S hT i operators transform instantaneous generic
data items IhT i into a generic data stream ShT i.
Source Streaming Linked Data Revolution Server Sink
Stream Receiver Generic Stream Bus Translator 1
Stream
HTTP
HTTP
Recorder Re-player Processor Decorator
Fig. 4. The architecture of the SLD revolution framework. In gray the components that were
redesign to adopt the generic programming approach.
SLD Revolution generalizes SLD architecture (cf. Figure 4 with Figure 2). The
Generic Stream Bus replace of the RDF stream bus. The receivers replace the adapters.
As the adapters they allow to ingest external data streams, but they no longer transform
the received events in time-stamped RDF graphs. Data items remain in their original form,
only the ingestion time is added, postponing the transformation to the moment when
it is required (we name this approach lazy transformation). The processors substitute
the C-SPARQL-based analyzers. The C-SPARQL engine remains as one of the possible
processors, but SLD Revolution can be extended with any continuous processing engine.
The current implementation includes the Complex Event Processor Esper, the SPARQL
Engine Jena-ARQ that operates on time-stamped RDF graphs one at a time, and a custom
JSON path expression evaluation engine built on gson (https://github.com/google/gson).
Translators generalize publishers, which are specific for the Streaming Linked Data
format [4], allowing SLD Revolution to output in alternative formats.
4 Experimental Settings
In this Section, we present the experimental settings of this work. As domain we chose
Social Network analysis as done by the Linked Data Benchmark Council (LDBC) in
the SNBench5 . We first explain the type of data we used for our experiments. Then,
5
http://www.ldbcouncil.org/benchmarks/snb
6
we explain how the data were sent to SLD and SLD Revolution. We describe the two
continuous processing pipelines that we registered in SLD and in SLD Revolution.
Finally, we state which key performance indicators (KPIs) we measure and how.
Input Data. SLD and SLD Revolution receive information in the same way, they
both connect to a web socket server and handle JSON-LD files.
1 {"@context": { ... }, "@type": "Collection", "totalItems": 1,
2 "prov:wasAssociatedWith": "sr:Twitter",
3 "items":[{
4 "@type":"Post",
5 "published":"2016-04-26T15:40:03.054+02:00",
6 "actor":{"@type":"Account", "@id":"user:1", "sioc:name":"@streamreasoning"},
7 "object":{
8 "@type":"Content", "@id":"post:2", "alias":"http://.../2",
9 "prov:wasAssociatedWith":"sr:Twitter",
10 "sioc:content":"You ARE the #socialmedia!",
11 "dct:language":"en",
12 "tag":[{ "@type":"Tag", "@id":"tag:3", "displayName":"socialmedia"}]}
13 }]}
Listing 1. JSON representation of a Twitter micro-post. Due to the lack of space we omitted
the context declaration that contains the namespace.
In Listing 1, we propose a JSON-LD serialization of the Activity Stream representa-
tion of a tweet as it was injected during the experiments in both systems. The JSON-LD
representation of an Activity Stream is a Collection (specified by @type property) com-
posed by one or more social media items. The Collection is described with two properties,
i.e., totalItems and prov:wasAssociatedWith, which tell respectively the number of items
and the provenance of the items. The collection in the example contains a Post created
on 2016-04-26 (published property) by an actor (Lines 11-16) that produce the object
(Lines 17-33). The Actor has a unique identifier @id, a displayName, a sioc:name and a
alias. The Object has a sioc:content, a dct:language, zero or more tags, and optionally a
url and a to to represent, respectively, links to web pages and mentions of other actors.
1 a sma:Tweet ;
2 dcterms:created "2016-04-26T15:46:43.346000+02:00"ˆˆxsd:dateTime ;
3 dcterms:language "en"ˆˆxsd:string ;
4 sioc:content "You ARE the #socialmedia!"ˆˆxsd:string ;
5 sioc:has_container "Twitter"ˆˆxsd:string ;
6 sioc:has_creator ;
7 sioc:id "2"ˆˆxsd:string ;
8 sioc:link "http://.../status/2"ˆˆxsd:string ;
9 sioc:topic .
10 a sioct:Tag ;
11 rdfs:label "socialmedia"ˆˆxsd:string .
12 a sioc:UserAccount ;
13 sioc:account_of "StreamReasoning"ˆˆxsd:string ;
14 sioc:creator_of ;
15 sioc:id "1"ˆˆxsd:string ;
16 sioc:name "@streamreasoning"ˆˆxsd:string .
Listing 2. RDF N3 representation of a Twitter micro-post
Listing 2 shows the RDF produced by the SLD adapter in transforming the JSON-LD
in Listing 1. The translation operation exploit well known vocabularies, in particular
sioc to represent the online community information, prov to track the provenance of an
item and dcterms to represents information about the object.
7
Fig. 5. SLD Pipeline
Sending data. A test consists in sending a constant amount of synthetic data using
the JSON-LD serialization presented in Listing 1. The data is sent in bunches 3 times
per minute (i.e. at the 10th , the 30th and the 50th seconds of the minute). Each bunch
contains the same amount of posts. We tested the configuration for different rate: 1500
posts per minute (i.e., 3 bunch of 500 posts), 3000 posts per minute, 6000 posts per
minute, 9000 posts per minute, 12000 posts per minute and 18000 posts per minute.
The rates and and the input methodology were chosen based on our experiences on
social monitoring (see Section 1). They test a normal situation for SLD (1500 and 3000
posts per minutes) as well as situations that we know to overload SLD (more than 6000
posts per minute).
Pipelines.We tested SLD and SLD revolution with different pipelines:
– the area chart pipeline computes the number of tweets observed over time. It uses a
15 minute long window that slides every minute. The results can be continuously
computed i) using a generic sliding window operator, which works looking only to
the time-stamps of the data items in the generic stream, and ii) accessing with a path
expression the totalItems property in the JSON-LD file, i.e., the number of items in
the collection.
– the bar chart pipeline counts how often hashtags appear in the tweets received in
the last 15 minutes. As the area chart pipeline, the window slides every minute.
In this second pipeline, RDF streams are adequate and it is convenient to write a
C-SPARQL query that counts the number of times each hashtag appears.
The two pipelines are coded in SLD and SLD Revolution in two different ways.
SLD performs the transformations of JSON-LD in RDF by default, on all the input data,
independently from the task to perform. SLD Revolution keeps the data in its original
format as much as possible, i.e., it performs lazy transformations.
SLD Pipelines. Figure 5 presents the two pipelines in SLD. The input data are
translated in RDF as soon as they enter the pipelines. The computations for the area
chart and for the bar chart (see the part marked with A and B ) are composed by the
same type of components and share the new RDF stream translated by the Adapter.
The pipeline A uses two C-SPARQL queries. The first (see Listing 3) applies a
tumbling window of 1 minute6 and counts the tweets.
6
A tumbling window is a sliding window that slides for its length
8
The second aggregates the results from the first query using a 15 minutes time
window that slides every minute (see Listing 4).
1 REGISTER STREAM presocialstr AS
2 CONSTRUCT { ?id sma:twitterCount ?twitterC }
3 FROM STREAM [RANGE 1m STEP 1m]
4 WHERE { SELECT (uuid() AS ?id) ?twitterC
5 WHERE { SELECT (COUNT (DISTINCT ?mp) AS ?twitterC)
6 WHERE { ?mp a sma:Tweet } } }
Listing 3. C-SPARQL pre-query for the area chart that applies a tumbling window of 1
minute and counts the tweets.
1 REGISTER STREAM ac AS
2 CONSTRUCT { ?uid sma:twitterCount ?totTwitter ; sma:created_during ?
unixTimeFrame }
3 FROM STREAM [RANGE 15m STEP 1m]
4 WHERE {
5 SELECT (uuid() AS ?uid) ?unixTimeFrame (SUM(?twitter) AS ?totTwitter)
6 WHERE { ?id sma:twitterCount ?twitter ; sma:created_during ?timeFrame .
7 ?timeFrame a sma:15mTimeFrame ; sma:inUnixTime ?unixTimeFrame }
8 GROUP BY ?unixTimeFrame }
Listing 4. C-SPARQL query for the area chart that aggregates the results from the query in
Listing 3 using a 15 minutes time window that slides every minute.
It is worth to note that the first query is an important optimization in terms of memory
consumption. It avoids the engine to keep 15 minutes of tweets to only count them. In
SLD we often use this design pattern, we call this first query a pre-query.
Pipeline B also exploits this design; it applies a pre-query to reduce the amount of
data and then a query to produce the final result.
It is also worth to note that all the C-SPARQL queries use the form REGISTER
STREAM ... AS CONSTRUCT ..., because RDF streams are the only means of commu-
nication between SLD components.
The last components of both pipelines are publishers that make the results available
to external process outside SLD. In this case, the publisher writes JSON files on disk.
Fig. 6. SLD Revolution Pipeline
SLD Revolution Pipelines. Figure 6 presents the pipelines in SLD Revolution. As
for SLD, the pipeline A is for the area chart, while B is for the bar chart. The first
component is no longer an adapter. The data directly enter SLD revolution in JSON-LD
format. The first query is a generic 1 minute long tumbling window implemented with
the EPL statement in Listing 5. FORCE UPDATE and START EAGER tells the stream
9
processing engine, respectively, to emit also empty reports and to start processing the
window as soon as the query is registered (i.e., without waiting for the first time-stamped
data item to arrive). It is worth to note that this query exploits the event-based nature of
the generic stream it is observing; it does not inspect the payload of the events, it only
uses their time-stamps.
select * from GenericEvent.win:time_batch(1 min, "FORCE_UPDATE, START_EAGER")
Listing 5. The generic window query in common to both the pipelines in SLD Revolution
As explained in Section 3, processors are the central components of SLD Revolution.
They can listen to one or more generic streams, compute different operations and push
out a generic streams. The type of the input and output streams can be different. The two
pipelines uses different processors (e.g. RDF translator, windower and SPARQL).
SLD Revolution maintains the data format as long as possible in order to reduce the
overhead of the translations. As already said, SLD Revolution can exploit the tree-based
nature of JSON-LD. In pipeline A, it exploits a path expression data to extract totalItems,
i.e., the number of items in each collection, from the time-stamped JSON-LD items in
the generic stream it listens to. It outputs a tuple htimeframe,counti that is aggregated
every minute over a window of 15 minutes using an EPL statement.
The Pipeline B of SLD Revolution translate JSON-LD in RDF in order to extract
information about the hashtags. As for the pipeline B of SLD, we use a pre-query design
pattern to reduce the amount of data. A SPARQL processor applies the SELECT query
in Listing 6 to every data-item in the generic stream it listens to and pushes out a stream
of tuples hhashtagLabel,counti. The relational stream is then aggregated with an esper
processor with a 15 minute time window that slides every 1 minute (see Listing 7).
1 SELECT ?htlabel (COUNT(DISTINCT(?mpTweet)) AS ?htTweetCount)
2 WHERE { ?mpTweet a sma:Tweet ; sioc:topic ?tweetTopic .
3 ?tweetTopic a sioctypes:Tag ; rdfs:label ?htlabel }
4 GROUP BY ?htlabel
5 ORDER BY desc(?htTweetCount)
Listing 6. SPARQL pre-query for the bar chart
select htlabel, sum(count) as sumHt from HTCountEvent.win:time(15 min)
group by htlabel output snapshot every 1 min
Listing 7. EPL query for the bar chart
KPIs. As key performance indicators (KPIs), we measure the resources consumption
of the two systems and the correctness of the results. For the resource consumption
we measure every 10 seconds: i) the CPU load of the system thread in %, ii) the
memory consumption of the thread in MB and iii) the memory consumption of the Java
Virtual Machine (JVM). For the correctness, we compared the computed results with
the expected results. Being the input a constant flows of tweets that only differ for the
ID, the area chart is expected to be flat and the bar chart is expected to count exactly the
same number of hashtags every minute.
10
3000
Median
Engine
Memory
(MB)
50.2%
32.1%
22.2%
SLD
3.9%
300
SLD
Revolu9on
R²
=
0.99891
R²
=
0.96413
Expon.
(SLD)
16.7%
Linear
(SLD
Revolu9on)
30
1
10
100
Median
CPU
Load
(%)
Fig. 7. An overview of the experimental results; larger bubbles means greater % errors.
5 Evaluation Results
Figure 7 offers an overview of the results of the experiments. The full results are reported
at the end of this section in Figure 10. On the X axis we plot the median of the CPU load
in %, while on the Y axis we plot the memory allocated by the engine thread. The size
of the bubble maps the median of the error of the area chart. Bubbles in the lower left
corner correspond to the experiment where we sent 1500 tweets per minute.
Increasing the throughput results in more memory consumption and CPU load for
both systems. However, not SLD Revolution consumes less memory than SLD and
occupies less CPU. Moreover, SLD Revolution presents a linear increment for both these
KPIs, while the resource usage for SLD grows exponentially with the throughput. Also
the error in the results increases with the throughput: SLD already shows an error greater
than 3% in the bar chart at 3000 tweets per minutes and in the area chart at 9000 tweets
per minute; SLD Revolution is faster - i.e. it reaches higher maximum input throughput
- and more accurate – i.e. it reaches 3% error level only for 18000 tweets per minutes,
providing more precise results than SLD.
Figure 8 presents the recorded time-series for CPU load and memory usage in both
systems. The memory usage graphs contains two different time series. The blue one
represents the memory usage of the system thread, while the orange one shows the total
memory usage for the JVM.
The memory usage of the system thread accounts for all the components and data
in the pipeline. Notably, when the system under testing is not overloaded, the memory
usage is constant over time, while when the system is overloaded it grows until the
system crashes. The total memory usage of the JVM shows, instead, the typical pattern
of the garbage collector that lets the JVM memory grow before freeing it. Also in this
case, when the system it is overloaded, the garbage collector fails to free the memory.
During the experiments the median of the memory used by SLD spans from 115 MB,
when loaded with at 1500 posts/min, to 1.6 GB, when loaded with 18000 posts/min. For
SLD Revolution, instead, it spans from 44 MB to 511.5 MB in the same load conditions.
The experimental results clearly shows that SLD Revolution consumes (in average) three
times less memory than its ancestor.
The same considerations can be proposed for the CPU load. The median of the CPU
load spans from 2% to 10% for SLD Revolution, while it spans from 10% to 39.5% for
SLD. SLD Revolution consumes in average 4 time less CPU time than SLD.
11
Memory usage Cpu Time Usage
post/
SLD Revolution SLD SLD Revolution SLD
min
1500
3000
6000
9000
12000
18000
Fig. 8. Memory and CPU usage over time
The correctness results are summarized in Figure 9. As explained in Section 4, the
percentage of errors is computed by comparing the results for each time interval with
the expected ones. The X axis of each plot shows the percentage error; it ranges from
0% to 100%. The Y axis is the percentage of results with that error; it also goes from 0%
to 100%. A bar as tall as the Y axis in the left side of the graph means that all results
where correct. The smaller that bar is and the greater the number of bars to its right is,
the more errors were observed.
In general, the results shows that SLD Revolution is more accurate (the result error
is smaller) than SLD. For the area chart the distribution shows that SLD Revolution
percentage of error is very low when the input throughput is between 1500 posts/min
and 9000 posts/min. When it is higher (i.e., 12000 and 18000 posts/min) also SLD
12
Area Chart Correctness Bar Chart Correctness
post/
SLD Revolution SLD SLD Revolution SLD
min
1500
3000
6000
9000
12000
18000
Fig. 9. Area chart and bar chart errors distributions
Revolution starts suffering and percentage of errors starts growing. For SLD, errors
are present even at lower input rate, the graph shows that the error distribution starts
moving to the right at 6000 posts/min. Similar consideration can be proposed for the bar
chart error distribution. The degradation of performance of SLD starts a very low rate, a
substantial presence of errors around 7% can be seen with 6000 posts/min in input.
Figure 8 and Figure 9 show the deep correlation between resources usage and errors.
Clearly, a growing input throughput drives the systems to be less reliable. For both the
versions of the SLD framework (SLD Revolution and SLD) the correctness of the results
decreases as soon as the machine is overloaded and the resources usage starts rising out
of control.
SLD SLD
Revolution 13
post/min KPI Min. 1st
Qu. Median 3rd
Qu. Max. Min. 1st
Qu. Median 3rd
Qu. Max.
1500 112 114 115 115 116 42 43 44 45 48
memory
(MB)
3000 188 190 191 192 193 45 48 50 52 58
6000 330 337 340 342 353 52 61 64 67 78
9000 n.a. 481 485 488 504 59 73 80 84 102
12000 684 783 888 956 1015 66 88 97 107 144
18000 n.a. 919 1652 2565 4451 212 392 511,5 597,2 774
1500 CPU
load
(%) 7,3% 9,3% 10,0% 10,8% 14,1% 1,3% 1,8% 2,0% 2,2% 3,4%
3000 8,3% 10,5% 11,2% 12,1% 16,2% 1,3% 2,0% 2,2% 2,5% 3,8%
6000 16,1% 18,6% 19,7% 21,0% 29,5% 1,4% 2,1% 2,4% 2,8% 4,5%
9000 n.a. 26,3% 28,3% 30,4% 37,0% 1,5% 2,3% 2,7% 3,2% 5,1%
12000 n.a. 32,6% 34,9% 36,9% 43,3% 1,3% 2,6% 3,2% 4,2% 8,9%
18000 n.a. 35,2% 39,6% 46,1% 72,5% 2,8% 7,2% 10,3% 13,8% 26,2%
1500 0,0% 0,0% 0,0% 0,1% 66,7% 0,0% 0,0% 0,0% 0,0% 33,3%
area
chart
error
3000 0,0% 0,0% 0,0% 0,1% 33,4% 0,0% 0,0% 0,0% 0,0% 33,3%
6000 0,0% 0,1% 0,6% 8,1% 61,8% 0,0% 0,0% 0,0% 0,0% 33,3%
9000 0,1% 1,2% 3,9% 9,5% 30,2% 0,0% 0,0% 0,0% 0,0% 33,3%
12000 0,1% 14,1% 22,2% 26,0% 82,4% 0,0% 0,0% 16,7% 33,3% 33,3%
18000 38,2% 44,8% 50,2% 59,5% 93,1% 0,0% 16,9% 32,1% 53,7% 100,0%
1500 0,0% 0,0% 2,2% 2,3% 6,7% 0,0% 0,0% 0,0% 0,0% 93,3%
bar
chart
error
3000 0,0% 0,1% 4,5% 4,5% 4,5% 0,0% 0,0% 0,0% 0,1% 2,2%
6000 0,1% 2,3% 2,3% 4,1% 5,7% 0,1% 0,1% 2,3% 2,3% 2,3%
9000 1,8% 3,4% 4,2% 5,4% 7,5% 1,6% 2,4% 2,4% 2,4% 3,2%
12000 13,6% 19,4% 21,0% 25,0% 28,0% 0,2% 1,5% 2,5% 4,7% 6,1%
18000 49,5% 51,1% 52,6% 54,7% 57,9% 2,9% 9,4% 12,6% 16,3% 22,4%
Fig. 10. The experimental results.
6 Conclusions and Future Works
In our future work, we intend: i) to empirically demonstrate the value of using SLD
Revolution for all our deployments of socialOmeters, and ii) to investigate if SLD
Revolution can be the target platform for a new generation of Ontology Based Data
Integration [18] system for Stream Reasoning [19]. This system could have the potential
to tame the velocity and variety dimension of Big Data simultaneously.
As for the former, we first intend to stress test SLD Revolution using workloads that
resemble reality. Then, we aim at putting it at work in parallel to SLD in real-world
deployments. Once we will have collected enough evidence that SLD Revolution is
always cheaper faster yet more accurate than SLD, we will start using it for all our
deployments.
As for the latter, we aim at further investigating the generic processing model
presented in Section 3. We are defining an algebra able to capture the semantics of
complex stream processing applications that need to integrate a variety of data sources.
The current sketch of this algebra uses S2I and I2S operators from CQL [8] but keeping
them generic w.r.t. the payloads. It uses SPARQL 1.1 algebra as I2I operators that
take in input graph-based payloads and generate in output either graph-based or tuple-
based payloads. The relational algebra will cover the I2I transformations of tuple-based
payloads. We are studying the application of R2RML [20] for formulating mappings
that works as I2I operators. Indeed, R2RML allows to write mapping from relational
data to RDF; more generally, I2I operators takes in input tuple-based payloads and
14
output graph-based ones. We still need to choose an algebra for transforming tree-based
payloads.
In our opinion, the grand challenge is how to fit all those formal elements in a
coherent framework that allows a system to automatically decide which is the latest
moment for transforming data (i.e., introducing the concept of lazy transformation) and
to perform optimization such as introducing the pre-query that we put in all the pipelines
illustrated in Section 4.
When the work on this formally defined generic stream processing model will be
completed, we will be able to start investigating how to extend mapping languages
like R2RML7 and, potentially, also ontological languages in order to make them time-
aware [21] while keeping the whole computational problem tractable.
References
1. Balduini, M., Celino, I., Dell’Aglio, D., Della Valle, E., Huang, Y., Lee, T.K., Kim, S., Tresp,
V.: BOTTARI: an augmented reality mobile application to deliver personalized and location-
based recommendations by continuous analysis of social media streams. J. Web Sem. 16
(2012) 33–41
2. Balduini, M., Della Valle, E., Dell’Aglio, D., Tsytsarau, M., Palpanas, T., Confalonieri, C.:
Social listening of city scale events using the streaming linked data framework. [22] 1–16
3. Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: Querying RDF streams
with C-SPARQL. SIGMOD Record 39(1) (2010) 20–26
4. Barbieri, D.F., Della Valle, E.: A proposal for publishing data streams as linked data - A
position paper. In Bizer, C., Heath, T., Berners-Lee, T., Hausenblas, M., eds.: Proceedings of
the WWW2010 Workshop on Linked Data on the Web, LDOW 2010, Raleigh, USA, April
27, 2010. Volume 628 of CEUR Workshop Proceedings., CEUR-WS.org (2010)
5. Breslin, J.G., Decker, S., Harth, A., Bojars, U.: Sioc: an approach to connect web-based
communities. IJWBC 2(2) (2006) 133–142
6. Phuoc, D.L., Dao-Tran, M., Pham, M., Boncz, P.A., Eiter, T., Fink, M.: Linked stream
data processing engines: Facts and figures. In Cudré-Mauroux, P., Heflin, J., Sirin, E.,
Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein,
A., Blomqvist, E., eds.: The Semantic Web - ISWC 2012 - 11th International Semantic Web
Conference, Boston, MA, USA, November 11-15, 2012, Proceedings, Part II. Volume 7650
of Lecture Notes in Computer Science., Springer (2012) 300–312
7. Dell’Aglio, D., Calbimonte, J., Balduini, M., Corcho, Ó., Della Valle, E.: On correctness in
RDF stream processor benchmarking. [22] 326–342
8. Arasu, A., Babu, S., Widom, J.: The cql continuous query language: semantic foundations
and query execution. VLDB J. 15(2) (2006) 121–142
9. Garofalakis, M., Gehrke, J., Rastogi, R.: Data Stream Management: Processing High-Speed
Data Streams (Data-Centric Systems and Applications). Springer-Verlag New York, Inc.,
Secaucus, NJ, USA (2007)
10. Phuoc, D.L., Dao-Tran, M., Parreira, J.X., Hauswirth, M.: A native and adaptive approach
for unified processing of linked streams and linked data. In Aroyo, L., Welty, C., Alani, H.,
Taylor, J., Bernstein, A., Kagal, L., Noy, N.F., Blomqvist, E., eds.: The Semantic Web - ISWC
2011 - 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011,
Proceedings, Part I. Volume 7031 of Lecture Notes in Computer Science., Springer (2011)
370–388
7
https://www.w3.org/TR/r2rml/
15
11. Calbimonte, J., Corcho, Ó., Gray, A.J.G.: Enabling ontology-based access to streaming
data sources. In Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z.,
Horrocks, I., Glimm, B., eds.: The Semantic Web - ISWC 2010 - 9th International Semantic
Web Conference, ISWC 2010, Shanghai, China, November 7-11, 2010, Revised Selected
Papers, Part I. Volume 6496 of Lecture Notes in Computer Science., Springer (2010) 96–111
12. Dell’Aglio, D., Della Valle, E., Calbimonte, J., Corcho, Ó.: RSP-QL semantics: A unifying
query model to explain heterogeneity of RDF stream processing systems. Int. J. Semantic
Web Inf. Syst. 10(4) (2014) 17–44
13. Le Phuoc, D., Nguyen-Mau, H.Q., Parreira, J.X., Hauswirth, M.: A middleware framework
for scalable management of linked streams. J. Web Sem. 16 (2012) 42–51
14. Gray, A.J.G., Garcia-Castro, R., Kyzirakos, K., Karpathiotakis, M., Calbimonte, J.P., Page,
K.R., Sadler, J., Frazer, A., Galpin, I., Fernandes, A.A.A., Paton, N.W., Corcho, Ó.,
Koubarakis, M., Roure, D.D., Martinez, K., Gómez-Pérez, A.: A semantically enabled
service architecture for mashups over streaming and stored data. In Antoniou, G., Grobelnik,
M., Simperl, E.P.B., Parsia, B., Plexousakis, D., Leenheer, P.D., Pan, J.Z., eds.: ESWC (2).
Volume 6644 of Lecture Notes in Computer Science., Springer (2011) 300–314
15. Compton, M., Barnaghi, P.M., Bermudez, L., Garcia-Castro, R., Corcho, Ó., Cox, S., Graybeal,
J., Hauswirth, M., Henson, C.A., Herzog, A., Huang, V.A., Janowicz, K., Kelsey, W.D., Le
Phuoc, D., Lefort, L., Leggieri, M., Neuhaus, H., Nikolov, A., Page, K.R., Passant, A., Sheth,
A.P., Taylor, K.: The ssn ontology of the w3c semantic sensor network incubator group. J.
Web Sem. 17 (2012) 25–32
16. Jazayeri, M., Loos, R., Musser, D.R., eds.: Generic Programming, International Seminar on
Generic Programming, Dagstuhl Castle, Germany, April 27 - May 1, 1998, Selected Papers.
Volume 1766 of Lecture Notes in Computer Science., Springer (2000)
17. Milner, R., Morris, L., Newey, M.: A logic for computable functions with reflexive and
polymorphic types. Department of Computer Science, University of Edinburgh (1975)
18. Lenzerini, M.: Data integration: A theoretical perspective. In: Proceedings of the Twenty-first
ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5,
Madison, Wisconsin, USA. (2002) 233–246
19. Della Valle, E., Ceri, S., van Harmelen, F., Fensel, D.: It’s a streaming world! reasoning upon
rapidly changing information. IEEE Intelligent Systems 24(6) (2009) 83–89
20. Priyatna, F., Corcho, Ó., Sequeda, J.: Formalisation and experiences of r2rml-based SPARQL
to SQL query translation using morph. In Chung, C., Broder, A.Z., Shim, K., Suel, T., eds.:
23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April
7-11, 2014, ACM (2014) 479–490
21. Artale, A., Kontchakov, R., Ryzhikov, V., Zakharyaschev, M.: A cookbook for temporal
conceptual data modelling with description logics. ACM Trans. Comput. Log. 15(3) (2014)
25:1–25:50
22. Alani, H., Kagal, L., Fokoue, A., Groth, P.T., Biemann, C., Parreira, J.X., Aroyo, L., Noy,
N.F., Welty, C., Janowicz, K., eds.: The Semantic Web - ISWC 2013 - 12th International
Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part
II. Volume 8219 of Lecture Notes in Computer Science., Springer (2013)