DIONYSUS: Towards Query-aware Distributed Processing
               of RDF Graph Streams

                               Syed Gillani, Gauthier Picard, and Frédérique Laforest
                      Laboratoire Hubert Curien, UMR CNRS 5516, and Institute Henri Fayol, EMSE
                                                  Saint-Etienne, France
                             syed.gillani@univ-st-etienne.fr , gauthier.picard@emse.fr,
                                      frederique.laforest@telecom-st-etienne.fr

ABSTRACT                                                                 the solution for variety and heterogeneity of data sources. RDF is
Arguably, the most significant obstacle to handle the emerging ap-       a schema-free data model in which data are represented as subject-
plication’s data deluge is to design a system that addresses the chal-   predicate-object (hs, p, oi) statements called triples, and SPARQL
lenges for big data’s volume, velocity and variety. Work in RDF          is a standard query language for RDF with triple patterns as one of
stream processing (RSP) systems partly addresses the challenge of        its main constituent. Today, RSP systems are turning into mature
variety by promoting the RDF model. However, challenges like             academic systems [7, 18, 10, 17], while broadening the spectrum
volume, velocity are overlooked by existing approaches. These            of applications they serve. As a part of this process, we increas-
challenges demand optimised combination of scale-out and scale-          ingly see the need to entertain these systems with functionality like
up solutions. Furthermore, various other requirements for RSP sys-       scalability, distribution of sources, historical analysis on streamed
tems, such as an efficient integration of distributed stream sources,    data, and integration of new operators to take the advantages of the
storage of historical streams and their analysis, and integration of     structured data model.
stateful operators to support complex event processing over streams         In SEAS [1] project, we tackle with a Smart Grid scenario, where
are far from being addressed in an efficient way. Our vision is to       a set of distributed and heterogeneous sources emanate data. The
design a general purpose RDF graph streaming system, which will          sources are mainly composed of sensors providing data about the
be able to cope with distributed streams and shares local optimising     power consumption at appliance level in each house, power gener-
strategies to allow different kinds of queries (analytical, streaming,   ation/storage data, weather related data, usersâĂŹ activity data etc.
sequence-based) through one query interface. The proposed sys-           Given the variety of data sources, the system must support an as-
tem will offer a black-box solution that will allow analysts to tap      sortment of data sources, standard graph analytic (e.g., how many
in the goldmine of massive RDF graph streams. We consider the            appliance are present in a house, their operational times and power
challenges and opportunities associated in designing such system,        ratings), complex analytic (e.g., finding areas in the city with power
introduce our approaches to these topics, and discuss the compo-         shortages and nearby power rich areas), real-time monitoring (e.g.,
nents of our envisioned system.                                          detection of abnormalities in power usage, power disruptions), and
                                                                         complex event processing (CEP) (defining a sequence to trade elec-
                                                                         tricity) over temporal and distributed events. The above mentioned
1.    INTRODUCTION                                                       scenario identify three main dimensions of optimisations and re-
   In a wide range of domains from social media to financial trad-       quirements to be included in the current RSP systems. That is, (i)
ing, there is a growing need of supporting continuous queries over       distribution and scalability of RSP systems, (ii) enabling analytical
predefined windows of data. The initial generation of stream pro-        queries over historical streams, and (iii) incremental evaluation to
cessing systems (SPSs) were focused on delivering real-time re-          enable stateful operators for CEP.
sponse for the input that usually consisted of homogeneous stream           First, as in classical data integration, continuous queries on data
tuples (relational) with predefined schemas. The implementation of       streams may involve combined processing over a number of stream-
such stream processing engines is relatively straightforward as long     ing data sources physically distributed across the network. The
as the source data can be correctly represented in the data model of     currently available RSP systems require streams to be transmitted
the SPS, and vice versa; albeit involving some manual work and           to a single location for centralised processing. Unfortunately, the
inflexibility.                                                           continuous transmission of large number of rapid streams to a cen-
   In order to provide a more flexible solution, and to enable rea-      tralised locations may exceed the capacity of monitoring infrastruc-
soning capabilities for continuous queries; recently, the notion of      ture. Thus, these approaches are not feasible for many real-world
RDF stream processing (RSP) systems has been introduced [7, 18].         scenarios. Consider for instance, social network like Twitter, where
The data for such systems are modelled as RDF graphs; providing          on average there are around 6000 tweets per second and around 500
                                                                         million tweets per day. An ad-hoc RSP system with a set of queries
                                                                         defined over long temporal windows (e.g., 1 day) can easily exhaust
                                                                         the available system resources, and requires a distributed comput-
                                                                         ing model. Similar, for Smart Grid, each appliance in a house em-
                                                                         anates on average 200K events per day; extending it for a set of
                                                                         appliances in a house and then for a set of houses would be beyond
 c 2016, Copyright is with the authors. Published in the Workshop Pro-   the capacity of existing RSP systems [13].
ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor-
deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this
                                                                            Second, the main optimisation goal of current RSP systems is to
paper is permitted under the terms of the Creative Commons license CC-   reduce the latency of results, since they mainly address what might
by-nc-nd 4.0
be called monitoring applications. That is, the systems were mainly        of selected challenges. Third, we provide an overview of our en-
not designed with storage in mind; while essentially all of the mon-       visioned approach. Fourth we provide various query optimisation
itoring applications that we encountered had a need for archival           techniques for our envisioned system.
storage. The storage of historical data within RSP systems would
enable (i) analytical queries on the historical data, (ii) make pre-
dictive analysis; while comparing the historical and current data
                                                                           2.      STATE OF THE ART
streams. Storing data from streams is not as simple as it sounds;             The main object of the proposed research is to build a general
the data from streams grow rapidly making it difficult to store and        purpose system which should support a wide variety of queries
manage. Storage capacity, however is only one aspect of the prob-          over RDF graph streams and on historical data. The existing tech-
lem. The data generated by streams can be so big that its analysis         niques in this scenario can be divided into (i) RDF stream process-
can also lead to severe time-bound issues. The complete analysis of        ing (RSP) systems, (ii) RDF storage systems for static data, (iii)
stored data, however is rarely required and what users need instead        CEP over RDF streams
is explorative access to the data to find interesting subsets that need       [RDF Stream Processing] Stream processing of graph struc-
further detailed and time-consuming analysis. For instance, a sen-         tured data (often dubbed as stream reasoning) – where streams
sory device capturing the temperature of the room every 2 seconds          are continuously processed together with semantics and rich back-
would produce a lot of repetitive data. However a user might be            ground knowledge – has gained fair momentum. Existing tech-
only interested in temperature values for longer time periods with         niques [7, 18, 10, 17] tackle diverse issues including, continuous
few distinct values.                                                       query processing, stream reasoning, ontology maintenance, ontology-
   Third, many real-time applications, not only require the on-line        based data access. However, they are far too limited and complex to
analytics of streams (as provided by all RSP engines), but also re-        be applicable on gigantic distributed data streams. These solutions
quire stateful operators; which allows the new data elements to act        are usually optimised for centralised settings and cannot be directly
as updates to the previously processed data elements. For instance,        adopted for federated/distributed settings. Furthermore, most of
the data element describing the measurement of power consump-              these techniques are based on the triple stream model, where each
tion by an appliance represents an update to its previously mea-           element/event within a stream is composed of a triple: a model di-
sured values. This kind of scenarios is not uncommon in applica-           rectly inspired from the relational tuple stream model. For the same
tions like Smart Grid, social networks, stock exchange, transport          reason, most of these solutions (e.g., CQELS [18], C-SPARQL [7])
systems etc., where each new input might represents the updated            map RDF triples on relational tuples – to be handled by an underly-
value of an object. This means, as opposite to the batch algorithms        ing relational stream processor. We argue that this would not allow
– that recomputes new output from scratch by re-evaluating all the         to reap the real power of structured events and it would be difficult
selected events – RSP system needs an incremental matching al-             to extend it for further graph operators; the W3C RSP1 group took
gorithm to minimise the computation by only evaluating the new             the same tone for triple stream model [2]. CEP over RDF [4] has
events, while considering the already computed states of the previ-        the comparable story as that of RSP – with triple model as the dom-
ous ones. Furthermore, such stateful behaviour can also provides           inant cause of controversy. We believe that an RDF graph model
another interesting class of queries, i.e., sequence-based queries.        for streams would first enable to close the semantic gap between
These queries determine the defined stateful sequence of events and        existing techniques, and second allows to tailor techniques from
produce the output once it has been matched. These operators can           static RDF graph solutions to RDF graph streams.
be defined under the umbrella of complex event processing (CEP).              [Centralised RDF Storage Systems] To cater the high volume
The general RSP systems do not offer native support for CEP; thus,         of data, we need to store it in an efficient manner; this would enable
any solution built on top of them will suffer in terms of expressive       near-to-real-time analytic queries and real-time streaming queries.
power, usability, and performance. Few specialised systems [4] that        Several techniques have been proposed to store graph structured
support sequencing over an RDF model are defined in the literature;        data. These techniques range from native graph storage systems
however, they lacks the scalability and distribution requirements as       [29, 6, 27] to specialised RDF storage systems [5, 3, 23]. Most
described earlier.                                                         of these systems are optimised only for static data and they em-
   There is a rich literature on RSP system, CEP and distributed           ploy extensive indexing techniques to optimise query performance.
querying over RDF, but corresponding systems are rather rigid (i.e.,       For instance, RDF3x [3] builds several clustered B+-trees for all
can only analyse one triple at a time (for RSP and CEP), and dis-          permutations of subjects, objects and predicates. Such extensive
tributed querying is based on static data), and do not deliver in          indexing techniques are not practical or feasible for high velocity
terms of performance required for applications that we envisioned.         graphs stream; as they would spend considerable amount of time
Existing systems that are capable of providing some sort of func-          and space for pre-processing and indexing RDF graph streams.
tionalities as discussed above fall into one of the following cate-           [Distributed RDF Storage Systems] In order to realise the dis-
gories: (1) distributed RDF batch processing systems that are op-          tributed nature and amount of data on the Web, several techniques
timised only for static data [28, 15], (2) centralised RSP systems         have been proposed for distributed RDF graph storage and query-
that only provide real-time analytics [7, 18, 10, 17], (3) centralised     ing. These techniques span from optimising a federated layer [24,
CEP systems that only provide sequence operators [4]. Our am-              9, 21] on top of existing data stores to a dedicated solutions for
bitious goal is to develop a general-purpose system that not only          clustering and querying RDF graphs in distributed manners [28,
provides all the required functionalities, but also shares optimisa-       15]. These techniques provide a good starting point but are not di-
tion strategies across the system; it will offer the abstractions, tools   rectly applicable for streaming settings, due to the following points.
and dedicated algorithms needed for achieving these goals. The             (1) Optimisation/distribution strategies are based on the assump-
proposed system will make it feasible for analysts to use one query        tion of static data, (2) the distribution/clustering of RDF graphs
interface to post various different types of queries on distributed        does not cater the heterogeneity of the sources; this can result into
data streams.                                                              an increase in network traffic and replication of data in streaming
   We first present the state-of-the-art to draw the picture of the        settings. Furthermore, due to their re-evaluation model, such dis-
work required to achieve all our tasks. Second, we present a list          1
                                                                               https://www.w3.org/community/rsp/
tributed approaches can lead to significant increase in query time                               Apps                Clients           Visulisation
with the increase of data size.
                                                                                                           Exact Query Graphs (EQGs)


3.    SELECTED RESEARCH CHALLENGES
    In this section, we expose main challenges to be dealt for our
envisioned system.                                                                                              Query Conductors

    [Streams, Events and Query Model] One important problem
                                                                                                        Bolts                      Bolts
for RDF graph-based stream processing is the lack of clean seman-                      CEP                           Stream                      Analytical
tic models for defining streams and continuous queries to process                    Optimiser                      Optimiser                    Optimiser

them. There is no agreement even on the definition of basic terms
such as “stream” and “event”. Although most of existing systems                                         Archipelago of CBGP-Stores
are based on a triple stream model, it is one of the many possi-
ble points that shows the dependence of RSP engines on relational
models. Hence, the existence of a wide variety of streaming ap-
                                                                             Static Island of CBGP-Stores Alive Island CBGP-Stores Deceased island of CBGP-Stores
plications not only introduces complexities in choosing the right
engine for given applications, but also makes the application de-
velopment and maintenance hard. The need of standardisation has                                            Data Stream Sources
been recently acknowledge by the W3C RSP working group and
few of their initial propositions clear out the discussion toward the
stream and event model. Furthermore, most of the existing RSP
                                                                                     Figure 1: Layered Architecture of DIONYSUS
systems use extended forms of SPARQL to embed streaming op-
erators. However, due to the lack of precise semantics, the query
results for these RSP engines are not even comparable [12, 22].          4.       DIONYSUS
    [Handling Volume and Velocity under Distributed Settings]               In this section, we describe the components of our envisioned
The problem of distributing data into a set of clusters, and then        system called as DIONYSUS2 . Our system design is motivated by
finding the query relevant data from all such distributed sources        the following goals and is described in Fig. 1.
at federation layer has been dealt in static settings [24, 9, 21, 28,    (1). We are interested in an efficient distribution of streaming data
15]. In such cases, data is known in advance and its distribution        from a set of sources, which are not known in advance. Thus, our
is performed once. However, this is not the case in streaming set-       storage and data distribution model is envisioned by Common Ba-
tings; data is not known in advance for distribution analysis and        sic Graph Pattern store (CBGP-store). Each CBGP-store is as-
new sources are added dynamically and old sources provide data           signed with a generic BGP (i.e., a set of triple patterns) generated
at variable velocities. Distributing and storing such a dynamic sea      automatically/manually from the domain ontology and domain use
of information brings new challenges and thus needed to be dealt.        cases (see Section 4.1). A collection of such stores exposes (i)
Existing RSP systems are based on sliding window-based execu-            fresh incrementally computed results of a set of CBGP for stream-
tion strategy: data in the windows are processed and expelled once       ing queries, (ii) the set of previously computed results for off-line
a window is expired or it slides forward [22]. This first does not       analytical queries. As we see in Section 4.1, this enables the use
allow historical/analytical analysis on data. Second, the windows        of light-weight incremental indexing technique to efficiently store
are devised in an ad hoc manner, thus are optimised only for small       data for each CBGP.
bandwidth.
    [Incremental Indexing and Query Evaluation] The use of in-           (2). The second goal of our approach is to push the intensive query
dexing (in distributed RDF settings) enables two primary tasks:          optimisation and processing locally at each CBGP-store for Exact
(1) it determines the relevant distributed sources for a query, (2)      Query Graphs (EQGs) registered by a user. An EQG is a more
it tunes the NP-hardness of subgraph isomorphism – that is match-        selective form of CBGP and it contains (i) subsets of triple pat-
ing a query graph and the data graph – into a more subtle prob-          terns that are distributed among CBGP-stores, (ii) SPARQL 1.1 op-
lem; that is either joining the result of a set of triple patterns or    erators, such as select, optional, union, filter, group
exploring smaller subsets of query graph to join them for the final      by, etc. The results of each EQGs are accumulated at the federated
result [29]. Nevertheless, the good indexing strategies can prove        level; Section 4.2 describes such techniques.
to be quite significant for the performance of a system; we argue        (3). Our third goal is to enable different kinds of queries – such
based on our experience with existing storage solutions that index-      as analytical, streaming, sequence-based – through a single query
ing itself can become a bottleneck to store and query RDF graph          interface. A query interface encompasses subsets of CBGP-stores
streams. For example, it takes more than 10 hours to build an            that can be abstracted as islands of CBGP-stores (see Fig. 1). This
optimised state-of-the-art index over a dataset of 2 billion triples     would enable to share query results computation and local optimi-
on modern servers [26]. It is imperative therefore, to first develop     sation strategies. For example, users would like to get the result of
techniques that interactively and adaptively build parts of the index,   (i) an analytical query describing the number of active appliances
while focusing on the data necessary to answer queries and making        and their power usage in a house, and the result of (ii) a sequence-
the data available immediately. Second, the continuous streaming         based query to determine the sequence of power usage by appli-
queries require incremental evaluation model. This means, the stor-      ances. This calls for a single query that can be optimised according
age mechanism in this scenario must store the state of previously        to its defined operators as presented in Section 5.
computed results and the new entries in the stream should act as an      (4). Our fourth aim would be to provide the semantic completeness
update to the previously stored states.                                  and locations transparency. That is, a new source can be added
                                                                         without affecting the integrity constraints, and a user query can
                                                                         2
                                                                             DistrIbuted aNalYtical, Streaming and Sequence qUerieS
span multiple islands of CBGP-stores. This would enable to first,            menting an incremental and adaptive indexing is one of the ma-
share the optimising strategies defined for each CBGP-store, sec-            jor challenges in storing and querying dynamic streams of RDF
ond to reduce the network traffic by employing local optimisation            graphs.
and computation strategies.
   In summary, our envisioned system can provide a way not to                A collection of CBGP-stores is abstracted under an island, where
drown in the sea of information emanating from heterogeneous                 each island is assigned with a set of query conductors. Query con-
distributed sources. It filters unnecessary information, which oth-          ductors shares optimisation strategies through bolts (see Fig. 1).
erwise can result in excessive use of storage and computational              Each CBGP-store is divided into three flavours: static, alive or de-
resources. The framework is designed to minimise the burden of               ceased. This classification is based on the fact that a streaming
query evaluation at the federation layer and to share local optimi-          query requires the current incrementally computed results, while
sation strategies across the islands of CBGP-stores.                         the analytical or predictive queries base their execution on histor-
                                                                             ical states of incrementally computed results. The static CBGP-
4.1     Common Basic Graph Pattern Stores                                    stores are used to enrich the streams with the static background
                                                                             knowledge – one of the main property of RSP engines. Each CBGP-
   The main difference between the static and streaming data distri-
                                                                             store captures a fraction of data from different sources, thus the
bution is its availability. The static data is available beforehand for
                                                                             storage of alive CBGP-store and query computation can be done
analysis and thus various techniques such as semantic hash-based
                                                                             completely in the main-memory, while the deceased and static CBGP-
pertaining [19] can be utilised to partition it into a set of clusters.
                                                                             stores can utilise disk-based storage.
However, the streaming data can not be available in advance for
distribution analysis. Thus, we envisioned an ontology-based data            4.2      Query Computation via Query Conduc-
distribution. Inspired by works on ontology partitioning [25, 20],                    tors
ontology modularisation [11] and ontology segmentation [8], we
propose to generate a set of common basic graph patterns (CBGP)                 The query conductor layer is the working horse of the system.
by analysing the structural relationships described in the domain            This layer is responsible for conducting various high-level optimi-
ontology/ontologies. Given a set of ontologies O = {o1 , o2 , . . . , on }   sations and accumulating results of the user-defined queries on an
used by a set of streams, a function F(O) produces a set of CBGP             archipelago of CBGP-stores (see Fig. 1). When a customised user
by analysing the structural relationships within ontologies and the          query – called as an Exact Query Graph (EQG) – is registered,
general use cases defined for a particular domain. Intuitively, we           the query conductor first determines its type (analytical, stream-
can say that each CBGP should contain information about a coher-             ing or sequence-based query) and creates an abstract syntax tree
ent subtopic within an ontology. The concepts within each CBGP               (AST). It then divides the EQG into a set of subquery graphs, along-
are semantically connected to each other and should not have strong          with their temporal information, in case of streaming and sequence-
dependencies with the information outside the CBGP. Thus, each               based queries. It next utilises the information of its islands of
CBGP = (C, D) contains a set of concepts C and links D between               CBGP-stores and orchestrates the execution of each subquery graph
them represent different kinds of dependencies; where D can be               on a set of CBGP-stores. The query conductor itself does little
reflected in the definition of an ontology or can be implied by the          computation other than concatenating or temporal sorting of re-
intuitive understanding of the concepts and background knowledge             sults from CBGP-stores. This design permits us to take the advan-
about the respective domain. Generally, CBGPs consist of differ-             tages of local optimisation strategies implemented for each CBGP-
ent types: star-shaped, shallow tree shaped, deep tree shaped, graph         store. Note that EQGs are more specialised and selective versions
with loops etc. Each CBGP is mapped to a data store; thus forming            of CBGP, thus the set of triple patterns within each EQG can be
a set of dataset clusters compose of data aggregated from a set of           easily decomposed and registered to CBGP-stores.
sources. Note that such data distribution may results in data du-               Results of streaming and sequence-based EQGs are held in main-
plication across the set of CBGP-stores. However the following               memory buffers, and they are updated whenever there is a newly
benefits would outweigh such shortcomings.                                   computed incremental results in the corresponding alive CBGP-
                                                                             stores. The query conductor also ensures that the data movement
(1) Aggregating commonly linked concepts within a single CBGP-               is not too expensive; we will explore the tactics for moving results
store. This would results in (i) querying processing to be focussed          based on operators involved, processing activity at various CBGP-
at local levels, and (ii) reducing the network traffic and load at fed-      stores, and overall network activity. That is, if a certain EQG re-
eration level. The data duplication among CBGP-stores ensure that            sults in an increase network traffic or query processing at federation
each query answer can be computed locally.                                   level, we can create a new CBGP-store by analysing the structure
                                                                             of the EQGs.
(2) Surviving in the sea of information by only filtering and storing
the relevant information. The CBGP-stores can act as data filters,           5.      QUERY OPTIMISATION
where only the relevant data based on the selected graph relation-             In this section, we discuss various opportunities offered by our
ships will be stored and the full dataset from a set of sources is           envisioned system to optimise the performance of different types of
summarised by a set of CBGPs. The summarised data acts as a sur-             queries: analytical, streaming and sequenced-based.
rogate for the original data and are queried instead of the complete
dataset.                                                                     5.1      Analytical Queries
                                                                                Analytical queries for historical data analysis usually contains
(3) In-memory incremental evaluation of the CBGPs; storing only              multiple aggregation phases. For example, the query to find the
the states of the computed results, rather than all the data elements        average power consumption by each house grouped by area is de-
from a set of sources.                                                       scribed in Query 1 (using SPARQL3 syntax). It contains three spec-
                                                                             ifications: (i) graph pattern matching to compute the query-related
(4) It enables the query-aware light-weight and adaptive indexing
                                                                             3
technique for storing CBGPs results. As described earlier, imple-                http://www.w3.org/TR/sparql11-query/
subgraphs corresponding to the power consumption of a house and              The scalability of continuous streaming queries are case depen-
area information, (ii) grouping the resulting patterns based on the       dent; there are two main flavours discussed in the literature. First,
values of area-house combinations, and (iii) aggregating the val-         scaling a large number of queries by distributing their execution.
ues of the power consumption to compute the average. Now con-             Second, scaling a complex query that needs large working mem-
sider that the house and its power consumption values lies in a           ory and might not fit within a single machine. The first case is
CBGP-store D1 , while information regarding the area lies in an-          easy to handle; a shared nothing execution architecture – i.e., nei-
other CBGP-store D2 , and there can be n CBGP-stores.                     ther streams or memory storage is shared among processors – can
                                                                          be utilised to run multiple instances of the streaming engine, each
  Query 1. Analytical query for Smart Grid use case
                                                                          running a subset of the queries. However, scaling complex queries
SELECT ?area, ?house, AVG(?power)            (iii)                        is still an open issue.
  WHERE                                                                       Query 2. Streaming query for Smart-Grid use case
{
?house :location ?l.                                                      SELECT ?power, ?house, ?temp, ?Wspeed, ?hum
?house :powerSource ?source.                   (i)                        WINDOW 2 HOURS
?source :value ?power.                                                     WHERE
                                                                          {
?l :partOf ?area.                                                         STREAM <http://example.org/powersource> [Range 2s]
?area :name ?areaName.                                                    {
}                                                                         ?house :location ?l.
GROUP BY (?area)                             (ii)                         ?house :powerSource ?source.
                                                                          ?source :value ?power.
                                                                          }
   Traditionally in federated settings, the two basic optimisation        STREAM <http://example.org/weathersource> [Range 2s]
techniques to compute such a query graph are: (i) compute each            {
triple pattern in a query graph against all the available data stores     ?l :temperature ?temp.
                                                                          ?l :windSpeed ?Wspeed.
and the results are joined at the server or (ii) evaluating each triple   ?l :humidity ?hum.
pattern in a nested loop join (NLJ) fashion; that is by substitut-        }
ing the results obtained from one triple pattern into another. These
                                                                          }
techniques and their optimised version [24] performs poorly for
highly selective and complex analytical queries. [16] provides var-          We argue that our system design could handle both cases effi-
ious optimisation strategies to cater analytical queries for federated    ciently with the following points. (1) We can utilise multiple query
static data sources and can act as a good starting point to extend it     conductors to serve a set of EQGs, where query conductors can
for dynamic settings, albeit in a different way. Since previous ap-       easily be distributed into a set of machines. (2) The main pro-
proaches assumed a cost-based model, they would not be efficient          cessing of a EQG is performed in a distributed manner on a set
to support the addition of new sources and the generation of new          of CBGP-stores. That is, a EQG is parsed into a set of subquery
CBGP-stores. The reason is that, they would require to maintain a         graphs, each is assigned to a CBGP-stores. The query execution
model of each subquery graph operation and the resources it needs.        results are stored in the distributed cache and then joined on com-
That is, essentially the query optimiser need to understands all the      mon variables (i.e., ?l in Query 2) to produce the final results; the
operations and storage techniques in all the distributed resources,       static CBGP-stores further provide the functionality of joining the
while continuously updating it for dynamic streams.                       static background knowledge with the streams. Such query execu-
   At opposite, we envisioned a black box approach, where no in-          tion strategy enables to share the results of common query graph
formation about the local query optimisations is known at the server      patterns within a set of EQGs. Furthermore, there is also the case
level. Each subquery graph of EQGs is sent to the corresponding           of incremental evaluation of the query result. Traditionally, Query
CBGP-store, where a local optimiser determines the subquery join          2 is executed in a re-evaluation manner [22, 18]. That is, the results
operations. Using the same example as described above, we can             for each subquery graph are indexed (usually B+ tree) and for each
compute the graph pattern matching – which is typically join inten-       new event, (i) the result of each subquery graph is re-evaluated, (ii)
sive – and aggregation of values at D1 store. Similarly, the basic        the join on the common variable is re-evaluated, and (iii) all the
graph pattern matching for area-related query graph patterns will         computation is performed in an ad-hoc manner. Therefore, exist-
be performed at D2 store. The results of both of these local pro-         ing approaches are unable to support the scalability, performance
cesses are sent back to the analytical query optimiser, which uses        and incremental evaluation requirements posed by the streaming
the cardinality of the results from D1 and D2 to efficiently order        queries.
the joins between house and area results, and finally performs the
grouping. Our black-box approach also allows the semantic com-            5.3     Sequence-based Queries
pleteness and location transparency; new sources and CBGP-stores             The distribution of sequence-based queries is an open research
can be easily updated without remodelling/recalculating local opti-       issue and there seems to be no effort in the context of the semantic
misation strategies.                                                      web community. A sequence-based query determines a data se-
                                                                          quence, which nin our case is an ordered sequence              of RDF graphs.
5.2    Continuous Streaming Queries                                                                                             o
                                                                          Formally, S = (G1 , t1 ), (G2 , t2 ), . . . (Gn , tn ) is defined as a sequence
   Continuous streaming queries are typically registered on a set of      of n RDF graphs, where each graph contains a set of triples and a
sources and the computation of the results are bounded by a win-          timestamp ti in which the recording was made. Given a set of sub-
dow – which slides by certain elements count or time interval. For        query graphs (QG) and an ordering/sequencing function O : QG →
example, see Query 2 (using CQELS [18] syntax). It determines             N, we determine the ordering of RDF graphs matched to the sub-
the subquery graph for power consumption on a source s1 and the           query graphs. A key observation in this case is that we need to
weather related subquery graph on source s2 , within a window of 2        execute or assemble the results of each subquery graphs in a way
hours that slides every 2 second. The results of the subquery graphs      that it follows the ordering defined for QGs. For example, consider
are joined on a common variable ?l.                                       Query 3 (using IntellCEP syntax [14]), it determines the sequence
of RDF graph events (SEQ(A,B)), where the power consumption             7.[1] https://itea3.org/project/seas.html.
                                                                               REFERENCES
of a house is greater than a certain threshold value, followed by
                                                                         [2] https://www.w3.org/community/rsp/wiki/RDF_Stream_Models.
specific weather conditions.                                                 Priminilary draft Link "https://goo.gl/iLQscV".
   Existing approaches [4] for executing such queries have various       [3] T. N. 0001 and G. Weikum. Rdf-3x: a risc-style engine for rdf.
shortcomings; (i) they utilise a triple-based model, where only one          PVLDB, 1(1):647–659, 2008.
triple is permitted in each event, (ii) they utilise ad hoc settings,    [4] D. Anicic and Fodor. EP-SPARQL: a unified language for event
where expensive indexing is used for ordering functions, and (iii)           processing and stream reasoning. In WWW, 2011.
they are based on a single stream model. These shortcoming makes         [5] M. Atre and Chaoji. Matrix "bit" loaded: A scalable lightweight join
them unsuitable for many real-world use cases, where distribution            query processor for RDF data. In WWW, pages 41–50, 2010.
is the key to cater huge volumes of data.                                [6] M. Atre and J. A. Hendler. BitMat: A main memory bit-matrix of rdf
                                                                             triples. In In: Proceedings of the 5th International Workshop on
     Query 3. Sequence-based query for Smart Grid use case                   Scalable Semantic Web Knowledge Base Systems., 2009.
    SELECT ?house,?l,?power                                              [7] D. F. Barbieri and Braga. C-SPARQL: Sparql for continuous
    WITHIN 24 hours                                                          querying. In WWW, pages 1061–1062, 2009.
    PARTITION BY (?house)                                                [8] M. Bhatt, C. Wouters, and Flahive. Semantic completeness in
    FROM STREAM S1 <http://example.org/powersource>                          sub-ontology extraction using distributed methods. In ICCSA. 2004.
    FROM STREAM S2 <http://example.org/weathersource>                    [9] C. Buil-Aranda, M. Arenas, O. Corcho, and A. Polleres. Federating
    WHERE                                                                    queries in {SPARQL} 1.1: Syntax, semantics and evaluation. Web
{
SEQ (A, B)                                                                   Semantics: Science, Services and Agents on the World Wide Web,
                                                                             18(1):1 – 17, 2013. Special Section on the Semantic and Social Web.
  A ON S1                                                               [10] J.-P. Calbimonte, O. Corcho, and A. J. G. Gray. Enabling
{                                                                            ontology-based access to streaming data sources. In ISWC’10.
?house :location ?l.                                                    [11] M. d’Aquin, M. Sabou, and E. Motta. Modularization: a key for the
?house :powerSource ?source.                                                 dynamic selection of relevant knowledge components. In 1st
?source :value ?power.
FILTER (?power > 50)                                                         International Workshop on Modular Ontologies, WoMO’06, 2006.
  }                                                                     [12] D. DellâĂŹAglio, J.-P. Calbimonte, and Balduini. On correctness in
                                                                             rdf stream processor benchmarking. In The Semantic Web âĂŞ ISWC
    B ON S2                                                                  2013. Springer Berlin Heidelberg, 2013.
{                                                                       [13] R. C. Fernandez, M. Weidlich, P. Pietzuch, and A. Gal. Scalable
                                                                             stateful stream processing for smart grids. In DEBS, pages 276–281,
?l :temperature ?temp.
?l :windSpeed ?Wspeed.                                                       New York, NY, USA. ACM.
?l :humidity ?hum.                                                      [14] S. Gillani, G. Picard, F. Laforest, and A. Zimmermann. Towards
FILTER (?temp > 20 && ?Wspeed > 10)                                          Efficient Semantically Enriched Complex Event Processing and
                                                                             Pattern Matching. In OrdRing 2014 @ ISWC 2014, Trentino, Italy,
    }                                                                        Oct. 2014.
                                                                        [15] S. Gurajada, S. Seufert, I. Miliaraki, and M. Theobald. TriAD: A
}                                                                            distributed shared-nothing rdf engine based on asynchronous
   Our approach of distributed CBGP-stores can however be useful             message passing. In SIGMOD, 2014.
in this scenario. Based on our earlier optimisation techniques, we      [16] D. Ibragimov, K. Hose, T. B. Pedersen, and E. Zimányi. Processing
                                                                             aggregate queries in a federation of SPARQL endpoints. In ESWC,
can send each subquery graph to the corresponding alive CBGP-                pages 269–285, 2015.
stores. The computation of graph pattern matching and aggregation       [17] S. Komazec and D. Cerri. Sparkwave: Continuous schema-enhanced
on subquery graph is computed locally, while the results along-with          pattern matching over rdf data streams. In DEBS, 2012.
their temporal properties can be utilised to determine the sequence     [18] D. Le-Phuoc and Dao-Tran. A native and adaptive approach for
defined on a set of events. That is, the query evaluation is bro-            unified processing of linked streams and linked data. In ISWC. 2011.
ken into several steps in a pipeline that matches the events with       [19] K. Lee and L. Liu. Scaling queries over big rdf graphs with semantic
a subquery graph and republish the matched events to a step fur-             hash partitioning. Proc. VLDB Endow., 2013.
ther in the pipeline. The sequence of events within a pipeline can      [20] B. MacCartney, S. McIlraith, and Amir. Practical partition-based
                                                                             theorem proving for large knowledge bases. In IJCAI, 2003.
be matched by either utilising a rule-based system or an automata-
                                                                        [21] K. Makris, N. Bikakis, N. Gioldasis, and S. Christodoulakis.
based approach. Crucially, this step requires deep insight into the          SPARQL-RW: Transparent query access over mapped rdf data
temporal measurements and state management for aggregate oper-               sources. In EDBT, pages 610–613, New York, NY, USA, 2012. ACM.
ators.                                                                  [22] A. Margara, J. Urbani, F. van Harmelen, and H. Bal. Streaming the
                                                                             web: Reasoning over dynamic data. Web Semantics: Science,
6.      CONCLUSION                                                           Services and Agents on the World Wide Web, 25(0):24 – 44, 2014.
                                                                        [23] Y. Nenov, R. Piro, B. Motik, I. Horrocks, Z. Wu, and J. Banerjee.
   Even though the scalability, state management and distribution            RDFox: A highly-scalable RDF store. In ISWC. Springer, 2015.
of sources are very common requirements for RSP system, there is        [24] A. Schwarte, P. Haase, and M. Schmidt. FedX: A federation layer for
currently no system that can provide a generic solution to accom-            distributed query processing on linked open data. In ESWC’13.
modate these attributes. In this paper, we summarised the chal-         [25] H. Stuckenschmidt and M. Klein. Structure-based partitioning of
lenges and opportunities provided by a distributed general purpose           large concept hierarchies. In The Semantic Web âĂŞ ISWC 2004,
system for RDF graph stream processing. We propose DIONYSUS                  volume 3298, pages 289–303. Springer Berlin Heidelberg, 2004.
that will provide one query interface to enable analytical, streaming   [26] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph
                                                                             matching on billion node graphs. Proc. VLDB Endow., 2012.
and sequence-based queries. It will support islands of data stores
                                                                        [27] P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, and L. Liu. Triplebit: A fast
each assigned to a CBGP; thus filtering and indexing the RDF                 and compact system for large scale rdf data. VLDB’13.
graphs in an incremental manner. Furthermore, the use of such           [28] K. Zeng, J. Yang, H. Wang, B. Shao, and Z. Wang. A distributed
system allows to share optimisation strategies and query matches             graph engine for web scale rdf data. In PVLDB’13.
within a set of queries; leading to the desired scalability require-    [29] L. Zou and M. Tamer. gStore: a graph-based SPARQL query engine.
ment presented by real-world systems.                                        VLDB J., pages 565–590, 2014.