DIONYSUS: Towards Query-aware Distributed Processing of RDF Graph Streams Syed Gillani, Gauthier Picard, and Frédérique Laforest Laboratoire Hubert Curien, UMR CNRS 5516, and Institute Henri Fayol, EMSE Saint-Etienne, France syed.gillani@univ-st-etienne.fr , gauthier.picard@emse.fr, frederique.laforest@telecom-st-etienne.fr ABSTRACT the solution for variety and heterogeneity of data sources. RDF is Arguably, the most significant obstacle to handle the emerging ap- a schema-free data model in which data are represented as subject- plication’s data deluge is to design a system that addresses the chal- predicate-object (hs, p, oi) statements called triples, and SPARQL lenges for big data’s volume, velocity and variety. Work in RDF is a standard query language for RDF with triple patterns as one of stream processing (RSP) systems partly addresses the challenge of its main constituent. Today, RSP systems are turning into mature variety by promoting the RDF model. However, challenges like academic systems [7, 18, 10, 17], while broadening the spectrum volume, velocity are overlooked by existing approaches. These of applications they serve. As a part of this process, we increas- challenges demand optimised combination of scale-out and scale- ingly see the need to entertain these systems with functionality like up solutions. Furthermore, various other requirements for RSP sys- scalability, distribution of sources, historical analysis on streamed tems, such as an efficient integration of distributed stream sources, data, and integration of new operators to take the advantages of the storage of historical streams and their analysis, and integration of structured data model. stateful operators to support complex event processing over streams In SEAS [1] project, we tackle with a Smart Grid scenario, where are far from being addressed in an efficient way. Our vision is to a set of distributed and heterogeneous sources emanate data. The design a general purpose RDF graph streaming system, which will sources are mainly composed of sensors providing data about the be able to cope with distributed streams and shares local optimising power consumption at appliance level in each house, power gener- strategies to allow different kinds of queries (analytical, streaming, ation/storage data, weather related data, usersâĂŹ activity data etc. sequence-based) through one query interface. The proposed sys- Given the variety of data sources, the system must support an as- tem will offer a black-box solution that will allow analysts to tap sortment of data sources, standard graph analytic (e.g., how many in the goldmine of massive RDF graph streams. We consider the appliance are present in a house, their operational times and power challenges and opportunities associated in designing such system, ratings), complex analytic (e.g., finding areas in the city with power introduce our approaches to these topics, and discuss the compo- shortages and nearby power rich areas), real-time monitoring (e.g., nents of our envisioned system. detection of abnormalities in power usage, power disruptions), and complex event processing (CEP) (defining a sequence to trade elec- tricity) over temporal and distributed events. The above mentioned 1. INTRODUCTION scenario identify three main dimensions of optimisations and re- In a wide range of domains from social media to financial trad- quirements to be included in the current RSP systems. That is, (i) ing, there is a growing need of supporting continuous queries over distribution and scalability of RSP systems, (ii) enabling analytical predefined windows of data. The initial generation of stream pro- queries over historical streams, and (iii) incremental evaluation to cessing systems (SPSs) were focused on delivering real-time re- enable stateful operators for CEP. sponse for the input that usually consisted of homogeneous stream First, as in classical data integration, continuous queries on data tuples (relational) with predefined schemas. The implementation of streams may involve combined processing over a number of stream- such stream processing engines is relatively straightforward as long ing data sources physically distributed across the network. The as the source data can be correctly represented in the data model of currently available RSP systems require streams to be transmitted the SPS, and vice versa; albeit involving some manual work and to a single location for centralised processing. Unfortunately, the inflexibility. continuous transmission of large number of rapid streams to a cen- In order to provide a more flexible solution, and to enable rea- tralised locations may exceed the capacity of monitoring infrastruc- soning capabilities for continuous queries; recently, the notion of ture. Thus, these approaches are not feasible for many real-world RDF stream processing (RSP) systems has been introduced [7, 18]. scenarios. Consider for instance, social network like Twitter, where The data for such systems are modelled as RDF graphs; providing on average there are around 6000 tweets per second and around 500 million tweets per day. An ad-hoc RSP system with a set of queries defined over long temporal windows (e.g., 1 day) can easily exhaust the available system resources, and requires a distributed comput- ing model. Similar, for Smart Grid, each appliance in a house em- anates on average 200K events per day; extending it for a set of appliances in a house and then for a set of houses would be beyond c 2016, Copyright is with the authors. Published in the Workshop Pro- the capacity of existing RSP systems [13]. ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor- deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this Second, the main optimisation goal of current RSP systems is to paper is permitted under the terms of the Creative Commons license CC- reduce the latency of results, since they mainly address what might by-nc-nd 4.0 be called monitoring applications. That is, the systems were mainly of selected challenges. Third, we provide an overview of our en- not designed with storage in mind; while essentially all of the mon- visioned approach. Fourth we provide various query optimisation itoring applications that we encountered had a need for archival techniques for our envisioned system. storage. The storage of historical data within RSP systems would enable (i) analytical queries on the historical data, (ii) make pre- dictive analysis; while comparing the historical and current data 2. STATE OF THE ART streams. Storing data from streams is not as simple as it sounds; The main object of the proposed research is to build a general the data from streams grow rapidly making it difficult to store and purpose system which should support a wide variety of queries manage. Storage capacity, however is only one aspect of the prob- over RDF graph streams and on historical data. The existing tech- lem. The data generated by streams can be so big that its analysis niques in this scenario can be divided into (i) RDF stream process- can also lead to severe time-bound issues. The complete analysis of ing (RSP) systems, (ii) RDF storage systems for static data, (iii) stored data, however is rarely required and what users need instead CEP over RDF streams is explorative access to the data to find interesting subsets that need [RDF Stream Processing] Stream processing of graph struc- further detailed and time-consuming analysis. For instance, a sen- tured data (often dubbed as stream reasoning) – where streams sory device capturing the temperature of the room every 2 seconds are continuously processed together with semantics and rich back- would produce a lot of repetitive data. However a user might be ground knowledge – has gained fair momentum. Existing tech- only interested in temperature values for longer time periods with niques [7, 18, 10, 17] tackle diverse issues including, continuous few distinct values. query processing, stream reasoning, ontology maintenance, ontology- Third, many real-time applications, not only require the on-line based data access. However, they are far too limited and complex to analytics of streams (as provided by all RSP engines), but also re- be applicable on gigantic distributed data streams. These solutions quire stateful operators; which allows the new data elements to act are usually optimised for centralised settings and cannot be directly as updates to the previously processed data elements. For instance, adopted for federated/distributed settings. Furthermore, most of the data element describing the measurement of power consump- these techniques are based on the triple stream model, where each tion by an appliance represents an update to its previously mea- element/event within a stream is composed of a triple: a model di- sured values. This kind of scenarios is not uncommon in applica- rectly inspired from the relational tuple stream model. For the same tions like Smart Grid, social networks, stock exchange, transport reason, most of these solutions (e.g., CQELS [18], C-SPARQL [7]) systems etc., where each new input might represents the updated map RDF triples on relational tuples – to be handled by an underly- value of an object. This means, as opposite to the batch algorithms ing relational stream processor. We argue that this would not allow – that recomputes new output from scratch by re-evaluating all the to reap the real power of structured events and it would be difficult selected events – RSP system needs an incremental matching al- to extend it for further graph operators; the W3C RSP1 group took gorithm to minimise the computation by only evaluating the new the same tone for triple stream model [2]. CEP over RDF [4] has events, while considering the already computed states of the previ- the comparable story as that of RSP – with triple model as the dom- ous ones. Furthermore, such stateful behaviour can also provides inant cause of controversy. We believe that an RDF graph model another interesting class of queries, i.e., sequence-based queries. for streams would first enable to close the semantic gap between These queries determine the defined stateful sequence of events and existing techniques, and second allows to tailor techniques from produce the output once it has been matched. These operators can static RDF graph solutions to RDF graph streams. be defined under the umbrella of complex event processing (CEP). [Centralised RDF Storage Systems] To cater the high volume The general RSP systems do not offer native support for CEP; thus, of data, we need to store it in an efficient manner; this would enable any solution built on top of them will suffer in terms of expressive near-to-real-time analytic queries and real-time streaming queries. power, usability, and performance. Few specialised systems [4] that Several techniques have been proposed to store graph structured support sequencing over an RDF model are defined in the literature; data. These techniques range from native graph storage systems however, they lacks the scalability and distribution requirements as [29, 6, 27] to specialised RDF storage systems [5, 3, 23]. Most described earlier. of these systems are optimised only for static data and they em- There is a rich literature on RSP system, CEP and distributed ploy extensive indexing techniques to optimise query performance. querying over RDF, but corresponding systems are rather rigid (i.e., For instance, RDF3x [3] builds several clustered B+-trees for all can only analyse one triple at a time (for RSP and CEP), and dis- permutations of subjects, objects and predicates. Such extensive tributed querying is based on static data), and do not deliver in indexing techniques are not practical or feasible for high velocity terms of performance required for applications that we envisioned. graphs stream; as they would spend considerable amount of time Existing systems that are capable of providing some sort of func- and space for pre-processing and indexing RDF graph streams. tionalities as discussed above fall into one of the following cate- [Distributed RDF Storage Systems] In order to realise the dis- gories: (1) distributed RDF batch processing systems that are op- tributed nature and amount of data on the Web, several techniques timised only for static data [28, 15], (2) centralised RSP systems have been proposed for distributed RDF graph storage and query- that only provide real-time analytics [7, 18, 10, 17], (3) centralised ing. These techniques span from optimising a federated layer [24, CEP systems that only provide sequence operators [4]. Our am- 9, 21] on top of existing data stores to a dedicated solutions for bitious goal is to develop a general-purpose system that not only clustering and querying RDF graphs in distributed manners [28, provides all the required functionalities, but also shares optimisa- 15]. These techniques provide a good starting point but are not di- tion strategies across the system; it will offer the abstractions, tools rectly applicable for streaming settings, due to the following points. and dedicated algorithms needed for achieving these goals. The (1) Optimisation/distribution strategies are based on the assump- proposed system will make it feasible for analysts to use one query tion of static data, (2) the distribution/clustering of RDF graphs interface to post various different types of queries on distributed does not cater the heterogeneity of the sources; this can result into data streams. an increase in network traffic and replication of data in streaming We first present the state-of-the-art to draw the picture of the settings. Furthermore, due to their re-evaluation model, such dis- work required to achieve all our tasks. Second, we present a list 1 https://www.w3.org/community/rsp/ tributed approaches can lead to significant increase in query time Apps Clients Visulisation with the increase of data size. Exact Query Graphs (EQGs) 3. SELECTED RESEARCH CHALLENGES In this section, we expose main challenges to be dealt for our envisioned system. Query Conductors [Streams, Events and Query Model] One important problem Bolts Bolts for RDF graph-based stream processing is the lack of clean seman- CEP Stream Analytical tic models for defining streams and continuous queries to process Optimiser Optimiser Optimiser them. There is no agreement even on the definition of basic terms such as “stream” and “event”. Although most of existing systems Archipelago of CBGP-Stores are based on a triple stream model, it is one of the many possi- ble points that shows the dependence of RSP engines on relational models. Hence, the existence of a wide variety of streaming ap- Static Island of CBGP-Stores Alive Island CBGP-Stores Deceased island of CBGP-Stores plications not only introduces complexities in choosing the right engine for given applications, but also makes the application de- velopment and maintenance hard. The need of standardisation has Data Stream Sources been recently acknowledge by the W3C RSP working group and few of their initial propositions clear out the discussion toward the stream and event model. Furthermore, most of the existing RSP Figure 1: Layered Architecture of DIONYSUS systems use extended forms of SPARQL to embed streaming op- erators. However, due to the lack of precise semantics, the query results for these RSP engines are not even comparable [12, 22]. 4. DIONYSUS [Handling Volume and Velocity under Distributed Settings] In this section, we describe the components of our envisioned The problem of distributing data into a set of clusters, and then system called as DIONYSUS2 . Our system design is motivated by finding the query relevant data from all such distributed sources the following goals and is described in Fig. 1. at federation layer has been dealt in static settings [24, 9, 21, 28, (1). We are interested in an efficient distribution of streaming data 15]. In such cases, data is known in advance and its distribution from a set of sources, which are not known in advance. Thus, our is performed once. However, this is not the case in streaming set- storage and data distribution model is envisioned by Common Ba- tings; data is not known in advance for distribution analysis and sic Graph Pattern store (CBGP-store). Each CBGP-store is as- new sources are added dynamically and old sources provide data signed with a generic BGP (i.e., a set of triple patterns) generated at variable velocities. Distributing and storing such a dynamic sea automatically/manually from the domain ontology and domain use of information brings new challenges and thus needed to be dealt. cases (see Section 4.1). A collection of such stores exposes (i) Existing RSP systems are based on sliding window-based execu- fresh incrementally computed results of a set of CBGP for stream- tion strategy: data in the windows are processed and expelled once ing queries, (ii) the set of previously computed results for off-line a window is expired or it slides forward [22]. This first does not analytical queries. As we see in Section 4.1, this enables the use allow historical/analytical analysis on data. Second, the windows of light-weight incremental indexing technique to efficiently store are devised in an ad hoc manner, thus are optimised only for small data for each CBGP. bandwidth. [Incremental Indexing and Query Evaluation] The use of in- (2). The second goal of our approach is to push the intensive query dexing (in distributed RDF settings) enables two primary tasks: optimisation and processing locally at each CBGP-store for Exact (1) it determines the relevant distributed sources for a query, (2) Query Graphs (EQGs) registered by a user. An EQG is a more it tunes the NP-hardness of subgraph isomorphism – that is match- selective form of CBGP and it contains (i) subsets of triple pat- ing a query graph and the data graph – into a more subtle prob- terns that are distributed among CBGP-stores, (ii) SPARQL 1.1 op- lem; that is either joining the result of a set of triple patterns or erators, such as select, optional, union, filter, group exploring smaller subsets of query graph to join them for the final by, etc. The results of each EQGs are accumulated at the federated result [29]. Nevertheless, the good indexing strategies can prove level; Section 4.2 describes such techniques. to be quite significant for the performance of a system; we argue (3). Our third goal is to enable different kinds of queries – such based on our experience with existing storage solutions that index- as analytical, streaming, sequence-based – through a single query ing itself can become a bottleneck to store and query RDF graph interface. A query interface encompasses subsets of CBGP-stores streams. For example, it takes more than 10 hours to build an that can be abstracted as islands of CBGP-stores (see Fig. 1). This optimised state-of-the-art index over a dataset of 2 billion triples would enable to share query results computation and local optimi- on modern servers [26]. It is imperative therefore, to first develop sation strategies. For example, users would like to get the result of techniques that interactively and adaptively build parts of the index, (i) an analytical query describing the number of active appliances while focusing on the data necessary to answer queries and making and their power usage in a house, and the result of (ii) a sequence- the data available immediately. Second, the continuous streaming based query to determine the sequence of power usage by appli- queries require incremental evaluation model. This means, the stor- ances. This calls for a single query that can be optimised according age mechanism in this scenario must store the state of previously to its defined operators as presented in Section 5. computed results and the new entries in the stream should act as an (4). Our fourth aim would be to provide the semantic completeness update to the previously stored states. and locations transparency. That is, a new source can be added without affecting the integrity constraints, and a user query can 2 DistrIbuted aNalYtical, Streaming and Sequence qUerieS span multiple islands of CBGP-stores. This would enable to first, menting an incremental and adaptive indexing is one of the ma- share the optimising strategies defined for each CBGP-store, sec- jor challenges in storing and querying dynamic streams of RDF ond to reduce the network traffic by employing local optimisation graphs. and computation strategies. In summary, our envisioned system can provide a way not to A collection of CBGP-stores is abstracted under an island, where drown in the sea of information emanating from heterogeneous each island is assigned with a set of query conductors. Query con- distributed sources. It filters unnecessary information, which oth- ductors shares optimisation strategies through bolts (see Fig. 1). erwise can result in excessive use of storage and computational Each CBGP-store is divided into three flavours: static, alive or de- resources. The framework is designed to minimise the burden of ceased. This classification is based on the fact that a streaming query evaluation at the federation layer and to share local optimi- query requires the current incrementally computed results, while sation strategies across the islands of CBGP-stores. the analytical or predictive queries base their execution on histor- ical states of incrementally computed results. The static CBGP- 4.1 Common Basic Graph Pattern Stores stores are used to enrich the streams with the static background knowledge – one of the main property of RSP engines. Each CBGP- The main difference between the static and streaming data distri- store captures a fraction of data from different sources, thus the bution is its availability. The static data is available beforehand for storage of alive CBGP-store and query computation can be done analysis and thus various techniques such as semantic hash-based completely in the main-memory, while the deceased and static CBGP- pertaining [19] can be utilised to partition it into a set of clusters. stores can utilise disk-based storage. However, the streaming data can not be available in advance for distribution analysis. Thus, we envisioned an ontology-based data 4.2 Query Computation via Query Conduc- distribution. Inspired by works on ontology partitioning [25, 20], tors ontology modularisation [11] and ontology segmentation [8], we propose to generate a set of common basic graph patterns (CBGP) The query conductor layer is the working horse of the system. by analysing the structural relationships described in the domain This layer is responsible for conducting various high-level optimi- ontology/ontologies. Given a set of ontologies O = {o1 , o2 , . . . , on } sations and accumulating results of the user-defined queries on an used by a set of streams, a function F(O) produces a set of CBGP archipelago of CBGP-stores (see Fig. 1). When a customised user by analysing the structural relationships within ontologies and the query – called as an Exact Query Graph (EQG) – is registered, general use cases defined for a particular domain. Intuitively, we the query conductor first determines its type (analytical, stream- can say that each CBGP should contain information about a coher- ing or sequence-based query) and creates an abstract syntax tree ent subtopic within an ontology. The concepts within each CBGP (AST). It then divides the EQG into a set of subquery graphs, along- are semantically connected to each other and should not have strong with their temporal information, in case of streaming and sequence- dependencies with the information outside the CBGP. Thus, each based queries. It next utilises the information of its islands of CBGP = (C, D) contains a set of concepts C and links D between CBGP-stores and orchestrates the execution of each subquery graph them represent different kinds of dependencies; where D can be on a set of CBGP-stores. The query conductor itself does little reflected in the definition of an ontology or can be implied by the computation other than concatenating or temporal sorting of re- intuitive understanding of the concepts and background knowledge sults from CBGP-stores. This design permits us to take the advan- about the respective domain. Generally, CBGPs consist of differ- tages of local optimisation strategies implemented for each CBGP- ent types: star-shaped, shallow tree shaped, deep tree shaped, graph store. Note that EQGs are more specialised and selective versions with loops etc. Each CBGP is mapped to a data store; thus forming of CBGP, thus the set of triple patterns within each EQG can be a set of dataset clusters compose of data aggregated from a set of easily decomposed and registered to CBGP-stores. sources. Note that such data distribution may results in data du- Results of streaming and sequence-based EQGs are held in main- plication across the set of CBGP-stores. However the following memory buffers, and they are updated whenever there is a newly benefits would outweigh such shortcomings. computed incremental results in the corresponding alive CBGP- stores. The query conductor also ensures that the data movement (1) Aggregating commonly linked concepts within a single CBGP- is not too expensive; we will explore the tactics for moving results store. This would results in (i) querying processing to be focussed based on operators involved, processing activity at various CBGP- at local levels, and (ii) reducing the network traffic and load at fed- stores, and overall network activity. That is, if a certain EQG re- eration level. The data duplication among CBGP-stores ensure that sults in an increase network traffic or query processing at federation each query answer can be computed locally. level, we can create a new CBGP-store by analysing the structure of the EQGs. (2) Surviving in the sea of information by only filtering and storing the relevant information. The CBGP-stores can act as data filters, 5. QUERY OPTIMISATION where only the relevant data based on the selected graph relation- In this section, we discuss various opportunities offered by our ships will be stored and the full dataset from a set of sources is envisioned system to optimise the performance of different types of summarised by a set of CBGPs. The summarised data acts as a sur- queries: analytical, streaming and sequenced-based. rogate for the original data and are queried instead of the complete dataset. 5.1 Analytical Queries Analytical queries for historical data analysis usually contains (3) In-memory incremental evaluation of the CBGPs; storing only multiple aggregation phases. For example, the query to find the the states of the computed results, rather than all the data elements average power consumption by each house grouped by area is de- from a set of sources. scribed in Query 1 (using SPARQL3 syntax). It contains three spec- ifications: (i) graph pattern matching to compute the query-related (4) It enables the query-aware light-weight and adaptive indexing 3 technique for storing CBGPs results. As described earlier, imple- http://www.w3.org/TR/sparql11-query/ subgraphs corresponding to the power consumption of a house and The scalability of continuous streaming queries are case depen- area information, (ii) grouping the resulting patterns based on the dent; there are two main flavours discussed in the literature. First, values of area-house combinations, and (iii) aggregating the val- scaling a large number of queries by distributing their execution. ues of the power consumption to compute the average. Now con- Second, scaling a complex query that needs large working mem- sider that the house and its power consumption values lies in a ory and might not fit within a single machine. The first case is CBGP-store D1 , while information regarding the area lies in an- easy to handle; a shared nothing execution architecture – i.e., nei- other CBGP-store D2 , and there can be n CBGP-stores. ther streams or memory storage is shared among processors – can be utilised to run multiple instances of the streaming engine, each Query 1. Analytical query for Smart Grid use case running a subset of the queries. However, scaling complex queries SELECT ?area, ?house, AVG(?power) (iii) is still an open issue. WHERE Query 2. Streaming query for Smart-Grid use case { ?house :location ?l. SELECT ?power, ?house, ?temp, ?Wspeed, ?hum ?house :powerSource ?source. (i) WINDOW 2 HOURS ?source :value ?power. WHERE { ?l :partOf ?area. STREAM [Range 2s] ?area :name ?areaName. { } ?house :location ?l. GROUP BY (?area) (ii) ?house :powerSource ?source. ?source :value ?power. } Traditionally in federated settings, the two basic optimisation STREAM [Range 2s] techniques to compute such a query graph are: (i) compute each { triple pattern in a query graph against all the available data stores ?l :temperature ?temp. ?l :windSpeed ?Wspeed. and the results are joined at the server or (ii) evaluating each triple ?l :humidity ?hum. pattern in a nested loop join (NLJ) fashion; that is by substitut- } ing the results obtained from one triple pattern into another. These } techniques and their optimised version [24] performs poorly for highly selective and complex analytical queries. [16] provides var- We argue that our system design could handle both cases effi- ious optimisation strategies to cater analytical queries for federated ciently with the following points. (1) We can utilise multiple query static data sources and can act as a good starting point to extend it conductors to serve a set of EQGs, where query conductors can for dynamic settings, albeit in a different way. Since previous ap- easily be distributed into a set of machines. (2) The main pro- proaches assumed a cost-based model, they would not be efficient cessing of a EQG is performed in a distributed manner on a set to support the addition of new sources and the generation of new of CBGP-stores. That is, a EQG is parsed into a set of subquery CBGP-stores. The reason is that, they would require to maintain a graphs, each is assigned to a CBGP-stores. The query execution model of each subquery graph operation and the resources it needs. results are stored in the distributed cache and then joined on com- That is, essentially the query optimiser need to understands all the mon variables (i.e., ?l in Query 2) to produce the final results; the operations and storage techniques in all the distributed resources, static CBGP-stores further provide the functionality of joining the while continuously updating it for dynamic streams. static background knowledge with the streams. Such query execu- At opposite, we envisioned a black box approach, where no in- tion strategy enables to share the results of common query graph formation about the local query optimisations is known at the server patterns within a set of EQGs. Furthermore, there is also the case level. Each subquery graph of EQGs is sent to the corresponding of incremental evaluation of the query result. Traditionally, Query CBGP-store, where a local optimiser determines the subquery join 2 is executed in a re-evaluation manner [22, 18]. That is, the results operations. Using the same example as described above, we can for each subquery graph are indexed (usually B+ tree) and for each compute the graph pattern matching – which is typically join inten- new event, (i) the result of each subquery graph is re-evaluated, (ii) sive – and aggregation of values at D1 store. Similarly, the basic the join on the common variable is re-evaluated, and (iii) all the graph pattern matching for area-related query graph patterns will computation is performed in an ad-hoc manner. Therefore, exist- be performed at D2 store. The results of both of these local pro- ing approaches are unable to support the scalability, performance cesses are sent back to the analytical query optimiser, which uses and incremental evaluation requirements posed by the streaming the cardinality of the results from D1 and D2 to efficiently order queries. the joins between house and area results, and finally performs the grouping. Our black-box approach also allows the semantic com- 5.3 Sequence-based Queries pleteness and location transparency; new sources and CBGP-stores The distribution of sequence-based queries is an open research can be easily updated without remodelling/recalculating local opti- issue and there seems to be no effort in the context of the semantic misation strategies. web community. A sequence-based query determines a data se- quence, which nin our case is an ordered sequence of RDF graphs. 5.2 Continuous Streaming Queries o Formally, S = (G1 , t1 ), (G2 , t2 ), . . . (Gn , tn ) is defined as a sequence Continuous streaming queries are typically registered on a set of of n RDF graphs, where each graph contains a set of triples and a sources and the computation of the results are bounded by a win- timestamp ti in which the recording was made. Given a set of sub- dow – which slides by certain elements count or time interval. For query graphs (QG) and an ordering/sequencing function O : QG → example, see Query 2 (using CQELS [18] syntax). It determines N, we determine the ordering of RDF graphs matched to the sub- the subquery graph for power consumption on a source s1 and the query graphs. A key observation in this case is that we need to weather related subquery graph on source s2 , within a window of 2 execute or assemble the results of each subquery graphs in a way hours that slides every 2 second. The results of the subquery graphs that it follows the ordering defined for QGs. For example, consider are joined on a common variable ?l. Query 3 (using IntellCEP syntax [14]), it determines the sequence of RDF graph events (SEQ(A,B)), where the power consumption 7.[1] https://itea3.org/project/seas.html. REFERENCES of a house is greater than a certain threshold value, followed by [2] https://www.w3.org/community/rsp/wiki/RDF_Stream_Models. specific weather conditions. Priminilary draft Link "https://goo.gl/iLQscV". Existing approaches [4] for executing such queries have various [3] T. N. 0001 and G. Weikum. Rdf-3x: a risc-style engine for rdf. shortcomings; (i) they utilise a triple-based model, where only one PVLDB, 1(1):647–659, 2008. triple is permitted in each event, (ii) they utilise ad hoc settings, [4] D. Anicic and Fodor. EP-SPARQL: a unified language for event where expensive indexing is used for ordering functions, and (iii) processing and stream reasoning. In WWW, 2011. they are based on a single stream model. These shortcoming makes [5] M. Atre and Chaoji. Matrix "bit" loaded: A scalable lightweight join them unsuitable for many real-world use cases, where distribution query processor for RDF data. In WWW, pages 41–50, 2010. is the key to cater huge volumes of data. [6] M. Atre and J. A. Hendler. BitMat: A main memory bit-matrix of rdf triples. In In: Proceedings of the 5th International Workshop on Query 3. Sequence-based query for Smart Grid use case Scalable Semantic Web Knowledge Base Systems., 2009. SELECT ?house,?l,?power [7] D. F. Barbieri and Braga. C-SPARQL: Sparql for continuous WITHIN 24 hours querying. In WWW, pages 1061–1062, 2009. PARTITION BY (?house) [8] M. Bhatt, C. Wouters, and Flahive. Semantic completeness in FROM STREAM S1 sub-ontology extraction using distributed methods. In ICCSA. 2004. FROM STREAM S2 [9] C. Buil-Aranda, M. Arenas, O. Corcho, and A. Polleres. Federating WHERE queries in {SPARQL} 1.1: Syntax, semantics and evaluation. Web { SEQ (A, B) Semantics: Science, Services and Agents on the World Wide Web, 18(1):1 – 17, 2013. Special Section on the Semantic and Social Web. A ON S1 [10] J.-P. Calbimonte, O. Corcho, and A. J. G. Gray. Enabling { ontology-based access to streaming data sources. In ISWC’10. ?house :location ?l. [11] M. d’Aquin, M. Sabou, and E. Motta. Modularization: a key for the ?house :powerSource ?source. dynamic selection of relevant knowledge components. In 1st ?source :value ?power. FILTER (?power > 50) International Workshop on Modular Ontologies, WoMO’06, 2006. } [12] D. DellâĂŹAglio, J.-P. Calbimonte, and Balduini. On correctness in rdf stream processor benchmarking. In The Semantic Web âĂŞ ISWC B ON S2 2013. Springer Berlin Heidelberg, 2013. { [13] R. C. Fernandez, M. Weidlich, P. Pietzuch, and A. Gal. Scalable stateful stream processing for smart grids. In DEBS, pages 276–281, ?l :temperature ?temp. ?l :windSpeed ?Wspeed. New York, NY, USA. ACM. ?l :humidity ?hum. [14] S. Gillani, G. Picard, F. Laforest, and A. Zimmermann. Towards FILTER (?temp > 20 && ?Wspeed > 10) Efficient Semantically Enriched Complex Event Processing and Pattern Matching. In OrdRing 2014 @ ISWC 2014, Trentino, Italy, } Oct. 2014. [15] S. Gurajada, S. Seufert, I. Miliaraki, and M. Theobald. TriAD: A } distributed shared-nothing rdf engine based on asynchronous Our approach of distributed CBGP-stores can however be useful message passing. In SIGMOD, 2014. in this scenario. Based on our earlier optimisation techniques, we [16] D. Ibragimov, K. Hose, T. B. Pedersen, and E. Zimányi. Processing aggregate queries in a federation of SPARQL endpoints. In ESWC, can send each subquery graph to the corresponding alive CBGP- pages 269–285, 2015. stores. The computation of graph pattern matching and aggregation [17] S. Komazec and D. Cerri. Sparkwave: Continuous schema-enhanced on subquery graph is computed locally, while the results along-with pattern matching over rdf data streams. In DEBS, 2012. their temporal properties can be utilised to determine the sequence [18] D. Le-Phuoc and Dao-Tran. A native and adaptive approach for defined on a set of events. That is, the query evaluation is bro- unified processing of linked streams and linked data. In ISWC. 2011. ken into several steps in a pipeline that matches the events with [19] K. Lee and L. Liu. Scaling queries over big rdf graphs with semantic a subquery graph and republish the matched events to a step fur- hash partitioning. Proc. VLDB Endow., 2013. ther in the pipeline. The sequence of events within a pipeline can [20] B. MacCartney, S. McIlraith, and Amir. Practical partition-based theorem proving for large knowledge bases. In IJCAI, 2003. be matched by either utilising a rule-based system or an automata- [21] K. Makris, N. Bikakis, N. Gioldasis, and S. Christodoulakis. based approach. Crucially, this step requires deep insight into the SPARQL-RW: Transparent query access over mapped rdf data temporal measurements and state management for aggregate oper- sources. In EDBT, pages 610–613, New York, NY, USA, 2012. ACM. ators. [22] A. Margara, J. Urbani, F. van Harmelen, and H. Bal. Streaming the web: Reasoning over dynamic data. Web Semantics: Science, 6. CONCLUSION Services and Agents on the World Wide Web, 25(0):24 – 44, 2014. [23] Y. Nenov, R. Piro, B. Motik, I. Horrocks, Z. Wu, and J. Banerjee. Even though the scalability, state management and distribution RDFox: A highly-scalable RDF store. In ISWC. Springer, 2015. of sources are very common requirements for RSP system, there is [24] A. Schwarte, P. Haase, and M. Schmidt. FedX: A federation layer for currently no system that can provide a generic solution to accom- distributed query processing on linked open data. In ESWC’13. modate these attributes. In this paper, we summarised the chal- [25] H. Stuckenschmidt and M. Klein. Structure-based partitioning of lenges and opportunities provided by a distributed general purpose large concept hierarchies. In The Semantic Web âĂŞ ISWC 2004, system for RDF graph stream processing. We propose DIONYSUS volume 3298, pages 289–303. Springer Berlin Heidelberg, 2004. that will provide one query interface to enable analytical, streaming [26] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. Proc. VLDB Endow., 2012. and sequence-based queries. It will support islands of data stores [27] P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, and L. Liu. Triplebit: A fast each assigned to a CBGP; thus filtering and indexing the RDF and compact system for large scale rdf data. VLDB’13. graphs in an incremental manner. Furthermore, the use of such [28] K. Zeng, J. Yang, H. Wang, B. Shao, and Z. Wang. A distributed system allows to share optimisation strategies and query matches graph engine for web scale rdf data. In PVLDB’13. within a set of queries; leading to the desired scalability require- [29] L. Zou and M. Tamer. gStore: a graph-based SPARQL query engine. ment presented by real-world systems. VLDB J., pages 565–590, 2014.