Towards Enriching CQELS with Complex Event Processing and Path Navigation Minh Dao-Tran1 and Danh Le-Phuoc2 1 Institute of Information Systems, Vienna University of Technology Favoritenstraße 9-11, A-1040 Vienna, Austria dao@kr.tuwien.ac.at 2 Insight Centre for Data Analytics, National University of Ireland, Galway danh.lephuoc@nuigalway.ie Abstract. The increasing popularity of RDF Stream Processing has led to the developments of several RSP engines. Among these, CQELS has been developed with a native and adaptive approach, which gives it a performance advantage over other engines. However, it currently does not support two important features, namely Complex Event Process- ing and RDFS reasoning. We propose in this paper an extension of the CQELS query language and semantics towards enriching CQELS with these attractive functionalities. Keywords: RDF Stream Processing, Complex Event Processing, Path Navigation 1 Introduction Following the trend of using RDF as a unified data model for integrating diverse web data sources across systems under different controls, RDF Stream Processing (RSP) extends RDF to addresses the challenges in querying heterogeneous data streams coming from dynamic data sources of emerging ICT systems such as IoT and Smart Cities. This research field has being increasingly getting more attention with the developments of several RSP engines such as C-SPARQL [5], CQELS [17], EP-SPARQL [3], SPARQLStream [7], and Sparkwave [13] which mainly aim at providing stream processing functionalities. Among these engines, CQELS engine was designed to achieve the strict per- formance requirements of stream processing engines [20] with native physical query operators. The current implementation of CQELS engine provides high throughput physical operators to cover the combination of CQL operators and SPARQL 1.1 with CQELS query language (CQELS-QL). It has been shown to have a performance advantage over other RSP engines [18]. However, CQELS currently does not support two important features, namely Complex Event Processing (CEP) and RDFS reasoning. The former play an im- portant role in handling and analyzing complex relation over high volume of dynamic data [1], while the latter allows reasoning about types and brings addi- tional expressiveness for RSP. Having them in CQELS will enable a new range of `2 Gulda Lane (g) Haydn Park. (h) m ,m h h y, y, y, ay la la la Mozart C. (m) Center Sq. (c) l de de de de 1, 1, 1, 1, d d a a `1 Beethoven St. (b) `3 10 12 14 16 Fig. 1: Public Transportation Scenario applications that require not only high performance but also high expressiveness, for example, as in the motivating scenario below. Towards a more powerful CQELS, in this paper, we enrich the current query language of CQELS to support the desired features. Our contributions are: – extending the RDF stream processing model to work on time intervals in- stead of time points; – proposing an extension of the CQELS-QL to handle navigational paths and complex event processing. For the former, we make use of nSPARQL [16] and SPARQL 1.1 property paths; for the latter, we start the sequence oper- ator SEQ, which is the most popular CEP operator in practice; – giving a semantics of the new language in the SPARQL style, by lifting the join and sequence-related operators to work on sets of mappings with inter- vals, and introducing evaluation functions that make use of these operators. The proposed new language features and semantics will be illustrated with the following scenario. Motivating Scenario. Suppose that the traffic center at city of Vienna wants to improve the quality of public transportation by means of smart services. For this purpose, the center has background datasets regarding available vehicles such as subways, trams, buses, and connections between every pair of consecutive stops, including the stop names, types of vehicles and the duration of time needed to travel between them. Moreover, the center receives sensor data reporting the arrival and delay of vehicles with respect to stops. Passengers who register to the smart service also provide their locations as input streams. On top of these background and streaming data, the center can offer smart services (i) to aid traffic officers in monitoring and quickly identifying potential problems in the transportation network or (ii) to notify passengers of potential traffic delays and recommend alternative routes. Example 1 To make sure smooth connections to the city center, a continuous query can be placed to notify about recent repeated delays of subways follow- ing by no arrival at stops that according to the network plan can be reached by subways. When such information is instantly available to an officer, he/she can immediately react by triggering rerouting or providing complementary vehicles to solve the traffic. Take a simplified public transportation map in Figure 1 where `1 and `2 are subway lines, and `3 is a tram line. Assume that delays and arrivals of vehicles wrt. stops are reported along the time lines. At time point 14, a repeated delay for the subway d1 on the way to h is detected, but this is not reported as getting from h to c needs to use the tram line `3 . At time point 16, the repeated delay for the subway a1 at m is however reported as one can get from m to c by subways.  2 Preliminaries This section briefly reviews the basic building blocks of this work, namely RDF, SPARQL, RDF stream processing, the navigation language nSPARQL, and CEP. 2.1 RDF and SPARQL RDF is a W3C recommendation for data interchange on the Web [8]. It models data as directed labeled graphs whose nodes are resources and edges represent relations among them. Each node can be a named resource (identified by an IRI), an anonymous resource (a blank node), or a literal. We denote by I, B, L the sets of IRIs, blank nodes, and literals, respectively. Let IB = I ∪ B, IBL = I ∪ B ∪ L. A triple (s, p, o) ∈ IB × I × IBL) is an RDF triple, where s is the subject, p the predicate, and o the object. An RDF graph is a set of RDF triples. Example 2 (cont’d) The background dataset in Example 1 can be represented by the following RDF graph:     :conn1 :beg :m, :conn1 :end :b, :conn1 :means :subway, :conn1 :dur :3m,        :conn2 :beg :b, :conn2 :end :c, :conn2 :means :subway, :conn2 :dur :2m,         :conn 3 :beg :h, :conn3 :end :g, :conn 3 :means :subway, :conn 3 :dur :3m,      :conn4 :beg :g, :conn4 :end :c, :conn4 :means :tram,  :conn4 :dur :5m,  D= .. .           :a1 rdf:type :subway,    :d1 rdf:type :subway,    .       ..   A triple pattern is a tuple (sp, pp, op) ∈ (IB ∪ V )×(I ∪ V )×(IBL ∪ V ), where V is a set of variables. A basic graph pattern is a set of triple patterns. SPARQL [12], a W3C recommendation for querying RDF graphs, is essen- tially a graph-matching query language. A SPARQL query is of the form H ← B, where B, the body of the query, is a complex RDF graph pattern composed of basic graph patterns with different algebraic operators such as UNION, OPTIONAL, etc.; and H, the head of the query, is an expression that indicates how to con- struct the answer to the query [15]. The semantics of SPARQL is defined via mappings. A mapping is a partial function µ : V → IBL. The result of a SELECT SPARQL query is a set of mappings that match the query’s body, projected to the variables specified in the SELECT clause. However, one-shot queries by themselves are not able to give answers under dynamic input as in the running scenario. For this purpose, we need RDF stream processing. 2.2 RDF Stream Processing RDF Streams and Temporal RDF Graphs. In continuous query process- ing over dynamic data, the temporal nature of the data is crucial and needs to be captured in the data representation. This applies to both Linked Stream Data and Linked Data, as updates in Linked Data collections are also possible. In [17], RDF streams and instantaneous RDF datasets were defined by intro- ducing timestamps (time points) to input triples of the former and RDF graphs of the latter. In this paper, we adapt these notions by using time intervals as timestamps. Thereby, 1. An RDF graph at timestamp t, denoted by G(t), is a set of RDF triples valid at time t and called an instantaneous RDF graph. A temporal RDF graph is a sequence G = [G(t)], t ∈ N, ordered by t. 2. An RDF stream S is a sequence of elements hg : [t1 , t2 ]i, where g is an RDF graph and [t1 , t2 ] is a time interval. Example 3 (cont’d) The sensor data regarding delay and arrival of vehicles as in Figure 1 can be represented as the following RDF stream.3 h{(a1 , delay, m)}, [10, 10]i, h{(d1 , delay, h)}, [12, 12]i, S= h{(d1 , delay, h)}, [14, 14]i, h{(a1 , delay, m)}, [16, 16]i, . . .  Continuous Queries. Queries in CQELS are inspired by the Continuous Query Language (CQL) [4], where a continuous query is composed from three classes of operators, namely stream-to-relation (S2R), relation-to-relation (R2R), and relation-to-stream (R2S) operators. In CQELS, the former are captured by ex- tending SPARQL 1.1 grammar4 with a “stream graph” pattern, while the latter are taken care of by SPARQL’s operators. For more details on the query syntax, we refer the reader to [17]. Example 4 (cont’d) The following continuous query in the CQELS-QL noti- fies stops with delays of subways during the last 10 minutes to users: 1 SELECT ?s 2 FROM ex : transportationMap 3 FROM NAMED WINDOW : W ON ex : publicTransport [ RANGE 10 m ] 4 WHERE { 5 WINDOW : W { ? v : delayAt ? s } 6 ? v rdf : type : subway . 7 } 3 Note that the notification of delay or arrival here is considered atomic event; there- fore, the time intervals associated with this data is of the form [t, t]. On the other hand, the output of a CQELS query containing non-atomic events (cf. Example 9) can be fed as input stream to another query. In such situation, input events are associated with time interval [t1 , t2 ] with t1 ≤ t2 . 4 http://www.w3.org/TR/sparql11-query/#grammar [[self :: a]]G = {(a, a)} [[next :: a]]G = {(x, y) | ∃z : (x, z, y) ∈ G} [[axis −1 :: a]]G = {(x, y) | (y, x) ∈ [[axis :: a]]G }, where axis ∈ {next, node, edge} .. . Table 1: Formal semantics of nested regular expressions 2.3 Navigating RDF Graphs with nSPARQL For a data model with graph structure like RDF, being able to navigate through the graphs is fundamentally important. In [16], the authors proposed nSPARQL, a language that incorporates navigational capabilities to a fragment of SPARQL using nested regular expressions. nSPARQL allows to pose interesting and nat- ural queries over RDF data, and has an attractive computational property that nested regular expressions can be efficiently evaluated in polynomial time. The new SPARQL 1.1 query language introduced property paths 5 that covers a fragment of nSPARQL without nesting. In this paper, we augment CQELS-QL with the full functionalities of nSPARQL based on the syntax of SPARQL 1.1. We now briefly recall nSPARQL, the following grammar defines the syntax of nested regular expressions: exp ::= axis | axis :: a(a ∈ IBL) | axis :: [exp] | exp/exp | exp|exp | exp ∗ | exp + , where axis ∈ {self, next, next−1 , edge, edge−1 , node, node−1 }. The evaluation of a nested regular expression exp in a graph G is formally defined as a binary relation [[exp]]G , denoting the pairs of nodes (x, y) such that y is reachable from x in G by following a path that conforms to exp. Table 1 partly shows the formal semantics of nSPARQL on constructs that are used in our running example. For the full semantics, we refer the reader to [16]. Example 5 (cont’d) To find all stops from which one can reach the city center by subway connections, we can use the following expression: ?s (next−1 ::beg[next::means/self::subway]/next::end)+ :c.  Based on nSPARQL, we define an extended triple pattern as either a triple pat- tern or a triple (sp, exp, op), where sp, op ∈ I ∪ L ∪ V and exp is a nested regular expression. An extended graph pattern P is a set of extended triple patterns. 2.4 Complex Event Processing Complex Event Processing [14] emerged from publish-subscribe systems [19]. While the latter refer only to single isolated events, CEP aims at timely detecting 5 http://www.w3.org/TR/sparql11-query/#propertypaths high-level events as complex patterns of incoming single atomic events whose order is crucial. The defined complex events can in turn be used to compose even more complex events and so forth. To express ordering between events, CEP languages assign time intervals as timestamps for events and make use of operators rooted from Allen’s in- terval algebra [2]. In this paper, we take the first step to incorporate CEP into CQELS-QL by adopting the sequencing operator SEQ from the SASE sys- tem [21] and applies it to time intervals. The version of SEQ with time points in SASE can be briefly described as follows. Let Ai be an event type. Its semantics represented as Ai (t), is that at a given point t in time, Ai (t) is True if an Ai type event occurs at t, and is False otherwise. The SEQ operator takes a list of n > 1 event types as its parameters, e.g., SEQ(A1 , . . . , An ) and specifies a particular order in which the events of interest should occur. Formally speaking: SEQ(A1 , . . . , An ) ≡ ∃ t1 < t2 < . . . < tn : A1 (t1 ) ∧ A2 (t2 ) ∧ . . . ∧ An (tn ). When an event type is negatively specified in the sequence, that is: SEQ(A1 , . . . , Ai−1 , !Ai , Ai+1 , . . . , An ), this corresponds to the SEQ WITHOUT operator (abbreviated as SEQ WO in this paper), which intuitively detects sequences of events of types A1 , . . . , Ai−1 , Ai+1 , . . . , An where no event of type Ai occurs in between two events of types Ai−1 and Ai+1 . Formally: SEQ(A1 , . . . , Ai−1 , !Ai , Ai+1 , . . . , An ) ≡ ∃t1 < . . .