Towards Comparing RDF Stream Processing Semantics? Minh Dao-Tran, Harald Beck, and Thomas Eiter Institute of Information Systems, Vienna University of Technology Favoritenstraße 9-11, A-1040 Vienna, Austria {dao,beck,eiter}@kr.tuwien.ac.at Abstract. The increasing popularity of RDF Stream Processing (RSP) has led to developments of data models and processing engines which diverge in several aspects, ranging from the representation of RDF streams to semantics. Bench- marking systems such as LSBench, SRBench, and CSRBench were introduced as attempts to compare different approaches. However, these works mainly con- centrate on the operational aspects. The recent logic-based LARS framework provides a theoretical underpinning to analyze stream processing/reasoning seman- tics. Towards comparing RSP engines at the semantic level, in this paper, we pick two representative RSP engines, namely C-SPARQL and CQELS, and propose translations from their languages and execution modes into LARS. We show the faithfulness of the translations and discuss how they can be exploited to provide a formal analysis and comparison of RSP semantics. Keywords: RDF Stream Processing, Linked-Stream Data, Semantics Comparison 1 Introduction Within the Semantic Web research area, RDF Stream Processing (RSP) recently emerged to address challenges in querying heterogeneous data streams. This has led to develop- ments of data models and processing engines, which diverge in several aspects, ranging from the representation of RDF streams, execution modes, to semantics [2, 15, 7, 6, 11, 18]. Thus, the RSP community1 was formed to establish a W3C recommendation. A standardization must start from seeing the differences between existing approaches and thus comparing RSP engines is an important topic. Initial empirical comparisons were carried out in SRBench [19] and LSBench [16]. The former defined only functional tests to verify the query languages features by the engines, while the latter measured mismatch between the output of different engines. Later on, CSRBench [9] introduced an oracle that pregenerates the correct answers wrt. each engine’s semantics, which are then used to check the output returned by the engine. This approach however allows only partial comparison between engines by referring to their ideal counterparts. ? This research has been supported by the Austrian Science Fund (FWF) projects P24090, P26471, and W1255-N23. 1 https://www.w3.org/community/rsp/ Due to the lack of a common language to express divergent RSP approaches, the three works above could just look at the output of the engines and did not have further means to explain beyond the output what caused the difference semantically. Recently, [10] proposed a unifying query model to explain the heterogeneity of RSP systems. It shows a difference between two approaches represented by C-SPARQL [2], SPARQLStream [7] and CQELS [15], the representative engines in the RSP community. This work is based on extending SPARQL semantics to the stream setting. Recently, a Logic-based framework for Analyzing Reasoning over Stream (LARS) was introduced [5]. LARS can be used as a unifying language to which stream process- ing/reasoning languages can be translated. It may serve as a formal host language to express semantics and thus allows a deeper comparison that goes beyond mere looking at the output of the respective engines. In this paper, towards comparing RSP engines at the semantic level, we pick C-SPARQL and CQELS, and propose – translations that capture the push- and pull- execution modes for general LARS programs, and – translations from the query languages of C-SPARQL and CQELS to LARS based on a well-known translation from SPARQL to Datalog [17]. We show the faithfulness of the translations and discuss how they can be exploited to provide a formal analysis and comparison of RSP semantics. 2 Preliminaries This section briefly reviews RDF, SPARQL, RSP, and LARS, which will be illustrated using the following running scenario inspired by [10]. Example 1 The Sirius Cybernetics Corporation offers shop owners a real-time geo- marketing solution (RT GM ) to increase their sales. RT GM provides two services: (i) an application that allows shop owners to push instantaneous discount coupons to a server, and (ii) a free mobile app that fetches the coupons from shops near the phone, matches them with the preferences specified in the user’s shopping profile, and delivers the matched coupons to the user. Alice and Bob own shops a and b that sell shoes and glasses, resp. At time point 10, Alice sends out a coupon for a 30% discount for men’s MBT shoes. At time 15, Bob sends out a coupon for a 25% discount on Ray-Ban glasses. Claire has the App installed on her mobile phone and is walking near shops a and b from time 18. She is neither interested in discounts on men’s products nor discount of less than 20%. Therefore, she will get only the discount from shop b.  2.1 RDF and SPARQL RDF is a W3C recommendation for data interchange on the Web [8]. It models data as directed labeled graphs whose nodes are resources and edges represent relations among them. Each node can be a named resource (identified by an IRI), an anonymous resource (a blank node), or a literal. We denote by I, B, L the sets of IRIs, blank nodes, and literals, respectively. A triple (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L) is an RDF triple, where s is the subject, p the predicate, and o the object. An RDF graph is a set of RDF triples. Example 2 (cont’d) Information in the scenario of Ex. 1 about products, offers from shops, and Claire’s relative locations to shops can be stored in the following RDF graphs: G = { : “mbt” : g classify : 1. : “rayban” : g classify : 0. . . . } g1 = {: a : offers : c1 . : c1 : on : “mbt”. : c1 : reduce : 30.} g2 = {: b : offers : c2 . : c2 : on : “rayban”. : c2 : reduce : 25.} g3 = {: “claire” : isNear : a. : “claire” : isNear : b.} A triple pattern is a tuple (sp, pp, op)∈(I ∪ B ∪ V )×(I ∪ V )×(I ∪ B ∪ L ∪ V ), where V is a set of variables. A basic graph pattern is a set of triple patterns. SPARQL [12], a W3C recommendation for querying RDF graphs, is essentially a graph-matching query language. A SPARQL query is of the form H ← B, where B, the body of the query, is a complex RDF graph pattern composed of basic graph patterns with different algebraic operators such as UNION, OPTIONAL, etc.; and H, the head of the query, is an expression that indicates how to construct the answer to the query [14]. However, SPARQL is not able to give answers under dynamic input as in the running scenario. For this purpose, we need RSP. 2.2 RDF Stream Processing RDF Streams and Temporal RDF Graphs. These two notions are introduced to extend Linked Stream Data and Linked Data with temporal aspects: 1. An RDF graph at timestamp t, denoted by G(t), is a set of RDF triples valid at time t and called an instantaneous RDF graph. A temporal RDF graph is a sequence G = [G(t)], t ∈ N = {0, 1, 2, . . .}, ordered by t. 2. An RDF stream S is a sequence of elements hg : [t]i, where g is an RDF graph and t is a timestamp. Example 3 (cont’d) An input stream S of our running scenario is a sequence of ele- ments hgi : [ti ]i, where gi , representing an offer, is of the form in Ex. 2 and ti can be either (i) the time point when the offer is announced by a shop owner (application time), or (ii) the time point when gi arrives at an RSP engine (system time).  Continuous Queries. Continuous queries are registered on a set of input streams and background data, and continuously send out the answers as new input arrives at the streams. There are two modes to execute such queries. In pull-based mode, the system is scheduled to execute periodically independent of the arrival of data and its incoming rate; in push-based mode, the execution is triggered as soon as data is fed into the system. Continuous queries in C-SPARQL and CQELS follow the approach by the Contin- uous Query Language (CQL) [1], in which queries are composed of three classes of operators, namely stream-to-relation (S2R), relation-to-relation (R2R), and relation-to- stream (R2S) operators. In the context of RSP, S2R operators are captured by windows on RDF streams, R2R operators are resorted to SPARQL operators, and R2S operators converts “pure” SPARQL output after R2R into output streams. As CQL is based on SQL, the background data tables and input streams all have schemas. This makes it crystal clear to see which input tuple comes from which stream. On the other hand, as RDF is schema-less, it is not straightforward to get this dis- tinction. RSP engines use different approaches to build the snapshot datasets for R2R evaluation [10]: (B1 ) C-SPARQL merges snapshots of the input streams into the default graph, (B2 ) CQELS directly accesses the content of the input streams by introducing a new “stream graph” pattern in the body of the query. Example 4 (cont’d) A continuous query to notify Claire with instantaneous coupons matching her preferences can be expressed in C-SPARQL and CQELS as follows. For readability, we write instead of , etc. SELECT ?shop ?product ?percent SELECT ?shop ?product ?percent FROM FROM STREAM [RANGE 30m] WHERE { STREAM [RANGE 5m] STREAM [RANGE 30m] { WHERE { ?shop :offers ?coupon. ?shop :offers ?coupon. ?coupon :reduce ?percent. ?coupon :reduce ?percent. ?coupon :on ?product. } ?coupon :on ?product. STREAM [RANGE 5m] { ?user :isNear ?shop. ?user :isNear ?shop. } ?product :g_classify ?gender. ?product :g_classify ?gender. FILTER FILTER (?percent >= 20 && ?gender != 1)} (?percent >= 20 && ?gender != 1)} Q1 : Notification query in C-SPARQL Q2 : Notification query in CQELS 2.3 Logic-oriented view on Streams, Windows and Time Reference We will gradually introduce the central concepts of LARS [5] tailored to the considered fragment. We distinguish extensional atoms AE for input data and intensional atoms AI for derived information. By A = AE ∪ AI , we denote the set of atoms. Definition 1 (Stream) A stream S = (T, υ) consists of a timeline T , an interval in N, and an evaluation function υ : N 7→ 2A . The elements t ∈ T are called time points. Intuitively, a stream S associates with each time point a set of atoms. We call S a data stream, if it contains only extensional atoms. Example 5 (cont’d) The offers in the running scenario (Ex. 1) can be modeled as a data stream D = (TD , υD ) with a timeline TD = [0, 50] whose time unit is minutes, and the evaluation function υD (10) = {offer (a, “mbt”, 30)}, υ(15)={offer (b, “rayban”, 25)}, υD (18) = {isNear (a), isNear (b)} and υD (t) = ∅ for all t ∈ TD \ {10, 15, 18}. The evaluation function υD can be equally represented as   10 7→ {offer (a, “mbt”, 30)}, 15 7→ {offer (b, “rayban”, 25)}, υD = . 18 7→ {isNear (a), isNear (b)} To cope with the amount of data, one usually considers only recent atoms. Let S = (T, υ) and S 0 = (T 0 , υ 0 ) be two streams s.t. S 0 ⊆ S, i.e., T 0 ⊆ T and υ 0 (t0 ) ⊆ υ(t0 ) for all t0 ∈ T 0 . Then S 0 is called a substream of S. Definition 2 (Window function) A (computable) window function wι of type ι takes as input a stream S = (T, υ), a time point t ∈ T , called the reference time point, and a vector of window parameters x for type ι and returns a substream S 0 of S. Important are tuple-based and time-based window functions. The former select a fixed number of latest tuples while the latter select all atoms appearing in last n time points. Window operators . Window functions can be accessed in formulas by window operators. That is, an expression α has the effect that α is evaluated on the “snapshot” of the stream delivered by its associated window function w . By dropping information based on time, window operators specify temporal rele- vance. For each atom in a window, we control the semantics by some temporal reference. Time Reference. Let S = (T, υ) be a stream, a ∈ A and B ⊆ A static background data. Then, at time point t ∈ T , – a holds, if a ∈ υ(t) or a ∈ B; – 3a holds, if a holds at some time point t0 ∈ T ; – 2a holds, if a holds at all time points t0 ∈ T ; and – @t0 a holds, if t0 ∈ T and a holds at t0 . Next, the set A+ of extended atoms is given by the grammar a | @t a | @t a | 3a | 2a, where a ∈ A and t is any time point. Expressions of form  ? a, where ?∈{@t , 3, 2}, are called window atoms. Example 6 The window atom 30 3offer (Sh, Pr , Pe) takes a snapshot of the last 30 minutes of a stream and uses the 3 operator to check whether an offer from shop Sh on product Pr with a discount of Pe% appeared in the stream during this period. Similarly, 5 3isNear (Sh) does the same job to take a snapshot of size 5 minutes of the shops near the user.  2.4 LARS Programs We present a fragment of the formalism in [5]. Syntax. A rule r is of the form α ← β(r), where H(r) = α is the head and the body of r is β(r) = β1 , . . . , βj , not βj+1 , . . . , not βn . Here, α is of form a or @t a, where a ∈ AI , and each βi is either an ordinary atom or a window atom. Let B (r) = B + (r) ∪ B − (r), where B + (r) = {βi | 1 ≤ i ≤ j} is the positive and − B (r) = {βi | j < i ≤ n} is the negative body or r. A (LARS) program P is a set of rules. A program is positive, if none of its rules has a negative body atom. Example 7 (cont’d) Suppose we are given static background data B that contains pro- duct information in a predicate of form g classify(Pr , Ge), where Ge = 0 (resp. 1) marks that product Pr is for women (resp., men). The following LARS rule amounts to the queries in Example 4, under the input streams in a format as in Example 5. ans(Sh, Pr , Pe) ← 30 3offer (Sh, Pr , Pe), 5 3isNear (Sh), g classify(Pr , Ge), Pe ≥ 20, Ge 6= 1. This rule works as follows. The two window atoms provide offers announced in the last 30 minutes and the shops near the user within the last 5 minutes. Together with the gender classification of products provided by g classify, only products not for men (Ge 6= 1) and have discount rate from 20% are concluded at the head with predicate ans.  Semantics. Let P be a LARS program. For a data stream D = (TD , vD ), any stream I = (T, υ) ⊇ D that coincides with D on AE is an interpretation stream for D. A tuple M = hT, υ, W, Bi is an interpretation for D, where W is a set of window functions w such that the corresponding window operator  appears in P , and B is the background knowledge. Throughout, we assume W and B are fixed and thus also omit them. Satisfaction by M at t ∈ T is as follows: M, t |= α for α ∈ A+ , if α holds in (T, υ) at time t; M, t |= r for rule r, if M, t |= β(r) implies M, t |= H(r), where M, t |= β(r), if (i) M, t |= βi for all i ∈ {1, . . . , j} and (ii) M, t 6|= βi for all i ∈ {j+1, . . . , n}; and M, t |= P for program P , i.e., M is a model of P (for D) at t, if M, t |= r for all r ∈ P . Moreover, M is minimal, if in addition no model M 0 = hT, υ 0 , W, Bi 6= M of P exists such that υ 0 ⊆ υ. Definition 3 (Answer Stream) An interpretation stream I = (T, υ) for a data stream D ⊆ I is an answer stream of program P at time t, if M = hT, υ, W, Bi is a minimal model of the reduct P M,t = {r ∈ P | M, t |= β(r)}. By AS(P, D, t) we denote the set of all such answer streams I. Since RSP queries return just a single deterministic answer (which of course can con- tain multiple rows) at a time point, we consider in this paper LARS programs that have a single answer stream. By AS (P, D, t), we directly refer to the single element of AS(P, D, t). Example 8 (cont’d) Consider background data B that contains product information as in Ex. 2. That is, B = {. . . , g classify(“mbt”, 1), g classify(“rayban”, 0), . . .}. Take the data stream D from Ex. 5 and let P be the LARS program consisting of the single rule in Ex. 7. Then, I = (TI , υI ) is the only answer stream of P wrt. D and B at time t = 18, where TI = TD and υI = υD ∪ {18 7→ {ans(b, “rayban”, 25)}}.  3 Modeling RSP Queries Section 2.2 shows a divergence in realizing continuous queries in C-SPARQL and CQELS. To be able to capture and analyze the difference between the two approaches, we need to have a common starting point. This section proposes a formal model of RSP queries that captures this common starting point idea, and then classifies C-SPARQL and CQELS on the model. Similarly as in [17], we ignore solution modifiers and formalize an RSP query as a quadruple Q = (V, P, D, S), where V is a result form, P is a graph pattern, D is a dataset,2 and S is a set of stream graph patterns. Roughly, S is a set of tuples of the 2 For simplicity, we omit instantaneous background datasets, which can be extended in a straightforward way. form (s, ω, g), where s is a stream identifier, ω is a window expression, and g is a basic RDF graph pattern. Given a result form V , we denote by V the tuple obtained from lexicographically ordering the set of variables in V . Example 9 Queries Q1 and Q2 in Ex. 4 stem from Q = (V, P, D, S), where V = {?shop, ?pname, ?percent} P = (P1 ∪ P2 ∪ P3 ) FILTER R    ?shop : offers ?coupon.  P1 = ?coupon : on ?product. ?coupon : reduce ?percent.   P2 = { ?user : isNear ?shop. } P3 = { ?product : g classify ?gender. } R = (?percent ≥ 20 && ?gender 6= 1) D = {}   (, [RANGE 30m], P1 ), S = . (, [RANGE 5m], P2 ) This query covers all common aspects of Q1 and Q2 which both access the static dataset at and the input streams at and with a window of range 30 and 5 minutes, respectively. On top of the snapshot from the input streams together with the static dataset, a pattern matching is carried out on the graph pattern P .  Next, we show how this RSP query model captures the divergent C-SPARQL and CQELS queries. Consider an RSP query Q. • The corresponding C-SPARQL query, denoted by cs(Q), can be obtained from Q by setting the graph patterns in all stream graph patterns in S to ∅. This goes along with the idea of C-SPARQL to merge patterns on the input streams into the default graph. • A corresponding CQELS query, however, can be obtained from Q at different levels of cautiousness: for every part of P that contains gi s.t. (si , ωi , gi ) ∈ S, replace it with either (i) (STREAM si ωi gi ), or (ii) ((STREAM si ωi gi ) UNION gi ). The former is a brave approach when one can make sure that the static dataset and the stream si do not share patterns, while the latter is more cautious when one is not sure and rather expects triples matching gi come from either the static dataset or the input streams. Therefore, Q is corresponding to a set cq(Q) of 2|S| CQELS queries, including a brave one, a cautious one, and the ones in between. Note that Q2 in Ex. 4 is the brave CQELS query of Q. 4 Capturing RSP Queries Using LARS We will use the following strategy to capture different RSP approaches with LARS: (1) First, the two push- and pull-based execution modes can be applied to LARS programs in general via two straightforward translations. (2) Then, window expressions in RSP are translated into window operators in LARS. (3) Next, R2R operators and the approaches in building the datasets to be evaluated by R2R operators are captured by two slightly different translations τ1 and τ2 , based on the translation from SPARQL to Datalog rules in [17]. (4) Finally, post-processing can be carried out to mimic R2S operators. Note that the post-processing can be done operationally. Therefore, it is not of our theoretical interest and will not be considered in this paper. We now go into details of (1)-(3). 4.1 Push- and Pull-Based Execution Modes for LARS Programs This section provides two translations that capture the push- and pull-based execution modes by means of LARS itself. Given a LARS program P and a pulling period U > 0, the translations (P ) and (P, U ) encode the push- and pull-mode by LARS rules, respectively. Intuitively, we add to the body of each rule in P an ordinary atom trigger. Then, rules to conclude trigger are added depending on the mode. For push-based mode, trigger will be concluded per new incoming input triple. For pull-based mode, the condition is that the current time point is a multiple of U . Formally speaking, for a LARS rule r, a LARS program P , a pulling period U , let trigger (r) = H (r) ← B (r), trigger. trigger (P ) = {trigger (r) | r ∈ P ∧ B (r) 6= ∅} (P ) = trigger (P ) ∪ {trigger ← NOW p(X). | p ∈ AI } (P, U ) = trigger (P ) ∪ {trigger ← NOW @T true, T % U = 0.} Notably, the translation  for the pull-based mode needs to acquire the current time point, which is achieved as follows. The logical constant true always holds, and thus @T true holds for all considered time points T . By applying window operator NOW (or equivalently 0 ) before, only the current time point will be selected. The following proposition shows that  and  faithfully capture the execution modes. Proposition 1 Let P be a LARS program, U be a positive integer, and D = (TD , υD ) be an input stream. For every t ∈ TD , it holds that (1) If υD (t) 6= ∅, then AS((P ), D, t) = AS(P, D, t) (2) If υD (t) = ∅, then AS((P ), D, t) = {D} (3) If t % U = 0, then AS((P, U ), D, t) = AS(P, D, t) (4) If t % U 6= 0, then AS((P, U ), D, t) = {D}. 4.2 Translate RSP Window Expressions to LARS Window Operators Table 1 presents a translation from windows in RSP to window operators in LARS. Given a window expression ω in RSP, τ (ω) returns a LARS window operator which corresponds to a window function that provides the same functionalities as ω [4, 3, 5]. Window expression ω τ (ω) [RANGE L] L [RANGE L SLIDE D] L,0,D [ROWS N] N # [NOW] 0 or NOW [RANGE UNBOUNDED] ∞ Table 1: Translating window expressions ω to LARS’ window operators 4.3 Translate RSP Queries to LARS Programs For capturing R2R operators of continuous SPARQL queries we can exploit an existing translation from SPARQL to Datalog rules [17]. The difference in our setting is the streaming input and how RSP engines take snapshots of the stream to build datasets for SPARQL evaluation. We propose two strategies (T1 ) and (T2 ) to extend the translation in [17] to capture R2R operators and the ways to build snapshot datasets (B1 ), (B2 ) (cf. Section 2.2): (T1 ) For (B1 ), we just need to make sure that the triples from the input streams are collected into the default graph. (T2 ) For (B2 ), we introduce one more case for translating a stream graph pattern to LARS rules. Towards formally presenting our translations, we start with a review of the translation from SPARQL to Datalog in [17], which has two parts: (i) The first part imports RDF triples from the dataset into a 4-ary predicate of the form triple(S, P, O, G), where (S, P, O) covers RDF triples and G holds a graph identifier. This can be done with the Answer Set Programming solver dlvhex.3 (ii) For the second part, a function τ takes as input a result form V , a graph pattern P , a dataset D, an integer i>0 and translates the input into a Datalog program, recur- sively along P . The base case is a single RDF triple pattern, i.e., P = {(S, P, O)}. Intuitively, τ converts SPARQL operators to declarative rules. Our purpose is to provide a translation for theoretical analysis rather than for practical implementation of RSP queries. Thus, we concentrate on (ii). For (i), we assume that – each triple (s, p, o) from the static dataset D can be accessed by triple(s, p, o, D), – each triple (s, p, o) arriving at a stream s at time t contributes to the evaluation function υ at t under a predicate striple, that is, striple(s, p, o, s) ∈ υ(t). Figure 1 shows the extension of τ in [17] with a parameter S representing the input streams. The translation LT (·) is taken from [17], which is based on the rewriting defined by Lloyd and Topor [13]. 3 http://www.kr.tuwien.ac.at/research/systems/dlvhex/ τ (V, (S, P, O), D, S, i) = ansi (V , D, S) ← triple(S, P, O, D) τ (V, (P1 AND P2 ), D, S, i) = τ (vars(P1 ), P1 , D, S, 2i)∪ τ (vars(P2 ), P2 , D, S, 2i + 1)∪ ansi (V , D, S) ← ans2i (vars(P1 ), D, S), ans2i+1 (vars(P2 ), D, S). τ (V, (P1 UNION P2 ), D, S, i) = τ (vars(P1 ), P1 , D, S, 2i) τ (vars(P2 ), P2 , D, S, 2i + 1)∪ ansi (V [(V \ vars(P1 )) → null], D, S) ← ans2i (vars(P1 ), D, S). ansi (V [(V \ vars(P2 )) → null], D, S) ← ans2i+1 (vars(P2 ), D, S). τ (V, (P1 MINUS P2 ), D, S, i) = τ (vars(P1 ), P1 , D, S, 2i)∪ τ (vars(P2 ), P2 , D, S, 2i + 1)∪ ansi (V [(V \ vars(P1 )) → null], D, S) ← ans2i (vars(P1 ), D, S), not ans02i (vars(P1 ) ∩ vars(P2 ), D, S). ans02i (vars(P1 ) ∩ vars(P2 ), D, S)← ans2i+1 (vars(P2 ), D, S). τ (V, (P1 OPT P2 ), D, S, i) = τ (V, (P1 AND P2 ), D, S, i)∪ τ (V, (P1 MINUS P2 ), D, S, i) τ (V, (P FILTER R), D, S, i) = τ (V, P, D, S, 2i)∪ LT (ansi (V , D, S) ← ans2i (vars(P ), D, S), R.) τ (V, (GRAPH g P ), D, S, i) = τ (V, P, g, S, i) for g ∈ V ∪ I ansi (V , D) ← ansi (V , g), isIRI(g), g 6= default. Fig. 1: Extending translation τ in [17] with input streams S For (T1 ) we modify the base cases of τ for (S, P, O) with τ (V, (S, P, O), D, S, i) = {ansi (V , D, S) ← triple(S, P, O, D)} ∪ {ansi (V , D, S) ← triple(S, P, O, S)}, and add the following rules to import input streaming triples to the default graph: τ 0 (S) = {triple(S, P, O, S) ← τ (ω)3striple(S, P, O, s) | (s, ω, g) ∈ S}. The translation for strategy (T1 ) is τ1 (V, P, D, S, i) = τ (V, P, D, S, i) ∪ τ 0 (S). For (T2 ), let τ2 be a function that agrees with τ , and moreover fulfills: ^ τ2 (V, (STREAM s ω g), D, S, i)=ansi (V , D, S) ← τ (ω)( 3striple(S, P, O, s)). (S,P,O)∈g When it is clear from context, we will write in the sequel τ /τi (Q) (i ∈ {1, 2}), for a query Q = (V, P, D, S) instead of τ /τi (V, P, D, S, 1). Given an RSP query Q=(V, P, D, S), let Q0 ∈{cs(Q)}∪cq(Q) and Ii =AS (τi (Q0 ), D, t) for a data stream D and a time point t, where i ∈ {1, 2}. We denote the set of atoms of predicate ansj with the parameter corresponding to S projected away by chop(I, Q) = {ansj (Vj , D) | ansj (Vj , D, S) ∈ I} ∪ (I \ {ansj (Vj , D, S) ∈ I}). The following result shows that our translation preserves the translation in [17]. Proposition 2 Let Q = (V, P, D, ∅) be an RSP query, that is, a SPARQL query, and cq(Q) = {Q0 }. Let I be the single answer set of τ (Q), I1 = AS (τ1 (cs(Q)), D, t), and I2 = AS (τ2 (Q0 ), D, t). It holds that I = chop(I1 , Q) = chop(I2 , Q). Translations τ1 and τ2 share the core from translation τ in [17], and differ due to two approaches by C-SPARQL and CQELS in extending SPARQL to deal with streaming input. Furthermore, the two engines execute on two different modes, namely pull- and push-based. This makes it non-trivial to analyze situations in which the engines should return the same output. Tackling this question now becomes possible with LARS, which will be discussed next. 5 Utilizing LARS for Comparing RSP Semantics This section discusses ideas to tackle the comparison between C-SPARQL and CQELS using LARS. The main questions are: (i) What do we mean by saying “C-SPARQL and CQELS return the same output?” (ii) When do the two approaches of dealing with streaming input coincide, i.e., merging input streams into the default graph and using stream graph patterns do not make any difference in building the datasets evaluated by the engines? (iii) Under which conditions will the push- and pull-based executions fulfill (i)? The most important question is (i), which establishes the whole setup and methodology for the comparison. Since C-SPARQL and CQELS build on a pull- and push-based mode, respectively, it does not make sense to require that the output of the two engines coincides at every single time point. Instead, we need some notion of agreement between the two engines. Intuitively, we say that C-SPARQL and CQELS agree on a time interval [t1 , t2 ], if the union of outputs returned by CQELS at every time point in (t1 , t2 ] (i.e., the total output in that interval) coincides with the output returned by C-SPARQL at t2 . Assume that both two engines start at time 0 of a timeline T and C-SPARQL is executed with a pulling period of U time units. Then, we say that the engines agree on T just if they agree on every interval [i · U, (i + 1) · U ] ∈ T , for i ≥ 0. On the translated LARS programs, checking for agreement on the output boils down to checking the agreement on the output facts of the predicate ans1 . Imagine a user staring at his App to wait for notifications. If the two engines agree, then at the end of the pulling period, the user will see the same notifications in both cases; if the pulling period is short enough, she might even not notice any difference between the two modes. This makes the agreement notion reasonable from a practical point of view. Now we analyze possible sufficient conditions of agreement for C-SPARQL and CQELS, by finding answers for questions (ii) and (iii). To put (ii) more concretely in the context of LARS, take an RSP query Q and let Q1 = cs(Q) and Q2 ∈ cq(Q); we want to find conditions under which the answer streams of τ1 (Q1 ) and τ2 (Q2 ) at time point t on the input stream D have the same extension of predicate ans1 . Observe that τ1 (Q1 ) and τ2 (Q2 ) differ on dealing with input coming from predi- cate striple. The former program merges facts of striple into triple, which will be used later on to conclude ansi ; the latter program concludes ansi directly from striple. If the static dataset and the input streams share some patterns, then τ1 may not be capable of identifying the origin of some triple. Even if the input stream does not receive any incoming triple, τ1 (Q1 ) may still conclude some output due to facts staying in the static part. On the other hand, τ2 (Q2 ) returns no output as the stream is not updated. To avoid such confusion, it is required that the static dataset and the input streams do not share patterns. Fortunately, this usually holds in practice. For instance, in our running example, the pattern ?user :isNear ?shop can only match triples in the stream , since predicate :isNear will not be streamed in , and the static dataset should not contain such information. When the conditions for question (ii) are fulfilled, one has a better setting to ana- lyze (iii). Still, for the first step, we need to assume as in [10] that the execution time of the engines is marginal compared to the input rate. Note that with these conditions, we can for a given LARS program P compare the output of (P, U ) and (P ) in time intervals [t1 , t2 ] = [i · U, (i + 1) · U ]. As the rules in P are shared by both translations, intuitively, the total output of ans1 by (P ) during [t1 , t2 ] will coincide with the one by (P, U ), if the total input to (P ) during [t1 , t2 ] coincides with the input to the (P, U ) at t2 . Here the “input” consists of the snapshots obtained by evaluating the windows on the streams. However, this condition cannot be guaranteed under high throughput. The reason is that with dense input streams, the snapshots taken at time points near the beginning of an interval will have high chances to collect more input than the snapshot at its end. Thus, in practice it is rather unlikely that C-SPARQL and CQELS will agree, due to the strong semantic implications of push/pull-based querying. Conclusions and Outlook. This paper establishes first steps towards formally compar- ing two RSP semantics implemented in two well-known engines, namely C-SPARQL and CQELS, by proposing translations to capture the languages and execution modes of the engines, and discussing idea to formalize a notion of agreement between the two semantics as well as a condition for it to hold. Next steps include working out these ideas formally and implementing the translations to build a comparison benchmarking systems of different stream processing approaches using LARS. References 1. A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: semantic foundations and query execution. VLDB J., 15(2):121–142, 2006. 2. D. F. Barbieri, D. Braga, S. Ceri, E. Della Valle, and M. Grossniklaus. C-SPARQL: a continuous query language for rdf data streams. Int. J. Semantic Computing, 4(1):3–25, 2010. 3. H. Beck, M. Dao-Tran, T. Eiter, and M. Fink. Towards a Logic-Based Framework for Analyzing Stream Reasoning. In OrdRing, pages 11–22, 2014. 4. H. Beck, M. Dao-Tran, T. Eiter, and M. Fink. Towards Ideal Semantics for Analyzing Stream Reasoning. In ReactKnow, 2014. 5. H. Beck, M. Dao-Tran, T. Eiter, and M. Fink. LARS: A logic-based framework for analyzing reasoning over streams. In AAAI, 2015. 6. A. Bolles, M. Grawunder, and J. Jacobi. Streaming SPARQL - extending SPARQL to process data streams. In ESWC, pages 448–462, 2008. 7. J.-P. Calbimonte, Ó. Corcho, and A. J. G. Gray. Enabling ontology-based access to streaming data sources. In ISWC (1), pages 96–111, 2010. 8. R. Cyganiak, D. Wood, and M. Lanthaler. RDF 1.1 Concepts and Abstract Syntax. http://www.w3.org/TR/rdf11-concepts/, 2014. 9. D. Dell’Aglio, J. Calbimonte, M. Balduini, Ó. Corcho, and E. D. Valle. On Correctness in RDF Stream Processor Benchmarking. In ISWC 2013, pages 326–342, 2013. 10. D. Dell’Aglio, E. D. Valle, J.-P. Calbimonte, and O. Corcho. RSP-QL Semantics: a Unifying Query Model to Explain Heterogeneity of RDF Stream Processing Systems. IJSWIS, 10(4), 2015. 11. S. Groppe. Data Management and Query Processing in Semantic Web Databases. Springer, 2011. 12. S. Harris and A. Seaborne. SPARQL 1.1 Query Language. http://www.w3.org/TR/sparql11- query/, 2013. 13. J. W. Lloyd and R. W. Topor. Making Prolog more Expressive. J. Log. Program., 1(3):225–240, 1984. 14. J. Pérez, M. Arenas, and C. Gutierrez. Semantics and complexity of sparql. ACM Trans. Database Syst., 34:16:1–16:45, September 2009. 15. D. L. Phuoc, M. Dao-Tran, J. X. Parreira, and M. Hauswirth. A native and adaptive approach for unified processing of linked streams and linked data. In ISWC (1), pages 370–388, 2011. 16. D. L. Phuoc, M. Dao-Tran, M.-D. Pham, P. Boncz, T. Eiter, and M. Fink. Linked stream data processing engines: Facts and figures. In ISWC - ET, pages 300–312, 2012. 17. A. Polleres. From SPARQL to rules (and back). In WWW 2007, pages 787–796, 2007. 18. O. Walavalkar, A. Joshi, T. Finin, and Y. Yesha. Streaming Knowledge Bases. In SSWS, 2008. 19. Y. Zhang, P. Minh Duc, O. Corcho, and J. P. Calbimonte. SRBench: A Streaming RDF/S- PARQL Benchmark. In ISWC, pages 641–657, 2012.