A Proposal for Publishing Data Streams as Linked Data - A Position Paper - Davide F. Barbieri Emanuele Della Valle Dipartimento di Elettronica e Informazione Dipartimento di Elettronica e Informazione Politecnico di Milano Politecnico di Milano Piazza L. da Vinci 32, 20133 Milano Piazza L. da Vinci 32, 20133 Milano dbarbieri@elet.polimi.it dellavalle@elet.polimi.it ABSTRACT Listing 1 shows an example of C-SPARQL query that, Streams are appearing more and more often on the Web given a static description of brokers and a stream of finan- in sites that distribute and present information in real-time cial transactions for all brokers, computes the amount of streams. We anticipate a rapidly growing need of mashing transactions for Swiss brokers within the last hour. up this streaming information with more static one. While 1 REGISTER STREAM T o t a l A m o u n t P e r B r o k e r COMPUTE EVERY 10 m AS 2 PREFIX ex : < http :// example / > best practices for linking static data on the Web were pub- 3 CONSTRUCT {? broker ex : hasTo talAmoun t ? total .} 4 FROM < http :// br okerscen tral . org / brokers . rdf > lished and facilitate the mash up of static information pub- 5 FROM STREAM < http :// stockex . org / market . trdf > lished on the Web, streams were neglected. In this short 6 7 [ RANGE 1 h STEP 10 m ] WHERE { position paper, we propose an approach to publish Data 8 ? broker ex : from ? country . 9 ? broker ex : does ? tx . Streams as Linked Data. 10 ? tx ex : with ? amount . 11 FILTER (? country = " CH " ) 12 } Keywords 13 AGGREGATE { (? total , SUM (? amount ) , ? broker ) } Data Streams, Linked Data, Virtual RDF, Stream Reason- Listing 1: Example of C-SPARQL which allows ing dealing with streams of RDF triples as well as static RDF graphs 1. INTRODUCTION At line 1, the REGISTER clause is use to tell the C-SPARQL A growing number of Web sites are distributing and pre- engine that it should register a continuous query, i.e. a senting information in real-time streams. Microblogs such query that will continuously compute answers to the query. as Twitter1 , weather monitoring site such as AccuWeather2 , In particular, we are registering a query that generates an traffic monitoring sites such as Waze3 are few representative RDF stream. The COMPUTE EVERY clause states the frequency examples. of every new computation, in the example every 10 minutes. Streams, being unbounded sequences of time-varying data At line 5, the clause FROM STREAM defines the RDF stream of elements, should not be treated as persistent data to be financial transactions, used within the query. Next, line stored (forever) and queried on demand, but rather as tran- 6 defines the window of observation of the RDF stream. sient data to be consumed on the fly by continuous queries. Streams, for their very nature, are volatile and for this rea- Continuous queries, after being registered, keep analyzing son should be consumed on the fly; thus, they are observed such streams, producing answers triggered by the streaming through a window, including the last elements of the stream, data and not by explicit invocation. Such a paradigmatic which changes over time. In the example, the window com- change have been largely investigated in the last decade prises RDF triples produced in the last 1 hour, and the win- by the database community [15]. Specialized Data Stream dow slides every 10 minutes. The WHERE clause is standard; it Management Systems (DSMS) have been developed (e.g., includes a set of matching patterns and FILTER clauses as in STREAM [2], Aurora/Borealis [1] and Stream Mill [6]). Sev- standard SPARQL. Finally, at line 13, the AGGREGATE function eral startups such as StreamBase4 are commercializing DSMS, asks the C-SPARQL engine to include in the result set a new and features of DSMS are becoming supported by major variable ?total which is bound to the sum of the amount of database products, such as Oracle and DB2. the transaction of each broker. Motivated by the availability of real-time streams on the Our C-SPARQL Engine [9] treats non-RDF DSMSs as vir- Web and by the lack of Web-based approaches to process tual RDF streams and graphs. It allows to register queries them, we have been working since 2008 on an extension to that continuously combine (virtual) RDF streams and RDF SPARQL[20] for continuous querying over streams of RDF graphs. Under this respect, our C-SPARQL Engine is sim- and static RDF graphs (namely C-SPARQL [7, 9]). ilar to D2RQ [12] that treats non-RDF databases as vir- 1 http://twitter.com/ tual RDF graphs. In our previous works [7, 8, 9] we de- 2 http://www.accuweather.com/ velop an engine for registering and continuously executing 3 http://world.waze.com/ C-SPARQL queries. With this position paper, we propose 4 http://www.streambase.com/ an extension of our C-SPARQL Engine that publishes data Copyright is held by the author/owner(s). streams as Linked Data. Such an extension complements the LDOW2010, April 27, 2010, Raleigh, North Carolina. work done so far and lowers the entry barrier for external . (Semantic) Web application to consume data streams. (hT ransaction(tr1, broker1, ”$1000”)i , τi ) The rest of the paper is organized as follows. In Section 2 (hT ransaction(tr2, broker1, ”$3000”)i , τi+1 ) we describe the design principles that inspire our proposal (hT ransaction(tr3, broker2, ”$2000”)i , τi+1 ) for Streaming Linked Data. Section 3 explains how to pub- In a similar way, we define an RDF stream [7] as an or- lish a single data stream as an RDF stream. In the same dered sequence of pairs, where each pair is made of an RDF section we also present a vocabulary to describe the time triple and its timestamp τ . By mapping the data stream interval in which the published data are valid. The URI above in RDF using D2RQ mapping language [10], we ob- schema that allows to control the Window behavior is pre- tain the following RDF stream: sented in Section 4. In Section 5, we describe the RESTful (hbroker1 does tr1 .i , τi ) [21] services which allow to control the C-SPARQL query (htr1 with ”$1000” .i , τi ) that continuously computes the published RDF stream. Fi- (hbroker1 does tr2 .i , τi+1 ) nally, Section 6 and 7 present some related work and draw (htr2 with ”$3000” .i , τi+1 ) (hbroker2 does tr3 .i , τi+1 ) some conclusions, respectively. (htr3 with ”$2000” .i , τi+1 ) We propose to represent RDF streams in RDF using named 2. DESIGN PRINCIPLE graphs [13]. We distinguish between two kind of named The design principle that inspires our approach is illus- graphs: the Stream Graphs (shortly s-graphs) and the In- trated in Figure 1. Our C-SPARQL engine is able to process stantaneous Graphs (shortly i-graphs). In our proposal, an data streams and RDF streams in combination with RDF RDF Stream can be represented using one s-graph and sev- graphs. In our previous work, we use in memory connec- eral i-graphs, one for each timestamp. tion between our C-SPARQL engine and local C-SPARQL A s-graphs is a metadata graph that describes the current clients. However, we anticipate a rapidly growing need of content of the window over the RDF Stream. The most mashing up results of our C-SPARQL engine with SPARQL- important part of an s-graph are the triples that refer to the and RDF-based linked data clients. A Streaming Linked i-graphs using rdfs:seeAlso5 and those that describe when Data Server is a special local C-SPARQL Client that con- each i-graph was received using the property receivedAt. nects in memory to a C-SPARQL engine and exposes as Few other metadata complete the description of an s- Linked Data the results of continuous queries registered in graph. The property lastUpdate describes the last time the C-SPARQL engine. the graph was updated. The property expires allows to indicate a Linked Data Client that the information in the graph will expire in a given moment in future. The proper- ties sld:windowType and windowSize describe the window through which the stream is observed (see Section 4 for more information). For instance, if the data stream exemplified above was the current content of a window over the stream of finan- tial transactions, it can be represented using the s-graph in Listing 2 and the two i-graphs in Listing 3 and 4. 1 @prefix rdfs : < http :// www . w3 . org /2000/01/ rdf - schema # > . 2 @prefix sld : < http :// www . s t r e a m i n g l i n k e d d a t a . org / schema # > . 3 @prefix : < http :// example / > . 4 5 : sgraph1 sld : lastUpdate "τi+1 "^^ xsd : dataTime ; 6 Figure 1: Architectural solution of our approach to 7 sld : expires "τi+2 "^^ xsd : dataTime ; sld : windowType sld : l o gi ca lT u mb li n g ; publish Streaming Linked Data 8 sld : windowSize " PT1H "^^ xsd : duration . 9 10 : sgraph1 rdfs : seeAlso : igraph1 . 11 : igraph1 sld : receivedAt "τi "^^ xsd : dataTime . By using our C-SPARQL engine as a one-to-one mapper 12 13 from data streams to RDF streams, we can make available 14 : sgraph1 rdfs : seeAlso : igraph2 . : igraph2 sld : receivedAt "τi+1 "^^ xsd : dataTime . to Linked Data Clients a raw data stream (see Section 3). Moreover, we offer an interface to remotely control the be- Listing 2: Example of Stream Graph linking two havior of the window which the stream is observed through Instantaneous Graphs (see Section 4). Finally, we make available RESTful services that implement a remote C-SPARQL Client (see Section 5). 1 @prefix rdfs : < http :// www . w3 . org /2000/01/ rdf - schema # > . Such services provide full control (i.e, beyond window be- 2 @prefix sld : < http :// www . s t r e a m i n g l i n k e d d a t a . org / schema # > . 3 @prefix : < http :// example / > . havior) on the C-SPARQL queries whose results are served 4 5 : igraph1 sld : receivedAt "τi "^^ xsd : dataTime ; as Linked Data by the Streaming Linked Data Server. 6 rdfs : seeAlso : sgraph1 . 7 8 : broker1 : does : tr1 . 3. PUBLISHING A STREAM 9 : tr1 : with " $ 1000" . A data stream is defined as an ordered sequence of pairs, Listing 3: The Instantaneous Graph timestamped where each pair is made of a tuple and its timestamp τ . For with τi . instance, the stream of financial transactions used in the example in Listing 1 could contain a transaction tr1 done by broker1 for $ 1000 registered at τi , and two transactions 5 We choose to link s-graphs to i-graphs using the property at τi+1 : tr2 done by broker1 for $ 3000 and tr3 done by rdfs:seeAlso, because it has been largely adopted to link broker2 for $ 2000. named graphs (see for instance the usage of rdfs:seeAlso in Sindice [19] and in the Semantic Web Client [17]) 1 @prefix rdfs : < http :// www . w3 . org /2000/01/ rdf - schema # > . their time interval at each iteration. With tumbling win- 2 @prefix sld : < http :// www . s t r e a m i n g l i n k e d d a t a . org / schema # > . dows every triple of the stream is included exactly into one 3 @prefix : < http :// example / > . 4 window, whereas with sliding windows some triples can be 5 : igraph2 sld : receivedAt "τi+1 "^^ xsd : dataTime ; 6 rdfs : seeAlso : sgraph1 . included into several windows. 7 We believe that consumers of Streaming Linked Data would 8 : broker1 : does : tr2 . 9 : tr2 : with " $ 3000" . largely benefit from controlling the window of a running C- 10 : broker2 : does : tr3 . 11 : tr3 : with " $ 2000" . SPARQL query. Therefore we propose the following IRI schemata: Listing 4: The Instantaneous Graph timestamped with τi+1 . • physical windows can be controlled replacing %size% with the number of triples (e.g., the last 1000 triples) Following the guidelines on cool URIs [5], we propose to Schema : http :// ex . org /% stream - URI %/ physical /% size % give to s-graphs and i-graphs an IRI using the following Example : http :// stockex . org / transactions / physical /1000 schemata: s - graph : http :// ex . org /% stream - name % • logical windows can be controlled replacing %size% e . g . , http :// stockex . org / transactions i - graph : http :// ex . org /% stream - name %/ URLeconde (% timestamp %) with the a time interval6 (e.g., PT1H meaning 1 hour) e . g . , http :// stockex . org / transactions /2010 -02 -12 T13 %3 A34 %3 A41Z and replacing %step% either with the keyword tumbling Moreover, following the best practice on how to publish or with a time interval (e.g., PT10M meaning 10 min- Linked Data on the Web [11] in terms of content negoti- utes). ation, when IRIs, which follow the schemata shown above Schema : http :// ex . org /% stream - URI %/ logical /% size %/% step % Example : http :// stockex . org / transactions / logical / PT1H / PT10M are dereferenced, the Streaming Linked Data Server deref- erences an information resource appropriate for the client (using HTTP content negotiation): Notably, each of these IRIs are translated to an equiva- • Linked Data Clients are redirected to lent C-SPARQL query that processes the data stream. For instance, the example above is equivalent to the following http :// ex . org / trdf /% stream - name % http :// ex . org / trdf /% stream - name %/ URLeconde (% timestamp %) C-SPARQL query. REGISTER STREAM transactions COMPUTE EVERY 10 m AS • HTML Clients are redirected to PREFIX : < http :// example / > CONSTRUCT * FROM STREAM < http :// stockex . org / market . trdf > http :// ex . org / page /% stream - name % [ RANGE 1 h STEP 10 m ] http :// ex . org / page /% stream - name %/ URLeconde (% timestamp %) WHERE { ? s ? p ? o . } 4. CONTROLLING THE WINDOW 5. CONTROLLING C-SPARQL QUERIES As we have explained in the previous section, streams are In this Section, we describe the RESTful [21] services intrinsically infinite. In C-SPARQL, we introduce the notion which allow one to control each C-SPARQL query that con- of windows over streams. In Section 3, we focus on the tinuously computes each RDF stream published with our general approach to publish a data stream rather than on approach. the notion of window. However, we foresee the need for a As we explained above, C-SPARQL queries have to be consumer of Streaming Linked Data to be able to control registered in the C-SPARQL Engine. As soon as a query is the behavior of the window through which the stream is registered, the C-SPARQL engine starts to compute it. An observed. explicit stop command is required to stop the processing of Types and characteristics of windows in C-SPARQL are a registered query. Similarly an unregister command allows inspired by those of the windows defined in continuous query for deleting a C-SPARQL query. languages for relational streaming data, such as CQL[3]. We desinged a RESTful interface that uses the HTTP Windows are expressed in C-SPARQL within the FROM STREAM methods to controll the C-SPARQL queries: clause, whose syntax is as follows: FromStrClause → ‘FROM’ [‘NAMED’] ‘STREAM’ StreamIRI • PUT, with a C-SPARQL query as parameter, allows to ‘[ RANGE’ Window ‘]’ register a query that generates a certain RDF stream, Window → LogicalWindow | PhysicalWindow • POST, with start or stop command as parameters, is LogicalWindow → Number TimeUnit WindowOverlap used to start or stop a registered query, and TimeUnit → ‘ms’ | ‘s’ | ‘m’ | ‘h’ | ‘d’ WindowOverlap → ‘STEP’ Number TimeUnit | ‘TUMBLING’ • DELETE can be used to unregister a query. PhysicalWindow → ‘TRIPLES’ Number A window extracts from the stream the last data stream elements, which are considered by the query. Such extrac- 6. RELATED WORK tion can be physical (a given number of triples) or logical Two previous works [14, 22] address the need for publish- (all the triples which occur during a given time interval, the ing data streams as Linked Data. number of which is variable over time). In [14], Corcho introduce the concept of Linked Stream Logical windows are sliding [16] when they are progres- Data, a way in which the Linked Data principles can be sively advanced of a given STEP (i.e. a time interval that 6 The lexical space of such an interval is the same as is shorter than the window’s time interval); they are non- xsd:duration, i.e., the format PnYnMnDTnHnMnS defined by overlapping (or TUMBLING) when they are advanced of exactly ISO 8601 [18] applied to stream data and be part of the Web of Linked C-SPARQL queries that generates the RDF streams is also Data. At a first glance, his proposal could appear similar detailed. to our one. Both his and our proposal use named graphs We believe that our proposal can lower the entry barrier and define IRI schemata. However, his approach does not for external (Semantic) Web application to consume data take into account the nature of streams, that, being un- streams. Our next step is to complete the prototypical im- bounded sequences of time-varying data elements, should plementation of our Streaming Linked Data Server and eval- not be treated as persistent data to be stored (forever) and uate it against several use cases. We are currently consider- queried on demand, but rather as transient data to be con- ing the synthetic Linear Road Benchmark [4], a well estab- sumed on the fly by continuous queries. His proposal allows lished benchmark for Data Stream Management Systems, for opening a window starting from and ending into any mo- and several real source of streams that we are already ex- ment in time (see listing below). This is incompatible with perimenting with (see for instance, the social media streams the principle to keep a window open on the latest data that in [8] or the Milan traffic streams in [9]). has to be consumed on the fly. It requires the Linked Stream Data server to store the stream for an indefinite time period. 8. ACKNOWLEDGMENTS http :// www . domain . org / sensor / name /% start time % ,% end time % The work described in this paper has been partially sup- In [22], Rodrı́guez et al. introduce the notion of Time- ported by the European project LarKC (FP7-215535). Annotated RDF (TA-RDF) that allows for representing time- series data, especially streaming data, using the Seman- 9. REFERENCES tic Web approach. (TA-RDF) is an extension of the RDF model where resources are optionally annotated with a time [1] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, value, i.e, a time-annotated resource is a pair of the form M. Cherniack, J.-H. Hwang, W. Lindner, A. S. resource[time] (see listing below for an example). Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The Design of the Borealis Stream < urn : OHARE > < urn : hasRainSensor > < urn : sensor1 > . < urn : sensor1 >["2009 -01 -01 Z -06:00"^^ xsd : date ] < urn : hasReading > "0" . Processing Engine. In Proc. Intl. Conf. on Innovative < urn : sensor1 >["2009 -01 -01 Z -06:05"^^ xsd : date ] < urn : hasReading > "5" . ... Data Systems Research (CIDR 2005), 2005. < urn : sensor1 >["2009 -01 -31 Z -10:00"^^ xsd : date ] < urn : hasReading > "15" . [2] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, I. Nishizawa, J. Rosenstein, and J. Widom. STREAM: A TA-RDF graph can be represented as a set of RDF The Stanford Stream Data Manager (Demonstration graphs using two special properties: belongsTo, which indi- Description). In Proc. ACM Intl. Conf. on cates a data element in a stream, and hasTimestamp, which Management of Data (SIGMOD 2003), page 665, points toward the timestamp of the data element. 2003. As for the previous related work, TA-RDF proposal looks [3] A. Arasu, S. Babu, and J. Widom. The CQL very similar to our one, but still it lacks the paradigmatic Continuous Query Language: Semantic Foundations change from persistent data to transient data. In TA-RDF and Query Execution. The VLDB Journal, streams are supposed to be stored indefinitely. 15(2):121–142, 2006. Finally, the two proposal do not consider the rich types of windows proposed in DSMS. They do not propose a vo- [4] A. Arasu, M. Cherniack, E. F. Galvez, D. Maier, cabulary to describe the window type (i.e., lsd:physical vs. A. Maskey, E. Ryvkina, M. Stonebraker, and lsd:logical) and the size of the window (i.e., the equivalent R. Tibbetts. Linear road: A stream data management of our property windowSize). The properties lastUpdate benchmark. In M. A. Nascimento, M. T. Özsu, and expires, which in our vocabulary allows to indicate a D. Kossmann, R. J. Miller, J. A. Blakeley, and K. B. Linked Data Client when the graph was updated and when Schiefer, editors, VLDB, pages 480–491. Morgan it will expire, are not present. Kaufmann, 2004. [5] D. Ayers and M. Vlkel. Cool uris for the semantic web. World Wide Web Consortium, Note 7. CONCLUSION NOTE-cooluris-20081203, December 2008. Available Distributing and presenting information in real-time streams on line at: http://www.w3.org/TR/2008/NOTE- is becoming a best practice on the Web. The nature of cooluris-20081203/. streams requires a paradigmatic change from persistent data [6] Y. Bai, H. Thakkar, H. Wang, C. Luo, and C. Zaniolo. to be stored, and queried on demand, to transient data, to A Data Stream Language and System Designed for be consumed on the fly by continuous queries. Power and Extensibility. In Proc. Intl. Conf. on In our previous work we investigated C-SPARQL as an Information and Knowledge Management (CIKM approach to treat non-RDF DSMSs as virtual RDF streams 2006), pages 337–346, 2006. and graphs. With this position paper, we propose an exten- [7] D. F. Barbieri, D. Braga, S. Ceri, E. Della Valle, and sion of our C-SPARQL Engine that publishes data streams M. Grossniklaus. C-SPARQL: SPARQL for as Linked Data. In this paper, we described the princi- Continuous Querying. In Proc. Intl. Conf. on World ple that inspires our approach and we explain how to pub- Wide Web (WWW), pages 1061–1062, 2009. lish RDF streams continuously generated by C-SPARQL [8] D. F. Barbieri, D. Braga, S. Ceri, E. Della Valle, and queries. Such a best practice introduces the concepts of M. Grossniklaus. Continuous queries and real-time Stream Graph (or s-graph) and Instantaneous Graph (or i- analysis of social semantic data with c-sparql. In graph) as well as a small vocabulary that allows to describe Proceedings of Social Data on the Web Workshop at which part of the stream has been published and when the the 8th International Semantic Web Conference, 10 information will expire. A RESTful service to control the 2009. [9] D. F. Barbieri, D. Braga, S. Ceri, and M. Grossniklaus. An Execution Environment for C-SPARQL Queries. In Proc. Intl. Conf. on Extending Database Technology (EDBT), 2010. [10] C. Bizer. D2R MAP - A Database to RDF Mapping Language. In WWW (Posters), 2003. [11] C. Bizer, R. Cyganiak, and T. Heath. How to publish linked data on the web. Web page, 2007. Revised 2008. Accessed 07/08/2009. [12] C. Bizer and A. Seaborne. D2RQ - Treating Non-RDF Databases as Virtual RDF Graphs. In ISWC2004 (posters), November 2004. [13] J. J. Carroll, C. Bizer, P. J. Hayes, and P. Stickler. Named graphs, provenance and trust. In A. Ellis and T. Hagino, editors, WWW, pages 613–622. ACM, 2005. [14] O. Corcho. Linked stream data: A position paper. In The 2nd International Workshop on Semantic Sensor Networks 2009, 2009. [15] M. Garofalakis, J. Gehrke, and R. Rastogi. Data Stream Management: Processing High-Speed Data Streams (Data-Centric Systems and Applications). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2007. [16] L. Golab and M. T. Özsu. Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams. In Proc. Intl. Conf. on Very Large Data Bases (VLDB 2006), pages 500–511, 2003. [17] O. Hartig, C. Bizer, and J. C. Freytag. Executing sparql queries over the web of linked data. In A. Bernstein, D. R. Karger, T. Heath, L. Feigenbaum, D. Maynard, E. Motta, and K. Thirunarayan, editors, International Semantic Web Conference, volume 5823 of Lecture Notes in Computer Science, pages 293–309. Springer, 2009. [18] International Organization for Standardization. Data elements and interchange formats — information interchange — representation of dates and times. ISO 8601, December 2004. Available on line at: http://xml.coverpages.org/ISO-FDIS-8601.pdf. [19] E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tummarello. Sindice.com: a document-oriented lookup index for open linked data. IJMSO, 3(1):37–52, 2008. [20] E. Prud’hommeaux and A. Seaborne. SPARQL Query Language for RDF. http://www.w3.org/TR/rdf-sparql-query/. [21] L. Richardson and S. Ruby. RESTful Web Services. O’Reilly, Beijing, 2007. [22] A. Rodriguez, R. McGrath, Y. Liu, and J. Myers. Semantic Management of Streaming Data. In Proc. Intl. Workshop on Semantic Sensor Networks (SSN), 2009.