sparqlPuSH: Proactive notification of data updates in RDF stores using PubSubHubbub Alexandre Passant1 and Pablo N. Mendes2 1 Digital Enterprise Research Institute National University of Ireland, Galway alexandre.passant@deri.org 2 Kno.e.sis Center, CSE Department Wright State University, Dayton, OH - USA pablo@knoesis.org Abstract. With the growing numbers of status update websites and related wrappers, initiatives modelling sensor data in RDF, as well as the dynamic nature of many Linked Data exporters, there is a need for protocols enabling real-time notification and broadcasting of RDF data updates. In this paper we present a flexible approach that provides such notifications to be delivered in real-time to any RSS or Atom reader. Our framework enables the active delivery of SPARQL query results through the PubSubHubbub (PuSH) protocol upon the arrival of new information in RDF stores. Our open source implementation can be plugged on any SPARQL endpoint and can directly reuse PuSH hubs that are already deployed in scalable clouds (e.g. Google’s). 1 Introduction Since the Semantic Web is “an extension of the current Web” [5], it has to deal with the different paradigm shifts happening on the Web. In particular, more and more streamed information is available online, ranging from microblogging updates to sensor data, often being combined with trends in ubiquitous comput- ing — e.g. status and geolocation updates from mobile phones, related to the two aforementioned aspects. From the Semantic Web side, this entails the ability to capture this streamed information in RDF, through efforts such as semantic microblogging [12] [9] or representation of sensor information as Linked Open Data [1]. In addition, many data sources in the Linking Open Data cloud are build from user-generated content, as DBpedia, Freebase, FOAF profiles, etc [6]. Consequently, there is a need to tackle this dynamic generation aspects which lead to an area of constantly evolving RDF data available at Web scale. These dynamic aspects of RDF data entail various issues, including changes management [13], stream querying [4], etc. In this paper we focus on how to enable real-time notifications of data updates in RDF stores. We provide a way to let users subscribe to a subset of content available within an RDF store (defined as a SPARQL query) and get a notification message each time some content within that subset changes in the store. To achieve this goal, we define a complete framework for such notifications and broadcasting, based on: 2 Alexandre Passant and Pablo N. Mendes – the representation of data updates in RDF stores through RSS/Atom feeds and a registration system to map SPARQL queries to these feeds; – the use of the PubSubHubbub protocol3 to proactively broadcast the previ- ous feeds and inform clients about data updates in real-time; – an open-source implementation of the aforementioned principles, build in PHP and flexible enough to be adapted on the top of any SPARQL endpoint supporting SPARQL Query SPARQL Update4 . The rest of this paper is organised as follows. In Section 2, we discuss our motivations for real-time notification of data updates, before discussing related work in Section 3. In Section 4, we present how we use the PubSubHubbub protocol to broadcast changes happening in RDF stores. In particular, we discuss how we map SPARQL query results to RSS feeds and how they are broadcasted to interested parties. In Section 5, we discuss the implementation of the previous principles in sparqlPuSH. Finally, we conclude the paper. 2 Motivations On the Web, particularly in the post-Web 2.0 era, there is an ubiquitous feed of socially created content around the globe that may or may not find its way to an interested user. Imagine users interested in monitoring a developing news story (e.g. the 2010 Chile earthquake), keeping up-to-date on media placements for a brand, or getting the latest news on the stock market. In such cases, content updates may contain actionable information that is useful only if delivered in real-time, especially in emergency scenarios such as earthquake monitoring [11]. Google Alerts, a content monitoring service, automatically notifies users when there are new Google results for a set of keywords. It delivers updates through e-mails (text and HTML), and RSS feeds. However, monitoring alerts are lim- ited to keyword-based queries. From the information retrieval point of view, in comparison to keywords, SPARQL queries provide a more expressive language to describe a user’s information need — i.e. complex constraints and unambiguous references. A Semantic Web counterpart to Google Alerts should allow users to register a SPARQL query and get updates pushed to the users as new matching triples arrive in an underlying RDF store containing relevant data. As such, Semantic Search engines that allow persistent searches and real time updates could support the brief use-cases that we now briefly describe. 2.1 Monitoring competitors information in a corporate context A product manager interested in following the competitors of his company could ask a semantic search engine to “select user-generated content mentioning com- panies that compete with mine”, using DBpedia to identify such competitors, as depicted in Fig. 1. Using a pull approach, his RSS aggregator would have to 3 http://code.google.com/p/pubsubhubbub/ 4 http://www.w3.org/TR/sparql11-update/ sparqlPuSH 3 constantly fetch the feeds to identify new content. However, with a push model, relevant data is simply delivered as soon as it comes into the original system(s). PREFIX ex : < http :// example . org / > PREFIX moat : < http :// moat - project . org / ns # > PREFIX company : < http :// dbpedia . org / ontology / Company > SELECT ? document WHERE { ? document moat : taggedWith ? competitor . ? competitor company : industry ? industry . ex : MyCompany company : industry ? industry . } Fig. 1. Example SPARQL query selecting documents that mention competitors (i.e. working in the same field) of a fictitious company identified as ex:MyCompany. 2.2 Following updates on the Health Care Reform In the context of the Health Care Reform in the U.S.A., we present the following fictitious use case: Otto is a balanced congressman that is concerned primarily with practical considerations, rather than moralistic premises. He would like to follow, as the discussion unfolds, how does the public perception change across states. Especially, he would like to compare the trending topics in states with Republican versus Democratic majorities. Many data sources can be applicable to Otto’s use case, including local news and microblog posts. Otto would like to “select all entities mentioned in microblog posts from democratic states”, while in another window he would like to “select all entities mentioned in microblog posts from republican states”. Looking at the windows side by side he would be able to quickly glance over the differences. However, a more complex query could provide a direct answer, if he choses to “select all entities mentioned in microblog posts from democratic states that were not mentioned in microblog posts from republican states”. Once again, in order to get an accurate perception of the current trends, this information should be updated as soon as new data comes in one of these original systems. 3 Related work Various work recently focused on the representation of changes in RDF data sources and related datasets, notably in the Linked Data realm5 . These include 5 http://www.ldodds.com/blog/2010/04/rdf-dataset-notifications/ 4 Alexandre Passant and Pablo N. Mendes the Talis Changesets6 and Triplify update vocabularies7 [3], as well as the dady (Dataset Dynamics) vocabulary8 , that can be combined with voiD [2]. While the two first ones provide the ability to represent atomic changes (e.g. new triples being added to a resource), dady focuses mainly on representing characteristics of changes in dataset, such as their expected frequency. In addition, atomic changes can be transmitted in Atom9 . Other efforts include the Web of Data Link Maintenance Protocol10 [14], which notifies linked resource when links are added to and removed from them, as well as — to some extent — the Semantic Sitemap extension [7] that defines how often data can be re-crawled from a website to get fresh information, while it relies on clients regularly fetching it, not directly solving the notification issue. Regarding real-time notification, XMPP messaging [10] can be used as a way to transport SPARQL queries11 [8], while Semantic Pingback12 focuses on informing remote sources of new links, as soon as a link is created from a seed source. Finally, outside the Semantic Web world, both rssCloud13 and PubSubHub- bub14 (PuSH) address the notification issue. Both focus on a push approach to broadcast notifications in feeds transmitted via hubs that push information proactively from one service to the various clients interested in following this service changes. However, rssCloud focuses only on RSS 2.0 feeds, while PuSH can be adapted to both RSS and Atom. In addition, it provides a public infras- tructure that can be used by implementers, notably by using the Google public PuSH hub15 . 4 Distributing data updates as RSS and Atom feeds 4.1 Shifting from a pull to a push approach As we presented in the use-cases, our main motivation is to enable data changes notification to shift from a pull to a push approach, i.e. letting people being notified of changes in the information they are interested to, rather than forcing them to constantly pull sources to identify new data. In particular, our goal is to enable proactive notification of changes happening in RDF stores, whatever they deal with: new data of a particular type being added, updated statements about a given resource, etc. To do so, we rely on the aforementioned PubSubHubbub protocol to broadcast these updates, combined 6 http://n2.talis.com/wiki/Changeset 7 http://triplify.org/vocabulary/update 8 http://vocab.deri.ie/dady 9 http://linkeddatacamp.org/wiki/LinkedDataCampVienna2009/DatasetDynamics 10 http://www4.wiwiss.fu-berlin.de/bizer/silk/wodlmp/ 11 http://danbri.org/words/2008/02/11/278 12 http://aksw.org/Projects/SemanticPingback 13 http://rsscloud.org/ 14 http://pubsubhubbub.googlecode.com/ 15 https://pubsubhubbub.appspot.com/ sparqlPuSH 5 with a two-steps approach: (i) registering the SPARQL queries related to the updates that must be monitored in a RDF store and (ii) broadcasting changes when data mapped to these queries is updated in the store. This workflow has been implemented in sparqlPuSH, a system that can be plugged on the top of any SPARQL endpoint to achieve this goal, and that we will further describe in Section 5. 4.2 Registering SPARQL queries for data updates sparqlPuSH interface 11 41 Client RDF Store LQRA P S 51 31 21 PuSH Hub Fig. 2. Workflow of the sparqlPuSH query registration, see text for details. The registration of a SPARQL query to be notified of updates happening in a RDF store works as follows16 (Fig. 2): 1. a user sends a SPARQL query to the sparqlPuSH interface (compliant with the SPARQL protocol), e.g. http://example.org/sparqlPuSH/; 2. the sparqlPuSH interface registers the query locally and maps it to a new feed, also indicating its creation date (Fig. 3). The information is stored in a particular graph, e.g. http://example.org/sparqlPuSH/feeds; 3. the interface generates a feed (RSS or Atom) corresponding to the query, and registers it to a PuSH hub (using the public Google’s one by default); 4. the feed, containing a link to the PuSH hub URL in its header — according to the PuSH specification17 — is send to the client; 5. the client parses the feed in order to get the PuSH hub URL, and registers its interest to the feed at this particular hub. 16 The last two steps of this workflow are simply the adaptation of the PuSH principles to our Semantic Web use-case. 17 http://code.google.com/p/pubsubhubbub/wiki/RssFeeds 6 Alexandre Passant and Pablo N. Mendes @prefix sp : < http :// vocab . deri . ie / sparqlpush # > . @prefix dct : < http :// purl . org / dc / terms / > . < http :// example . org / feed /34562738 > a sp : Feed ; sp : query " SELECT ? uri ? author ? label ? date WHERE { ? uri a sioc : Post ; sioc : has_creator ? author ; dc : title ? label ; dct : created ? date . } ORDER BY ASC (? date ) " ; dct : modified "2010 -03 -29 T09 :18:23 Z " . Fig. 3. Mapping RSS feeds to SPARQL query results. In order to register the query, any system can send a HTTP POST query to the sparqlPuSH interface, and the query has to be passed using the query pa- rameter18 . The system automatically interprets some common prefixes, and addi- tional ones can be easily added in an appropriate configuration file. Moreover, in order to provide relevant feeds (e.g. appropriate or elements), the system uses a few conventions to must be respected in the SPARQL query: – ?uri — their URI of the element(s) to be retrieved; – ?date — their creation / modification date; and some optional ones: – ?label — their label; – ?author — their author19 ; That way, the different elements of the RDF data that is retrieved are con- verted into elements of the feeds. Notably, the use of a ?date variable is required to order information by date and ensure that information in the feed is ordered as expected, i.e. by update date. Moreover, any element of the feed (identified from the ?uri variable) can be a dereferencable URI, so that it can be easily consumed by Linked Data aware clients. That way, we use simple RSS or Atom feeds to transfer RDF information between the original triple-store and the client(s) that request the changes notifications. Furthermore, these queries (and the complete architecture proposed in this paper) can be used not only to identify new data from a particular type be- ing loaded in a store (sometimes with additional constraints, such as filtering 18 In addition, as we will describe later, sparqlPuSH provides a form-based interface to directly register queries. 19 The ?author variable can bind either to the URI of an author, or to a literal iden- tifying it, while the first way is obviously preferred in a Linked Data scenario. sparqlPuSH 7 based on a given topic or author) but also to identify changes corresponding to a particular entity that is being modified. For example, relying on the Talis Changeset vocabulary, one can register a SPARQL query that will — based on the following broadcast system — notify an alert as soon as new statements are edited regarding the resource , as described in Fig. 4. PREFIX cs : < http :// purl . org / vocab / changeset / schema # > SELECT ? uri ? author ? label ? date WHERE { ? uri a cs : ChangeSet ; cs : creatorName ? author ; cs : changeReason ? label ; cs : createdDate ? date ; cs : subjectOfChange < http :// example . org / FooBar > . } ORDER BY ASC (? date ) Fig. 4. Registering a SPARQL query to identify changes of a particular resource 4.3 Triggering events and broadcasting updates via PuSH hubs sparqlPuSH interface 11 RDF Data Client RDF Store LQRA P S 41 31 21 PuSH Hub Fig. 5. Workflow of the sparqlPuSH changes notification, see text for details. Once feeds have been registered in the system, the process works as follows (Fig. 5): 8 Alexandre Passant and Pablo N. Mendes 1. RDF data can be loaded in the RDF store through the sparqlPuSH interface, that is compliant with SPARQL Update principles, so that it can be send using HTTP POST. Actually, this interface just acts as proxy that launches triggers when new data is loaded; 2. once the data have been loaded, the system runs all the registered SPARQL queries and update — if needed — the according feeds20 ; 3. for each updated feed, a notification is sent to the the PuSH hub; 4. immediately, the hub broadcasts the information to all the clients that have registered to this particular feed with it. In practice, our experiments showed that once the data is loaded in the sparql- PuSH interface, the clients receive it only a few seconds later, using the Google public PuSH hub server21 . With regards to the triggering step, since the sparqlPuSH interface lives on the top of any SPARQL endpoint, it is done by (1) sending the update query from the interface to the store via HTTP, then (2) running all the registered queries when it receives a response from this update query. However, this process could be a bit cumbersome, especially when the sparqlPuSH interface and the original RDF store are on the same server, as it still implies running the queries over HTTP. To solve this issue, and as we will now discuss, the interface can be adapted for some particular stores to use their internal API for querying, rather than doing it via HTTP. 5 Implementation We implemented sparqlPuSH in PHP, as a proxy that can be plugged on the top of any RDF store supporting the currently standardised SPARQL Update language, the SPARQL protocol via HTTP, as well as named graphs (in order to register the queries and their mappings in a particular graph)22 . The system includes (i) an interface to let users register queries (Figure 6(a)) — while this is generally done remotely, through clients that can then interpret the resulting feed and register to the appropriate PuSH hub —, and (ii) an interface listing the available RSS feeds, and the corresponding SPARQL queries (Figure 6(b)). The sparqlPuSH implementation is available at http://code.google.com/ p/sparqlpush through SVN, and comes with an example client that can be used to test the approach, in addition to the server part described in this paper. This test client can (1) register SPARQL queries to any sparqlPuSH interface (including the management of the PuSH hub registration when retrieving the response feed) and (2) receive notifications from any PuSH hub to update its interface its real-time (using JavaScript to check every second if a notification has been received) In addition, a video is also available from the project website in order to showcase the interest of approach. 20 In addition, their dct:modified value (as seen in Fig. 3) is updated. 21 http://pubsubhubbub.appspot.com/ 22 These requirements being common features of many RDF stores. sparqlPuSH 9 (a) (b) Fig. 6. sparqlPuSH: (a) registering queries; (b) listing available queries. Furthermore, as we previously mentioned, in addition to the generic SPARQL connector, sparqlPuSH includes a direct interface to ARC2 using its PHP API. That way, interactions between sparqlPuSH and the RDF store are not done through HTTP but directly using the ARC2 API, as well as the triggering step, making the approach even faster. New wrappers for other APIs could easily be added, in order to avoid this HTTP interactions and making sparqlPuSH part of the RDF store itself. 6 Conclusion In this paper, we detailed an architecture for triggering data updates in RDF store and broadcasting them in real-time to various clients. Our system, spar- qlPuSH, has been made available by combining SPARQL queries, RSS / Atom feeds, and the PubSubHubbub protocol. We believe that this approach can be a first step towards a push-model for the Semantic Web, which becomes mostly needed considering the evolution of the (Semantic) Web towards a continuous stream of data. In addition, while currently available as a plug-in for any RDF store, we hope that this push approach can become a default model in various RDF store implementations, enabling more capabilities to monitor, in real-time, changes related to RDF data. In the future, we could also imagine to register the queries to be monitored using C-SPARQL [4], instead of the current triggering approach. Acknowledgements The work presented in this paper has been funded in part by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Líon-2). 10 Alexandre Passant and Pablo N. Mendes References 1. Proceedings of the 1st SemSensWeb2009 Workshop on the Semantic Sensor Web, volume 468. CEUR-WS.org, 2009. 2. Keith Alexander, Richard Cyganiak, Michael Hausenblas, and Jun Zhao. Describ- ing Linked Datasets. In Proceedings of the Second Workshop on Linked Data on the Web (LDOW2009) at WWW2009, 2009. 3. Sören Auer, Sebastian Dietzold, Jens Lehmann, Sebastian Hellmann, and David Aumueller. Triplify: light-weight linked data publication from relational databases. In Juan Quemada, Gonzalo León, Yoëlle S. Maarek, and Wolfgang Nejdl, edi- tors, Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, pages 621–630. ACM, 2009. 4. Davide Francesco Barbieri, Daniele Braga, Stefano Ceri, Emanuele Della Valle, and Michael Grossniklaus. C-SPARQL: SPARQL for continuous querying. In WWW, pages 1061–1062, 2009. 5. Tim Berners-Lee, James A. Hendler, and Ora Lassila. The Semantic Web. Scientific American, 284(5):34–43, 2001. 6. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems (IJSWIS), 5(3):1–22, 2009. 7. Richard Cyganiak, Holger Stenzhorn, Renaud Delbru, Stefan Decker, and Giovanni Tummarello. Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web. In Proceedings of the 5th European Semantic Web Conference (ESWC 2008), volume 5021 of Lecture Notes in Computer Science, pages 690–704. Springer, 2008. 8. Frank Osterfeld, Malte Kiesel, and Sven Schwarz. Nabu — A Semantic Archive for XMPP Instant Messaging. In Proceedings of the 1st Workshop on The Seman- tic Desktop, 4th International Semantic Web Conference, volume 175 of CEUR Workshop Proceedings. CEUR-WS.org, 2005. 9. Alexandre Passant, Uldis Bojars, John G. Breslin, Tuukka Hastrup, Milan Stankovic, and Philippe Laublet. An Overview of SMOB 2: Open, Semantic and Distributed Microblogging. In 4th International Conference on Weblogs and Social Media, ICWSM 2010. AAAI, 2010. 10. Peter Saint-Andre. Extensible Messaging and Presence Protocol (XMPP): Core. Request for comments: 3920, Internet Engineering Task Force, 2004. http://www. ietf.org/rfc/rfc3920.txt. 11. Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors. In Proceedings of the Nine- teenth International WWW Conference (WWW2010). ACM, 2010. 12. Joshua Shinavier. Real-time SemanticWeb in <= 140 chars. In Proceedings of the Third Workshop on Linked Data on the Web (LDOW2010) at WWW2010, 2010. 13. Jürgen Umbrich, Michael Hausenblas, Aidan Hogan, Axel Polleres, and Stefan Decker. Towards Dataset Dynamics: Change Frequency of Linked Open Data Sources. In Proceedings of the Third Workshop on Linked Data on the Web (LDOW2010) at WWW2010, 2010. 14. Julius Volz, Christian Bizer, Martin Gaedke, and Georgi Kobilarov. Discovering and maintaining links on the web of data. In International Semantic Web Confer- ence, volume 5823 of Lecture Notes in Computer Science, pages 650–665. Springer, 2009.