Gize: A Time Warp in the Web of Data Valeria Fionda1 , Melisachew Wudage Chekol2 , and Giuseppe Pirrò3 1 University of Calabria, Italy fionda@mat.unical.it 2 University of Mannheim, Germany mel@informatik.uni-mannheim.de 3 Institute for High Performance Computing and Networking, ICAR-CNR, Italy pirro@icar.cnr.it Abstract. We introduce the Gize framework for querying historical RDF data. Gize builds upon two main pillars: a lightweight approach to keep historical data, and an extension of SPARQL called SPARQ–LTL, which incorporates temporal logic primitives to enable a rich class of queries. One striking point of Gize is that its features can be readily made available in existing query processors. 1 Introduction Querying historical data is of utmost importance in many contexts, from city monitoring, where one needs to track different aspects (e.g., pollution, popula- tion) to generic exploratory research, where one is interested in posing queries like “Retrieve players that are now managing some club they played for” or “Re- trieve the annotation of a gene since the discovery of a particular interaction”. The classical approach for querying RDF data (e.g., via SPARQL endpoints) only considers the latest version. In fact, querying of historical RDF data poses some challenges. The first concerns the representation and storing of historical data. Some approaches (e.g., [4, 7]) allow to retrieve data by providing times- tamps. Other resort to dedicated indexing structures (e.g., [5]) to speed-up query processing. The second problem concerns what type of querying primitive to pro- vide. The most common approach is to devise SPARQL extensions that work with intervals or SPARQL translations into temporal logic. The addition of ad- hoc components either in terms of data representation, query language or both, hinders the applicability on existing RDF (query) processing infrastructures. To cope with these issues we present the Gize4 framework. Gize is built around two main components. The first is a lightweight approach to store RDF data, where each version of the data is stored in a separate named graph. The second component is a powerful extension of SPARQL called SPARQ–LTL. This language inherits a variety of temporal operators from Linear Temporal Logics (LTL) [2]. To evaluate SPARQ–LTL on existing SPARQL processors we devise a translation to standard SPARQL queries. 4 Gize ( ) is the Ethiopian word for time. Related Work. Approaches like the DBpedia Wayback Machine [4] allow to re- trieve data at a certain timestamp (provided by the user). Other approaches (e.g.,[7]) access historical data via (HTTP) content negotiation, typically using the Memento framework. Yet other approaches (e.g., [5]) introduce timestamps into RDF triples along with ad-hoc indexing strategies. In Description Logics, some proposals focus on temporal conjunctive query answering (e.g., [1]); other efforts (e.g., [6]) have focused on the translation of SPARQL into LTL. Sur- prisingly, the design of solutions for querying historical RDF data on existing SPARQL processors is still in its infancy. Gize fills this gap by contributing an extension of SPARQL, called SPARQ–LTL, that allows for a rich class of temporal primitives borrowed from LTL (e.g., SINCE, NEXT, PREVIOUS) along with a translation from SPARQ–LTL queries into standard SPARQL queries. 2 The Gize Framework Representing Historical RDF Data. The tenet of Gize is to enable querying of historical RDF data on existing processors. To represent versions, Gize lever- ages the notion of RDF quad. An RDF quad (for simplicity, we omit bnodes) is a tuple of the form hs, p, o, ci ∈ I × I × (I ∪ L) × I, where I (IRIs) and L (literals) are countably infinite sets. The forth element of the quad represents that named graph to which the triple belongs. June 2012 March 2014 September 2014 dbp:Cesare_Prandelli dbp:Cesare_Prandelli dbp:Andrea_Pirlo dbp:Antonio_Conte dbp:Andrea_Pirlo dbp:Andrea_Pirlo dbpo:coach dbpo:name dbpo:name dbpo:coach . dbpo:coach . dbp:Luca_Antonelli dbpo:name . . dbp:Italy_national_football_team . dbpo:name . dbp:Italy_national_football_team . dbp:Italy_national_football_team dbpo:regionalName dbpo:name dbp:Giampaolo_Pazzini . dbpo:regionalName dbpo:name dbp:Matteo_Darmian dbpo:regionalName . dbpo:name dbp:UEFA_European_Championship dbp:Giampaolo_Pazzini dbp:UEFA_European_Championship dbp:UEFA_Euro_2016_qualifying dbp:UEFA_Euro_2012 dbp:UEFA_European_Championship dbpo:current dbpo:current June 2016 July 2016 update addition deletion dbp:Antonio_Conte dbp:Giampiero_Ventura dbp:Andrea_Pirlo dbp:Andrea_Pirlo DBpedia Update Statistics about People dbpo:coach dbpo:name . dbpo:coach dbpo:name . . . dbp:Italy_national_football_team dbp:Italy_national_football_team . . dbpo:regionalName dbpo:name dbp:Matteo_Darmian dbpo:regionalName dbpo:name dbp:Matteo_Darmian dbp:UEFA_European_Championship dbp:UEFA_European_Championship dbp:UEFA_Euro_2016 dbp:UEFA_Euro_2016 dbpo:current dbpo:current Fig. 1. An exceprt of evolving data from DBpedia. Fig. 1 shows the evolution of some data taken from DBpedia. Each of the 5 versions considered is represented by a named graph. From this small data sample one may notice that Italy has changed 3 coaches from July 2012 (C. Prandelli) to July 2016 (G. Ventura). Interestingly, the latest coach (G. Ventura) will start on July 18th. One may also notice that some players like A. Pirlo were part of the team along the whole period, while some other like G. Pazzini or M. Darmiani were left out or added, respectively. The figure also shows (bottom right corner) update statistics about People in DBpedia. Each percentage is computed wrt the previous version. The availability of historical data allows to pose queries like “Find all players that played with the highest number of coaches” or “Find players that played since C. Prandelli was the coach”. Most of existing approaches either are not able to express such kind of queries or have to resort to ad-hoc processing infrastructures. The SPARQ–LTL Language. The syntax of SPARQ–LTL is shown below while Table 1 provides a description of the temporal operators. Let q be a SPARQ–LTL query, H = {h1 , h2 , . . . , hm } be the set of versions, and hc be the current (not necessarily the latest) version of the data. QP ::= QP1 . QP2 | {QP1 } UNION {QP2 } | {QP1 } MINUS {QP2 } | QP1 OPTIONAL {QP2 } | | GRAPH I ∪ VQP1 | QP1 FILTER {R} | t = (I ∪ V) × (I ∪ V) × (I ∪ L ∪ V) | | X{QP} | W{QP} | F{QP} | G{QP} | {QP1 } U {QP2 } | | Y{QP} | Z{QP} | P{QP} | H{QP} | {QP1 } S {QP2 } Operator SPARQ–LTL Syntax Meaning Xq NEXT Evaluate q on version hc+1 Fq EVENTUALLY Evaluate q on all versions hc , ....hm Gq ALWAYS The evaluation of q must be the same on all versions hc , ....hm q1 U q2 UNTIL If S2 is the solution of q2 in a version, hk ∈ {hc , ....hm } then there exists a solution S1 of q1 on hc such that S1 is compatible with S2 in all versions {hc , ....hk } Yq PREVIOUS Evaluate q on version hc−1 Pq PAST Evaluate q on all versions h1 , ....hc Hq ALWAYSPAST The evaluation of q must be the same on all versions h1 , ....hc q1 S q2 SINCE If S2 is the solution of q2 in a version, hk ∈ {h1 , ....hc } then there exists a solution S1 of q1 on {hk , ....hc } such that S1 is compatible with S2 in all versions {hk , ....hc } Table 1. Meaning of the temporal operators in SPARQ–LTL. SPARQ–LTL allows to use an additional set of keywords when writing SPARQL queries. SPARQ–LTL are evaluated by translating the temporal operators via a (set of) pattern(s) evaluated on named graphs maintaining data versions. We give some examples by considering the five version of data shown in Fig. 1 (stored in separate named graphs). In what follows, dbp:INFT is a shorthand for dbp:Italy_national_football_team. Example 1. Select players who are playing in the Italian national football team and played at least under a different coach than the current one. SPARQ–LTL Translation into SPARQL SELECT ?p WHERE { dbp:INFT dbpo:name ?p. dbp:INFT dbpo:coach ?c1. {GRAPH { dbp:INFT dbpo:name ?p. SELECT ?p WHERE { dbp:INFT dbpo:coach ?c2. dbp:INFT dbpo:name ?p. FILTER (?c1!=?c2) } } UNION dbp:INFT dbpo:coach ?c1. {GRAPH { dbp:INFT dbpo:name ?p. PAST{ dbp:INFT dbpo:coach ?c2. dbp:INFT dbpo:name ?p. FILTER (?c1!=?c2) } } UNION dbp:INFT dbpo:coach ?c2. {GRAPH { dbp:INFT dbpo:name ?p. FILTER (?c1!=?c2) dbp:INFT dbpo:coach ?c2. } } FILTER (?c1!=?c2)} } UNION {GRAPH { dbp:INFT dbpo:name ?p. dbp:INFT dbpo:coach ?c2. FILTER (?c1!=?c2) } } UNION {GRAPH { dbp:INFT dbpo:name ?p. dbp:INFT dbpo:coach ?c2. FILTER (?c1!=?c2)} } } The SPARQL query on the right is automatically generated and can be eval- uated on existing processors. Note that the translation (because of the semantics of PAST described in Table 1) requires to look into all versions. Example 2. Find the name of the coach of the Italian national football team after the sacking of Cesare Prandelli. SPARQ–LTL Translation into SPARQL SELECT ?n WHERE { SELECT ?n WHERE { {GRAPH { PAST { dbp:INFT dbpo:coach dbp:CP. dbp:INFT dbpo:coach dbp:CP. GRAPH NEXT { {dbp:INFT dbpo:coach ?n. FILTER (?n != dbp:CP )} dbp:INFT dbpo:coach ?n. } } UNION ...... UNION FILTER (?n != dbp:CP ) {GRAPH { } dbp:INFT dbpo:coach dbp:CP. } GRAPH } {dbp:INFT dbpo:coach ?n. FILTER (?n != dbp:CP ) } } }} In the previous query, dbp:CP is a shorthand for dbp:Cesare_Prandelli. As before, the translation of PAST makes usage of UNION queries over each ver- sions vi ; then, for each vi , NEXT checks in version vi+1 (via a FILTER) that the coach changed. 3 Conclusions We have outlined Gize, which enables to set-up an infrastructure for query- ing historical RDF data on existing SPARQL processors. Gize adopts a simple approach to store different versions of the data and a powerful temporal ex- tension of SPARQL called SPARQ–LTL. As a future work, we are considering approaches like RDF HDT [3] to improve the storage space consumption. References 1. S. Borgwardt, M. Lippmann, and V. Thost. Temporalizing Rewritable Query Lan- guages Over Knowledge Bases. JWS, 33:50–70, 2015. 2. G. D. De Giacomo and M. Y. Vardi. Linear Temporal Logic and Linear Dynamic Logic on Finite Traces. In IJCAI, 2013. 3. J. D. Fernández, M. A. Martı́nez-Prieto, C. Gutiérrez, A. Polleres, and M. Arias. Binary RDF Representation for Publication and Exchange. JWS, 19:22–41, 2013. 4. J. D. Fernández, P. Schneider, and J. Umbrich. The DBpedia Wayback Machine. In SEMANTICS, pages 192–195, 2015. 5. S. Gao, J. Gu, and C. Zaniolo. RDF-TX: A Fast, User-Friendly System for Querying the History of RDF Knowledge Bases. In EDBT, pages 269–280, 2016. 6. R. Mateescu, S. Meriot, and S. Rampacek. Extending SPARQL with Temporal Logic. Report, 2009. 7. H. Van de Sompel, R. Sanderson, M. L. Nelson, L. L. Balakireva, H. Shankar, and S. Ainsworth. An HTTP-based Versioning Mechanism for Linked Data. 2010.