Update Strategies for DBpedia Live Claus Stadler+, Michael Martin*, Jens Lehmann*, and Sebastian Hellmann* Universität Leipzig, Institut für Informatik, Johannisgasse 26, D-04103 Leipzig, Germany, +cstadler@informatik.uni-leipzig.de *{lastname}@informatik.uni-leipzig.de http://aksw.org Abstract. Wikipedia is one of the largest public information spaces with a huge user community, which collaboratively works on the largest online encyclopedia. Their users add or edit up to 150 thousand wiki pages per day. The DBpedia project extracts RDF from Wikipedia and interlinks it with other knowledge bases. In the DBpedia live extraction mode, Wikipedia edits are instantly processed to update information in DBpedia. Due to the high number of edits and the growth of Wikipedia, the update process has to be very efficient and scalable. In this paper, we present different strategies to tackle this challenging problem and describe how we modified the DBpedia live extraction algorithm to work more efficiently. 1 Introduction The Linking Open Data (LOD) cloud continues to grow and contains various very large knowledge bases. Many of those knowledge bases are extracted or derived from other sources. Hence, they need to be synchronised to the original source in order to reflect changes in it. This raises performance challenges, in particular if the knowledge bases aim at a timely synchronisation. In this article, we describe how we handle this challenge in the specific case of the DBpedia knowledge base. We hope that this example can spur a broader discussion of this topic. The DBpedia live extraction framework [5] extracts RDF data from articles in the English Wikipedia1 with a delay of only a few seconds after they are edited. In brief, the extraction proceeds as follows: When an article is modified, the live extraction framework polls its latest revision via Wikipedia’s non public OAIRepository. After that, a set of extractors are run which generate RDF output for this Wikipedia page. Finally, this data is written into a publicly accessible triple store where it can be accessed via Linked Data [3] and SPARQL [9]. As the DBpedia store should only reflect the data extracted from the latest revision of an article, a strategy for identifying and removing outdated triples needs to be employed. In our case the triple management strategy needs to respect two things: Firstly, the DBpedia live store is seeded with the DBpedia 1 http://en.wikipedia.org dataset [6, 1] in version 3.4. The seeding is done in order to provide initial data about articles that have not been edited since the start of the live extraction process. As an effect, the extraction takes place on an existing dataset, which not only contains data extracted from the english Wikipedia but also third party datasets, amongst them YAGO[10], SKOS2 , UMBEL3 , and Open-Cyc4 . On the one hand the live extraction needs to keep the data of the third party datasets intact. On the other hand when an article gets edited, its corresponding data in the seeding dataset must be updated. Since all data resides in the same graph5 , this becomes a complex task. Secondly, the state of the extractors needs to be taken into account. An extractor can be in one of the states Active, Purge, and Keep which affects the generation and removal of triples as follows: – Active The extractor is invoked on that page so that triples are generated. – Purge The extractor is disabled and all triples previously generated by the extractor for a that page should be removed from the store. – Keep The extractor is disabled but previously generated triples for that page should be retained. Our initial strategy is as follows: Upon the first edit of an article seen by the extraction framework a clean up is performed using the queries described in Section 2.1. The clean up removes all but the static facts from the seeding data set for the article’s corresponding resource. The new triples are then in- serted together with annotations. Each triple is annotated with its extractor, DBpedia URI and its date of extraction, using OWL 2 axiom annotations. Once these annotations exist, they allow for simple subsequent deletions of all triples corresponding to a certain page and extractor in the event of repeated article edits. As DBpedia consists of approximately 300 million facts, annotations would boost this value by a factor of six6 . As the amount of data in the store grew, we soon realized that the update performance of the store became so slow that edits on Wikipedia occurred more frequently than they could be processed. Before re- sorting to acquire better hardware, we considered alternative triple management approaches. The paper is structured as follows: We describe the concepts for optimiz- ing the update process in the Section 2. We also provide a short evaluation of the performance improvement that was facilitated by the new deployed update strategy in Section 3. We conclude and present related as well as future work in the Section 4. 2 http://www.w3.org/TR/skos-reference/ 3 http://www.umbel.org 4 http://www.opencyc.org 5 http://dbpedia.org 6 Three triples of the annotation vocabulary (omitting ?s rdf:type owl:Axiom), and the three annotations (extractor, page, and extraction-date) 2 Concepts for Optimizing the Update Process The extraction of changed facts in DBpedia, which have changed through edits in Wikipedia, is described in [5]. After changed facts have been computed, the DBpedia knowledge base has to be updated. The formerly used process for up- dating the model with new information caused some performance problems as described in Section 1. To optimize the update process, we sketch three strategies as follows in this section. – A specialized update process which uses a set of DBpedia-specific SPARUL queries. – An update process on the basis of multiple resource specific graphs which uses separate graphs for each set of triples generated by an extractor from an article. – An RDB assisted update process which uses an additional relational database table for storing temporarily affected RDF resources. 2.1 Specialized Update Process As our generic solution using annotations turned out to be too slow, we chose to use a domain specific one, in order to reduce the amount of explicit generated metadata. As opposed to our initial approach which allows extractors to generate arbitrary data, we now require the data to satisfy two constraints: – All subjects of the triples extracted from an Wikipedia article must start with the DBpedia URI corresponding to the article. If the subject is not equal to the DBpedia URI, we call this a subresource. For instance, for the arti- cle “London” both subjects dbpedia:London7 and dbpedia:London/prop1 would meet this naming constraint. – Extractors must only generate triples whose predicates and/or objects are specific to that extractor. For instance, the extractor for infoboxes would be the only one to generate triples whose properties are in the http:// dbpedia.org/property/ namespace. As a consequence, a triple’s subject and predicate implicitely uniquely deter- mine the corresponding article and extractor. Whenever an article is modified, the deletion procedure is as follows: As subresources are so far only generated by the infobox extractor (when recursively extracting data from nested templates) they can be deleted unless this extractor is in state keep. Th query for that task is shown in Listing 1.1. Deletion of the main DBpedia article URIs is more com- plex: Triples need to be filtered by the state of their generating extractor as well as by their membership to the static part of DBpedia. This results in a complex dynamically built query as sketched in Listing 1.2. 7 The prefix dbpedia stands for http://dbpedia.org/resource/. Listing 1.1. Deleting all statements matching part of the URI of London DELETE FROM < http :// dbpedia . org > { ? sub ? p ? o . } FROM < http :// dbpedia . org > { < http :// dbpedia . org / resource / London > ? p ? sub . ? sub ? p ? o . FILTER ( REGEX (? sub , ’^ http :// dbpedia . org / resource / London / ’ >) } Listing 1.2. Deleting resources according to specific extractors while preventing the deletion of the static part DELETE FROM < http :// dbpedia . org > { < http :// dbpedia . org / resource / London > ? p ? o . } { < http :// dbpedia . org / resource / London > ? p ? o . # Dynamically generated filters based on extractors in # active and purge state FILTER ( REGEX (? p , ’^ http :// dbpedia . org / property / ’) || ? p = foaf : homepage || # more conditions for other extractors ) . # Static filters preventing deletion of the static DBpedia part FILTER ((? p != < http :// www . w3 . org /1999/02/22 - rdf - syntax - ns # type > || !( REGEX (? o , ’^ http :// dbpedia . org / class / yago / ’)) && # more conditions for the static part ) . } 2.2 Update Process with Resource Specific Graphs The previously mentioned attempts have the disadvantage of either introducing a high overhead with respect to the amount of triples needed to store meta data or being very complex. A different approach is to put each set of triples generated by an extractor from an article into its own graph. For instance a URI containing a hash of the extractor and article name could serve as the graph name. The update process then becomes greatly simplified as upon an edit, it is only necessary to clear the corresponding graph and insert the new triples. This approach requires splitting the seeding DBpedia dataset into separate graphs from the beginning. As the DBpedia dataset v3.4 comes in separate files for each extractor, the subjects of the triples in these files determine the target graph. The downside of this approach is, that the data no longer resides in a single graph. Therefore it is not possible to specify the dataset in the SPARQL FROM clause. Instead, a FILTER over the graphs is required as show in Listing 1.3. Listing 1.3. Selecting triples across multiple graphs. SELECT ? s ? p ? o { GRAPH ? g {? s ? p ? o } . FILTER ( REGEX (? g , ’\^ http :// dbpedia . org / ’)) . } 2.3 RDB Assisted Update Process The third approach we evaluated and implemented is to use to store RDF state- ments in a relational database (RDB) in addition. This approach is motivated by the observation that most changes made to Wikipedia articles only cause small changes in the corresponding RDF data. Therefore, the idea is to have a method for quickly retrieving the set of triples previously generated for an article, comparing it to the new set of triples and only performing the necessary updates. For selection of resources which have to be updated after a periodically fin- ished Wikipedia extraction process, we firstly created an RDB table as illustrated in Figure 1. Whenever a Wikipedia page is edited, the extraction method gener- page_id dbpedia_page resource_uri serialized_data Fig. 1. Definition of the RDB table ates a JSON object holding information about each extractor and its generated triples. After serialization of such an object, it will be stored in combination with the corresponding page identifier. In case a record with the same page identifier already exists in this table, this old JSON object and the new one are being compared. The results of this comparison are two disjoint sets of triples which are used on the one hand for adding statements to the DBpedia RDF graph and on the other hand for removing statements from this graph. Therefore the update procedure becomes straight forward: With this strategy, once the initial clean up for a page has been performed, all further modifications to that page only trigger a simple update process. This update process no longer involves complex SPARQL filters, instead it can modify the affected triples directly. Listing 1.4. SQL Statements for fetching data for a resource SELECT data FROM dbpedia_page WHERE page_id = h t t p :// dbpedia . org / resource / L o n d o n ; INSERT INTO dbpedia_page ( page_id , data ) VALUES ( h t t p :// dbpedia . org / resource / L o n d o n , < JSON - Object >); UPDATE dbpedia_page SET data = < JSON - Object > WHERE page_id = < pageId > Listing 1.5. Simple SPARQL Delete and insert queries Delete From < http :// dbpedia . org > { ... concrete triples ... } Insert Into < http :// dbpedia . org > { ... concrete triples ... } 3 Evaluation of the RDB Assisted Update Process We did a small evaluation by comparing the RDB assisted update process to a simplified version of the DBpedia specific one. This simplified version deletes Algorithm 1 Algorithm of the RDB assisted update process //The data to be put into the store is included in the extractionResult //object pageId ← extractionResult[pageId] resourceU ri ← extractionResult[resourceU ri] newT riples ← extractionResult[triples] //Attempt to retrieve previously inserted data for the pageId jsonObject ← f etchF romSQLDB(pageId) if jsonObject 6= N U LL then oldT riples ← extractT riples(jsonObject) insertSet ← newT riples − oldT riples removeSet ← oldT riples − newT riples removeT riplesF romRDF Store(removeSet) addT riplesT oRDF Store(insertSet) else cleanU pRDF Store(pageId) insertIntoRDF Store(newT riples) end if jsonObject ← generateJSON Object(pageId, resourceU ri, newT riples) putIntoSQLDB(jsonObject) triples with a certain subject using 1.6 instead of 1.2. The difference is only that the complex filter patterns were omitted. Listing 1.6. Example of the simplified delete query DELETE FROM < http :// dbpedia . org > { < http :// dbpedia . org / resource / London > ? p ? o . } { < http :// dbpedia . org / resource / London > ? p ? o . } The benchmark simulates edits of articles and was set up as follows. 5000 distinct resources were picked at random from the DBpedia dataset. For each resource two sets O and N were created by randomly picking p% of the triples whose subject starts with the resource. The sets O and N represent the sets of triples corresponding to an article prior and posterior to an edit, respectively. A run of the benchmark first clears the target graph and dbpedia_page table. Then each resource’ O-set is inserted into the store. Finally the time to update the old sets of triples to the new ones using either the simplified specialized update strategy or the RDB assisted one8 is measured. Additionally the total number of triples that were removed (O − N ), added (N − O) and retained (N ∩ O) were counted. Three runs were performed with p = 0.9, p = 0.8, and p = 0.5 meaning that the simulated edits changed 10%, 20% and 50% of the triples, respectively. We assume that the actual ratio of triples updated by the 8 As this approach involves a JSON object holding information about each extractor, the generation of the sets O and N was related to a dummy extractor live extraction process in the event of repeated edits of articles is between 10 and 20 percent. However the exact value has not been determined yet. The benchmark was run on machine with a two core 1,2GHz Celeron CPU and 2GB Ram. The triple store used was ”Virtuoso Open-Source Edition 6.1.1” in its default configuration with four indices GS, SP, POGS, and OP. p Added Removed Retained Strategy Time taken (sec) 0.5 124924 124937 123319 SQL 240 RDF 200 0.8 79605 79710 318149 SQL 200 RDF 250 0.9 44629 44554 402748 SQL 170 RDF 300 Table 1. Benchmark results. In Table 1 the value SQL indiciates the RDB assisted approach, and RDF the specialized one. As can be seen from the table, the former approach - which reduces the updates to the triple store to a minimum - performs better than specialized version when there is sufficient overlap between O and N (p = 0.8 and p = 0.9) On the other hand, the smaller the overlaps the more the RDB becomes a bottleneck (p = 0.5). This is expected as in the worst case there is no overlap between O and N . In this situation the specialized approach would delete and reinsert triples directly. The RDB assisted approach would ultimatively do the same; however with the overhead of additionally reading from and writing to the dbpedia_page table. 4 Related Work, Future Work and Conclusion In this paper we sketched four different approaches for managing triples in the context of the DBpedia live extraction: The first based on OWL 2 annotations, the second using domain specific queries, the third using individual graphs and the fourth being assisted by an RDB. Initially because of RDFs flexibility we were tempted to find a solution which operates on the triple store alone. In regard to the RDB assisted approach we were sceptical as it meant having to duplicate every single triple. The lesson learnt is that for application scenarios involving frequent minor updates of resources in a triple store, an RDB assisted approach may be advantageous - despite the implied data duplication. Related Work Apart from the strategies described in this article, there are a number other ways to improve the performance of synchronising a knowledge base to its original source: The first and most obvious challenge in research and practice is to further improve performance of triple stores, in particular for SPARUL queries. Although the Berlin SPARQL Benchmark[4] (BSBM) became a reference for measuring the query performance of SPARQL endpoints, up to now there is no such benchmark playing a comparable role for SPARUL. Another method is to avoid decoupling the original source and the generated knowledge base. For instance, the Triplify tool[2] is a thin layer above a relational database. An RDF representation is generated by SQL queries augmented with syntactic sugar. This lightweight integration does not require a synchonisation process (of course, it could be that a mirror of the original source needs to be kept in sync). However, it is usually preferable only for simple extraction processes, otherwise the burden to generate an RDF presentation in real time becomes computationally very expensive. Furthermore, complex transformations of the original source, as present in the DBpedia extraction framework, are difficult to handle. Future Work For improving the performance of the DBpedia Navigator [7] we will integrate an adaptive SPARQL query cache [8] as a proxy layer on top of the DBpedia SPARQL Endpoint. This caching solution analyses the triple patterns of the SPARQL queries and stores them in combination with their result sets. To trigger the invalidation process of the cache proxy, all added and updated statements, which are committed to the RDF store, have to be committed to the cache proxy as well. The invalidation process of this cache proxy works very selectively and invalidates only those cache objects whose aggregated triple pattern matches the added or updated statements. However, an invalidation process is an expensive process and should only be triggered if necessary. The deployment of strategies presented in this paper are contributed to reduce the change sets of statements for the DBpedia update process. However they contribute as well to the cache proxy as the amount of statements that must be considered in the invalidation process is minimized. References 1. Sören Auer, Chris Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. DBpedia: A nucleus for a web of open data. In Proceedings of the 6th International Semantic Web Conference (ISWC), volume 4825 of Lecture Notes in Computer Science, pages 722–735. Springer, 2008. 2. Sören Auer, Sebastian Dietzold, Jens Lehmann, Sebastian Hellmann, and David Aumueller. Triplify: light-weight linked data publication from relational databases. In Juan Quemada, Gonzalo León, Yoëlle S. Maarek, and Wolfgang Nejdl, editors, Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, pages 621–630. ACM, 2009. 3. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data - the story so far. International Journal on Semantic Web and Information Systems (IJSWIS), 2009. 4. Christian Bizer and Andreas Schultz. The berlin sparql benchmark. Int. J. Se- mantic Web Inf. Syst., 5(2):1–24, 2009. 5. Sebastian Hellmann, Claus Stadler, Jens Lehmann, and Sören Auer. DBpedia live extraction. In Proc. of 8th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE), volume 5871 of Lecture Notes in Computer Science, pages 1209–1223, 2009. 6. Jens Lehmann, Chris Bizer, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. DBpedia - a crystallization point for the web of data. Journal of Web Semantics, 7(3):154–165, 2009. 7. Jens Lehmann and Sebastian Knappe. DBpedia navigator. Semantic Web Chal- lenge, International Semantic Web Conference 2008, 2008. 8. Michael Martin, Jörg Unbehauen, and Sören Auer. Improving the performance of semantic web applications with SPARQL query caching. In Proceedings of 7th Ex- tended Semantic Web Conference (ESWC 2010), 30 May – 3 June 2010, Heraklion, Greece, 2010. 9. Eric Prud’hommeaux and Andy Seaborne. SPARQL query language for RDF. W3C Recommendation, 2008. http://www.w3.org/TR/rdf-sparql-query. 10. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A Core of Semantic Knowledge. In 16th international World Wide Web conference (WWW 2007), New York, NY, USA, 2007. ACM Press.