<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Update Strategies for DBpedia Live</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claus Stadler</string-name>
          <email>+cstadler@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Martin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Hellmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Leipzig, Institut fur Informatik</institution>
          ,
          <addr-line>Johannisgasse 26, D-04103 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Wikipedia is one of the largest public information spaces with a huge user community, which collaboratively works on the largest online encyclopedia. Their users add or edit up to 150 thousand wiki pages per day. The DBpedia project extracts RDF from Wikipedia and interlinks it with other knowledge bases. In the DBpedia live extraction mode, Wikipedia edits are instantly processed to update information in DBpedia. Due to the high number of edits and the growth of Wikipedia, the update process has to be very e cient and scalable. In this paper, we present di erent strategies to tackle this challenging problem and describe how we modi ed the DBpedia live extraction algorithm to work more e ciently.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Introduction
dataset [
        <xref ref-type="bibr" rid="ref1">6, 1</xref>
        ] in version 3.4. The seeding is done in order to provide initial data
about articles that have not been edited since the start of the live extraction
process. As an e ect, the extraction takes place on an existing dataset, which
not only contains data extracted from the english Wikipedia but also third party
datasets, amongst them YAGO[10], SKOS2, UMBEL3, and Open-Cyc4. On the
one hand the live extraction needs to keep the data of the third party datasets
intact. On the other hand when an article gets edited, its corresponding data in
the seeding dataset must be updated. Since all data resides in the same graph5,
this becomes a complex task. Secondly, the state of the extractors needs to be
taken into account. An extractor can be in one of the states Active, Purge, and
      </p>
    </sec>
    <sec id="sec-2">
      <title>Keep which a ects the generation and removal of triples as follows:</title>
      <p>{ Active The extractor is invoked on that page so that triples are generated.
{ Purge The extractor is disabled and all triples previously generated by the
extractor for a that page should be removed from the store.
{ Keep The extractor is disabled but previously generated triples for that page
should be retained.</p>
    </sec>
    <sec id="sec-3">
      <title>Our initial strategy is as follows: Upon the rst edit of an article seen by</title>
      <p>the extraction framework a clean up is performed using the queries described
in Section 2.1. The clean up removes all but the static facts from the seeding
data set for the article's corresponding resource. The new triples are then
inserted together with annotations. Each triple is annotated with its extractor,</p>
    </sec>
    <sec id="sec-4">
      <title>DBpedia URI and its date of extraction, using OWL 2 axiom annotations. Once these annotations exist, they allow for simple subsequent deletions of all triples corresponding to a certain page and extractor in the event of repeated article edits.</title>
    </sec>
    <sec id="sec-5">
      <title>As DBpedia consists of approximately 300 million facts, annotations would</title>
      <p>boost this value by a factor of six6. As the amount of data in the store grew, we
soon realized that the update performance of the store became so slow that edits
on Wikipedia occurred more frequently than they could be processed. Before
resorting to acquire better hardware, we considered alternative triple management
approaches.</p>
      <p>The paper is structured as follows: We describe the concepts for
optimizing the update process in the Section 2. We also provide a short evaluation of
the performance improvement that was facilitated by the new deployed update
strategy in Section 3. We conclude and present related as well as future work in
the Section 4.
2 http://www.w3.org/TR/skos-reference/
3 http://www.umbel.org
4 http://www.opencyc.org
5 http://dbpedia.org
6 Three triples of the annotation vocabulary (omitting ?s rdf:type owl:Axiom), and
the three annotations (extractor, page, and extraction-date)</p>
      <p>Concepts for Optimizing the Update Process</p>
    </sec>
    <sec id="sec-6">
      <title>The extraction of changed facts in DBpedia, which have changed through edits</title>
      <p>in Wikipedia, is described in [5]. After changed facts have been computed, the
DBpedia knowledge base has to be updated. The formerly used process for
updating the model with new information caused some performance problems as
described in Section 1. To optimize the update process, we sketch three strategies
as follows in this section.</p>
      <p>{ A specialized update process which uses a set of DBpedia-speci c SPARUL
queries.
{ An update process on the basis of multiple resource speci c graphs
which uses separate graphs for each set of triples generated by an extractor
from an article.
{ An RDB assisted update process which uses an additional relational
database table for storing temporarily a ected RDF resources.
2.1</p>
      <sec id="sec-6-1">
        <title>Specialized Update Process</title>
        <p>As our generic solution using annotations turned out to be too slow, we chose to
use a domain speci c one, in order to reduce the amount of explicit generated
metadata. As opposed to our initial approach which allows extractors to generate
arbitrary data, we now require the data to satisfy two constraints:
{ All subjects of the triples extracted from an Wikipedia article must start with
the DBpedia URI corresponding to the article. If the subject is not equal
to the DBpedia URI, we call this a subresource. For instance, for the
article \London" both subjects dbpedia:London7 and dbpedia:London/prop1
would meet this naming constraint.
{ Extractors must only generate triples whose predicates and/or objects are
speci c to that extractor. For instance, the extractor for infoboxes would
be the only one to generate triples whose properties are in the http://
dbpedia.org/property/ namespace.</p>
        <p>As a consequence, a triple's subject and predicate implicitely uniquely
determine the corresponding article and extractor. Whenever an article is modi ed,
the deletion procedure is as follows: As subresources are so far only generated by
the infobox extractor (when recursively extracting data from nested templates)
they can be deleted unless this extractor is in state keep. Th query for that task
is shown in Listing 1.1. Deletion of the main DBpedia article URIs is more
complex: Triples need to be ltered by the state of their generating extractor as well
as by their membership to the static part of DBpedia. This results in a complex
dynamically built query as sketched in Listing 1.2.
7 The pre x dbpedia stands for http://dbpedia.org/resource/.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Listing 1.1. Deleting all statements matching part of the URI of London</title>
      <p>DELETE
FROM &lt;http :// dbpedia .org &gt; { ? sub ?p ?o . }
FROM &lt;http :// dbpedia .org &gt; {
&lt;http :// dbpedia . org / resource / London &gt; ?p ? sub .
? sub ?p ?o .</p>
      <p>FILTER ( REGEX (? sub , '^ http :// dbpedia . org / resource / London /' &gt;) }</p>
    </sec>
    <sec id="sec-8">
      <title>Listing 1.2. Deleting resources according to speci c extractors while preventing the deletion of the static part</title>
      <p>DELETE
FROM &lt;http :// dbpedia .org &gt;
{ &lt;http :// dbpedia . org / resource / London &gt; ?p ?o . }
{ &lt;http :// dbpedia . org / resource / London &gt; ?p ?o .</p>
      <p># Dynamically generated filters based on extractors in
# active and purge state
FILTER ( REGEX (?p , '^ http :// dbpedia . org / property / ') ||
?p = foaf : homepage ||
# more conditions for other extractors
) .
# Static filters preventing deletion of the static DBpedia part
FILTER ((? p != &lt;http :// www . w3 . org /1999/02/22 - rdf - syntax - ns # type &gt; ||
!( REGEX (?o , '^ http :// dbpedia . org / class / yago / ')) &amp;&amp;
# more conditions for the static part
}
The previously mentioned attempts have the disadvantage of either introducing
a high overhead with respect to the amount of triples needed to store meta
data or being very complex. A di erent approach is to put each set of triples
generated by an extractor from an article into its own graph. For instance a URI
containing a hash of the extractor and article name could serve as the graph
name. The update process then becomes greatly simpli ed as upon an edit, it is
only necessary to clear the corresponding graph and insert the new triples. This
approach requires splitting the seeding DBpedia dataset into separate graphs
from the beginning. As the DBpedia dataset v3.4 comes in separate les for each
extractor, the subjects of the triples in these les determine the target graph.</p>
    </sec>
    <sec id="sec-9">
      <title>The downside of this approach is, that the data no longer resides in a single graph. Therefore it is not possible to specify the dataset in the SPARQL FROM clause. Instead, a FILTER over the graphs is required as show in Listing 1.3.</title>
    </sec>
    <sec id="sec-10">
      <title>Listing 1.3. Selecting triples across multiple graphs.</title>
      <p>SELECT ?s ?p ?o
{ GRAPH ?g {? s ?p ?o} .</p>
      <p>FILTER ( REGEX (?g , '\^ http :// dbpedia . org / ')) .
}
2.3</p>
      <sec id="sec-10-1">
        <title>RDB Assisted Update Process</title>
        <p>The third approach we evaluated and implemented is to use to store RDF
statements in a relational database (RDB) in addition. This approach is motivated
by the observation that most changes made to Wikipedia articles only cause
small changes in the corresponding RDF data. Therefore, the idea is to have
a method for quickly retrieving the set of triples previously generated for an
article, comparing it to the new set of triples and only performing the necessary
updates.</p>
        <p>For selection of resources which have to be updated after a periodically
nished Wikipedia extraction process, we rstly created an RDB table as illustrated
in Figure 1. Whenever a Wikipedia page is edited, the extraction method
generdbpedia_page</p>
        <p>page_id
resource_uri
serialized_data
ates a JSON object holding information about each extractor and its generated
triples. After serialization of such an object, it will be stored in combination with
the corresponding page identi er. In case a record with the same page identi er
already exists in this table, this old JSON object and the new one are being
compared. The results of this comparison are two disjoint sets of triples which
are used on the one hand for adding statements to the DBpedia RDF graph
and on the other hand for removing statements from this graph. Therefore the
update procedure becomes straight forward:</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>With this strategy, once the initial clean up for a page has been performed, all further modi cations to that page only trigger a simple update process. This update process no longer involves complex SPARQL lters, instead it can modify the a ected triples directly.</title>
    </sec>
    <sec id="sec-12">
      <title>Listing 1.4. SQL Statements for fetching data for a resource</title>
      <p>SELECT data FROM dbpedia_page
WHERE page_id = h t t p :// dbpedia . org / resource / L o n d o n ;
INSERT INTO dbpedia_page ( page_id , data )
VALUES ( h t t p :// dbpedia . org / resource / L o n d o n , &lt;JSON - Object &gt;);
UPDATE dbpedia_page SET data = &lt;JSON - Object &gt; WHERE page_id = &lt;pageId &gt;</p>
    </sec>
    <sec id="sec-13">
      <title>Listing 1.5. Simple SPARQL Delete and insert queries</title>
      <p>Delete From &lt;http :// dbpedia .org &gt; { ... concrete triples ... }
Insert Into &lt;http :// dbpedia .org &gt; { ... concrete triples ... }
3</p>
      <p>Evaluation of the RDB Assisted Update Process</p>
    </sec>
    <sec id="sec-14">
      <title>We did a small evaluation by comparing the RDB assisted update process to</title>
      <p>a simpli ed version of the DBpedia speci c one. This simpli ed version deletes
Algorithm 1 Algorithm of the RDB assisted update process
==The data to be put into the store is included in the extractionResult
==object
pageId extractionResult[pageId]
resourceU ri extractionResult[resourceU ri]
newT riples extractionResult[triples]
==Attempt to retrieve previously inserted data for the pageId
jsonObject f etchF romSQLDB(pageId)
if jsonObject 6= N U LL then
oldT riples extractT riples(jsonObject)
insertSet newT riples oldT riples
removeSet oldT riples newT riples
removeT riplesF romRDF Store(removeSet)
addT riplesT oRDF Store(insertSet)
else
cleanU pRDF Store(pageId)
insertIntoRDF Store(newT riples)
end if
jsonObject generateJ SON Object(pageId; resourceU ri; newT riples)
putIntoSQLDB(jsonObject)
triples with a certain subject using 1.6 instead of 1.2. The di erence is only that
the complex lter patterns were omitted.</p>
    </sec>
    <sec id="sec-15">
      <title>Listing 1.6. Example of the simpli ed delete query</title>
      <p>DELETE
FROM &lt; http :// dbpedia .org &gt;
{ &lt; http :// dbpedia . org / resource / London &gt; ?p ?o . }
{ &lt; http :// dbpedia . org / resource / London &gt; ?p ?o . }</p>
    </sec>
    <sec id="sec-16">
      <title>The benchmark simulates edits of articles and was set up as follows. 5000</title>
      <p>distinct resources were picked at random from the DBpedia dataset. For each
resource two sets O and N were created by randomly picking p% of the triples
whose subject starts with the resource. The sets O and N represent the sets of
triples corresponding to an article prior and posterior to an edit, respectively.</p>
    </sec>
    <sec id="sec-17">
      <title>A run of the benchmark rst clears the target graph and dbpedia_page table.</title>
      <p>Then each resource' O-set is inserted into the store. Finally the time to update
the old sets of triples to the new ones using either the simpli ed specialized
update strategy or the RDB assisted one8 is measured. Additionally the total
number of triples that were removed (O N ), added (N O) and retained
(N \ O) were counted. Three runs were performed with p = 0:9, p = 0:8, and
p = 0:5 meaning that the simulated edits changed 10%, 20% and 50% of the
triples, respectively. We assume that the actual ratio of triples updated by the
8 As this approach involves a JSON object holding information about each extractor,
the generation of the sets O and N was related to a dummy extractor
live extraction process in the event of repeated edits of articles is between 10
and 20 percent. However the exact value has not been determined yet. The
benchmark was run on machine with a two core 1,2GHz Celeron CPU and 2GB</p>
    </sec>
    <sec id="sec-18">
      <title>Ram. The triple store used was "Virtuoso Open-Source Edition 6.1.1" in its default con guration with four indices GS, SP, POGS, and OP.</title>
      <p>p
0.5
0.8
0.9</p>
      <sec id="sec-18-1">
        <title>Added</title>
      </sec>
      <sec id="sec-18-2">
        <title>Removed</title>
      </sec>
      <sec id="sec-18-3">
        <title>Retained Strategy</title>
        <p>124924
79605
44629
124937
79710
44554
123319
318149
402748</p>
        <p>In Table 1 the value SQL indiciates the RDB assisted approach, and RDF
the specialized one. As can be seen from the table, the former approach - which
reduces the updates to the triple store to a minimum - performs better than
specialized version when there is su cient overlap between O and N (p = 0:8
and p = 0:9) On the other hand, the smaller the overlaps the more the RDB
becomes a bottleneck (p = 0:5). This is expected as in the worst case there is no
overlap between O and N . In this situation the specialized approach would delete
and reinsert triples directly. The RDB assisted approach would ultimatively do
the same; however with the overhead of additionally reading from and writing
to the dbpedia_page table.
4</p>
        <p>Related Work, Future Work and Conclusion
In this paper we sketched four di erent approaches for managing triples in the
context of the DBpedia live extraction: The rst based on OWL 2 annotations,
the second using domain speci c queries, the third using individual graphs and
the fourth being assisted by an RDB. Initially because of RDFs exibility we
were tempted to nd a solution which operates on the triple store alone. In
regard to the RDB assisted approach we were sceptical as it meant having to
duplicate every single triple. The lesson learnt is that for application scenarios
involving frequent minor updates of resources in a triple store, an RDB assisted
approach may be advantageous - despite the implied data duplication.</p>
      </sec>
    </sec>
    <sec id="sec-19">
      <title>Related Work Apart from the strategies described in this article, there are a</title>
      <p>number other ways to improve the performance of synchronising a knowledge
base to its original source: The rst and most obvious challenge in research
and practice is to further improve performance of triple stores, in particular for</p>
    </sec>
    <sec id="sec-20">
      <title>SPARUL queries. Although the Berlin SPARQL Benchmark[4] (BSBM) became</title>
      <p>
        a reference for measuring the query performance of SPARQL endpoints, up to
now there is no such benchmark playing a comparable role for SPARUL. Another
method is to avoid decoupling the original source and the generated knowledge
base. For instance, the Triplify tool[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a thin layer above a relational database.
An RDF representation is generated by SQL queries augmented with syntactic
sugar. This lightweight integration does not require a synchonisation process
(of course, it could be that a mirror of the original source needs to be kept
in sync). However, it is usually preferable only for simple extraction processes,
otherwise the burden to generate an RDF presentation in real time becomes
computationally very expensive. Furthermore, complex transformations of the
original source, as present in the DBpedia extraction framework, are di cult to
handle.
      </p>
      <p>Future Work For improving the performance of the DBpedia Navigator [7] we
will integrate an adaptive SPARQL query cache [8] as a proxy layer on top
of the DBpedia SPARQL Endpoint. This caching solution analyses the triple
patterns of the SPARQL queries and stores them in combination with their
result sets. To trigger the invalidation process of the cache proxy, all added
and updated statements, which are committed to the RDF store, have to be
committed to the cache proxy as well. The invalidation process of this cache
proxy works very selectively and invalidates only those cache objects whose
aggregated triple pattern matches the added or updated statements. However,
an invalidation process is an expensive process and should only be triggered if
necessary. The deployment of strategies presented in this paper are contributed
to reduce the change sets of statements for the DBpedia update process. However
they contribute as well to the cache proxy as the amount of statements that must
be considered in the invalidation process is minimized.
5. Sebastian Hellmann, Claus Stadler, Jens Lehmann, and Soren Auer. DBpedia live
extraction. In Proc. of 8th International Conference on Ontologies, DataBases, and
Applications of Semantics (ODBASE), volume 5871 of Lecture Notes in Computer
Science, pages 1209{1223, 2009.
6. Jens Lehmann, Chris Bizer, Georgi Kobilarov, Soren Auer, Christian Becker,
Richard Cyganiak, and Sebastian Hellmann. DBpedia - a crystallization point
for the web of data. Journal of Web Semantics, 7(3):154{165, 2009.
7. Jens Lehmann and Sebastian Knappe. DBpedia navigator. Semantic Web
Challenge, International Semantic Web Conference 2008, 2008.
8. Michael Martin, Jorg Unbehauen, and Soren Auer. Improving the performance of
semantic web applications with SPARQL query caching. In Proceedings of 7th
Extended Semantic Web Conference (ESWC 2010), 30 May { 3 June 2010, Heraklion,
Greece, 2010.
9. Eric Prud'hommeaux and Andy Seaborne. SPARQL query language for RDF.</p>
      <p>W3C Recommendation, 2008. http://www.w3.org/TR/rdf-sparql-query.
10. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A Core of
Semantic Knowledge. In 16th international World Wide Web conference (WWW
2007), New York, NY, USA, 2007. ACM Press.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Soren Auer, Chris Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives.
          <article-title>DBpedia: A nucleus for a web of open data</article-title>
          .
          <source>In Proceedings of the 6th International Semantic Web Conference (ISWC)</source>
          , volume
          <volume>4825</volume>
          of Lecture Notes in Computer Science, pages
          <volume>722</volume>
          {
          <fpage>735</fpage>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Soren Auer, Sebastian Dietzold, Jens Lehmann,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Hellmann</surname>
          </string-name>
          , and David Aumueller.
          <article-title>Triplify: light-weight linked data publication from relational databases</article-title>
          .
          <source>In Juan Quemada</source>
          , Gonzalo Leon, Yoelle
          <string-name>
            <given-names>S.</given-names>
            <surname>Maarek</surname>
          </string-name>
          , and Wolfgang Nejdl, editors,
          <source>Proceedings of the 18th International Conference on World Wide Web, WWW</source>
          <year>2009</year>
          , Madrid, Spain,
          <source>April 20-24</source>
          ,
          <year>2009</year>
          , pages
          <fpage>621</fpage>
          {
          <fpage>630</fpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Tom Heath, and
          <string-name>
            <surname>Tim</surname>
          </string-name>
          Berners-Lee.
          <article-title>Linked data - the story so far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems (IJSWIS)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Schultz</surname>
          </string-name>
          .
          <article-title>The berlin sparql benchmark</article-title>
          .
          <source>Int. J. Semantic Web Inf. Syst.</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):1{
          <fpage>24</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>