<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Materializing the editing history of Wikipedia as linked data in DBpedia</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabien Gandon</string-name>
          <email>fabien.gandon@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raphael Boyer</string-name>
          <email>raphael.boyer@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivier Corby</string-name>
          <email>olivier.corby@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandre Monnin</string-name>
          <email>alexandre.monnin@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Université Côte d'Azur</institution>
          ,
          <addr-line>Inria, CNRS, I3S, France Wimmics, Sophia Antipolis</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe a DBpedia extractor materializing the editing history of Wikipedia pages as linked data to support queries and indicators on the history. 1 French DBpedia Chapter http://fr.dbpedia.org/ 2 Wikipedia Statistics - https://en.wikipedia.org/wiki/Wikipedia:Statistics 3 Comparisons accessed 23/08/16 https://en.wikipedia.org/wiki/Wikipedia:Size_comparisons</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The different instances of the DBpedia platform typically extract RDF from Wikipedia
using up to 16 extractors. The extraction focuses on structured content including
infoboxes, categories, links, etc. As an example, the French chapter1, of which we are
responsible, extracted 185 million triples in 2015. The resulting RDF graph is then
published and supports up to 2.5 million SPARQL queries per day and an average of
70,000 SPARQL queries per day in 2015. But Wikipedia is a social media that produces
more data than the actual content of its pages. The activity of the epistemic communities
of Wikipedia produces a huge amount of traces showing, for instance, the evolution,
conflicts, trends, and variety of opinions of the users. In fact, the different projects of
the Wikipedia Foundation develop at a rate of over ten edits per second, performed by
users from all over the world2. And this activity is performed on broad collection of
topics: the English chapter of Wikipedia alone has over 5 million articles and the
combined Wikipedias for all other languages exceed the English chapter in size with
more than 27 billion words in 40 million articles in 293 languages3. As a result the
history of the editing actions captures the peaks and shifts of interests of the contributors
and indirectly reflects the unfolding of events all around the world and in every domain.</p>
      <p>
        Providing means to monitor the editing activity has always been important for
Wikipedians to follow the changes. These means include APIs such as the recent
changes API, the IRC streams per languages, the WebSockets streams, the Server-Sent
Events Streams, etc. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Previous works also suggested to monitor real-time editing
activity of Wikipedia to detect events such as natural disasters [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] a resource
versioning mechanism inspired from the Memento protocol (RFC7089) is applied but
only to DBpedia dumps. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] historical versions of resources are regenerated for a
given timestamp with some revision data but through a RESTful API. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] the
preservation of the history of linked datasets is tested but only on a sample of 100,000
resources. We do not mention here works on formats, vocabularies or algorithms to
detect and describe updates to RDF datasets since at this stage we are focusing on
editing acts on Wikipedia.
      </p>
      <p>
        Data about the activity provide historical indicators of interest, attention, over the
set of resources they cover. They have been also used, for instance, to assess the
currency of the data [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], to study conflict resolution [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], to temporally anchor data, to
attribute changes and to identify vandalism [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or to precisely attribute the authorship
of content [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Inversely, using statements of other datasets (e.g. typing) one can filter
and analyze the editing history considering chosen dimensions (e.g. focus on events
about artists). But none of the previous contributions support public SPARQL querying
of the full editing history. The potential of these linked data is even greater when
combined with other linked data sources and this is not easily done with an API
approach e.g. “give me the 10 most edited populated places in July 2012”.
      </p>
      <p>For this reason we designed and provide a new DBpedia extractor producing a linked
data representation of the editing history of Wikipedia pages. Instead of real-time
monitoring we capture the history as linked data to be able to query it, mine it and
combine it with other sources to augment the dimensions we can exploit when querying
linked data in general and DBpedia in particular.</p>
      <p>A history dump of a Wikipedia chapter contains all the modifications dating back
from the inception of this linguistic chapter along with some information for each and
every modification. As an example, the French editing history dump represents 2TB of
uncompressed data. The data extraction is performed though streams in Node.js with a
MongoDB instance. It took 4 days to extract 55 GB of RDF in turtle on 8 Intel(R)
Xeon(R) CPU E5-1630 v3 @ 3.70GHz with 68GB or RAM and using SSD disks. The
result is then published through a SPARQL end-point with the DBpedia chapter4.</p>
      <p>The extractor reuses as many existing vocabularies from the LOV directory5 as
possible in order to facilitate integration and reuse. Figure 1 is a sample of the output
of the edition history extractor for the page describing the author “Victor Hugo” in the
DBpedia French chapter. The history data for such an entry contains one section of
general information about the article history (lines 1-15) along with as many additional
sections as there are previous revisions to capture each change (e.g. two revisions at
lines 16-24). The general information about the article includes: the number of
revisions (line 3), the date of creation and last modification (lines 4-5), the number of
unique contributors (line 6), the number of revisions per year and per month (e.g. lines
7-8) and the average sizes of revisions per year and per month (e.g. lines 9-10). In
addition each individual revision description includes: the date and time of the
modification (e.g. lines 17), the size of the revision as a number of characters, (e.g. lines
18) the size of the modification as a number of characters (e.g. lines 19), the optional
comment of the contributor (e.g. lines 20), the username or IP address of the contributor
and if the contributor is a human or a bot (e.g. line 21 or 24) and a link to the previous
revision (e.g. line 22).</p>
      <p>By construction the data are fully linked to the DBpedia resources and the
vocabularies used include: PROV-O, Dublin Core, the Semantic Web Publishing
Vocabulary, DBpedia ontologies, FOAF and SIOC. As a result the produced linked
data are well integrated to the LOD cloud. Every time we were missing a predicate we
added it to DBpedia FR ontology. As shown in Figure 2 these data support very
4 History endpoint http://dbpedia-historique.inria.fr/sparql
5 Linked Open Vocabularies (LOV) http://lov.okfn.org/ as accessed in June 2016
arbitrary queries such as, in this example, the ability to request most modified pages
grouped by pairs of pages modified the same day.
1. &lt;https://fr.wikipedia.org/wiki/Victor_Hugo&gt; a prov:Revision ;
2. dc:subject &lt;http://fr.dbpedia.org/resource/Victor_Hugo&gt; ;
3. swp:isVersion "3496"^^xsd:integer ;
4. dc:created "2002-06-06T08:48:32"^^xsd:dateTime ;
5. dc:modified "2015-10-15T14:17:02"^^xsd:dateTime ;
6. dbfr:uniqueContributorNb 1295 ;</p>
      <p>(...)
7. dbfr:revPerYear [ dc:date "2015"^^xsd:gYear ; rdf:value "79"^^xsd:integer ] ;
8. dbfr:revPerMonth [ dc:date "06/2002"^^xsd:gYearMonth ; rdf:value
"3"^^xsd:integer ] ;
(...)
9. dbfr:averageSizePerYear [ dc:date "2015"^^xsd:gYear ; rdf:value
"154110.18"^^xsd:float ] ;
10. dbfr:averageSizePerMonth [ dc:date "06/2002"^^xsd:gYearMonth ; rdf:value
"2610.66"^^xsd:float ] ;
11.
12.
13.
14.
15.</p>
      <p>(...)
dbfr:size "159049"^^xsd:integer ;
dc:creator [ foaf:nick "Rinaldum" ] ;
sioc:note "wikification"^^xsd:string ;
prov:wasRevisionOf &lt;https:// … 119074391&gt; ;
prov:wasAttributedTo [ foaf:name "Rémih" ; a prov:Person, foaf:Person ] .
16.
17.
18.
19.
20.
21.
22.
23.
24.</p>
      <p>
        &lt;https:// … 119074391&gt; a prov:Revision ;
dc:created "2015-09-29T19:35:34"^^xsd:dateTime ;
dbfr:size "159034"^^xsd:integer ;
dbfr:sizeNewDifference "-5"^^xsd:integer ;
sioc:note "/*Années théâtre*/ neutralisation"^^xsd:string ;
prov:wasAttributedTo [ foaf:name "Thouny" ; a prov:Person, foaf:Person ] ;
prov:wasRevisionOf &lt;https://... 118903583&gt; .
(...)
&lt;https:// … oldid=118201419&gt; a prov:Revision ;
prov:wasAttributedTo [ foaf:name "OrlodrimBot" ; a prov:SoftwareAgent ] ;
(...)
The STTL template language [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] allows to generate portals in a declarative and fast
way. We used it to build two portals to show the richness of the historical data
materialized. The first application designed is a visual history browser that displays
images of the 50 most edited topics for every month. With the second portal we
demonstrate the ability to join this new dataset with other linked data sources starting
with DBpedia itself: we built a focused portal generator that reduces the monitoring
activity to specific DBpedia categories of resources6 (e.g. companies, actors, countries,
6 Category-filtered view of the History: Focusing on artists (mode=dbo:Artist)
http://corese.inria.fr/srv/template?profile=st:dbedit&amp;mode=dbo:Artist
Or focusing on countries (mode=dbo:Country) using the DBpedia ontology
http://corese.inria.fr/srv/template?profile=st:dbedit&amp;mode=dbo:Country
etc.). Figure 3 is a screenshot of the portal focused on countries and shows the events
in Ukraine in 2014. Many applications of the editing activity already exist [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and these
two portals are only a proof of concept for what can be done with SPARQL over the
linked data of editing activity.
      </p>
      <p>The history extractor is now integrated to the DBpedia open-source code and running
on the production server of the French chapter. We are studying the integration of the
live change feed for both the chapter and its history in order to reflect real-time changes
to the content and editing logs. We are also considering ways to represent more
precisely the changes between two revisions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>T.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <article-title>Comprehensive Wikipedia monitoring for global and realtime natural disaster detection</article-title>
          .
          <source>In Proceedings of the ISWC Developers Workshop</source>
          <year>2014</year>
          , at the 13th
          <source>International Semantic Web Conference (ISWC</source>
          <year>2014</year>
          ),
          <source>Riva del Garda</source>
          ,
          <year>2014</year>
          . ,
          <fpage>86</fpage>
          -
          <lpage>95</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>O.</given-names>
            <surname>Corby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Faron-Zucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Generic</given-names>
            <surname>RDF</surname>
          </string-name>
          <article-title>Transformation Software and its Application to an Online Translation Service for Common Languages of Linked Data</article-title>
          .
          <source>The 14th International Semantic Web Conference, Oct</source>
          <year>2015</year>
          , Bethlehem,
          <string-name>
            <given-names>United</given-names>
            <surname>States</surname>
          </string-name>
          .
          <year>2015</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>3. H. Van de Sompel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Sanderson</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          <string-name>
            <surname>Nelson</surname>
            ,
            <given-names>L. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balakireva</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Shankar</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ainsworth</surname>
          </string-name>
          .
          <article-title>An HTTP-based versioning mechanism for linked data</article-title>
          .
          <source>LDOW</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Javier</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fernández</surname>
            ,
            <given-names>Patrik</given-names>
          </string-name>
          <string-name>
            <surname>Schneider</surname>
            , and
            <given-names>Jürgen</given-names>
          </string-name>
          <string-name>
            <surname>Umbrich</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>The DBpedia wayback machine</article-title>
          .
          <source>In Proceedings of the 11th International Conference on Semantic Systems SEMANTICS '15 ACM</source>
          ,
          <fpage>192</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Paul</given-names>
            <surname>Meinhardt</surname>
          </string-name>
          , Magnus Knuth, and
          <string-name>
            <given-names>Harald</given-names>
            <surname>Sack</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>TailR: a platform for preserving history on the web of data</article-title>
          .
          <source>In Proceedings of the 11th International Conference on Semantic Systems SEMANTICS '15, ACM</source>
          ,
          <fpage>57</fpage>
          -
          <lpage>64</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <article-title>The Wiki(pedia|data) Edit Streams Firehose</article-title>
          , Invited Talk, Wiki Workshop, April 12,
          <string-name>
            <surname>WWW</surname>
          </string-name>
          <year>2016</year>
          , Montreal, Canada, http://bit.ly/wiki-firehose
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Anisa</given-names>
            <surname>Rula</surname>
          </string-name>
          , Luca Panziera, Matteo Palmonari, Andrea Maurino:
          <article-title>Capturing the Currency of DBpedia Descriptions and Get Insight into their Validity</article-title>
          .
          <source>COLD 2014</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Volha</given-names>
            <surname>Bryl</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Learning conflict resolution strategies for cross-language Wikipedia data fusion</article-title>
          .
          <source>In 4th Workshop on Web Quality Workshop (WebQuality) at WWW</source>
          <year>2014</year>
          ,
          <year>2014</year>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>E.</given-names>
            <surname>Alfonseca</surname>
          </string-name>
          , G. Garrido,
          <string-name>
            <given-names>J.-Y.</given-names>
            <surname>Delort</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Penas</surname>
          </string-name>
          . WHAD:
          <article-title>Wikipedia historical attributes data - Historical structured data extraction and vandalism detection from the Wikipedia edit history</article-title>
          .
          <source>Language Resources and Evaluation</source>
          ,
          <volume>47</volume>
          (
          <issue>4</issue>
          ):
          <fpage>1163</fpage>
          -
          <lpage>1190</lpage>
          ,
          <year>2013</year>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Fabian</given-names>
            <surname>Flöck</surname>
          </string-name>
          and
          <string-name>
            <given-names>Maribel</given-names>
            <surname>Acosta</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>WikiWho: precise and efficient attribution of authorship of revisioned content</article-title>
          .
          <source>In Proceedings of the 23rd international conference on World wide web WWW '14. ACM</source>
          ,
          <volume>843</volume>
          -
          <fpage>854</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>