<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hedera: Scalable Indexing and Exploring Entities in Wikipedia Revision History</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>L3S Research Center / Leibniz Universitat Hannover</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Much of work in semantic web relying on Wikipedia as the main source of knowledge often work on static snapshots of the dataset. The full history of Wikipedia revisions, while contains much more useful information, is still di cult to access due to its exceptional volume. To enable further research on this collection, we developed a tool, named Hedera, that e ciently extracts semantic information from Wikipedia revision history datasets. Hedera exploits Map-Reduce paradigm to achieve rapid extraction, it is able to handle one entire Wikipedia articles' revision history within a day in a medium-scale cluster, and supports exible data structures for various kinds of semantic web study.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        For over decades, Wikipedia has become a backbone of sematic Web research,
with the proliferation of high-quality big knowledge bases (KBs) such as
DBpedia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], where information is derived from various Wikipedia public
collections. Existing approaches often rely on one o ine snapshot of datasets, they
treat knowledge as static and ignore the temporal evolution of information in
Wikipedia. When for instance a fact changes (e.g. death of a celebrity) or
entities themselves evolve, they can only be re ected in the next version of the
knowledge bases (typically extracted fresh from a newer Wikipedia dump). This
undesirable quality of KBs make them unable to capture temporally dynamical
relationship latent among revisions of the encyclopedia (e.g., participate together
in complex events), which are di cult to detect in one single Wikipedia
snapshot. Furthermore, applications relying on obsolete facts might fail to reason
under new contexts (e.g. question answering systems for recent real-world
incidents), because they were not captured in the KBs. In order to complement these
temporal aspects, the whole Wikipedia revision history should be well-exploited.
However, such longitudinal analytics over ernoumous size of Wikipedia require
huge computation. In this work, we develop Hedera, a large-scale framework that
supports processing, indexing and visualising Wikipedia revision history. Hedera
is an end-to-end system that works directly with the raw dataset, processes them
to streaming data, and incrementally indexes and visualizes the information of
entities registered in the KBs in a dynamic fashion. In contrast to existing work
that handle the dataset in centralized settings [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Hedera employs the
MapReduce paradigm to achieve the scalable performance, which is able to transfer
raw data of 2.5 year revision history of 1 million entities into full-text index
within a few hours in an 8-node cluster. We open-sourced Hedera to facilitate
further research 1.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Extracting and Indexing Entities</title>
      <p>Preprocessing Dataset
Here we describe the Hedera architecture and work ow. As shown in Figure 1,
the core data input of Hedera is a Wikipedia Revision history dump 2. Hedera
currently works with the raw XML dumps, it supports accessing and extracting
information directly from compressed les. Hedera makes use heavily the Hadoop
framework. The preprocessor is responsible for re-partitioning the raw les into
independent units (a.k.a InputSplit in Hadoop) depending on users' need. There
are two levels of partitioning: Entity-wise and Document-wise. Entity-wise
partitioning guarantees that revisions belonging to the same entity are sent to one
computing node, while document-wise sends content of revisions arbitrarily to
any node, and keeps track in each revision the reference to its preceding ones
for future usage in the Map-Reduce level. The preprocessor accepts user-de ned
low-level lters (for instance, only partition articles, or revisions within 2011 and
2012), as well as list of entity identi ers from a knowledge base to limit to. If
ltered by the knowledge base, users must provide methods to verify one
revision against the map of entities (for instance, using Wikipedia-derived URL of
entities). The results are Hadoop le splits, in the XML or JSON formats.</p>
      <p>Hadoop job</p>
      <p>Pig work flow
…</p>
      <p>Extraction
Transformer</p>
      <p>Transformer
Entity
snippet</p>
      <p>Transformer</p>
      <p>Input split</p>
      <p>Entity
snippet</p>
      <p>Preprocessor
Wikipedia
Revisions</p>
      <p>History Dump
Wikipedia-derived
Ontologies (optional)</p>
      <p>User-defined
Filters</p>
      <p>Extension
Batch Indexing
Temporal
Information
Extraction
Longitudinal
Analytics
…
Before extracted in the Map-Reduce phase (Extraction component in Figure 1),
le splits outputed from the preprocessor are streamed into a Transformer. The
main goal of the transformer is to consume the les and emits (key,value) pairs
suitable for inputting into one Map function. Hedera provides several classes of
transformer, each of which implements one operator speci ed in the extraction
1 Project documentation and code can be found at: https://github.com/
antoine-tran/Hedera
2 http://dumps.wikimedia.org</p>
      <p>Hedera: Processing Tools for Wikipedia Revision
layer. Pushing down these operators into transformers reduces signi cantly the
volume of text sent around the network. The extraction layer enables users to
write extraction logic in high-level programming languages such as Java or Pig 3,
which can be used in other applications. The extraction layer also accepts
userde ned lters, allowing user to extract and index di erent portions of the same
partitions at di erent time. For instance, the user can choose to rst lter and
partition Wikipedia articles published in 2012; and later she can sample, from
one partition, the revisions about people published in May 2012. This exibility
facilitates rapid development of research-style prototypes in Wikipedia revision
dataset, which is one of our major contributions.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Indexing and Exploring Entity-based Evolutions in</title>
    </sec>
    <sec id="sec-4">
      <title>Wikipedia</title>
      <p>
        In this section, we illustrate the use of Hedera in one application - incremental
indexing and visualizing Wikipedia revision history. Indexing large-scale
longitudinal data collections i.e., the Wikipedia history is not a straightforward
problem. Challenges in nding a scalable data structure and distributed storage
that can most exploit data along the time dimension are still not fully addressed.
In Hedera, we present a distributed approach in which the collection is processed
and thereafter the indexing is parallelized using the Map-Reduce paradigm. This
approach (that is based on the document-based data structure of ElasticSearch)
can be considered as a baseline for further optimizations. The index's schema is
loosely structured, which allows exible update and incremental indexing of new
revisions (that is of necessity for the evolving Wikipedia history collection). Our
preliminary evaluation showed that this approach outperformed the well-known
centralized indexing method provided by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The time processing (indexing) gap
is exponentially magni ed along with the increase of data volume. In addition,
we also evaluated the querying time (and experienced the similar result) of the
system. We describe how the temporal index facilitate large-scale analytics on
the semantic-ness of Wikipedia with some case studies. The detail of the
experiment is described below.
      </p>
      <p>We extract 933,837 entities registered in DBpedia, each of which correspond
to one Wikipedia article. The time interval spans from 1 Jan 2011 to 13 July
2013, containing 26,067,419 revisions, amounting for 601 GBytes of text in
uncompressed format. The data is processed and re-partitioned using Hedera before
being passed out and indexed into ElasticSearch 4 (a distributed real-time
indexing framework that supports data at large scale) using Map-Reduce. Figure 2
illustrates one toy example of analysing the temporal dynamics of entities in
Wikipedia. Here we aggregate the results for three distinct entity queries, i.e.,
obama, euro and olympic on the temporal anchor-text (a visible text on a
hyperlink between two Wikipedia revision) index. The left-most table shows the
top terms appear in the returned results, whereas the two timeline graphs
illustrate the dynamic evolvement of the entities over the studied time period (with
3 http://pig.apache.org
4 http://www.elasticsearch.org
1-week and 1-day granuality, from left to right respectively). As easily observed,
the three entities peak at the time where a related event happens (Euro 2012 for
euro, US Presidential Election for obama and the Summer and Winter Olympics
for olympic). This further shows the value of temporal anchor text in mining
the Wikipedia entity dynamics. We analogously experimented on the Wikipedia
full-text index. Here we brought up a case study of the entity co-occurrance (or
temporal relationship) (i.e., between Usain Bolt and Mo Farah), where the two
co-peak in the time of Summer Olympics 2012, one big tournament where the
two atheletes together participated. These examples demonstrate the value of
our temporal Wikipedia indexes for temporal semantic research challenges.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>In this paper, we introduced Hedera, our ongoing work in supporting exible
and e cient access to Wikipedia revision history dataset. Hedera can work
directly with raw data in the low-level, it uses Map-Reduce to achieve the
highperformance computation. We open-source Hedera for future use in research
communities, and believe our system is the rst in public of this kind. Future
work includes deeper integration with knowledge bases, with more API and
services to access the extraction layer more exibly.</p>
      <p>Acknowledgements</p>
      <p>This work is partially funded by the FP7 project ForgetIT (under grant No. 600826)
and the ERC Advanced Grant ALEXANDRIA (under grant No. 339233).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <surname>Z. Ives.</surname>
          </string-name>
          <article-title>DBpedia: A nucleus for a web of open data</article-title>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>O.</given-names>
            <surname>Ferschke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zesch</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Gurevych.</surname>
          </string-name>
          <article-title>Wikipedia revision toolkit: e ciently accessing wikipedia's edit history</article-title>
          .
          <source>In HLT</source>
          , pages
          <volume>97</volume>
          {
          <fpage>102</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>