<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Entity-Centric Preservation for Web Archive Enrichment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gerhard Gossen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Demidova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Risse</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giang Binh Tran</string-name>
          <email>gtrang@L3S.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>L3S Research Center and Leibniz Universitat Hannover</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>81</fpage>
      <lpage>84</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Today Web content and long-term Web archives are becoming more
interesting for researchers in humanities and social sciences [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this context, Linked
Open Data (LOD) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] - a standardized method to publish and interlink
structured semantic data - plays an increasingly important role in the area of digital
libraries and archives. For example, linking entities in documents with their
semantic descriptions in the LOD Cloud provides richer semantic descriptions of
the document and enables better semantic-based access methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        While the temporal dimension gains importance, it becomes necessary to
look at LOD as a constantly evolving source of knowledge. LOD inherits the
property of the Web of not having a documented history. Therefore, in the area
of Web archiving it becomes more important to preserve relevant parts of the
LOD Cloud along with crawled Web pages. This preservation process requires
several steps, such as entity extraction from Web pages (e.g. using Named Entity
Recognition (NER) techniques [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) coupled with enrichment of extracted entities
using metadata from LOD [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as close as possible to the content collection time
point and calls for Web archive enrichment approaches that collect entity context
information from LOD.
      </p>
      <p>To facilitate interpretation of the archived Web documents and LOD entities
linked to these documents in the future, presentation of the LOD entities within
Web archives should take into account content, provenance, quality, authenticity
and context dimensions. In addition, prioritization methods to select the most
relevant sources, as well as entities and properties for archiving are required. In
the following we discuss the requirements in more detail. Then we present the
resulting entity preservation process.</p>
      <p>Content and Schema: In order to get a complete information provided
by the object properties, traversal of the knowledge graph is required.
However, traversal of large knowledge graphs is computationally expensive. Apart
from that, while available properties are source dependent their usefulness with
respect to the speci c Web archive varies. Whereas some properties (e.g.
entity types, or equivalence links) can be considered of crucial importance, others
can be less important such that they can be excluded from the entity
preservation process. Therefore, property weighting is required to guide an entity-centric
crawling process.
(a) Web representation
(b) RDF representation</p>
      <p>Provenance and Quality: Similar to Web document archiving,
documentation of the crawling or entity extraction process as well as unique identi cation
and veri cation methods for collected entities are required to ensure re-usability
and citeability of the collected resources. In addition, for better interpretation of
the entities stored within the Web archive, it is crucial to collect metadata
describing the original data source for an entity. Optimally, these metadata should
include data quality parameters such as methods used for dataset creation (e.g.
automatic entity extraction, manual population, etc.), freshness (last update at
the dataset and entity levels), data size, as well as completeness and consistency
of data. Unfortunately, such metadata is rarely available in the LOD cloud.
Therefore, archiving systems should provide functionality for statistical analysis
of data sources to estimate their quality and reliability. To ensure correct access
to the archived information available metadata about publisher, copyright and
licenses of the sources needs to be preserved.</p>
      <p>Authenticity and Context: Authenticity is the ability to see an entity in
the same way as it was present on the Web at the crawl or extraction time. One
of the main principles of LOD is the use of dereferenceable URIs that can be
used to obtain a Web-based view of an entity (this view also may di er from the
machine-readable representation). For example, Figure 1 presents a part of the
Web and RDF representations of the entity \London" from the Freebase dataset
of March, 4, 2014. To satisfy authenticity requirements, in addition to
machinereadable entity representation, human-readable pages representing archived
entities should be preserved. Such information can include visual resources such as
photographs of people, maps snippets, snippets of Wikipedia articles, etc.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Entity Preservation Process</title>
      <p>Whereas entity extraction from the archived Web documents performed by NER
tools can deliver entity labels, types and eventually initial pointers (URIs) of the
relevant entities in the reference LOD datasets, the collection of relevant
entities should be extended beyond the LOD sources used by the extractors (such as
DBpedia). Furthermore, the content of the entities needs to be collected and
preserved. In these context important challenges are connected to SPARQL endpoint
(a) SPARQL endpoint discovery and
metadata crawling
(b) Entity-centric crawling
discovery and metadata crawling, prioritization of the crawler and entity-centric
crawling to obtain the most relevant parts of the LOD graphs e ciently.</p>
      <p>SPARQL endpoint discovery and metadata crawling: Existing dataset
catalogs such as DataHub1 or the LinkedUp catalogue2 include endpoint URLs
of selected datasets as well as selected statistics, mostly concerning the size
of speci c datasets; however, existing catalogs are highly incomplete. Metadata
crawling includes several steps to obtain SPARQL endpoints URLs and generate
metadata. Based on this metadata, prioritization of LOD sources and properties
within these sources for preservation can be performed. Fig. 2(a) presents the
overall procedure of SPARQL endpoint discovery and metadata crawling.</p>
      <p>In total, SPARQL endpoint discovery, metadata collection and pre-processing
includes the following steps:
Step 1a. Query LOD catalogues to obtain a seed list of LOD datasets, SPARQL
endpoints and other available metadata.</p>
      <p>Step 1b. For each LOD source, collect available (or generate) metadata, e.g. topics,
schema, version, quality-related statistics and license.</p>
      <p>Step 1c. For a speci c Web archive, select LOD datasets w.r.t. the topical relevance
and quality parameters. Establish property weighting for crawl prioritization.</p>
      <p>Prioritization of the crawler: Similar to Web crawling, there is a
tradeo between data collection e ciency and completeness in the context of LOD
crawling. On the one hand, the entity preservation process should aim to create a
potentially complete overview of the available entity representations and collect
data from possibly many sources. On the other hand, due to the topical variety,
scale and quality di erences of LOD datasets, preservation should be performed
in a selective manner, prioritizing data sources and properties according to their
topical relevance for the Web archive, general importance as well as quality (e.g.
in terms of relative completeness, mutual consistency, and freshness).</p>
      <p>
        Entity-centric crawling: Entities in Linked Data Cloud can be retrieved
from SPARQL endpoints, through URI lookups or from data dumps. The entity
1 http://datahub.io/
2 http://data.linkededucation.org/linkedup/catalog/
preservation process can be a part of a Web archive metadata enrichment that
extracts entities from the archived Web pages and fed them into the Linked
Data Crawler (e.g. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) for preservation. Fig. 2(b) presents the overall procedure
for entity extraction and crawling. In total, entity-centric crawling includes the
following steps enabling the crawler to collect comprehensive representations of
entities and their sources for archiving:
Step 2a. For each entity extracted by NER, collect machine-readable entity
representations from the relevant datasets (either using SPARQL or HTTP).
Step 2b. Collect human-readable entity view(s) available through the HTTP protocol
(see Fig. 1 (a)) using Web crawling techniques.
      </p>
      <p>Step 2c. Follow object properties (URIs) of an entity to collect related entities from
the same LOD dataset. Here, property weighting together with other
heuristics (e.g. path length) can be used to prioritise the crawler.</p>
      <p>Step 2d. Follow the links (e.g. owl:sameAs ) to external datasets to collect equivalent
or related entities. In this step, prioritization of the crawler can be performed
based on the estimated data quality parameters in such external datasets.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and Outlook</title>
      <p>In this paper we discussed aspects of entity-centric preservation in LOD in the
context of the long-term Web archives. We described requirements and methods
for entity preservation in Web archives. As the next steps towards the
entitycentric LOD preservation we envision investigation of preservation-relevant
quality aspects of linked datasets, as well as development of methods to automatic
property weighting to achieve e ective prioritization of the entity crawling.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments References</title>
      <p>This work was partially funded by the European Research Council under
ALEXANDRIA (ERC 339233) and the COST Action IC1302 (KEYSTONE).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>E.</given-names>
            <surname>Demidova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Funk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Holzmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maynard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Papailiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Risse</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Spiliotopoulos</surname>
          </string-name>
          .
          <article-title>Analysing and enriching focused semantic web archives for parliament applications</article-title>
          .
          <source>Future Internet</source>
          ,
          <volume>6</volume>
          (
          <issue>3</issue>
          ):
          <volume>433</volume>
          {
          <fpage>456</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Finkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grenager</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Incorporating non-local information into information extraction systems by gibbs sampling</article-title>
          .
          <source>In Proc. of the ACL</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Linked Data: Evolving the Web into a Global Data Space (1st edition)</article-title>
          .
          <source>Synthesis Lectures on the Semantic Web: Theory and Technology</source>
          ,
          <volume>1</volume>
          :
          <issue>1</issue>
          ,
          <fpage>1</fpage>
          -
          <lpage>136</lpage>
          . Morgan &amp; Claypool,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth. LDSpider</surname>
          </string-name>
          :
          <article-title>An open-source crawling framework for the web of linked data</article-title>
          .
          <source>In Proc. of ISWC</source>
          <year>2010</year>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R.</given-names>
            <surname>Rogers</surname>
          </string-name>
          . Digital Methods. MIT Press,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>