<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Crawl Me Maybe: Iterative Linked Dataset Preservation</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>L3S Research Center, Leibniz Universitat Hannover</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The abundance of Linked Data being published, updated, and interlinked calls for strategies to preserve datasets in a scalable way. In this paper, we propose a system that iteratively crawls and captures the evolution of linked datasets based on exible crawl de nitions. The captured deltas of datasets are decomposed into two conceptual sets: evolution of (i)metadata and (ii)the actual data covering schema and instance-level statements. The changes are represented as logs which determine three main operations: insertions, updates and deletions. Crawled data is stored in a relational database, for e ciency purposes, while exposing the di s of a dataset and its live version in RDF format.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>Dataset</kwd>
        <kwd>Crawling</kwd>
        <kwd>Evolution</kwd>
        <kwd>Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Over the last decade there has been a large drive towards publishing structured
data on the Web. A prominent case being data published in accordance with
Linked Data principles [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Next to the advantages concomitant with the
distributed and linked nature of such datasets, challenges emerge with respect to
managing the evolution of datasets through adequate preservation strategies.
Due to the inherent nature of linkage in the LOD cloud, changes with respect
to one part of the LOD graph, in uence and propagate changes throughout the
graph. Hence, capturing the evolution of entire datasets or speci c subgraphs is a
fundamental prerequisite, to re ect the temporal nature of data and links.
However, given the scale of existing LOD, scalable and e cient means to compute
and archive di s of datasets are required.
      </p>
      <p>
        A signi cant e ort towards this problem has been presented by Kafer et al.[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
with the Dynamic Linked Data Observatory: a long-term experiment to monitor
a two-hop neighbourhood of a core set of diverse linked data documents.
      </p>
      <p>The authors investigate the lifespan of the core set of documents, measuring
their on and o -line time, and the frequency of changes. Furthermore, they delve
into how the evolution of links between dereferenceable documents over time.
An understanding of how links evolve over time is essential for traversing linked
data documents, in terms of reachability and discoverability. In contrast to the
previous initiatives, in this work we provide an iterative linked dataset crawler.
It distinguishes between two main conceptual types of data: metadata and the
actual data covering schema and instance-level statements.</p>
      <p>In the remainder of this paper, we explain the schema used to capture the
crawled data, the work ow of the iterative crawler and the logging states which
encode the evolution of a dataset.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Iterative Linked Dataset Crawler</title>
      <p>The dataset crawler extracts resources from linked datasets. The crawled data
is stored in a relational database. The database schema (presented in Figure 1)
was designed towards ease of storage and retrieval.
1 The time at which a given crawl operation is triggered.</p>
      <p>Crawl Me Maybe: Iterative Linked Dataset Preservation
computed at di erent levels. Each crawl explicitly logs the various changes at
schema and resource-levels in a dataset as either inserted, updated or deleted.
The changes themselves are rst captured at triple-level, and then attributed to
either schema-level or resource instance-level. The following log operators with
respect to dataset evolution are handled by the dataset crawler.
{ Insertions. New triples may be added to a dataset. Such additions
introduced in the dataset correspond to insertions.
{ Deletions. Over time, triples may be deleted from a dataset due to various
reasons ranging from persisting correctness to detection of errors. These
correspond to deletions.
{ Updates. Updates correspond to the update of one element of a triple &lt;
s; p; &gt;.</p>
      <p>Figure 2 presents an example depicting the computation of between a
previously crawled dataset at crawl-point t0 and a fresh crawl at crawl-point t1.</p>
      <p>First, assume a change in the `live dataset' in the form of an insertion of the
triple corresponding to the URI resource_uri_2. Thus, the triple describing the
city Madras is added. Consequently, if the value of the property dbpedia-owl:
city is updated, then a subsequent crawl would capture this di erence in the
literal value of the property as an update to Chennai. Similarly, deletions made
are also detected during the computation of di s. Thus, computing and storing
di s on-the- y in accordance with the log operators is bene cial; we avoid the
overheads emerging from storing dumps of entire datasets.
2.2</p>
      <p>Web Interface for the Iterative Dataset Crawler
We present a Web interface (accessible at http://data-observatory.org/
dataset_crawler) that provides means to access the crawled resources, given
speci c crawl-points of interest from the periodical crawls. The interface allows
us to lter for speci c datasets and resource types. The Web application has
three main components (see Figure 3): (i) displaying metadata of the dataset, (ii)
dataset evolution, showing summaries of added/updated/deleted resources for
the di erent types, and (iii) dataset type-speci c evolution, showing a summary
of the added/updated/deleted resource instances for a speci c resource type and
corresponding to speci c crawl time-points. In addition, the crawler tool is made
available along with instructions for installation and con guration2.</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper, we presented a linked dataset crawler for capturing dataset
evolution. Data is preserved in the form of three logging operators (insertions/updates/
deletions) by performing an online computation for any given dataset with
respect to the live state of the dataset and its previously crawled state (if
available). Furthermore, the crawled and computed of a dataset can be used to
assess its state at any given crawl-point. Finally, we provided a web interface
which allows the setup of the crawler, and facilitates simple query functionalities
over the crawled data.
2 https://github.com/bfetahu/dataset_crawler</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked data - the story so far</article-title>
          .
          <source>Int. J. Semantic Web Inf. Syst.</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):1{
          <fpage>22</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. T. Kafer,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelrahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>OByrne, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          .
          <article-title>Observing linked data dynamics</article-title>
          .
          <source>In The Semantic Web: Semantics and Big Data</source>
          , pages
          <volume>213</volume>
          {
          <fpage>227</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>