<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Preserving Linked Data on the Semantic Web by the application of Link Integrity techniques from Hypermedia</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Experimentation</institution>
          ,
          <addr-line>Reliability, Design</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rob Vesse, Wendy Hall, Leslie Carr Intelligence, Agents &amp; Multimedia Group School of Electronics &amp; Computer Science University of Southampton Southampton SO17 1BJ</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <fpage>26</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>As the Web of Linked Data expands it will become increasingly important to preserve data and links such that the data remains useful. In this work we present a method for locating linked data to preserve which functions even when the URI the user wishes to preserve does not resolve (i.e. is broken/not RDF) and an application for monitoring and preserving the data. This work is based upon the principle of adapting ideas from hypermedia link integrity in order to apply them to the Semantic Web.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The Web of Linked Data is characterised by the
interlinking between disparate heterogeneous data sources and
the fact that the links between the data sources are one of
the primary mechanisms for navigating through this data
space. Since links are essential to the Web of Linked Data
we believe that it is important to have mechanisms in place
to maintain link integrity. The aim of link integrity is to
ensure that a link works correctly in that traversing the link
takes you to a resource and that as far as possible the
resource is the one intended by the provider of the link. On
a larger scale, link integrity deals with the overall integrity
of interlinked datasets such as documents within a Content
Management System (CMS) or the linked data sets available
on the Semantic Web. Therefore link integrity is one way of
ensuring data integrity within the overall system which in
our use case is linked datasets.</p>
      <p>Link integrity is an existing and well known problem from
hypermedia where there were two problems to be dealt with
- dangling links and the editing problem. Dangling links
are the most well known problem and are regularly
experienced by users on the Web as they nd themselves presented
with an HTTP error as the link they followed pointed to a
resource which cannot be retrieved. The editing problem
refers to the situation in which the content at the end of
the link is changed so that it is no longer what the creator
of the link intended to link to. Both these issues a ect the
Semantic Web since on the Semantic Web everything is
interlinked data, therefore data is immediately susceptible to
dangling links. The editing problem becomes much more
problematic on the Semantic Web since anyone can make a
statement about anything so the meaning of things on the
Semantic Web is subject to semantic drift1.</p>
      <p>Due to the interlinking between data on the Semantic Web
we show that it is possible to exploit the data model such
that links themselves can be used to recover missing data in
the event of a dangling link being encountered. This
provides for a means to retrieve the data that was/may have
been at a given URI even if that URI is no longer resolvable.
Using this approach for locating data about a URI we are
able to preserve and monitor data about a URI from
multiple sources and to recover data about URIs that are no
longer functioning as described in Section 3.1.</p>
      <p>The Semantic Web also introduces two additional
problems in link integrity speci c to linked data. The rst of
these is URI Identity &amp; Meaning - what does a URI mean
and does this meaning actually matter to the applications
that use it and the data that contains it - which is very much
an open research debate and beyond the scope of our current
work. The second is the co-reference problem which refers
to the situation in which some `thing' we wish to make
statements about has multiple URIs that could be used for it. In
this work we utilise existing work in this area of research as
part of our algorithm for preserving linked data.</p>
      <p>Section 2 of this paper covers the related work in link
integrity for hypermedia and the Semantic Web which we
use ideas from in Section 3 to design and develop our
algorithm and the AAT software. Section 5 outlines our plans
for future research in this area and we conclude in Section 6
by discussing the potential bene ts of link integrity for the
Semantic Web.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Link integrity in hypermedia rst received serious
attention in the late 1980s and early 1990s primarily from
researchers in the open hypermedia community. Systems like
Microcosm [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and HyperG [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] were among the rst to
consider the issue in depth, Davis's thesis [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Kappe's
1See the discussion on the W3C Semantic Web Mailing List
for an example of this - http://lists.w3.org/Archives/
Public/semantic-web/2009May/0315.html
1995 paper [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] provide examples of link integrity in open
hypermedia. The widespread growth of the World Wide
Web [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in the mid-1990s led to some new research but as
search engines became commonplace towards the end of the
decade research interest dwindled. It was perceived that
users did not care su ciently to warrant research into the
problem as they could locate missing resources e ectively
using search engines, in addition the scale of the Web by
that time was simply too vast for many proposed solutions
to handle. Davis's survey [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] provides a good overview of
the state of this research as of the end of the 1990s. Another
reason for the decline in research was that the fact that links
could fail was one of the reasons the Web was able to
expand as fast as it did since it didn't matter if links failed and
produced the familiar HTTP 404 error; users were able to
publish content without worrying about whether their links
to external content were valid.
      </p>
      <p>
        Ashman's 2000 paper [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which discusses link integrity
with particular reference to electronic document archives
provides both a useful survey of existing work and describes
a key motivation for ongoing research. As more document
collections were translated into digital forms and placed onto
intranets people once again started to be concerned about
link integrity. Users wanted assurances that links into the
document archives would work consistently and ideally links
out of the archives would work correctly as well since it may
not be possible to alter the archived documents without
invalidating the integrity of the archive.
      </p>
      <p>
        In this vein Veiga and Ferreira [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ] discuss the
possibility of turning the Web into an e ective knowledge
repository by use of replication and versioning. Their work follows
on from earlier work such as Moreau &amp; Gray's [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] which
proposed limited use of replication and versioning but had
signi cant reliance on author and user involvement in the
process. In Veiga &amp; Ferreira's work there is no
requirement for author involvement in the process, only the end
user need use a browser plugin to indicate the content they
wish to replicate and preserve. Their results showed that
the user could preserve the sections of the Web they were
interested in with no perceivable performance impact - on
average there was only a 12ms increase in retrieval time for
resources. In Section 3 we discuss using an approach of this
kind for the Semantic Web.
      </p>
      <p>
        Phelps &amp; Wilensky introduced the concept of lexical
signatures for Web pages in their Robust Hyperlinks paper [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
They compute the lexical signature of a page and append
it to all links to that page so that in the event of the link
failing a browser plugin can use the signature to relocate the
page using a search engine. The obvious aw in their work
was that it required rewriting all the links on the Web but
Harrison &amp; Nelson later showed that these signatures need
only be computed Just-in-Time (JIT) when a link fails [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
In their Opal system the signatures can be computed JIT
by retrieving cached copies of the pages from a search
engine cache, computing the signature and then using search
engines to relocate the page. As discussed in Section 3.1 a
JIT style approach can be e ectively used to recover linked
data about a URI.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Semantic Web Research</title>
      <p>
        Unlike the traditional Web it is not possible for semantic
search engines like Sindice [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and Falcons [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to ful l the
same role as document search engines because the users in
the Semantic Web domain are typically client applications
rather than humans. When a human encounters a dead
link they usually navigate to a search engine and enter an
appropriate search phrase to nd alternative sources of
information. For a client application encountering a dead link
they will typically have no concept or how/where to nd
alternative sources of information and URIs for linked data
are not always ideal for searching upon compared to textual
search for documents. It should be noted that as with the
existing Web if the Web of Linked Data is to undergo a
massive expansion in the same way things must be allowed to
fail but this does not mean we shouldn't attempt to mitigate
the problem as far as possible.
      </p>
      <p>
        In terms of the Semantic Web there has been research
into the versioning and synchronisation of RDF data which
is relevant to aspects of our work such as Tummarello et al's
RDFSync [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] which is an algorithm for e ciently
synchronising changes in RDF between multiple machines. This
shows that change detection in RDF is non-trivial due to
the inherent data isomorphism caused by the use of blank
nodes but also shows that it can be achieved in an e cient
manner. More recent research from Papavassiliou et al [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]
has shown that using information about very basic changes
in the RDF - such as that provided by systems like
RDFSync or All About That (see Section 3.2) - can be used to
build applications which provide useful information to end
users. In the case of Papavassiliou's et al's paper they built a
system which furnished users with high level descriptions of
how RDFS vocabularies have changed in order to aid users
in working with such vocabularies. In addition there are
systems like the Talis Platform2 which is a Semantic Web
store that implements a versioning mechanism whereby
updates can be made via a Changeset protocol [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As part of
this protocol they utilise a useful lightweight vocabulary for
publishing changes in RDF data as RDF which as will be
discussed in Section 3.2.3 we reuse in our own system.
      </p>
      <p>
        Regarding Semantic Web speci c link integrity problems
the research has largely focused on the co-reference
problem. Since there are many organisations publishing similar
data semantically (bibliographic databases being a prime
example) there are frequently many URIs for a single entity
such as an author. Co-reference research aims to develop
ways to e ciently and accurately determine URI
equivalences and refactor the data or republish this information
to help other Semantic Web applications. There are several
competing philosophies ranging from the Okkam approach
described by Bouquet et al [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] which advocates universally
agreed URIs for each entity to the Co-reference Resolution
Service (CRS) approach of Ja ri et al [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] which determines
co-referent URIs and republishes the information in
dedicated triple stores. The CRS approach taken by the ReSIST
project3 within the RKB Explorer4 application has
potential for use in link integrity as the information provided by
a CRS could be utilised in a JIT fashion as in Harrison &amp;
Nelson's work and we demonstrate how this can be done in
Sections 3 and 4.
      </p>
      <p>
        In terms of link maintenance for the Semantic Web there
has been some research in the form of the Silk framework
by Volz et al [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] which is a framework for computing links
between di erent datasets. Their approach allows users to
stipulate arbitrarily complex matching criteria to do entity
matching between datasets, the links produced from this can
2http://www.talis.com/platform
3http://www.resist-noe.org/
4http://www.rkbexplorer.com
then be published via a CRS style service or added to the
relevant datasets. As proposed in their later paper [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] this
can be used as part of a link maintenance strategy, the
possibility of combining this with our approach is discussed in
Section 5. In a similar vein Haslhofer and Popitsch's
DSNotify system [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] can monitor linked resources and inform
the application when links are no longer valid using feature
based similarity metrics like the Silk framework.
      </p>
    </sec>
    <sec id="sec-4">
      <title>METHOD</title>
      <p>As we have discussed it is not realistic to maintain link
integrity in a pre-emptive way since such solutions have been
consistently shown not to scale to Web scale in previous
work. Therefore the focus must be on recovery in the event
of failure and preservation to guard against the loss of data
which is considered interesting/useful to end users. As the
amount of data in the Web of Linked Data starts to expand
massively - particularly with linked data being adopted by
an increasing number of major organisations - we expect
that as with the early document web there'll be an
increasing amount of content published by both big companies and
individuals. Just like the document web this explosion of
content will most likely include much content that is poorly
maintained and will lead to increasing numbers of broken
links. We have two connected goals in this work 1) to
provide a means to retrieve resource descriptions in the form
of linked data about a URI even when the the URI is
nonfunctional and 2) to provide the means for an end user to
preserve and version these descriptions. To attempt to solve
this problem we present an expansion algorithm for
retrieving Linked Data about a URI even if that URI itself has
failed in Section 3.1 and a preservation system built using
this algorithm in Section 3.2.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Expansion Algorithm</title>
      <p>Since the goal of this work is to preserve linked data it was
deemed essential that as far as possible we leverage existing
linked data technologies and services in order to e ect this
preservation. To this end we designed a relatively simple
algorithm which uses simple crawling techniques which are
directed by a user de nable expansion pro le (see De nition
1). Our aim with this algorithm is to provide resource
descriptions of a URI regardless of whether the URI itself is
dereferenceable.</p>
      <p>Even in the case where a URI is used only as an identi er
in the description of another resource and is not itself
dereferenceable it is likely that we can still retrieve some data
about it. The fact that a URI is minted only as an identi er
and that the person/organisation minting the URI does not
provide the means to dereference the URI does not a ect
our ability to nd data about it assuming that the identi er
is used elsewhere i.e. it is reused as part of linked data.</p>
      <p>De nition 1. An expansion pro le is a Vocabulary of
Interlinked Datasets (VoID) description of a set of datasets
and linksets that should be used to locate linked data about
the URI of interest. The VoID description may be
optionally annotated with additional properties which a ect the
behaviour of the algorithm.</p>
      <p>
        Drawing on ideas described in Alexander et al's
Vocabulary of Interlinked Datasets (VoID) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] about the way it
can be used to direct crawlers we decided to use VoID as
the primary means of expressing an expansion pro le. We
introduce a couple of additional predicates since we require
the means to allow end users to specify some basic
characteristics of how the algorithm should behave and there is a type
of service we need to express which is not contained in the
VoID ontology. VoID has concepts of Datasets and Linksets,
the former represent a set of data which may have SPARQL
endpoint(s) and/or URI lookup endpoint(s) while the latter
represent the types of interlinkings between datasets. What
VoID does not have a means to express is the location of
a service provided by a dataset which allows an application
to retrieve URIs which are considered equivalent to a given
URI - this we term a URI discovery endpoint (see De nition
2). A discovery endpoint di ers from a lookup endpoint in
that the latter is expected to return everything the dataset
knows about the given URI as opposed to only returning
equivalent URIs. Examples of existing discovery endpoints
on the Semantic Web include RKBExplorer's CRSes [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
and sameAs.org5. Another key di erence between a lookup
and discovery endpoint is that links discovered from a
discovery endpoint are considered to be on the same level of
the crawl for the purposes of the algorithm i.e. they do not
have increased depth relative to the URI that discovery is
performed upon. By this we mean that the execution of the
algorithm results in performing a breadth- rst depth-limited
linked data crawl starting from a given URI - in this tree
structure a discovery endpoint introduces sibling nodes for
a URI while a lookup endpoint introduces child nodes for a
URI.
      </p>
      <p>Our other extensions to VoID allow individual
datasets/linksets to be marked as ignored (the algorithm will not
use them) and for the user to de ne to what depth the
algorithm should crawl to (defaults to 1). These extensions are
de ned as part of the AAT schema detailed in Section 3.2.1.</p>
      <p>De nition 2. A URI discovery endpoint is an endpoint
that when passed a URI returns a Graph containing
equivalent URIs of the input URI typically in the form or owl:sameAs
links.</p>
      <p>
        As already stated the actual algorithm is a simple crawler
which uses the input expansion pro le as a guide to which
potential sources of linked data it should use to try and nd
data about the URI of interest - this procedure is detailed
in Algorithm 1. Note that the algorithm does not terminate
in the event of an error retrieving data from a particular
URI/endpoint and simply continues, by doing this it is still
possible to retrieve some data even if the starting URI does
not return a valid response. The algorithm will continue
and issue queries about the URI to the various endpoints
described in the given expansion pro le so unless the URI
refers to a document that had very poor linkages or was not
indexed by the semantic search services used some RDF will
be returned. This approach has similarities to the JIT style
approach of Harrison &amp; Nelson [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] in that there doesn't
need to be any foreknowledge of the URIs you wish to
recover data about when you discover they are broken since by
utilising the caches and lookup services of relevant datasets
it is still possible to recover data about the URI.
      </p>
      <p>The basic behaviour of the algorithm is only to follow
owl:sameAs and rdfs:seeAlso links but the end user can
specify that any predicate be treated as a link to follow by
the specifying an appropriate VoID linkset in their
expansion pro le.</p>
      <sec id="sec-5-1">
        <title>5http://www.sameas.org</title>
        <sec id="sec-5-1-1">
          <title>Algorithm 1 Expansion Algorithm</title>
          <p>Require: URI, Expansion Pro le
1: ToExpand as a set of pairs of URIs and Depths
2: while ToExpand 6= ; do
3: Remove rst pair from ToExpand
4: if Graph with URI is already in the Dataset then
5: Continue
6: if Depth &gt;Max Depth then
7: Continue
8: Retrieve the Graph at the URI
9: Add the Graph to the Dataset
10: for all Triples in Graph do
11: if Triple is a Link then
12: Add a new pair to ToExpand
13: for all Datasets in Expansion Pro le do
14: if Dataset has a SPARQL Endpoint then
15: Issue a DESCRIBE for the URI against the
Endpoint
16: Add resulting Graph to the Dataset
17: Process the Graph for additional Links
18: if Dataset has a Lookup Endpoint then
19: Issue a Lookup for the URI against the Endpoint
20: Add resulting Graph to the Dataset
21: Process the Graph for additional Links
22: if Dataset has a Discovery Endpoint then
23: Issue a Discovery for the URI against the
Endpoint
24: for all Equivalent URIs do
25: Add a new pair to ToExpand
26: return Dataset</p>
          <p>
            There are already some existing systems which work in
a similar way to our algorithm such as the sponger middle
ware using in Virtuoso [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. The main di erence between
our algorithm and algorithms such as those in the Virtuoso
sponger is that our algorithm is only interested in linked
data and it does not infer/create any additional data. Unlike
the Virtuoso sponger it does not attempt to turn non-linked
data into RDF and it does not do any inference over the data
it returns, it is designed only to nd and return (in the form
of an RDF dataset) linked data about the URI of interest.
Yet as expansion pro les may reference any datasets and
associated endpoints they wish there is no reason why a
user could not direct our algorithm to utilise a service like
URIBurner6 which uses the Virtuoso sponger in order to get
the bene ts of the additional inferred data.
3.1.1
          </p>
          <p>Default Profile</p>
          <p>Since the end user of such an algorithm may not always
know where to look for linked data about the URI they are
interested in the algorithm has a default expansion pro le
which is used in the case when no pro le is speci ed. This
pro le uses 3 data sources which are in our opinion
important hubs of the Web of Linked Data:</p>
          <p>DBPedia7 - The DBPedia SPARQL endpoint is used
to lookup URIs</p>
          <p>Sindice8 Cache - The Sindice Cache API9 allows the</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>6http://www.uriburner 7http://dbpedia.org 8http://www.sindice.com 9http://www.sindice.com/developers/cacheapi</title>
        <p>retrieval of Sindice's cached copy of the RDF from a
URI.</p>
        <p>SameAs.org10 - SameAs.org provides a URI discovery
endpoint (see Section 3.1 and De nition 2) which can
be used to nd URIs which are equivalent to a given
URI
The default pro le11 has a max expansion depth of 1 which
means it only considers URIs which are immediate
neighbours of the starting URI.</p>
        <p>In the case where the end user does know which linked
data sources will have useful information about the URI
they can specify their own expansion pro le which is used
instead of the default pro le. In this case the algorithm will
use the datasets and linksets they de ne in the pro le to
discover linked data about the URI of interest, for example
if attempting to recover data about a person it may be useful
to follow foaf:knows links.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Preservation</title>
      <p>
        The preservation approach taken is to allow the end user
to monitor and preserve a set of linked data that they are
interested in. The data is preserved not at the data source
but rather at a local level on the users server with the user
able to republish this data as they desire. This is in line with
the ideas of Veiga &amp; Ferreira [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] in that the end user
species the parts of the Web they want to preserve and then the
software takes care of this. The data must be preserved in
such a way that the original data can be e ciently extracted
from it and su cient information to provide versioning over
the data is kept.
      </p>
      <p>In the Semantic Web domain the objects of interest are
URIs we propose that a pro le of a URI be preserved (see
De nition 3). Since the data being processed is RDF it
is logically divided into triples which can be preserved and
monitored individually. It is deemed necessary to store
information pertaining to the temporality and provenance of each
triple - when it was rst seen, last updated, source URI(s)
and whether it has changed or been retracted/deleted from
the RDF.</p>
      <p>De nition 3. A URIs pro le is the transformed and
annotated form of the linked data retrievable about a given URI
such that the temporality and provenance of the triples
contained therein are inferable from the pro le</p>
      <p>In terms of user interface the system should allow a user
to view a pro le both in the stored form and in its original
form. The system must monitor the original data source
over time updating the pro les as necessary such that it can
provide a report of changes in the data to the user. Since a
URI pro le will contain versioning information the interface
should allow a user to view a particular version of the pro le.
3.2.1</p>
      <p>Schema</p>
      <p>As the rst stage of implementation an RDF Schema for
All About That12 (AAT) is de ned which embodies classes
and properties which allow the description and annotation
of triples in such a way that the required information as
discussed in the preceding proposal can be stored for each
10http://www.sameas.org
11http://www.dotnetrdf.org/expander/defaultProfile
12This schema is available at http://www.dotnetrdf.org/
AllAboutThat/
triple. The schema de nes a class for representing pro les
called aat:Profile and uses the rdf:Statement class to
represent triples. rdf:Statement is used as the basis of
triple storage as it makes it possible for non-AAT aware tools
to extract the original triples from the pro le easily. A
number of properties are de ned which store meta data about
the pro le itself such as created &amp; updated date, source
URI and a locally unique identi er for the pro le.
Similar properties are de ned for triples which allow the rst
and last asserted dates, source URI and change status of a
triple to be indicated. A key distinction in the schema is
between aat:profileSource and aat:source, despite storing
equivalent data two predicates are created since the former
expresses the URI which is the starting point for the
prole while the latter expresses all the URIs at which a given
triple is asserted.</p>
      <p>
        While there were alternative schemas and vocabularies
available that could have potentially been used to store
the required data the motivation behind designing our own
schema was to provide a lightweight schema that attached
all data to a single subject for ease of processing.
Alternatives such as the Provenance Vocabulary by Hartig &amp; Zhao
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] are far more expressive but they potentially require
introducing multiple intermediate blank nodes which would
signi cantly complicate the processing needed to implement
many of the core features of AAT. Similarly the Open
Provenance Model as described by Moreau et al [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] is highly
expressive but like the Hartig &amp; Zhao's vocabulary the RDF
serialization is overly complex for use in AAT. As discussed
in Section 5 there is no reason why the data contained in
AAT could not be exposed in other provenance vocabularies
but for AATs processing and storage a lightweight
vocabulary is preferable.
      </p>
      <p>The use of rei cation was chosen over the use of named
graphs primarily due to the need to make annotations at
the level of individual triples rather than at the graph level,
usage is motivated by the fact that the mechanism provides
a clear and obvious schema for encoding a triple and adding
additional annotations to it. While rei cation may signi
cantly increase the size of the data being stored initially over
time this balances out compared to named graphs where it
is necessary to either store many copies of the same graph
or store multiple named graphs which represent a series of
deltas to the original data. The other di culty inherent in
the named graphs approach is that the annotations typically
would then be held separately in other named graphs which
adds to the complexity of the data processing.
Nevertheless named graphs are used within AAT since each pro le
naturally forms a named graph and AAT generates several
related named graphs about each pro le detailing change
history and changesets as described in Section 3.2.3.
3.2.2</p>
      <p>Profile Creation &amp; Update</p>
      <p>To create a URIs pro le linked data about the URI is rst
retrieved using the expansion algorithm presented in Section
3.1; then using the AAT schema each triple can be
transformed into a set of triples which represent an annotation
of the original triple. For each triple in the original RDF
a blank node is created which is then used as the subject
of a set of triples which represent the required information
about the original triple. Figure 1 shows an example triple
and Figure 2 shows it transformed into the AAT form. A
URIs pro le consists of a set of transformed triples where
each pro le is a named graph in the underlying store.</p>
      <p>
        Since the user needs to both browse the data they are
preserving as well as potentially republish it, a Web based
interface was designed as the primary interaction mechanism.
The interface allows users to explore the data by rst
selecting a pro le to view and then allowing them to view pro le
contents, export, versions and change reports. A user may
also use the interface to add new URIs they wish to monitor
to the system and to initiate updates to pro les (see
Definition 4). Following linked data best practices [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and to
provide the ability for the user to republish their preserved
data multiple dereferenceable URIs for each pro le are
created and accessible through the Web interface. These allow
the retrieval of the pro le contents which consists of all the
triples ever retrieved from the pro le URI in the transformed
form, the export of the pro le (see De nition 5) and various
meta graphs about a pro le e.g. change history, changesets.
This means that the pro le of a URI has a URI and thus
can itself be pro led if it was desired.
      </p>
      <p>De nition 4. An update of a pro le occurs when AAT
using the Expansion algorithm to retrieve RDF about the
given URI. The triples contained are compared with the
triples currently in the pro le and the pro le updated
accordingly</p>
      <p>De nition 5. The export of a pro le is the recreation of
the RDF in its original form based upon the current contents
of the pro le. An export represents the RDF as it was last
seen by AAT
3.2.3</p>
      <p>Change Reporting</p>
      <p>A key feature of AAT is the ability to generate change
reports about how the RDF at the pro led URI has changed
over time. To do this a number of relatively simple
computations over the annotated triples can be made based
primarily on the rst and last asserted dates of the triples. In
creating change reports four di erent types of changes in the
RDF are looked for (see De nitions 6-9). A distinction is
made between missing knowledge and retracted or deleted
knowledge as it may be possible for triples to be perceived to
be temporarily non-present in the RDF. For example in the
event of a transient network issue making some/all of the
relevant URIs unretrievable the updated date for the
prole will still be updated leaving all the triples in the pro le
to appear missing. The length of time we require Triples
to be missing before we consider them to be deleted is
currently set to 7 days for our monitoring of the BBC dataset
described in Section 4.2.1, this time period is a domain
speci c parameter that can be adjusted depending on the data
that is being monitored.</p>
      <p>De nition 6. New knowledge is any triple that is new to
the RDF at the pro led URI</p>
      <p>De nition 7. Changed knowledge is any triple where the
object of the triple has changed. Only triples where the
predicate has a cardinality of 1 can be considered to change</p>
      <p>De nition 8. Missing knowledge is any triple no longer
found in the RDF at the pro led URI but which was recently
seen in the RDF</p>
      <p>De nition 9. Retracted or deleted knowledge is any triple
no longer found in the RDF at the pro led URI which has
not been seen for a reasonable length of time</p>
      <p>In regards to the concept of changed knowledge consider
some arbitrary predicates ex:one and ex:many which have
cardinalities of 1 and unrestricted respectively. Since ex:one
has a cardinality of 1 it can be said whenever the object
of that triple has changed it is changed knowledge. Yet it
cannot be said for ex:many triples as the predicate has
unrestricted cardinality, therefore each triple using this
predicate must be treated as a unique entity i.e. one instance of
a triple using this predicate cannot be considered to replace
another. In the examples the fact that &lt; A &gt; was related
to &lt; C &gt; via the predicate ex:many in Example 1 and now
is instead related to &lt; E &gt; in Example 2 doesn't mean they
are related to &lt; E &gt; instead of &lt; C &gt;, it just means they no
longer consider themselves related to &lt; C &gt;. The fact that
they are related to &lt; E &gt; is new knowledge while the fact
they related to &lt; C &gt; is missing/deleted knowledge, but if
the value of the ex:one relationship had changed then that
would be considered changed knowledge.</p>
      <sec id="sec-6-1">
        <title>Example 1 Original Graph</title>
        <p>&lt;A&gt; ex:one &lt;B&gt; .
&lt;A&gt; ex:many &lt;C&gt; .
&lt;A&gt; ex:many &lt;D&gt; .</p>
      </sec>
      <sec id="sec-6-2">
        <title>Example 2 Modi ed Graph</title>
        <p>&lt;A&gt; ex:one &lt;B&gt; .
&lt;A&gt; ex:many &lt;D&gt; .
&lt;A&gt; ex:many &lt;E&gt; .</p>
        <p>
          When a change report is computed is it itself serialized
into an RDF Graph using the Talis Changeset ontology [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
which is stored as a named graph in the underlying store
and republished via the web interface. Each Changeset
generated links back to the previous Changeset (if one exists)
such that a end user/client application consuming the data
can follow the history of changes, a special URI which
retrieves the most recent Changeset is provided such that users
have a starting point for this. Separate to Changesets a
named graph containing a history for each pro le is also
stored which links to all the relevant Changesets for a
prole.
4.
4.1
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>RESULTS</title>
    </sec>
    <sec id="sec-8">
      <title>Expansion</title>
      <p>To test the expansion algorithm we took a small sample or
URIs which included the URIs of the authors, places
associated with the authors and TV programmes from the BBC
(since we use the BBC programmes dataset for our
preservation tests as described in Section 4.2). The results shown
in Table 1 show that the amount of linked data that can
be obtained using the default expansion pro le described in
Section 3.1.1 varies depending on the URI being pro led.
Expanding the URI of a person potentially produces a large
number of small graphs particularly if that person is a well
published academic since many bibliographic databases are
exposed as linked data and provide small amounts of data
about people. As can be seen URIs for places return
varying amounts of data which depends on the size and relative
importance of the place. Conversely expanding the URIs
of BBC programmes using the default pro le produces very
little linked data, we suspect that this is due to the type
of data and the fact the linking it uses it mostly based on
the BBCs ontologies. As outlined in Section 5 we plan to
conduct experiments in the future to asses the e cacy of
the algorithm on various types of data and using domain
speci c expansion pro les.</p>
      <p>One of the bene ts of the algorithm is that as can be seen
in the results in Table 1 the algorithm is trivially parallel.
Increasing the number of threads used to process the
discovered URIs shows a signi cant reduction in the time taken
to retrieve the linked data. Experiments were conducted
with higher number of threads but 8 threads was found to
be optimal since beyond 8 threads erratic behaviour is
observed due to two factors: 1. underlying limitations of the
HTTP API used in terms of stable concurrent connections
and 2. high volumes of concurrent access to a single site look
like DoS attacks and lead to temporary bans on accessing
those sites. Di erences in the number of triples and graphs
returned for URIs can be attributed to a couple of factors.
In the case of the London URI where the di erence is
dramatic - over 200,000 triples di erence - this is because with
a smaller number of threads connections seem more likely
to time out though we are unsure why this is. In the other
cases many of the graphs were from the same domain name
and the API used to retrieve the RDF had a bug regarding
connection management for multiple concurrent connections
to the same domain which caused connections to fail
unexpectedly which is why a reduction in the amount of data is
observed as the number of threads increased.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Preservation</title>
      <p>4.2.1</p>
      <p>BBC Programmes</p>
      <p>In order to test AAT properly it was used to monitor a
subset of the BBC Programmes13 dataset which is a large
and constantly changing linked data set which allowed for
both the testing of the scalability of AAT and for the
veri cation that it's change detection algorithms worked as
13http://www.bbc.co.uk/programmes/developers
intended. The subset used was all the brands (i.e.
programmes) associated with the service BBC1 (the BBCs main
TV channel) since this includes many brands which change
regularly such as soaps and news broadcasts. Table 2
demonstrates the average number of changes detected over just a
short period.</p>
      <p>As can be seen in Table 2 you can see that the BBC
update their dataset on a daily basis, the initial high number
of changes is due to starting from a base dataset that was
a couple of months old due to architectural changes made
to AAT to support the use of the expansion algorithm and
improve the e ciency of the system. The average number
of changes being 2 is due to the fact that the typical
update we see the BBC make to their data is that they add a
triple describing a newly broadcast episode of a programme
and update the value of the dc:modified triple. The
apparently high number of 25 for the maximum changes is due to
one of the program URIs failing to resolve resulting in the
contents of that pro le being considered to missing so the
change report for each day reports those triples as removed.
The relatively high number of pro les changing each day
is due to the fact that as already stated many of the
programmes associated with BBC 1 are broadcast daily such
as soaps and news bulletins and that the BBC publish data
about programmes several days before the programmes are
actually broadcast.</p>
      <p>
        To demonstrate the reuse of the data being harvested we
created a demonstration application which is a simple web
based faceted browser which lets users browse through
information about recently shown BBC shows. Facets can be
used to lter by Genre and Channel and the user can view
detailed information about both programmes and the
individual episodes. This application was presented as part of an
earlier prototype of AAT described in [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and shown in
Figure 3. Like previous work by Papavassiliou et al [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] it shows
that simple information about basic triple level changes in
RDF (additions, deletions etc) can be reprocessed into
useful applications for end users.
4.2.2
      </p>
      <p>Architecture &amp; Scalability</p>
      <p>AATs architecture is constructed as shown in Figure 4,
and as can be seen it is decomposed into several
components which then rely on some external standalone
components: an RDF API and the expansion algorithm. AAT
is theoretically agnostic of its underlying storage though in
practise di erences in implementation between triple stores
mean only certain stores are currently viable for use as the
backing store. In the early prototyping stage a RDBMS
based store was used which was su cient for initial
prototyping but not scalable for real world testing so then the
usage of production grade triple stores was adopted. Initially
it was intended to use the open source release of Virtuoso14
as the backing store but it was found that Virtuoso didn't
correctly preserve boolean typed literals which created
issues in the internal processing of data within AAT. 4store15
was then used brie y but it was found that it was unable to
handle the heavy volume of parallel read/writes which AAT
uses during its data processing due to 4store's concurrency
model. Currently AAT runs again AllegroGraph16 since it
has demonstrated in testing the ability to handle the high
volumes of read/writes necessary for using AAT on the large
dataset described in the preceding section.</p>
      <p>In terms of general scalability the majority of algorithms
in AAT need to run on a single thread for each pro le but
it is trivial to process multiple pro les in parallel and this
is the approach taken currently. Since work can be divided
over multiple threads it will also be possible to signi cantly
increase the scalability by dividing the work over a cluster
of machines which would allow much larger datasets to be
monitored e ciently.
14http://www.openlinksw.com/virtuoso
15http://4store.org
16http://www.franz.com/agraph/allegrograph/</p>
    </sec>
    <sec id="sec-10">
      <title>FUTURE WORK</title>
      <p>There are a number of things that could be done to
improve the expansion algorithm outlined in Section 3.1 with
regards to both making it more intelligent in how it retrieves
linked data and in conducting a detailed analyses of the data
returned. Manual inspection of the data shows that it does
appear to be relevant to the URI of interest but it is
proposed that a full IR analysis of this is conducted in order to
statistically con rm this initial assessment. Additionally as
was seen in Table 1 some types of URIs produced very little
linked data using the default expansion pro le, a broader
analysis using domain speci c pro les is necessary to
ascertain whether those URIs have low levels of interlinking
or if the interlinkings just use domain speci c links rather
than the generic owl:sameAs and rdfs:seeAlso links that
are followed by default.</p>
      <p>
        In terms of improving the intelligence of the algorithm at
the moment it submits every URI to every SPARQL, lookup
and discovery endpoint described in the expansion pro le, it
would improve the speed of the algorithm if it could use some
decision making as to which endpoints a given URI should
be submitted. Conversely though there is the possibility
that this would impact the e ectiveness of the algorithm so
it would be necessary to conduct experiments to determine
whether there is a trade o between speed and accuracy. It is
also worth considering that searching on URIs is not the only
viable mechanism for nding additional linked data about
a URI of interest. Using terms extracted from the RDF
such as the objects of rdfs:label or dc:title triples would
provide a way to augment URI based lookup with term/text
based search results from semantic search engines. There are
already frameworks like Silk [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] which can be used to do
this and it would be useful to integrate the Silk framework
with the expansion algorithm.
      </p>
      <p>
        One limitation inherent in AAT is that currently is does
not do any kind of special handling of blank nodes which
means that if data contains blank nodes AAT will
continuously think it has encountered new knowledge when most
likely it has not. For the data we have worked with so far
this is generally not an issue since the linked data
community tends to avoid blank nodes but if we are to provide for
preserving all kinds of RDF e ectively then we need to
handle blank nodes properly. Solving this problem may involve
doing some sub-graph matching and isomorphism to see if
the sections of the graph that contain blank nodes can be
mapped to the previously seen sections of the graph as in
Tummarello et al's RDFSync [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. The blank nodes
themselves could either be left as-is or they could be translated
to URIs as done by systems like the Talis17 platform.
      </p>
      <p>Given that this work was inspired by traditional link
integrity techniques from hypermedia it is interesting to note
that it has the potential to be applied back to the
document web since there is increasing cross-over between the
document and data web primarily due to the increasing
uptake of RDFa. As increasing numbers of documents embed
structured data using RDFa it will become possible to
preserve and monitor the structured information embedded in
ordinary web pages in the same way as can be done with
linked data now, therefore we envisage this as having
applications in automated monitoring and maintenance of
document based websites.</p>
      <p>
        As mentioned in Section 3.2.1 a lightweight schema is used
by AAT to annotate and store the data but there are
alternative vocabularies that could have been used such as the
provenance ontology [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and the open provenance model
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. It would be a fairly easy and potentially useful
enhancement to map the AAT schema to these vocabularies
such that the data could be retrieved in the desired form by
users/client applications designed to work with those
formats.
6.
      </p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSION</title>
      <p>In this work we have introduced a simple but powerful
expansion algorithm which can be used to retrieve linked
data about a URI even when that URI is not resolvable.
This provides an important tool for preserving data in the
Semantic Web and recovering from data loss and shows that
in the Semantic Web links themselves can be exploited as
a means to recover from broken links. As we have outlined
in Sections 4.1 and 5 there is a need to conduct a detailed
analysis of the algorithm to asses it's e cacy for a wider
variety of URIs and using domain speci c expansion pro les.
Depending on the results of this analysis the algorithm may
need to be further re ned to improve both it's speed and
accuracy.</p>
      <p>We have also presented the All About That (AAT)
system which allows users to monitor and preserve linked data
they are interested in using the expansion algorithm as the
primary retrieval method for deciding which linked data to
preserve based on a starting URI. As we demonstrated in
Section 4.2.1 we envisage the usage of such a system as a
base on which to build rich Semantic Web applications that
can take and present the changing data in interesting and
useful ways to end users. It also ful ls a role in the overall
goal of our research which is to provide a suite of algorithms
and systems which can be used to manage both data and
link integrity on the Semantic Web.</p>
      <p>As has been discussed in Section 5 there are some
limitations in the current versions of our algorithm and the AAT
system which we intend to investigate and address in the
future. It is clear that there is still a signi cant amount
of work to be done to create a comprehensive set of tools
such that they can be applied to as wide a variety of data
on the Semantic Web as is possible and experiences of past
research in link integrity for the document Web tells us that
there will be no perfect solution.</p>
      <p>Despite this it is our belief that as the Semantic Web
grows data and link integrity will be increasingly important
issues to users as their applications come to rely upon linked
data. There is a need to have systems in place such that data
can be preserved and accessed even if the original sources
are gone or unavailable. This has already been seen with
the release of services like the Sindice Cache API18 which is
used as one of the data sources in the default expansion
prole (see Section 3.1.1). Additionally with rising adoption of
RDFa embedded inside documents on the web systems like
this become applicable for the preservation of the structured
data embedded in the document based Web as discussed in
Section 5.
17http://www.talis.com/platform
18http://www.sindice.com/developers/cacheapi</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Changeset</surname>
            <given-names>protocol</given-names>
          </string-name>
          ,
          <year>2007</year>
          . http://n2.talis.comn/wiki/Changeset_Protocol.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] Virtuoso sponger</article-title>
          .
          <source>Technical report, OpenLink Software</source>
          ,
          <year>2009</year>
          . http://virtuoso.openlinksw.com/ Whitepapers/html/VirtSpongerWhitePaper.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Alexander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Describing linked datasets: On the design and usage of void, the `vocabularly of interlinked datasets'</article-title>
          .
          <source>In Proceedings of the Linked Data on the Web Workshop (LDOW2009)</source>
          , Madrid, Spain,
          <year>April 2009</year>
          . http: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>538</volume>
          /ldow2009_paper20.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ashman</surname>
          </string-name>
          .
          <article-title>Electronic document addressing: dealing with change</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>32</volume>
          (
          <issue>3</issue>
          ):
          <volume>201</volume>
          {
          <fpage>212</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cailliau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Luotonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. F.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Secret</surname>
          </string-name>
          .
          <article-title>The world-wide web</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>37</volume>
          (
          <issue>8</issue>
          ):
          <volume>76</volume>
          {
          <fpage>82</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          .
          <article-title>How to publish linked data on the web</article-title>
          ,
          <year>2007</year>
          . http://sites.wiwiss. fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bouquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stoermer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Bazzanella</surname>
          </string-name>
          .
          <article-title>An entity name system (ens) for the semantic web</article-title>
          .
          <source>In 5th European Semantic Web Conference, ESWC</source>
          <year>2008</year>
          , volume
          <volume>5021</volume>
          , page 258. Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] G. Cheng, W. Ge, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          .
          <article-title>Falcons: searching and browsing entities on the semantic web</article-title>
          .
          <source>In WWW '08: Proceeding of the 17th international conference on World Wide Web</source>
          , pages
          <volume>1101</volume>
          {
          <fpage>1102</fpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Davis</surname>
          </string-name>
          .
          <article-title>Data Integrity Problems in an Open Hypermedia Link Service</article-title>
          .
          <source>PhD thesis</source>
          , University of Southampton,
          <year>November 1995</year>
          . http://eprints.ecs.soton.ac.uk/6597/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Davis</surname>
          </string-name>
          .
          <article-title>Hypertext link integrity</article-title>
          .
          <source>ACM Comput. Surv., page 28</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>A. M. Fountain</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>I. Heath</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Davis</surname>
          </string-name>
          .
          <article-title>Microcosm: an open model for hypermedia with dynamic linking</article-title>
          .
          <source>In Hypertext: concepts</source>
          ,
          <source>systems and applications</source>
          , pages
          <volume>298</volume>
          {
          <fpage>311</fpage>
          , New York, NY, USA,
          <year>1992</year>
          . Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Harrison</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Nelson</surname>
          </string-name>
          .
          <article-title>Just-in-time recovery of missing web pages</article-title>
          .
          <source>In HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia</source>
          , pages
          <volume>145</volume>
          {
          <fpage>156</fpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartif</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Guide to the provenance vocabularly</article-title>
          ,
          <year>2009</year>
          . http://sourceforge.net/apps/mediawiki/trdf/ index.php?title=Provenance_Vocabulary.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Haslhofer</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Popitsch</surname>
          </string-name>
          .
          <article-title>DSNotify{Detecting and Fixing Broken Links in Linked Data Sets</article-title>
          .
          <source>In Proceedings of 8th International Workshop on Web Semantics</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ja ri</surname>
          </string-name>
          , H. Glaser,
          <string-name>
            <surname>and I. Millard.</surname>
          </string-name>
          <article-title>Managing uri synonymity to enable consistent reference on the semantic web</article-title>
          .
          <source>In IRSW2008 - Identity and Reference on the Semantic Web</source>
          <year>2008</year>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>F.</given-names>
            <surname>Kappe</surname>
          </string-name>
          .
          <article-title>A scalable architecture for maintaining referential integrity in distributed information systems</article-title>
          .
          <source>Journal of Universal Computer Science</source>
          ,
          <volume>1</volume>
          (
          <issue>2</issue>
          ):
          <volume>84</volume>
          {
          <fpage>104</fpage>
          ,
          <year>1995</year>
          . http://www.jucs.org/jucs_1_
          <article-title>2/ a_scalable_architecture_for.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Kappe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Andrews</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Faschingbauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaisbauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pichler</surname>
          </string-name>
          , and
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Schip inger</article-title>
          . Hyper-G:
          <article-title>A new tool for distributed hypermedia</article-title>
          .
          <source>Institutes for Information Processing Graz</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Futrelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>McGrath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Myers</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Paulson</surname>
          </string-name>
          .
          <article-title>The open provenance model</article-title>
          .
          <source>December</source>
          <year>2007</year>
          . http://eprints.ecs.soton.ac.uk/14979/.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreau</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Gray</surname>
          </string-name>
          .
          <article-title>A Community of Agents Maintaining Links in the World Wide Web (Preliminary Report)</article-title>
          .
          <source>In The Third International Conference and Exhibition on The Practical Application of Intelligent Agents and Multi-Agents</source>
          , pages
          <volume>221</volume>
          {
          <fpage>235</fpage>
          , London, UK, Mar.
          <year>1998</year>
          . http: //www.ecs.soton.ac.uk/~lavm/papers/gcWWW.ps.gz.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>V.</given-names>
            <surname>Papavassiliou</surname>
          </string-name>
          , G. Flouris,
          <string-name>
            <given-names>I.</given-names>
            <surname>Fundulaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kotzinos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Christophides</surname>
          </string-name>
          .
          <article-title>On Detecting High-Level Changes in RDF/S KBs</article-title>
          .
          <source>In The Semantic Web: 9th International Semantic Web Conference (ISWC2009)</source>
          , pages
          <fpage>473</fpage>
          {
          <fpage>488</fpage>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Phelps</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Wilensky</surname>
          </string-name>
          .
          <article-title>Robust hyperlinks: Cheap, everywhere, now</article-title>
          .
          <source>In Digital Documents: Systems and Principles</source>
          , pages
          <volume>514</volume>
          {
          <fpage>549</fpage>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tummarello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Delbru</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Oren. Sindice.</surname>
          </string-name>
          <article-title>com: Weaving the Open Linked Data</article-title>
          .
          <source>In The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC</source>
          , pages
          <volume>552</volume>
          {
          <fpage>565</fpage>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tummarello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Morbidoni</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <article-title>Bachmann-Gm "ur, and</article-title>
          <string-name>
            <surname>O. Erling.</surname>
          </string-name>
          <article-title>RDFSync: e cient remote synchronization of RDF models</article-title>
          .
          <source>In The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference</source>
          ,
          <string-name>
            <surname>ISWC</surname>
          </string-name>
          <year>2007</year>
          +
          <article-title>ASWC 2007, Busan</article-title>
          , Korea,
          <source>November 11-15</source>
          ,
          <year>2007</year>
          , Proceedings, pages
          <volume>537</volume>
          {
          <fpage>551</fpage>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L.</given-names>
            <surname>Veiga</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          .
          <article-title>Repweb: replicated web with referential integrity</article-title>
          .
          <source>In SAC '03: Proceedings of the 2003 ACM symposium on Applied computing</source>
          , pages
          <volume>1206</volume>
          {
          <fpage>1211</fpage>
          , New York, NY, USA,
          <year>2003</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Veiga</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          .
          <article-title>Turning the web into an e ective knowledge repository</article-title>
          .
          <source>ICEIS 2004: Software Agents and Internet Computing</source>
          ,
          <volume>14</volume>
          (
          <issue>17</issue>
          ),
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R.</given-names>
            <surname>Vesse</surname>
          </string-name>
          , W. Hall, and
          <string-name>
            <given-names>L.</given-names>
            <surname>Carr</surname>
          </string-name>
          .
          <article-title>All about that - a uri pro ling tool for monitoring and preserving linked data</article-title>
          .
          <source>In ISWC</source>
          <year>2009</year>
          ,
          <year>August 2009</year>
          . http://eprints.ecs.soton.ac.uk/17815.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Volz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaedke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Kobilarov</surname>
          </string-name>
          .
          <article-title>Discovering and maintaining links on the web of data</article-title>
          . In A. Bernstein,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Karger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Feigenbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maynard</surname>
          </string-name>
          , E. Motta,
          <article-title>and</article-title>
          K. Thirunarayan, editors,
          <source>The Semantic Web - ISWC</source>
          <year>2009</year>
          , volume
          <volume>5823</volume>
          of Lecture Notes in Computer Science, pages
          <volume>650</volume>
          {
          <fpage>665</fpage>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Volz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaedke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Kobilarov.</surname>
          </string-name>
          <article-title>Silk{a link discovery framework for the web of data</article-title>
          .
          <source>In 2nd Linked Data on the Web Workshop (LDOW2009)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>