Preserving Linked Data on the Semantic Web by the
  application of Link Integrity techniques from Hypermedia

                                          Rob Vesse, Wendy Hall, Leslie Carr
                                           Intelligence, Agents & Multimedia Group
                                          School of Electronics & Computer Science
                                                   University of Southampton
                                                          Southampton
                                                           SO17 1BJ
                                            {rav08r,wh,lac}@ecs.soton.ac.uk

ABSTRACT                                                          the link is changed so that it is no longer what the creator
As the Web of Linked Data expands it will become increas-         of the link intended to link to. Both these issues affect the
ingly important to preserve data and links such that the          Semantic Web since on the Semantic Web everything is in-
data remains useful. In this work we present a method for         terlinked data, therefore data is immediately susceptible to
locating linked data to preserve which functions even when        dangling links. The editing problem becomes much more
the URI the user wishes to preserve does not resolve (i.e.        problematic on the Semantic Web since anyone can make a
is broken/not RDF) and an application for monitoring and          statement about anything so the meaning of things on the
preserving the data. This work is based upon the principle        Semantic Web is subject to semantic drift1 .
of adapting ideas from hypermedia link integrity in order to         Due to the interlinking between data on the Semantic Web
apply them to the Semantic Web.                                   we show that it is possible to exploit the data model such
                                                                  that links themselves can be used to recover missing data in
                                                                  the event of a dangling link being encountered. This pro-
General Terms                                                     vides for a means to retrieve the data that was/may have
Experimentation, Reliability, Design                              been at a given URI even if that URI is no longer resolvable.
                                                                  Using this approach for locating data about a URI we are
                                                                  able to preserve and monitor data about a URI from mul-
Keywords                                                          tiple sources and to recover data about URIs that are no
Semantic Web, Linked Data, Link Integrity, Preservation           longer functioning as described in Section 3.1.
                                                                     The Semantic Web also introduces two additional prob-
                                                                  lems in link integrity specific to linked data. The first of
1.   INTRODUCTION                                                 these is URI Identity & Meaning - what does a URI mean
   The Web of Linked Data is characterised by the inter-          and does this meaning actually matter to the applications
linking between disparate heterogeneous data sources and          that use it and the data that contains it - which is very much
the fact that the links between the data sources are one of       an open research debate and beyond the scope of our current
the primary mechanisms for navigating through this data           work. The second is the co-reference problem which refers
space. Since links are essential to the Web of Linked Data        to the situation in which some ‘thing’ we wish to make state-
we believe that it is important to have mechanisms in place       ments about has multiple URIs that could be used for it. In
to maintain link integrity. The aim of link integrity is to       this work we utilise existing work in this area of research as
ensure that a link works correctly in that traversing the link    part of our algorithm for preserving linked data.
takes you to a resource and that as far as possible the re-          Section 2 of this paper covers the related work in link
source is the one intended by the provider of the link. On        integrity for hypermedia and the Semantic Web which we
a larger scale, link integrity deals with the overall integrity   use ideas from in Section 3 to design and develop our algo-
of interlinked datasets such as documents within a Content        rithm and the AAT software. Section 5 outlines our plans
Management System (CMS) or the linked data sets available         for future research in this area and we conclude in Section 6
on the Semantic Web. Therefore link integrity is one way of       by discussing the potential benefits of link integrity for the
ensuring data integrity within the overall system which in        Semantic Web.
our use case is linked datasets.
   Link integrity is an existing and well known problem from
hypermedia where there were two problems to be dealt with         2.   RELATED WORK
- dangling links and the editing problem. Dangling links             Link integrity in hypermedia first received serious atten-
are the most well known problem and are regularly experi-         tion in the late 1980s and early 1990s primarily from re-
enced by users on the Web as they find themselves presented       searchers in the open hypermedia community. Systems like
with an HTTP error as the link they followed pointed to a         Microcosm [11] and HyperG [17] were among the first to
resource which cannot be retrieved. The editing problem           consider the issue in depth, Davis’s thesis [9] and Kappe’s
refers to the situation in which the content at the end of
                                                                  1
Copyright is held by the author/owner(s).                          See the discussion on the W3C Semantic Web Mailing List
WWW2010, April 26-30, 2010, Raleigh, North Carolina.              for an example of this - http://lists.w3.org/Archives/
.                                                                 Public/semantic-web/2009May/0315.html
1995 paper [16] provide examples of link integrity in open          rather than humans. When a human encounters a dead
hypermedia. The widespread growth of the World Wide                 link they usually navigate to a search engine and enter an
Web [5] in the mid-1990s led to some new research but as            appropriate search phrase to find alternative sources of in-
search engines became commonplace towards the end of the            formation. For a client application encountering a dead link
decade research interest dwindled. It was perceived that            they will typically have no concept or how/where to find
users did not care sufficiently to warrant research into the        alternative sources of information and URIs for linked data
problem as they could locate missing resources effectively          are not always ideal for searching upon compared to textual
using search engines, in addition the scale of the Web by           search for documents. It should be noted that as with the
that time was simply too vast for many proposed solutions           existing Web if the Web of Linked Data is to undergo a mas-
to handle. Davis’s survey [10] provides a good overview of          sive expansion in the same way things must be allowed to
the state of this research as of the end of the 1990s. Another      fail but this does not mean we shouldn’t attempt to mitigate
reason for the decline in research was that the fact that links     the problem as far as possible.
could fail was one of the reasons the Web was able to ex-              In terms of the Semantic Web there has been research
pand as fast as it did since it didn’t matter if links failed and   into the versioning and synchronisation of RDF data which
produced the familiar HTTP 404 error; users were able to            is relevant to aspects of our work such as Tummarello et al’s
publish content without worrying about whether their links          RDFSync [23] which is an algorithm for efficiently synchro-
to external content were valid.                                     nising changes in RDF between multiple machines. This
   Ashman’s 2000 paper [4] which discusses link integrity           shows that change detection in RDF is non-trivial due to
with particular reference to electronic document archives           the inherent data isomorphism caused by the use of blank
provides both a useful survey of existing work and describes        nodes but also shows that it can be achieved in an efficient
a key motivation for ongoing research. As more document             manner. More recent research from Papavassiliou et al [20]
collections were translated into digital forms and placed onto      has shown that using information about very basic changes
intranets people once again started to be concerned about           in the RDF - such as that provided by systems like RDF-
link integrity. Users wanted assurances that links into the         Sync or All About That (see Section 3.2) - can be used to
document archives would work consistently and ideally links         build applications which provide useful information to end
out of the archives would work correctly as well since it may       users. In the case of Papavassiliou’s et al’s paper they built a
not be possible to alter the archived documents without in-         system which furnished users with high level descriptions of
validating the integrity of the archive.                            how RDFS vocabularies have changed in order to aid users
   In this vein Veiga and Ferreira [24, 25] discuss the possi-      in working with such vocabularies. In addition there are
bility of turning the Web into an effective knowledge reposi-       systems like the Talis Platform2 which is a Semantic Web
tory by use of replication and versioning. Their work follows       store that implements a versioning mechanism whereby up-
on from earlier work such as Moreau & Gray’s [19] which             dates can be made via a Changeset protocol [1]. As part of
proposed limited use of replication and versioning but had          this protocol they utilise a useful lightweight vocabulary for
significant reliance on author and user involvement in the          publishing changes in RDF data as RDF which as will be
process. In Veiga & Ferreira’s work there is no require-            discussed in Section 3.2.3 we reuse in our own system.
ment for author involvement in the process, only the end               Regarding Semantic Web specific link integrity problems
user need use a browser plugin to indicate the content they         the research has largely focused on the co-reference prob-
wish to replicate and preserve. Their results showed that           lem. Since there are many organisations publishing similar
the user could preserve the sections of the Web they were           data semantically (bibliographic databases being a prime ex-
interested in with no perceivable performance impact - on           ample) there are frequently many URIs for a single entity
average there was only a 12ms increase in retrieval time for        such as an author. Co-reference research aims to develop
resources. In Section 3 we discuss using an approach of this        ways to efficiently and accurately determine URI equiva-
kind for the Semantic Web.                                          lences and refactor the data or republish this information
   Phelps & Wilensky introduced the concept of lexical sig-         to help other Semantic Web applications. There are several
natures for Web pages in their Robust Hyperlinks paper [21].        competing philosophies ranging from the Okkam approach
They compute the lexical signature of a page and append             described by Bouquet et al [7] which advocates universally
it to all links to that page so that in the event of the link       agreed URIs for each entity to the Co-reference Resolution
failing a browser plugin can use the signature to relocate the      Service (CRS) approach of Jaffri et al [15] which determines
page using a search engine. The obvious flaw in their work          co-referent URIs and republishes the information in dedi-
was that it required rewriting all the links on the Web but         cated triple stores. The CRS approach taken by the ReSIST
Harrison & Nelson later showed that these signatures need           project3 within the RKB Explorer4 application has poten-
only be computed Just-in-Time (JIT) when a link fails [12].         tial for use in link integrity as the information provided by
In their Opal system the signatures can be computed JIT             a CRS could be utilised in a JIT fashion as in Harrison &
by retrieving cached copies of the pages from a search en-          Nelson’s work and we demonstrate how this can be done in
gine cache, computing the signature and then using search           Sections 3 and 4.
engines to relocate the page. As discussed in Section 3.1 a            In terms of link maintenance for the Semantic Web there
JIT style approach can be effectively used to recover linked        has been some research in the form of the Silk framework
data about a URI.                                                   by Volz et al [28] which is a framework for computing links
                                                                    between different datasets. Their approach allows users to
                                                                    stipulate arbitrarily complex matching criteria to do entity
2.1    Semantic Web Research                                        matching between datasets, the links produced from this can
  Unlike the traditional Web it is not possible for semantic
                                                                    2
search engines like Sindice [22] and Falcons [8] to fulfil the        http://www.talis.com/platform
                                                                    3
same role as document search engines because the users in             http://www.resist-noe.org/
                                                                    4
the Semantic Web domain are typically client applications             http://www.rkbexplorer.com
then be published via a CRS style service or added to the          introduce a couple of additional predicates since we require
relevant datasets. As proposed in their later paper [27] this      the means to allow end users to specify some basic character-
can be used as part of a link maintenance strategy, the pos-       istics of how the algorithm should behave and there is a type
sibility of combining this with our approach is discussed in       of service we need to express which is not contained in the
Section 5. In a similar vein Haslhofer and Popitsch’s DSNo-        VoID ontology. VoID has concepts of Datasets and Linksets,
tify system [14] can monitor linked resources and inform           the former represent a set of data which may have SPARQL
the application when links are no longer valid using feature       endpoint(s) and/or URI lookup endpoint(s) while the latter
based similarity metrics like the Silk framework.                  represent the types of interlinkings between datasets. What
                                                                   VoID does not have a means to express is the location of
                                                                   a service provided by a dataset which allows an application
3.    METHOD                                                       to retrieve URIs which are considered equivalent to a given
   As we have discussed it is not realistic to maintain link in-   URI - this we term a URI discovery endpoint (see Definition
tegrity in a pre-emptive way since such solutions have been        2). A discovery endpoint differs from a lookup endpoint in
consistently shown not to scale to Web scale in previous           that the latter is expected to return everything the dataset
work. Therefore the focus must be on recovery in the event         knows about the given URI as opposed to only returning
of failure and preservation to guard against the loss of data      equivalent URIs. Examples of existing discovery endpoints
which is considered interesting/useful to end users. As the        on the Semantic Web include RKBExplorer’s CRSes [15]
amount of data in the Web of Linked Data starts to expand          and sameAs.org5 . Another key difference between a lookup
massively - particularly with linked data being adopted by         and discovery endpoint is that links discovered from a dis-
an increasing number of major organisations - we expect            covery endpoint are considered to be on the same level of
that as with the early document web there’ll be an increas-        the crawl for the purposes of the algorithm i.e. they do not
ing amount of content published by both big companies and          have increased depth relative to the URI that discovery is
individuals. Just like the document web this explosion of          performed upon. By this we mean that the execution of the
content will most likely include much content that is poorly       algorithm results in performing a breadth-first depth-limited
maintained and will lead to increasing numbers of broken           linked data crawl starting from a given URI - in this tree
links. We have two connected goals in this work 1) to pro-         structure a discovery endpoint introduces sibling nodes for
vide a means to retrieve resource descriptions in the form         a URI while a lookup endpoint introduces child nodes for a
of linked data about a URI even when the the URI is non-           URI.
functional and 2) to provide the means for an end user to             Our other extensions to VoID allow individual dataset-
preserve and version these descriptions. To attempt to solve       s/linksets to be marked as ignored (the algorithm will not
this problem we present an expansion algorithm for retriev-        use them) and for the user to define to what depth the algo-
ing Linked Data about a URI even if that URI itself has            rithm should crawl to (defaults to 1). These extensions are
failed in Section 3.1 and a preservation system built using        defined as part of the AAT schema detailed in Section 3.2.1.
this algorithm in Section 3.2.
                                                                      Definition 2. A URI discovery endpoint is an endpoint
3.1    Expansion Algorithm                                         that when passed a URI returns a Graph containing equiva-
   Since the goal of this work is to preserve linked data it was   lent URIs of the input URI typically in the form or owl:sameAs
deemed essential that as far as possible we leverage existing      links.
linked data technologies and services in order to effect this
preservation. To this end we designed a relatively simple             As already stated the actual algorithm is a simple crawler
algorithm which uses simple crawling techniques which are          which uses the input expansion profile as a guide to which
directed by a user definable expansion profile (see Definition     potential sources of linked data it should use to try and find
1). Our aim with this algorithm is to provide resource de-         data about the URI of interest - this procedure is detailed
scriptions of a URI regardless of whether the URI itself is        in Algorithm 1. Note that the algorithm does not terminate
dereferenceable.                                                   in the event of an error retrieving data from a particular
   Even in the case where a URI is used only as an identifier      URI/endpoint and simply continues, by doing this it is still
in the description of another resource and is not itself deref-    possible to retrieve some data even if the starting URI does
erenceable it is likely that we can still retrieve some data       not return a valid response. The algorithm will continue
about it. The fact that a URI is minted only as an identifier      and issue queries about the URI to the various endpoints
and that the person/organisation minting the URI does not          described in the given expansion profile so unless the URI
provide the means to dereference the URI does not affect           refers to a document that had very poor linkages or was not
our ability to find data about it assuming that the identifier     indexed by the semantic search services used some RDF will
is used elsewhere i.e. it is reused as part of linked data.        be returned. This approach has similarities to the JIT style
                                                                   approach of Harrison & Nelson [12] in that there doesn’t
   Definition 1. An expansion profile is a Vocabulary of In-       need to be any foreknowledge of the URIs you wish to re-
terlinked Datasets (VoID) description of a set of datasets         cover data about when you discover they are broken since by
and linksets that should be used to locate linked data about       utilising the caches and lookup services of relevant datasets
the URI of interest. The VoID description may be option-           it is still possible to recover data about the URI.
ally annotated with additional properties which affect the            The basic behaviour of the algorithm is only to follow
behaviour of the algorithm.                                        owl:sameAs and rdfs:seeAlso links but the end user can
                                                                   specify that any predicate be treated as a link to follow by
  Drawing on ideas described in Alexander et al’s Vocab-           the specifying an appropriate VoID linkset in their expan-
ulary of Interlinked Datasets (VoID) [3] about the way it          sion profile.
can be used to direct crawlers we decided to use VoID as
                                                                   5
the primary means of expressing an expansion profile. We               http://www.sameas.org
Algorithm 1 Expansion Algorithm                                          retrieval of Sindice’s cached copy of the RDF from a
Require: URI, Expansion Profile                                          URI.
 1: ToExpand as a set of pairs of URIs and Depths
 2: while ToExpand 6= ∅ do                                             • SameAs.org10 - SameAs.org provides a URI discovery
 3:   Remove first pair from ToExpand                                    endpoint (see Section 3.1 and Definition 2) which can
 4:   if Graph with URI is already in the Dataset then                   be used to find URIs which are equivalent to a given
 5:      Continue                                                        URI
 6:   if Depth >Max Depth then                                    The default profile11 has a max expansion depth of 1 which
 7:      Continue                                                 means it only considers URIs which are immediate neigh-
 8:   Retrieve the Graph at the URI                               bours of the starting URI.
 9:   Add the Graph to the Dataset                                   In the case where the end user does know which linked
10:   for all Triples in Graph do                                 data sources will have useful information about the URI
11:      if Triple is a Link then                                 they can specify their own expansion profile which is used
12:         Add a new pair to ToExpand                            instead of the default profile. In this case the algorithm will
13:   for all Datasets in Expansion Profile do                    use the datasets and linksets they define in the profile to
14:      if Dataset has a SPARQL Endpoint then                    discover linked data about the URI of interest, for example
15:         Issue a DESCRIBE for the URI against the End-         if attempting to recover data about a person it may be useful
            point                                                 to follow foaf:knows links.
16:         Add resulting Graph to the Dataset
17:         Process the Graph for additional Links                3.2      Preservation
18:      if Dataset has a Lookup Endpoint then                       The preservation approach taken is to allow the end user
19:         Issue a Lookup for the URI against the Endpoint       to monitor and preserve a set of linked data that they are
20:         Add resulting Graph to the Dataset                    interested in. The data is preserved not at the data source
21:         Process the Graph for additional Links                but rather at a local level on the users server with the user
22:      if Dataset has a Discovery Endpoint then                 able to republish this data as they desire. This is in line with
23:         Issue a Discovery for the URI against the End-        the ideas of Veiga & Ferreira [25] in that the end user speci-
            point                                                 fies the parts of the Web they want to preserve and then the
24:         for all Equivalent URIs do                            software takes care of this. The data must be preserved in
25:            Add a new pair to ToExpand                         such a way that the original data can be efficiently extracted
26: return Dataset                                                from it and sufficient information to provide versioning over
                                                                  the data is kept.
                                                                     In the Semantic Web domain the objects of interest are
   There are already some existing systems which work in          URIs we propose that a profile of a URI be preserved (see
a similar way to our algorithm such as the sponger middle         Definition 3). Since the data being processed is RDF it
ware using in Virtuoso [2]. The main difference between           is logically divided into triples which can be preserved and
our algorithm and algorithms such as those in the Virtuoso        monitored individually. It is deemed necessary to store infor-
sponger is that our algorithm is only interested in linked        mation pertaining to the temporality and provenance of each
data and it does not infer/create any additional data. Unlike     triple - when it was first seen, last updated, source URI(s)
the Virtuoso sponger it does not attempt to turn non-linked       and whether it has changed or been retracted/deleted from
data into RDF and it does not do any inference over the data      the RDF.
it returns, it is designed only to find and return (in the form
of an RDF dataset) linked data about the URI of interest.            Definition 3. A URIs profile is the transformed and anno-
Yet as expansion profiles may reference any datasets and          tated form of the linked data retrievable about a given URI
associated endpoints they wish there is no reason why a           such that the temporality and provenance of the triples con-
user could not direct our algorithm to utilise a service like     tained therein are inferable from the profile
URIBurner6 which uses the Virtuoso sponger in order to get
the benefits of the additional inferred data.                       In terms of user interface the system should allow a user
                                                                  to view a profile both in the stored form and in its original
3.1.1    Default Profile                                          form. The system must monitor the original data source
  Since the end user of such an algorithm may not always          over time updating the profiles as necessary such that it can
know where to look for linked data about the URI they are         provide a report of changes in the data to the user. Since a
interested in the algorithm has a default expansion profile       URI profile will contain versioning information the interface
which is used in the case when no profile is specified. This      should allow a user to view a particular version of the profile.
profile uses 3 data sources which are in our opinion impor-
tant hubs of the Web of Linked Data:                               3.2.1    Schema
                                                                    As the first stage of implementation an RDF Schema for
    • DBPedia7 - The DBPedia SPARQL endpoint is used              All About That12 (AAT) is defined which embodies classes
      to lookup URIs                                              and properties which allow the description and annotation
                                                                  of triples in such a way that the required information as
    • Sindice8 Cache - The Sindice Cache API9 allows the          discussed in the preceding proposal can be stored for each
6                                                                 10
  http://www.uriburner                                               http://www.sameas.org
7                                                                 11
  http://dbpedia.org                                                 http://www.dotnetrdf.org/expander/defaultProfile
8                                                                 12
  http://www.sindice.com                                             This schema is available at http://www.dotnetrdf.org/
9
  http://www.sindice.com/developers/cacheapi                       AllAboutThat/
triple. The schema defines a class for representing profiles
called aat:Profile and uses the rdf:Statement class to
represent triples. rdf:Statement is used as the basis of
triple storage as it makes it possible for non-AAT aware tools
to extract the original triples from the profile easily. A num-                    Figure 1: Original Triple
ber of properties are defined which store meta data about
the profile itself such as created & updated date, source
URI and a locally unique identifier for the profile. Simi-
lar properties are defined for triples which allow the first
and last asserted dates, source URI and change status of a
triple to be indicated. A key distinction in the schema is be-
tween aat:profileSource and aat:source, despite storing
equivalent data two predicates are created since the former
expresses the URI which is the starting point for the pro-
file while the latter expresses all the URIs at which a given
triple is asserted.
   While there were alternative schemas and vocabularies
available that could have potentially been used to store
the required data the motivation behind designing our own
schema was to provide a lightweight schema that attached
                                                                   Figure 2: Triple transformed to AAT Annotated
all data to a single subject for ease of processing. Alterna-
                                                                   Form
tives such as the Provenance Vocabulary by Hartig & Zhao
[13] are far more expressive but they potentially require in-
troducing multiple intermediate blank nodes which would               Since the user needs to both browse the data they are pre-
significantly complicate the processing needed to implement        serving as well as potentially republish it, a Web based in-
many of the core features of AAT. Similarly the Open Prove-        terface was designed as the primary interaction mechanism.
nance Model as described by Moreau et al [18] is highly ex-        The interface allows users to explore the data by first select-
pressive but like the Hartig & Zhao’s vocabulary the RDF           ing a profile to view and then allowing them to view profile
serialization is overly complex for use in AAT. As discussed       contents, export, versions and change reports. A user may
in Section 5 there is no reason why the data contained in          also use the interface to add new URIs they wish to monitor
AAT could not be exposed in other provenance vocabularies          to the system and to initiate updates to profiles (see Def-
but for AATs processing and storage a lightweight vocabu-          inition 4). Following linked data best practices [6] and to
lary is preferable.                                                provide the ability for the user to republish their preserved
   The use of reification was chosen over the use of named         data multiple dereferenceable URIs for each profile are cre-
graphs primarily due to the need to make annotations at            ated and accessible through the Web interface. These allow
the level of individual triples rather than at the graph level,    the retrieval of the profile contents which consists of all the
usage is motivated by the fact that the mechanism provides         triples ever retrieved from the profile URI in the transformed
a clear and obvious schema for encoding a triple and adding        form, the export of the profile (see Definition 5) and various
additional annotations to it. While reification may signifi-       meta graphs about a profile e.g. change history, changesets.
cantly increase the size of the data being stored initially over   This means that the profile of a URI has a URI and thus
time this balances out compared to named graphs where it           can itself be profiled if it was desired.
is necessary to either store many copies of the same graph
or store multiple named graphs which represent a series of            Definition 4. An update of a profile occurs when AAT
deltas to the original data. The other difficulty inherent in      using the Expansion algorithm to retrieve RDF about the
the named graphs approach is that the annotations typically        given URI. The triples contained are compared with the
would then be held separately in other named graphs which          triples currently in the profile and the profile updated ac-
adds to the complexity of the data processing. Neverthe-           cordingly
less named graphs are used within AAT since each profile
naturally forms a named graph and AAT generates several               Definition 5. The export of a profile is the recreation of
related named graphs about each profile detailing change           the RDF in its original form based upon the current contents
history and changesets as described in Section 3.2.3.              of the profile. An export represents the RDF as it was last
                                                                   seen by AAT

3.2.2    Profile Creation & Update                                 3.2.3    Change Reporting
  To create a URIs profile linked data about the URI is first        A key feature of AAT is the ability to generate change re-
retrieved using the expansion algorithm presented in Section       ports about how the RDF at the profiled URI has changed
3.1; then using the AAT schema each triple can be trans-           over time. To do this a number of relatively simple compu-
formed into a set of triples which represent an annotation         tations over the annotated triples can be made based pri-
of the original triple. For each triple in the original RDF        marily on the first and last asserted dates of the triples. In
a blank node is created which is then used as the subject          creating change reports four different types of changes in the
of a set of triples which represent the required information       RDF are looked for (see Definitions 6-9). A distinction is
about the original triple. Figure 1 shows an example triple        made between missing knowledge and retracted or deleted
and Figure 2 shows it transformed into the AAT form. A             knowledge as it may be possible for triples to be perceived to
URIs profile consists of a set of transformed triples where        be temporarily non-present in the RDF. For example in the
each profile is a named graph in the underlying store.             event of a transient network issue making some/all of the
relevant URIs unretrievable the updated date for the pro-           have a starting point for this. Separate to Changesets a
file will still be updated leaving all the triples in the profile   named graph containing a history for each profile is also
to appear missing. The length of time we require Triples            stored which links to all the relevant Changesets for a pro-
to be missing before we consider them to be deleted is cur-         file.
rently set to 7 days for our monitoring of the BBC dataset
described in Section 4.2.1, this time period is a domain spe-
cific parameter that can be adjusted depending on the data
                                                                    4.      RESULTS
that is being monitored.
                                                                    4.1      Expansion
  Definition 6. New knowledge is any triple that is new to             To test the expansion algorithm we took a small sample or
the RDF at the profiled URI                                         URIs which included the URIs of the authors, places asso-
                                                                    ciated with the authors and TV programmes from the BBC
  Definition 7. Changed knowledge is any triple where the           (since we use the BBC programmes dataset for our preser-
object of the triple has changed. Only triples where the            vation tests as described in Section 4.2). The results shown
predicate has a cardinality of 1 can be considered to change        in Table 1 show that the amount of linked data that can
                                                                    be obtained using the default expansion profile described in
   Definition 8. Missing knowledge is any triple no longer          Section 3.1.1 varies depending on the URI being profiled.
found in the RDF at the profiled URI but which was recently         Expanding the URI of a person potentially produces a large
seen in the RDF                                                     number of small graphs particularly if that person is a well
                                                                    published academic since many bibliographic databases are
  Definition 9. Retracted or deleted knowledge is any triple        exposed as linked data and provide small amounts of data
no longer found in the RDF at the profiled URI which has            about people. As can be seen URIs for places return vary-
not been seen for a reasonable length of time                       ing amounts of data which depends on the size and relative
                                                                    importance of the place. Conversely expanding the URIs
   In regards to the concept of changed knowledge consider          of BBC programmes using the default profile produces very
some arbitrary predicates ex:one and ex:many which have             little linked data, we suspect that this is due to the type
cardinalities of 1 and unrestricted respectively. Since ex:one      of data and the fact the linking it uses it mostly based on
has a cardinality of 1 it can be said whenever the object           the BBCs ontologies. As outlined in Section 5 we plan to
of that triple has changed it is changed knowledge. Yet it          conduct experiments in the future to asses the efficacy of
cannot be said for ex:many triples as the predicate has un-         the algorithm on various types of data and using domain
restricted cardinality, therefore each triple using this predi-     specific expansion profiles.
cate must be treated as a unique entity i.e. one instance of           One of the benefits of the algorithm is that as can be seen
a triple using this predicate cannot be considered to replace       in the results in Table 1 the algorithm is trivially parallel.
another. In the examples the fact that < A > was related            Increasing the number of threads used to process the discov-
to < C > via the predicate ex:many in Example 1 and now             ered URIs shows a significant reduction in the time taken
is instead related to < E > in Example 2 doesn’t mean they          to retrieve the linked data. Experiments were conducted
are related to < E > instead of < C >, it just means they no        with higher number of threads but 8 threads was found to
longer consider themselves related to < C >. The fact that          be optimal since beyond 8 threads erratic behaviour is ob-
they are related to < E > is new knowledge while the fact           served due to two factors: 1. underlying limitations of the
they related to < C > is missing/deleted knowledge, but if          HTTP API used in terms of stable concurrent connections
the value of the ex:one relationship had changed then that          and 2. high volumes of concurrent access to a single site look
would be considered changed knowledge.                              like DoS attacks and lead to temporary bans on accessing
                                                                    those sites. Differences in the number of triples and graphs
Example 1 Original Graph                                            returned for URIs can be attributed to a couple of factors.
<A> ex:one <B> .                                                    In the case of the London URI where the difference is dra-
<A> ex:many <C> .                                                   matic - over 200,000 triples difference - this is because with
<A> ex:many <D> .                                                   a smaller number of threads connections seem more likely
                                                                    to time out though we are unsure why this is. In the other
                                                                    cases many of the graphs were from the same domain name
                                                                    and the API used to retrieve the RDF had a bug regarding
                                                                    connection management for multiple concurrent connections
Example 2 Modified Graph
                                                                    to the same domain which caused connections to fail unex-
<A> ex:one <B> .                                                    pectedly which is why a reduction in the amount of data is
<A> ex:many <D> .                                                   observed as the number of threads increased.
<A> ex:many <E> .
                                                                    4.2      Preservation
   When a change report is computed is it itself serialized          4.2.1     BBC Programmes
into an RDF Graph using the Talis Changeset ontology [1]               In order to test AAT properly it was used to monitor a
which is stored as a named graph in the underlying store            subset of the BBC Programmes13 dataset which is a large
and republished via the web interface. Each Changeset gen-          and constantly changing linked data set which allowed for
erated links back to the previous Changeset (if one exists)         both the testing of the scalability of AAT and for the ver-
such that a end user/client application consuming the data          ification that it’s change detection algorithms worked as
can follow the history of changes, a special URI which re-
                                                                    13
trieves the most recent Changeset is provided such that users            http://www.bbc.co.uk/programmes/developers
                                  Table 1: Sample Expansion Algorithm Results
URI                                           Total Graphs Total Triples Retrieval Time (seconds)   Thread Used
Rob Vesse                                           4          115                  13.8                 1
http://id.ecs.soton.ac.uk/person/11471              4          115                   1.8                 2
                                                    4          115                   1.8                 4
                                                    4          115                   2.1                 8
Wendy Hall                                        691         4,068                786.3                 1
http://id.ecs.soton.ac.uk/person/1650             692         4,070                383.8                 2
                                                  692         4,070                375.9                 4
                                                  692         4,070                359.6                 8
Les Carr                                          368         2,694                438.9                 1
http://id.ecs.soton.ac.uk/person/60               279         2,516                109.5                 2
                                                  238         2,434                 75.9                 4
                                                  204         2,366                 64.8                 8
Ilkeston                                            6          444                  19.1                 1
http://dbpedia.org/resource/Ilkeston                5          393                  13.5                 2
                                                    5          416                   9.6                 4
                                                    5          393                   5.3                 8
Southampton                                        24         3,735                 57.2                 1
http://dbpedia.org/resource/Southampton            23         3,497                 43.8                 2
                                                   23         3,497                 27.3                 4
                                                   23         3,497                 55.3                 8
Nottingham                                         17         4,154                 41.4                 1
http://dbpedia.org/resource/Nottingham             16         4,048                 39.5                 2
                                                   16         4,048                 27.4                 4
                                                   16         4,048                 25.9                 8
London                                             13         53,886               142.4                 1
http://dbpedia.org/resource/                       13         53,870               211.9                 2
                                                   13         53,870               149.8                 4
                                                   14        280,424               385.8                 8
Eastenders                                          2          612                   1.8                 1
http://www.bbc.co.uk/programmes/b006m86d            2          612                   0.7                 2
                                                    2          612                   0.6                 4
                                                    2          612                   0.7                 8
Panorama                                            2          174                   1.4                 1
http://www.bbc.co.uk/programmes/b006t14n            2          174                   0.9                 2
                                                    2          174                   0.7                 4
                                                    2          174                   0.6                 8


                      Table 2: BBC Programmes preserved dataset size over 1 week
    Date and Time    Number of Changed Profiles Average Changes per Profile Max. Changes     Min. Changes
      12/2/2010                 163                        43                   311               1
      13/2/2010                 105                         2                    25               1
      14/2/2010                  90                         2                    25               1
      15/2/2010                  87                         2                    25               1
      16/2/2010                  90                         2                    25               1
      17/2/2010                  87                         2                    25               1
      18/2/2010                  86                         2                    25               1
Figure 3: BBC Programmes Demonstration Appli-
cation built on top of data from AAT


intended. The subset used was all the brands (i.e. pro-
grammes) associated with the service BBC1 (the BBCs main
TV channel) since this includes many brands which change                Figure 4: All About That Architecture
regularly such as soaps and news broadcasts. Table 2 demon-
strates the average number of changes detected over just a
short period.                                                       AATs architecture is constructed as shown in Figure 4,
   As can be seen in Table 2 you can see that the BBC up-        and as can be seen it is decomposed into several compo-
date their dataset on a daily basis, the initial high number     nents which then rely on some external standalone compo-
of changes is due to starting from a base dataset that was       nents: an RDF API and the expansion algorithm. AAT
a couple of months old due to architectural changes made         is theoretically agnostic of its underlying storage though in
to AAT to support the use of the expansion algorithm and         practise differences in implementation between triple stores
improve the efficiency of the system. The average number         mean only certain stores are currently viable for use as the
of changes being 2 is due to the fact that the typical up-       backing store. In the early prototyping stage a RDBMS
date we see the BBC make to their data is that they add a        based store was used which was sufficient for initial proto-
triple describing a newly broadcast episode of a programme       typing but not scalable for real world testing so then the us-
and update the value of the dc:modified triple. The appar-       age of production grade triple stores was adopted. Initially
ently high number of 25 for the maximum changes is due to        it was intended to use the open source release of Virtuoso14
one of the program URIs failing to resolve resulting in the      as the backing store but it was found that Virtuoso didn’t
contents of that profile being considered to missing so the      correctly preserve boolean typed literals which created is-
change report for each day reports those triples as removed.     sues in the internal processing of data within AAT. 4store15
The relatively high number of profiles changing each day         was then used briefly but it was found that it was unable to
is due to the fact that as already stated many of the pro-       handle the heavy volume of parallel read/writes which AAT
grammes associated with BBC 1 are broadcast daily such           uses during its data processing due to 4store’s concurrency
as soaps and news bulletins and that the BBC publish data        model. Currently AAT runs again AllegroGraph16 since it
about programmes several days before the programmes are          has demonstrated in testing the ability to handle the high
actually broadcast.                                              volumes of read/writes necessary for using AAT on the large
   To demonstrate the reuse of the data being harvested we       dataset described in the preceding section.
created a demonstration application which is a simple web           In terms of general scalability the majority of algorithms
based faceted browser which lets users browse through in-        in AAT need to run on a single thread for each profile but
formation about recently shown BBC shows. Facets can be          it is trivial to process multiple profiles in parallel and this
used to filter by Genre and Channel and the user can view        is the approach taken currently. Since work can be divided
detailed information about both programmes and the indi-         over multiple threads it will also be possible to significantly
vidual episodes. This application was presented as part of an    increase the scalability by dividing the work over a cluster
earlier prototype of AAT described in [26] and shown in Fig-     of machines which would allow much larger datasets to be
ure 3. Like previous work by Papavassiliou et al [20] it shows   monitored efficiently.
that simple information about basic triple level changes in
RDF (additions, deletions etc) can be reprocessed into use-
                                                                 14
ful applications for end users.                                     http://www.openlinksw.com/virtuoso
                                                                 15
                                                                    http://4store.org
4.2.2    Architecture & Scalability                              16
                                                                    http://www.franz.com/agraph/allegrograph/
5.      FUTURE WORK                                              by AAT to annotate and store the data but there are alter-
   There are a number of things that could be done to im-        native vocabularies that could have been used such as the
prove the expansion algorithm outlined in Section 3.1 with       provenance ontology [13] and the open provenance model
regards to both making it more intelligent in how it retrieves   [18]. It would be a fairly easy and potentially useful en-
linked data and in conducting a detailed analyses of the data    hancement to map the AAT schema to these vocabularies
returned. Manual inspection of the data shows that it does       such that the data could be retrieved in the desired form by
appear to be relevant to the URI of interest but it is pro-      users/client applications designed to work with those for-
posed that a full IR analysis of this is conducted in order to   mats.
statistically confirm this initial assessment. Additionally as
was seen in Table 1 some types of URIs produced very little      6.      CONCLUSION
linked data using the default expansion profile, a broader
analysis using domain specific profiles is necessary to as-         In this work we have introduced a simple but powerful
certain whether those URIs have low levels of interlinking       expansion algorithm which can be used to retrieve linked
or if the interlinkings just use domain specific links rather    data about a URI even when that URI is not resolvable.
than the generic owl:sameAs and rdfs:seeAlso links that          This provides an important tool for preserving data in the
are followed by default.                                         Semantic Web and recovering from data loss and shows that
   In terms of improving the intelligence of the algorithm at    in the Semantic Web links themselves can be exploited as
the moment it submits every URI to every SPARQL, lookup          a means to recover from broken links. As we have outlined
and discovery endpoint described in the expansion profile, it    in Sections 4.1 and 5 there is a need to conduct a detailed
would improve the speed of the algorithm if it could use some    analysis of the algorithm to asses it’s efficacy for a wider
decision making as to which endpoints a given URI should         variety of URIs and using domain specific expansion profiles.
be submitted. Conversely though there is the possibility         Depending on the results of this analysis the algorithm may
that this would impact the effectiveness of the algorithm so     need to be further refined to improve both it’s speed and
it would be necessary to conduct experiments to determine        accuracy.
whether there is a trade off between speed and accuracy. It is      We have also presented the All About That (AAT) sys-
also worth considering that searching on URIs is not the only    tem which allows users to monitor and preserve linked data
viable mechanism for finding additional linked data about        they are interested in using the expansion algorithm as the
a URI of interest. Using terms extracted from the RDF            primary retrieval method for deciding which linked data to
such as the objects of rdfs:label or dc:title triples would      preserve based on a starting URI. As we demonstrated in
provide a way to augment URI based lookup with term/text         Section 4.2.1 we envisage the usage of such a system as a
based search results from semantic search engines. There are     base on which to build rich Semantic Web applications that
already frameworks like Silk [28] which can be used to do        can take and present the changing data in interesting and
this and it would be useful to integrate the Silk framework      useful ways to end users. It also fulfils a role in the overall
with the expansion algorithm.                                    goal of our research which is to provide a suite of algorithms
   One limitation inherent in AAT is that currently is does      and systems which can be used to manage both data and
not do any kind of special handling of blank nodes which         link integrity on the Semantic Web.
means that if data contains blank nodes AAT will contin-            As has been discussed in Section 5 there are some limita-
uously think it has encountered new knowledge when most          tions in the current versions of our algorithm and the AAT
likely it has not. For the data we have worked with so far       system which we intend to investigate and address in the
this is generally not an issue since the linked data commu-      future. It is clear that there is still a significant amount
nity tends to avoid blank nodes but if we are to provide for     of work to be done to create a comprehensive set of tools
preserving all kinds of RDF effectively then we need to han-     such that they can be applied to as wide a variety of data
dle blank nodes properly. Solving this problem may involve       on the Semantic Web as is possible and experiences of past
doing some sub-graph matching and isomorphism to see if          research in link integrity for the document Web tells us that
the sections of the graph that contain blank nodes can be        there will be no perfect solution.
mapped to the previously seen sections of the graph as in           Despite this it is our belief that as the Semantic Web
Tummarello et al’s RDFSync [23]. The blank nodes them-           grows data and link integrity will be increasingly important
selves could either be left as-is or they could be translated    issues to users as their applications come to rely upon linked
to URIs as done by systems like the Talis17 platform.            data. There is a need to have systems in place such that data
   Given that this work was inspired by traditional link in-     can be preserved and accessed even if the original sources
tegrity techniques from hypermedia it is interesting to note     are gone or unavailable. This has already been seen with
that it has the potential to be applied back to the docu-        the release of services like the Sindice Cache API18 which is
ment web since there is increasing cross-over between the        used as one of the data sources in the default expansion pro-
document and data web primarily due to the increasing up-        file (see Section 3.1.1). Additionally with rising adoption of
take of RDFa. As increasing numbers of documents embed           RDFa embedded inside documents on the web systems like
structured data using RDFa it will become possible to pre-       this become applicable for the preservation of the structured
serve and monitor the structured information embedded in         data embedded in the document based Web as discussed in
ordinary web pages in the same way as can be done with           Section 5.
linked data now, therefore we envisage this as having appli-
cations in automated monitoring and maintenance of docu-         7.      REFERENCES
ment based websites.
   As mentioned in Section 3.2.1 a lightweight schema is used         [1] Changeset protocol, 2007.
                                                                          http://n2.talis.comn/wiki/Changeset_Protocol.
17                                                               18
     http://www.talis.com/platform                                    http://www.sindice.com/developers/cacheapi
 [2] Virtuoso sponger. Technical report, OpenLink                  Hyper-G: A new tool for distributed hypermedia.
     Software, 2009. http://virtuoso.openlinksw.com/               Institutes for Information Processing Graz, 1994.
     Whitepapers/html/VirtSpongerWhitePaper.html.             [18] L. Moreau, J. Freire, J. Futrelle, R. McGrath,
 [3] K. Alexander, R. Cyganiak, M. Hausenblas, and                 J. Myers, and P. Paulson. The open provenance
     J. Zhao. Describing linked datasets: On the design            model. December 2007.
     and usage of void, the ‘vocabularly of interlinked            http://eprints.ecs.soton.ac.uk/14979/.
     datasets’. In Proceedings of the Linked Data on the      [19] L. Moreau and N. Gray. A Community of Agents
     Web Workshop (LDOW2009), Madrid, Spain, April                 Maintaining Links in the World Wide Web
     2009. http:                                                   (Preliminary Report). In The Third International
     //ceur-ws.org/Vol-538/ldow2009_paper20.pdf.                   Conference and Exhibition on The Practical
 [4] H. Ashman. Electronic document addressing: dealing            Application of Intelligent Agents and Multi-Agents,
     with change. ACM Comput. Surv., 32(3):201–212,                pages 221–235, London, UK, Mar. 1998. http:
     2000.                                                         //www.ecs.soton.ac.uk/~lavm/papers/gcWWW.ps.gz.
 [5] T. Berners-Lee, R. Cailliau, A. Luotonen, H. F.          [20] V. Papavassiliou, G. Flouris, I. Fundulaki,
     Nielsen, and A. Secret. The world-wide web.                   D. Kotzinos, and V. Christophides. On Detecting
     Commun. ACM, 37(8):76–82, 1994.                               High-Level Changes in RDF/S KBs. In The Semantic
 [6] C. Bizer, R. Cyganiak, and T. Heath. How to publish           Web: 9th International Semantic Web Conference
     linked data on the web, 2007. http://sites.wiwiss.            (ISWC2009), pages 473–488. Springer, 2009.
     fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial.          [21] T. A. Phelps and R. Wilensky. Robust hyperlinks:
 [7] P. Bouquet, H. Stoermer, and B. Bazzanella. An                Cheap, everywhere, now. In Digital Documents:
     entity name system (ens) for the semantic web. In 5th         Systems and Principles, pages 514–549. Springer,
     European Semantic Web Conference, ESWC 2008,                  2004.
     volume 5021, page 258. Springer, 2008.                   [22] G. Tummarello, R. Delbru, and E. Oren. Sindice. com:
 [8] G. Cheng, W. Ge, and Y. Qu. Falcons: searching and            Weaving the Open Linked Data. In The Semantic
     browsing entities on the semantic web. In WWW ’08:            Web: 6th International Semantic Web Conference,
     Proceeding of the 17th international conference on            2nd Asian Semantic Web Conference, ISWC, pages
     World Wide Web, pages 1101–1102, New York, NY,                552–565. Springer, 2008.
     USA, 2008. ACM.                                          [23] G. Tummarello, C. Morbidoni, R. Bachmann-Gm
 [9] H. Davis. Data Integrity Problems in an Open                  ”ur, and O. Erling. RDFSync: efficient remote
     Hypermedia Link Service. PhD thesis, University of            synchronization of RDF models. In The Semantic
     Southampton, November 1995.                                   Web: 6th International Semantic Web Conference,
     http://eprints.ecs.soton.ac.uk/6597/.                         2nd Asian Semantic Web Conference, ISWC 2007+
[10] H. C. Davis. Hypertext link integrity. ACM Comput.            ASWC 2007, Busan, Korea, November 11-15, 2007,
     Surv., page 28, 1999.                                         Proceedings, pages 537–551. Springer, 2007.
[11] A. M. Fountain, W. Hall, I. Heath, and H. C. Davis.      [24] L. Veiga and P. Ferreira. Repweb: replicated web with
     Microcosm: an open model for hypermedia with                  referential integrity. In SAC ’03: Proceedings of the
     dynamic linking. In Hypertext: concepts, systems and          2003 ACM symposium on Applied computing, pages
     applications, pages 298–311, New York, NY, USA,               1206–1211, New York, NY, USA, 2003. ACM.
     1992. Cambridge University Press.                        [25] L. Veiga and P. Ferreira. Turning the web into an
[12] T. L. Harrison and M. L. Nelson. Just-in-time                 effective knowledge repository. ICEIS 2004: Software
     recovery of missing web pages. In HYPERTEXT ’06:              Agents and Internet Computing, 14(17), 2004.
     Proceedings of the seventeenth conference on Hypertext   [26] R. Vesse, W. Hall, and L. Carr. All about that - a uri
     and hypermedia, pages 145–156, New York, NY, USA,             profiling tool for monitoring and preserving linked
     2006. ACM.                                                    data. In ISWC 2009, August 2009.
[13] O. Hartif and J. Zhao. Guide to the provenance                http://eprints.ecs.soton.ac.uk/17815.
     vocabularly, 2009.                                       [27] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov.
     http://sourceforge.net/apps/mediawiki/trdf/                   Discovering and maintaining links on the web of data.
     index.php?title=Provenance_Vocabulary.                        In A. Bernstein, D. R. Karger, T. Heath,
[14] B. Haslhofer and N. Popitsch. DSNotify–Detecting              L. Feigenbaum, D. Maynard, E. Motta, and
     and Fixing Broken Links in Linked Data Sets. In               K. Thirunarayan, editors, The Semantic Web - ISWC
     Proceedings of 8th International Workshop on Web              2009, volume 5823 of Lecture Notes in Computer
     Semantics, 2009.                                              Science, pages 650–665. Springer, 2009.
[15] A. Jaffri, H. Glaser, and I. Millard. Managing uri       [28] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov.
     synonymity to enable consistent reference on the              Silk–a link discovery framework for the web of data.
     semantic web. In IRSW2008 - Identity and Reference            In 2nd Linked Data on the Web Workshop
     on the Semantic Web 2008, 2008.                               (LDOW2009), 2009.
[16] F. Kappe. A scalable architecture for maintaining
     referential integrity in distributed information
     systems. Journal of Universal Computer Science,
     1(2):84–104, 1995. http://www.jucs.org/jucs_1_2/
     a_scalable_architecture_for.
[17] F. Kappe, K. Andrews, J. Faschingbauer,
     M. Gaisbauer, M. Pichler, and J. Schipflinger.