=Paper=
{{Paper
|id=None
|storemode=property
|title=Preserving Linked Data on the Semantic Web by the application of Link Integrity techniques from Hypermedia
|pdfUrl=https://ceur-ws.org/Vol-628/ldow2010_paper08.pdf
|volume=Vol-628
|dblpUrl=https://dblp.org/rec/conf/www/VesseHC10
}}
==Preserving Linked Data on the Semantic Web by the application of Link Integrity techniques from Hypermedia==
Preserving Linked Data on the Semantic Web by the
application of Link Integrity techniques from Hypermedia
Rob Vesse, Wendy Hall, Leslie Carr
Intelligence, Agents & Multimedia Group
School of Electronics & Computer Science
University of Southampton
Southampton
SO17 1BJ
{rav08r,wh,lac}@ecs.soton.ac.uk
ABSTRACT the link is changed so that it is no longer what the creator
As the Web of Linked Data expands it will become increas- of the link intended to link to. Both these issues affect the
ingly important to preserve data and links such that the Semantic Web since on the Semantic Web everything is in-
data remains useful. In this work we present a method for terlinked data, therefore data is immediately susceptible to
locating linked data to preserve which functions even when dangling links. The editing problem becomes much more
the URI the user wishes to preserve does not resolve (i.e. problematic on the Semantic Web since anyone can make a
is broken/not RDF) and an application for monitoring and statement about anything so the meaning of things on the
preserving the data. This work is based upon the principle Semantic Web is subject to semantic drift1 .
of adapting ideas from hypermedia link integrity in order to Due to the interlinking between data on the Semantic Web
apply them to the Semantic Web. we show that it is possible to exploit the data model such
that links themselves can be used to recover missing data in
the event of a dangling link being encountered. This pro-
General Terms vides for a means to retrieve the data that was/may have
Experimentation, Reliability, Design been at a given URI even if that URI is no longer resolvable.
Using this approach for locating data about a URI we are
able to preserve and monitor data about a URI from mul-
Keywords tiple sources and to recover data about URIs that are no
Semantic Web, Linked Data, Link Integrity, Preservation longer functioning as described in Section 3.1.
The Semantic Web also introduces two additional prob-
lems in link integrity specific to linked data. The first of
1. INTRODUCTION these is URI Identity & Meaning - what does a URI mean
The Web of Linked Data is characterised by the inter- and does this meaning actually matter to the applications
linking between disparate heterogeneous data sources and that use it and the data that contains it - which is very much
the fact that the links between the data sources are one of an open research debate and beyond the scope of our current
the primary mechanisms for navigating through this data work. The second is the co-reference problem which refers
space. Since links are essential to the Web of Linked Data to the situation in which some ‘thing’ we wish to make state-
we believe that it is important to have mechanisms in place ments about has multiple URIs that could be used for it. In
to maintain link integrity. The aim of link integrity is to this work we utilise existing work in this area of research as
ensure that a link works correctly in that traversing the link part of our algorithm for preserving linked data.
takes you to a resource and that as far as possible the re- Section 2 of this paper covers the related work in link
source is the one intended by the provider of the link. On integrity for hypermedia and the Semantic Web which we
a larger scale, link integrity deals with the overall integrity use ideas from in Section 3 to design and develop our algo-
of interlinked datasets such as documents within a Content rithm and the AAT software. Section 5 outlines our plans
Management System (CMS) or the linked data sets available for future research in this area and we conclude in Section 6
on the Semantic Web. Therefore link integrity is one way of by discussing the potential benefits of link integrity for the
ensuring data integrity within the overall system which in Semantic Web.
our use case is linked datasets.
Link integrity is an existing and well known problem from
hypermedia where there were two problems to be dealt with 2. RELATED WORK
- dangling links and the editing problem. Dangling links Link integrity in hypermedia first received serious atten-
are the most well known problem and are regularly experi- tion in the late 1980s and early 1990s primarily from re-
enced by users on the Web as they find themselves presented searchers in the open hypermedia community. Systems like
with an HTTP error as the link they followed pointed to a Microcosm [11] and HyperG [17] were among the first to
resource which cannot be retrieved. The editing problem consider the issue in depth, Davis’s thesis [9] and Kappe’s
refers to the situation in which the content at the end of
1
Copyright is held by the author/owner(s). See the discussion on the W3C Semantic Web Mailing List
WWW2010, April 26-30, 2010, Raleigh, North Carolina. for an example of this - http://lists.w3.org/Archives/
. Public/semantic-web/2009May/0315.html
1995 paper [16] provide examples of link integrity in open rather than humans. When a human encounters a dead
hypermedia. The widespread growth of the World Wide link they usually navigate to a search engine and enter an
Web [5] in the mid-1990s led to some new research but as appropriate search phrase to find alternative sources of in-
search engines became commonplace towards the end of the formation. For a client application encountering a dead link
decade research interest dwindled. It was perceived that they will typically have no concept or how/where to find
users did not care sufficiently to warrant research into the alternative sources of information and URIs for linked data
problem as they could locate missing resources effectively are not always ideal for searching upon compared to textual
using search engines, in addition the scale of the Web by search for documents. It should be noted that as with the
that time was simply too vast for many proposed solutions existing Web if the Web of Linked Data is to undergo a mas-
to handle. Davis’s survey [10] provides a good overview of sive expansion in the same way things must be allowed to
the state of this research as of the end of the 1990s. Another fail but this does not mean we shouldn’t attempt to mitigate
reason for the decline in research was that the fact that links the problem as far as possible.
could fail was one of the reasons the Web was able to ex- In terms of the Semantic Web there has been research
pand as fast as it did since it didn’t matter if links failed and into the versioning and synchronisation of RDF data which
produced the familiar HTTP 404 error; users were able to is relevant to aspects of our work such as Tummarello et al’s
publish content without worrying about whether their links RDFSync [23] which is an algorithm for efficiently synchro-
to external content were valid. nising changes in RDF between multiple machines. This
Ashman’s 2000 paper [4] which discusses link integrity shows that change detection in RDF is non-trivial due to
with particular reference to electronic document archives the inherent data isomorphism caused by the use of blank
provides both a useful survey of existing work and describes nodes but also shows that it can be achieved in an efficient
a key motivation for ongoing research. As more document manner. More recent research from Papavassiliou et al [20]
collections were translated into digital forms and placed onto has shown that using information about very basic changes
intranets people once again started to be concerned about in the RDF - such as that provided by systems like RDF-
link integrity. Users wanted assurances that links into the Sync or All About That (see Section 3.2) - can be used to
document archives would work consistently and ideally links build applications which provide useful information to end
out of the archives would work correctly as well since it may users. In the case of Papavassiliou’s et al’s paper they built a
not be possible to alter the archived documents without in- system which furnished users with high level descriptions of
validating the integrity of the archive. how RDFS vocabularies have changed in order to aid users
In this vein Veiga and Ferreira [24, 25] discuss the possi- in working with such vocabularies. In addition there are
bility of turning the Web into an effective knowledge reposi- systems like the Talis Platform2 which is a Semantic Web
tory by use of replication and versioning. Their work follows store that implements a versioning mechanism whereby up-
on from earlier work such as Moreau & Gray’s [19] which dates can be made via a Changeset protocol [1]. As part of
proposed limited use of replication and versioning but had this protocol they utilise a useful lightweight vocabulary for
significant reliance on author and user involvement in the publishing changes in RDF data as RDF which as will be
process. In Veiga & Ferreira’s work there is no require- discussed in Section 3.2.3 we reuse in our own system.
ment for author involvement in the process, only the end Regarding Semantic Web specific link integrity problems
user need use a browser plugin to indicate the content they the research has largely focused on the co-reference prob-
wish to replicate and preserve. Their results showed that lem. Since there are many organisations publishing similar
the user could preserve the sections of the Web they were data semantically (bibliographic databases being a prime ex-
interested in with no perceivable performance impact - on ample) there are frequently many URIs for a single entity
average there was only a 12ms increase in retrieval time for such as an author. Co-reference research aims to develop
resources. In Section 3 we discuss using an approach of this ways to efficiently and accurately determine URI equiva-
kind for the Semantic Web. lences and refactor the data or republish this information
Phelps & Wilensky introduced the concept of lexical sig- to help other Semantic Web applications. There are several
natures for Web pages in their Robust Hyperlinks paper [21]. competing philosophies ranging from the Okkam approach
They compute the lexical signature of a page and append described by Bouquet et al [7] which advocates universally
it to all links to that page so that in the event of the link agreed URIs for each entity to the Co-reference Resolution
failing a browser plugin can use the signature to relocate the Service (CRS) approach of Jaffri et al [15] which determines
page using a search engine. The obvious flaw in their work co-referent URIs and republishes the information in dedi-
was that it required rewriting all the links on the Web but cated triple stores. The CRS approach taken by the ReSIST
Harrison & Nelson later showed that these signatures need project3 within the RKB Explorer4 application has poten-
only be computed Just-in-Time (JIT) when a link fails [12]. tial for use in link integrity as the information provided by
In their Opal system the signatures can be computed JIT a CRS could be utilised in a JIT fashion as in Harrison &
by retrieving cached copies of the pages from a search en- Nelson’s work and we demonstrate how this can be done in
gine cache, computing the signature and then using search Sections 3 and 4.
engines to relocate the page. As discussed in Section 3.1 a In terms of link maintenance for the Semantic Web there
JIT style approach can be effectively used to recover linked has been some research in the form of the Silk framework
data about a URI. by Volz et al [28] which is a framework for computing links
between different datasets. Their approach allows users to
stipulate arbitrarily complex matching criteria to do entity
2.1 Semantic Web Research matching between datasets, the links produced from this can
Unlike the traditional Web it is not possible for semantic
2
search engines like Sindice [22] and Falcons [8] to fulfil the http://www.talis.com/platform
3
same role as document search engines because the users in http://www.resist-noe.org/
4
the Semantic Web domain are typically client applications http://www.rkbexplorer.com
then be published via a CRS style service or added to the introduce a couple of additional predicates since we require
relevant datasets. As proposed in their later paper [27] this the means to allow end users to specify some basic character-
can be used as part of a link maintenance strategy, the pos- istics of how the algorithm should behave and there is a type
sibility of combining this with our approach is discussed in of service we need to express which is not contained in the
Section 5. In a similar vein Haslhofer and Popitsch’s DSNo- VoID ontology. VoID has concepts of Datasets and Linksets,
tify system [14] can monitor linked resources and inform the former represent a set of data which may have SPARQL
the application when links are no longer valid using feature endpoint(s) and/or URI lookup endpoint(s) while the latter
based similarity metrics like the Silk framework. represent the types of interlinkings between datasets. What
VoID does not have a means to express is the location of
a service provided by a dataset which allows an application
3. METHOD to retrieve URIs which are considered equivalent to a given
As we have discussed it is not realistic to maintain link in- URI - this we term a URI discovery endpoint (see Definition
tegrity in a pre-emptive way since such solutions have been 2). A discovery endpoint differs from a lookup endpoint in
consistently shown not to scale to Web scale in previous that the latter is expected to return everything the dataset
work. Therefore the focus must be on recovery in the event knows about the given URI as opposed to only returning
of failure and preservation to guard against the loss of data equivalent URIs. Examples of existing discovery endpoints
which is considered interesting/useful to end users. As the on the Semantic Web include RKBExplorer’s CRSes [15]
amount of data in the Web of Linked Data starts to expand and sameAs.org5 . Another key difference between a lookup
massively - particularly with linked data being adopted by and discovery endpoint is that links discovered from a dis-
an increasing number of major organisations - we expect covery endpoint are considered to be on the same level of
that as with the early document web there’ll be an increas- the crawl for the purposes of the algorithm i.e. they do not
ing amount of content published by both big companies and have increased depth relative to the URI that discovery is
individuals. Just like the document web this explosion of performed upon. By this we mean that the execution of the
content will most likely include much content that is poorly algorithm results in performing a breadth-first depth-limited
maintained and will lead to increasing numbers of broken linked data crawl starting from a given URI - in this tree
links. We have two connected goals in this work 1) to pro- structure a discovery endpoint introduces sibling nodes for
vide a means to retrieve resource descriptions in the form a URI while a lookup endpoint introduces child nodes for a
of linked data about a URI even when the the URI is non- URI.
functional and 2) to provide the means for an end user to Our other extensions to VoID allow individual dataset-
preserve and version these descriptions. To attempt to solve s/linksets to be marked as ignored (the algorithm will not
this problem we present an expansion algorithm for retriev- use them) and for the user to define to what depth the algo-
ing Linked Data about a URI even if that URI itself has rithm should crawl to (defaults to 1). These extensions are
failed in Section 3.1 and a preservation system built using defined as part of the AAT schema detailed in Section 3.2.1.
this algorithm in Section 3.2.
Definition 2. A URI discovery endpoint is an endpoint
3.1 Expansion Algorithm that when passed a URI returns a Graph containing equiva-
Since the goal of this work is to preserve linked data it was lent URIs of the input URI typically in the form or owl:sameAs
deemed essential that as far as possible we leverage existing links.
linked data technologies and services in order to effect this
preservation. To this end we designed a relatively simple As already stated the actual algorithm is a simple crawler
algorithm which uses simple crawling techniques which are which uses the input expansion profile as a guide to which
directed by a user definable expansion profile (see Definition potential sources of linked data it should use to try and find
1). Our aim with this algorithm is to provide resource de- data about the URI of interest - this procedure is detailed
scriptions of a URI regardless of whether the URI itself is in Algorithm 1. Note that the algorithm does not terminate
dereferenceable. in the event of an error retrieving data from a particular
Even in the case where a URI is used only as an identifier URI/endpoint and simply continues, by doing this it is still
in the description of another resource and is not itself deref- possible to retrieve some data even if the starting URI does
erenceable it is likely that we can still retrieve some data not return a valid response. The algorithm will continue
about it. The fact that a URI is minted only as an identifier and issue queries about the URI to the various endpoints
and that the person/organisation minting the URI does not described in the given expansion profile so unless the URI
provide the means to dereference the URI does not affect refers to a document that had very poor linkages or was not
our ability to find data about it assuming that the identifier indexed by the semantic search services used some RDF will
is used elsewhere i.e. it is reused as part of linked data. be returned. This approach has similarities to the JIT style
approach of Harrison & Nelson [12] in that there doesn’t
Definition 1. An expansion profile is a Vocabulary of In- need to be any foreknowledge of the URIs you wish to re-
terlinked Datasets (VoID) description of a set of datasets cover data about when you discover they are broken since by
and linksets that should be used to locate linked data about utilising the caches and lookup services of relevant datasets
the URI of interest. The VoID description may be option- it is still possible to recover data about the URI.
ally annotated with additional properties which affect the The basic behaviour of the algorithm is only to follow
behaviour of the algorithm. owl:sameAs and rdfs:seeAlso links but the end user can
specify that any predicate be treated as a link to follow by
Drawing on ideas described in Alexander et al’s Vocab- the specifying an appropriate VoID linkset in their expan-
ulary of Interlinked Datasets (VoID) [3] about the way it sion profile.
can be used to direct crawlers we decided to use VoID as
5
the primary means of expressing an expansion profile. We http://www.sameas.org
Algorithm 1 Expansion Algorithm retrieval of Sindice’s cached copy of the RDF from a
Require: URI, Expansion Profile URI.
1: ToExpand as a set of pairs of URIs and Depths
2: while ToExpand 6= ∅ do • SameAs.org10 - SameAs.org provides a URI discovery
3: Remove first pair from ToExpand endpoint (see Section 3.1 and Definition 2) which can
4: if Graph with URI is already in the Dataset then be used to find URIs which are equivalent to a given
5: Continue URI
6: if Depth >Max Depth then The default profile11 has a max expansion depth of 1 which
7: Continue means it only considers URIs which are immediate neigh-
8: Retrieve the Graph at the URI bours of the starting URI.
9: Add the Graph to the Dataset In the case where the end user does know which linked
10: for all Triples in Graph do data sources will have useful information about the URI
11: if Triple is a Link then they can specify their own expansion profile which is used
12: Add a new pair to ToExpand instead of the default profile. In this case the algorithm will
13: for all Datasets in Expansion Profile do use the datasets and linksets they define in the profile to
14: if Dataset has a SPARQL Endpoint then discover linked data about the URI of interest, for example
15: Issue a DESCRIBE for the URI against the End- if attempting to recover data about a person it may be useful
point to follow foaf:knows links.
16: Add resulting Graph to the Dataset
17: Process the Graph for additional Links 3.2 Preservation
18: if Dataset has a Lookup Endpoint then The preservation approach taken is to allow the end user
19: Issue a Lookup for the URI against the Endpoint to monitor and preserve a set of linked data that they are
20: Add resulting Graph to the Dataset interested in. The data is preserved not at the data source
21: Process the Graph for additional Links but rather at a local level on the users server with the user
22: if Dataset has a Discovery Endpoint then able to republish this data as they desire. This is in line with
23: Issue a Discovery for the URI against the End- the ideas of Veiga & Ferreira [25] in that the end user speci-
point fies the parts of the Web they want to preserve and then the
24: for all Equivalent URIs do software takes care of this. The data must be preserved in
25: Add a new pair to ToExpand such a way that the original data can be efficiently extracted
26: return Dataset from it and sufficient information to provide versioning over
the data is kept.
In the Semantic Web domain the objects of interest are
There are already some existing systems which work in URIs we propose that a profile of a URI be preserved (see
a similar way to our algorithm such as the sponger middle Definition 3). Since the data being processed is RDF it
ware using in Virtuoso [2]. The main difference between is logically divided into triples which can be preserved and
our algorithm and algorithms such as those in the Virtuoso monitored individually. It is deemed necessary to store infor-
sponger is that our algorithm is only interested in linked mation pertaining to the temporality and provenance of each
data and it does not infer/create any additional data. Unlike triple - when it was first seen, last updated, source URI(s)
the Virtuoso sponger it does not attempt to turn non-linked and whether it has changed or been retracted/deleted from
data into RDF and it does not do any inference over the data the RDF.
it returns, it is designed only to find and return (in the form
of an RDF dataset) linked data about the URI of interest. Definition 3. A URIs profile is the transformed and anno-
Yet as expansion profiles may reference any datasets and tated form of the linked data retrievable about a given URI
associated endpoints they wish there is no reason why a such that the temporality and provenance of the triples con-
user could not direct our algorithm to utilise a service like tained therein are inferable from the profile
URIBurner6 which uses the Virtuoso sponger in order to get
the benefits of the additional inferred data. In terms of user interface the system should allow a user
to view a profile both in the stored form and in its original
3.1.1 Default Profile form. The system must monitor the original data source
Since the end user of such an algorithm may not always over time updating the profiles as necessary such that it can
know where to look for linked data about the URI they are provide a report of changes in the data to the user. Since a
interested in the algorithm has a default expansion profile URI profile will contain versioning information the interface
which is used in the case when no profile is specified. This should allow a user to view a particular version of the profile.
profile uses 3 data sources which are in our opinion impor-
tant hubs of the Web of Linked Data: 3.2.1 Schema
As the first stage of implementation an RDF Schema for
• DBPedia7 - The DBPedia SPARQL endpoint is used All About That12 (AAT) is defined which embodies classes
to lookup URIs and properties which allow the description and annotation
of triples in such a way that the required information as
• Sindice8 Cache - The Sindice Cache API9 allows the discussed in the preceding proposal can be stored for each
6 10
http://www.uriburner http://www.sameas.org
7 11
http://dbpedia.org http://www.dotnetrdf.org/expander/defaultProfile
8 12
http://www.sindice.com This schema is available at http://www.dotnetrdf.org/
9
http://www.sindice.com/developers/cacheapi AllAboutThat/
triple. The schema defines a class for representing profiles
called aat:Profile and uses the rdf:Statement class to
represent triples. rdf:Statement is used as the basis of
triple storage as it makes it possible for non-AAT aware tools
to extract the original triples from the profile easily. A num- Figure 1: Original Triple
ber of properties are defined which store meta data about
the profile itself such as created & updated date, source
URI and a locally unique identifier for the profile. Simi-
lar properties are defined for triples which allow the first
and last asserted dates, source URI and change status of a
triple to be indicated. A key distinction in the schema is be-
tween aat:profileSource and aat:source, despite storing
equivalent data two predicates are created since the former
expresses the URI which is the starting point for the pro-
file while the latter expresses all the URIs at which a given
triple is asserted.
While there were alternative schemas and vocabularies
available that could have potentially been used to store
the required data the motivation behind designing our own
schema was to provide a lightweight schema that attached
Figure 2: Triple transformed to AAT Annotated
all data to a single subject for ease of processing. Alterna-
Form
tives such as the Provenance Vocabulary by Hartig & Zhao
[13] are far more expressive but they potentially require in-
troducing multiple intermediate blank nodes which would Since the user needs to both browse the data they are pre-
significantly complicate the processing needed to implement serving as well as potentially republish it, a Web based in-
many of the core features of AAT. Similarly the Open Prove- terface was designed as the primary interaction mechanism.
nance Model as described by Moreau et al [18] is highly ex- The interface allows users to explore the data by first select-
pressive but like the Hartig & Zhao’s vocabulary the RDF ing a profile to view and then allowing them to view profile
serialization is overly complex for use in AAT. As discussed contents, export, versions and change reports. A user may
in Section 5 there is no reason why the data contained in also use the interface to add new URIs they wish to monitor
AAT could not be exposed in other provenance vocabularies to the system and to initiate updates to profiles (see Def-
but for AATs processing and storage a lightweight vocabu- inition 4). Following linked data best practices [6] and to
lary is preferable. provide the ability for the user to republish their preserved
The use of reification was chosen over the use of named data multiple dereferenceable URIs for each profile are cre-
graphs primarily due to the need to make annotations at ated and accessible through the Web interface. These allow
the level of individual triples rather than at the graph level, the retrieval of the profile contents which consists of all the
usage is motivated by the fact that the mechanism provides triples ever retrieved from the profile URI in the transformed
a clear and obvious schema for encoding a triple and adding form, the export of the profile (see Definition 5) and various
additional annotations to it. While reification may signifi- meta graphs about a profile e.g. change history, changesets.
cantly increase the size of the data being stored initially over This means that the profile of a URI has a URI and thus
time this balances out compared to named graphs where it can itself be profiled if it was desired.
is necessary to either store many copies of the same graph
or store multiple named graphs which represent a series of Definition 4. An update of a profile occurs when AAT
deltas to the original data. The other difficulty inherent in using the Expansion algorithm to retrieve RDF about the
the named graphs approach is that the annotations typically given URI. The triples contained are compared with the
would then be held separately in other named graphs which triples currently in the profile and the profile updated ac-
adds to the complexity of the data processing. Neverthe- cordingly
less named graphs are used within AAT since each profile
naturally forms a named graph and AAT generates several Definition 5. The export of a profile is the recreation of
related named graphs about each profile detailing change the RDF in its original form based upon the current contents
history and changesets as described in Section 3.2.3. of the profile. An export represents the RDF as it was last
seen by AAT
3.2.2 Profile Creation & Update 3.2.3 Change Reporting
To create a URIs profile linked data about the URI is first A key feature of AAT is the ability to generate change re-
retrieved using the expansion algorithm presented in Section ports about how the RDF at the profiled URI has changed
3.1; then using the AAT schema each triple can be trans- over time. To do this a number of relatively simple compu-
formed into a set of triples which represent an annotation tations over the annotated triples can be made based pri-
of the original triple. For each triple in the original RDF marily on the first and last asserted dates of the triples. In
a blank node is created which is then used as the subject creating change reports four different types of changes in the
of a set of triples which represent the required information RDF are looked for (see Definitions 6-9). A distinction is
about the original triple. Figure 1 shows an example triple made between missing knowledge and retracted or deleted
and Figure 2 shows it transformed into the AAT form. A knowledge as it may be possible for triples to be perceived to
URIs profile consists of a set of transformed triples where be temporarily non-present in the RDF. For example in the
each profile is a named graph in the underlying store. event of a transient network issue making some/all of the
relevant URIs unretrievable the updated date for the pro- have a starting point for this. Separate to Changesets a
file will still be updated leaving all the triples in the profile named graph containing a history for each profile is also
to appear missing. The length of time we require Triples stored which links to all the relevant Changesets for a pro-
to be missing before we consider them to be deleted is cur- file.
rently set to 7 days for our monitoring of the BBC dataset
described in Section 4.2.1, this time period is a domain spe-
cific parameter that can be adjusted depending on the data
4. RESULTS
that is being monitored.
4.1 Expansion
Definition 6. New knowledge is any triple that is new to To test the expansion algorithm we took a small sample or
the RDF at the profiled URI URIs which included the URIs of the authors, places asso-
ciated with the authors and TV programmes from the BBC
Definition 7. Changed knowledge is any triple where the (since we use the BBC programmes dataset for our preser-
object of the triple has changed. Only triples where the vation tests as described in Section 4.2). The results shown
predicate has a cardinality of 1 can be considered to change in Table 1 show that the amount of linked data that can
be obtained using the default expansion profile described in
Definition 8. Missing knowledge is any triple no longer Section 3.1.1 varies depending on the URI being profiled.
found in the RDF at the profiled URI but which was recently Expanding the URI of a person potentially produces a large
seen in the RDF number of small graphs particularly if that person is a well
published academic since many bibliographic databases are
Definition 9. Retracted or deleted knowledge is any triple exposed as linked data and provide small amounts of data
no longer found in the RDF at the profiled URI which has about people. As can be seen URIs for places return vary-
not been seen for a reasonable length of time ing amounts of data which depends on the size and relative
importance of the place. Conversely expanding the URIs
In regards to the concept of changed knowledge consider of BBC programmes using the default profile produces very
some arbitrary predicates ex:one and ex:many which have little linked data, we suspect that this is due to the type
cardinalities of 1 and unrestricted respectively. Since ex:one of data and the fact the linking it uses it mostly based on
has a cardinality of 1 it can be said whenever the object the BBCs ontologies. As outlined in Section 5 we plan to
of that triple has changed it is changed knowledge. Yet it conduct experiments in the future to asses the efficacy of
cannot be said for ex:many triples as the predicate has un- the algorithm on various types of data and using domain
restricted cardinality, therefore each triple using this predi- specific expansion profiles.
cate must be treated as a unique entity i.e. one instance of One of the benefits of the algorithm is that as can be seen
a triple using this predicate cannot be considered to replace in the results in Table 1 the algorithm is trivially parallel.
another. In the examples the fact that < A > was related Increasing the number of threads used to process the discov-
to < C > via the predicate ex:many in Example 1 and now ered URIs shows a significant reduction in the time taken
is instead related to < E > in Example 2 doesn’t mean they to retrieve the linked data. Experiments were conducted
are related to < E > instead of < C >, it just means they no with higher number of threads but 8 threads was found to
longer consider themselves related to < C >. The fact that be optimal since beyond 8 threads erratic behaviour is ob-
they are related to < E > is new knowledge while the fact served due to two factors: 1. underlying limitations of the
they related to < C > is missing/deleted knowledge, but if HTTP API used in terms of stable concurrent connections
the value of the ex:one relationship had changed then that and 2. high volumes of concurrent access to a single site look
would be considered changed knowledge. like DoS attacks and lead to temporary bans on accessing
those sites. Differences in the number of triples and graphs
Example 1 Original Graph returned for URIs can be attributed to a couple of factors.
ex:one . In the case of the London URI where the difference is dra-
ex:many . matic - over 200,000 triples difference - this is because with
ex:many . a smaller number of threads connections seem more likely
to time out though we are unsure why this is. In the other
cases many of the graphs were from the same domain name
and the API used to retrieve the RDF had a bug regarding
connection management for multiple concurrent connections
Example 2 Modified Graph
to the same domain which caused connections to fail unex-
ex:one . pectedly which is why a reduction in the amount of data is
ex:many . observed as the number of threads increased.
ex:many .
4.2 Preservation
When a change report is computed is it itself serialized 4.2.1 BBC Programmes
into an RDF Graph using the Talis Changeset ontology [1] In order to test AAT properly it was used to monitor a
which is stored as a named graph in the underlying store subset of the BBC Programmes13 dataset which is a large
and republished via the web interface. Each Changeset gen- and constantly changing linked data set which allowed for
erated links back to the previous Changeset (if one exists) both the testing of the scalability of AAT and for the ver-
such that a end user/client application consuming the data ification that it’s change detection algorithms worked as
can follow the history of changes, a special URI which re-
13
trieves the most recent Changeset is provided such that users http://www.bbc.co.uk/programmes/developers
Table 1: Sample Expansion Algorithm Results
URI Total Graphs Total Triples Retrieval Time (seconds) Thread Used
Rob Vesse 4 115 13.8 1
http://id.ecs.soton.ac.uk/person/11471 4 115 1.8 2
4 115 1.8 4
4 115 2.1 8
Wendy Hall 691 4,068 786.3 1
http://id.ecs.soton.ac.uk/person/1650 692 4,070 383.8 2
692 4,070 375.9 4
692 4,070 359.6 8
Les Carr 368 2,694 438.9 1
http://id.ecs.soton.ac.uk/person/60 279 2,516 109.5 2
238 2,434 75.9 4
204 2,366 64.8 8
Ilkeston 6 444 19.1 1
http://dbpedia.org/resource/Ilkeston 5 393 13.5 2
5 416 9.6 4
5 393 5.3 8
Southampton 24 3,735 57.2 1
http://dbpedia.org/resource/Southampton 23 3,497 43.8 2
23 3,497 27.3 4
23 3,497 55.3 8
Nottingham 17 4,154 41.4 1
http://dbpedia.org/resource/Nottingham 16 4,048 39.5 2
16 4,048 27.4 4
16 4,048 25.9 8
London 13 53,886 142.4 1
http://dbpedia.org/resource/ 13 53,870 211.9 2
13 53,870 149.8 4
14 280,424 385.8 8
Eastenders 2 612 1.8 1
http://www.bbc.co.uk/programmes/b006m86d 2 612 0.7 2
2 612 0.6 4
2 612 0.7 8
Panorama 2 174 1.4 1
http://www.bbc.co.uk/programmes/b006t14n 2 174 0.9 2
2 174 0.7 4
2 174 0.6 8
Table 2: BBC Programmes preserved dataset size over 1 week
Date and Time Number of Changed Profiles Average Changes per Profile Max. Changes Min. Changes
12/2/2010 163 43 311 1
13/2/2010 105 2 25 1
14/2/2010 90 2 25 1
15/2/2010 87 2 25 1
16/2/2010 90 2 25 1
17/2/2010 87 2 25 1
18/2/2010 86 2 25 1
Figure 3: BBC Programmes Demonstration Appli-
cation built on top of data from AAT
intended. The subset used was all the brands (i.e. pro-
grammes) associated with the service BBC1 (the BBCs main
TV channel) since this includes many brands which change Figure 4: All About That Architecture
regularly such as soaps and news broadcasts. Table 2 demon-
strates the average number of changes detected over just a
short period. AATs architecture is constructed as shown in Figure 4,
As can be seen in Table 2 you can see that the BBC up- and as can be seen it is decomposed into several compo-
date their dataset on a daily basis, the initial high number nents which then rely on some external standalone compo-
of changes is due to starting from a base dataset that was nents: an RDF API and the expansion algorithm. AAT
a couple of months old due to architectural changes made is theoretically agnostic of its underlying storage though in
to AAT to support the use of the expansion algorithm and practise differences in implementation between triple stores
improve the efficiency of the system. The average number mean only certain stores are currently viable for use as the
of changes being 2 is due to the fact that the typical up- backing store. In the early prototyping stage a RDBMS
date we see the BBC make to their data is that they add a based store was used which was sufficient for initial proto-
triple describing a newly broadcast episode of a programme typing but not scalable for real world testing so then the us-
and update the value of the dc:modified triple. The appar- age of production grade triple stores was adopted. Initially
ently high number of 25 for the maximum changes is due to it was intended to use the open source release of Virtuoso14
one of the program URIs failing to resolve resulting in the as the backing store but it was found that Virtuoso didn’t
contents of that profile being considered to missing so the correctly preserve boolean typed literals which created is-
change report for each day reports those triples as removed. sues in the internal processing of data within AAT. 4store15
The relatively high number of profiles changing each day was then used briefly but it was found that it was unable to
is due to the fact that as already stated many of the pro- handle the heavy volume of parallel read/writes which AAT
grammes associated with BBC 1 are broadcast daily such uses during its data processing due to 4store’s concurrency
as soaps and news bulletins and that the BBC publish data model. Currently AAT runs again AllegroGraph16 since it
about programmes several days before the programmes are has demonstrated in testing the ability to handle the high
actually broadcast. volumes of read/writes necessary for using AAT on the large
To demonstrate the reuse of the data being harvested we dataset described in the preceding section.
created a demonstration application which is a simple web In terms of general scalability the majority of algorithms
based faceted browser which lets users browse through in- in AAT need to run on a single thread for each profile but
formation about recently shown BBC shows. Facets can be it is trivial to process multiple profiles in parallel and this
used to filter by Genre and Channel and the user can view is the approach taken currently. Since work can be divided
detailed information about both programmes and the indi- over multiple threads it will also be possible to significantly
vidual episodes. This application was presented as part of an increase the scalability by dividing the work over a cluster
earlier prototype of AAT described in [26] and shown in Fig- of machines which would allow much larger datasets to be
ure 3. Like previous work by Papavassiliou et al [20] it shows monitored efficiently.
that simple information about basic triple level changes in
RDF (additions, deletions etc) can be reprocessed into use-
14
ful applications for end users. http://www.openlinksw.com/virtuoso
15
http://4store.org
4.2.2 Architecture & Scalability 16
http://www.franz.com/agraph/allegrograph/
5. FUTURE WORK by AAT to annotate and store the data but there are alter-
There are a number of things that could be done to im- native vocabularies that could have been used such as the
prove the expansion algorithm outlined in Section 3.1 with provenance ontology [13] and the open provenance model
regards to both making it more intelligent in how it retrieves [18]. It would be a fairly easy and potentially useful en-
linked data and in conducting a detailed analyses of the data hancement to map the AAT schema to these vocabularies
returned. Manual inspection of the data shows that it does such that the data could be retrieved in the desired form by
appear to be relevant to the URI of interest but it is pro- users/client applications designed to work with those for-
posed that a full IR analysis of this is conducted in order to mats.
statistically confirm this initial assessment. Additionally as
was seen in Table 1 some types of URIs produced very little 6. CONCLUSION
linked data using the default expansion profile, a broader
analysis using domain specific profiles is necessary to as- In this work we have introduced a simple but powerful
certain whether those URIs have low levels of interlinking expansion algorithm which can be used to retrieve linked
or if the interlinkings just use domain specific links rather data about a URI even when that URI is not resolvable.
than the generic owl:sameAs and rdfs:seeAlso links that This provides an important tool for preserving data in the
are followed by default. Semantic Web and recovering from data loss and shows that
In terms of improving the intelligence of the algorithm at in the Semantic Web links themselves can be exploited as
the moment it submits every URI to every SPARQL, lookup a means to recover from broken links. As we have outlined
and discovery endpoint described in the expansion profile, it in Sections 4.1 and 5 there is a need to conduct a detailed
would improve the speed of the algorithm if it could use some analysis of the algorithm to asses it’s efficacy for a wider
decision making as to which endpoints a given URI should variety of URIs and using domain specific expansion profiles.
be submitted. Conversely though there is the possibility Depending on the results of this analysis the algorithm may
that this would impact the effectiveness of the algorithm so need to be further refined to improve both it’s speed and
it would be necessary to conduct experiments to determine accuracy.
whether there is a trade off between speed and accuracy. It is We have also presented the All About That (AAT) sys-
also worth considering that searching on URIs is not the only tem which allows users to monitor and preserve linked data
viable mechanism for finding additional linked data about they are interested in using the expansion algorithm as the
a URI of interest. Using terms extracted from the RDF primary retrieval method for deciding which linked data to
such as the objects of rdfs:label or dc:title triples would preserve based on a starting URI. As we demonstrated in
provide a way to augment URI based lookup with term/text Section 4.2.1 we envisage the usage of such a system as a
based search results from semantic search engines. There are base on which to build rich Semantic Web applications that
already frameworks like Silk [28] which can be used to do can take and present the changing data in interesting and
this and it would be useful to integrate the Silk framework useful ways to end users. It also fulfils a role in the overall
with the expansion algorithm. goal of our research which is to provide a suite of algorithms
One limitation inherent in AAT is that currently is does and systems which can be used to manage both data and
not do any kind of special handling of blank nodes which link integrity on the Semantic Web.
means that if data contains blank nodes AAT will contin- As has been discussed in Section 5 there are some limita-
uously think it has encountered new knowledge when most tions in the current versions of our algorithm and the AAT
likely it has not. For the data we have worked with so far system which we intend to investigate and address in the
this is generally not an issue since the linked data commu- future. It is clear that there is still a significant amount
nity tends to avoid blank nodes but if we are to provide for of work to be done to create a comprehensive set of tools
preserving all kinds of RDF effectively then we need to han- such that they can be applied to as wide a variety of data
dle blank nodes properly. Solving this problem may involve on the Semantic Web as is possible and experiences of past
doing some sub-graph matching and isomorphism to see if research in link integrity for the document Web tells us that
the sections of the graph that contain blank nodes can be there will be no perfect solution.
mapped to the previously seen sections of the graph as in Despite this it is our belief that as the Semantic Web
Tummarello et al’s RDFSync [23]. The blank nodes them- grows data and link integrity will be increasingly important
selves could either be left as-is or they could be translated issues to users as their applications come to rely upon linked
to URIs as done by systems like the Talis17 platform. data. There is a need to have systems in place such that data
Given that this work was inspired by traditional link in- can be preserved and accessed even if the original sources
tegrity techniques from hypermedia it is interesting to note are gone or unavailable. This has already been seen with
that it has the potential to be applied back to the docu- the release of services like the Sindice Cache API18 which is
ment web since there is increasing cross-over between the used as one of the data sources in the default expansion pro-
document and data web primarily due to the increasing up- file (see Section 3.1.1). Additionally with rising adoption of
take of RDFa. As increasing numbers of documents embed RDFa embedded inside documents on the web systems like
structured data using RDFa it will become possible to pre- this become applicable for the preservation of the structured
serve and monitor the structured information embedded in data embedded in the document based Web as discussed in
ordinary web pages in the same way as can be done with Section 5.
linked data now, therefore we envisage this as having appli-
cations in automated monitoring and maintenance of docu- 7. REFERENCES
ment based websites.
As mentioned in Section 3.2.1 a lightweight schema is used [1] Changeset protocol, 2007.
http://n2.talis.comn/wiki/Changeset_Protocol.
17 18
http://www.talis.com/platform http://www.sindice.com/developers/cacheapi
[2] Virtuoso sponger. Technical report, OpenLink Hyper-G: A new tool for distributed hypermedia.
Software, 2009. http://virtuoso.openlinksw.com/ Institutes for Information Processing Graz, 1994.
Whitepapers/html/VirtSpongerWhitePaper.html. [18] L. Moreau, J. Freire, J. Futrelle, R. McGrath,
[3] K. Alexander, R. Cyganiak, M. Hausenblas, and J. Myers, and P. Paulson. The open provenance
J. Zhao. Describing linked datasets: On the design model. December 2007.
and usage of void, the ‘vocabularly of interlinked http://eprints.ecs.soton.ac.uk/14979/.
datasets’. In Proceedings of the Linked Data on the [19] L. Moreau and N. Gray. A Community of Agents
Web Workshop (LDOW2009), Madrid, Spain, April Maintaining Links in the World Wide Web
2009. http: (Preliminary Report). In The Third International
//ceur-ws.org/Vol-538/ldow2009_paper20.pdf. Conference and Exhibition on The Practical
[4] H. Ashman. Electronic document addressing: dealing Application of Intelligent Agents and Multi-Agents,
with change. ACM Comput. Surv., 32(3):201–212, pages 221–235, London, UK, Mar. 1998. http:
2000. //www.ecs.soton.ac.uk/~lavm/papers/gcWWW.ps.gz.
[5] T. Berners-Lee, R. Cailliau, A. Luotonen, H. F. [20] V. Papavassiliou, G. Flouris, I. Fundulaki,
Nielsen, and A. Secret. The world-wide web. D. Kotzinos, and V. Christophides. On Detecting
Commun. ACM, 37(8):76–82, 1994. High-Level Changes in RDF/S KBs. In The Semantic
[6] C. Bizer, R. Cyganiak, and T. Heath. How to publish Web: 9th International Semantic Web Conference
linked data on the web, 2007. http://sites.wiwiss. (ISWC2009), pages 473–488. Springer, 2009.
fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial. [21] T. A. Phelps and R. Wilensky. Robust hyperlinks:
[7] P. Bouquet, H. Stoermer, and B. Bazzanella. An Cheap, everywhere, now. In Digital Documents:
entity name system (ens) for the semantic web. In 5th Systems and Principles, pages 514–549. Springer,
European Semantic Web Conference, ESWC 2008, 2004.
volume 5021, page 258. Springer, 2008. [22] G. Tummarello, R. Delbru, and E. Oren. Sindice. com:
[8] G. Cheng, W. Ge, and Y. Qu. Falcons: searching and Weaving the Open Linked Data. In The Semantic
browsing entities on the semantic web. In WWW ’08: Web: 6th International Semantic Web Conference,
Proceeding of the 17th international conference on 2nd Asian Semantic Web Conference, ISWC, pages
World Wide Web, pages 1101–1102, New York, NY, 552–565. Springer, 2008.
USA, 2008. ACM. [23] G. Tummarello, C. Morbidoni, R. Bachmann-Gm
[9] H. Davis. Data Integrity Problems in an Open ”ur, and O. Erling. RDFSync: efficient remote
Hypermedia Link Service. PhD thesis, University of synchronization of RDF models. In The Semantic
Southampton, November 1995. Web: 6th International Semantic Web Conference,
http://eprints.ecs.soton.ac.uk/6597/. 2nd Asian Semantic Web Conference, ISWC 2007+
[10] H. C. Davis. Hypertext link integrity. ACM Comput. ASWC 2007, Busan, Korea, November 11-15, 2007,
Surv., page 28, 1999. Proceedings, pages 537–551. Springer, 2007.
[11] A. M. Fountain, W. Hall, I. Heath, and H. C. Davis. [24] L. Veiga and P. Ferreira. Repweb: replicated web with
Microcosm: an open model for hypermedia with referential integrity. In SAC ’03: Proceedings of the
dynamic linking. In Hypertext: concepts, systems and 2003 ACM symposium on Applied computing, pages
applications, pages 298–311, New York, NY, USA, 1206–1211, New York, NY, USA, 2003. ACM.
1992. Cambridge University Press. [25] L. Veiga and P. Ferreira. Turning the web into an
[12] T. L. Harrison and M. L. Nelson. Just-in-time effective knowledge repository. ICEIS 2004: Software
recovery of missing web pages. In HYPERTEXT ’06: Agents and Internet Computing, 14(17), 2004.
Proceedings of the seventeenth conference on Hypertext [26] R. Vesse, W. Hall, and L. Carr. All about that - a uri
and hypermedia, pages 145–156, New York, NY, USA, profiling tool for monitoring and preserving linked
2006. ACM. data. In ISWC 2009, August 2009.
[13] O. Hartif and J. Zhao. Guide to the provenance http://eprints.ecs.soton.ac.uk/17815.
vocabularly, 2009. [27] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov.
http://sourceforge.net/apps/mediawiki/trdf/ Discovering and maintaining links on the web of data.
index.php?title=Provenance_Vocabulary. In A. Bernstein, D. R. Karger, T. Heath,
[14] B. Haslhofer and N. Popitsch. DSNotify–Detecting L. Feigenbaum, D. Maynard, E. Motta, and
and Fixing Broken Links in Linked Data Sets. In K. Thirunarayan, editors, The Semantic Web - ISWC
Proceedings of 8th International Workshop on Web 2009, volume 5823 of Lecture Notes in Computer
Semantics, 2009. Science, pages 650–665. Springer, 2009.
[15] A. Jaffri, H. Glaser, and I. Millard. Managing uri [28] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov.
synonymity to enable consistent reference on the Silk–a link discovery framework for the web of data.
semantic web. In IRSW2008 - Identity and Reference In 2nd Linked Data on the Web Workshop
on the Semantic Web 2008, 2008. (LDOW2009), 2009.
[16] F. Kappe. A scalable architecture for maintaining
referential integrity in distributed information
systems. Journal of Universal Computer Science,
1(2):84–104, 1995. http://www.jucs.org/jucs_1_2/
a_scalable_architecture_for.
[17] F. Kappe, K. Andrews, J. Faschingbauer,
M. Gaisbauer, M. Pichler, and J. Schipflinger.