Preserving Linked Data on the Semantic Web by the application of Link Integrity techniques from Hypermedia Rob Vesse, Wendy Hall, Leslie Carr Intelligence, Agents & Multimedia Group School of Electronics & Computer Science University of Southampton Southampton SO17 1BJ {rav08r,wh,lac}@ecs.soton.ac.uk ABSTRACT the link is changed so that it is no longer what the creator As the Web of Linked Data expands it will become increas- of the link intended to link to. Both these issues affect the ingly important to preserve data and links such that the Semantic Web since on the Semantic Web everything is in- data remains useful. In this work we present a method for terlinked data, therefore data is immediately susceptible to locating linked data to preserve which functions even when dangling links. The editing problem becomes much more the URI the user wishes to preserve does not resolve (i.e. problematic on the Semantic Web since anyone can make a is broken/not RDF) and an application for monitoring and statement about anything so the meaning of things on the preserving the data. This work is based upon the principle Semantic Web is subject to semantic drift1 . of adapting ideas from hypermedia link integrity in order to Due to the interlinking between data on the Semantic Web apply them to the Semantic Web. we show that it is possible to exploit the data model such that links themselves can be used to recover missing data in the event of a dangling link being encountered. This pro- General Terms vides for a means to retrieve the data that was/may have Experimentation, Reliability, Design been at a given URI even if that URI is no longer resolvable. Using this approach for locating data about a URI we are able to preserve and monitor data about a URI from mul- Keywords tiple sources and to recover data about URIs that are no Semantic Web, Linked Data, Link Integrity, Preservation longer functioning as described in Section 3.1. The Semantic Web also introduces two additional prob- lems in link integrity specific to linked data. The first of 1. INTRODUCTION these is URI Identity & Meaning - what does a URI mean The Web of Linked Data is characterised by the inter- and does this meaning actually matter to the applications linking between disparate heterogeneous data sources and that use it and the data that contains it - which is very much the fact that the links between the data sources are one of an open research debate and beyond the scope of our current the primary mechanisms for navigating through this data work. The second is the co-reference problem which refers space. Since links are essential to the Web of Linked Data to the situation in which some ‘thing’ we wish to make state- we believe that it is important to have mechanisms in place ments about has multiple URIs that could be used for it. In to maintain link integrity. The aim of link integrity is to this work we utilise existing work in this area of research as ensure that a link works correctly in that traversing the link part of our algorithm for preserving linked data. takes you to a resource and that as far as possible the re- Section 2 of this paper covers the related work in link source is the one intended by the provider of the link. On integrity for hypermedia and the Semantic Web which we a larger scale, link integrity deals with the overall integrity use ideas from in Section 3 to design and develop our algo- of interlinked datasets such as documents within a Content rithm and the AAT software. Section 5 outlines our plans Management System (CMS) or the linked data sets available for future research in this area and we conclude in Section 6 on the Semantic Web. Therefore link integrity is one way of by discussing the potential benefits of link integrity for the ensuring data integrity within the overall system which in Semantic Web. our use case is linked datasets. Link integrity is an existing and well known problem from hypermedia where there were two problems to be dealt with 2. RELATED WORK - dangling links and the editing problem. Dangling links Link integrity in hypermedia first received serious atten- are the most well known problem and are regularly experi- tion in the late 1980s and early 1990s primarily from re- enced by users on the Web as they find themselves presented searchers in the open hypermedia community. Systems like with an HTTP error as the link they followed pointed to a Microcosm [11] and HyperG [17] were among the first to resource which cannot be retrieved. The editing problem consider the issue in depth, Davis’s thesis [9] and Kappe’s refers to the situation in which the content at the end of 1 Copyright is held by the author/owner(s). See the discussion on the W3C Semantic Web Mailing List WWW2010, April 26-30, 2010, Raleigh, North Carolina. for an example of this - http://lists.w3.org/Archives/ . Public/semantic-web/2009May/0315.html 1995 paper [16] provide examples of link integrity in open rather than humans. When a human encounters a dead hypermedia. The widespread growth of the World Wide link they usually navigate to a search engine and enter an Web [5] in the mid-1990s led to some new research but as appropriate search phrase to find alternative sources of in- search engines became commonplace towards the end of the formation. For a client application encountering a dead link decade research interest dwindled. It was perceived that they will typically have no concept or how/where to find users did not care sufficiently to warrant research into the alternative sources of information and URIs for linked data problem as they could locate missing resources effectively are not always ideal for searching upon compared to textual using search engines, in addition the scale of the Web by search for documents. It should be noted that as with the that time was simply too vast for many proposed solutions existing Web if the Web of Linked Data is to undergo a mas- to handle. Davis’s survey [10] provides a good overview of sive expansion in the same way things must be allowed to the state of this research as of the end of the 1990s. Another fail but this does not mean we shouldn’t attempt to mitigate reason for the decline in research was that the fact that links the problem as far as possible. could fail was one of the reasons the Web was able to ex- In terms of the Semantic Web there has been research pand as fast as it did since it didn’t matter if links failed and into the versioning and synchronisation of RDF data which produced the familiar HTTP 404 error; users were able to is relevant to aspects of our work such as Tummarello et al’s publish content without worrying about whether their links RDFSync [23] which is an algorithm for efficiently synchro- to external content were valid. nising changes in RDF between multiple machines. This Ashman’s 2000 paper [4] which discusses link integrity shows that change detection in RDF is non-trivial due to with particular reference to electronic document archives the inherent data isomorphism caused by the use of blank provides both a useful survey of existing work and describes nodes but also shows that it can be achieved in an efficient a key motivation for ongoing research. As more document manner. More recent research from Papavassiliou et al [20] collections were translated into digital forms and placed onto has shown that using information about very basic changes intranets people once again started to be concerned about in the RDF - such as that provided by systems like RDF- link integrity. Users wanted assurances that links into the Sync or All About That (see Section 3.2) - can be used to document archives would work consistently and ideally links build applications which provide useful information to end out of the archives would work correctly as well since it may users. In the case of Papavassiliou’s et al’s paper they built a not be possible to alter the archived documents without in- system which furnished users with high level descriptions of validating the integrity of the archive. how RDFS vocabularies have changed in order to aid users In this vein Veiga and Ferreira [24, 25] discuss the possi- in working with such vocabularies. In addition there are bility of turning the Web into an effective knowledge reposi- systems like the Talis Platform2 which is a Semantic Web tory by use of replication and versioning. Their work follows store that implements a versioning mechanism whereby up- on from earlier work such as Moreau & Gray’s [19] which dates can be made via a Changeset protocol [1]. As part of proposed limited use of replication and versioning but had this protocol they utilise a useful lightweight vocabulary for significant reliance on author and user involvement in the publishing changes in RDF data as RDF which as will be process. In Veiga & Ferreira’s work there is no require- discussed in Section 3.2.3 we reuse in our own system. ment for author involvement in the process, only the end Regarding Semantic Web specific link integrity problems user need use a browser plugin to indicate the content they the research has largely focused on the co-reference prob- wish to replicate and preserve. Their results showed that lem. Since there are many organisations publishing similar the user could preserve the sections of the Web they were data semantically (bibliographic databases being a prime ex- interested in with no perceivable performance impact - on ample) there are frequently many URIs for a single entity average there was only a 12ms increase in retrieval time for such as an author. Co-reference research aims to develop resources. In Section 3 we discuss using an approach of this ways to efficiently and accurately determine URI equiva- kind for the Semantic Web. lences and refactor the data or republish this information Phelps & Wilensky introduced the concept of lexical sig- to help other Semantic Web applications. There are several natures for Web pages in their Robust Hyperlinks paper [21]. competing philosophies ranging from the Okkam approach They compute the lexical signature of a page and append described by Bouquet et al [7] which advocates universally it to all links to that page so that in the event of the link agreed URIs for each entity to the Co-reference Resolution failing a browser plugin can use the signature to relocate the Service (CRS) approach of Jaffri et al [15] which determines page using a search engine. The obvious flaw in their work co-referent URIs and republishes the information in dedi- was that it required rewriting all the links on the Web but cated triple stores. The CRS approach taken by the ReSIST Harrison & Nelson later showed that these signatures need project3 within the RKB Explorer4 application has poten- only be computed Just-in-Time (JIT) when a link fails [12]. tial for use in link integrity as the information provided by In their Opal system the signatures can be computed JIT a CRS could be utilised in a JIT fashion as in Harrison & by retrieving cached copies of the pages from a search en- Nelson’s work and we demonstrate how this can be done in gine cache, computing the signature and then using search Sections 3 and 4. engines to relocate the page. As discussed in Section 3.1 a In terms of link maintenance for the Semantic Web there JIT style approach can be effectively used to recover linked has been some research in the form of the Silk framework data about a URI. by Volz et al [28] which is a framework for computing links between different datasets. Their approach allows users to stipulate arbitrarily complex matching criteria to do entity 2.1 Semantic Web Research matching between datasets, the links produced from this can Unlike the traditional Web it is not possible for semantic 2 search engines like Sindice [22] and Falcons [8] to fulfil the http://www.talis.com/platform 3 same role as document search engines because the users in http://www.resist-noe.org/ 4 the Semantic Web domain are typically client applications http://www.rkbexplorer.com then be published via a CRS style service or added to the introduce a couple of additional predicates since we require relevant datasets. As proposed in their later paper [27] this the means to allow end users to specify some basic character- can be used as part of a link maintenance strategy, the pos- istics of how the algorithm should behave and there is a type sibility of combining this with our approach is discussed in of service we need to express which is not contained in the Section 5. In a similar vein Haslhofer and Popitsch’s DSNo- VoID ontology. VoID has concepts of Datasets and Linksets, tify system [14] can monitor linked resources and inform the former represent a set of data which may have SPARQL the application when links are no longer valid using feature endpoint(s) and/or URI lookup endpoint(s) while the latter based similarity metrics like the Silk framework. represent the types of interlinkings between datasets. What VoID does not have a means to express is the location of a service provided by a dataset which allows an application 3. METHOD to retrieve URIs which are considered equivalent to a given As we have discussed it is not realistic to maintain link in- URI - this we term a URI discovery endpoint (see Definition tegrity in a pre-emptive way since such solutions have been 2). A discovery endpoint differs from a lookup endpoint in consistently shown not to scale to Web scale in previous that the latter is expected to return everything the dataset work. Therefore the focus must be on recovery in the event knows about the given URI as opposed to only returning of failure and preservation to guard against the loss of data equivalent URIs. Examples of existing discovery endpoints which is considered interesting/useful to end users. As the on the Semantic Web include RKBExplorer’s CRSes [15] amount of data in the Web of Linked Data starts to expand and sameAs.org5 . Another key difference between a lookup massively - particularly with linked data being adopted by and discovery endpoint is that links discovered from a dis- an increasing number of major organisations - we expect covery endpoint are considered to be on the same level of that as with the early document web there’ll be an increas- the crawl for the purposes of the algorithm i.e. they do not ing amount of content published by both big companies and have increased depth relative to the URI that discovery is individuals. Just like the document web this explosion of performed upon. By this we mean that the execution of the content will most likely include much content that is poorly algorithm results in performing a breadth-first depth-limited maintained and will lead to increasing numbers of broken linked data crawl starting from a given URI - in this tree links. We have two connected goals in this work 1) to pro- structure a discovery endpoint introduces sibling nodes for vide a means to retrieve resource descriptions in the form a URI while a lookup endpoint introduces child nodes for a of linked data about a URI even when the the URI is non- URI. functional and 2) to provide the means for an end user to Our other extensions to VoID allow individual dataset- preserve and version these descriptions. To attempt to solve s/linksets to be marked as ignored (the algorithm will not this problem we present an expansion algorithm for retriev- use them) and for the user to define to what depth the algo- ing Linked Data about a URI even if that URI itself has rithm should crawl to (defaults to 1). These extensions are failed in Section 3.1 and a preservation system built using defined as part of the AAT schema detailed in Section 3.2.1. this algorithm in Section 3.2. Definition 2. A URI discovery endpoint is an endpoint 3.1 Expansion Algorithm that when passed a URI returns a Graph containing equiva- Since the goal of this work is to preserve linked data it was lent URIs of the input URI typically in the form or owl:sameAs deemed essential that as far as possible we leverage existing links. linked data technologies and services in order to effect this preservation. To this end we designed a relatively simple As already stated the actual algorithm is a simple crawler algorithm which uses simple crawling techniques which are which uses the input expansion profile as a guide to which directed by a user definable expansion profile (see Definition potential sources of linked data it should use to try and find 1). Our aim with this algorithm is to provide resource de- data about the URI of interest - this procedure is detailed scriptions of a URI regardless of whether the URI itself is in Algorithm 1. Note that the algorithm does not terminate dereferenceable. in the event of an error retrieving data from a particular Even in the case where a URI is used only as an identifier URI/endpoint and simply continues, by doing this it is still in the description of another resource and is not itself deref- possible to retrieve some data even if the starting URI does erenceable it is likely that we can still retrieve some data not return a valid response. The algorithm will continue about it. The fact that a URI is minted only as an identifier and issue queries about the URI to the various endpoints and that the person/organisation minting the URI does not described in the given expansion profile so unless the URI provide the means to dereference the URI does not affect refers to a document that had very poor linkages or was not our ability to find data about it assuming that the identifier indexed by the semantic search services used some RDF will is used elsewhere i.e. it is reused as part of linked data. be returned. This approach has similarities to the JIT style approach of Harrison & Nelson [12] in that there doesn’t Definition 1. An expansion profile is a Vocabulary of In- need to be any foreknowledge of the URIs you wish to re- terlinked Datasets (VoID) description of a set of datasets cover data about when you discover they are broken since by and linksets that should be used to locate linked data about utilising the caches and lookup services of relevant datasets the URI of interest. The VoID description may be option- it is still possible to recover data about the URI. ally annotated with additional properties which affect the The basic behaviour of the algorithm is only to follow behaviour of the algorithm. owl:sameAs and rdfs:seeAlso links but the end user can specify that any predicate be treated as a link to follow by Drawing on ideas described in Alexander et al’s Vocab- the specifying an appropriate VoID linkset in their expan- ulary of Interlinked Datasets (VoID) [3] about the way it sion profile. can be used to direct crawlers we decided to use VoID as 5 the primary means of expressing an expansion profile. We http://www.sameas.org Algorithm 1 Expansion Algorithm retrieval of Sindice’s cached copy of the RDF from a Require: URI, Expansion Profile URI. 1: ToExpand as a set of pairs of URIs and Depths 2: while ToExpand 6= ∅ do • SameAs.org10 - SameAs.org provides a URI discovery 3: Remove first pair from ToExpand endpoint (see Section 3.1 and Definition 2) which can 4: if Graph with URI is already in the Dataset then be used to find URIs which are equivalent to a given 5: Continue URI 6: if Depth >Max Depth then The default profile11 has a max expansion depth of 1 which 7: Continue means it only considers URIs which are immediate neigh- 8: Retrieve the Graph at the URI bours of the starting URI. 9: Add the Graph to the Dataset In the case where the end user does know which linked 10: for all Triples in Graph do data sources will have useful information about the URI 11: if Triple is a Link then they can specify their own expansion profile which is used 12: Add a new pair to ToExpand instead of the default profile. In this case the algorithm will 13: for all Datasets in Expansion Profile do use the datasets and linksets they define in the profile to 14: if Dataset has a SPARQL Endpoint then discover linked data about the URI of interest, for example 15: Issue a DESCRIBE for the URI against the End- if attempting to recover data about a person it may be useful point to follow foaf:knows links. 16: Add resulting Graph to the Dataset 17: Process the Graph for additional Links 3.2 Preservation 18: if Dataset has a Lookup Endpoint then The preservation approach taken is to allow the end user 19: Issue a Lookup for the URI against the Endpoint to monitor and preserve a set of linked data that they are 20: Add resulting Graph to the Dataset interested in. The data is preserved not at the data source 21: Process the Graph for additional Links but rather at a local level on the users server with the user 22: if Dataset has a Discovery Endpoint then able to republish this data as they desire. This is in line with 23: Issue a Discovery for the URI against the End- the ideas of Veiga & Ferreira [25] in that the end user speci- point fies the parts of the Web they want to preserve and then the 24: for all Equivalent URIs do software takes care of this. The data must be preserved in 25: Add a new pair to ToExpand such a way that the original data can be efficiently extracted 26: return Dataset from it and sufficient information to provide versioning over the data is kept. In the Semantic Web domain the objects of interest are There are already some existing systems which work in URIs we propose that a profile of a URI be preserved (see a similar way to our algorithm such as the sponger middle Definition 3). Since the data being processed is RDF it ware using in Virtuoso [2]. The main difference between is logically divided into triples which can be preserved and our algorithm and algorithms such as those in the Virtuoso monitored individually. It is deemed necessary to store infor- sponger is that our algorithm is only interested in linked mation pertaining to the temporality and provenance of each data and it does not infer/create any additional data. Unlike triple - when it was first seen, last updated, source URI(s) the Virtuoso sponger it does not attempt to turn non-linked and whether it has changed or been retracted/deleted from data into RDF and it does not do any inference over the data the RDF. it returns, it is designed only to find and return (in the form of an RDF dataset) linked data about the URI of interest. Definition 3. A URIs profile is the transformed and anno- Yet as expansion profiles may reference any datasets and tated form of the linked data retrievable about a given URI associated endpoints they wish there is no reason why a such that the temporality and provenance of the triples con- user could not direct our algorithm to utilise a service like tained therein are inferable from the profile URIBurner6 which uses the Virtuoso sponger in order to get the benefits of the additional inferred data. In terms of user interface the system should allow a user to view a profile both in the stored form and in its original 3.1.1 Default Profile form. The system must monitor the original data source Since the end user of such an algorithm may not always over time updating the profiles as necessary such that it can know where to look for linked data about the URI they are provide a report of changes in the data to the user. Since a interested in the algorithm has a default expansion profile URI profile will contain versioning information the interface which is used in the case when no profile is specified. This should allow a user to view a particular version of the profile. profile uses 3 data sources which are in our opinion impor- tant hubs of the Web of Linked Data: 3.2.1 Schema As the first stage of implementation an RDF Schema for • DBPedia7 - The DBPedia SPARQL endpoint is used All About That12 (AAT) is defined which embodies classes to lookup URIs and properties which allow the description and annotation of triples in such a way that the required information as • Sindice8 Cache - The Sindice Cache API9 allows the discussed in the preceding proposal can be stored for each 6 10 http://www.uriburner http://www.sameas.org 7 11 http://dbpedia.org http://www.dotnetrdf.org/expander/defaultProfile 8 12 http://www.sindice.com This schema is available at http://www.dotnetrdf.org/ 9 http://www.sindice.com/developers/cacheapi AllAboutThat/ triple. The schema defines a class for representing profiles called aat:Profile and uses the rdf:Statement class to represent triples. rdf:Statement is used as the basis of triple storage as it makes it possible for non-AAT aware tools to extract the original triples from the profile easily. A num- Figure 1: Original Triple ber of properties are defined which store meta data about the profile itself such as created & updated date, source URI and a locally unique identifier for the profile. Simi- lar properties are defined for triples which allow the first and last asserted dates, source URI and change status of a triple to be indicated. A key distinction in the schema is be- tween aat:profileSource and aat:source, despite storing equivalent data two predicates are created since the former expresses the URI which is the starting point for the pro- file while the latter expresses all the URIs at which a given triple is asserted. While there were alternative schemas and vocabularies available that could have potentially been used to store the required data the motivation behind designing our own schema was to provide a lightweight schema that attached Figure 2: Triple transformed to AAT Annotated all data to a single subject for ease of processing. Alterna- Form tives such as the Provenance Vocabulary by Hartig & Zhao [13] are far more expressive but they potentially require in- troducing multiple intermediate blank nodes which would Since the user needs to both browse the data they are pre- significantly complicate the processing needed to implement serving as well as potentially republish it, a Web based in- many of the core features of AAT. Similarly the Open Prove- terface was designed as the primary interaction mechanism. nance Model as described by Moreau et al [18] is highly ex- The interface allows users to explore the data by first select- pressive but like the Hartig & Zhao’s vocabulary the RDF ing a profile to view and then allowing them to view profile serialization is overly complex for use in AAT. As discussed contents, export, versions and change reports. A user may in Section 5 there is no reason why the data contained in also use the interface to add new URIs they wish to monitor AAT could not be exposed in other provenance vocabularies to the system and to initiate updates to profiles (see Def- but for AATs processing and storage a lightweight vocabu- inition 4). Following linked data best practices [6] and to lary is preferable. provide the ability for the user to republish their preserved The use of reification was chosen over the use of named data multiple dereferenceable URIs for each profile are cre- graphs primarily due to the need to make annotations at ated and accessible through the Web interface. These allow the level of individual triples rather than at the graph level, the retrieval of the profile contents which consists of all the usage is motivated by the fact that the mechanism provides triples ever retrieved from the profile URI in the transformed a clear and obvious schema for encoding a triple and adding form, the export of the profile (see Definition 5) and various additional annotations to it. While reification may signifi- meta graphs about a profile e.g. change history, changesets. cantly increase the size of the data being stored initially over This means that the profile of a URI has a URI and thus time this balances out compared to named graphs where it can itself be profiled if it was desired. is necessary to either store many copies of the same graph or store multiple named graphs which represent a series of Definition 4. An update of a profile occurs when AAT deltas to the original data. The other difficulty inherent in using the Expansion algorithm to retrieve RDF about the the named graphs approach is that the annotations typically given URI. The triples contained are compared with the would then be held separately in other named graphs which triples currently in the profile and the profile updated ac- adds to the complexity of the data processing. Neverthe- cordingly less named graphs are used within AAT since each profile naturally forms a named graph and AAT generates several Definition 5. The export of a profile is the recreation of related named graphs about each profile detailing change the RDF in its original form based upon the current contents history and changesets as described in Section 3.2.3. of the profile. An export represents the RDF as it was last seen by AAT 3.2.2 Profile Creation & Update 3.2.3 Change Reporting To create a URIs profile linked data about the URI is first A key feature of AAT is the ability to generate change re- retrieved using the expansion algorithm presented in Section ports about how the RDF at the profiled URI has changed 3.1; then using the AAT schema each triple can be trans- over time. To do this a number of relatively simple compu- formed into a set of triples which represent an annotation tations over the annotated triples can be made based pri- of the original triple. For each triple in the original RDF marily on the first and last asserted dates of the triples. In a blank node is created which is then used as the subject creating change reports four different types of changes in the of a set of triples which represent the required information RDF are looked for (see Definitions 6-9). A distinction is about the original triple. Figure 1 shows an example triple made between missing knowledge and retracted or deleted and Figure 2 shows it transformed into the AAT form. A knowledge as it may be possible for triples to be perceived to URIs profile consists of a set of transformed triples where be temporarily non-present in the RDF. For example in the each profile is a named graph in the underlying store. event of a transient network issue making some/all of the relevant URIs unretrievable the updated date for the pro- have a starting point for this. Separate to Changesets a file will still be updated leaving all the triples in the profile named graph containing a history for each profile is also to appear missing. The length of time we require Triples stored which links to all the relevant Changesets for a pro- to be missing before we consider them to be deleted is cur- file. rently set to 7 days for our monitoring of the BBC dataset described in Section 4.2.1, this time period is a domain spe- cific parameter that can be adjusted depending on the data 4. RESULTS that is being monitored. 4.1 Expansion Definition 6. New knowledge is any triple that is new to To test the expansion algorithm we took a small sample or the RDF at the profiled URI URIs which included the URIs of the authors, places asso- ciated with the authors and TV programmes from the BBC Definition 7. Changed knowledge is any triple where the (since we use the BBC programmes dataset for our preser- object of the triple has changed. Only triples where the vation tests as described in Section 4.2). The results shown predicate has a cardinality of 1 can be considered to change in Table 1 show that the amount of linked data that can be obtained using the default expansion profile described in Definition 8. Missing knowledge is any triple no longer Section 3.1.1 varies depending on the URI being profiled. found in the RDF at the profiled URI but which was recently Expanding the URI of a person potentially produces a large seen in the RDF number of small graphs particularly if that person is a well published academic since many bibliographic databases are Definition 9. Retracted or deleted knowledge is any triple exposed as linked data and provide small amounts of data no longer found in the RDF at the profiled URI which has about people. As can be seen URIs for places return vary- not been seen for a reasonable length of time ing amounts of data which depends on the size and relative importance of the place. Conversely expanding the URIs In regards to the concept of changed knowledge consider of BBC programmes using the default profile produces very some arbitrary predicates ex:one and ex:many which have little linked data, we suspect that this is due to the type cardinalities of 1 and unrestricted respectively. Since ex:one of data and the fact the linking it uses it mostly based on has a cardinality of 1 it can be said whenever the object the BBCs ontologies. As outlined in Section 5 we plan to of that triple has changed it is changed knowledge. Yet it conduct experiments in the future to asses the efficacy of cannot be said for ex:many triples as the predicate has un- the algorithm on various types of data and using domain restricted cardinality, therefore each triple using this predi- specific expansion profiles. cate must be treated as a unique entity i.e. one instance of One of the benefits of the algorithm is that as can be seen a triple using this predicate cannot be considered to replace in the results in Table 1 the algorithm is trivially parallel. another. In the examples the fact that < A > was related Increasing the number of threads used to process the discov- to < C > via the predicate ex:many in Example 1 and now ered URIs shows a significant reduction in the time taken is instead related to < E > in Example 2 doesn’t mean they to retrieve the linked data. Experiments were conducted are related to < E > instead of < C >, it just means they no with higher number of threads but 8 threads was found to longer consider themselves related to < C >. The fact that be optimal since beyond 8 threads erratic behaviour is ob- they are related to < E > is new knowledge while the fact served due to two factors: 1. underlying limitations of the they related to < C > is missing/deleted knowledge, but if HTTP API used in terms of stable concurrent connections the value of the ex:one relationship had changed then that and 2. high volumes of concurrent access to a single site look would be considered changed knowledge. like DoS attacks and lead to temporary bans on accessing those sites. Differences in the number of triples and graphs Example 1 Original Graph returned for URIs can be attributed to a couple of factors. ex:one . In the case of the London URI where the difference is dra- ex:many . matic - over 200,000 triples difference - this is because with ex:many . a smaller number of threads connections seem more likely to time out though we are unsure why this is. In the other cases many of the graphs were from the same domain name and the API used to retrieve the RDF had a bug regarding connection management for multiple concurrent connections Example 2 Modified Graph to the same domain which caused connections to fail unex- ex:one . pectedly which is why a reduction in the amount of data is ex:many . observed as the number of threads increased. ex:many . 4.2 Preservation When a change report is computed is it itself serialized 4.2.1 BBC Programmes into an RDF Graph using the Talis Changeset ontology [1] In order to test AAT properly it was used to monitor a which is stored as a named graph in the underlying store subset of the BBC Programmes13 dataset which is a large and republished via the web interface. Each Changeset gen- and constantly changing linked data set which allowed for erated links back to the previous Changeset (if one exists) both the testing of the scalability of AAT and for the ver- such that a end user/client application consuming the data ification that it’s change detection algorithms worked as can follow the history of changes, a special URI which re- 13 trieves the most recent Changeset is provided such that users http://www.bbc.co.uk/programmes/developers Table 1: Sample Expansion Algorithm Results URI Total Graphs Total Triples Retrieval Time (seconds) Thread Used Rob Vesse 4 115 13.8 1 http://id.ecs.soton.ac.uk/person/11471 4 115 1.8 2 4 115 1.8 4 4 115 2.1 8 Wendy Hall 691 4,068 786.3 1 http://id.ecs.soton.ac.uk/person/1650 692 4,070 383.8 2 692 4,070 375.9 4 692 4,070 359.6 8 Les Carr 368 2,694 438.9 1 http://id.ecs.soton.ac.uk/person/60 279 2,516 109.5 2 238 2,434 75.9 4 204 2,366 64.8 8 Ilkeston 6 444 19.1 1 http://dbpedia.org/resource/Ilkeston 5 393 13.5 2 5 416 9.6 4 5 393 5.3 8 Southampton 24 3,735 57.2 1 http://dbpedia.org/resource/Southampton 23 3,497 43.8 2 23 3,497 27.3 4 23 3,497 55.3 8 Nottingham 17 4,154 41.4 1 http://dbpedia.org/resource/Nottingham 16 4,048 39.5 2 16 4,048 27.4 4 16 4,048 25.9 8 London 13 53,886 142.4 1 http://dbpedia.org/resource/ 13 53,870 211.9 2 13 53,870 149.8 4 14 280,424 385.8 8 Eastenders 2 612 1.8 1 http://www.bbc.co.uk/programmes/b006m86d 2 612 0.7 2 2 612 0.6 4 2 612 0.7 8 Panorama 2 174 1.4 1 http://www.bbc.co.uk/programmes/b006t14n 2 174 0.9 2 2 174 0.7 4 2 174 0.6 8 Table 2: BBC Programmes preserved dataset size over 1 week Date and Time Number of Changed Profiles Average Changes per Profile Max. Changes Min. Changes 12/2/2010 163 43 311 1 13/2/2010 105 2 25 1 14/2/2010 90 2 25 1 15/2/2010 87 2 25 1 16/2/2010 90 2 25 1 17/2/2010 87 2 25 1 18/2/2010 86 2 25 1 Figure 3: BBC Programmes Demonstration Appli- cation built on top of data from AAT intended. The subset used was all the brands (i.e. pro- grammes) associated with the service BBC1 (the BBCs main TV channel) since this includes many brands which change Figure 4: All About That Architecture regularly such as soaps and news broadcasts. Table 2 demon- strates the average number of changes detected over just a short period. AATs architecture is constructed as shown in Figure 4, As can be seen in Table 2 you can see that the BBC up- and as can be seen it is decomposed into several compo- date their dataset on a daily basis, the initial high number nents which then rely on some external standalone compo- of changes is due to starting from a base dataset that was nents: an RDF API and the expansion algorithm. AAT a couple of months old due to architectural changes made is theoretically agnostic of its underlying storage though in to AAT to support the use of the expansion algorithm and practise differences in implementation between triple stores improve the efficiency of the system. The average number mean only certain stores are currently viable for use as the of changes being 2 is due to the fact that the typical up- backing store. In the early prototyping stage a RDBMS date we see the BBC make to their data is that they add a based store was used which was sufficient for initial proto- triple describing a newly broadcast episode of a programme typing but not scalable for real world testing so then the us- and update the value of the dc:modified triple. The appar- age of production grade triple stores was adopted. Initially ently high number of 25 for the maximum changes is due to it was intended to use the open source release of Virtuoso14 one of the program URIs failing to resolve resulting in the as the backing store but it was found that Virtuoso didn’t contents of that profile being considered to missing so the correctly preserve boolean typed literals which created is- change report for each day reports those triples as removed. sues in the internal processing of data within AAT. 4store15 The relatively high number of profiles changing each day was then used briefly but it was found that it was unable to is due to the fact that as already stated many of the pro- handle the heavy volume of parallel read/writes which AAT grammes associated with BBC 1 are broadcast daily such uses during its data processing due to 4store’s concurrency as soaps and news bulletins and that the BBC publish data model. Currently AAT runs again AllegroGraph16 since it about programmes several days before the programmes are has demonstrated in testing the ability to handle the high actually broadcast. volumes of read/writes necessary for using AAT on the large To demonstrate the reuse of the data being harvested we dataset described in the preceding section. created a demonstration application which is a simple web In terms of general scalability the majority of algorithms based faceted browser which lets users browse through in- in AAT need to run on a single thread for each profile but formation about recently shown BBC shows. Facets can be it is trivial to process multiple profiles in parallel and this used to filter by Genre and Channel and the user can view is the approach taken currently. Since work can be divided detailed information about both programmes and the indi- over multiple threads it will also be possible to significantly vidual episodes. This application was presented as part of an increase the scalability by dividing the work over a cluster earlier prototype of AAT described in [26] and shown in Fig- of machines which would allow much larger datasets to be ure 3. Like previous work by Papavassiliou et al [20] it shows monitored efficiently. that simple information about basic triple level changes in RDF (additions, deletions etc) can be reprocessed into use- 14 ful applications for end users. http://www.openlinksw.com/virtuoso 15 http://4store.org 4.2.2 Architecture & Scalability 16 http://www.franz.com/agraph/allegrograph/ 5. FUTURE WORK by AAT to annotate and store the data but there are alter- There are a number of things that could be done to im- native vocabularies that could have been used such as the prove the expansion algorithm outlined in Section 3.1 with provenance ontology [13] and the open provenance model regards to both making it more intelligent in how it retrieves [18]. It would be a fairly easy and potentially useful en- linked data and in conducting a detailed analyses of the data hancement to map the AAT schema to these vocabularies returned. Manual inspection of the data shows that it does such that the data could be retrieved in the desired form by appear to be relevant to the URI of interest but it is pro- users/client applications designed to work with those for- posed that a full IR analysis of this is conducted in order to mats. statistically confirm this initial assessment. Additionally as was seen in Table 1 some types of URIs produced very little 6. CONCLUSION linked data using the default expansion profile, a broader analysis using domain specific profiles is necessary to as- In this work we have introduced a simple but powerful certain whether those URIs have low levels of interlinking expansion algorithm which can be used to retrieve linked or if the interlinkings just use domain specific links rather data about a URI even when that URI is not resolvable. than the generic owl:sameAs and rdfs:seeAlso links that This provides an important tool for preserving data in the are followed by default. Semantic Web and recovering from data loss and shows that In terms of improving the intelligence of the algorithm at in the Semantic Web links themselves can be exploited as the moment it submits every URI to every SPARQL, lookup a means to recover from broken links. As we have outlined and discovery endpoint described in the expansion profile, it in Sections 4.1 and 5 there is a need to conduct a detailed would improve the speed of the algorithm if it could use some analysis of the algorithm to asses it’s efficacy for a wider decision making as to which endpoints a given URI should variety of URIs and using domain specific expansion profiles. be submitted. Conversely though there is the possibility Depending on the results of this analysis the algorithm may that this would impact the effectiveness of the algorithm so need to be further refined to improve both it’s speed and it would be necessary to conduct experiments to determine accuracy. whether there is a trade off between speed and accuracy. It is We have also presented the All About That (AAT) sys- also worth considering that searching on URIs is not the only tem which allows users to monitor and preserve linked data viable mechanism for finding additional linked data about they are interested in using the expansion algorithm as the a URI of interest. Using terms extracted from the RDF primary retrieval method for deciding which linked data to such as the objects of rdfs:label or dc:title triples would preserve based on a starting URI. As we demonstrated in provide a way to augment URI based lookup with term/text Section 4.2.1 we envisage the usage of such a system as a based search results from semantic search engines. There are base on which to build rich Semantic Web applications that already frameworks like Silk [28] which can be used to do can take and present the changing data in interesting and this and it would be useful to integrate the Silk framework useful ways to end users. It also fulfils a role in the overall with the expansion algorithm. goal of our research which is to provide a suite of algorithms One limitation inherent in AAT is that currently is does and systems which can be used to manage both data and not do any kind of special handling of blank nodes which link integrity on the Semantic Web. means that if data contains blank nodes AAT will contin- As has been discussed in Section 5 there are some limita- uously think it has encountered new knowledge when most tions in the current versions of our algorithm and the AAT likely it has not. For the data we have worked with so far system which we intend to investigate and address in the this is generally not an issue since the linked data commu- future. It is clear that there is still a significant amount nity tends to avoid blank nodes but if we are to provide for of work to be done to create a comprehensive set of tools preserving all kinds of RDF effectively then we need to han- such that they can be applied to as wide a variety of data dle blank nodes properly. Solving this problem may involve on the Semantic Web as is possible and experiences of past doing some sub-graph matching and isomorphism to see if research in link integrity for the document Web tells us that the sections of the graph that contain blank nodes can be there will be no perfect solution. mapped to the previously seen sections of the graph as in Despite this it is our belief that as the Semantic Web Tummarello et al’s RDFSync [23]. The blank nodes them- grows data and link integrity will be increasingly important selves could either be left as-is or they could be translated issues to users as their applications come to rely upon linked to URIs as done by systems like the Talis17 platform. data. There is a need to have systems in place such that data Given that this work was inspired by traditional link in- can be preserved and accessed even if the original sources tegrity techniques from hypermedia it is interesting to note are gone or unavailable. This has already been seen with that it has the potential to be applied back to the docu- the release of services like the Sindice Cache API18 which is ment web since there is increasing cross-over between the used as one of the data sources in the default expansion pro- document and data web primarily due to the increasing up- file (see Section 3.1.1). Additionally with rising adoption of take of RDFa. As increasing numbers of documents embed RDFa embedded inside documents on the web systems like structured data using RDFa it will become possible to pre- this become applicable for the preservation of the structured serve and monitor the structured information embedded in data embedded in the document based Web as discussed in ordinary web pages in the same way as can be done with Section 5. linked data now, therefore we envisage this as having appli- cations in automated monitoring and maintenance of docu- 7. REFERENCES ment based websites. As mentioned in Section 3.2.1 a lightweight schema is used [1] Changeset protocol, 2007. http://n2.talis.comn/wiki/Changeset_Protocol. 17 18 http://www.talis.com/platform http://www.sindice.com/developers/cacheapi [2] Virtuoso sponger. Technical report, OpenLink Hyper-G: A new tool for distributed hypermedia. Software, 2009. http://virtuoso.openlinksw.com/ Institutes for Information Processing Graz, 1994. Whitepapers/html/VirtSpongerWhitePaper.html. [18] L. Moreau, J. Freire, J. Futrelle, R. McGrath, [3] K. Alexander, R. Cyganiak, M. Hausenblas, and J. Myers, and P. Paulson. The open provenance J. Zhao. Describing linked datasets: On the design model. December 2007. and usage of void, the ‘vocabularly of interlinked http://eprints.ecs.soton.ac.uk/14979/. datasets’. In Proceedings of the Linked Data on the [19] L. Moreau and N. Gray. A Community of Agents Web Workshop (LDOW2009), Madrid, Spain, April Maintaining Links in the World Wide Web 2009. http: (Preliminary Report). In The Third International //ceur-ws.org/Vol-538/ldow2009_paper20.pdf. Conference and Exhibition on The Practical [4] H. Ashman. Electronic document addressing: dealing Application of Intelligent Agents and Multi-Agents, with change. ACM Comput. Surv., 32(3):201–212, pages 221–235, London, UK, Mar. 1998. http: 2000. //www.ecs.soton.ac.uk/~lavm/papers/gcWWW.ps.gz. [5] T. Berners-Lee, R. Cailliau, A. Luotonen, H. F. [20] V. Papavassiliou, G. Flouris, I. Fundulaki, Nielsen, and A. Secret. The world-wide web. D. Kotzinos, and V. Christophides. On Detecting Commun. ACM, 37(8):76–82, 1994. High-Level Changes in RDF/S KBs. In The Semantic [6] C. Bizer, R. Cyganiak, and T. Heath. How to publish Web: 9th International Semantic Web Conference linked data on the web, 2007. http://sites.wiwiss. (ISWC2009), pages 473–488. Springer, 2009. fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial. [21] T. A. Phelps and R. Wilensky. Robust hyperlinks: [7] P. Bouquet, H. Stoermer, and B. Bazzanella. An Cheap, everywhere, now. In Digital Documents: entity name system (ens) for the semantic web. In 5th Systems and Principles, pages 514–549. Springer, European Semantic Web Conference, ESWC 2008, 2004. volume 5021, page 258. Springer, 2008. [22] G. Tummarello, R. Delbru, and E. Oren. Sindice. com: [8] G. Cheng, W. Ge, and Y. Qu. Falcons: searching and Weaving the Open Linked Data. In The Semantic browsing entities on the semantic web. In WWW ’08: Web: 6th International Semantic Web Conference, Proceeding of the 17th international conference on 2nd Asian Semantic Web Conference, ISWC, pages World Wide Web, pages 1101–1102, New York, NY, 552–565. Springer, 2008. USA, 2008. ACM. [23] G. Tummarello, C. Morbidoni, R. Bachmann-Gm [9] H. Davis. Data Integrity Problems in an Open ”ur, and O. Erling. RDFSync: efficient remote Hypermedia Link Service. PhD thesis, University of synchronization of RDF models. In The Semantic Southampton, November 1995. Web: 6th International Semantic Web Conference, http://eprints.ecs.soton.ac.uk/6597/. 2nd Asian Semantic Web Conference, ISWC 2007+ [10] H. C. Davis. Hypertext link integrity. ACM Comput. ASWC 2007, Busan, Korea, November 11-15, 2007, Surv., page 28, 1999. Proceedings, pages 537–551. Springer, 2007. [11] A. M. Fountain, W. Hall, I. Heath, and H. C. Davis. [24] L. Veiga and P. Ferreira. Repweb: replicated web with Microcosm: an open model for hypermedia with referential integrity. In SAC ’03: Proceedings of the dynamic linking. In Hypertext: concepts, systems and 2003 ACM symposium on Applied computing, pages applications, pages 298–311, New York, NY, USA, 1206–1211, New York, NY, USA, 2003. ACM. 1992. Cambridge University Press. [25] L. Veiga and P. Ferreira. Turning the web into an [12] T. L. Harrison and M. L. Nelson. Just-in-time effective knowledge repository. ICEIS 2004: Software recovery of missing web pages. In HYPERTEXT ’06: Agents and Internet Computing, 14(17), 2004. Proceedings of the seventeenth conference on Hypertext [26] R. Vesse, W. Hall, and L. Carr. All about that - a uri and hypermedia, pages 145–156, New York, NY, USA, profiling tool for monitoring and preserving linked 2006. ACM. data. In ISWC 2009, August 2009. [13] O. Hartif and J. Zhao. Guide to the provenance http://eprints.ecs.soton.ac.uk/17815. vocabularly, 2009. [27] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. http://sourceforge.net/apps/mediawiki/trdf/ Discovering and maintaining links on the web of data. index.php?title=Provenance_Vocabulary. In A. Bernstein, D. R. Karger, T. Heath, [14] B. Haslhofer and N. Popitsch. DSNotify–Detecting L. Feigenbaum, D. Maynard, E. Motta, and and Fixing Broken Links in Linked Data Sets. In K. Thirunarayan, editors, The Semantic Web - ISWC Proceedings of 8th International Workshop on Web 2009, volume 5823 of Lecture Notes in Computer Semantics, 2009. Science, pages 650–665. Springer, 2009. [15] A. Jaffri, H. Glaser, and I. Millard. Managing uri [28] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. synonymity to enable consistent reference on the Silk–a link discovery framework for the web of data. semantic web. In IRSW2008 - Identity and Reference In 2nd Linked Data on the Web Workshop on the Semantic Web 2008, 2008. (LDOW2009), 2009. [16] F. Kappe. A scalable architecture for maintaining referential integrity in distributed information systems. Journal of Universal Computer Science, 1(2):84–104, 1995. http://www.jucs.org/jucs_1_2/ a_scalable_architecture_for. [17] F. Kappe, K. Andrews, J. Faschingbauer, M. Gaisbauer, M. Pichler, and J. Schipflinger.