Managing Co-reference on the Semantic Web Hugh Glaser, Afraz Jaffri, Ian C. Millard School of Electronics and Computer Science University of Southampton Southampton, Hampshire, UK {hg, aoj04r, icm}@ecs.soton.ac.uk ABSTRACT cluding notable encyclopaedic, geographical, music, film and Co-reference resolution, or the determination of ‘equivalent’ academic publication related resources. URIs referring to the same concept or entity, is a significant However, in many cases there is minimal interlinking be- hurdle to overcome in the realisation of large scale Seman- tween these datasets, as often they are existing resources tic Web applications. However, it has only recently gained which have recently been exposed in a semantic representa- the attention of research communities in the Semantic Web tion. As a result, the ‘web’ remains fragmented and difficult context, and while activities are now underway in identifying to navigate. co-referent or conflated URIs, little consideration has been As more datasets appear there is significant potential for given to tools and techniques for storing, manipulating, and overlap to occur, with a given resource being described in reusing co-reference information. two or more repositories. These representations are more This paper provides an overview of the specification, im- than likely to use different identifiers, stemming from each plementation, interactions and experiences in using the Co- source, unless one or other of the datasets is relatively new reference Resolution Service (CRS) to facilitate rigorous man- and has been constructed with knowledge of the other. Fur- agement of URI co-reference data, and enable interoperation thermore, the information in different repositories could be between multiple Linked Open Data sources. Comparisons expressed against different ontologies. are made throughout the paper contrasting the differences The problems involved with identifying these ‘duplicate’ in the way the CRS manages multiple URIs for the same descriptions, either within a single dataset or across multiple resource with the emerging practice of using owl:sameAs to data sources, are encapsulated in the area of co-reference identify duplicate URIs. The advantages and benefits that resolution [9]. Research has been and continues to be carried have been gained from deploying the CRS on a site with out in this field, developing systematic analysis and heuristic multiple Linked Data repositories are also highlighted. based approaches to identifying co-references in or between datasets, however techniques for managing, publishing and using co-reference information are lacking. Categories and Subject Descriptors This is not a problem that will disappear as the Semantic H.3.5 [Information Systems]: Information storage and Web gains momentum, and it is naı̈ve to suppose or rely on retrieval—Online Information Services – data sharing, web the fact that a ‘standard’ set of identifiers will eventually based services emerge over time; organisations, companies and government agencies are unlikely to be willing to adopt identifiers over which they have no authority or control. Keywords The remainder of this paper describes the Co-reference Co-reference, Linked Data, Semantic Web Resolution Service (CRS, formerly known as the Consistent Reference Service), which we have developed to meet the needs of managing co-reference data, both within our own 1. INTRODUCTION project and the Semantic Web as a whole. The Semantic Web vision fundamentally requires the cre- ation of a ‘Web of Data’ containing large quantities of readily accessible, interlinked, and machine readable information, 2. THE NEED FOR A CO-REFERENCE analogous to the existing World Wide Web content currently RESOLUTION SERVICE available for human consumption. Part of the ReSIST project with which the authors are In recent years an increasing number of semantic datasets involved aims to provide a synthesised view of resources are emerging, fuelled primarily by the efforts of the Link- related to Resilient and Dependable Computing research, ing Open Data community, forming the beginnings of such utilising a Semantic Web based approach [6]. Data has been a resource. By following the guidelines set out in the Linked acquired from multiple sources, detailing academic publica- Data Tutorial [3], or through the use of tools such as D2R tions, researchers, institutions and projects, and converted [5], existing datasets are being made available online, in- into RDF as required. This totals approximately 100M triples, from 20+ different sources, ranging from large pub- lication repositories to small chunks of information submit- ted by project partners. Data from each different source Copyright is held by the author/owner(s). has been kept separate, and published as Linked Data on LDOW2009, April 20, 2009, Madrid, Spain. subdomains of rkbexplorer.com. Merge uria and urib . . . Clearly there is likely to be overlap and duplicity of in- bundle1 = {uria } (1) formation between these repositories, particularly with peo- bundle2 = {urib } (2) ple and publications. We have deployed a number of algo- rithms to identify co-referent identifiers in and between our bundle3 = bundle1 ∪ bundle2 (3) datasets, however this is of little use without some way of bundle3 = {uria , urib } (4) applying these results to a semantic application or further bundle  , bundle  (5)  1  2 analysis tools. The most prevalent way of dealing with ‘duplicate’ URIs Merge uria and uric . . . that are deemed to be the same is to use the owl:sameAs pred- bundle4 = {uric } (6) icate to link between them. The semantics of owl:sameAs dic- tate that all the URIs linked with this predicate have the bundle5 = bundle3 ∪ bundle4 (7) same identity [2], implying that the subject and object must bundle5 = {uria , urib , uric } (8) be the same resource. The major disadvantage with this ap- bundle  , bundle  (9)  3  4 proach is that the two URIs become indistinguishable, even though they may refer to different entities according to the Merge urim and urin . . . context in which they are used. bundle6 = {urim } (10) Named graphs may be used in some cases to overcome this problem, but this approach has significant drawbacks. bundle7 = {urin } (11) In addition to being outside of the RDF model, prior un- bundle8 = bundle6 ∪ bundle7 (12) derstanding of the graphs and their partition is required. bundle8 = {urim , urin } (13) Furthermore, if RDF descriptions are combined, cached, or passed between different services, then named graphs can bundle   , bundle 6  7  (14) easily be lost. Merge urin and urib . . . Generally, co-reference resolution techniques are not as certain as one might hope, somewhat undermining the strong bundle9 = bundle8 ∪ bundle5 (15) semantics behind owl:sameAs. Once again, we must con- bundle9 = {uria , urib , uric , urim , urin } (16) sider the notion of equivalence within a given context: with bundle  , bundle  (17)  8  5 the exception of very elementary examples, one can only be sure that two URIs are equivalent within the confines of a specific application, whereas owl:sameAs asserts that two Figure 1: Examples of bundle formation references are always the same. It is the authors’ belief that more often than not the use of owl:sameAs is inappropriate and is being applied incorrectly, tions approach [4] to identity management can be seamlessly and rather that owl:sameAs should only be used when the integrated into the CRS framework. Indeed, any approach two concepts being represented are utterly indistinguishable. to URI identity management will be easier to implement and The approach taken within the ReSIST project has been control in a world where knowledge of URI synonyms and to separate out knowledge regarding co-reference and equiv- URI definition are kept separate. alence from the main datasets, in a manner similar to that Consider the case where the URI synonyms for the same in which early hypertext systems were developed by storing resource are included as owl:sameAs links in the definition content and link-bases as distinct components. By treating of the resource that is being described. Alterations to these such information as a first class entity and storing it in a sep- URIs will cause an alteration in the definition of the URI. arate system, the Co-reference Resolution Service, a number Separating the URI co-reference links into a bundle in a of benefits can be realised. separate knowledge base allows the duplicate URIs to be Firstly, a number of CRSes can be used to represent dif- changed without affecting the definition of the resource for ferent co-reference contexts; applications can then use one the original URI. or more CRSes as appropriate. For example, in undertak- ing citation analysis, a paper with the same title and text 3. CRS IMPLEMENTATION that appeared both as a journal article and technical re- The CRS provides what is essentially a very simple ser- port should be considered as two separate papers, whereas vice – maintaining sets of equivalent URIs – however it has in many other applications it may be thought of as the same taken several iterations to arrive at the current version 3, resource appearing in two different publication formats. A which is maintaining co-reference data for each of the rkb- different CRS instance could be created to represent each explorer.com repositories and enabling the complex cross- viewpoint, whereas an application accessing a linked data repository interoperation required by the RKBExplorer ap- site with embedded owl:sameAs links has no opportunity to plication [7]. choose an equivalence context. What may appear to be a trivially straight-forward ser- Secondly, in recognising co-reference data as important vice actually delivers a refined yet powerful set of capabili- knowledge in its own right, and by storing it separately ties, which is the result of much thought, deliberation and and manipulating it through custom services, more powerful experience through implementing and using the service to management techniques can be applied, including history, manage real-world data and support complex applications. rollback and annotation capabilities. The core CRS functionality is implemented in a PHP In relation to the ongoing issues over URI identity, both class, enabling easy integration to a wide variety of web- the RDF predicate based approach [8] and the URI declara- based applications and middleware libraries, and backed by a mySQL database to facilitate acceptable performance when used with large datasets. uris Equivalent URIs are conceptually stored in a ‘bundle’ – a hash bundleID deprecated set of identifiers referring to resources which are considered to be the same in a given context. A URI can exist in at bundles most one bundle within a CRS instance. One URI in each bundleID canonHash active bundle is nominated to be a canonical identifier, or canon, for that bundle, representing a ‘preferred’ URI for the set symbols of duplicates. An application that wishes to use data from multiple sources as if they were a single resource can process hash lexical URI results by looking up URIs in a CRS and replacing them with their canons on the fly, reducing the multiplicity of Figure 2: CRS database schema identifiers to a single definitive URI. Bundles additionally have sequential numeric identifiers, however these are only used internally and are not exposed. returned by the CRS for that URI. In removing the unnec- Bundles are formed by atomic operations only, by means essary duplicates, we reduce the number of query iterations of merging pairs of URIs together. In merging uria and urib that are required to retrieve all possible facts from an equiv- the CRS first checks to see that each URI is already known alence closure. Those duplicates removed from the underly- and exists within a bundle. If not, a ‘singleton’ bundle is cre- ing dataset are flagged as deprecated within the CRS, which ated for new URIs as required. Now to perform the merge, continues to give results when asked to give equivalents for a new third bundle is created consisting of the union of the both normal and deprecated URIs. However, deprecated bundles that contain the URIs which are being asserted as URIs are not returned in equivalence sets, hence if the CRS equivalent. The two constituent bundles which were merged is queried for equivalents of a deprecated URI, only the non- are then marked as inactive, as shown in Figure 1. deprecated members of the bundle are returned. All URIs A number of schemes can be employed to elect the canon remain in their bundles, maintaining the history and bundle for this newly merged bundle, from random allocation, se- formation structures; deprecated ones are simply filtered out lection by a ordering according to a list of preferred URI when results are returned. Checks have been put in place to domains, or simply by assuming the canon from the bundle ensure that canons cannot be deprecated, and while it would in the left hand side of the pair of merged URIs, as in the be perfectly feasible to change the canon for a given bundle example above. to an alternative member of that bundle, we have not found In order to handle large datasets, the CRS uses a mySQL need to implement such functionality. database for back end storage. To facilitate fast access when querying the CRS, data is internalised in indexed tables of 4. USING A SINGLE CRS hashed URIs, according to the schema in Figure 2. This As stated in the previous section, the core functionality of enables simple queries to be formulated which permit ex- the CRS is implemented in a PHP class. This can be used tremely fast lookups to find the canon of a given URI, or directly, incorporated within an application, or wrapped in finding all URIs in a given bundle; the two fundamental simple scripts to expose functionality via HTTP interfaces. query operations and most used features of the CRS. In either case, the back end database does not have to re- Each operation performed by the CRS can additionally be side on the same machine as the code executing the CRS logged in a history table, including the facility to record a class, given appropriate mySQL permissions and firewall ac- comment as to why an action was carried out. As a result, if cess, enabling multiple applications to access the same co- at a later date it is discovered that two URIs were incorrectly reference information directly via PHP. However this obvi- deemed to be equivalent, then operations can be ‘undone’ or ously may incur additional overheads. rolled back to rectify the situation. The CRS class provides a function to ‘merge’ two URIs, Finally, functionality is provided to ‘deprecate’ URIs within i.e. assert that they are equivalent, along with a number a dataset, by setting a flag in the uris table. A number of of other useful functions to facilitate querying of the un- sources from which we acquired publications data contained derlying knowledge. One can request equivalent URIs for particularly poor quality information with regards to person a given input URI, which returns the set of non-deprecated identifiers, often conflating different individuals who share URIs from the bundle in which the requested URI resides. common names under the same URI. As a result, we were If no information is known about the requested URI, a set forced to generate a new URI for every author name on every is simply returned containing only that URI. Similarly, the publication, and then perform our own co-reference analysis canonical URI can be requested for any given input URI. to collapse equivalent URIs where appropriate [7]. There is no level of access control built in to the core func- However, this process led to bundles containing many tens tionality, other than authenticating to the mySQL database. or low hundreds equivalent URIs, each from within the same As a result, we have chosen not to give public access to ‘local’ dataset. These duplicates are of our own creation, the rkbexplorer.com CRSes via the CRS class, rather to provide little additional value, and in fact cause significant provide a number of web interfaces which permit read-only overheads if each variant has to be checked by an applica- querying of the co-reference knowledge. tion. It was decided therefore that once a phase of this ‘cold For each rkbexplorer.com sub-domain, the URI naming start’ co-reference analysis had been completed, the under- scheme uses the following pattern for non-information re- lying RDF data in the associated knowledge base should sources: http://.rkbexplorer.com/id/xyz. be modified to remove unnecessary duplicates by consult- When a non-information resource is dereferenced with ing the CRS and re-writing each ‘local’ URI with the canon Accept: application/rdf+xml an RDF representation is returned as expected within linked data best practice. How- .rkbexplorer.com/id/ URI is encountered. ever, in this document, there is an additional link via the In comparison, Semantic Web applications that rely on coref:coreferenceData predicate, indicating that there is owl:sameAs to represent all co-references must always re- co-reference data available at the URI cursively load and potentially compute inference over the http://.rkbexplorer.com/crs/xyz and allow- data of each URI that is deemed equivalent to the current ing CRS aware applications to discover related CRSes. This URI in order to compute a global equivalence closure. This URI produces a representation of the bundle for the /id/xyz may bring significant performance overheads, imposing un- URI in either HTML or RDF, based on content negotiation, necessary loading and processing of large chunks of data. such as that in Figure 3. Alternatively, to facilitate use by Furthermore, there are no opportunities to limit or control a wider number of systems, a request can be made which the expansion of the equivalence set, whereas the CRS ar- returns a document in ntriples format describing the canon chitecture allows for following as many, or as few duplicate to be owl:sameAs all other duplicates in the bundle. URIs as required with no significant barrier on performance. Applications may also wish to query the CRS in a more We have provided CRS instances for each of our general sense, which is provided by the interface accessible rkbexplorer.com sub-domains, and performed significant via /crs/export/?term=&format= co-reference analysis both internally and across these datasets. The core CRS implementation can handle arbitrary URIs A visualisation of the cross-repository linkage is presented from any number of sources, however the HTTP interfaces in Figure 4, and experimental voiD descriptions [1] are pro- described above and used with rkbexplorer.com sub-domains vided at http://.rkbexplorer.com/id/void have the implication that at least one of the URIs in any detailing these linkages and CRS content in a semantically pair of equivalence assertions comes from the sub-domain annotated manner. for which that CRS is representative. As a result, each This set of CRSes informs our faceted browser application, rkbexplorer.com CRS maintains co-references between URIs RKBExplorer, enabling data from the various repositories to on that sub-domain, in addition to links to equivalent URIs be incorporated as required. Although rigorous performance in other rkbexplorer.com sub-domains or external Linked and load testing has not been carried out on the CRS im- Data sources. plementation, managing millions of URIs in tens of millions Finally, each CRS instance, or database, is assumed to of bundles has presented no problems. Indeed, fetching the contain knowledge according to a single co-reference context global equivalence closure is an insignificant step when com- only. Unfortunately ontologies have not yet been defined for pared to other processing and analysis phases within the ap- encapsulating the contextual aspects of co-reference analysis plication. It is anticipated that individual CRSes will scale or the use of co-reference information; hence an application well beyond the current usage, and even more so when mul- must currently either have prior knowledge of a set of CRSes tiple CRSes are employed. it may consult, or accept data ‘carte-blanche’ from any CRS The global equivalence closure described above has been it discovers. implemented within the RKBExplorer application, and ad- ditionally exposed through an HTTP interface at 5. USING MULTIPLE CRSES http://www.rkbexplorer.com/sameAs/. This service will consult all necessary CRSes to determine the overall set of It is conceivable that linked data providers may wish to equivalents for a given URI, while additionally picking a publish co-reference information about their dataset, repre- canon from a preferential order of domains. Again, to en- senting equivalences both between local URIs and linking able easy integration of CRS knowledge in non CRS aware to external URIs in other sources. Typically we envisage applications, the service can simply be queried with content that providers could host one (or more) CRSes per dataset, negotiation or the additional parameter &format=n3 to re- as demonstrated with rkbexplorer.com. When investigat- trieve a document listing the equivalence relationships using ing co-reference for a given URI, application developers may the owl:sameAs predicate. choose to treat a CRS which exists on the same domain as the URI in question as a first point of call, or as more ‘au- thoritative’ than other CRSes published elsewhere, however 6. CONCLUSIONS this is not a prescribed semantic. This paper has briefly outlined the problems of co-reference We have seen how the separation of co-reference data into resolution within Open Linked Data repositories and on the CRSes allows for additional services to be provided that Semantic Web as a whole. The problems of using owl:sameAs could not easily be achieved with owl:sameAs approaches. have been discussed, and the needs of more capable manage- Another of these is the use of multiple CRSes to efficiently ment techniques presented. We detail the rationale, capa- deduce a global equivalence closure for finding duplicates for bilities and implementation of the CRS architecture, and a given URI. Finding all equivalences is simply a matter of describe its use in real-world applications. following the coref:coreferenceData links to the bundle Co-reference within the Semantic Web is a growing, yet for that URI and recursively repeating the process for each largely unappreciated problem. It has been suggested that URI in that bundle. it is a matter that will resolve as the Semantic Web evolves, There are various methods that can speed up this process, with careful social engineering and planning, however due such as only looking at one URI from each CRS repository, to the reasons discussed previously we do not believe this to or following only the coref:canon predicates in order to be the case. build up a unified view of equivalent URIs. It is also possible It is our conclusion that the most effective means for com- for an application to maintain a list of known CRSes appli- bating the issue is to make co-reference awareness an archi- cable to a given context, and to query each one in parallel to tectural feature of future semantic applications. Existing discover any equivalences it knows about, or to naı̈vely query use of owl:sameAs is not sufficient, and in many cases incor- the .rkbexplorer.com/crs/ CRS whenever a rect. We believe the use of the bundle framework provides a 2009-01-16 11:11:40 Figure 3: Example RDF description of equivalent URIs in a bundle Figure 4: Co-references between CRSes – see http://www.rkbexplorer.com/linkage/ flexible, expandable and readily compatible notation for con- ceptualising co-reference, and that the CRS implementation provides a broad strategy for co-reference management that integrates the process of reference management into the ar- chitecture of the Semantic Web by utilising both social and technical engineering. Readers are encourage to experiment with and if possi- ble make use of the rkbexplorer.com services discussed in this paper, and we welcome any feedback. The core CRS implementation may be available on request. 7. ACKNOWLEDGMENTS This work is funded in part by the ReSIST Network of Ex- cellence (NoE) which is sponsored by the EU Sixth Frame- work programme (FP6) under contract number IST-4-026764- NOE, and in collaboration with The Korea Institute of Sci- ence and Technology Information (KISTI). We would also like to thank our colleagues at Southamp- ton, Newcastle, and DERI, along with numerous members of the Linking Open Data community who have contributed both directly and indirectly through informative and enlight- ening discussion. 8. REFERENCES [1] K. Alexander, R. Cyganiak, M. Hausenblas, and J. Zhao. voiD Guide - Using the Vocabulary of interlinked Datasets, 2008. [2] S. Bechofer, F. Van Harmelen, J. Hendler, I. Horrocks, D. Mcguiness, P. Schneider, and L. Stein. OWL Web Ontology Language Reference, 2004. [3] C. Bizer, R. Cyganiak, and T. Heath. How to publish Linked Data on the Web, 2007. [4] D. Booth. Why URI Declarations? A Comparison of Architectural Approaches. In 1st Workshop on Identity and Reference for the Semantic Web (IRSW2008), 2008. [5] R. Cyganiak and C. Bizer. D2R Server – Publishing Relational Databases on the Web as SPARQL Endpoints. In 15th International World Wide Web Conference (WWW2006), 2006. [6] H. Glaser, I. C. Millard, T. Anderson, and B. Randell. ReSIST Deliverable D10: Prototype Knowledge Base, 2006. [7] H. Glaser, I. C. Millard, T. Anderson, and B. Randell. ReSIST Deliverable D23: Resilience Knowledge Base – Version 2, 2007. [8] P. Hayes and H. Halpin. In Defense of Ambiguity. In Workshop on Identity, Identifiers and Identification (WWW2007), 2007. [9] A. Jaffri, H. Glaser, and I. C. Millard. URI Disambiguation in the Context of Linked Data. In 1st Workshop on Linked Data on the Web (LDOW2008), 2008.