Managing Co-reference on the Semantic Web

                                        Hugh Glaser, Afraz Jaffri, Ian C. Millard
                                            School of Electronics and Computer Science
                                                     University of Southampton
                                                   Southampton, Hampshire, UK
                                             {hg, aoj04r, icm}@ecs.soton.ac.uk


ABSTRACT                                                           cluding notable encyclopaedic, geographical, music, film and
Co-reference resolution, or the determination of ‘equivalent’      academic publication related resources.
URIs referring to the same concept or entity, is a significant        However, in many cases there is minimal interlinking be-
hurdle to overcome in the realisation of large scale Seman-        tween these datasets, as often they are existing resources
tic Web applications. However, it has only recently gained         which have recently been exposed in a semantic representa-
the attention of research communities in the Semantic Web          tion. As a result, the ‘web’ remains fragmented and difficult
context, and while activities are now underway in identifying      to navigate.
co-referent or conflated URIs, little consideration has been          As more datasets appear there is significant potential for
given to tools and techniques for storing, manipulating, and       overlap to occur, with a given resource being described in
reusing co-reference information.                                  two or more repositories. These representations are more
   This paper provides an overview of the specification, im-       than likely to use different identifiers, stemming from each
plementation, interactions and experiences in using the Co-        source, unless one or other of the datasets is relatively new
reference Resolution Service (CRS) to facilitate rigorous man-     and has been constructed with knowledge of the other. Fur-
agement of URI co-reference data, and enable interoperation        thermore, the information in different repositories could be
between multiple Linked Open Data sources. Comparisons             expressed against different ontologies.
are made throughout the paper contrasting the differences             The problems involved with identifying these ‘duplicate’
in the way the CRS manages multiple URIs for the same              descriptions, either within a single dataset or across multiple
resource with the emerging practice of using owl:sameAs to         data sources, are encapsulated in the area of co-reference
identify duplicate URIs. The advantages and benefits that          resolution [9]. Research has been and continues to be carried
have been gained from deploying the CRS on a site with             out in this field, developing systematic analysis and heuristic
multiple Linked Data repositories are also highlighted.            based approaches to identifying co-references in or between
                                                                   datasets, however techniques for managing, publishing and
                                                                   using co-reference information are lacking.
Categories and Subject Descriptors                                    This is not a problem that will disappear as the Semantic
H.3.5 [Information Systems]: Information storage and               Web gains momentum, and it is naı̈ve to suppose or rely on
retrieval—Online Information Services – data sharing, web          the fact that a ‘standard’ set of identifiers will eventually
based services                                                     emerge over time; organisations, companies and government
                                                                   agencies are unlikely to be willing to adopt identifiers over
                                                                   which they have no authority or control.
Keywords                                                              The remainder of this paper describes the Co-reference
Co-reference, Linked Data, Semantic Web                            Resolution Service (CRS, formerly known as the Consistent
                                                                   Reference Service), which we have developed to meet the
                                                                   needs of managing co-reference data, both within our own
1.   INTRODUCTION                                                  project and the Semantic Web as a whole.
   The Semantic Web vision fundamentally requires the cre-
ation of a ‘Web of Data’ containing large quantities of readily
accessible, interlinked, and machine readable information,
                                                                   2.   THE NEED FOR A CO-REFERENCE
analogous to the existing World Wide Web content currently              RESOLUTION SERVICE
available for human consumption.                                      Part of the ReSIST project with which the authors are
   In recent years an increasing number of semantic datasets       involved aims to provide a synthesised view of resources
are emerging, fuelled primarily by the efforts of the Link-        related to Resilient and Dependable Computing research,
ing Open Data community, forming the beginnings of such            utilising a Semantic Web based approach [6]. Data has been
a resource. By following the guidelines set out in the Linked      acquired from multiple sources, detailing academic publica-
Data Tutorial [3], or through the use of tools such as D2R         tions, researchers, institutions and projects, and converted
[5], existing datasets are being made available online, in-        into RDF as required. This totals approximately 100M
                                                                   triples, from 20+ different sources, ranging from large pub-
                                                                   lication repositories to small chunks of information submit-
                                                                   ted by project partners. Data from each different source
Copyright is held by the author/owner(s).                          has been kept separate, and published as Linked Data on
LDOW2009, April 20, 2009, Madrid, Spain.
subdomains of rkbexplorer.com.                                      Merge uria and urib . . .
   Clearly there is likely to be overlap and duplicity of in-                                   bundle1 = {uria }             (1)
formation between these repositories, particularly with peo-
                                                                                                bundle2 = {urib }             (2)
ple and publications. We have deployed a number of algo-
rithms to identify co-referent identifiers in and between our                     bundle3 = bundle1 ∪ bundle2                 (3)
datasets, however this is of little use without some way of                              bundle3 = {uria , urib }             (4)
applying these results to a semantic application or further                                     bundle
                                                                                                   , bundle
                                                                                                                            (5)
                                                                                                     1  2
analysis tools.
   The most prevalent way of dealing with ‘duplicate’ URIs          Merge uria and uric . . .
that are deemed to be the same is to use the owl:sameAs pred-
                                                                                                bundle4 = {uric }             (6)
icate to link between them. The semantics of owl:sameAs dic-
tate that all the URIs linked with this predicate have the                        bundle5 = bundle3 ∪ bundle4                 (7)
same identity [2], implying that the subject and object must                       bundle5 = {uria , urib , uric }            (8)
be the same resource. The major disadvantage with this ap-                                      bundle
                                                                                                   , bundle
                                                                                                                            (9)
                                                                                                     3  4
proach is that the two URIs become indistinguishable, even
though they may refer to different entities according to the        Merge urim and urin . . .
context in which they are used.
                                                                                              bundle6 = {urim }              (10)
   Named graphs may be used in some cases to overcome
this problem, but this approach has significant drawbacks.                                      bundle7 = {urin }            (11)
In addition to being outside of the RDF model, prior un-                          bundle8 = bundle6 ∪ bundle7                (12)
derstanding of the graphs and their partition is required.                            bundle8 = {urim , urin }               (13)
Furthermore, if RDF descriptions are combined, cached, or
passed between different services, then named graphs can                                        bundle
                                                                                                  , bundle
                                                                                                      6  7
                                                                                                                           (14)
easily be lost.                                                     Merge urin and urib . . .
   Generally, co-reference resolution techniques are not as
certain as one might hope, somewhat undermining the strong                            bundle9 = bundle8 ∪ bundle5            (15)
semantics behind owl:sameAs. Once again, we must con-                       bundle9 = {uria , urib , uric , urim , urin }    (16)
sider the notion of equivalence within a given context: with                                         bundle
                                                                                                        , bundle
                                                                                                                           (17)
                                                                                                          8  5
the exception of very elementary examples, one can only be
sure that two URIs are equivalent within the confines of a
specific application, whereas owl:sameAs asserts that two                 Figure 1: Examples of bundle formation
references are always the same.
   It is the authors’ belief that more often than not the use of
owl:sameAs is inappropriate and is being applied incorrectly,       tions approach [4] to identity management can be seamlessly
and rather that owl:sameAs should only be used when the             integrated into the CRS framework. Indeed, any approach
two concepts being represented are utterly indistinguishable.       to URI identity management will be easier to implement and
   The approach taken within the ReSIST project has been            control in a world where knowledge of URI synonyms and
to separate out knowledge regarding co-reference and equiv-         URI definition are kept separate.
alence from the main datasets, in a manner similar to that             Consider the case where the URI synonyms for the same
in which early hypertext systems were developed by storing          resource are included as owl:sameAs links in the definition
content and link-bases as distinct components. By treating          of the resource that is being described. Alterations to these
such information as a first class entity and storing it in a sep-   URIs will cause an alteration in the definition of the URI.
arate system, the Co-reference Resolution Service, a number         Separating the URI co-reference links into a bundle in a
of benefits can be realised.                                        separate knowledge base allows the duplicate URIs to be
   Firstly, a number of CRSes can be used to represent dif-         changed without affecting the definition of the resource for
ferent co-reference contexts; applications can then use one         the original URI.
or more CRSes as appropriate. For example, in undertak-
ing citation analysis, a paper with the same title and text         3.   CRS IMPLEMENTATION
that appeared both as a journal article and technical re-
                                                                       The CRS provides what is essentially a very simple ser-
port should be considered as two separate papers, whereas
                                                                    vice – maintaining sets of equivalent URIs – however it has
in many other applications it may be thought of as the same
                                                                    taken several iterations to arrive at the current version 3,
resource appearing in two different publication formats. A
                                                                    which is maintaining co-reference data for each of the rkb-
different CRS instance could be created to represent each
                                                                    explorer.com repositories and enabling the complex cross-
viewpoint, whereas an application accessing a linked data
                                                                    repository interoperation required by the RKBExplorer ap-
site with embedded owl:sameAs links has no opportunity to
                                                                    plication [7].
choose an equivalence context.
                                                                       What may appear to be a trivially straight-forward ser-
   Secondly, in recognising co-reference data as important
                                                                    vice actually delivers a refined yet powerful set of capabili-
knowledge in its own right, and by storing it separately
                                                                    ties, which is the result of much thought, deliberation and
and manipulating it through custom services, more powerful
                                                                    experience through implementing and using the service to
management techniques can be applied, including history,
                                                                    manage real-world data and support complex applications.
rollback and annotation capabilities.
                                                                       The core CRS functionality is implemented in a PHP
   In relation to the ongoing issues over URI identity, both
                                                                    class, enabling easy integration to a wide variety of web-
the RDF predicate based approach [8] and the URI declara-
                                                                    based applications and middleware libraries, and backed by
a mySQL database to facilitate acceptable performance when
used with large datasets.                                           uris
   Equivalent URIs are conceptually stored in a ‘bundle’ – a               hash       bundleID       deprecated
set of identifiers referring to resources which are considered
to be the same in a given context. A URI can exist in at            bundles
most one bundle within a CRS instance. One URI in each                 bundleID       canonHash      active
bundle is nominated to be a canonical identifier, or canon,
for that bundle, representing a ‘preferred’ URI for the set
                                                                    symbols
of duplicates. An application that wishes to use data from
multiple sources as if they were a single resource can process         hash           lexical URI
results by looking up URIs in a CRS and replacing them
with their canons on the fly, reducing the multiplicity of                        Figure 2: CRS database schema
identifiers to a single definitive URI. Bundles additionally
have sequential numeric identifiers, however these are only
used internally and are not exposed.                               returned by the CRS for that URI. In removing the unnec-
   Bundles are formed by atomic operations only, by means          essary duplicates, we reduce the number of query iterations
of merging pairs of URIs together. In merging uria and urib        that are required to retrieve all possible facts from an equiv-
the CRS first checks to see that each URI is already known         alence closure. Those duplicates removed from the underly-
and exists within a bundle. If not, a ‘singleton’ bundle is cre-   ing dataset are flagged as deprecated within the CRS, which
ated for new URIs as required. Now to perform the merge,           continues to give results when asked to give equivalents for
a new third bundle is created consisting of the union of the       both normal and deprecated URIs. However, deprecated
bundles that contain the URIs which are being asserted as          URIs are not returned in equivalence sets, hence if the CRS
equivalent. The two constituent bundles which were merged          is queried for equivalents of a deprecated URI, only the non-
are then marked as inactive, as shown in Figure 1.                 deprecated members of the bundle are returned. All URIs
   A number of schemes can be employed to elect the canon          remain in their bundles, maintaining the history and bundle
for this newly merged bundle, from random allocation, se-          formation structures; deprecated ones are simply filtered out
lection by a ordering according to a list of preferred URI         when results are returned. Checks have been put in place to
domains, or simply by assuming the canon from the bundle           ensure that canons cannot be deprecated, and while it would
in the left hand side of the pair of merged URIs, as in the        be perfectly feasible to change the canon for a given bundle
example above.                                                     to an alternative member of that bundle, we have not found
   In order to handle large datasets, the CRS uses a mySQL         need to implement such functionality.
database for back end storage. To facilitate fast access when
querying the CRS, data is internalised in indexed tables of        4.      USING A SINGLE CRS
hashed URIs, according to the schema in Figure 2. This
                                                                      As stated in the previous section, the core functionality of
enables simple queries to be formulated which permit ex-
                                                                   the CRS is implemented in a PHP class. This can be used
tremely fast lookups to find the canon of a given URI, or
                                                                   directly, incorporated within an application, or wrapped in
finding all URIs in a given bundle; the two fundamental
                                                                   simple scripts to expose functionality via HTTP interfaces.
query operations and most used features of the CRS.
                                                                   In either case, the back end database does not have to re-
   Each operation performed by the CRS can additionally be
                                                                   side on the same machine as the code executing the CRS
logged in a history table, including the facility to record a
                                                                   class, given appropriate mySQL permissions and firewall ac-
comment as to why an action was carried out. As a result, if
                                                                   cess, enabling multiple applications to access the same co-
at a later date it is discovered that two URIs were incorrectly
                                                                   reference information directly via PHP. However this obvi-
deemed to be equivalent, then operations can be ‘undone’ or
                                                                   ously may incur additional overheads.
rolled back to rectify the situation.
                                                                      The CRS class provides a function to ‘merge’ two URIs,
   Finally, functionality is provided to ‘deprecate’ URIs within
                                                                   i.e. assert that they are equivalent, along with a number
a dataset, by setting a flag in the uris table. A number of
                                                                   of other useful functions to facilitate querying of the un-
sources from which we acquired publications data contained
                                                                   derlying knowledge. One can request equivalent URIs for
particularly poor quality information with regards to person
                                                                   a given input URI, which returns the set of non-deprecated
identifiers, often conflating different individuals who share
                                                                   URIs from the bundle in which the requested URI resides.
common names under the same URI. As a result, we were
                                                                   If no information is known about the requested URI, a set
forced to generate a new URI for every author name on every
                                                                   is simply returned containing only that URI. Similarly, the
publication, and then perform our own co-reference analysis
                                                                   canonical URI can be requested for any given input URI.
to collapse equivalent URIs where appropriate [7].
                                                                      There is no level of access control built in to the core func-
   However, this process led to bundles containing many tens
                                                                   tionality, other than authenticating to the mySQL database.
or low hundreds equivalent URIs, each from within the same
                                                                   As a result, we have chosen not to give public access to
‘local’ dataset. These duplicates are of our own creation,
                                                                   the rkbexplorer.com CRSes via the CRS class, rather to
provide little additional value, and in fact cause significant
                                                                   provide a number of web interfaces which permit read-only
overheads if each variant has to be checked by an applica-
                                                                   querying of the co-reference knowledge.
tion. It was decided therefore that once a phase of this ‘cold
                                                                      For each rkbexplorer.com sub-domain, the URI naming
start’ co-reference analysis had been completed, the under-
                                                                   scheme uses the following pattern for non-information re-
lying RDF data in the associated knowledge base should
                                                                   sources: http://<repository>.rkbexplorer.com/id/xyz.
be modified to remove unnecessary duplicates by consult-
                                                                   When a non-information resource is dereferenced with
ing the CRS and re-writing each ‘local’ URI with the canon
                                                                   Accept: application/rdf+xml an RDF representation is
returned as expected within linked data best practice. How-       <repository>.rkbexplorer.com/id/ URI is encountered.
ever, in this document, there is an additional link via the          In comparison, Semantic Web applications that rely on
coref:coreferenceData predicate, indicating that there is         owl:sameAs to represent all co-references must always re-
co-reference data available at the URI                            cursively load and potentially compute inference over the
http://<repository>.rkbexplorer.com/crs/xyz and allow-            data of each URI that is deemed equivalent to the current
ing CRS aware applications to discover related CRSes. This        URI in order to compute a global equivalence closure. This
URI produces a representation of the bundle for the /id/xyz       may bring significant performance overheads, imposing un-
URI in either HTML or RDF, based on content negotiation,          necessary loading and processing of large chunks of data.
such as that in Figure 3. Alternatively, to facilitate use by     Furthermore, there are no opportunities to limit or control
a wider number of systems, a request can be made which            the expansion of the equivalence set, whereas the CRS ar-
returns a document in ntriples format describing the canon        chitecture allows for following as many, or as few duplicate
to be owl:sameAs all other duplicates in the bundle.              URIs as required with no significant barrier on performance.
   Applications may also wish to query the CRS in a more             We have provided CRS instances for each of our
general sense, which is provided by the interface accessible      rkbexplorer.com sub-domains, and performed significant
via /crs/export/?term=<uri>&format=<format>                       co-reference analysis both internally and across these datasets.
   The core CRS implementation can handle arbitrary URIs          A visualisation of the cross-repository linkage is presented
from any number of sources, however the HTTP interfaces           in Figure 4, and experimental voiD descriptions [1] are pro-
described above and used with rkbexplorer.com sub-domains         vided at http://<repository>.rkbexplorer.com/id/void
have the implication that at least one of the URIs in any         detailing these linkages and CRS content in a semantically
pair of equivalence assertions comes from the sub-domain          annotated manner.
for which that CRS is representative. As a result, each              This set of CRSes informs our faceted browser application,
rkbexplorer.com CRS maintains co-references between URIs          RKBExplorer, enabling data from the various repositories to
on that sub-domain, in addition to links to equivalent URIs       be incorporated as required. Although rigorous performance
in other rkbexplorer.com sub-domains or external Linked           and load testing has not been carried out on the CRS im-
Data sources.                                                     plementation, managing millions of URIs in tens of millions
   Finally, each CRS instance, or database, is assumed to         of bundles has presented no problems. Indeed, fetching the
contain knowledge according to a single co-reference context      global equivalence closure is an insignificant step when com-
only. Unfortunately ontologies have not yet been defined for      pared to other processing and analysis phases within the ap-
encapsulating the contextual aspects of co-reference analysis     plication. It is anticipated that individual CRSes will scale
or the use of co-reference information; hence an application      well beyond the current usage, and even more so when mul-
must currently either have prior knowledge of a set of CRSes      tiple CRSes are employed.
it may consult, or accept data ‘carte-blanche’ from any CRS          The global equivalence closure described above has been
it discovers.                                                     implemented within the RKBExplorer application, and ad-
                                                                  ditionally exposed through an HTTP interface at
5.   USING MULTIPLE CRSES                                         http://www.rkbexplorer.com/sameAs/. This service will
                                                                  consult all necessary CRSes to determine the overall set of
   It is conceivable that linked data providers may wish to
                                                                  equivalents for a given URI, while additionally picking a
publish co-reference information about their dataset, repre-
                                                                  canon from a preferential order of domains. Again, to en-
senting equivalences both between local URIs and linking
                                                                  able easy integration of CRS knowledge in non CRS aware
to external URIs in other sources. Typically we envisage
                                                                  applications, the service can simply be queried with content
that providers could host one (or more) CRSes per dataset,
                                                                  negotiation or the additional parameter &format=n3 to re-
as demonstrated with rkbexplorer.com. When investigat-
                                                                  trieve a document listing the equivalence relationships using
ing co-reference for a given URI, application developers may
                                                                  the owl:sameAs predicate.
choose to treat a CRS which exists on the same domain as
the URI in question as a first point of call, or as more ‘au-
thoritative’ than other CRSes published elsewhere, however        6.   CONCLUSIONS
this is not a prescribed semantic.                                   This paper has briefly outlined the problems of co-reference
   We have seen how the separation of co-reference data into      resolution within Open Linked Data repositories and on the
CRSes allows for additional services to be provided that          Semantic Web as a whole. The problems of using owl:sameAs
could not easily be achieved with owl:sameAs approaches.          have been discussed, and the needs of more capable manage-
Another of these is the use of multiple CRSes to efficiently      ment techniques presented. We detail the rationale, capa-
deduce a global equivalence closure for finding duplicates for    bilities and implementation of the CRS architecture, and
a given URI. Finding all equivalences is simply a matter of       describe its use in real-world applications.
following the coref:coreferenceData links to the bundle              Co-reference within the Semantic Web is a growing, yet
for that URI and recursively repeating the process for each       largely unappreciated problem. It has been suggested that
URI in that bundle.                                               it is a matter that will resolve as the Semantic Web evolves,
   There are various methods that can speed up this process,      with careful social engineering and planning, however due
such as only looking at one URI from each CRS repository,         to the reasons discussed previously we do not believe this to
or following only the coref:canon predicates in order to          be the case.
build up a unified view of equivalent URIs. It is also possible      It is our conclusion that the most effective means for com-
for an application to maintain a list of known CRSes appli-       bating the issue is to make co-reference awareness an archi-
cable to a given context, and to query each one in parallel to    tectural feature of future semantic applications. Existing
discover any equivalences it knows about, or to naı̈vely query    use of owl:sameAs is not sufficient, and in many cases incor-
the <repository>.rkbexplorer.com/crs/ CRS whenever a              rect. We believe the use of the bundle framework provides a
<rdf:RDF xmlns:coref="http://www.rkbexplorer.com/ontologies/coref#"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

  <coref:Bundle>
    <coref:canon rdf:resource="http://southampton.rkbexplorer.com/id/person-00021"/>
    <coref:duplicate rdf:resource="http://acm.rkbexplorer.com/id/person-102898" />
    <coref:duplicate rdf:resource="http://citeseer.rkbexplorer.com/id/resource-CSP109002" />
    <coref:duplicate rdf:resource="http://dblp.rkbexplorer.com/id/people-27aedbcb" />
    <coref:duplicate rdf:resource="http://eprints.rkbexplorer.com/id/kfupm/person-27aed0c1" />
    <coref:duplicate rdf:resource="http://southampton.rkbexplorer.com/id/person-00021" />
    <coref:duplicate rdf:resource="http://wiki.rkbexplorer.com/id/hugh_glaser" />
    <coref:lastUpdated>2009-01-16 11:11:40</coref:lastUpdated>
  </coref:Bundle>

</rdf:RDF>


                    Figure 3: Example RDF description of equivalent URIs in a bundle


             Figure 4: Co-references between CRSes – see http://www.rkbexplorer.com/linkage/
flexible, expandable and readily compatible notation for con-
ceptualising co-reference, and that the CRS implementation
provides a broad strategy for co-reference management that
integrates the process of reference management into the ar-
chitecture of the Semantic Web by utilising both social and
technical engineering.
   Readers are encourage to experiment with and if possi-
ble make use of the rkbexplorer.com services discussed in
this paper, and we welcome any feedback. The core CRS
implementation may be available on request.

7.   ACKNOWLEDGMENTS
   This work is funded in part by the ReSIST Network of Ex-
cellence (NoE) which is sponsored by the EU Sixth Frame-
work programme (FP6) under contract number IST-4-026764-
NOE, and in collaboration with The Korea Institute of Sci-
ence and Technology Information (KISTI).
   We would also like to thank our colleagues at Southamp-
ton, Newcastle, and DERI, along with numerous members
of the Linking Open Data community who have contributed
both directly and indirectly through informative and enlight-
ening discussion.


8.   REFERENCES
[1] K. Alexander, R. Cyganiak, M. Hausenblas, and
    J. Zhao. voiD Guide - Using the Vocabulary of
    interlinked Datasets, 2008.
[2] S. Bechofer, F. Van Harmelen, J. Hendler, I. Horrocks,
    D. Mcguiness, P. Schneider, and L. Stein. OWL Web
    Ontology Language Reference, 2004.
[3] C. Bizer, R. Cyganiak, and T. Heath. How to publish
    Linked Data on the Web, 2007.
[4] D. Booth. Why URI Declarations? A Comparison of
    Architectural Approaches. In 1st Workshop on Identity
    and Reference for the Semantic Web (IRSW2008),
    2008.
[5] R. Cyganiak and C. Bizer. D2R Server – Publishing
    Relational Databases on the Web as SPARQL
    Endpoints. In 15th International World Wide Web
    Conference (WWW2006), 2006.
[6] H. Glaser, I. C. Millard, T. Anderson, and B. Randell.
    ReSIST Deliverable D10: Prototype Knowledge Base,
    2006.
[7] H. Glaser, I. C. Millard, T. Anderson, and B. Randell.
    ReSIST Deliverable D23: Resilience Knowledge Base –
    Version 2, 2007.
[8] P. Hayes and H. Halpin. In Defense of Ambiguity. In
    Workshop on Identity, Identifiers and Identification
    (WWW2007), 2007.
[9] A. Jaffri, H. Glaser, and I. C. Millard. URI
    Disambiguation in the Context of Linked Data. In 1st
    Workshop on Linked Data on the Web (LDOW2008),
    2008.