URI Disambiguation in the Context of Linked Data Afraz Jaffri Hugh Glaser Ian Millard School of Electronics and Computer School of Electronics and Computer School of Electronics and Computer Science Science Science University of Southampton University of Southampton University of Southampton a.o.jaffri@ecs.soton.ac.uk hg@ecs.soton.ac.uk icm@ecs.soton.ac.uk ABSTRACT Whilst extensive linking between datasets has been widely The Linked Data initiative has given rise to an increasing number encouraged, there has been little analysis of the accuracy of the of RDF datasets, many of which are freely accessible online. links or the datasets themselves. These resources often arise as a result of database exports; Datasets are often converted from existing sources which can however sufficient consideration may not be given to the unseen themselves be either incomplete or inaccurate. The linking implications caused when they are used in the wider context of the process accentuates these inconsistencies and produces a snowball Semantic Web. This paper investigates two popular resources, effect as more datasets are added. If the Semantic Web is to DBLP and DBpedia, and discusses whether the issues regarding provide a meaningfully interconnected web of assertions and identity management and co-reference resolution have been relations, there must also be some guarantee or measure of the suitably addressed. We find that a large percentage of authors in correctness of the information. DBLP have been conflated, and that disambiguation pages have been incorrectly linked using owl:sameAs within DBpedia. One of the main areas in which errors occur, both in databases Systems for dealing with these issues are presented, and directions and in digital libraries which are the kinds of repositories that are given for future research. have been converted into RDF for use with linked data, is the problem of co-reference. Co-reference is the problem of ensuring that two different entities do not share the same name or identifier, Categories and Subject Descriptors and conversely identifying when two identifiers refer to the same H.3.5 [Information Systems]: Information Storage and Retrieval: entity. In the context of the Semantic Web we are therefore Online Information Services – data sharing, web-based services. concerned with URIs. This paper presents some analysis of datasets used to link data and General Terms raises the question of how to manage the identity and meaning of Management, Design, Reliability URIs in the Semantic Web. The next section describes some related work in the field of co-reference and author disambiguation, while Section 3 describes the problem of co- Keywords reference and where it occurs in DBLP and DBpedia. Section 4 Linked Data, URI, Co-reference goes on to describe possible solutions to the problem that are currently in deployment. Section 5 concludes and issues an 1. INTRODUCTION invitation to help to provide an infrastructure where data can be As the Linking Open Data project gathers pace, more and more confidently used on the Semantic Web. repositories of knowledge are being added to the Linked Data Cloud, covering a wide range of topics. Many datasets stem from 2. RELATED WORK the focal point of the Linked Data Cloud, DBpedia [2]. Since DBpedia has harvested knowledge from Wikipedia, there is the 2.1 Author Disambiguation potential to create links to any subject that is described in The issue of resolving the problem of co-reference occurs in many Wikipedia. different disciplines. A brief overview of the problems and solutions that appear in Information Science and database design The datasets that have been interlinked so far have knowledge can be found in [16]. One of the main areas in which co-reference relating to people, places, books, songs and CYC [8] concepts as becomes a major problem is in author disambiguation. There are well as many others. Entities such as these are often prone to the many authors who share the same name and distinguishing problems of duplication and co-reference. between them is a vital part of any digital library or citation system. Not only do authors share the same names but variation in the spelling of names can also lead to a single author having multiple identities. For example, the author ‘Hugh Glaser’ could Copyright is held by the author/owner(s). be represented with his full name or by using ‘H. Glaser’, or LDOW2008, April 22, 2008, Beijing, China. ‘Glaser, H.’ A wide variety of methods have been employed to try and solve the problem of author disambiguation. Some of these include record linkage [10] used in databases, citation matching [17], name matching [5] and name equivalence identification [9]. These stays stable approximately 93% of the time. While the results have methods involve some form of string matching and word sense been documented there has been little attempt to quantify how big disambiguation. a problem co-reference on the Semantic Web actually is. In the next section we will consider how well the meaning of URI’s Although these methods can help in identifying names with from Wikipedia have translated themselves to DBpedia. We will different spellings or written in different formats, the problem of also present a study made on the DBLP bibliographic database disambiguating authors with exactly the same name remains a which is available both as linked data and a database. challenge. There have been recent attempts that use a different approach from the traditional string based systems. Using the Web as a means of author disambiguation has been highlighted as a 3. THE PROBLEM OF CO-REFERENCE possible solution to the problem. Since Web pages often contain Co-reference on the Semantic Web can occur in two ways: Firstly, information about people that are not included in citation when a single URI identifies more than one resource and secondly references, automatic scripts can be made that check the results of when multiple URIs identify the same resource. Both situations search engine queries made on the names of authors [19]. Another occur frequently when studying linked data. For an example of the web-based approach attempts to find the publication page of an first situation, many URIs in the DBLP dataset are used for author from his or her institution’s website and match the identifying a single author when, in fact, there are a number of publications contained in the page to citations in the repository people with the same name who are being incorrectly identified as [22]. The accuracy of such web based systems ranges from 73% to being the same person. The second situation occurs much more 84%. These systems also rely on there being sufficient frequently as different datasets use their own URIs to identify the information available on the Web about each author. This is not same resource. People and places are entities which suffer from always the case, especially with older publications and URI multiplicity. Spain, for example has at least four URIs: publications not in the field of computer science. http://dbpedia.org/resource/Spain http://www4.wiwiss.fu-berlin.de/factbook/resource/Spain Another method that has been put into practice is to use an http://sws.geonames.org/2510769 unsupervised machine learning approach using k-way spectral http://www4.wiwiss.fu- clustering that disambiguates authors in citations [12]. This study berlin.de/eurostat/resource/countries/Espa%C3%B1a focused on the DBLP dataset and chose the top ranked ambiguous names such as ‘J. Lee’, ‘S. Lee’, ‘Y. Chen’, ‘C. Chen’, ‘J ‘Hugh Glaser’ has at least eight URIs: Anderson’ and ‘J Smith’. The unsupervised learning technique http://acm.rkbexplorer.com/rdf/resource-P112732 used co-author names, publication titles and publication venue http://citeseer.rkbexplorer.com/rdf/resource-CSP109020 titles for author disambiguation. This assumes that individuals http://citeseer.rkbexplorer.com/rdf/resource-CSP109013 will quite often author with the same people and publish to the http://citeseer.rkbexplorer.com/rdf/resource-CSP109011 same venues. The results of this experiment show that an average http://citeseer.rkbexplorer.com/rdf/resource-CSP109002 of around 65% of authors can be successfully disambiguated. http://dblp.rkbexplorer.com/rdf/resource-27de9959 The purpose of mentioning the ongoing work in author http://europa.eu/People/#person-0ff816fa disambiguation in a different domain is to highlight the http://resist.ecs.soton.ac.uk/wiki/User:hugh_glaser importance of a problem that is only beginning to be appreciated http://www.ecs.soton.ac.uk/info/#person-00021 on the Semantic Web. Section 3 will elaborate on this. The next This is to be expected and does not present a problem in itself. section will look at how co-reference is being managed on the The problem occurs when these URIs are linked to other URIs via Semantic Web. owl:sameAs. Since URI identity can often depend on the context in which it is used [6], there can be no guarantee that the two 2.2 Disambiguation on the Semantic Web URIs are in fact the same entity. The next section supports this There has been much discussion about identity and meaning on assertion by looking at the DBLP dataset and also the DBpedia the Semantic Web from a theoretical point of view. Such dataset to reveal inconsistencies in the linking and naming of discussions will continue as questions fundamental to the resources. architecture of the Semantic Web are debated. Attention is now turning towards practical solutions of managing co-reference, or 3.1 DBLP URI identity management. Since co-reference between datasets is The DBLP database reportedly contains over 900 000 articles essential for linked data to work properly, a perfect opportunity from over 500 000 different authors in the field of computer arises to test some of the methods and solutions that have been science and related disciplines. The database can be seen as RDF proposed. by means of a D2R Server [4] and has been converted into linked data by adding owl:sameAs links to authors who are also in The various methods that have been suggested for managing co- DBpedia. Whilst providing a comprehensive repository for reference and identity on the Semantic Web range from ontology scientific publications, there are a number of inconsistencies that based [22, 11], object consolidation [15] to complete management appear in the data. This problem is not only found in DBLP but systems [7, 16]. The above applications have been used with also in other digital repositories. Due to lack of resources there is geographical data [21], wikis [11] and general Semantic Web data often not enough time available to rigorously check the input for [7, 16]. correctness or completeness. This has resulted in many authors There has been valuable work done on studying the reliability and having publications incorrectly attributed to them, with some stability of Wikipedia URIs [14] that are being used by DBpedia. having more titles under their name and some authors having less. This study suggests that the meaning of a URI found in Wikipedia This will have a major impact on the Semantic Web when such repositories are used as data sources without any attempt to completeness of the data before assigning links to other datasets manage the inconsistencies or ‘clean’ the data. and also in the form of owl:sameAs. To assess the quality of data stored in DBLP we looked at some of All of the names in DBLP have their own URIs which is thought the most common names and tried to ascertain whether the name to identify one single author with that particular name. As these belonged to a single author. This was achieved by looking at the results show, in most situations that is not the case. publications attributed to each name and performing a Web search Table 1. List of names with most number of distinct authors on the publication to find out to which institution an author was affiliated. The remaining publications were then checked in the Name No. Authors same way and authors who came from the same institutions were David Smith 15 grouped together. Authors also frequently change institutions, to accommodate this when a name was found that belonged to a David Williams 10 different institution, it was assumed to be different unless: David Jones 8 1. The co-authors of any publication were the same. David Evans 7 2. The publication venue was the same. Alan Williams 6 3. The area of research was similar. Matthew Jones 4 The author’s own publication page was also used if one could be Andrew Taylor 4 found. This process allowed for a conservative estimate to be made of the number of different authors who appeared under the Michael Taylor 4 same name. Single author papers and papers where there was a Andrew Brown 4 difference of greater than four years between their publications Ben Smith 4 were excluded as authors can change their field of research over a period of time. Names were chosen in order to provide a worst case scenario for This identity problem is not just theoretical, but also has authors not having been disambiguated. The ten most common implications for the future when more applications will be built surnames in the UK along with a list of common forenames were that reason with and use Semantic Web data. In particular, used. A total of 49 names were investigated by selecting five consider the attempt that is being made in the UK to allocate forenames with the nine most common surnames, and four research funding and judge research excellence by citation impact forenames with the remaining surname. The DBLP dataset that [13]. One could naturally believe that a Semantic Web application was used was from October 2006 which contains a total of 491 could be made that amalgamates all bibliographic data from 796 authors. Thus, the selected names were almost 0.01% of the DBLP and other repositories and ranks people or institutions total population. based on their publications. If the issue of co-reference is not taken into consideration then it is clear that not everyone will be The results showed that for 92% of names chosen there were at fairly represented. least two different authors whose publications had been incorrectly merged. The highest number of different authors was Now that the problem of co-reference has been highlighted in 15 for the name ‘David Smith’. The mean number of authors for DBLP, we move on to looking at how well the problem is handled each name was 3.8 with a standard deviation of 2.6. The ten most in DBpedia. ambiguous author names are shown in Table 1. As well as several authors being considered as one, there are also 3.2 DBpedia a number of cases where an author has more than one name where The huge amount of data that has been extracted from Wikipedia initials are used instead of full names. For example, ‘C.B. Jones’, has led to a rapid increase in the number of URIs that can be used ‘Cliff B. Jones’ and ‘Cliff Jones’ are all the same author yet his to identify people, places and things. At present DBpedia has publications appear under these three different names. identifiers for close to two million entities. This has enabled many other datasets to become linked with DBpedia entities through the To estimate the number of names which include two separate use of owl:sameAs giving rise to the Web of Data. authors in the entire population, a Laplace point estimated can be calculated using a 95% confidence interval using the Adjusted Whilst providing a valuable resource for data providers and Wald Method [1]. Multiplying the total number of entries in application developers, the conversion process has not taken into DBLP by the Laplace point estimate (0.902) gives 443 600 account the different needs that DBpedia has in comparison to names. This will not be a truly accurate estimation since common Wikipedia. In particular, the issues of ambiguity and co-reference names were chosen and not random names. raised in this paper have not been addressed. Nevertheless, we can conclude that if a person has a common name. The probability of their publications being merged with Wikipedia deals with the issue of co-reference by having special other authors will be 90%. These results should provide concern ‘disambiguation’ pages. These pages are created when there is to those working in the Semantic Web and especially those who more than one entry that has the same name but carries a different deploy linked data. meaning. Disambiguation pages are mainly intended for humans When existing data sources are used for Semantic Web data searching on a particular topic who may need some help in integration it is important to consider the consistency and locating the page that they are looking for. These same disambiguation pages have been carried over into DBpedia where there is no real need for them. Instead of making entities When looking at the real URI for the music band in DBpedia unambiguous, as in Wikipedia, the DBpedia URIs actually there is no owl:sameAs link. introduce more ambiguity. These issues demonstrate the necessity of having dedicated Consider a person or machine wanting to use a URI for Robert management systems In order to manage co-reference resolution Williams, the American politician. Using the URI on the Semantic Web. The next section looks at two such systems http://dbpedia.org/resource/Robert_Williams reveals that that are currently in production. Problems during the creation of properties belonging to Sir Robert Williams of Dorset, Robbie these systems have shown that there is a significant problem that Williams the singer and Robert Williams the actor have all been needs to be tackled. merged onto one page. This happens with a large number of pages that fall into the Wikipedia category ‘Disambiguation’. DBpedia 2.0 provides a number of examples where URIs are not 4. POSSIBLE SOLUTIONS sufficiently disambiguated. One example is the URI There are two main initiatives that have been set up in order to http://dbpedia.org/resource/Nancy_Wilson if this URI refers to confront the issue of co-reference on the Semantic Web. Our own Nancy Wilson the singer then the dbpedia:spouse property is of ReSIST [18] project has gathered metadata from publications and Nancy Wilson the guitarist. institutions and exposed them as linked data. The Okkam project [7] is a relatively new project that will formally begin this year There are, of course, other URIs which have all the properties although the initial architecture has already been conceived. belonging to the correct person. The URI http://dbpedia.org/resource/Nancy_Wilson_%27guitarist%28 will 4.1 Consistent Reference Services give the correct URI for the guitarist Nancy Wilson. This is The CRS sits in the Semantic Web as any other knowledge base simple for a human to work out, but machines will struggle. This or database would. Each data provider maintains one or more is demonstrated by the fact that putting ‘Robert Williams’ or CRSs for their own knowledge. In the ReSIST project there are ‘Nancy Wilson’ into Sindice [20] puts the ambiguous URI at a over 15 repositories each with their own CRS. higher rank than the ‘real’ URIs. Therefore the disambiguation The CRS introduces the concept of a bundle to group together URIs used in DBpedia only act as URI ‘noise’ and should resources that have been deemed to refer to the same concept probably be removed. within a given context. Different bundles may be used to group together URIs of the same resource in different contexts. For It is pleasing to note that DBpedia 3.0 has given much more example, there may be a bundle containing all of the URIs about a attention to the issue of disambiguation. However, whilst a new person in the context of institution 1; and another bundle ‘disambiguates’ property has been created, rogue properties containing all of the URIs about the same person in the context of belonging to distinct URIs still appear in URIs referring to institution 2. Each CRS can use different algorithms to identify disambiguation pages. There are approximately 150 000 of these equivalent resources. A full description of the service can be URIs which can be detected with relative ease. It is hoped that found in [16]. successive improvements to the method in which URIs are The system is being used on a live site at disambiguated will mean that the co-reference resolution of URIs http://www.rkbexplorer.com. Extending this system for use with can then be handled by external systems as described in Section 4. DBpedia and other sites would involve using the linking algorithms for each dataset and storing the links in a CRS. Each A second problem arises due to the strong implications prescribed dataset would have one or more CRSs which would act as an by the owl:sameAs property. By stating that one URI is authority for their data. An application may choose to give owl:sameAs another, one is stating that the two references identify precedence to a CRS hosted from the same domain as the URI in the same resource, and that each should share the properties of the question. Taking the owl:sameAs links out of the data ensures the other [3]. Looking at the owl:sameAs links in DBpedia one can knowledge is semantically correct without introducing a see that URIs are made to be the same as several URIs with significant overhead. However, if owl:sameAs links wish to be different meanings. For example, made then the CRS can be used for this purpose. http://dbpedia.org/resource/Welsh is taken from a Wikipedia disambiguation page for the term ‘Welsh’, in DBpedia this URI is 4.2 Okkam owl:sameAs: The Okkam project has been created to enable a ‘Web of Entities’ [7]. Whereas the CRS is a fully distributed system, the Okkam system is centralised. The main aims are to create a naming service for entities and a directory containing entity profiles under the single control of one authority. The main service, OkkamCore, allows for the publishing, None of these links are made from the pages that are actually modifying and removing of entities and assertions of identity and identifying these concepts such as a retrieval service based on a set of criteria. A prototype http://dbpedia.org/resource/Welsh_language. In another example application has been made and will be sequentially improved and the URI http://dbpedia.org/resource/H.P._Lovecraft is upgraded throughout the duration of the project. By holding owl:sameAs the CYC URI identifying the author and the Zitgist identifiers for all types of entities the project hopes to avoid the URI identifying the music band. Clearly the two are not the same. proliferation of URIs that is currently occurring. For the purposes of linked data it is yet to be seen what the final system will provide. The project will be monitored with interest as [8] Cycorp Inc. http://www.cyc.com progression develops. [9] Feitelson, D.G. On Identifying Name Equivalences in Digital Libraries. Information Research, 9(4), p.192.2004 5. CONCLUSION [10] Fellegi, I.P. and Sunter, A.B. A Theory for Record Linkage, This paper has attempted to provide some motivation for finding Journal of the American Statistical Association, 64(328), solutions to the co-reference problem. With the linked data pp.1183-1210 ,December 1969 initiative in its early stages, it is important to think about the integrity of the data being provided before errors are found in the [11] Gangemi, A., and Presutti, V. A Grounded Ontology for applications that attempt to use the data. Identity and Reference of Web Resources. In Proceedings of the 16th International World Wide Web Conference (Banff, We would stress that DBLP and Wikipedia/DBpedia are valuable Canada) ACM. and hard-won facilities that deliver searchable resources very effectively to their many users. The problem that is arising is that [12] Han, H., Hongyuan, Z., and Giles, C.L. Name in the context of the Semantic Web and Linked Data, different Disambiguation in Author Citations using a K-Way Spectral measures of quality pertain. It is the very Network Effect that the Clustering Method. In Proceedings of the 5th ACM/IEEE-CS Linked Data community is seeking that causes the difference. Joint Conference on Digital Libraries.(Denver) ACM The issue has attracted significant theoretical debate, yet the only [13] Harnad, S., Carr, L., Brody, T. and Oppenheim, C. Mandated systems attempting to solve the problem are the two mentioned in online RAE CV’s linked to univerestiy eprint Section 4. It would be in the interest of the whole Semantic Web archives:enhancing UK research impact and assessment. community if this issue was carefully considered as a fundamental Ariadne http://www.ariadne.ac.uk/issue35/harnad/ part of the architecture needed to make the Semantic Web gain [14] Hepp, M., Siorpaes, K. and Bachlechner, D. Harvesting Wiki widespread adoption. Consensus Using Wikipedia Entries as Vocabulary for Knowledge Management. IEEE Internet Computing. 11(5) 6. ACKNOWLEDGEMENTS pp.54-65 Sep 2007. This work is supported under the ReSIST Network of Excellence [15] Hogan, A., Harth, A and Decker, S. A Grounded ontology (NoE) which is sponsored by the Information Society Technology for Identity and Reference of Web Resources. In Proceedings (IST) priority of the EU Sixth Framework programme (FP6) of the 16th International World Wide Web Conference under contract number IST-4-026764-NOE. (Banff, Canada) ACM. [16] Jaffri, A., Glaser, H., and Millard, I. URI Identity Management for Semantic Web Data Integration and 7. REFERENCES Linkage. In Proceedings of the Workshop on Scalable [1] Agresti, A. and Coull, B.A. Approximate is better than Semantic Web Systems (Vilamoura, Portugal 2007) ‘Exact’ for Interval Estimation of Binomial Proportions. The Springer. American Statistician. 52 pp.119-126.1998 [17] McCallum, A., Niham, R., and Ungar, L.H. Efficient [2] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R. Clustering of High-Dimensional Data Sets with Application and Ives, Z. DBpeddia: A Nucleus for a Web of Open Data. to Reference Matching. In Proceedings of the sixth ACM In Proceedings of the 6th International Semantic Web SIGKDD international conference on Knowledge discovery Conference (Busan, Korea 2007). Springer. and data mining. (Boston, USA 2000). ACM Press. [3] Bechofer, S., Van Harmelen, F., Hendler, J., Horrocks, I., [18] Resilience for Survivability in IST (ReSIST) Network of Mcguiness, D.L., Schneider, P.F. and Stein, L.A.OWL Web Excellence. http://resist-noe.eu Ontology Language Reference, Technical Report, W3C, http://www.w3.org/TR/owl-ref/ [19] Tan, Y.F., Kan, M.-Y. and Lee, D. Search Engine Driven Author Disambiguation, Proceedings 6th ACM/IEEE-CS [4] Bizer, C. and Cyganiak, R. D2R Server – Publishing Joint Conference on Digital Libraries, pp.314-315, ACM Relational Databases on the Web as SPARQL Endpoints. In Press, New York. Proceedings of the 15th International World Wide Web Conference.(Edinburgh, Scotland 2006).ACM [20] Tummarello, G., Delbru, R and Oren, E. Sindice.com: Weaving the Open Linked Data. In Proceedings of the 6th [5] Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and International Semantic Web Conference (Busan, Korea Fienberg, S. Adaptive Name Matching in Information 2007) ACM Integration. IEEE Intelligent Systems, 18(5) pp.16-23,2003 [21] Volz, R., Kleb, J., and Mueller, W. Towards Ontology-based [6] Booth, D. URIs and the Myth of Resource Identity, Disambiguation of Geographical Identifiers. In Proceedings Proceedings of the Workshop on Identity, Meaning and the of the 16th International World Wide Web Conference Web (IMW06) at International World Wide Web (Banff, Canada) ACM. Conference. (Edinburgh, Scotland. 2006) ACM [22] Yang, K., Jiang, J., Lee, H. and Ho, J. Extracting Citation [7] Bouquet, P., Stoermer, H and Giacomuzzi, D. OKKAM: Relationships from Web Documents for Author Enabling a Web of Entities. In Proceedings of the 16th Disambiguation, Technical Report No.TR-IIS-06- International World Wide Web Conference (Banff, Canada) 017,Institute of Information Science, Taipei, Taiwan ACM.