DBpediaSameAs: an Approach to Tackle Heterogeneity in DBpedia Identifiers Andre Valdestilhas Natanael Arndt Dimitris Kontokostas AKSW, Department of AKSW, Department of AKSW, Department of Computer Science Computer Science Computer Science Augustusplatz 10 Augustusplatz 10 Augustusplatz 10 D-04109 Leipzig, Germany D-04109 Leipzig, Germany D-04109 Leipzig, Germany valdestilhas@informatik.uni- arndt@informatik.uni-leipzig.de kontokostas@informatik.uni- leipzig.de leipzig.de ABSTRACT Further DBpedia has more than one URI representing the The DBpedia dataset has multiple URIs within the dataset same resource within the dataset, e.g. the dbpedia:Brassil4 and from other datasets connected with (transitive) owl has at least the following equivalents within the DBpedia :sameAs relations and thus referring to the same concepts. dbpedia:Republica_Federativa_do_Brasil, dbpedia:ISO_ With this heterogeneity of identifiers it is complicated for 3166-1:BR and dbpedia:Brazil which are all redirecting to users and agents to find the unique identifier which should dbpedia:Brazil. Thus a problem to consider is to directly be preferably used. We are introducing the concept of DBpe- resolve any of the equivalents directly to the final URI e.g. dia Unique Identifier (DUI) and a dataset of linksets relating http://dbpedia.org/resource/Brazil without any redun- URIs to DUIs. In order to improve the quality of our dataset dancies. we developed a mechanism that allows the user to rate and suggest links. As proof of concept an implementation with Also, according to Halpin et. al. [1] and Wood et. al. [4], a graphical web user interface is provided for accessing the sameas.org has collected millions of triples with owl:same linkset and rating the links. The DBpedia sameAs service As relations. It would be important to promote reciprocal is available at http://dbpsa.aksw.org/SameAsService. owl:sameAs confirmation mechanisms and develop effective trust mechanisms to assure the quality of owl:sameAs rela- tions. Categories and Subject Descriptors M.0 [Knowledge Management]: Knowledge Acquisition; To tackle the identifier heterogeneity problem we are making H.4 [Information Systems Applications]: Miscellaneous the following contributions: General Terms • We describe an approach for the mitigation of the iden- Semantic Web, Linked Data, DBpedia, SameAs, Link Rate tifier heterogeneity problem and implement a proto- type where the user is able to evaluate existing links, as well as suggest new links to be rated. 1. INTRODUCTION As DBpedia [3] was evolving during the 9 years of its ex- • The ability to generate statistics about good and bad istence, the community extended the linksets to DBpedia links which, brings the possibility to have a quality resources. Thus, DBpedia has more than one URI that rep- control for the links to DBpedia. resents the same resource, which leads to the identifier het- • We define the DBpedia Unique Identifier (DUI), which erogeneity problem. For instance a DBpedia resource can instead of several transient owl:sameAs DBpedia URIs contain owl:sameAs links to other data sets such as Free- for the same final address, now is possible to have Base1 , Wikidata, GeoNames2 or yago3 . a unique URI from DBpedia. A DUI goes directly 1 to the final address instead of having to process sev- Freebase project webpage: http://freebase.com eral possible intermediate results. For example, with 2 GeoNames project and exploration webpage: http:// a URI from Freebase, 17 redundant URIs from DB- geonames.org pedia where avoided or if one used a service such as 3 Yago project webpage:http://yago-knowledge.org sameAs.org, 1141 URIs would be avoided. The rest of the paper is organized as follows: section 2 rep- resents a proposed approach for tackling the identifier het- erogeneity problem, we evaluate our work in section 3, in section 4 we focus on related work, and finally section 5 concludes the paper and outline future work. 4 Throughout the paper we are using the following names- pace definitions: owl: http://www.w3.org/2002/07/owl#, dbpedia: http://dbpedia.org/resource/. 1 2. REPRESENTATION OF THE IDEA of a relational database i.e. comparative with voting system This section provides an explanation about our main idea, in future works. (3) An implementation of a service on the such as implementation and descriptions. web was provided, where the user enters the URI and re- ceives a DUI. (4) In order to provide an interface to access Before continuing the work, there are some definitions that this service were created a web system that receive as input were adopted. a URI, return as output an DBpedia identifier and allow rate and make suggestions about the resulting link. • Normalization of the URI: Is understood by nor- Figure 2 presents the relation of the contribution in a graph malizing URIs, the fact of eliminating redundancies. form. • DBpedia unique identifier: The DBpedia Unique Identifier (DUI) is an unique URI that identifies a re- source in the DBpedia repository and also is the result of our normalization. The idea started with a stand alone service on the web that solves the problem where the user provides a URI as param- eter and instead of several transient URIs with owl:sameAs property, the user receives a single DUI from our service. 2.1 The work-flow The work-flow for requesting the DUI of a given resource is represented in fig. 1. Firstly, the user will provide a URI Figure 2: Relation of the contributions. from some address, i.e. FreeBase. Then, instead of possible several results of URIs with the property owl:sameAs, our system will return a DUI. Consequently, the user has a pos- Where the DBpedia Link Repository uses the DBpediaSameAs sibility to rate, verify, validate, and suggest a different link. service in order to tackle the heterogeneity and giving the Then the rate can give us a chance to have statistics about appropriate DUI, that redirects the user to the DBpedia the quality of the links. Link Rate interface, thus, providing a feedback to the DB- pedia Link Repository, therefore, improving the quality of the DBpedia endpoint. 3. EVALUATION The aim of this qualitative6 evaluation was centered in veri- fying the behavior of the service DBpediaSameAs, the Graph- ical User Interface (GUI) that gives the possibility to verify and rate the links. There are chosen 3 evaluation criteria: (1) Normalization on DBpedia URIs: With this criteria was evaluated if the DBpediaSameAs can provide an normal- ization on DBpedia URIs. (2) Rate the Links: Where was evaluated if the DBpediaSameAs can provide a way to rate the links. (3) DBpediaSameAs as service: Was evalu- Figure 1: General work-flow. ated if DBpediaSameAs can provide a stand alone service on the web that brings the normalization on DBpedia URIs. A service, also was implemented, where the user can pro- vide a URI and the API will return the DBpedia identifier like a URI that represents the owl:sameAs about the URI 3.1 Normalization on DBpedia URIs provided. The criteria used in this evaluation are uniquely to tackle heterogeneity, that was observed during the search of co- 2.2 Methodology references between different data sets with a problem about This section describes in four steps the technique and how redundancies. the idea was developed, from phase of importing links to a relational database until the development of the service on When was used a URI from freebase in order to obtain a DB- the web and a GUI. pedia URI was observed that at least 3 URIs were returned, that drives to the same final address. (1) The files with triples that contains owl:sameAs links, were downloaded. (2) All triples were imported in a rela- As an example of a real case, executed in our public server, tional database 5 , because we will use some characteristics with a URI from Freebase: 5 6 http://tinyurl.com/creatdb http://tinyurl.com/rmethod 2 $ curl http :// dbpsa . aksw . org / ¬ 3.3.1 Transitive and Redirect Links SameAsService / SameAsServlet ? uris = http %3 ¬ Transitive and Redirect Links are redundancies at DBpe- A %2 F %2 Frdf . freebase . com %2 Fns %2 Fm .015 fr dia that supposed has a link to the same place, in other returns : http :// dbpedia . org / resource / ¬ words, they use owl:sameAs property, this links will redirect Brazil another links, will provide a transition between the links, that’s why the name transitive. In this case, instead of using this transitive links that points to the same final destination Where, in this case, instead of 17 URIs from DBpedia, that URI, this final destination URI will be used directly. The goes to the same final address, our approach drives the user figure 4 try to make more clear this explanation. directly to the final address. Was discovered and treated 6,473,988 triples with transitive As can be observed on the figure 4 that approach the transi- and redirect links from 62,531,487 imported links among 142 tive and redirect URIs, where show that with this approach domains inside DBpedia. Then, 10.35% of the links can be instead of have several URIs the user can have only one avoided in some cases. from the DBpediaSameAs. Thus, in this way, providing a normalization on DBpedia URIs. 3.4 Discussion The DBpediaSameAs was evaluated with its normalization 3.2 Rate the links of URIs, link rate, and DBpediaSameAs as a stand alone In order to have a link rating, were implemented a GUI service on the web. As results of the normalization a DUI that allows the users to give some feedback, suggestions, in was obtained in order to tackle the heterogeneity. In other this way, improving the quality of the links. The rate is a words, instead of several URIs e.g. from sameAs.org one quite simple process, the GUI just ask the user to rate the DUI was obtained. The link rate functionality further allows link with +1 if the link attends your expectations or -1 if the to improve the quality of the dataset. link is wrong or some type of spam. The GUI was developed using concepts from prefix.cc 7 and work from Zaveri[5] such Despite, the GUI of DBpediaSameAs, also a stand alone our system of rate (+1 and -1) and the standard of the web service on the web was developed that brings the function- documents. Some improvements and personalization, also ality to get a DUI without a GUI for agents or people which was provided, such as the suggestions and the possibility to don’t need to use the DBpediaSameAs in a Graphical mode, check the link. The figure 3 shows the moment when the allowing use as an off-the-shelf component. user clicked on the -1 and indicated that the user didn’t like the link and was asked to make a suggestion of a new URI. 4. RELATED WORK The work [5] elaborates a data quality assessment method- ology in DBpedia, which comprises of a manual and semi- automatic process. This work drive us to a reinforcement about the concept of data quality used in our work, when in our case will be more a manual process and also we are able to improve the DBpedia data quality. The work [2], presents a two staged experiment for the cre- ation of gold standards that act as benchmarks for several interlinking algorithms. The similar aspects of this works are: The validation of links and a dubbed manual valida- Figure 3: Rate a link. Available at http://dbpsa.aksw tion, where the user i.e. validator or evaluator specifies .org/SameAsService whether a link generated by an interlinked tool is correct or incorrect. The results of the link validation process are The field about a suggestion for a new link will only appear used to learn presumably better link specifications and thus when the user are not satisfied with the current link, then, achieving high-quality. Also, this work proposes an experi- when clicking on the -1, then the system will ask for an ment to investigate the effect of user intervention in dataset optional suggestion. interlinking on small knowledge bases. 4.1 A related problem with sameAs.org 3.3 Results The sameAs.org is a service that leading source of co-reference The results of this work could also be expressed in num- data on the Semantic Web. For example, when the web bers that was obtained during importing triples to the rela- site sameAs.org is accessed with a URI from Freebase that tional database and with some results from the sameAs.org should bring information about a country called Brazil. web site. A total of 62,531,487 triples imported into our database, the time was 2,220 seconds for the whole oper- Was used the URI (http://rdf.freebase.com/ns/m.015fr) ation, thus, was noticed that 28,167 triples were imported as parameter to the service, and is received as return more per second. The source code used to obtain the results is than 1140 URIs as shown in fig. 5, but the user can have a available in our github repository8 . doubt about which one is the correct. 7 http://prefix.cc 8 Our work is not an alternative to the sameAs web site, but https://github.com/firmao/dbpedia-links/blob /master/CreateDB.sh brings possibilities, like, was noticed that the sameAs.org 3 Figure 4: Transitive / Redirect links in DBpedia. Ackermann for a essential help with the deployment of ap- plication and very good suggestions. Additionally,research activities of this paper were funded by grants from the EU’s 7th & H2020 Programmes for projects ALIGNED (GA 644055), GeoKnow (GA 318159) and LIDER (GA 610782). Figure 5: Several URIs with the property owl:same 7. REFERENCES As from the web site sameAs.org. [1] H. Halpin, P. J. Hayes, J. P. McCusker, D. L. McGuinness, and H. S. Thompson. When owl: sameas isn’t the same: An analysis of identity in linked data. does not provide a way to rate the link, but with this rating, In P. F. Patel-Schneider, Y. Pan, P. Hitzler, P. Mika, is possible to improve the quality of the data, and bring some L. Zhang, J. Z. Pan, I. Horrocks, and B. Glimm, facility to the user. editors, International Semantic Web Conference (1), volume 6496 of Lecture Notes in Computer Science, 5. CONCLUSION AND FUTURE WORKS pages 305–320. Springer, 2010. An approach was provided to tackle the heterogeneity work- [2] M. Hassan, J. Lehmann, and A.-C. N. Ngomo. ing with owl:sameAs redundancies that were observed dur- Interlinking: Performance assessment of user evaluation ing researching co-references between different data sets and vs. supervised learning approaches. In 24th providing a unique DBpedia identifier and give the chance International World Wide Web Conference (WWW to rate the resulting links and make suggestions. 2015): workshop: Linked Data on the Web (LDOW2015), Florence, Italy, May 18 to 22, 2015, A proof of concept was implemented as a computer web Proceedings, 2015. system in order to present and validate our idea and every [3] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, concept of this work. The source code is available 9 . D. Kontokostas, P. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - a Was noticed in our results that there are benefits when a large-scale, multilingual knowledge base extracted from considerable number of owl:sameAs redundancies can be wikipedia. Semantic Web Journal, 2014. avoided. Rating the links allow users to make link sugges- [4] D. Wood, M. Zaidman, L. Ruth, and M. Hausenblas, tions brings more quality to the repository, and the stand editors. Linked Data: Structured data on the Web. alone service on the web allow you to use the DBpediaSameAs Manning, 2014. also in a command line textual environment and can be used [5] A. Zaveri, D. Kontokostas, M. A. Sherif, L. Bühmann, as an off-the-shelf component. M. Morsey, S. Auer, and J. Lehmann. User-driven quality evaluation of dbpedia. In Proceedings of the 9th For the future we plan to: (1) make a study about the re- International Conference on Semantic Systems, sults of link rating. This needs a period of usage of the I-SEMANTICS ’13, pages 97–104, New York, NY, DBpediaSameAs service in order to gather sufficient results USA, 2013. ACM. for proper analysis. (2) An implementation case with more members of the DBpedia community. A study about how will be the behavior when implement with the DBpedia com- munity. 6. ACKNOWLEDGMENT We would like to acknowledge, National Council for Scien- tific and Technological Development (CNPq) 10 and Univer- sität Leipzig for their support. Special thanks to Markus 9 https://github.com/firmao/DBPediaLinkSameAs.git 10 http://www.cnpq.br/ 4