Automated Mapping Generation for Converting Databases into Linked Data Simeon Polfliet Ryutaro Ichise Ensimag engineering school Principles of Informatics Research Division Grenoble Institute of Technology (INPG) National Institute of Informatics Grenoble, France Tokyo, Japan simeon.polfliet@ensimag.imag.fr ichise@nii.ac.jp Abstract. Most of the data on the Web is stored in relational databases. In order to make the Semantic Web grow we need to provide easy-to-use tools to convert those databases into linked data, so that even people with little knowledge of the semantic web can use them. Some programs able to convert relational databases into RDF files have been developed, but the user still has to link manually the database attribute names to existing ontology properties and this generated “linked data” is not actually linked with external relevant data. We propose here a method to associate automatically attribute names to existing ontology entities in order to complete the automation of the conversion of databases. We also present a way - rather basic, but with low error rate - to add links automatically to relevant data from other data sets. Keywords: Database, Linked Data, Semantic Integration, Semantic Web 1 Introduction Even though significant research and development efforts have been made, the achieve- ment of the vision of the Semantic Web remains remote. The amount of data on the Semantic Web remains marginal in comparison with the traditional Web. The impor- tance of revealing relational data and making it available as RDF and as Linked Data[1] has been already acknowledged. Most notably, Virtuoso RDF views[4] and D2RQ[2] are production-ready tools for generating RDF representations from relational database contents. But the main restriction to their deployment is the complexity of generating a mapping, which is the last non-automated part of these programs. In this paper, we will present a method to generate automatically the mapping between attribute names and existing ontology entities, completed with a method to add links automatically to external data. Then, we will present the application of this method on relational databases applied on the D2RQ Mapping system and the D2R Server[3], and the tests and results on different kind of relational databases. A presentation of our software AuReLi (Automatic Relational Database to Linked Data Converter) can be found at http://ri-www.nii.ac.jp/AuReLi/ 2 Method In ontology matching, there are three types of methods to compare two entities: string- based, structure-based and knowledge-based methods. In the present problem, there 2 Automated Mapping Generation for Converting Databases into Linked Data is on one side a database and on the other side we have several ontology descriptions. Thus, structure-based method are not relevant here. We are seeking to compare the name of an attribute in a database with the name of an ontology property. These names can be composed by one or several words: the first step is to split the name into a set of words in order to compare the words of each set. The success of the matching depends on the correctness of the word decomposition. The words composing names of ontology properties and of attributes in a database are usually either separated by special characters, for instance product_name, or by a change of case, e.g. ProductName. As sometimes it is not the case, we completed this simple splitting method with a method based on the presence of the words in a dictionary such as the WordNet dictionary1 used here. After doing the previous splitting, it is necessary to check if the resulting words exist in the dictionary. If that is not the case, then we try to split it into words that are in the dictionary. However, because it is possible that the word is not in the dictionary but some part of it is, we will only keep the result if all the decomposed parts are present in the dictionary. With this method, even productname will be correctly split. The second step is to compare the resulting set of words of the attribute name with the sets of words of all the ontology entities, and then return the best match. In order to compare the words, we use string-based similarity measures2 , especially Jaro-Winkler, and WordNet similarity measures3 : Lin[5] and Wu and Palmer[6] measures. We use WordNet measures if the words exist in the WordNet dictionary, otherwise we use the string-based ones. Once the mapping is done, in order to have true linked data, we want to add links to relevant data. The idea is to make a SPARQL query on a given data set. If you know the target data and its ontology entities, you can specifically build SPARQL queries for this data set to get links. But here, in a more general setting, we do not have this information. However, there is a property common to most of the data sets: rdfs:label. Even better, this property is especially good because it is usually at the same time short and clearly defining the data. Therefore, if the rdfs:label property was correctly set on your data, the SPARQL query based on this property should not return wrong links and has good chances to find a result if there is a related data in the target data set. 3 Implementation We produced a reusable Java library and used the D2RQ Map and the D2R Server[3] as a basis to implement and test our method. A Java graphical user interface was produced for the mapping generation, in order to simplify its use as much as possible. First, the user has to define the parameters to connect to the relational database, and to give to the program the ontology descriptions he wants to use, as shown in Fig. 1. We already provide some of the most common generic ontology descriptions along with some more specialized ones, but the user can add any other ontology by providing a file with its OWL definition. Then, the program generates the mapping of the table and attribute names with the ontology entities. It presents the resulting mapping to 1 Princeton University: WordNet, Version 3.0: http://wordnet.princeton.edu/wordnet/download/ 2 S. Chapman: SimMetrics Java library: http://www.dcs.shef.ac.uk/~sam/simmetrics.html 3 D. Hope: Java WordNet::Similarity: http://www.cogs.susx.ac.uk/users/drh21/ Automated Mapping Generation for Converting Databases into Linked Data 3 Fig. 1. Mapping generation graphical interface the user so that he can check and make changes if necessary. It also allows the user to choose which attributes to use as labels for the rdfs:label property. The D2R Server was also modified to add links automatically in the generated data. If the feature is activated, it makes a SPARQL query on DBpedia for each request of the user and add the link to the data if there was a result. We used DBpedia because it is currently one of the biggest and the most general linked database. 4 Test and Results Five databases from different sources, with different size and about different topics were used for the tests: Northwind4 , World and Sakila5 , Automobile6 , World Development Indicator7 There are approximately three hundred attributes in those five databases: after a manual check of the mappings, 79,66% of the attributes were correctly mapped. The wrong mappings are explained by the fact that some attributes were too spe- cific and consequently could not match any existing ontology property in the ontology descriptions used in the experiment. Another limit is the use of acronyms or short abbreviations, which did not produce a correct mapping either. The generation time was around one minute for each database. The mapping generated automatically can be seen in Fig. 2 for the World database. On the left and the middle are the table names and the attribute names, on the right are the matched ontology entities. We can observe for instance that the attribute GNPOld do not have good corresponding property and thus is mapped with foaf:OnlineAccount which is obviously irrelevant. But on the other hand, Percentage becomes dbpedia:part, which is quite good since a percentage is a part of something. This matching is due to WordNet because it would not have been found by a string-based similarity measure. For the server, the use of the feature to automatically add links is slightly slowing down each request of the user because it needs the answer of the SPARQL query. It 4 Example database for the Microsoft SQL Server: http://www.microsoft.com/downloads/details.aspx?FamilyID= 06616212-0356-46a0-8da2-eebc53a68034 5 Two example database from the MySQL website: http://dev.mysql.com/doc/index-other.html 6 Data set from the UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets.html 7 database from the World Bank Data Catalog: http://data.worldbank.org/data-catalog 4 Automated Mapping Generation for Converting Databases into Linked Data Fig. 2. Mapping generation result for the World database becomes problematic if the external data set is slow or do not answer to the query. The results on the rdfs:label property on DBpedia are usually good, providing the labels in the mapping are correct. The principal case where the added links are wrong is in the case of homonyms, e.g. cities such as London, England and London, Canada. 5 Conclusion The automatic mapping generation is a difficult problem which renders almost impossi- ble the automatic production of a 100% correct mapping. Nevertheless, even if the user still needs some knowledge of the Semantic Web, we managed to simplify the process with a user-friendly interface where the user only has to check the correctness of the proposed mapping. The automatic addition of links in the generated RDF is simple and functional, and can easily be extended to add a greater variety of links. References 1. T. Berners-Lee. Design issues: Linked data, 2006. http://www.w3.org/DesignIssues/LinkedData.html 2. C. Bizer and A. Seaborne. D2RQ - treating non-RDF databases as virtual RDF graphs. In ISWC2004 (posters), November 2004. 3. C. Bizer and R. Cyganiak: D2R Server, Version 0.7 http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/ 4. O. Erling and I. Mikhailov. RDF support in the Virtuoso DBMS. In Proceedings of the 1st Conference on Social Semantic Web, volume P-113 of GI-Edition - Lecture Notes in Informatics (LNI), ISSN 1617-5468. Bonner Kollen Verlag, September 2007. 5. Lin, D. An information-theoretic definition of similarity. In Proceedings of the In- ternational Conference on Machine Learning, 1998 6. Z. Wu and M. Palmer. Verb semantics and lexical selection. In 32nd Annual Meeting of the Association for Computational Linguistics, 133-138, 1994