=Paper=
{{Paper
|id=Vol-2918/short1
|storemode=property
|title=Semantic Data Transformation
|pdfUrl=https://ceur-ws.org/Vol-2918/short1.pdf
|volume=Vol-2918
|authors=Joffrey de Oliveira,Hammad Aslam Khan,Olivier Curé,Philippe Calvez
|dblpUrl=https://dblp.org/rec/conf/esws/OliveiraKCC21
}}
==Semantic Data Transformation==
Semantic Data Transformation Joffrey de Oliveira1,2 , Hammad Aslam Khan2 , Olivier Curé1 , Philippe Calvez2 1 LIGM Univ Paris Est Marne la Vallée, CNRS, F-77454. {firstname.lastname}@univ-eiffel.fr 2 ENGIE LAB CRIGEN philippe.calvez1@engie.com, hammadaslam.khan@engie.com Abstract. Data transformation from diverse formats to semantic data is a major challenge that is being addressed by commercial products avail- able in the market to some extent. However, the transformation process still requires considerable effort from users. Knowledge creation, which is one of the major steps in knowledge graph (KG) maintenance, needs existing data transformation from various formats to RDF (Resource De- scription Framework) data. Current data transformation approaches are either manual or through mappings. This work presents a semantic data transformation approach for knowledge creation that is semi-automatic requiring minimum possible input from users and does so with machine learning and natural language processing techniques. Keywords: Semantic data, knowledge graph, RDF, machine learning, knowl- edge creation 1 Introduction With the emergence of the Semantic Web and Linked Open Data more and more applications over the past decades have been developed to exploit semantic data. Nevertheless, the vast majority of content published on the Web is not repre- sented in RDF and thus requires a transformation before being published. These data integration issues are very common when working with data sources on the web. There are several solutions to transform data into semantic data, most of them being based on mapping solutions which can accurately generate semantic data by writing scripts. But the writing of these mappings requires considerable effort from the share of qualified users which makes each transformation time consuming. They often require a specific mapping language and the definition of rules for each data source which makes it time consuming and prone to human error. Some solutions try to alleviate the expertise part by extending SPARQL like SPARQL Generate [2], RML [1] and R2RML [3]. In addition, there is a mul- titude of different data sources and each new source you want to use, demands its own transformation. This led to the development of new solutions which enable to better manage this source diversity. However, these solutions remain time-consuming on the user side since they usually imply that the user manually maps each element of their data to some ontologies. The most advanced of these systems try to make the process semi-automatic by learning from the data and schema that match their ontologies. [6] To ease the creation of knowledge graphs from non-semantic data, we propose an approach combining natural language processing (NLP) and machine learning to allow for a more automatized transformation of data. 2 Motivating Example Over the past few years, more and more governments have started publishing their data for use by the public. These Open Data initiatives are done at every level of governments, from single cities to nationwide programs, and involve many different types of data. One of these is geographical data, often in the form of addresses and geographical locations of all sorts of places, such as named places, landmarks, residential buildings, or businesses. This data allows users to develop a variety of useful applications, with notable examples creating non-proprietary maps of the world such as OpenStreetMap, MapBox and others. But currently the vast majority of data being published is not available as linked data, OpenStreetMap(OSM) created the OSMOnto [5], an ontology close to OSM’s database to integrate more closely with the semantic web. Nevertheless, linked data from certain regions can remain hard to find, if it even exists. Thus the need to translate openly published relational data, into linked data. The task remains difficult as there is no single world standard describing addresses, and even when dealing with a single country there is rarely a single database regrouping all data with a single schema. It isn’t unusual to find a different schema for each city when they publish the data themselves. We will attempt to explain our approach through examples obtained while transforming this type of data in Section 2 and evaluating our approach in Section 3. 3 Approach 3.1 Processing overview Figure 1 represents our approach to data transformation. We take data sources that the user wants to map to ontologies and the description of these ontologies. We then match the columns and tables found in the data sources to the properties and classes from the ontologies. The matches are suggested based on scores from 3 different modules which are denoted NLP, String and Numeric (to be detailed in this section). But there may be cases when mapping becomes ambiguous because of lack of information. In such a scenario a mapping is proposed and the end-user is given the opportunity to verify and change some of the mappings through the graphical user interface provided with the tool we developed. 3.2 Matching classes When a user wants to transform tabular data, the application lets her create entity types by selecting the columns she wants to see in the entity type. Here, Fig. 1. Representation of our process an entity type is a table containing a number of columns that will be matched with a class from the ontologies known by the application. The columns under the entity type will be matched to properties associated with the class. That way when generating the RDF document, each row in the entity type’s table will generate an instance of the class that matched the entity type and the values in this row’s column will be used to generate the properties associated with that instance. The matching is performed between properties and columns. The average score for the matching between columns from a table and properties related to a class is then used as the score for the matching between the class and the table. 3.3 NLP Module The NLP module matches properties and columns. It relies on Word2Vec [7], a group of deep learning models which makes it possible to represent words in a vectorial space. We can ask Word2Vec for the cosine similarity of two vectors which represent two words where the cosine similarity is expressed by a float. We assume that the cosine similarity is representative of a semantic proximity of these two words. When we match a column with a property, we tokenize the name of the property and the name of the column. We then compare each token from the column’s name with each token from the property’s name. The highest cosine similarity is kept as a score for the matching between this column and property. The NLP module is essential to our process as it is the only one of our modules that can run without data from previous matchings made by the user. This allows Fig. 2. NLP Module matching column with property ”locn:thoroughfare” us to make predictions on new data. Figure 2 shows how the NLP module scores the match between property ’thoroughfare’ and a column named ’StreetName’. In our implementation we chose to use word embeddings trained on Google News articles3 . When working on a specific domain it may be beneficial to train word embeddings on domain specific documents to ensure the recognition of some words and to prevent the semantics from common parlance from diluting the meaning of words in the domain. 3.4 String module The string module works by using predicates to establish the profile of the strings in the name and in the content of attributes, each predicate tests a specificity of the string. For example does the string have numbers? special characters? less than X tokens? starts with the prefix X? It is with dozens of such predicates that we can determine a set of labels to represent a string. Each positive predicate generates a label. These labels are known by the module who assigns them weights from the strings that have previously matched each property. Figure 3 shows an example of some predicates that are generated for the shown data. During its initialization phase, the module retrieves data from previous ses- sions, and measures the prevalence of each label for each property and compares it to the prevalence of that label in all of the other classes to determine the weight to give it. In general, the more a label is present in the strings of its property and the less it is present in the other properties, the greater its weight will be. A label that would be present in all strings would therefore indicate nothing and will have a weight of zero. From the best results of these modules 3 https://code.google.com/archive/p/word2vec/ Fig. 3. Tokenization and generation of some predicates from a column and its content This data generates the following numeric predicates among others: nameContainsAddress(), contentContainsNumerical(), contentContainsLane(), contentContainsSt(). on each attribute of the entity, we determine a degree of certainty for the cor- respondence with this class. The class with the best certainty is kept and the properties which correspond best to its attributes are also preserved. These re- sults are presented to the user who can change the results by choosing the class and properties herself if the results do not suit him. 3.5 Numeric module The numeric module uses the same mechanics as the string module but analyses the content of columns as integers, floats and doubles instead of simple strings. The predicates built by this module focus on the essential aspects of the numbers present in the column. For instance, the presence of negative values, floating values, and the range of values present in the column are significant information that help us make a clear distinction between concepts like a postal code and a latitude. 3.6 Combining the scores The final score for the matching between a property and a column is calculated by adding up the scores of the 3 modules. The score of the NLP module is divided by two in this step. This does not impact its performance in cases with no previous data as it is the only active module in these cases. This is done because we think the numeric and string modules make better matchings individually but the NLP module can still help break ties in ambiguous cases. 4 Evaluation In this section, we present a preliminary evaluation of our solution. We are comparing our results to a well-established semantic data integration tool. 4.1 Data sets and Ontologies In this experimentation, our goal is to create RDF data from textual documents. We are considering location4 and geo5 ontologies. We will use these ontologies to represent addresses using their house street name(locn:thoroughfare), number (locn:poBox), full address (locn:fullAddress), postal code (locn:postalCode) and city name (locn:postalName). The geo ontology will be used to represent geo- graphical points by using their longitude and latitude (geo:long, geo:lat). We will translate data from four different open data projects from different countries. The ”Base Adresse Nationale” (BAN6 ), a national project from the French government, and open data projects from Montreal7 , Austin8 and Lon- don9 . We cleaned the data before translating it by separating the geographical information to form a ”Geographical” table and an ”Address” table and ignored data that we would not translate, like ids, keys and metadata from the original databases. The largest case was BAN, where only 7 of the original 20 columns were relevant for our ontologies. 4.2 Comparison setting Karma [4] is a data integration tool allowing users to generate RDF data from a variety of formats. The tool facilitates the mapping of columns from the source data to properties and classes in ontologies. When mapping columns, Karma recommends a number of mappings based on the content of the column and by learning from matches made by the user in previous sessions. We will be comparing the quality of the top recommendation from our approach with Karma by mapping the same data on the same ontologies using both systems. 4.3 Data and Results Since our objective is to automate the translation process, we measure the qual- ity of our matching by counting the number of user interactions necessary to create linked data from a table. We increment the counter of user interactions when the proposed matching is incorrect. We count matching a property and a literal as ”Assign Property” and matching between a property and a class as ”Assign Relationship”. Both Karma and our solution start the evaluation with no previous matches to learn from and the matchings expected are the same for both solutions. Figure 4 details user interactions that were counted when translating data from the different sources into linked data. We reach an average of 0.45 user actions per column using our approach, compared to Karma’s 0.68. We notice 9 https://www.w3.org/ns/locn 9 https://www.w3.org/2003/01/geo/ 9 https://adresse.data.gouv.fr/donnees-nationales 9 https://donnees.montreal.ca/ville-de-montreal 9 https://data.austintexas.gov/Locations-and-Maps/ 9 https://opendata.london.ca/ Fig. 4. Results for the initial matching that out of the 7 properties that we were matching with the data, 3 were found solely based on NLP in the first dataset. 5 Related Work Most solutions for ontology matching and mapping are either mapping languages or transformation software. In both cases, they are oriented towards expert users like ontology experts. Karma stood out as it was trying to be accessible to anyone with an understanding of their data and their ontology with a semi-automated approach. We based the core of our approach on predicates inspired from their approach and expanded it by introducing NLP, the addition of NLP allows us to suggest mappings on properties and classes that have not been matched before. This was not possible with Karma as it needed data from previous matchings made by the user for a property before suggesting it in new matchings. The other addition we made to karma’s process was to split predicates on content between string based predicates and numerical predicates. 6 Conclusion and future work We have proposed an approach to semi-automatically match non-semantic data with ontologies by using machine learning and NLP techniques to create knowl- edge graphs. We have demonstrated that combining different approaches, in particular NLP and string matching, improve the quality of our matchings. The quality of our matchings depends directly on the quality of the data we are matching and the quality of the data from the previous sessions. Data attributes using real life vocabulary can be directly used to suggest matchings with our NLP approaches whereas truncated words, acronyms are unusable. The more we use each ontology, the better the quality of our matchings. Even though our process shows good results, it could be improved by extend- ing its NLP approaches. In our evaluation several sources had their schemas in different languages which complicated the task of our NLP module as Word2vec must have seen the exact word being matched to give an answer. If we translated the names in the data’s schema in a separate phase or if the NLP module was capable of doing it automatically, its results should improve significantly. The combination of the different modules itself could be improved by adjusting the weight of the different scores based on how accurate their predictions were for a given property in previous sessions. The model that we train will be made reusable and able to learn during sessions in future works, this will improve the quality of our matchings during the sessions themselves. In this paper our focus was on the quality of the matchings, this focus however came at a cost when it comes to memory consumption and processing power, work will be needed to optimize our approaches to handle larger amounts of data. The full integration of the graphical user interface and the extension of its functionalities will be essential. We are working towards the development of a full application. References 1. Dimou, Anastasia & Vander Sande, Miel & Colpaert, Pieter & Verborgh, Ruben & Mannens, Erik & Van de Walle, Rik. (2014). RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. CEUR Workshop Proceedings. 1184. 2. Lefrançois, Maxime & Zimmermann, Antoine & Bakerally, Noorani. (2017). Flex- ible RDF Generation from RDF and Heterogeneous Data Sources with SPARQL- Generate. 131-135. 10.1007/978-3-319-58694-6 16. 3. S. Das, S. Sundara, and R. Cyganiak. R2RML: RDB to RDF Mapping Language. W3C Recommendation. https://www.w3.org/TR/r2rml/, September 2012. 4. Knoblock C.A. et al. (2012) Semi-automatically Mapping Structured Sources into the Semantic Web. In: Simperl E., Cimiano P., Polleres A., Corcho O., Presutti V. (eds) The Semantic Web: Research and Applications. ESWC 2012. Lecture Notes in Computer Science, vol 7295. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30284-8 32 5. Codescu, Mihai Horsinka, Gregor Kutz, Oliver Mossakowski, Till Rau, Rafaela. (2014). OSMonto -An Ontology of OpenStreetMap Tags. 6. F. Cerbah. Learning highly structured semantic repositories from relational databases: The RDBToOnto tool. In Proceedings of the 5th European Seman- tic Web Conference (ESWC 2008), pages 777–781, Athens, GA, USA, 2008. ISBN 3-540- 68233-3 978-3-540-68233-2. 7. Mikolov, Tomas Corrado, G.s Chen, Kai Dean, Jeffrey. (2013). Efficient Estima- tion of Word Representations in Vector Space. 1-12.