=Paper=
{{Paper
|id=Vol-2447/paper7
|storemode=property
|title=Automated Mapping for Semantic-based Conversion of Transportation Data Formats
|pdfUrl=https://ceur-ws.org/Vol-2447/paper7.pdf
|volume=Vol-2447
|authors=Marjan Hosseini,Safia Kalwar,Matteo Rossi,Mersedeh Sadeghi
|dblpUrl=https://dblp.org/rec/conf/i-semantics/HosseiniKRS19
}}
==Automated Mapping for Semantic-based Conversion of Transportation Data Formats==
Automated Mapping for Semantic-based Conversion of Transportation Data Formats?† Marjan Hosseini, Safia Kalwar, Matteo Rossi, and Mersedeh Sadeghi Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milano, Italy {firstname.lastname}@polimi.it Abstract. This position paper outlines our proposed approach to auto- mate the process of creating mappings between different data formats in the transportation domain. The approach exploits the word2vec model, in combination with graphs for finding meaningful equivalent relation- ships between concepts in different data formats. Keywords: Machine Learning · Ontology · Data Mapping. 1 Introduction The modern vision of transportation is that of ”mobility as a service”, in which users can seamlessly build door-to-door trips including several travel modes through a single entry point, with a unified interface and payment methods. To realize this vision, a wide range of diverse actors of the transportation ecosys- tem must communicate, interact, and cooperate with one another. Divergence of transportation standards and heterogeneity of data representations, formats and models are the main obstacles towards making such an interoperable system a reality. Hence, solutions are needed that bridge this fragmentation, hide the pe- culiarities of different standards and allow for the communication and exchange of data among heterogeneous, non-integrated systems. In line with this objective, The SPRINT (Semantics for PerfoRmant and scalable INteroperability of multimodal Transport) project aims at develop- ing tools and technologies that facilitate interoperability in the transport do- main. The core idea underlying the project is to go beyond pure “syntactic” interoperability—where interested parties are forced to adopt a unified set of formats for data exchange—and instead leverage “semantic” interoperability, which enables different systems to communicate with each other through their native standards, by mapping their concepts to a common ontology, which pro- vides an unambiguous and homogeneous view of data. One of the specific goals of the SPRINT project is to enhance and automate the conversion process realized ? This work was supported by Shift2Rail and the EU Horizon 2020 research and innovation programme under grant agreement No: 826172 (SPRINT). † Copyright 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 M. Hosseini et al. by the ST4RT Converter [2], which is the main component for the realization of semantic interoperability among heterogeneous, legacy transport services. A ST4RT converter (whose main principles and processes are depicted in Fig. 1) is a software artefact that acts as an adapter between two distinct formats. Given a suitable mapping between the source/target data and a reference formal ontology, a ST4RT converter first transforms data expressed in the source for- mat into an intermediate representation based on the reference ontology. Then, following a similar procedure in the reverse direction, the converter translates the intermediate representation into the target data model. This approach has the notable advantage of exempting participating parties from harmonizing the syntax and structure of their data; a meaningful communication is achieved only if they agree on the concepts and semantics behind their terminology and syntax. As shown in Fig. 1(a), the ST4RT Converter relies on dual lifting and lowering processes mapping given standards to/from the reference ontology. The first step in whole approach is the annotation process, through which the source and tar- get standards are semantically annotated to state the mappings between their data models and the reference ontology. Figure 1(b) shows the whole conver- sion workflows at design-time (left) and run-time (right). At design-time, given the structured data model of a standard, corresponding java classes are auto- matically generated, which are the basis for annotation process. At this stage, java classes, attributes and methods must be annotated to map each term to its equivalent concept in the reference ontology [2]. Using the annotated java classes as input resources, the conversion process happens at run-time when the system receives a message that is an instance of the source standard. The converter decomposes the source message into the concepts and terms according to its native standard and creates the instances of the corresponding java classes for each concept. Finally, it uses the defined mapping to lift such java instances to RDF triples conforming to the reference ontology. Following the inverse direction in the lowering stage, the converter first uses the defined mapping to translate the RDF triples into instances of suitable java classes representing concepts of the target standard, and it ultimately generates the converted message that is Design Time Run-time Semantic Model XML Java Object (Reference Ontology) XSDs Process (Source) (Source) Lifting Lifting & Lowering Java Classes Specification Intermediate Named Annotations Ontology Graph Lowering Process Annotation Process Syntactic Model Syntactic Model XML Java Object (Standard A) (Standard B) Annotated Java (Target) (Target) Classes (a) (b) Fig. 1. a) ST4RT approach to semantic interoperability, b) Converter Workflow at design- and run-time, composed of annotation, lifting and lowering processes. Title Suppressed Due to Excessive Length 3 an instance of the target data model. Except for the mapping process in the annotation phase, all steps are accomplished in an automated manner. This position paper presents our proposed approach to make the annotation phase of the conversion process more efficient, automated, and user-friendly. As mentioned above, so far the annotation step is carried out manually, which hampers the efficiency and overall performance of the process. Human users who are expert in both the reference ontology and the desired standards are required to establish the mappings with the concepts appearing in the source and target specifications, which is a time- and effort-consuming procedure. In the proposed approach, we aim at making the annotation-creation process more automated by taking advantage of machine-learning algorithms and methodologies. The rest of this paper briefly describes related works (Sect. 2), then outlines the proposed method (Sect. 3), and concludes with a brief discussion (Sect. 4). 2 Related Works The word2vec model [3] is a 2-layer neural network which can be trained using a sufficient amount of text corpus as the input, and which outputs the feature vectors of the words appearing in the input corpus so that the vectors of seman- tically similar words are mapped near one another in the vector space. These vectors can be employed to establish meaningful associations among the words (e.g., Milan is to Italy what Paris is to France). The produced vectors can also be used as the input to other machine learning techniques, such as clustering or extra deep neural networks. Another property of the word2vec model [4] is the capability of meaningfully combine words’ vectors and represent longer texts by performing operations such as addition or subtraction. The word2vec model has already been employed in the medical domain for concept extraction [1]. 3 Method As explained in Section 1, in order to lift /lower a given standard to/from the reference semantic model we need to state the mapping between their concepts and structures in the annotation phase. This section describes the proposed method for the automatic generation of such mappings. Definitions Let S and R be, respectively, the source standard and the reference semantic model. We indicate by XS (resp., XR ) the structure used by S (resp., R), and by OS (resp., OR ) the set of the vocabularies in XS (resp., XR ). For example, XS could be defined through XSD or OWL. If MS is an instance of XS and MR is an instance of XR , we say that MS and MR are equivalent if they are used for the same purpose in S and R—i.e., they are semantically equal. We consider instances Mi (where i ∈ {S, R}) in which we can identify a root concept. Then, Mi is defined using a sub-tree WXi of Xi , which is based on a vocabulary VOi that is a subset of Oi (i.e., VOi ⊆ Oi ). We write Mi ∈ T (VOi , WXi ) 4 M. Hosseini et al. to indicate that Mi is built on sub-tree WXi using the terms of VOi . Given Ms , VOS , WXS , OR and XR , we aim to define a Map method that maps the concepts appearing in MS to concepts of XR , thus building an instance MR of R that is equivalent to MS : MR = Map(MS , VOS , WXS , OR , XR ) Assumptions For applying the proposed method, we assume that the following four premises are true. Although some of them might not be true in general, our aim is to automate the mapping process in most cases, sacrificing completeness to obtain efficiency, thus we deem these simplifications to be acceptable and general enough. We briefly discuss in Sect. 4 how the last one can be relaxed. Assumption 1 The language in both sides of the mapping is English. Assumption 2 Given that our method targets mappings between standards in the same domain and both involved systems cover the concepts of the domain, we assume that for each concept in the source system, we have at least one corresponding concept in the reference system. Assumption 3 The corresponding instances in the source (i.e., the given stan- dard) and target (i.e., the reference semantic model) formats include the same equivalent concepts; that is, for each concept in the source instance, there exists exactly one concept in target format (one to one relationship between concepts). Assumption 4 All concepts exist in the word2vec model. Procedure Figure 2 depicts an overall workflow of the proposed procedure. In order to map the source data to the reference data format, the first step is to decompose the source data to its components: VOS , which is the set of terms that exist in MS , and the tree representation WXS of the given structured data. Then, the semantic equivalent of the main concept in the source data should be determined, where the main concept is the root of the tree. According to Assumption 2, there should be at least one concept in the reference data model corresponding to the source main concept. To detect it, the extracted root of the tree structure in the source data (WXS ), would be embedded to its 300- dimensional vector space employing the word2vec model. If the main concept is a phrase, its atomic parts should be embedded individually and then averaged. Since the word2vec model identifies semantically close concepts based on their relative distances in the 300-dimensional vector space, we search the space for the vector that is nearest to the one of the source main concept and tag it as the equivalent concept in the reference system. After determining the equivalent main concepts, the structures corresponding to that particular main concept in the source and reference systems (WXS and WXR ) are retrieved. Then, inside the tree structures of each data format, all the Title Suppressed Due to Excessive Length 5 Source Reference Standard Semantic MS Passenger MR RouteSource Name Source RouteDestination PreBooking Path PreRegister Decomposition Destination RouteDestination Date TravelerName Registering Main Concept Identification Pre Pre (Word2vec) Booking Registering MS Structure RouteSource ? Retrieve the RouteDestination ? structure PreRegister with equivalent RouteDestination ? root TravelerName ? MS RouteSource MR Translation PreBook PassengerName PreRegister RouteSource (Word2vec) PreBook Path Source PreRegister RouteDest. Passenger Traveler PreBook Path Desti. Name Name PreRegister Time PreBook Date PreRegister TravelerName Fig. 2. Proposed mapping procedure from source to reference data format. possible routes from the root to the leaf nodes are extracted. To reach a single leaf node, there is only one path from the root. Each extracted route consists of the set of all the terms from the root to that particular leaf node. According to Assumption 3,1 the structures of the equivalent main concepts in the source and reference data contain the same number of attributes, hence the number of leaf nodes in both data formats are equal, and so is the number of routes. For each route in the tree structure of the source data, its corresponding route in the reference data structure should be identified. To do so, a vector for each of the words in the current route is obtained using the word2vec model, and then the average of such vectors must be calculated. We expect that the vectors that are the result of the averaging operation in the corresponding routes in the source and reference data fall closer to each other in the vector space. This is possible due to the properties of the word2vec model, in which combinations of words can be meaningfully represented through vector addition. Similarly, the average operation of the vectors preserves their semantics, since it just scales the vector magnitude by a positive number, leaving the direction unchanged. We assign the route from the source data structure to the nearest average of the existing routes in the reference data structure. Then, the attribute name of the 1 As mentioned in Sect 3, we trade-off generality for efficiency. In future works we will look to relax the assumptions underlying the approach. 6 M. Hosseini et al. leaf node from the source data is mapped to the corresponding leaf node in the reference data format and their attribute values are transferred accordingly. 4 Discussion This section outlines some of the possible challenges that might arise while ap- plying the proposed method. The first one is related to the extracted concepts and occurs because the individual concepts in the structure are typically a com- bination of two or more words, for example pre booking or PreBooking. These kinds of compound words usually do not exist in the word2vec model as a single word. To address this issue, it might be necessary to perform a pre-processing step, which could possibly be splitting the compound words to their atomic stems and then computing their average. Another challenge could be due to the absence of some terms of the source or reference ontologies, which prevents the averaging in the route matching step. To tackle this problem, one possible approach is further training the existing word2vec model, using the transfer learning technique. To this end, a sufficient number of unstructured texts containing the missing words are necessary. Al- though, instead of unstructured text, it might be possible to perform transfer learning using instances of structured data formats containing the missing words, either by flattening the structured text to make it unstructured, or by adding extra layers to the word2vec model. Finally, to validate the method, a possible approach could consist in preparing a dataset containing a set of pairs (M1 , M2 ) of equivalent instances in different data formats (hence, with the same main concepts). Then, the Map method should be applied to one element from each pair (say, M1 ) and the result should be compared to the true data structure and the terms of the other element of the pair (i.e., M2 ). Subsequently, the direction of the Map method should be reversed (i.e., it should be applied to M2 ) and the same process should be repeated. The method is validated if both mapped data are equivalent to the corresponding true data formats. References 1. Andrew L Beam, Benjamin Kompa, Inbar Fried, Nathan P Palmer, Xu Shi, Tianxi Cai, and Isaac S Kohane. Clinical concept embeddings learned from massive sources of multimodal medical data. arXiv preprint arXiv:1804.01486, 2018. 2. Alessio Carenini, Ugo DellArciprete, Stefanos Gogos, Mohammad Mehdi Purhashem Khallehbasti, Matteo Rossi, and Riccardo Santoro. ST4RT – semantic transforma- tions for rail transportation. In Transport Research Arena (TRA), pages 1–10, 2018. 3. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 4. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis- tributed representations of words and phrases and their compositionality. In Ad- vances in neural information processing systems, pages 3111–3119, 2013.