=Paper=
{{Paper
|id=Vol-2277/paper42
|storemode=property
|title=
Combining Lexical and Semantic Similarity Measures with Machine Learning Approach for Ontology Matching Problem
|pdfUrl=https://ceur-ws.org/Vol-2277/paper42.pdf
|volume=Vol-2277
|authors=Lev Bulygin
|dblpUrl=https://dblp.org/rec/conf/rcdl/Bulygin18
}}
==
Combining Lexical and Semantic Similarity Measures with Machine Learning Approach for Ontology Matching Problem
==
Combining Lexical and Semantic Similarity Measures with Machine Learning Approach for Ontology and Schema Matching Problem © Lev Bulygin Lomonosov Moscow State University, Moscow, Russia buliginleo@yandex.ru Abstract. Ontology and schema matching is one of the most important tasks for data integration. We suggest to combine the different matchers with machine learning approach. The features are the outputs of lexical and semantic similarity functions. Naive Bayesian classifier, logistic regression and gradient tree boosting have been trained on these features. The proposed approach is tested on the OAEI 2017 benchmark “conference” with the various splits of the data on train and test sets. Experiments show that final combined model in element-level matching outperformed the single matchers. Results are compared with EditDistance matcher and WordNet matcher. Keywords: ontology matching, element-level matcher, lexical matcher, semantic matcher, word2vec, word2net. approach on OAEI 2017 “conference” ontologies with 1 Introduction single matchers: edit distance and WordNet. In Section 2 we reviewed related work and it's Data integration is now a very important problem, as the evolution. Section 3 describes the setup of the paper. In amount of data is growing very much. One of the most Section 4 presents our matching algorithm in detail. important steps in automatic data integration is the Section 5 shows the experiments and the analysis. We comparison of ontologies or schemas. This problem is conclude this paper in Section 6. traditionally solved manually or semi-automatically. Manual ontology or schema matching is very labor- 2 Related Work intensive, long in time and prone to errors. Therefore now the task is to maximally automate the process of Schema matching and ontology matching are very comparing the circuit with the greatest accuracy. similar. In this article, there is no difference for schema For schema matching and ontology matching matching and ontology matching because we only work element-level and structure-level matchers are used. with the names of the entities. So we considered the Element-level matchers use information of elements of works about schema matching and ontology matching schema (name of columns, description). Structure-level together. In [1] a review of approaches for automatic matchers use information of structure (type similarity, schema matching are conducted. They introduced the key properties, hierarchy of elements). classification of matchers and concluded that it is Recently, element-level matchers have been widely impossible to create a fully automatic matching system, used lexical and semantic information. For getting universal for all subject areas. One of the most important semantic information WordNet and Word2vec are used. conclusions is that hybrid matching system is better than Word2vec allow us to get a meaning of the word and we single system. Intuitively, this is obvious, since a hybrid can use this for improving ontology and schema matcher uses more information to make a decision. matching. So we can combine many lexical and semantic In [2] the ready-made solutions for automatic schema information into one similarity matrix and train a matching are compared and the concept of pre-match machine learning model for trying to solve ontology and effort and post-match effort introduced. Pre-match effort schema matching problems. is learning of the matcher, configuration of thresholds This work is performed as Master thesis. The main and weights. Post-match effort is correction and goal of this work is to propose a approach for solving improvement of the match output. In the same year, in ontology and schema matching problem with the greatest [3] the authors described their COMA system. The accuracy. To achieve this goal we need to perform the authors introduced new methods of post-match effort: following tasks: review related work, create the reuse of matchings, aggregation of individual matchers. architecture of solution, select of information for In [4] the Target-based Integration Query System (TIQS) matching and create experiments. We compared our are described. In [5] a new classification on matching dimensions are presented. A natural issue is uncertainty, In [6] uncertainty are considered, monotonicity and Proceedings of the XX International Conference statistical monotonicity are introduced . All the above “Data Analytics and Management in Data Intensive articles don't use machine learning for aggregation Domains” (DAMDID/RCDL’2018), Moscow, Russia, October 9-12, 2018 245 results of matchers. 3 Problem Statement, Evaluation Metrics In [7] Bayesian Networks are used for combining and Similarity Measures ontology matching. As the features lexical information (N-gram, Levenshtein, Dice Coefficient etc) were used. 3.1 Problem Statement The best accuracy is 88%. In [8] the outputs from several single matchers are An ontology O is by a 3-tuple (C, P, I). C is the classes, used and the authors trained on this data a multi-layer denoting the concepts. P is the relations within the neural network.The authors used lexical information and ontology. I is the instances of classes. The task of the WordNet similarity. ontology matching is to find the alignment between In [9] ontology matching problem in detail are entities in a source ontology 𝑂𝑂1 and a target ontology 𝑂𝑂2 . described: applications, classifications of ontology An alignment is a set {(𝑒𝑒1 , 𝑒𝑒2 , 𝑐𝑐𝑐𝑐𝑐𝑐)|𝑒𝑒1 ∈ 𝑂𝑂1 , 𝑒𝑒2 ∈ matching techniques, evaluations and alignment. They 𝑂𝑂2 }, where 𝑒𝑒1 is an entity in 𝑂𝑂1 , 𝑒𝑒2 is an entity in 𝑂𝑂2 , and suggest three dimensions of building correspondences: 𝑐𝑐𝑐𝑐𝑐𝑐 is the confidence of the correspondence. conceptual, extensional and semantic. Our algorithm works only with names of entities In [10] Multiple Concept Similarity for ontology without the structure information. The input of our mapping are described. The author reduced the ontology algorithm is the two ontologies: source and target. The mapping problem to the machine learning problem. They output of our algorithm is the alignment. used lexical information (prefix, suffix, Edit distance, n- gram), semantic information (WordNet) and special 3.2 Evaluation Metrics types of similarity called Word List similarity (a The effectiveness of ontology matching could similarity for sentences). measured by precision, recall and F-measure. In [11] a machine-learning approach to ontology 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = |𝐴𝐴∩𝐵𝐵| , (1) alignment problem are described. The authors used |𝐴𝐴| |𝐴𝐴∩𝐵𝐵| lexical information, semantic information (WordNet) 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 = , (2) |𝐵𝐵| and structural information for training Support Vector 2⋅𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝⋅𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 Machine, K-Nearest Neighbours, Decision Tree, 𝐹𝐹 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = , (3) 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 AdaBoost models. The authors improves F-measure criterion up to 99%. where A is a predict alignment and B is a true In [12] Bayeasian Networks are used for composition alignment. of matchers. The resulting model achieved 81% accuracy. The authors used the outputs of the lexical 3.3 Used Similarity Measures matchers, structure-level matchers, synonym matchers As features for machine learning we used the similarity and instance-level matchers. measures. We extract from ontology the names of In [13] all stages of ontology matching problem in entities. Further we need to create sets of words from detail are described: feature selection, methods of names for computing some similarity measures. combining matchers and experiments. Example. We have two names of entities: In [14] word2vec embeddings for ontology matching “early_paid_applicant” and “early-registered- problem are used. Authors proposed the new algorithm of participant”. The sets are [“early”, “paid”, “applicant”] matching and sentence2vec algorithm. The results matcher and [“early”, “registered”, “participant”]. are compared with various types of WordNet, EditDistance matchers. In [15] the combination of word2vec features and Metrics based on lexical information from names of a neural network are used. One of the most important entities: problems is that articles using machine learning can not be • N-gram [11]. Let ngram(S, N) be the set of compared with each other because of the lack of a unified substrings of string S of length N. The n-gram benchmark. The OAEI is not intended to use machine similarity for two strings S and T: learning, so the training dataset is chosen by the author of |𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑆𝑆,𝑁𝑁)∩𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑇𝑇,𝑁𝑁)| the article. 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑆𝑆, 𝑇𝑇) = (4) 𝑚𝑚𝑚𝑚𝑚𝑚(|𝑆𝑆|,|𝑇𝑇|)−𝑁𝑁+1 At the moment, the most promising option is the • Dice coefficient [11]. The Dice similarity score is using of machine learning to create a hybrid model with defined as twice the shared information lexical, semantic and structure information. In this (intersection) divided by sum of cardinalities. For article, we want to connect many different single two sets X and Y, the Dice similarity score is: matchers to one hybrid matcher using machine learning. 2∗|𝑋𝑋∩𝑌𝑌| One of the newest features we used is Word2vec. So far 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑋𝑋, 𝑌𝑌) = (5) |𝑋𝑋|+|𝑌𝑌| we only used the single element-level matchers. • Jaccard similarity [11]. For two sets X and Y, the All reviewed papers use widely the lexical Jaccard similarity score is: information and WordNet distances. In this paper we |𝑋𝑋∩𝑌𝑌| used 29 similarity metrics from 17 various single 𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗(𝑋𝑋, 𝑌𝑌) = (6) |𝑋𝑋∪𝑌𝑌| element-level matchers. • Jaro measure [11]. The Jaro measure is a type of edit distance, developed mainly to compare short strings, such as first and last names. 246 • Substring similarity [11]. For longest substring T of • Ratio. For two strings X, Y: two strings X and Y substring similarity is: 𝑀𝑀 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟(2.0 ⋅ ⋅ 100), (10) 2∗|𝑇𝑇| 𝑇𝑇 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋, 𝑌𝑌) = (7) • where T is the total number of characters in both |𝑋𝑋|+|𝑌𝑌| • Monge-Elkan [11]. Token-based and internal strings, and M is the number of matches in the two similarity function for tokens: strings X and Y. |𝑥𝑥| • Soundex [6]. Phonetic measure such as soundex 1 𝑠𝑠𝑠𝑠𝑚𝑚𝑀𝑀𝑀𝑀 (𝑥𝑥, 𝑦𝑦) = ∑ 𝑚𝑚𝑚𝑚𝑥𝑥𝑗𝑗=1,|𝑦𝑦| 𝑠𝑠𝑠𝑠𝑚𝑚′ (𝑥𝑥[𝑖𝑖], 𝑦𝑦[𝑗𝑗]) match string based on their sound. These measures |𝑥𝑥| 𝑖𝑖=1 have been especially effective in matching names, (8) since names are often spelled in different ways that • Smith-Waterman [11]. The Smith-Waterman sound the same. algorithm performs local sequence alignment; that is, for determining similar regions between two • Token Sort. Fuzzy Wuzzy token sort ratio raw strings. raw_score is a measure of the strings similarity as an int in the range [0, 100]. For two strings X and Y, • Needleman-Wunsch [11]. The Needleman-Wunsch the score is obtained by splitting the two strings into distance generalizes the Levenshtein distance and tokens and then sorting the tokens. The score is then considers global alignment between two strings. the fuzzy wuzzy ratio raw score of the transformed Specifically, it is computed by assigning a score to strings. each alignment between the two input strings and choosing the score of the best alignment, that is, the • Tversky Index. For sets X and Y: |𝑋𝑋∩𝑌𝑌| maximal score. 𝑇𝑇𝑇𝑇(𝑋𝑋, 𝑌𝑌) = , (11) |𝑋𝑋∩𝑌𝑌|+𝛼𝛼|𝑋𝑋−𝑌𝑌|+𝜂𝜂|𝑌𝑌−𝑋𝑋| • Affine gap [17]. Returns the affine gap score • where 𝛼𝛼, 𝜂𝜂 > 0. between two strings. The affine gap measure is an • Overlap coefficient. For sets X and Y: extension of the Needleman-Wunsch measure that |𝑋𝑋∩𝑌𝑌| handles the longer gaps more gracefully. 𝑂𝑂𝑂𝑂(𝑋𝑋, 𝑌𝑌) = (12) 𝑚𝑚𝑚𝑚𝑚𝑚(|𝑋𝑋|,|𝑌𝑌|) • Bag distance. The number of common symbols in two strings. Metrics based on semantic information from names • Cosine similarity [11]. For two sets X and Y: of entities: |𝑋𝑋∩𝑌𝑌| • WordNet similarity. Best score of similarity between 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑋𝑋, 𝑌𝑌) = (9) synonyms of one word with synonyms of other �|𝑋𝑋|⋅|𝑌𝑌| • Partial Ratio. Given two strings X and Y, let the word. We use a sentence similarity algorithm from shorter string (X) be of length m. It finds the fuzzy [14]. wuzzy ratio similarity measure between the shorter • Word2vec and Sentence2vec similarity. Word2vec string and every substring of length m of the longer is a model that are used to produce word string, and returns the maximum of those similarity embeddings. We used word2vec model trained by measures. the texts of Google News. The dimensionality of • Soft TFIDF [21]. A variant of TF-IDF that considers these vectors is 300. We used sentence2vec words equal based on Jaro Winkler rather than exact algorithm from [14]. match. • Editex. Editex is a phonetic distance measure that 4 Proposed Matching Algorithm combines the properties of edit distances with the The input of the matching system is the two letter-grouping strategy used by Soundex and ontologies. For each ontology the system extracts the Phonix. names of the entities. Further it generate all possible pairs • Generalized Jaccard. This similarity measure is of the names of the ontology with the names of the other softened version of the Jaccard measure. The ontology. For all pairs system computes the all similarity Jaccard measure is promising candidate for tokens measures. All outputs of the similarity measures are which exactly match across the sets. concatenated into the one similarity matrix. The computed features are used as input to a machine • JaroWinkler. The Jaro-Winkler measure is designed learning model. to capture cases where two strings have a low Jaro score, but share a prefix and thus are likely to match. Program 1 Element-level matching algorithm • Levenshtein distance [11]. Levenshtein distance Input: Ontology1, Ontology2 - input ontologies computes the minimum cost of transforming one or schemas, THRESHOLD - threshold for create matching between elements string into the other. Output: alignment - output alignment for • Partial Token Sort. For two strings X and Y, the Ontology1 and Ontology2 score is obtained by splitting the two strings into for Entity1 ∈ Ontology1 do for Entity2 ∈ Ontology2 do tokens and then sorting the tokens. The score is then Name1 ← get_name(Entity1) the fuzzy wuzzy partial ratio raw score of the Name2 ← get_name(Entity2) transformed strings. Features ← get_features(Name1, Name2) 247 Match ← predict_match(Features) Table 1 Our split of dataset “conference” from OAEI if Match > THRESHOLD then 2017 benchmark alignment.append((Name1,Name2)) end if end for Train pairs Test pairs end for return alignment Conference, Ekaw Iasted, Sigkdd Ekaw, Sigkdd Program 2 Creating dataset and training a machine Cmt, Confof Cmt, Sigkdd learning model Confof, Edas Cmt, Ekaw Edas, Iasted Input: TrueAlignments - set of true alignments, Ekaw, Iasted TrainAlignments - set of train pairs Confof, Ekaw Edas, Ekaw Cmt, Iasted Output: Model - trained model for predict Conference, Iasted Edas, Sigkdd matching Conference, Sigkdd Confof, Sigkdd Conference, Edas for Ontology1, Ontology2 in TrueAlignments do Conference, Confof Confof, Iasted for Entity1 ∈ Ontology1 do Cmt, Edas for Entity2 ∈ Ontology2 do Cmt, Conference if (Entity1, Entity2) ∈ TrainAlignments then add_to_train(Entity1, Entity2, 1) 5.2 Experiments else add_to_train(Entity1, Entity2, 0) As a baseline solution we take EditDistance and end if WordNet matchers. For every matcher we select end for end for threshold with the best F-measure on the test dataset. As end for the machine learning models we chose Naive Bayes Model.train() Classifier 27, Logistic regression 28 as the simple models return Model and Gradient boosting 29 as the complex model. For training a machine learning model we need the Table 2 The results on test dataset ontologies and the alignments. We split the alignments on the training alignments and testing alignments. Precision, F- Further we train a machine learning model on the Method Recall, % % measure, % features from train alignments. Then we can evaluate F- measure on testing alignment. EditDistance 50 35.8 41.7 5 Preliminaries and Results WordNet 9.1 38.2 14.7 Naive Bayes 5.1 Collection of Data 2.13 80.08 4.15 Classifier The experiments were conducted at the dataset Logistic 75.58 40.12 52.41 “conference” from OAEI 2017. The dataset consists 16 regression ontologies from the same domain (conference organisation) and 21 true ontology matchings. This XGBoost 69.15 45.67 55.01 dataset is convenient for machine learning because all ontologies are from the same domain and this dataset has the several true matchings between ontologies. We can see on Table 2 that XGBoost is the model We get entities from ontologies with parsing code on with best F-measure 55.01%. The worst model is Naive Python and get array of tuples: (entity1, entity2, match), Bayes Classifiers with F-measure 4.15%. EditDistance is where entity1, entity2 - string names of entities, match - better than WordNet by 27%. For getting more stable bool variable, 1 means that the entities are matched, 0 results we splitted the datasets randomly on training and means that the entities are not matched. We generated 29 testing pairs 20 times and averaged the results. The similarity measures from 3.3 with Python packages: results are show on Table 3. We can see that XGBoost py_stringmatching 24, fuzzycomp 25, gensim 26. are remained the best model. The final dataset is very unbalanced: 391 572 negative samples and 305 positive samples. After generating features we split data randomly on train and test datasets in equal proportions. In Table 1 are described our split. 24 https://github.com/kvpradap/py_stringmatching .GaussianNB.html 25 https://github.com/fuzzycode/fuzzycomp 28 http://scikit- 26 https://github.com/RaRe-Technologies/gensim learn.org/stable/modules/generated/sklearn.linear_mode 27 http://scikit- l.LogisticRegression.html learn.org/stable/modules/generated/sklearn.naive_bayes 29 https://github.com/dmlc/xgboost 248 Table 3 The averaged results on 20 various random Semantics IV, pp. 146-171 (2005). doi: splits on train and test datasets 10.1007/11603412_5 [6] Gal, A.: Why is Schema Matching Tough and Precision, F- Method Recall, % What Can We Do About It? In: ACM SIGMOD % measure, % Record, vol. 35, issue 4, pp. 2-5 (2006). doi: EditDistance 47.31 39.2 42.82 10.1145/1228268.1228269 [7] Svab, O., Svatek, V.: Combining Ontology WordNet 11.87 41.69 18.35 Mapping Methods Using Bayesian Networks. In: OM'06 Proceedings of the 1st International Naive Bayes Conference on Ontology Matching, vol. 225, pp. 2.63 81.49 5.10 Classifier 206-210. Logistic 74.98 43.43 54.95 [8] Hariri, B., Abolhassani, H., Sayyadi, H.: A regression Neural-Networks-Based Approach for Ontology XGBoost 72.72 47.51 57.29 Alignment. (2006). https://www.jstage.jst.go.jp/article/softscis/2006/ 0/2006_0_1248/_article [9] Euzenat, J., Shvaiko, P.: Ontology Matching. 6 Conclusions Springer-Verlag Berlin Heidelberg, Berlin In this paper, we combined the lexical and semantic (2007). doi: 10.1007/978-3-642-38721-0 similarity measures into one hybrid model. The [10] Ichise, R.: Machine Learning Approach for experiments show that hybrid model is better than single Ontology Mapping Using Multiple Concept matchers. Similarity Measures. In: Seventh IEEE/ACIS In future work we want to run our solution on other International Conference on Computer and benchmarks. We must add the structure-level and Information Science, (2008). doi: instance-level features because in our dataset there are 10.1109/ICIS.2008.51 examples in which the names of entities are exactly the [11] Nezhadi, A., Shadgar, B., Osareh, A.: Ontology same but they are not matched. Alignment Using Machine Learning Techniques. In: International Journal of Computer Science & Acknowledgments. This work is supervised by Sergey Information Technology, vol. 3, pp. 139-150 Stupnikov, Institute of Informatics Problems, Federal (2011). doi: 10.5121/ijcsit.2011.3210 Research Center “Computer Science and Control“ of the [12] Nikovsi, D., Esenther, A., Ye, X., Shiba, M., Russian Academy of Sciences”. Takayama, S.: Bayesian Networks for Matcher Composition in Automatic Schema Matching. References In: Mitsubishi Electric Research Laboratories (2011). doi: 10.1.1.644.2168 [1] Rahm, E., Bernstein, A.: A survey of approaches to automatic schema matching. In: The [13] Ngo, D.: Enhancing Ontology Matching by International Journal on Very Large Data Bases, Using Machine Learning, Graph Matching and vol. 10, issue 4, pp. 334-350 (2001). doi: Information Retrieval Techniques. In: University 10.1007/s007780100057 Montpellier II - Sciences et Techniques du Languedoc (2012). doi: 10.1.1.302.587 [2] Do, H., Melnik, S., Rahm, E.: Comparison of Schema Matching Evaluations. In: Revised [14] Zhang, Y., Wang, X., Lai, S., He, S., Liu, K., Papers from the NODe 2002 Web and Database- Zhao, J., Lv, X.: Ontology Matching with Word Related Workshops on Web, Web-Services, and Embeddings. In: Chinese Computational Database Systems, pp. 221-237 (2002). doi: Linguistics and Natural Language Processing 10.1007/3-540-36560-5_17 Based on Naturally Annotated Big Data, pp 34- 45 (2014). doi: 10.1007/978-3-319-12277-9_4 [3] Do, H., Rahm, E.: COMA: a system for flexible combination of schema matching approaches. In: [15] Jayawardana, V., Lakmal, D., Silva, N., Perera, VLDB '02 Proceedings of the 28th international A., Sugathadasa, K., Ayesha, B.: Deriving a conference on Very Large Data Bases, pp. 610- Representative Vefctor for Ontology Classes 621 (2002). doi: 10.1016/B978-155860869- with Instance Word Vector Embeddings. In: 6/50060-3 2017 Seventh International Conference on Innovative Computing Technology (INTECH), [4] L. Xu, D. Embley.: Automating Schema (2017). doi: 10.1109/INTECH.2017.8102426 Mapping for Data Integration. (2003). http://www.deg.byu.edu/papers/AutomatingSche maMatching.journal.pdf [5] Shvaiko, P., Euzenat, J.: A Survey of Schema- Based Matching Approaches. In: Journal on Data 249