KGMatcher+ Results for OAEI 2022 Omaima Fallatah1,2 , Ziqi Zhang1 and Frank Hopfgartner3 1 Information School, The University of Sheffield, Regent Court, 211 Portobello, Sheffield S1 4DP, UK 2 Information Systems, Umm Al Qura University, Mecca 24382, Saudi Arabia 3 UniversitΓ€t Koblenz-Landau, Mainz 55118, Germany Abstract KGMatcher+ is a scalable and domain-independent matching system that matches the schema (classes) of large Knowledge Graphs by following a hybrid matching approach. KGMatcher+ is composed of an instance-based matcher which only uses annotated instances of knowledge graph classes to generate candidate class alignments and a string-based matcher. This year is the second OAEI participation of KGMatcher+, formerly known as KGMatcher. More improvements have been added to the matcher, particularly in terms of handling imbalanced class distribution, and it is the best-performing system in the common knowledge graphs track this year. Keywords Knowledge Graphs, Instance-based Ontology Matching, Machine Learning, Schema Matching. 1. Presentation of the system 1.1. State, purpose, general statement Combining different matching techniques is a common practice in ontology matching tools. Matching techniques are divided into three main categories [1]: (1) Element level techniques which discover similar entities by processing textual annotation of ontology entities, (2) Struc- tural level techniques which study the relation between ontology entities to generate candidate pairs, and finally (3) Extensional or Instance_based techniques which utilize populated instances, i.e., (ABox) data to generate alignments at schema level (TBox). The matcher proposed here is a hybrid approach that combines an element-level matcher with an instance-based one. The first matcher uses entity labels to generate candidate pairs, and the latter produces candidate class alignments by only using annotated instance names. While conventional ontologies mainly focus on modeling classes and properties, Knowledge Graphs (KGs), particularly those available on the web, are large-scale and describe numerous instances. Following a self-supervised and two-way classification approach, the presented matcher trains a classifier using instances annotated in each KG as training data. A KG classifier can then be used to classify any instance name into one of its own classes. The system is domain independent and is capable of coping with KGs with unbalanced populations (for details, see [2, 3]). This OM2022: the International Workshop on Ontology Matching, October 23, 2022, Hangzhou, China $ oafallatah1@sheffield.ac.uk (O. Fallatah); ziqi.zhang@sheffield.ac.uk (Z. Zhang); hopfgartner@uni-koblenz.de (F. Hopfgartner)  0000-0002-5466-9119 (O. Fallatah); 0000-0002-8587-8618 (Z. Zhang); 0000-0003-0380-6088 (F. Hopfgartner) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: An overview of KGMatcher+ process makes the matcher particularly useful for matching large KGs with numerous populated and overlapping instances such as DBpedia [4] and YAGO [5]. This is the second OAEI participation of this matching system, KGMatcher participated in the OAEI in 2021 [6]. 1.2. Specific techniques used Given two input knowledge graphs, π’ͺ and π’ͺβ€² , where π’ͺ has a set of classes π’ͺ = {𝐢π’ͺ 0 , 𝐢 1 , .., 𝐢 𝑖 }, π’ͺ π’ͺ and each class contains a set of instances 𝐢π’ͺ 𝑖 = {𝑒𝑖 , 𝑒𝑖 , ..., 𝑒𝑖 }. Similarly, π’ͺ β€² contains a set 0 1 𝑛 of classes such that π’ͺβ€² = {𝐢π’ͺ 0 , 𝐢 1 , .., 𝐢 𝑗 } where 𝐢 𝑗 = {𝑒𝑗 , 𝑒𝑗 , ..., 𝑒𝑗 }. KGMatcher+ has β€² π’ͺβ€² π’ͺβ€² π’ͺβ€² 0 1 π‘š two main components: an instance-based matcher and a name matcher. The workflow of KGMatcher+ is illustrated in figure 1. 1.2.1. Preprocessing Given the two input KGs, the matcher starts by parsing and indexing the lexical data of the two KGs separately. Following the standard free text search/index approach, an index is created for each KG where each class is treated as a document and the content of each β€˜documentβ€˜ is the concatenation of the class’s instance labels. In addition to the standard text cleaning processes, a word segmentation method is applied in order to separate multi-word entities, e.g., academicfield. Using a general dictionary, this method is able to infer the spaces between words and replace them with a space. 1.2.2. Instance-based Matcher The first matching component of KGMatcher+ belongs to the extensional matcher category. It uses a self-supervised machine learning approach to map KG classes based on their instances overlap. The matching is done in a two-way classification fashion where a KG classifier is trained using one KG’s instances data. Later on, that classifier is used to classify any instance name into one of its classes. Here, we summarize the matching process of KGMatcher+, however, readers may refer to [3] for further details. β€’ Exact Name Filtering. The matcher starts by applying an exact name filter to exclude class labels that exist in both input KGs. Given the large number of classes in typical KGs, this step works as a blocking strategy that reduces the search space of the instance-based matcher. β€’ Resampling KGs Instances. The class distribution in typical public KG tends to be highly imbalanced. Thus, the goal of this step is to balance the number of populated instances in the two input KGs to avoid biased classification results. The previous version of the system was only targeting the majority, i.e., large classes, by undersampling their instances using TF-IDF approach [7]. Different from the previous version, here, we introduced a new sampling strategy that combines undersampling the majority classes with oversampling classes with fewer instances, i.e., the minority classes. In [2], we studied 6 different strategies for handling class distribution including state-of-the-art methods such as SMOTE [8] and using cost-sensitive learning [9]. The sampling method that outperformed all other methods was when we used the TF-IDF undersampling method in addition to using random oversampling. The standard TF-IDF method which is often deployed to measure word relevance in a collection of documents is used to undersample KGs classes. Here, the TF-IDF of a word in each class represents the relevance of that word in a particular class in comparison to other classes in the KG. Therefore, for each majority class, the most frequent words in terms of TF-IDF score are used to undersample its instance names. Therefore, instance names that do not compose any of the words with high TF-IDF scores are discarded. Then, random oversampling is used to generate repeated random samples of instances in the classes that fall in the minority class category. As a result, a more balanced and indicative set of KG instances is obtained to be used as training data. Readers interested in further details of the sampling strategies incorporated in KGMatcher+ may refer to [2]. β€’ Training KG classifiers. Here, a KG classifier will be separately trained for each input KG using the previously re-sampled instances data. Pre-trained word embedding is used here as a feature to capture and present the semantics of KG instance names. Compared to traditional feature representation methods, word embedding and language models are recognized as effective ways to capture the semantic similarity of words. KGMatcher+ is able to train two types of classifiers, a Deep Neural Network (DNN) model1 2 similar to other successful NLP tasks such as [10] and [11], and a pre-trained BERT model [12]. KGMatcher+ will automatically opt to use the BERT model if a GPU is available during the runtime. The output of this phase is two classifiers, 𝐢𝐿𝑆π’ͺ and 𝐢𝐿𝑆π’ͺβ€² trained using the instances from the two input KGs π’ͺ, and π’ͺβ€² respectively. 1 The parameters selected for the DNN model: an input layer of pre-trained word embedding model followed by four fully connected hidden layers with 128, 128, 64, 32 rectified linear units. A dropout layer of 0.2 is added between each pair of dense layers for regulation. Finally, a softmax layer for multi-class classification, taking the total number of classes in the KG we are training a classifier for. 2 The input layer is the Google News token-based model https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1 β€’ KG1 alignment elicitation is the process of generating candidate pairs using the classifier trained on the first input KG. Candidate pairs are generated by iteratively applying the classifier 𝐢𝐿𝑆π’ͺ to instances in the other KG’s classes. As a result, each instance name in 𝑗 𝑗 β€² is now classified into a class in π’ͺ. The candidate pair (𝐢π’ͺ ,𝐢π’ͺ β€² ) is added to the first 𝐢π’ͺ 𝑖 𝑗 candidate alignments set 𝐴π’ͺβ†’π’ͺβ€² if the majority of 𝐢π’ͺ β€² were classified as instances of 𝐢π’ͺ . A similarity score between [0,1] is obtained using the percentage of instances that 𝑖 𝑗 voted for a particular class. Therefore, if 600 out of 1000 instance names in 𝐢π’ͺ β€² were voting for, 𝐢π’ͺ the similarity score of that pair will be 0.6. 𝑖 β€’ KG2 alignment elicitation is similar to the above-illustrated elicitation process. However, the roles of the two KGs are reversed where 𝐢𝐿𝑆π’ͺβ€² , i.e., the classifier trained on the second KG (π’ͺβ€² ), is applied to π’ͺ instances in order to obtain the second candidate alignment set 𝐴π’ͺβ€² β†’π’ͺ . β€’ Similarity computing is where KGMatcher+ combines the two candidate alignment sets resulting from the two-way classification method. First, the matcher separately stores each directional alignment in an alignment matrix of a |π’ͺ|.|π’ͺβ€² | dimension. The two matrices are then aggregated into one matrix by taking the average similarity score of each pair. For example, if (𝐢π’ͺ 6 ,𝐢 3 ,0.88) in 𝐴 π’ͺβ€² π’ͺβ†’π’ͺ β€² and (𝐢π’ͺβ€² ,𝐢π’ͺ , 0.64) in 𝐴π’ͺβ€² β†’π’ͺ ) their 5 3 aggregated similarity value will be 0.76. Consequently, the final alignments for this matcher are chosen by following the state-of-the-art automatic final alignment selection approach introduced in [13]. Given an alignment matrix, this method iteratively selects the pair with the highest similarity score for each class in both KGs. 1.2.3. Name Matcher The second component of KGMatcher+ is an element-level matcher, which measures the simi- larity of KG class labels. First, the edit distance of each class pair is measured, and then their semantic similarity is measured by a word embedding method. First, the Levenshtein distance is calculated for each class pair. Then, in terms of the word embedding similarity, a pre-trained word2vec model is used to represent class labels before measuring their cosine similarities. The semantic similarity is measured in a Vector Space Model, where words with high semantic relations are often represented closer to each other. In the case of multi-word labels, the vector representation of each word composing the label is aggregated with an element-wise average of the composing word vectors. Finally, the maximum of the two similarity measures is chosen as the name similarity of that pair. The threshold value of the name matcher is set to 0.8. To illustrate, if the word embedding similarity of (RailwayStation,TrainStation) is 0.83 while their Levenshtein distance is 0.56, the maximum similarity value, i.e., the word embedding similarity which is also higher than 0.8. Nonetheless, in case the two similarity scores are lower than the threshold, that pair will be excluded from the candidate alignment. 1.2.4. Post Processing KGMatcher+ combines the results generated from the two matching components by following the same method described earlier to combine the two instance classification alignments. 1.2.5. Instance Matching For the OAEI participation, we have adapted KGMatcher+ to also match the instances of KGs. The instance matching component is very simple. First, standard text preprocessing techniques such as lower casing, and removing stopwords and non-alphanumeric characters are applied. Then, KGMatcher+ generates candidate instance pairs based on the existence of the label in the opposite knowledge graph. 1.3. Adaptations made for the evaluation KGMatcher+ is mainly developed with Python. To facilitate reusing and evaluating KGMatcher+ and for the OAEI submission, it was packaged using a SEALS client. The wrapping service from the Matching EvaLuation Toolkit (MELT) [14] was used to warp the system’s Python-process, and to generate the SEALS package. 2. Results In this section, we present and discuss the results for each of the OAEI tracks where KGMatcher+ was able to produce a non-empty alignment file. The results include the following OAEI tracks: Conference, Knowledge Graph, and Common Knowledge Graphs track. 2.1. Conference In the Conference track, when following the rar2-M3 evaluation, KGMatcher+ F1 score (0.52) is slightly lower than both baselines, i.e., StringEquiv (0.53) and edna (0.56). This particular evaluation, i.e., M3 takes into consideration both class and property matches. The fact that KGMatcher+ does not match property justifies the negative impact of the undiscovered property alignments on the matcher’s performance on this task. Further, given that the Conference track datasets do not include enough instances to apply the instance-based matcher, the name matcher is the only matcher applied to map classes. In terms of the cross-domain test case of mapping DBpedia and OntoFram, KGMatcher+ is the second to the best-performing system. 2.2. Knowledge Graph In the Knowledge Graph track, KGMatcher+ was able to generate results for all 5 test cases at both classes and instances level. In terms of class matching, the matcher yields satisfactory results, with 0.79 for F1 score. The added instance matcher has positively impacted the overall matcher result on this task, with a precision of 0.94, a recall of 0.66, and F1 of 0.82. KGMctcher+ is the second to the best-performing system in this track. 2.3. Common Knowledge Graphs In this track, KGMatcher+ was able to complete the task of matching the classes from 4 cross- domain and common KGs. On the task of matching NELL and DBpedia, the matcher obtained the highest F1 score of 0.95. In terms of the second task, which maps classes from Yago and Wikidata, KGMatcher+ is also the best-performing matching system with a recall of 0.83 and an F1 score of 0.91. KGMctcher+ yields the best performance results on this track. 3. General comments The results of KGMatcher+ have been very encouraging. In the common knowledge graph track, it achieves outstanding results. This indicates that our hybrid approach, utilizing instances data to map KG classes, is able to outperform systems that use other matchers’ combinations. It is important to note that the performance of KGMatcher+ instance-based component depends on the dataset nature. Since KGMatcher+ is learning KG classifiers by using general pre-trained word embedding models, the more representative the KG instances of real-world entities, the better the instance classification results. Figure 2 shows the difference between the performance when classifying instances from common KGs, e.g., NELL, compared to a single domain KG from the knowledge graph track. Note that the latter mainly annotates classes in the entertainment domain [15]. (a) (b) Figure 2: The instance classification report of a 20 randomly sampled classes from the OAEI KG MemoryAlpha in (a) and NELL in (b). Note that y-axis numbers indicate class IDs. 4. Conclusion As part of OAEI 2022, this paper presents KGMatcher+, a system for matching the schema of large-scale and domain-independent KGs by utilizing instances data. The process is done by learning KG classifiers, which are able to classify instances into a particular KG class. The results suggest that a hybrid approach that incorporates an instance-based technique can be highly effective particularly for matching large cross-domain KGs with imbalanced class distribution, such as Yago and Wikidata. References [1] L. Otero-Cerdeira, F. J. RodrΓ­guez-MartΓ­nez, A. GΓ³mez-RodrΓ­guez, Ontology matching: A literature review, Expert Systems with Applications (2015) 949–971. [2] O. Fallatah, Z. Zhang, F. Hopfgartner, The impact of imbalanced class distribution on knowledge graphs matching, in: Proceedings of the 17th International Workshop on Ontology Matching (OM 2022), CEUR-WS, 2022. [3] O. Fallatah, Z. Zhang, F. Hopfgartner, A hybrid approach for large knowledge graphs matching, in: Proceedings of the 16th International Workshop on Ontology Matching (OM 2021), CEUR-WS, 2021. [4] C. Bizer, J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mende, S. Hell- mann, M. Morsey, P. van Kleef, S. Auer, C. Bizer, DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia, Semantic Web (2012) 1–5. [5] T. P. Tanon, G. Weikum, F. Suchanek, Yago 4: A reason-able knowledge base, in: European Semantic Web Conference, Springer, 2020, pp. 583–596. [6] O. Fallatah, Z. Zhang, F. Hopfgartner, Kgmatcher results for oaei 2021, in: CEUR Workshop Proceedings, volume 3063, 2021, pp. 160–166. [7] H. SchΓΌtze, C. D. Manning, P. Raghavan, Introduction to information retrieval, volume 39, Cambridge University Press Cambridge, 2008. [8] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357. [9] C. Elkan, The foundations of cost-sensitive learning, in: International joint conference on artificial intelligence, volume 17, Lawrence Erlbaum Associates Ltd, 2001, pp. 973–978. [10] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, J. Gao, Deep learning based text classification: A comprehensive review, arXiv preprint arXiv:2004.03705 (2020). [11] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural language processing (almost) from scratch, Journal of machine learning research 12 (2011) 2493–2537. [12] A. S. Maiya, ktrain: A low-code library for augmented machine learning, arXiv preprint arXiv:2004.10703 (2020). [13] M. GuliΔ‡, B. Vrdoljak, M. Banek, Cromatcher: An ontology matching system based on automated weighted aggregation and iterative final alignment, Journal of Web Semantics 41 (2016) 50–71. [14] S. Hertling, J. Portisch, H. Paulheim, MELT - matching evaluation toolkit, in: Semantic Systems. The Power of AI and Knowledge Graphs - 15th International Conference, 2019, pp. 231–245. [15] S. Hertling, H. Paulheim, The knowledge graph track at oaei, in: European Semantic Web Conference, Springer, 2020, pp. 343–359.