LSMatch and LSMatch-Multilingual Results for OAEI
2022
Abhisek Sharma1,∗,† , Archana Patel2,† and Sarika Jain1,†
1
    National Institute of Technology Kurukshetra, India
2
    Eastern International University, Vietnam


                                         Abstract
                                         The Large-Scale Ontology Matching System (LSMatch and LSMatch-Multilingual) and its findings using
                                         OAEI 2022 datasets are presented in this paper. A string similarity and synonyms matcher is used in
                                         the element-level and label-based ontology matching system called LSMatch. Same configuration in
                                         addition with MyMemory translation memory is used in the creation of multilingual capable system
                                         called LSMatch-Multilingual. The system(s) is/are capable of identifying classes, instances, and properties
                                         (both in monolingual and multilingual settings) between two ontologies. This year LSMatch and LSMatch-
                                         Multilingual are collectively participating on OAEI’s six tracks—Anatomy, Conference, Multifarm, Bio-ML,
                                         Common Knowledge Graphs, and Knowledge Graph. LSMatch has shown encouraging outcomes across
                                         all six tracks.

                                         Keywords
                                         Ontology Matching, Knowledge Schema, Alignment, String similarity, Synonym matcher.


1. Presentation of the system
1.1. State, purpose, general statement
LSMatch (Large Scale Ontology Matching System) is an ontology matching system that finds
correspondences between ontologies using lexical properties. It employs the Levenshtein string
similarity measure and the synonyms matcher, which employs background knowledge contain-
ing synonyms to filter out concepts with similar meanings but different lexical representations
[1]. For multilingual LSMatch uses MyMemory translation memory. This is LSMatch’s second
OAEI appearance, and it was tested on six tracks: Anatomy, Conference, Multifarm, Bio-ML,
Common Knowledge Graphs, and Knowledge Graph. The LSMatch system was wrapped
in the MELT framework [2], and it is performing at par with other systems, in Multifarm
LSMatch-Multilingual got highest F1-score.


∗
    Corresponding author.
†
     These authors contributed equally.
Envelope-Open abhisek_61900048@nitkkr.ac.in (A. Sharma); archana.patel@eiu.edu.vn (A. Patel); jasarika@nitkkr.ac.in (S. Jain)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
1.2. Specific techniques used
The current version of LSMatch (as compared to last year’s submission) is now capable addresses
both monolingual and multilingual ontology alignments. The working of the LSMatch system
is shown in figure 1. We introduce the multiple parts of the system by taking two Knowledge
schemas/ontologies. LSMatch system takes input in any format and loads the input schemas/on-
tologies as RDF graphs. After extracting classes, properties, and instances we perform stemming,
removing stopwords and non-alphabetic characters, and normalizing letters. Then we pass the
ontology concepts from Levenshtein and synonyms matcher modules. The underline modules
have following functionality:

    Input Layer            Pre-processing                                                                   Output
                                                                                                            Layer
                                                              Processing
                           Loading ontology as
                              Graph object
                                                                Levenshtein
     Source Ontology                                             Matcher             Similarity Matrix
                           Extracting Concepts/
                           Properties/Instances
                                                                Synonyms                 Alignment
                                                                 Matcher                 Filtering
                            Text Preprocessing

      Target Ontology                                                                                    Final Alignment


        Reference                                                                                               Evaluation
        Alignment                                                                                        Pre.     Recall     F1


                                                  External Resources /
                                                                          Translations
                                                   Synonym Source


Figure 1: Combined architecture of LSMatch and LSMatch-Multilingual systems


        • Levenshtein matcher: The LSMatch uses a string similarity matcher that calculates
          Levenshtein distance between the concepts [3]. The concepts are represented as rdfs:label
          or directly as the class name in the ontologies. The official definition of Levenshtein
          distance is stated as “The smallest number of insertions, deletions, and substitutions
          required to change one string or tree into another”1 .
        • Background knowledge [4]: To identify different lexical representations, LSMatch uses a
          synonym matcher that fetches synonyms Wordnet [5]. Python’s nltk library is used for
          wordnet inclusion.
        • Synonym Matcher: LSMatch fetches synonyms from wordnet. Although we have pre-
          fetched the synonyms but during the execution, the concepts are cross-checked whether
          the synonyms for every concept are present or not. If some concept doesn’t have syn-
          onyms pre-fetched for it, we fetch them on the fly.

1
    https://xlinux.nist.gov/dads/HTML/Levenshtein.html
       • Translations2 : for translations we have used MyMemory’s translations memory as its
         provide good translations, is free, and is the world’s largest Translation Memory.

   For the purpose of storage and retrieval of alignments LSMatch uses dictionary. In the
dictionary, we store information as <key, value> pairs where key is hashed [6, 7]. LSMatch
stores the alignments received from both the matchers along with the similarity score. We
target storing and updating the scores of pairs multiple times during the alignment process and
having hashed keys allow us to do that efficiently. By default, LSMatch keeps all the alignments
with a combined score (Levenshtein + Synonym) of 0.5 or above to check the alignments over
variable thresholds. For the final selection of alignments the current version of LSMatch has
used 0.95 as the threshold.


2. Results
This section describes the results of the LSMatch and LSMatch-multilingual system collectively
on six tracks namely: Anatomy, Conference, Multifarm, Bio-ML, Common Knowledge Graphs,
and Knowledge Graph. The results are presented collectively in Table 1. Differences from
OAEI2021 are discussed in the subsections below.

2.1. Anatomy
In anatomy overall result is almost same as last year with 2% improvement in recall, though
overall F-measure got affected and it decreased by 0.2%.

2.2. Conference
For conference track the result are exactly same as last year as due to some error we had to use
the last year’s LSMatch for this track, because of which the results are identical.

2.3. Multifarm
This is the first entry of LSMatch in Multifarm track. For this track we specifically developed
LSMatch-multilingual. Though both the versions of LSMatch were tested on Multifarm track,
LSMatch-multilingual obtained best F1-score among all the systems with 0.47 (see Table 2 for
comparative results).

2.4. Bio-ML
The Bio-ML track is Machine Learning (ML) friendly Biomedical track. This track super-
sedes the previous largebio and phenotype tracks. There are 5 tasks in total (on which
LSMatch was tested), all Equivalent matching have been performed with 5 ontology pairs,
OMIN-ORDO(Disease), NCIT-DOID(Disease), SNOMED-FMA(Body), SNOMED-NCIT(Pharm),
and SNOMED-NCIT(Neoplas). On OMIN-ORDO(Disease) and NCIT-DOID(Disease) LSMatch

2
    https://mymemory.translated.net/
Table 1
Result summary of LSMatch at OAEI 2022 and OAEI 2021
  Task                                                    Year   Precision    F1     Recall
                                       —–Anatomy—–
  Mouse-Human                                             2022     0.952     0.761   0.634
  Mouse-Human                                             2021     0.997     0.763   0.618
                                      —–Conference—–
  OntoFarm (rar2-M3)                                      2022     0.83       0.55    0.41
  OntoFarm (rar2-M3)                                      2021     0.83       0.55    0.41
  OntoFarm (Sharp)                                        2022     0.88       0.57    0.42
  OntoFarm (Sharp)                                        2021     0.88       0.57    0.42
  OntoFarm (Discrete)                                     2022     0.87       0.66    0.53
  OntoFarm (Discrete)                                     2021     0.88       0.66    0.53
  OntoFarm (Continuous)                                   2022     0.88       0.67    0.54
  OntoFarm (Continuous)                                   2021     0.88       0.67    0.54
  DBpedia-OntoFarm                                        2022      0.5       0.55     0.6
  DBpedia-OntoFarm                                        2021      0.5       0.55     0.6
                       —–Bio-ML (Unsupervised (90% Test Mapping))—–
  Equivalent Matching Results for OMIM-ORDO (Disease)     2022     0.65      0.329   0.221
  Equivalent Matching Results for NCIT-DOID (Disease)     2022    0.719      0.633   0.565
  Equivalent Matching Results for SNOMED-FMA (Body)       2022    0.809      0.132   0.072
  Equivalent Matching Results for SNOMED-NCIT (Pharm)     2022    0.982      0.706   0.551
  Equivalent Matching Results for SNOMED-NCIT (Neoplas) 2022      0.902      0.377   0.238
                      —–Bio-ML (Semi-supervised (70% Test Mapping))—–
  Equivalent Matching Results for OMIM-ORDO (Disease)      2022    0.594     0.325   0.223
  Equivalent Matching Results for NCIT-DOID (Disease)      2022    0.665     0.611   0.565
  Equivalent Matching Results for SNOMED-FMA (Body)        2022    0.762     0.128   0.07
  Equivalent Matching Results for SNOMED-NCIT (Pharm)      2022    0.976     0.702   0.548
  Equivalent Matching Results for SNOMED-NCIT (Neoplas) 2022       0.877     0.374   0.238
                —–Large BioMed and Disease & Phenotype track (2021)—–
  FMA-NCI small                                       2021      0.979        0.876   0.792
  FMA-SNOMED small                                    2021      0.988        0.33    0.198
  HP-MP task                                          2021        1          0.421   0.267
  DOID-ORDO task                                      2021        1          0.463   0.301
                                   —–Common KG Track—–
  Nell-DBPedia                                            2022     0.96       0.84    0.75
  Nell-DBPedia                                            2021     0.99       0.87    0.78
  Yago-Wikidata                                           2022     0.96       0.76    0.63
                                 —–Knowledge Graph Track—–
                     Class           Property         Instance             Overall
         Year
                 P    F1   R      P     F1    R    P      F1   R     P       F1    R
         2022   0.97 0.78 0.64   0.73 0.71 0.69 0.66 0.63 0.6       0.66    0.63 0.61
         2021     1  0.78 0.64     0     0    0     0      0   0      1     0.01   0
Table 2
Results on Multifarm Track
                    System                     Precision       F1      Recall
                    LSMatch                       0.24       0.038      0.021
                    LSMatch-multilingual          0.68       0.47        0.36
                    CIDER-LM                      0.16        0.25      0.58
                    LogMap                        0.72        0.44       0.31
                    LogMapLt                      0.24       0.038       0.02
                    Matcher                     0.00082    0.000082   0.000043


got average results. On SNOMED-FMA(Body), LSMatch has 6th best precision out of 9. On
SNOMED-NCIT(Pharm) and SNOMED-NCIT(Neoplas), LSMatch has 2nd best precision just
after LogMap-Lite. All the above stated resutls are on Unsupervised (90% Test Mapping). For
Semi-supervised(70% Test Mappings), LSMatch has average performance in all tasks.

2.5. Common Knowledge Graphs
This year common Knowledge Graph track has one more task, namely Yago-Wikidata where
LSMatch’s performance was decent though need improvement. In Nell-DBPedia task, LSMatch
has almost similar result to last year.

2.6. Knowledge Graph
In OAEI 2021 LSMatch only supported class matching, this year (OAEI 2022) LSMatch had added
functionality to also match instance and properties. Class matching results this year are same
as last year, with this year’s property and instance matching overall result was 0.66, 0.63, and
0.61 precision, F1, and recall respectively. Which last year was 1, 0.01, and 0.


3. Conclusion
This year, the system was tested on six tracks, i.e., Anatomy, Conference, Multifarm, Bio-ML,
Common Knowledge Graphs, and Knowledge Graph. The system achieved considerably good
precision in all the tracks but lacked behind in recall. In future versions, we will be adding a set
of matchers and working to improve the utilization of background knowledge by which we can
find better correlations between concepts that are not properly aligned using just the lexical
measures.


References
[1] S. Zhang, Y. Hu, G. Bian, Research on string similarity algorithm based on levenshtein
    distance, in: 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation
    Control Conference (IAEAC), IEEE, 2017, pp. 2247–2251.
[2] S. Hertling, J. Portisch, H. Paulheim, Melt-matching evaluation toolkit, in: International
    conference on semantic systems, Springer, Cham, 2019, pp. 231–245.
[3] T. T. A. Nguyen, S. Conrad, Ontology matching using multiple similarity measures, in:
    2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering
    and Knowledge Management (IC3K), volume 1, IEEE, 2015, pp. 603–611.
[4] Z. Aleksovski, W. Ten Kate, F. Van Harmelen, Exploiting the structure of background
    knowledge used in ontology matching., in: Ontology Matching, 2006, p. 13.
[5] G. A. Miller, Wordnet: a lexical database for english, Communications of the ACM 38 (1995)
    39–41.
[6] P. Ochieng, S. Kyanda, Large-scale ontology matching: State-of-the-art analysis, ACM
    Computing Surveys (CSUR) 51 (2018) 1–35.
[7] S. Anam, Y. S. Kim, B. H. Kang, Q. Liu, Review of ontology matching approaches and
    challenges, International Journal of Computer Science and Network Solutions 3 (2015) 1–27.