ALOD2vec Matcher Results for OAEI 2021

 Jan Portisch1,2[0000−0001−5420−0663] and Heiko Paulheim1[0000−0003−4386−8195]
         1
        Data and Web Science Group, University of Mannheim, Germany
                {jan, heiko}@informatik.uni-mannheim.de
2
  SAP SE Business Technology Platform - One Domain Model, Walldorf, Germany
                           jan.portisch@sap.com


        Abstract. This paper presents the results of the ALOD2vec Matcher in
        the Ontology Alignment Evaluation Initiative (OAEI) 2021. The match-
        ing system exploits a Web-scale dataset, i.e. WebIsALOD, as background
        knowledge source. In order to make use of the dataset, the RDF2vec ap-
        proach is applied to derive embeddings for each concept available in the
        dataset. ALOD2vec Matcher participated in the OAEI 2018 and 2020
        campaigns before. This is the system’s third participation.3

        Keywords: Ontology Matching · Ontology Alignment · External Re-
        sources · Background Knowledge · Knowledge Graph Embeddings · RDF2vec


1     Presentation of the System

1.1    State, Purpose, General Statement

The ALOD2vec Matcher is an element-level, label-based matcher which uses a
large-scale Web-crawled RDF dataset of hypernymy relations as general pur-
pose background knowledge. The dataset contains many tail-entities as well as
instance data such as persons or places which cannot be found in common the-
sauri. In order to exploit the external dataset, a neural language model approach
is used to obtain a vector for each concept contained in the dataset. This match-
ing system system was initially introduced at the OAEI 2018 [13] and also partic-
ipated in the 2020 campaign [10]. The implementation is based on the Matching
EvaLuation Toolkit [6] as well as the KGvec2go [11] REST API to obtain vector
representations via a Web API.


1.2    Specific Techniques Used

After the basic concepts of this matcher are introduced (Foundations), the spe-
cific techniques applied are presented.


Foundations
3
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2      J. Portisch et al.

WebIsALOD Dataset A frequent problem that occurs when working with exter-
nal background knowledge is the fact that less common entities are not contained
within a knowledge base. The WebIsA [16] database is an attempt to tackle this
problem by providing a dataset which is not based on a single source of knowl-
edge – like DBpedia [7] – but instead on the whole Web: The dataset consists
of hypernymy relations extracted from the Common Crawl 4 , a freely download-
able crawl of a significant portion of the Web. A sample triple from the dataset
is european union skos:broader international organization 5 . The dataset is also
available via a Linked Open Data (LOD) endpoint6 under the name WebIsA-
LOD [5]. In the LOD dataset, a machine-learned confidence score c ∈ [0, 1] is
assigned to every hypernymy triple indicating the assumed degree of truth of
the statement.

RDF2vec The background dataset can be viewed as a very large knowledge
graph; in order to obtain a similarity score for nodes and edges in that graph, the
RDF2vec [15] approach is used. It applies the word2vec [8,9] model to RDF data:
Random walks are performed for each node and are interpreted as sentences.
After the walk generation, the sentences are used as input for the word2vec
algorithm. As a result, one obtains a vector for each word, i.e., a concept in the
RDF graph. Multiple flavors of RDF2vec have been developed in the past such
as biased walks [1] or RDF2Vec Light [12].7

KGvec2go Training embeddings on large knowledge graphs can be computa-
tionally very expensive. Moreover, the resulting embedding models can be very
large since a multidimensional vector needs to be persisted for every node in the
knowledge graph. However, most downstream applications require only a small
subset of node vectors. The KGvec2go project [11] addresses these problems by
providing a free REST API8 for pre-trained RDF2vec models on various large
knowledge graphs (among which WebIsALOD is also available).


Monolingual Matching ALOD2vec Matcher is a monolingual matching sys-
tem. For the alignment process, the system retrieves the labels of all elements
of the ontologies to be matched. A filter adds all simple string matches to the
final alignment in order to increase the performance. The remaining labels are
linked to concepts in the background dataset, are compared, and the best solu-
tion is added to the final alignment. A high-level view of the matching system
is provided in Figure 1.
    The first step is to link the obtained labels from the ontology to concepts in
the WebIsALOD dataset. Therefore, string operations are performed on the label
4
  see http://commoncrawl.org/
5
  see http://webisa.webdatacommons.org/concept/european_union_
6
  see http://webisa.webdatacommons.org/
7
  For a good overview of the RDF2vec approach and its applications, refer to
  http://www.rdf2vec.org/
8
  see http://kgvec2go.org/api.html
                                                            ALOD2vec Matcher            3


Fig. 1. High-level view of the ALOD2vec matching process. KG1 and KG2 represent
the input ontologies and optionally instances. The final alignment is referred to as A.


and it is checked whether the label is available in WebIsALOD. If it cannot be
found, a token-lookup is performed. Given two entities e1 and e2 , the matcher
uses their textual labels to link them to concepts e01 and e02 in the external
dataset. Afterwards, the embedding vectors ve01 and ve02 of the linked concepts
(e01 and e02 ) are retrieved via a Web request and the cosine similarity between
those is calculated. Hence: sim(e1 , e2 ) = simcosine (ve01 , ve02 ). If sim(e1 , e2 ) > t
where t is a threshold in the range of 0 and 1, a correspondence is added to a
temporary alignment. In a last step, a one-to-one arity is enforced by applying
a Maximum Weight Bipartite [2] filter on the temporary alignment.
     In order to consume the vectors in Java, a client has been implemented and
contributed to the MELT-ML module. The KGvec2go REST API can now be
accessed though class KGvec2goClient. Even though this matcher only uses the
WebIsALOD dataset, the implementation supports all datasets accessible on
KGvec2go. The extension is available by default in MELT 2.6.


Instance Matching After classes and properties have been matched, instances
are matched using a string index. The confidence score assigned to instances
belonging to matched classes is higher than that of matches between instances
belonging to non-matched classes.


Explainability ALOD2vec Matcher provides an explanation for every corre-
spondence that is added to the final alignment. Therefore, the extension capa-
bilities of the alignment format [3] are used. Two concrete examples from the
Anatomy track for explanations of the matching system are: “Label ’aqueous
4       J. Portisch et al.

humour’ of ontology 1 and label ’Aqueous Humor’ of ontology 2 have a very
similar writing.” or “The following two label sets have a cosine above the given
threshold: |lens|anterior|epithelium| and |anterior|surface|lens|”. In order to ex-
plain a correspondence, the description property9 of the Dublin Core Metadata
Initiative is used.

1.3   Extensions to the Matching System for the 2021 Campaign
For the 2021 campaign, the matching system was adapted to use the latest
MELT release and was packaged as MELT Web Docker10 container. The 2021
implementation is publicly available on GitHub.11

2     Results
2.1   Anatomy Track
On the anatomy dataset, the system scores a precision of 0.828, a recall of 0.766,
and an F1 of 0.796.

2.2   Conference Track
On the conference track, the matcher achieves a recall of 0.49 and a precision of
0.64. The overall F1 score on ra1-M3 was 0.59.

2.3   Multifarm Track
Since the WebIsALOD dataset is only available in English, the focus of the
ALOD2vec Matcher is on monolingual matching tasks.

2.4   LargeBio Track
In its current version, the LargeBio track is too large for the matching system’s
architecture. There is a tradeoff in package size and runtime performance (a
large package with all vectors matches faster than the submitted small package
which obtains vectors at runtime from KGvec2go). The current architecture
of ALOD2vec Matcher is not intended for large-scale matching – however, the
matching algorithm itself could be used for large-scale matching.

2.5   Knowledge Graph Track
The system could complete all matching tasks in time. As in the previous year,
this matcher obtains the second best results achieving almost the same score as
the Wiktionary Matcher 2021 [14]. The overall F1 score was 0.87 on the complete
track.
 9
   see http://purl.org/dc/terms/description
10
   see https://dwslab.github.io/melt/matcher-packaging/web
11
   see https://github.com/janothan/ALOD2VecMatcher
                                                          ALOD2vec Matcher           5

2.6   Common Knowledge Graph Track
This year, a new track was added to the OAEI: The Common Knowledge Graph
Track [4]. Although not optimized for this track, Alod2vec Matcher achieved the
second best result with an F1 score of 0.89.

3     Conclusion
In this paper, we presented the newest version of the ALOD2vec Matcher, a
matcher utilizing an RDF2vec vector representation of the WebIsALOD dataset,
as well as its results in the 2021 OAEI. In the future, the matching system
could be improved by using another, potentially larger or newer, hypernymy
database, by exploiting other embedding algorithms, and by adding further
matching strategies to the overall algorithms such as checking of logical con-
straints.

References
 1. Cochez, M., Ristoski, P., Ponzetto, S.P., Paulheim, H.: Biased graph walks for
    RDF graph embeddings. In: Akerkar, R., Cuzzocrea, A., Cao, J., Hacid, M. (eds.)
    Proceedings of the 7th International Conference on Web Intelligence, Mining and
    Semantics, WIMS 2017, Amantea, Italy, June 19-22, 2017. pp. 21:1–21:12. ACM
    (2017), https://doi.org/10.1145/3102254.3102279
 2. Cruz, I.F., Antonelli, F.P., Stroe, C.: Efficient selection of mappings and auto-
    matic quality-driven combination of matching methods. In: Proceedings of the 4th
    International Conference on Ontology Matching-Volume 551. pp. 49–60. Citeseer
    (2009)
 3. David, J., Euzenat, J., Scharffe, F., dos Santos, C.T.: The alignment API 4.0.
    Semantic Web 2(1), 3–10 (2011), https://doi.org/10.3233/SW-2011-0028
 4. Fallatah, O., Zhang, Z., Hopfgartner, F.: A gold standard dataset for large knowl-
    edge graphs matching. In: Shvaiko, P., Euzenat, J., Jiménez-Ruiz, E., Hassanzadeh,
    O., Trojahn, C. (eds.) Proceedings of the 15th International Workshop on Ontology
    Matching co-located with the 19th International Semantic Web Conference (ISWC
    2020), Virtual conference (originally planned to be in Athens, Greece), November
    2, 2020. CEUR Workshop Proceedings, vol. 2788, pp. 24–35. CEUR-WS.org (2020),
    http://ceur-ws.org/Vol-2788/om2020_LTpaper3.pdf
 5. Hertling, S., Paulheim, H.: Webisalod: Providing hypernymy relations extracted
    from the web as linked open data. In: d’Amato, C., Fernández, M., Tamma, V.A.M.,
    Lécué, F., Cudré-Mauroux, P., Sequeda, J.F., Lange, C., Heflin, J. (eds.) The
    Semantic Web - ISWC 2017 - 16th International Semantic Web Conference, Vienna,
    Austria, October 21-25, 2017, Proceedings, Part II. Lecture Notes in Computer
    Science, vol. 10588, pp. 111–119. Springer (2017), https://doi.org/10.1007/978-
    3-319-68204-4_11
 6. Hertling, S., Portisch, J., Paulheim, H.: MELT - matching evaluation toolkit. In:
    Acosta, M., Cudré-Mauroux, P., Maleshkova, M., Pellegrini, T., Sack, H., Sure-
    Vetter, Y. (eds.) Semantic Systems. The Power of AI and Knowledge Graphs - 15th
    International Conference, SEMANTiCS 2019, Karlsruhe, Germany, September 9-
    12, 2019, Proceedings. Lecture Notes in Computer Science, vol. 11702, pp. 231–245.
    Springer (2019), https://doi.org/10.1007/978-3-030-33220-4_17
6       J. Portisch et al.

 7. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N.,
    Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: Dbpedia - A large-
    scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2),
    167–195 (2015), https://doi.org/10.3233/SW-140134
 8. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word represen-
    tations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Confer-
    ence on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4,
    2013, Workshop Track Proceedings (2013), http://arxiv.org/abs/1301.3781
 9. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
    sentations of words and phrases and their compositionality. In: Burges, C.J.C.,
    Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Infor-
    mation Processing Systems 26: 27th Annual Conference on Neural Information
    Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake
    Tahoe, Nevada, United States. pp. 3111–3119 (2013)
10. Portisch, J., Hladik, M., Paulheim, H.: Alod2vec matcher results for OAEI 2020.
    In: Shvaiko, P., Euzenat, J., Jiménez-Ruiz, E., Hassanzadeh, O., Trojahn, C.
    (eds.) Proceedings of the 15th International Workshop on Ontology Matching
    co-located with the 19th International Semantic Web Conference (ISWC 2020),
    Virtual conference (originally planned to be in Athens, Greece), November 2,
    2020. CEUR Workshop Proceedings, vol. 2788, pp. 147–153. CEUR-WS.org (2020),
    http://ceur-ws.org/Vol-2788/oaei20_paper2.pdf
11. Portisch, J., Hladik, M., Paulheim, H.: Kgvec2go - knowledge graph embed-
    dings as a service. In: Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri,
    C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H.,
    Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of The 12th Language Re-
    sources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16,
    2020. pp. 5641–5647. European Language Resources Association (2020), https:
    //www.aclweb.org/anthology/2020.lrec-1.692/
12. Portisch, J., Hladik, M., Paulheim, H.: Rdf2vec light - A lightweight approachfor
    knowledge graph embeddings. In: Taylor, K.L., Gonçalves, R.S., Lécué, F., Yan, J.
    (eds.) Proceedings of the ISWC 2020 Demos and Industry Tracks: From Novel Ideas
    to Industrial Practice co-located with 19th International Semantic Web Conference
    (ISWC 2020), Globally online, November 1-6, 2020 (UTC). CEUR Workshop Pro-
    ceedings, vol. 2721, pp. 79–84. CEUR-WS.org (2020), http://ceur-ws.org/Vol-
    2721/paper520.pdf
13. Portisch, J., Paulheim, H.: Alod2vec matcher. In: Shvaiko, P., Euzenat, J., Jiménez-
    Ruiz, E., Cheatham, M., Hassanzadeh, O. (eds.) Proceedings of the 13th Interna-
    tional Workshop on Ontology Matching co-located with the 17th International
    Semantic Web Conference, OM@ISWC 2018, Monterey, CA, USA, October 8,
    2018. CEUR Workshop Proceedings, vol. 2288, pp. 132–137. CEUR-WS.org (2018),
    http://ceur-ws.org/Vol-2288/oaei18_paper3.pdf
14. Portisch, J., Paulheim, H.: Wiktionary Matcher results for OAEI 2021. In:
    OM@ISWC 2021 (2021), to appear
15. Ristoski, P., Rosati, J., Noia, T.D., Leone, R.D., Paulheim, H.: Rdf2vec: RDF
    graph embeddings and their applications. Semantic Web 10(4), 721–752 (2019),
    https://doi.org/10.3233/SW-180317
16. Seitner, J., Bizer, C., Eckert, K., Faralli, S., Meusel, R., Paulheim, H., Ponzetto,
    S.P.: A large database of hypernymy relations extracted from the web. In: Cal-
    zolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B.,
    Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of
                                                 ALOD2vec Matcher         7

the Tenth International Conference on Language Resources and Evaluation LREC
2016, Portorož, Slovenia, May 23-28, 2016. European Language Resources As-
sociation (ELRA) (2016), http://www.lrec-conf.org/proceedings/lrec2016/
summaries/204.html