ALOD2Vec Matcher?

    Jan Portisch[0000−0001−5420−0663] and Heiko Paulheim[0000−0003−4386−8195]

          Data and Web Science Group, University of Mannheim, Germany
           jan.portisch@sap.com, heiko@informatik.uni-mannheim.de


       Abstract. In this paper, we introduce the ALOD2Vec Matcher, an on-
       tology matching tool that exploits a Web-scale data set, i.e., WebIsA-
       LOD, as external knowledge source. In order to make use of the data set,
       the RDF2Vec approach is chosen to derive embeddings for each concept
       available in the data set.
       We show that it is possible to use very large RDF graphs as external
       background knowledge source for the task of ontology matching.

       Keywords: Ontology Matching · Ontology Alignment · External Re-
       sources · Vector Space Embeddings · RDF2Vec


1     Presentation of the System
1.1    State, purpose, general statement
The ALOD2Vec Matcher is an element-level, label-based matcher which uses a
large-scale Web-crawled RDF data set of hypernymy relations as background
knowledge. One advantage of that data set is the inclusion of many tail-entities,
as well as instance data, such as persons or places, which cannot be found in
thesauri. In order to make use of the external data set, a neural language model
approach is used to calculate an embedding vector for each concept contained
in it.
Given two entities e1 and e2 , the matcher uses their textual labels to link them to
concepts e01 and e02 in the external data set. Then, the pre-calculated embedding
vectors ve01 and ve02 of the linked concepts (e01 and e02 ) are retrieved and the cosine
similarity between those is calculated. Hence: sim(e1 , e2 ) = simcosine (ve01 , ve02 ).
The resulting alignment is homogenous, i.e., classes, object properties, and data-
type properties are handled separately. In addition, the matcher enforces a one-
to-many matching restriction.

1.2    Specific techniques used
For the alignment process, the matcher retrieves textual descriptions of all ele-
ments of the ontologies to be matched. A filter adds all simple string matches
to the final alignment in order to increase the performance. The remaining la-
bels are linked to concepts of the background data set, are compared, and the
best solution is added to the final alignment. A high-level view of the system is
depicted in figure 1.
?
    Supported by SAP SE.
2      J. Portisch and H. Paulheim


                       Fig. 1. ALOD2Vec Matching Process


WebIsALOD Data Set When working with knowledge bases in order to ex-
ploit the contained knowledge in applications, a frequent problem is the fact
that less common entities are not contained within the knowledge base. The
WebIsA [7] database is an attempt to tackle this problem by providing a data
set which is not based on a single source of knowledge – like DBpedia [3] – but in-
stead on the whole Web: The data set consists of hypernymy relations extracted
from the Common Crawl 1 , a freely downloadable crawl of a significant portion
of the Web. A sample triple from the data set is european union skos:broader
international organization 2 . The data set is also available via a Linked Open
Data (LOD) endpoint3 under the name WebIsALOD [2]. In the LOD data set, a
machine-learned confidence score c ∈ [0, 1] is assigned to every hypernymy triple
indicating the assumed degree of truth of the statement.


RDF2Vec The background data set can be viewed as a very large knowledge
graph; in order to obtain a similarity score for nodes in that graph, the RDF2Vec
[6] approach is used. It applies the word2vec [4,5] model to RDF data: Random
walks are performed for each node and are interpreted as sentences. After the
walk generation, the sentences are used as input for the word2vec algorithm. As
a result, one obtains a vector for each word, i.e., a concept in the RDF graph.
The approach is used here to obtain vectors for all concepts in the WebIsALOD
data set.
1
  see http://commoncrawl.org/
2
  see http://webisa.webdatacommons.org/concept/european_union_
3
  see http://webisa.webdatacommons.org/
                                                          ALOD2Vec Matcher        3

Linking The first step is to link the obtained labels from the ontology to con-
cepts in the WebIsALOD data set. Therefore, string operations are performed
on the label and it is checked whether the label is available in WebIsALOD. If
it cannot be found, labels consisting of multiple words are truncated from the
right, and the process is repeated to check for sub-concepts. For example, the
label United Nations Peacekeeping Mission in Mali cannot be found in WebIsA-
LOD. Therefore, it is truncated until the longest label from the left is found – in
this case United Nations. The process is repeated until all tokens are processed.
The resulting concepts for the given label are: United Nations 4 , peacekeeping
mission 5 , and Mali 6 .


Similarity Calculation As stated before, labels are linked to concepts, their
vectors are retrieved, and the cosine similarity between them is used as similarity
score.
    There are cases in which parts of a label cannot be found, however, for exam-
ple in tubule macula and in macula lutea both times only macula can be found
using the WebIsALOD data set. If only the found concepts would be used to
calculate the similarity between the concepts, a perfect score would be obtained
because sim(macula, macula) = 1.0. This is not precise as the approach does
not allow to discriminate between perfect matches due to incomplete linking and
real perfect matches. Therefore, a penalty factor p ∈ [0, 1] is introduced that is
to be multiplied with the final similarity score and which lowers the score for in-
complete links; p = 0 indicates the maximal penalty, p = 1 indicates no penalty.
The calculation of p is depicted in equation 1:

                    |F ound Concepts L1 |            |F ound Concepts L2 |
       p = 0.5 ∗                            + 0.5 ∗                             (1)
                   |P ossible Concepts L1 |         |P ossible Concepts L2 |

where L1 is the label of the first concept and L2 is the label of the second one;
|F ound Concepts Li | is the number of tokens for which a concept could be found
(minus stopwords) and |P ossible Concepts Li | is the number of tokens of the
label without stopwords. The penalty score is multiplied with the final similarity
score. Hence, incomplete linkages are penalized.
    If two labels were matched to multiple concepts, a resolution is required. In
this case the best average similarity is used:
                                     1|c |    2 |c |
                                  Σi∈c 1
                                         M axj∈c 2
                                                   sim(c1i , c2j )
                     simaverage =                                               (2)
                                              |c1 |

where c1 and c2 represent two individual concepts and c1i , respectively c2j ,
represent the ith and j th sub-concept of c1 and c2 ; |c1 | and |c2 | are the number
of sub-concepts of c1 and c2 ; c1 is the concept with more tokens.
4
  see http://webisa.webdatacommons.org/concept/united_nations_
5
  see http://webisa.webdatacommons.org/concept/peacekeeping_mission_
6
  see http://webisa.webdatacommons.org/concept/_mali_
4       J. Portisch and H. Paulheim

    Typically, there is more than one label to an entity of an ontology. Therefore,
a score-matrix is used: Every label of an entity is linked and compared to every
label of the other entity and the best score is returned.

RDF2Vec Configuration Parameters We generated 100 sentences of depth
8 for each node in the WebIsALOD data set for the training process of the model.
In order to have also sentences for nodes that do not have out-going edges, those
were identified and sentences were generated backwards and afterwards reversed.
The sentences were generated in a biased fashion [1], i.e., high-confidence edges
are followed with a higher probability. Eventually, the embeddings were trained
using the continuous bag of words (CBOW) approach with the parameters of the
original RDF2Vec paper: window size = 5, number of iterations = 5, negative
sampling = true, negative samples = 25, average input vector = true, and
200 dimensional embeddings.

2     Results
2.1   Anatomy
For the Anatomy data set, the matcher achieves a higher recall and F1 score
compared to the baseline solution. However, the true positives are mostly exact
lexical matches or share many common tokens.
Concerning runtime-performance, ALOD2Vec Matcher performs in the upper
half of all matchers that participated in the Anatomy track.

2.2   Conference
On the Conference data set, it can be seen that the matcher is better in aligning
classes than in aligning properties. This is in line with the results reported for
other matchers. In this case, it is due to fewer lexical matches in properties as
well as the higher usage of non-nouns which cannot be properly linked to the
background knowledge source.

2.3   Large BioMed
For the Large BioMed matching tasks, the matcher is capable of aligning the
small fragments within the given time frame of 6 hours. While ALOD2Vec
Matcher performs slightly above the 2017 and 2018 F1 averages on the small
FMA-NCI data set, it perfoms in the lower half for the remaining ones.

2.4   Complex Track
Although the matcher presented here is not capable of generating complex corre-
spondences yet, it could produce results for the entity identification subtask for
two data sets: On GeoLink, ALOD2Vec Matcher achieved the highest F1 score
and recall of all matchers that participated; on Hydrograph, alignments for the
English ontologies could be generated and scored within the median.
                                                        ALOD2Vec Matcher           5

3     General Comments

3.1   Comments on the results

The matcher performs above the given baselines. However, the matches are still
rather trivial and mostly share common tokens.
     There are multiple reasons for the mediocre performance. First, the under-
lying data set is very noisy: It contains a lot of wrong information (e.g. fish
skos:broader fisher )7 , subjective information (e.g. donald trump skos:broader lu-
natic)8 , and is not strictly hierarchical (e.g. live skos:broader quality, and vice
versa)9 . In addition, the tail-entity problem is still not solved because very spe-
cific entities are involved in very few hypernymy statements and their resulting
vectors are likely not meaningful (e.g. complex congenital heart defect)10 .
     Besides the pitfalls of the data set, the matcher cannot handle homonyms,
non-nouns, or non-English labels.


3.2   Discussions on the way to improve the proposed system

There are three ways in which the current research focusing on this approach can
be improved in the future: Firstly, more propositionalization techniques for very
large data sets could be explored. Secondly, the matcher itself can be enhanced to
use more information available in ontologies such as their structure. And lastly,
the data sets to be used can be improved. WebIsALOD is only one Web-scale
RDF data set and still has some pitfalls such as the restriction to hypernymy
relations and noise. More such data sets can be created and used in the future.


4     Conclusion

In this paper, we presented the ALOD2Vec Matcher, a matcher utilizing a Web-
crawled knowledge data set by applying the RDF2Vec methodology to a hyper-
nymy data set extracted from the Web. It could be shown that it is possible to
use very large RDF graphs as external background knowledge and the RDF2Vec
methodology for the task of ontology matching.


References
1. Cochez, M., Ristoski, P., Ponzetto, S.P., Paulheim, H.: Biased graph walks for rdf
   graph embeddings. In: Proceedings of the 7th International Conference on Web
   Intelligence, Mining and Semantics. p. 21. ACM (2017)
7
   see http://webisa.webdatacommons.org/concept/_fish_
8
   see http://webisa.webdatacommons.org/concept/donald_trump_
 9
   see http://webisa.webdatacommons.org/concept/_quality_
10
   see http://webisa.webdatacommons.org/concept/complex+congenital+heart_
   defect_
6       J. Portisch and H. Paulheim

2. Hertling, S., Paulheim, H.: WebIsALOD: Providing Hypernymy Relations Extracted
   From the Web as Linked Open Data. In: International Semantic Web Conference.
   pp. 111–119. Springer (2017)
3. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hell-
   mann, S., Morsey, M., Bizer, C.: DBpedia A Large-scale, Multilingual Knowledge
   Base Extracted from Wikipedia. Semantic Web 6(2), 167–195 (2012)
4. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Repre-
   sentations in Vector Space. arXiv preprint arXiv:1301.3781 (2013)
5. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Dis-
   tributed Representations of Words and Phrases and Their Compositional-
   ity. In: Advances in Neural Information Processing Systems. pp. 3111–3119
   (2013), http://papers.nips.cc/paper/5021-distributed-representations-of-
   words-and-phrases-and-their-compositionality
6. Ristoski, P., Rosati, J., Di Noia, T., De Leone, R., Paulheim, H.: RDF2vec: RDF
   Graph Embeddings and Their Applications. Semantic Web Journal (2017), http:
   //www.semantic-web-journal.net/system/files/swj1495.pdf
7. Seitner, J., Bizer, C., Eckert, K., Faralli, S., Meusel, R., Paulheim, H., Ponzetto,
   S.P.: A Large DataBase of Hypernymy Relations Extracted from the Web. In: LREC
   (2016)