<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ALOD2Vec Matcher?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>External Re-</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data and Web Science Group, University of Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we introduce the ALOD2Vec Matcher, an ontology matching tool that exploits a Web-scale data set, i.e., WebIsALOD, as external knowledge source. In order to make use of the data set, the RDF2Vec approach is chosen to derive embeddings for each concept available in the data set. We show that it is possible to use very large RDF graphs as external background knowledge source for the task of ontology matching.</p>
      </abstract>
      <kwd-group>
        <kwd>Ontology Matching</kwd>
        <kwd>Ontology Alignment sources</kwd>
        <kwd>Vector Space Embeddings</kwd>
        <kwd>RDF2Vec</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Presentation of the System</title>
      <sec id="sec-2-1">
        <title>State, purpose, general statement</title>
        <p>The ALOD2Vec Matcher is an element-level, label-based matcher which uses a
large-scale Web-crawled RDF data set of hypernymy relations as background
knowledge. One advantage of that data set is the inclusion of many tail-entities,
as well as instance data, such as persons or places, which cannot be found in
thesauri. In order to make use of the external data set, a neural language model
approach is used to calculate an embedding vector for each concept contained
in it.</p>
        <p>Given two entities e1 and e2, the matcher uses their textual labels to link them to
concepts e01 and e02 in the external data set. Then, the pre-calculated embedding
vectors ve01 and ve02 of the linked concepts (e01 and e02) are retrieved and the cosine
similarity between those is calculated. Hence: sim(e1; e2) = simcosine(ve01 ; ve02 ).
The resulting alignment is homogenous, i.e., classes, object properties, and
datatype properties are handled separately. In addition, the matcher enforces a
oneto-many matching restriction.
1.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Speci c techniques used</title>
        <p>
          For the alignment process, the matcher retrieves textual descriptions of all
elements of the ontologies to be matched. A lter adds all simple string matches
to the nal alignment in order to increase the performance. The remaining
labels are linked to concepts of the background data set, are compared, and the
best solution is added to the nal alignment. A high-level view of the system is
depicted in gure 1.
? Supported by SAP SE.
WebIsALOD Data Set When working with knowledge bases in order to
exploit the contained knowledge in applications, a frequent problem is the fact
that less common entities are not contained within the knowledge base. The
WebIsA [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] database is an attempt to tackle this problem by providing a data
set which is not based on a single source of knowledge { like DBpedia [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] { but
instead on the whole Web: The data set consists of hypernymy relations extracted
from the Common Crawl 1, a freely downloadable crawl of a signi cant portion
of the Web. A sample triple from the data set is european union skos:broader
international organization2. The data set is also available via a Linked Open
Data (LOD) endpoint3 under the name WebIsALOD [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. In the LOD data set, a
machine-learned con dence score c 2 [0; 1] is assigned to every hypernymy triple
indicating the assumed degree of truth of the statement.
        </p>
        <p>
          RDF2Vec The background data set can be viewed as a very large knowledge
graph; in order to obtain a similarity score for nodes in that graph, the RDF2Vec
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] approach is used. It applies the word2vec [
          <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
          ] model to RDF data: Random
walks are performed for each node and are interpreted as sentences. After the
walk generation, the sentences are used as input for the word2vec algorithm. As
a result, one obtains a vector for each word, i.e., a concept in the RDF graph.
The approach is used here to obtain vectors for all concepts in the WebIsALOD
data set.
        </p>
        <sec id="sec-2-2-1">
          <title>1 see http://commoncrawl.org/ 2 see http://webisa.webdatacommons.org/concept/european_union_ 3 see http://webisa.webdatacommons.org/</title>
          <p>Linking The rst step is to link the obtained labels from the ontology to
concepts in the WebIsALOD data set. Therefore, string operations are performed
on the label and it is checked whether the label is available in WebIsALOD. If
it cannot be found, labels consisting of multiple words are truncated from the
right, and the process is repeated to check for sub-concepts. For example, the
label United Nations Peacekeeping Mission in Mali cannot be found in
WebIsALOD. Therefore, it is truncated until the longest label from the left is found { in
this case United Nations. The process is repeated until all tokens are processed.
The resulting concepts for the given label are: United Nations4, peacekeeping
mission5, and Mali 6.</p>
          <p>Similarity Calculation As stated before, labels are linked to concepts, their
vectors are retrieved, and the cosine similarity between them is used as similarity
score.</p>
          <p>There are cases in which parts of a label cannot be found, however, for
example in tubule macula and in macula lutea both times only macula can be found
using the WebIsALOD data set. If only the found concepts would be used to
calculate the similarity between the concepts, a perfect score would be obtained
because sim(macula; macula) = 1:0. This is not precise as the approach does
not allow to discriminate between perfect matches due to incomplete linking and
real perfect matches. Therefore, a penalty factor p 2 [0; 1] is introduced that is
to be multiplied with the nal similarity score and which lowers the score for
incomplete links; p = 0 indicates the maximal penalty, p = 1 indicates no penalty.
The calculation of p is depicted in equation 1:
p = 0:5
jF ound Concepts L1j + 0:5
jP ossible Concepts L1j
jF ound Concepts L2j
jP ossible Concepts L2j
where L1 is the label of the rst concept and L2 is the label of the second one;
jF ound Concepts Lij is the number of tokens for which a concept could be found
(minus stopwords) and jP ossible Concepts Lij is the number of tokens of the
label without stopwords. The penalty score is multiplied with the nal similarity
score. Hence, incomplete linkages are penalized.</p>
          <p>If two labels were matched to multiple concepts, a resolution is required. In
this case the best average similarity is used:
simaverage =
ij2c1cj1 M axjjc22cj2 sim(c1i ; c2j )
jc1j
where c1 and c2 represent two individual concepts and c1i , respectively c2j ,
represent the ith and jth sub-concept of c1 and c2; jc1j and jc2j are the number
of sub-concepts of c1 and c2; c1 is the concept with more tokens.
4 see http://webisa.webdatacommons.org/concept/united_nations_
5 see http://webisa.webdatacommons.org/concept/peacekeeping_mission_
6 see http://webisa.webdatacommons.org/concept/_mali_
(1)
(2)</p>
          <p>Typically, there is more than one label to an entity of an ontology. Therefore,
a score-matrix is used: Every label of an entity is linked and compared to every
label of the other entity and the best score is returned.</p>
          <p>
            RDF2Vec Con guration Parameters We generated 100 sentences of depth
8 for each node in the WebIsALOD data set for the training process of the model.
In order to have also sentences for nodes that do not have out-going edges, those
were identi ed and sentences were generated backwards and afterwards reversed.
The sentences were generated in a biased fashion [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ], i.e., high-con dence edges
are followed with a higher probability. Eventually, the embeddings were trained
using the continuous bag of words (CBOW) approach with the parameters of the
original RDF2Vec paper: window size = 5, number of iterations = 5, negative
sampling = true, negative samples = 25, average input vector = true, and
200 dimensional embeddings.
2
2.1
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <sec id="sec-3-1">
        <title>Anatomy</title>
        <p>For the Anatomy data set, the matcher achieves a higher recall and F1 score
compared to the baseline solution. However, the true positives are mostly exact
lexical matches or share many common tokens.</p>
        <p>Concerning runtime-performance, ALOD2Vec Matcher performs in the upper
half of all matchers that participated in the Anatomy track.
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Conference</title>
        <p>On the Conference data set, it can be seen that the matcher is better in aligning
classes than in aligning properties. This is in line with the results reported for
other matchers. In this case, it is due to fewer lexical matches in properties as
well as the higher usage of non-nouns which cannot be properly linked to the
background knowledge source.
2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Large BioMed</title>
        <p>For the Large BioMed matching tasks, the matcher is capable of aligning the
small fragments within the given time frame of 6 hours. While ALOD2Vec
Matcher performs slightly above the 2017 and 2018 F1 averages on the small
FMA-NCI data set, it perfoms in the lower half for the remaining ones.
2.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Complex Track</title>
        <p>Although the matcher presented here is not capable of generating complex
correspondences yet, it could produce results for the entity identi cation subtask for
two data sets: On GeoLink, ALOD2Vec Matcher achieved the highest F1 score
and recall of all matchers that participated; on Hydrograph, alignments for the
English ontologies could be generated and scored within the median.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>General Comments</title>
      <sec id="sec-4-1">
        <title>Comments on the results</title>
        <p>The matcher performs above the given baselines. However, the matches are still
rather trivial and mostly share common tokens.</p>
        <p>There are multiple reasons for the mediocre performance. First, the
underlying data set is very noisy: It contains a lot of wrong information (e.g. sh
skos:broader sher )7, subjective information (e.g. donald trump skos:broader
lunatic)8, and is not strictly hierarchical (e.g. live skos:broader quality, and vice
versa)9. In addition, the tail-entity problem is still not solved because very
speci c entities are involved in very few hypernymy statements and their resulting
vectors are likely not meaningful (e.g. complex congenital heart defect )10.</p>
        <p>Besides the pitfalls of the data set, the matcher cannot handle homonyms,
non-nouns, or non-English labels.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Discussions on the way to improve the proposed system</title>
        <p>There are three ways in which the current research focusing on this approach can
be improved in the future: Firstly, more propositionalization techniques for very
large data sets could be explored. Secondly, the matcher itself can be enhanced to
use more information available in ontologies such as their structure. And lastly,
the data sets to be used can be improved. WebIsALOD is only one Web-scale
RDF data set and still has some pitfalls such as the restriction to hypernymy
relations and noise. More such data sets can be created and used in the future.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we presented the ALOD2Vec Matcher, a matcher utilizing a
Webcrawled knowledge data set by applying the RDF2Vec methodology to a
hypernymy data set extracted from the Web. It could be shown that it is possible to
use very large RDF graphs as external background knowledge and the RDF2Vec
methodology for the task of ontology matching.</p>
      <sec id="sec-5-1">
        <title>7 see http://webisa.webdatacommons.org/concept/_fish_</title>
        <p>8 see http://webisa.webdatacommons.org/concept/donald_trump_
9 see http://webisa.webdatacommons.org/concept/_quality_
10 see http://webisa.webdatacommons.org/concept/complex+congenital+heart_
defect_</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cochez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponzetto</surname>
            ,
            <given-names>S.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Biased graph walks for rdf graph embeddings</article-title>
          .
          <source>In: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics</source>
          . p.
          <fpage>21</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hertling</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>WebIsALOD: Providing Hypernymy Relations Extracted From the Web as Linked Open Data</article-title>
          . In: International Semantic Web Conference. pp.
          <volume>111</volume>
          {
          <fpage>119</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isele</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morsey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>DBpedia A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia</article-title>
          .
          <source>Semantic Web</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <volume>167</volume>
          {
          <fpage>195</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed Representations of Words and Phrases and Their Compositionality</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          (
          <year>2013</year>
          ), http://papers.nips.cc/paper/5021-distributed
          <article-title>-representations-ofwords-and-phrases-and-their-compositionality</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosati</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Di Noia,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>De</surname>
          </string-name>
          <string-name>
            <surname>Leone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>RDF2vec: RDF Graph Embeddings</article-title>
          and
          <string-name>
            <given-names>Their</given-names>
            <surname>Applications</surname>
          </string-name>
          .
          <source>Semantic Web Journal</source>
          (
          <year>2017</year>
          ), http: //www.semantic
          <article-title>-web-journal</article-title>
          .net/system/files/swj1495.pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Seitner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eckert</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faralli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meusel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponzetto</surname>
            ,
            <given-names>S.P.:</given-names>
          </string-name>
          <article-title>A Large DataBase of Hypernymy Relations Extracted from the Web</article-title>
          . In: LREC (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>