<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AnyGraphMatcher Submission to the OAEI Knowledge Graph Challenge 2019?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Lutke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Matching objects between two di erent data bases typically relies on syntactic similarity measurements and the exhausting of ontology restrictions. As opposed to them, AnyGraphMatcher (AGM) introduces an additional source of information for semantically matching data { i.e. the creation of word embeddings. AGM's key idea wraps around a stable marriage for determining best matching data objects between two data bases. Results on the OAEI knowledge graph track however indicate the need for a more advanced blocking technique. Results show that word embeddinngs are to be seen a supportive feature for mapping rather than a key source of information.</p>
      </abstract>
      <kwd-group>
        <kwd>Ontology matching machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Presentation of the system</title>
      <sec id="sec-2-1">
        <title>State, purpose, general statement</title>
        <p>In recent years, data has developed into a di erentiator for business success. But
organizations gathered data typically comes from various sources, which are
mutually heterogeneous and inconsistent. Identity resolution is required to locate
and integrate common pieces of information between these data sources. More
precisely, data elements between those data bases have to be compared to each
other and a decision on whether they describe the same real world concept must
be made. Most prevalent techniques resolve around the comparison of
syntactic elements, like titles, labels or descriptions. However, those techniques fail to
preserve the actual semantic meaning of data objects. Consider for example the
word Berlin, either describing Germany's capital or a cargo ship. Just from the
title, the semantic meaning of the word \Berlin" cannot be determined. Recent
breakthrough in linguistics research on the latent representation of words o ers
a promising opportunity [3]. The main notion wraps around the distributional
hypothesis by Harris, stating that the meaning of a word is de ned by its
context. First, the hypothesis referred to linguistics only. But the adaptation in
Paulheim's RDF2Vec approach [5] showcased, that the same concept is
applicable to (semi-)structured databases. ALOD2VEC [4] and DOME [1] are two
? Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 Internation (CC BY 4.0)
systems, which already used the idea of word embeddings for data integration
in the 2018 OAEI challenge. Compared to them, AnyGraphMatcher (AGM) is a
novel concept speci cally for identity resolution, which extends RDF2Vec with
the use of semi-supervised machine learning and stable marriage.
1.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Speci c techniques used</title>
        <p>The task of identity resolution is perceived as a binomial statement whether two
given data elements describe the same real-world object. The nal goal is such
a binomial prediction for the entire Cartesian product of all elements of two
databases A and B. Essentially, AnyGraphMatcher employs a ve step process
to get there:
1. Blocking
2. Graph walk
3. Word embedding model
4. Semi-supervised machine learning
5. Stable marriage</p>
        <p>The Cartesian product can be extremely large depending on the size of the
input datasets, representing a burden to runtime performance. However, most of
the object pairs from the Cartesian product will not be matching, particularly
if the input datasets are considered free of duplicates. So blocking is required.
Instead of performing the following, computationally expensive predictions on
the whole Cartesian product, a number of candidate-pairs is chosen based on
an e cient similarity computation rst. All pairs except the candidate-pairs are
directly predicted non-matching. For the e cient candidate selection, a
levenshtein automata is utilized, which measures syntactic distances. More precisely,
edit distances up to two edits per word in a string are calculated. Note one
major limitation in this concept: An assumption is met that actually corresponding
data objects have similar labels. This might increase precision, but decrease
recall.</p>
        <p>Afterwards, a graph walk is employed, which iterates through the set of
vertices and edges of an ontology graph. While walking through the graph, visited
paths are written down, so that a corpus le is created in the end. The more
detailed, recursive procedure looks as follows: For each vertice, an outgoing edge is
selected. The edge is traced to its endpoint and the selection-procedure is started
from there on again. At each vertice visited, such a selection is triggered n times.
Furthermore, after excelling a distance of k steps from the vertice, where the path
started, the procedure is terminated. Due to runtime limitations, for the OAEI
submission, n equalling the number of outgoing edges of the current vertice and
k = 1 was chosen. Basically, this boils down to a simple NT- le representation
of an ontology. Further improvements could be derived from experiments with
larger k values.</p>
        <p>The graph walk has generated a text corpus, which can be passed to a word
embedding model to form a latent representation of each word occurring in the
corpus. As the utilized SkipGram neural model is quite prevalent today, further
details on the concept remains to the respective works in the literature. Since
the SkipGram model is however highly con gurable, most parameters have been
adopted from the general recommendation of the inventors (e.g. hidden layer
size = 100) Only the number of epochs has been modi ed. Due to runtime
limitations, those depend on the size of the generated corpus, but halts between
10 and 250. After running the SkipGram model on the corpus, each resource in
the input ontology can be assigned a 100-dimensional vector.</p>
        <p>The nal goal is a prediction whether two candidate pairs correspond to each
other. This is achieved by semi-supervised machine learning. Supervision
requires for a gold standard, which cannot be presupposed for said mapping
tasks. That is why AnyGraphMatcher employs an automatic gold standard set
generation. For this purpose, the candidate-pairs from the blocking are
considered once again. Special attention is paid to the similarity of data object's
labels. For the e cient calculation, the library \Apache Lucene" is used, which
measures similarity in a value range of up to 2:5 for identical strings and 0:0
for very distant strings.1 Very similar candidate-pairs with a similarity-score of
2:0 or above are assumed matches. All further candidate-pairs, in which one of
the matching data objects occurs, are assumed non-matches. In the end, this
gold standard captures most apparent matches, but di erentiates them by the
inclusion of matches (i.e. positives) and non-matches (i.e. negatives).</p>
        <p>With the gold standard in place, a binary machine learning classi er { here
an XGBoost { can work properly. Besides, the classi er takes more information
than just the syntactic similarity into account. I.e. latent information is derived
from the SkipGram model and passed to the classi er. Pairwise cosine similarity,
Euclidean distance and SkipGram context probability (i.e. the probability that
resource x appears in the context of resource y) are calculated. Other than
expected, the binary classi cation is not meant to be the nal prediction step
however. Rather, it is implemented as another, more enhanced way of blocking.
This is done by up-sampling the matching pairs during training until there are
1.5 times as many matches as non-matches. This procedure makes sure, that
only very likely non-matches are classi ed negatives and excluded from further
processing.</p>
        <p>
          The nal prediction is achieved by stable marriage under the
assumption of 1:1 cardinality mappings. Here, each data object is considered in
isolation rst. All remaining candidate-pairs, in which a given data object appears,
are extracted. For all found candidate-pairs, an overall similarity score is
calculated. That score includes cosine similarity, Euclidean distance, SkipGram
context probability and levenshtein distance. For each of these measures, the
relative similarity in comparison with other candidate-pairs is computed. This
is quanti ed by the position p in the following formula:
simrelative = 2 (p 1)
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
1 The exact calculation is not meant to be explained here. For further details, refer to
https://lucene.apache.org/core/3 5 0/scoring.html.
        </p>
        <p>The following tables 1.2, 1.2 and 1.2 illustrate the idea once more. The focused
data object in these tables is called A. A might match to B, C and D. Table
1.2 shows various thought similarity values between the candidate objects and
A. Stable marriage goes on as follows to determine which one of the candidates
is the best matching one: Table 1.2 indicates the order, how well a
candidatepair matches compared to other candidate pairs. Note that each of the similarity
measure is still considered in isolation here. Table 1.2 then translates the ordering
into values computed by the equation above. The nal score in table 1.2 is
calculated by summing the translated values for each of the candidate pairs. The
one pair with the highest total score is assumed to match best. As a secondary
criterion for further discrimination, the score derived from Levenshtein distance
is taken. The way the nal score is calculated is to be seen as an optimisable
characteristic of AGM.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Summary of the system's limitations</title>
        <p>Despite AnyGraphMatcher's exploitation of a wide range of characteristics of
data objects, it su ers from two major limitations:</p>
        <p>
          First, strong con dence is set to syntactic similarity. Basically two
assumption lead to this conclusion: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) data objects, which are syntactically similar, do
match (see gold standard generation) and (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) actually matching data objects
have similar labels (see blocking).
        </p>
        <p>
          Second limitation is the recall-bias in the entire pipeline. Note that the only
three steps in place to predict candidate-pairs negative are (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) blocking, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
semi-supervised machine learning and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) stable marriage. Blocking (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) and
semisupervised machine learning (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) themselves are recall-biased. So they prefer to
predict (syntactically) similar samples as positives rather than negatives. Stable
marriage can only predict negatives, if a data object has already been identi ed
a match with another data object. So all in all, there is no strict exclusion of
negatives from the set of candidate-pairs. This might raise precision, but reduce
recall.
        </p>
        <p>In sum, AGM can basically exploit semantics, if and only if the underlying
data sets consistently follow the syntactic similarity assumption from above.
1.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Adaptations made for the evaluation</title>
        <p>For the OAEI submission, the melt framework provided by the University of
Mannheim has been used [2]. Melt handles most of the regulations required
for submitting matcher systems to the OAEI challenge. Since melt is originally
written in Java, while AGM is mainly developed in Python, melt is used as a
wrapper service, that calls the AGM pipeline by starting a new Python-process.
Furthermore, the blocking process has been adapted to run more e ciently on
the larger of the data sets in the OAEI knowledge graph track. A very strict
blocking is applied, that initially excludes a lot candidate matches. Whether
this technique harms recall is to be clari ed in the results section.
1.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Link to the system and parameters le</title>
        <p>The implementation of AGM can be found on Github using the link
https://github.com/XLexxaX/AnyGraphMatcher/tree/SUBMISSION.
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The following paragraphs shortly outline the results of AGM compared to the
baseline gures. It therefore refers to the full result table available online on
http://oaei.ontologymatching.org/2019/results/knowledgegraph/index.html. For
the reason of comprehensibility, table 4 lists a more compact overview of the
knowledge graph track.
With an overall F-score of 11%, AGM fails to output proper mappings on the
rst of the ve mapping tasks. Taking a closer look at the ve steps in the AGM
pipeline, the exclusion of many candidate-mappings during blocking stands out.
2.2</p>
      <sec id="sec-3-1">
        <title>Memory Alpha</title>
        <sec id="sec-3-1-1">
          <title>M emoryBeta</title>
          <p>The mapping of Memory Alpha to Memory Beta yielded slightly better results
with an F-score of 32%. But still, the purely syntax-based baseline oupterforms
AGM by approximately 50%. Notable is however, that this time, precision (47%)
is signi cantly better than recall (24%).
2.3</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Memory Alpha</title>
        <sec id="sec-3-2-1">
          <title>StarT rekExpandedU niverse</title>
          <p>The observations from Memory Alpha and Memory Beta continue throughout
the remaining three mapping tasks. All in all, an F-score of 30% was achieved
when mapping Memory Alpha to Star Trek Expanded Universe. The baseline of
91% F-score is missed.
2.4</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Star Wars Wiki</title>
        <sec id="sec-3-3-1">
          <title>StarW arsGalaxiesW iki</title>
          <p>For the Star wars wiki mapping, again 30% F-score is achieved. The baseline
F-score of 67% is out of reach. Note however the even larger gap of 52% between
AGM's precision and recall this time.
2.5</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Star Wars Wiki</title>
        <sec id="sec-3-4-1">
          <title>T heOldRepublicW iki</title>
          <p>Repeatedly, a 52% gap between recall and precision is conspicious. The F-score
of 20% does not meet the pretension of the baseline by far.
3.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>General comments</title>
      <sec id="sec-4-1">
        <title>Comments on the results</title>
        <p>In sum, results of AGM on the knowledge graph track are relatively weak
compared to the baseline gures. Mainly recall is lacking behind the competition's
results. This can be traced back to the strict blocking technique. However,
precision also lacks behind the baseline gures. Recap, that a levenshtein automata
has been used, which can only measure edit distances of up to 2 edits per word. In
case two long texts are compared, this restriction leads to imprecise measuring.
So the levenshtein automata as implemented in AGM is rather an
approximation of syntactic similarity. Nevertheless, a static threshold has been used for
blocking (see section 1.2), such that in the end precision su ers as well.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Discussions on the way to improve the proposed system</title>
        <p>In order to compensate for the weak results, another way has to be found block
based on a weaker threshold, while ensuring runtime e ciency of the AGM
pipeline. One idea is to loosen the current threshold and introduce a second
blocking step, that blocks based on exact edit distances for all candidates found
by the levensthein automata.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>AGM follows a novel approach to data mappings by utilizing the idea of word
embeddings. It implements a ve step process including blocking, a graph walk,
embedding creation, semi-supervised machine learning and stable marriage. By
combining di erent similarity measures derived from syntax and word
embeddings, AGM aims to yield semantically correct mappings. However results show
a relatively poor performance compared to the purely syntax based baseline
gures. A strict and imprecise blocking technique has been identi ed a root cause.
Though the results cannot achieve the baseline gures, they provide a valuable
outcome for AGM's approach in general: The stable marriage depends a lot on
the upstream steps and su ers from error-propagation. This implies that
features derived from embeddings cannot be solely used for mapping. Embeddings
are to be seen an approximation of concept's semantic meaning, such that they
can additionally support in distinguishing them. In order to compensate for this
observation in the future, a more advanced blocking technique is required.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1. DOME results for OAEI 2018</article-title>
          . In OM@ISWC, volume
          <volume>2288</volume>
          <source>of CEUR Workshop Proceedings</source>
          , pages
          <volume>144</volume>
          {
          <fpage>151</fpage>
          ,
          <string-name>
            <surname>Karlsruhe</surname>
          </string-name>
          ,
          <year>2018</year>
          . CEUR-WS.org.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Sven</given-names>
            <surname>Hertling</surname>
          </string-name>
          , Jan Portisch, and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim. MELT - Matching EvaLuation</surname>
          </string-name>
          <article-title>Toolkit</article-title>
          .
          <source>In Semantics 2019 SEM2019 Proceedings, Karlsruhe</source>
          ,
          <year>2019</year>
          , to appear.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Je</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          . In C. J.
          <string-name>
            <surname>C. Burges</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Q. Weinberger, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          , pages
          <fpage>3111</fpage>
          {
          <fpage>3119</fpage>
          . Curran Associates, Inc.,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Jan</given-names>
            <surname>Portisch</surname>
          </string-name>
          and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Alod2vec matcher</article-title>
          .
          <source>In OM@ISWC</source>
          , volume
          <volume>2288</volume>
          <source>of CEUR Workshop Proceedings</source>
          , pages
          <volume>132</volume>
          {
          <fpage>137</fpage>
          . CEUR-WS.org,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Petar</given-names>
            <surname>Ristoski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          . Rdf2vec:
          <article-title>Rdf graph embeddings for data mining</article-title>
          .
          <source>In The Semantic Web - ISWC</source>
          <year>2016</year>
          : 15th International Semantic Web Conference, Kobe, Japan,
          <source>October 17-21</source>
          ,
          <year>2016</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , volume
          <volume>9981</volume>
          , pages
          <fpage>498</fpage>
          {
          <fpage>514</fpage>
          ,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          ,
          <year>2016</year>
          . Springer International Publishing.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>