<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MaasMatch results for OAEI 2012</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Frederik C. Schadd</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nico Roos</string-name>
          <email>roosg@maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Maastricht University</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper summarizes the results of the participation of MaasMatch in the Ontology Alignment Evaluation Initiative (OAEI) of 2012. We provide a brief description of the techniques that have been applied, with the emphasis being on the utilized similarity measures and the performed improvements over the system that participated in the year 2011. Additionally, the results of the 2012 OAEI campaign will be discussed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Presentation of the system</title>
      <sec id="sec-2-1">
        <title>State, purpose, general statement</title>
        <p>
          Sharing and reusing knowledge is an important aspect in modern information
systems. Since multiple decades, researchers have been investigating methods that
facilitate knowledge sharing in the corporate domain, allowing for instance the integration
of external data into a company’s own knowledge system. Ontologies are at the center
of this research, allowing the explicit definition of a knowledge domain. With the steady
development of ontology languages, such as the current OWL language [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], knowledge
domains can be modelled with an increasing amount of detail.
        </p>
        <p>The initial research of the MaasMatch framework focused on resolving
terminological heterogeneities between ontology concepts, which is reflected in its initial selection
of similarity measures. Recent research focused on further developing these techniques,
while increasing its spectrum of similarity measures such that the system can be
applicable in a wider area of matching tasks. The supported matching domain of ontologies
for MaasMatch are limited to semi-large, meaning up to 2000 concepts per ontology,
mono-lingual OWL ontologies, thus yielding predictable results for the Library and
Multifarm tracks.
1.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Specific techniques used</title>
        <p>Various similarity measures covering differing categories have been applied in the
current system. This subsection provides a brief explanation of each measure and how
these are combined to extract the final alignment.</p>
        <p>
          Syntactic Similarity MaasMatch currently utilizes a token-based measure for the
purpose of determining the syntactic similarity between concepts. More specifically,
concept names and labels are compared by computing the 3-grams [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] of their names and
determining their similarity using the Jaccard [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] measure.
        </p>
        <p>
          Structural Similarity As structural similarity a Name-Path similarity is used. Given
a concept c, such a similarity collects the name of c and all ancestors of c, which is
subsequently used as a basis for comparison. Given the nature of these strings, a
hybrid similarity has been selected for this purpose. A hybrid similarity is defined as any
similarity that relies on another similarity measure for its computation. Cohen et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
researched a token-based framework for a hybrid distance. Given two strings s and t,
the set of tokens a1; a2; ... ; aK into which string s can be divided into and the set of
tokens b1; b2; ... ; bL into which string t can be divided into, a hybrid distance can be
computed as follows:
sim(s; t) =
1 XK mLax sim0(ai; bj )
K i=1 j=1
(1)
        </p>
        <p>
          The hybrid similarity in MaasMatch utilizes the Levenshtein [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] similarity, to which
a substring-based extension is applied. This extension functions similarly to the Winkler
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] extension, however is not limited to the size or location of the substring. This
setup has been shown to outperform other variations of measures on the conference
dataset and a record matching dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Given two strings s and t, the longest common
substring of s and t defined as LCS(s; t) and a scaling factor S, sim0 of our hybrid
distance is computed as follows:
        </p>
        <p>LCS(s; t)
min(s; t)
sim0(s; t) = Levenshtein(s; t) +
S (1</p>
        <p>
          Levenshtein(s; t))
(2)
Virtual Document Similarity A new similarity that is deployed in MaasMatch is the
comparison of virtual documents representing ontology concepts, which are created by
gathering the information contained within a concept and the information of its related
neighbours according to a specific model. This approach has been pioneered by Qu et al.
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. In essence, this approach uses a weighted combination of descriptions of concepts.
A description of a concept is a weighted document vector describing the terms that
occur in the concept description. The model of creating such a description allows for
certain types of terms, such as the concept name, label or comments, to be weighted
differently according to their perceived importance. Descriptions of related concepts
are added to the description of a particular concept by multiplying the term weights of
the related descriptions with a diminishing factor before merging the vectors. For a full
description of this process, we recommend the reader to consult the works of Qu et al.
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Lexical Similarity This similarity has seen improvements, compared to its counterpart
of the 2011 competition, with regard to its computing time. The similarity uses
WordNet as a basic lexical resource, however utilizes virtual document similarities between
ontology concepts and WordNet synsets in order to only assign synsets to concepts
which accurately describe the meaning of that concept. Given two ontologies O1 and
O2 that are to be matched, O1 contains the sets of entities Ex1 = f 1
e1; e12; :::; e1mg, where
x distinguishes between the set of classes, properties or instances, O2 contains the sets
of entities Ex2 = f 1</p>
        <p>e2; e22; :::; e2ng, and C(e) denotes a collection of synsets representing
entity e, the essential steps of our approach, performed separately can be described as
follows:
1. For every entity e in Exi, compute its corresponding set C(e) by performing the
following procedure:
(a) Assemble the set C(e) with synsets that might denote the meaning of entity e.
(b) Create a virtual document of e, and a virtual document for every synset in C(e).
(c) Calculate the document similarities between the virtual document denoting e
and the different virtual documents originating from C(e).
(d) Discard all synsets from C(e) that resulted in a low similarity score with the
virtual document of e, using some selection procedure.
2. Compute the WordNet similarity for all combinations of e1 2 Ex1 and e2 2 Ex2
using the processed collections C(e1) and C(e2).</p>
        <p>
          Aggregation and Extraction In our system, similarity matrices are aggregated by
computing the average similarity measure of each pairwise combination of concepts,
based on the computed similarity cube. The Naive descending extraction algorithm [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
is applied on the aggregated similarity matrix in order to determine the final alignment.
At this point a confidence threshold can be applied in order to avoid producing
alignments which do not satisfy a determined degree of confidence.
While for practical applications it is recommended to apply a confidence boundary in
the extraction step, this has been omitted for the evaluation system in order to provide
the possibility for the experimenters to conduct a more thorough analysis of the
produced alignments, even if these have a low confidence value and would not be included
in the final alignment under normal circumstances.
1.4
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Link to the system and parameters file</title>
        <p>MaasMatch and its corresponding parameter file is available on the SEALS platform
and can be downloaded at http://www.seals-project.eu/tool-services/browse-tools.
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>This section presents the evaluation of the OAEI2012 results achieved by MaasMatch.
Evaluations utilizing ontologies exceeding the supported complexity range, such as the
Library track, will be excluded from the discussion for the sake of brevity. Note that
the evaluations of some of the tracks do not determine the optimal confidence
threshold of the produced alignments such that correspondences with low confidence values
are incorporated into the evaluations as well, resulting in lower performance measures
compared to a normal execution environment.
2.1</p>
      <sec id="sec-3-1">
        <title>Benchmark</title>
        <p>The benchmark data set consists of several base ontologies which are matched with
automatically altered versions of themselves. This makes it possible to establish under
what condition a matcher performs well or poorly. Previous competitions used only
a single ontology as base, with the alterations being done by hand. The current data
set consists of several base ontologies such that a more varied spectrum of knowledge
domains is utilized. The results of MaasMatch on the benchmark data set can be seen
in Table 1.</p>
        <p>Test Set Precision
biblio 0.54
2 0.6
3 0.53
4 0.54
finance 0.59</p>
        <p>Recall F-Measure
0.57 0.56
0.6 0.6
0.53 0.53
0.54 0.54
0.6 0.59</p>
        <p>
          From Table 1 it is observable that the results set a stark contrast in comparison to
the competition of 2011 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The continued development of our system was
successful in increasing the recall of the produced alignments, however this came at a cost of
reduced recall, yielding a similar f-measure when compared to the previous year.
However, this evaluation does not take into account the confidence values provided with the
alignments, resulting in alignments with low confidence value being included in the
evaluation. In a realistic scenario a pruning mechanism, for instance a simple cutoff
rate, would be applied such that matches with low confidence values would not be
included. As reported by the experimenter, pruning the alignments results in f-measure
gains between 0.07 to 0.15, mostly due to a significant gain in precision, thus yielding
significantly improved results over the MaasMatch system of 2011.
2.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Anatomy</title>
        <p>The anatomy data set consists of two large real-world ontologies from the biomedical
domain, with one ontology describing the anatomy of a mouse and the other being the
NCI Thesaurus, which describes the human anatomy. The results of this data set can be
seen in Table 2.</p>
        <p>Test Set Precision
mouse-human 0.434</p>
        <p>Recall F-Measure
0.784 0.559</p>
        <p>
          Also the results of the anatomy data set have seen some drastic changes compared to
the results of the previous year. The recall has been significantly improved, albeit at the
cost of a significant proportion of precision. Overall, the f-measure has been improved
by 0.11 over the results of the previous year [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
2.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Conference</title>
        <p>The confidence data set consists of numerous real-world ontologies describing the
domain of organizing scientific conferences. The results of this track can be seen in Table
3.</p>
        <p>Test Set Precision
ra1 0.63
ra2 0.60</p>
        <p>Recall F1-Measure
0.57 0.60
0.50 0.56</p>
        <p>For this data set, MaasMatch produced alignments of fairly balanced quality. The
comparison to the standard reference alignments resulted in an f-measure of 0.6, which
is a significant improvement compared to the same evaluation of the previous year.
The evaluation using reference alignments which have been pruned using a consistency
reason resulting in the recall being more affected than the precision of the alignments.
This data set consists of several large scale ontologies, containing up to tens of
thousands of concepts. While ontologies of such scale are not in the target domain of
MaasMatch, due to the high computation complexity, some evaluation could still be
performed, visible in Table 4.</p>
        <p>Test Set</p>
        <p>FMA-NCI Original UMLS
FMA-NCI Clean UMLS (LogMap)
FMA-NCI Clean UMLS (Alcomo)</p>
        <p>Among the varying evaluation methods, MaasMatch produced fairly consistent
alignments when matching the FMA and NCI ontologies, all resulting in f-measures of
approximately 0.68. Unfortunately, the remaining ontologies of this data set are outside of
the supported complexity range, such that an alignment could not be computed within
the given time frame. However, the results of the completed tasks indicate that our
system is already capable of producing alignments of high quality in this domain, thus
improving its efficiency, for instance by applying partitioning techniques, should result
in an overall satisfying performance during the next evaluation.
2.5</p>
      </sec>
      <sec id="sec-3-4">
        <title>Multifarm</title>
        <p>The Multifarm data set is based on ontologies from the OntoFarm data set, that have
been translated into a set of different languages in order to test the multi lingual
capabilities of a specific system. Currently, the similarities employed by MaasMatch are not
suitable in a multi-lingual matching problem, thus yielding predictably poor results.</p>
        <p>Test Set Precision
type I 0.02
type II 0.14</p>
        <p>Recall
0.14
0.14</p>
        <p>F-Measure
0.03
0.14</p>
        <p>In Table 5, aggregation measures are separated into heterogeneous ontologies
translated into different languages (type I) and homogeneous ontologies translated into
different languages (type II). While the recall is unchanged for both matching types, the
precision if positively influenced for homogeneous matching tasks.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>General comments</title>
      <sec id="sec-4-1">
        <title>Comments on the results</title>
        <p>Overall, our system has seen improvements across various tracks, aided by the
incorporation of additional similarity measures as well as the further development of the
already existing measures. While the results of the previous year were high in precision
and low in recall, the results of this year’s participation demonstrate a more balanced
measure of precision and recall, with both measures usually having a similar value.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Discussions on the way to improve the proposed system</title>
        <p>The first area of improvement would consist of expanding the supported domain of
matching problems, such that large scale or multi-lingual ontologies can be matched
as well. Matching large scale ontologies would require the development of partitioning
techniques in order to reduce the computational complexity of a matching task,
preferably without impacting the results.
3.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Comments on the SEALS platform</title>
        <p>While the SEALS platform is a convenient tool for competition purposes, it would be
nice to see its capabilities expanded such that evaluations can be automatically
performed for research purposes, such that for instance any matching tool that is uploaded
is automatically evaluated on the different available data sets.
3.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Comments on the OAEI 2011 procedure</title>
        <p>This years competition has seen some confusion whether or not the participants should
omit post processing measures, such as cutoff based alignment pruning, given that some
tracks perform automatic thresholding in order to generate the best possible alignments.
However, the reported results of the benchmark data set did not include automatic
thresholding, thus yielding the impression that the systems performs worse than it
actually does. It would be preferable to have a clear statement on this matter and that each
track is being evaluation according to the same policy.
3.5</p>
      </sec>
      <sec id="sec-4-5">
        <title>Comments on the OAEI 2011 measures</title>
        <p>An important part of the scientific method is the ability of recreating experimental
results. Some tracks aggregate precision, recall and f-measure using the harmonic mean.
However, given that the ranges of these 3 values lie in the interval of [0; 1], it is
possible that values of 0 would be incorporated in the evaluation, which in turn would yield
a division by 0 due the reciprocal being computed of these values. It is currently
unclear how this is circumvented and how exactly the measures are aggregated, making
it very difficult to replicate experiments outside the OAEI environment. Thus it would
be preferable to incorporate a detailed explanation on the computation and especially
aggregation of the computed measures, even if this means including the same text in
each year’s proceedings.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>This paper describes the 2012 participation of MaasMatch in the OAEI campaign, in
which considerable improvements have been observed in the benchmark, anatomy and
conference tracks, which have been evaluated in the previous year. New tracks were
introduced with matching problems outside of the currently supported matching
domain, however we intend to expand the capabilities of our system such the new types
of problems can be tackled as well.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ravikumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Fienberg</surname>
          </string-name>
          .
          <article-title>A comparison of string distance metrics for name-matching tasks</article-title>
          .
          <source>In Proc. IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03)</source>
          , pages
          <fpage>73</fpage>
          -
          <lpage>78</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Hermans</surname>
          </string-name>
          and
          <string-name>
            <given-names>F. C.</given-names>
            <surname>Schadd</surname>
          </string-name>
          .
          <article-title>A generalization of the winkler extension and its application for ontology mapping</article-title>
          .
          <source>In Proceedings Of The 24th Benelux Conference on Artificial Intelligence (BNAIC</source>
          <year>2012</year>
          ),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Jaccard</surname>
          </string-name>
          . E´
          <article-title>tude comparative de la distribution florale dans une portion des alpes et des jura</article-title>
          .
          <source>Bulletin del la Socie´te´ Vaudoise des Sciences Naturelles</source>
          ,
          <volume>37</volume>
          :
          <fpage>547</fpage>
          -
          <lpage>579</lpage>
          ,
          <year>1901</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Levenshtein</surname>
          </string-name>
          .
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          .
          <source>Technical Report 8</source>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D. L.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          and
          <string-name>
            <surname>F. van Harmelen. OWL</surname>
          </string-name>
          <article-title>web ontology language overview</article-title>
          .
          <source>W3C recommendation, W3C</source>
          ,
          <year>February 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>C.</given-names>
            <surname>Meilicke</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          .
          <article-title>Analyzing mapping extraction approaches</article-title>
          .
          <source>The Second International Workshop on Ontology Matching</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hu</surname>
          </string-name>
          , and G. Cheng.
          <article-title>Constructing virtual documents for ontology matching</article-title>
          .
          <source>In Proceedings of the 15th international conference on World Wide Web, WWW '06</source>
          , pages
          <fpage>23</fpage>
          -
          <lpage>31</lpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>F. C.</given-names>
            <surname>Schadd</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Roos</surname>
          </string-name>
          .
          <article-title>Maasmatch results for oaei 2011</article-title>
          .
          <source>In Proc. 6th ISWC workshop on Ontology Matching (OM)</source>
          , pages
          <fpage>171</fpage>
          -
          <lpage>178</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>F. C.</given-names>
            <surname>Schadd</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Roos</surname>
          </string-name>
          .
          <article-title>Coupling of wordnet entries for ontology mapping using virtual documents</article-title>
          .
          <source>In Proceedings of the ISWC'12 International Workshop OM-2012</source>
          ,
          <year>2012</year>
          . Accepted Paper.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Shannon</surname>
          </string-name>
          .
          <article-title>A mathematical theory of communication</article-title>
          .
          <source>SIGMOBILE Mob. Comput. Commun. Rev.</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ):
          <fpage>3</fpage>
          -
          <lpage>55</lpage>
          ,
          <year>January 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>W. E.</given-names>
            <surname>Winkler</surname>
          </string-name>
          .
          <article-title>String Comparator Metrics and Enhanced Decision Rules in the FellegiSunter Model of Record Linkage</article-title>
          .
          <source>Technical report</source>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>