<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>COMA++: Results for the Ontology Alignment Contest OAEI 2006</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sabine Massmann</string-name>
          <email>massmann@informatik.uni-leipzig.de</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Engmann</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erhard Rahm</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper summarizes the OAEI Contest 2006 results for the matching tool COMA++. The study shows that a generic schema matching system can also effectively solve complex ontology matching tasks. Presentation of the system COMA++ is an extension of our previous COMA prototype [1]. It is a customizable and generic tool for matching both schemas and ontologies specified in languages such as SQL, XML Schema or OWL [2]. COMA++ offers a GUI and supports the combined use of several match algorithms as well as the reuse of previously confirmed match results [6]. The COMA++ architecture is shown in figure 1. The Repository persistently stores all match-related data, the Model and Mapping Pools manage all schemas, ontologies, and mappings in memory, and the Matching Engine performs the match operations. The GUI provides access to these components and is used to visualize models, manage the match process and mappings. The Matching Engine contains different libraries that supports many match algorithms and match strategies. The similarity results of individual matchers are maintained and aggregated within a similarity matrix per match task [1]. Match strategies implement workflows to deal with complex match tasks and enable a reuse of previous results and the decomposition of larger match tasks into smaller ones [3].</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>State, purpose, general statement</title>
      <p>COMA and COMA++ have proven to be very effective for matching database and
XML schemas [1, 4, 6]. The main reason for this test was to see the effectiveness of a
generic matching tool for dealing with ontologies.</p>
      <p>External
Schemas,
Ontologies
Exported</p>
      <p>Mappings
Graphical User</p>
      <p>Interface</p>
      <p>Matching Engine (Fragment-based)
ReLsibolruatriyon MLaibtcrahreyr ComLibbirnaarytion</p>
      <p>Model Pool</p>
      <p>Mapping Pool</p>
      <p>Model
Manipulation
An automatic match process in COMA++ consists of several steps. In the first step
the imported schemas and ontologies are transformed into a generic graph
representation. The graph nodes represent schema/ontology components such as
classes or properties and have attributes like name and data type. All relationships,
e.g. aggregations and specializations, are uniformly represented by edges between
nodes. In the next step graph nodes are matched with each other using a match
strategy and matchers. There is no differentiation made between node types, so that
for example classes and properties can be matched. The similarity values obtained by
the individual matchers are aggregated according to a combination strategy (average,
etc.). The match candidates are selected from the aggregated correspondences, e.g.
based on a threshold criterion. Finally, the result mapping (RDF alignment) is
generated.</p>
      <p>In addition to the schema-based matchers we used an instance-level matcher which
has recently been added to the COMA++ match library.
1.3</p>
    </sec>
    <sec id="sec-3">
      <title>Adaptations made for the evaluation</title>
      <p>In addition to the integration of an instance matcher only few changes to COMA++
were necessary to deal with specifics of the contest. As mentioned, the output
mapping was translated into the predefined RDF alignment format. Furthermore the
result of a matcher was ignored if it contained the same similarity value for all
entities. This was a minor adaptation made because the same strategy had to be used
for all tests.</p>
      <p>Another change was the splitting of huge ontologies into several smaller ones. The
results of the smaller match tasks were then merged. Another selection step was
applied on the merged results to obtain the final result mapping.</p>
      <p>To fit the rules of the contest the prototype is not using synonyms and abbreviations
which can be given to the system. The specific creation of them was not allowed but
would have been necessary because of the different domains.
1.4</p>
    </sec>
    <sec id="sec-4">
      <title>Link to the system, parameters file and to the set of provided alignments</title>
      <p>At the following URL .zip archives of all the contest results are available.
Furthermore the system with a parameters file can be downloaded.
http://dbs.uni-leipzig.de/Research/coma_oaei.html
2</p>
      <sec id="sec-4-1">
        <title>Results</title>
        <p>The results discussed here have been calculated with five matchers: NameType,
Comment, Parents, Children and Instance. For the combination of the match results
the average value has been computed and a selection has been made using, e.g. a
threshold. The best setting has been determined by running different configurations
on the benchmark and choosing the one with the highest f-measure. The exact
parameters can be found in the appendix.
2.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Benchmark</title>
      <p>This test is a systematic benchmark test containing 50 tests which can be used for
identifying the strengths and weaknesses of an algorithm.</p>
      <p>The overall score of COMA++ for this task (except 102) is quite good:</p>
      <p>Precision Recall F-Measure
Average 0.96 0.82 0.88</p>
      <sec id="sec-5-1">
        <title>Time 7.0 sec</title>
        <p>2.1.1 Tests 101-104
The results for tests 101, 103 and 104 are perfect because the classes and properties
have the same names, comments and instances. The language restriction and
generalization have no influence.</p>
        <p>The alignment for the irrelevant ontology 102 contains a few false matches that have
similar names, e.g. “year – yearValue”. There are no matches expected for this test,
thus precision and recall automatically are 0.0, so we left this value out at the average
calculation.</p>
        <p>Precision
1.00</p>
      </sec>
      <sec id="sec-5-2">
        <title>Recall 1.00</title>
      </sec>
      <sec id="sec-5-3">
        <title>F-Measure 1.00</title>
      </sec>
      <sec id="sec-5-4">
        <title>Time 15.4 sec</title>
      </sec>
      <sec id="sec-5-5">
        <title>Recall 0.95</title>
      </sec>
      <sec id="sec-5-6">
        <title>Recall 0.51</title>
        <p>2.1.2 Tests 201-247
The results of these tests differ depending on the given information because the
chosen strategy uses names, data types, comments, structure and instance. If one or
more of these information is missing only the remaining information can be used.
For the tasks 202, 209 and 210 the names and the comments differ so these
information can’t be used and the results have a lower recall.</p>
        <p>For all other tests of this group the names, the comments or both contain useful
information so the results are quite good.</p>
        <p>The tests 221-247 even have the same names and comments, whereas the structure is
different. Instances are similar but some ontologies don’t contain them. The given
information is enough to reach very good results.
2.1.3 Tests 248-266
In these tests the names have been substituted with random strings and there are no
comments. The algorithm can thus only use the hierarchy and the instances, if given.
Not for every class and property instances exist, so that information just helps to find
corresponding entities. The results for these tests are therefore satisfactory.</p>
        <p>Average</p>
        <p>Precision
0.98</p>
      </sec>
      <sec id="sec-5-7">
        <title>F-Measure 0.97</title>
      </sec>
      <sec id="sec-5-8">
        <title>Time 8.1 sec</title>
      </sec>
      <sec id="sec-5-9">
        <title>Average Precision 0.89</title>
      </sec>
      <sec id="sec-5-10">
        <title>F-Measure 0.65</title>
      </sec>
      <sec id="sec-5-11">
        <title>Time 4.2 sec</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>2.1.4 Tests 301-304 (Real Ontologies)</title>
      <p>The real-world ontologies have been a more difficult task for COMA++ because the
ontologies are quite different compared with the 101 ontology. Three out of the four
ontologies don’t contain instances – only 304 does. 302 and 303 don’t use comments,
the structure is quite different and the names are often dissimilar, which the prototype
could not find because the contest disallowed us to use auxiliary information.
For the anatomy task two large ontologies had to be aligned. Because of the huge size
the matching task had to be splitted by our system into smaller ones. The part results
were merged and then a variety has been selected. The selection was necessary
because with the splitted matching more false matches have been found.</p>
      <p>Average
2.2</p>
    </sec>
    <sec id="sec-7">
      <title>Anatomy</title>
      <p>Precision
0.84</p>
      <sec id="sec-7-1">
        <title>Recall 0.69</title>
      </sec>
      <sec id="sec-7-2">
        <title>F-Measure 0.76</title>
      </sec>
      <sec id="sec-7-3">
        <title>Time</title>
        <p>3.6 sec
Another difficulty has been the fact that in the FMA ontology the id of classes look
like “frame_92794” and “frame_51746” and the real information is in the label.
Whereas the OpenGALEN ontology has meaningful ids and uses rarely labels. These
labels or ids are made up of a lot of tokens and sometimes they differ only in a few
letters, e.g. “fifth” instead of “first”. Therefore we expect that more false positives
will be found than in the benchmark test.
2.3</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Directory</title>
      <p>For this test we matched 4640 pairs of ontologies.</p>
      <p>To find out more about the quality of our strategy and that kind of test we also
matched the 2265 ontology pairs of the contest 2005. We reached a recall around 0.32
what is as good as the best participants. Looking at the missing correspondences we
couldn’t find any similarity of the names, e.g.,
“7/source.owl#Academic_Departments” and ”7/target.owl#United_Kingdom” and no
comments or instances existed. That’s why we couldn’t figure out a way to improve
our system.
2.4</p>
    </sec>
    <sec id="sec-9">
      <title>Food</title>
      <p>The food ontologies uses the different format SKOS. We transformed the given
SKOS files into OWL format to be able to match them. These ontologies are quite
large so the match process has to be splitted as well as in the anatomy test.
2.5</p>
    </sec>
    <sec id="sec-10">
      <title>Conference</title>
      <p>This task contains 10 ontologies that deal with conference organisation. The
calculation of alignments between each of them was no problem because of the
smaller size.
3</p>
      <sec id="sec-10-1">
        <title>General comments</title>
        <p>3.1</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Comments on the results</title>
      <p>Given that COMA and COMA++ were not specifically designed for matching
ontologies and we invested only a small amount of time for the contest the overall
results are surprisingly good. The new instance matcher proved to be effective
especially for the tests where useful information was only provided by instance
values.</p>
      <p>The used parameters were selected for the whole set of tests. For individual match
tasks better results than reported can be obtained by using tailored configuration
parameters. Another point is that domain-specific abbreviations, synonyms and
previous match results could not be utilized in order to conform with the contest rules.
3.2</p>
    </sec>
    <sec id="sec-12">
      <title>Discussions on the way to improve the proposed system</title>
      <p>The use of auxiliary information that is conforming to the rules, e.g. WordNet or
UMLS, could improve the recall results. The addition of ontology-oriented matchers
and the distinction between node and relationship types could also be helpful.
3.3</p>
    </sec>
    <sec id="sec-13">
      <title>Comments on the OAEI 2006 procedure</title>
      <p>This is our first participation in this Ontology Alignment Contest. Since we are not
involved in the contest preparation we had no prior knowledge of most tasks and the
regulations. We thus had comparatively little time (about 2 months) to deal with the
details of six test series and technical problems caused by unknown formats and large
files. Furthermore, we had to adapt the system to the contest rules and try to find the
best strategy and configuration.
4</p>
      <sec id="sec-13-1">
        <title>Conclusion References</title>
        <p>The presented contest results show that COMA++ is not only effective for schema
matching but also for ontology matching. This underlines the viability of generic
approaches for complex metadata management problems.</p>
      </sec>
      <sec id="sec-13-2">
        <title>Appendix: Raw results</title>
        <p>The following benchmark results have been computed with the following parameters:
• Strategie: NoContext
• Matcher: NameType, Comment, Instance, Parents, Children
• Combination: Average
• Selection: N=0, Delta=0.0001, Threshold=0.13; Direction=Both
The tests were run on a PC running Windows XP with an Intel Pentium 4 2.4 GHz
processor and 512 MB memory.</p>
        <p>Matrix of results</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Do</surname>
            ,
            <given-names>H.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>Rahm: COMA - A System for Flexible Combination of Schema Matching Approach</article-title>
          .
          <source>Proc. Intl. Conf. Very Large Databases (VLDB)</source>
          ,
          <year>2002</year>
          Aumüller,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.H.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Massmann</surname>
          </string-name>
          , E. Rahm:
          <article-title>Schema and Ontology Matching with COMA++ (Software Demonstration)</article-title>
          .
          <source>Proc. 24. ACM SIGMOD Intl. Conf. Management of Data</source>
          ,
          <year>2005</year>
          Rahm, E.,
          <string-name>
            <given-names>H.H.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Massmann: Matching Large XML Schemas</article-title>
          .
          <source>ACM SIGMOD Record</source>
          <volume>33</volume>
          (
          <issue>4</issue>
          ),
          <year>2004</year>
          Do, H.H.,
          <string-name>
            <given-names>S.</given-names>
            <surname>Melnik</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          , E. Rahm:
          <article-title>Comparison of Schema Matching Evaluations</article-title>
          .
          <source>Proc. 2.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          Intl. Workshop Web and Databases,
          <source>LNCS 2593</source>
          , Springer Verlag,
          <year>2003</year>
          Rahm,
          <string-name>
            <surname>E.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          :
          <article-title>A Survey of Approaches to Automatic Schema Matching</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>VLDB Journal</source>
          ,
          <volume>10</volume>
          (
          <issue>4</issue>
          ), 2001
          <string-name>
            <surname>Hong-Hai Do</surname>
          </string-name>
          .
          <source>Schema Matching and Mapping-based Data Integration, Dissertation</source>
          , Department of Computer Science, Universität Leipzig, Germany,
          <year>2006</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>