=Paper=
{{Paper
|id=Vol-156/paper-10
|storemode=property
|title=Introduction to the Ontology Alignment Evaluation 2005
|pdfUrl=https://ceur-ws.org/Vol-156/paper10.pdf
|volume=Vol-156
|dblpUrl=https://dblp.org/rec/conf/kcap/EuzenatSY05
}}
==Introduction to the Ontology Alignment Evaluation 2005==
<pdf width="1500px">https://ceur-ws.org/Vol-156/paper10.pdf</pdf>
<pre>
     Introduction to the Ontology Alignment Evaluation 2005

                    Jérôme Euzenat                             Heiner Stuckenschmidt                       Mikalai Yatskevich
                  INRIA Rhône-Alpes   Vrije Universiteit Amsterdam                                        Dept. of Information and
                655 avenue de l’Europe     De Boelelaan 1081a                                           Communication Technology
               38330 Monbonnot, France  1081 HV Amsterdam, The                                              University of Trento
            Jerome.Euzenat@inrialpes.fr        Netherland                                                   Via Sommarive, 14
                                                                       heiner@cs.vu.nl                  I-38050 Povo, Trento, Italia
                                                                                                             yatskevi@unitn.it

The increasing number of methods available for schema                                 The main goal of the Ontology Alignment Evaluation is to
matching/ontology integration suggests the need to establish                          be able to compare systems and algorithms on the same basis
a consensus for evaluation of these methods.                                          and to allow drawing conclusions about the best strategies.
The Ontology Alignment Evaluation Initiative1 is now a co-                            Our ambition is that from such challenges, the tool develop-
ordinated international initiative that has been set up for or-                       ers can learn and improve their systems.
ganising evaluation of ontology matching algorithms.
After the two events organized in 2004 (namely, the Infor-                            2.    GENERAL METHODOLOGY
mation Interpretation and Integration Conference (I3CON)                              We present below the general methodology for the 2005
and the EON Ontology Alignment Contest [4]), this year                                campaign. In this we took into account many of the com-
one unique evaluation campaign is organised. Its outcome                              ments made during the previous campaign.
is presented at the Workshop on Integrating Ontologies held
in conjunction with K-CAP 2005 at Banff (Canada) on Oc-
tober 2, 2005.
                                                                                      2.1   Alignment problems
Since last year, we have set up a web site, improved the soft-                        This year’s campaign consists of three parts: it features two
ware on which the tests can be evaluated and set up some                              real world blind tests (anatomy and directory) in addition
precise guidelines for running these tests. We have taken                             to the systematic benchmark test suite. By blind tests it is
into account last year’s remarks by (1) adding more coverage                          meant that the result expected from the test is not known in
to the benchmarck suite and (2) elaborating two real world                            advance by the participants. The evaluation organisers pro-
test cases (as well as addressing other technical comments).                          vide the participants with the pairs of ontologies to align as
This paper serves as a presentation to the 2005 evaluation                            well as (in the case of the systematic benchmark suite only)
campaign and introduction to the results provided in the fol-                         expected results. The ontologies are described in OWL-
lowing papers.                                                                        DL and serialized in the RDF/XML format. The expected
                                                                                      alignments are provided in a standard format expressed in
                                                                                      RDF/XML [2].
1.     GOALS
Last year events demonstrated that it is possible to evaluate                         Like for last year’s EON contest, a systematic benchmark
ontology alignment tools.                                                             series has been produced. The goal of this benchmark series
                                                                                      is to identify the areas in which each alignment algorithm
One intermediate goal of this year is to take into account the                        is strong and weak. The test is based on one particular on-
comments from last year contests. In particular, we aimed                             tology dedicated to the very narrow domain of bibliography
at improving the tests by widening their scope and variety.                           and a number of alternative ontologies of the same domain
Benchmark tests are more complete (and harder) than be-                               for which alignments are provided.
fore. Newly introduced tracks are more ’real-world’ and of
a considerable size.                                                                  The directory real world case consists of alignming web sites
1
    http://oaei.inrialpes.fr                                                          directory (like open directory or Yahoo’s). It is more than
                                                                                      two thousand elementary tests.

Permission to make digital or hard copies of all or part of this work for             The anatomy real world case covers the domain of body
personal or classroom use is granted without fee provided that copies are             anatomy and consists of two ontologies with an approximate
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
                                                                                      size of several 10k classes and several dozen of relations.
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.                                                              The evaluation has been processed in three successive steps.
K-CAP’05, Integrating ontologies workshop, October 2, 2005, Banff, Al-
berta, Canada.
                                                                                      2.2   Preparatory phase


                                                                                 61
The ontologies and alignments of the evaluation have been           provided results for all the tests and not all system were cor-
provided in advance during the period between June 1st and          rectly validated. However, when the tests are straightforward
July 1st. This was the occasion for potential participants to       to process (benchmarks and directory), participants provided
send observations, bug corrections, remarks and other test          results. The main problems with the anatomy test was its
cases to the organizers. The goal of this primary period is to      size. We also mentioned the kind of results sent by each
be sure that the delivered tests make sense to the participants.    participant (relations and confidence).
The feedback is important, so all participants should not hes-
itate to provide it. The final test base has been released on       We note that the time devoted for performing these tests
July 4th. The tests did only change after this period for en-       (three months) and the period allocated for that (summer)
suring a better and easier participation.                           is relatively short and does not really allow the participants
                                                                    to analyse their results and improve their algorithms. On the
2.3    Execution phase                                              one hand, this prevents having algorithms really tuned for
During the execution phase the participants have used their         the contests, on the other hand, this can be frustrating for the
algorithms to automatically match the ontologies of both            participants. We should try to allow more time for partici-
part. The participants were required to only use one algo-          pating next time.
rithm and the same set of parameters for all tests. Of course,
it is regular to select the set of parameters that provide the      Complete           results       are      provided        on
best results. Beside the parameters the input of the algo-          http://oaei.inrialpes.fr/2005/results/. These are the only
rithms must be the two provided ontology to align and any           official results (the results presented here are only partial
general purpose resource available to everyone (that is no          and prone to correction). The summary of results track by
resourse especially designed for the test). In particular, the      track is provided below.
participants should not use the data (ontologies and results)
from other test sets to help their algorithm.                       4.    BENCHMARK
                                                                    The benchmark test case improved on last year’s base by
The participants have provided their alignment for each test        providing new variations of the reference ontology (last year
in the Alignment format and a paper describing their re-            the test contained 19 individual tests while this year it con-
sults2 .                                                            tains 53 tests). These new tests are supposed to be more dif-
                                                                    ficult. The other improvement was the introduction of other
In an attempt to validate independently the results, they were      evaluation metrics (real global precision and recall as well
required to provide a link to their program and parameter set       as the generation of precision-recall graphs).
used for obtaining the results.
                                                                    4.1    Test set
2.4    Evaluation phase                                             The systematic benchmark test set is built around one ref-
The organizers have evaluated the results of the algorithms         erence ontology and many variations of it. The participants
used by the participants and provided comparisons on the            have to match this reference ontology with the variations.
basis of the provided alignments.                                   These variations are focussing the characterisation of the be-
                                                                    haviour of the tools rather than having them compete on real-
In the case of the real world ontologies only the organiz-          life problems. The ontologies are described in OWL-DL and
ers will do the evaluation with regard to the withheld align-       serialized in the RDF/XML format.
ments.
                                                                    Since the goal of these tests is to offer some kind of perma-
The standard evaluation measures are precision and recall           nent benchmarks to be used by many, the test is an extension
computed against the reference alignments. For the matter           of last year EON Ontology Alignment Contest. Test number-
of aggregation of the measures we have computed a true              ing (almost) fully preserves the numbering of the first EON
global precision and recall (not a mere average). We have           contest.
also computed precision/recall graphs for some of the par-
ticipants (see below).                                              The reference ontology is based on the one of the first EON
                                                                    Ontology Alignment Contest. It is improved by comprising
Finally, in an experimental way, we will attempt this year at       a number of circular relations that were missing from the
reproducing the results provided by participants (validation).      first test. The domain of this first test is Bibliographic ref-
                                                                    erences. It is, of course, based on a subjective view of what
3.    COMMENTS ON THE EXECUTION                                     must be a bibliographic ontology. There can be many dif-
We had more participants than last year’s event and it is eas-      ferent classifications of publications (based on area, quality,
ier to run these tests (qualitatively we had less comments and      etc.). We choose the one common among scholars based on
the results were easier to analyse). We summarize the list of       mean of publications; as many ontologies below (tests #301-
participants in Table 1. As can be seen, not all participants       304), it is reminiscent to BibTeX.
2
  Andreas Hess from the UCDublin has not been able to provide a
paper in due time. Description of his system can be found in [3]    The reference ontology is that of test #101. It contains 33


                                                               62
      Name                        System      Benchmarks
                                                  √            Directory
                                                                  √        Anatomy      Validated    Relations    Confidence
      U. Karlsruhe                 FOAM           √               √                         √           =           cont
      U. Montréal/INRIA             OLA           √               √                                     =           cont
      IRST Trento              CtxMatch 2         √               √            √                       =, ≤           1.
      U. Southampton                CMS           √               √            √            √           =             1.
      Southeast U. Nanjin          Falcon         √               √                                     =             1.
      UC. Dublin                        ?         √               √                                     =           cont
      CNR/Pisa                    OMAP                                                                  =             1.

 Table 1: Participants and the state of the state of their submissions. Confidence is given as 1/0 or continuous values.


named classes, 24 object properties, 40 data properties, 56                untouched (they were added xmlns and xml:base at-
named individuals and 20 anonymous individuals.                            tributes).

The reference ontology is put in the context of the se-
                                                                     Table 5 summarize what has been retracted from the refer-
mantic web by using other external resources for ex-
                                                                     ence ontology in the systematic tests. There are here 6 cate-
pressing non bibliographic information. It takes advan-
                                                                     gories of alteration:
tage of FOAF (http://xmlns.com/foaf/0.1/) and iCalendar
(http://www.w3.org/2002/12/cal/) for expressing the People,
Organization and Event concepts. Here are the external ref-          Name Name of entities that can be replaced by (R/N) ran-
erence used:                                                              dom strings, (S)ynonyms, (N)ame with different con-
                                                                          ventions, (F) strings in another language than english.
                                                                     Comments Comments can be (N) suppressed or (F) trans-
   – http://www.w3.org/2002/12/cal/#:Vevent (defined in                   lated in another language.
     http://www.w3.org/2002/12/cal/ical.n3 and suppos-               Specialization Hierarchy can       be     (N)    suppressed,
     edly in http://www.w3.org/2002/12/cal/ical.rdf)                      (E)xpansed or (F)lattened.
   – http://xmlns.com/foaf/0.1/#:Person     (defined   in            Instances can be (N) suppressed
     http://xmlns.com/foaf/0.1/index.rdf)                            Properties can be (N) suppressed or (R) having the restric-
   – http://xmlns.com/foaf/0.1/#:Organization (defined in                 tions on classes discarded.
     http://xmlns.com/foaf/0.1/index.rdf)                            Classes can be (E)xpanded, i.e., relaced by several classes
                                                                          or (F)latened.
This reference ontology is a bit limited in the sense that it
does not contain attachement to several classes.                     4.2    Results
                                                                     Table 2 provide the consolidated results, by groups of tests.
Similarly the kind of proposed alignments is still limited:          Table 6 contain the full results.
they only match named classes and properties, they mostly
use the "=" relation with confidence of 1.                           We display the results of participants as well as those
                                                                     given by some very simple edit distance algorithm on labels
There are still three group of tests in this benchmark:              (edna). The computed values here are real precision and re-
                                                                     call and not a simple average of precision and recall. This is
                                                                     more accurate than what has been computed last year.
   – simple tests (1xx) such as comparing the reference on-
     tology with itself, with another irrelevant ontology (the       As can be seen, the 1xx tests are relatively easy for most of
     wine ontology used in the OWL primer) or the same               the participants. The 2xx tests are more difficult in general
     ontology in its restriction to OWL-Lite;                        while 3xx tests are not significantly more difficult than 2xx
   – systematic tests (2xx) that were obtained by discarding         for most participants. The real interesting results is that there
     some features of the reference ontology. The consid-            are significant differences across algorithms within the 2xx
     ered features were (names, comments, hierarchy, in-             test series. Most of the best algorithms were combining dif-
     stances, relations, restrictions, etc.). The tests are sys-     ferent ways of finding the correspondence. Each of them is
     tematically generated to as to start from some refer-           able to perform quite well on some tests with some meth-
     ence ontology and discarding a number of information            ods. So the key issue seems to have been the combination of
     in order to evaluate how the algorithm behave when              different methods (as described by the papers).
     this information is lacking. These tests were largely
     improved from last year by combining all feature dis-           One algorithm, Falcon, seems largely dominant. But a group
     carding.                                                        of other algorithms (Dublin, OLA, FOAM) are computing
   – four real-life ontologies of bibliographic references           against each other. While the CMS and CtxMatch currently
     (3xx) that were found on the web and left mostly                perform at a lower rate. Concerning these algorithm, CMS


                                                               63
   algo          edna            falcon          foam         ctxMatch2-1     dublin20          cms            omap             ola
   test       Prec. Rec.      Prec. Rec.      Prec. Rec.      Prec. Rec.     Prec. Rec.     Prec. Rec.      Prec. Rec.      Prec. Rec.
   1xx        0.96 1.00       1.00 1.00       0.98 0.65       0.10 0.34      1.00 0.99      0.74 0.20       0.96 1.00       1.00 1.00
   2xx        0.41 0.56       0.90 0.89       0.89 0.69       0.08 0.23      0.94 0.71      0.81 0.18       0.31 0.68       0.80 0.73
   3xx        0.47 0.82       0.93 0.83       0.92 0.69       0.08 0.22      0.67 0.60      0.93 0.18       0.93 0.65       0.50 0.48
 H-means      0.45 0.61       0.91 0.89       0.90 0.69       0.08 0.24      0.92 0.72      0.81 0.18       0.35 0.70       0.80 0.74


                  Table 2: Means of results obtained by participants (corresponding to harmonic means)


seems to priviledge precision and performs correctly in this
(OLA seems to have privileged recall with regard to last
year). CtxMatch has the difficulty of delivering many sub-
sumption assertions. These assertions are taken by our eval-
uation procedure positively (even if equivalence assertions
were required), but since there are many more assertions
than in the reference alignments, this brings the result down.

These results can be compared with last year’s results given
in Table 3 (with aggregated measures computed at new with
the methods of this year). For the sake of comparison, the re-
sults of this year on the same test set as last year are given in
Table 4. As can be expected, the two participants of both
challenges (Karlsruhe2 corresponding to foam and Mon-
tréal/INRIA corresponding to ola) have largely improved
their results. The results of the best participants this year
are over or similar to those of last year. This is remarkable,
because participants did not tune their algorithms to the chal-
lenge of last year but to that of this year (more difficult since
it contains more test of a more difficult nature and because
of the addition of cycles in them).

So, it seem that the field is globally progressing.                              Figure 1: Precision-recall graphs

Because of the precision/recall trade-off, as noted last year,
it is difficult to compare the middle group of systems. In           lenge (incorrect format, alignment with external entities).
order to assess this, we attempted to draw precision recall          Because time is short and we try to avoid modifying pro-
graphs. We provide in Figure 1 the averaged precision and            vided results, this test is still a test of both algorithms and
recall graphs of this year. They involve only the results of         their ability to deliver a required format. However, some
all participants. However, the results corresponding to par-         teams are really performant in this (and the same teams gen-
ticipants who provided confidence measures different of 1 or         erally have their tools validated relatively easily).
0 (see Table 1) can be considered as approximation. More-
over, for reason of time these graphs have been computed by          The evaluation of algorithms like ctxMatch which provide
averaging the graphs of each tests (instead to pure precision        many subsumption assertions is relatively inadequate. Even
and recall).                                                         if the test can remain a test of inference equivalence. It
                                                                     would be useful to be able to count adequately, i.e., not neg-
These graphs are not totally faithful to the algorithms be-          atively for precision, true assertions like owl:Thing subsum-
cause participants have cut their results (in order to get high      ing another concept. We must develop new evaluation meth-
overall precision and recall). However, they provide a rough         ods taken into account these assertions and the semantics of
idea about the way participants are fighting against each oth-       the OWL language.
ers in the precision recall space. It would be very useful
that next year we ask for results with continuous ranking for        As a side note: all participants but one have used the UTF-8
drawing these kind of graphs.                                        version of the tests, so next time, this one will have to be the
                                                                     standard one with iso-latin as an exception.

4.3 Comments                                                         5.     DIRECTORY
A general comments, we remarks, that it is still difficult for
participants to provide results that correspond to the chal-         5.1    Data set


                                                                64
                                algo       karlsruhe2      umontreal         fujitsu      stanford
                                test      Prec. Rec.      Prec. Rec.      Prec. Rec.    Prec. Rec.
                                1xx       NaN 0.00        0.57 0.93       0.99 1.00     0.99 1.00
                                2xx       0.60 0.46       0.54 0.87       0.93 0.84     0.98 0.72
                                3xx       0.90 0.59       0.36 0.57       0.60 0.72     0.93 0.74
                              H-means     0.65 0.40       0.52 0.83       0.88 0.85     0.98 0.77


                               Table 3: EON 2004 results with this year’s aggregation method.

   algo         edna            falcon          foam         ctxMatch2-1     dublin20          cms            omap             ola
   test      Prec. Rec.      Prec. Rec.      Prec. Rec.      Prec. Rec.     Prec. Rec.     Prec. Rec.      Prec. Rec.      Prec. Rec.
   1xx       0.96 1.00       1.00 1.00       0.98 0.65       0.10 0.34      1.00 0.99      0.74 0.20       0.96 1.00       1.00 1.00
   2xx       0.66 0.72       0.98 0.97       0.87 0.73       0.09 0.25      0.98 0.92      0.91 0.20       0.89 0.79       0.89 0.86
   3xx       0.47 0.82       0.93 0.83       0.92 0.69       0.08 0.22      0.67 0.60      0.93 0.18       0.93 0.65       0.50 0.48
 H-means     0.66 0.78       0.97 0.96       0.74 0.59       0.09 0.26      0.94 0.88      0.65 0.18       0.90 0.81       0.85 0.83


                                     Table 4: This year’s results on EON 2004 test bench.


The data set exploited in the web directories matching task         of the art matching systems can be tuned either to produce
was constructed from Google, Yahoo and Looksmart web di-            the results with better Recall or to produce the results with
rectories as described in [1]. The key idea of the data set con-    better Precision. For example, the system which produce
struction methodology was to significantly reduce the search        the equivalence relation on any input will always have 100%
space for human annotators. Instead of considering the full         Recall. Therefore, the main methodological goal in the eval-
mapping task which is very big (Google and Yahoo directo-           uation was to prevent Recall tuned systems from getting of
ries have up to 3∗105 nodes each: this means that the human         unrealistically good results on the dataset. In order to accom-
annotators need to consider up to (3 ∗ 105 )2 = 9 ∗ 1010 map-       plish this goal the double validation of the results was per-
pings), it uses semi automatic pruning techniques in order          formed. The participants were asked for the binaries of their
to significantly reduce the search space. For example, for          systems and were required to use the same sets of parameters
the dataset described in [1] human annotators consider only         in both web directory and systematic matching tasks. Then
2265 mappings instead of the full mapping problem.                  the results were double checked by organizers to ensure that
                                                                    the latter requirement is fulfilled by the authors. The pro-
The major limitation of the current dataset version is the fact     cess allow to recognize Recall tuned systems by analysis of
that it contains only true positive mappings (i.e., the map-        systematic tests results.
pings which tell that the particular relation holds between
nodes in both trees). At the same time it does not contain          The dataset originally was presented in its own format. The
true negative mappings (or zero mappings) which tell that           mappings were presented as pairwise relationships between
there are no relation holding between pair of nodes. Notice         the nodes of the web directories identified by their paths to
that manually constructed mapping sets (such as ones pre-           root. Since the systems participating in the evaluation all
sented for systematic tests) assume all the mappings except         take OWL ontologies as input the conversion of the dataset
true positives to be true negatives. This assumption does not       to OWL was performed. In the conversion process the nodes
hold in our case since dataset generation technique guaran-         of the web directories were modelled as classes and clas-
tee correctness but not completeness of the produced map-           sification relation connecting the nodes was modelled as
pings. This limitation allows to use the dataset only for eval-     rdfs:subClassOf relation. Therefore the matching task was
uation of Recall but not Precision (since Recall is defined as      presented as 2265 tasks of finding the semantic relation hold-
ratio of correct mappings found by the system to the total          ing between pathes to root in the web directories modelled
number of correct mappings). At the same time measuring             as sub class hierarchies.
Precision necessarily require presence of the true negatives
in the dataset since Precision is defined as a ratio of correct
mappings found by the system to all the mappings found by           5.2     Results
the system. This means that all the systems will have 100%          The results for web directory matching task are presented on
Precision on the the dataset since there are no incorrect map-      Figure 2. As from the figure the web directories matching
pings to be found.                                                  task is a very hard one. In fact the best systems found about
                                                                    30% of mappings form the dataset (i.e., have Recall about
The absence of true negatives has significant implications on       30%).
the testing methodology in general. In fact most of the state
                                                                    The evaluation results can be considered from two perspec-


                                                               65
                                                                   The web directories matching task is an important step to-
                                                                   wards evaluation on the real world matching problems. At
                                                                   the same time there are a number of limitations which makes
                                                                   the task only an intermediate step. First of all the cur-
                                                                   rent version of the mapping dataset provides correct but not
                                                                   complete set of the reference mappings. The new mapping
                                                                   dataset construction techniques can overcome this limita-
                                                                   tion. In the evaluation the mapping task was split to the the
                                                                   tiny subtasks. This strategy allowed to obtain results form all
                                                                   the matching systems participating in the evaluation. At the
                                                                   same time it hides computational complexity of "real world"
                                                                   matching (the web directories have up to 105 nodes) and may
   Figure 2: Recall for web directories matching task
                                                                   affect the results of the tools relying on "look for similar sib-
                                                                   lings" heuristic.
tives. On the one hand, they are good indicator of real world
ontologies matching complexity. On the other hand the re-          The results obtained on the web directories matching task
sults can provide information about the quality of the dataset     coincide well with previously reported results on the same
used in the evaluation. The desired mapping dataset qual-          dataset. According to [1] generic matching systems (or the
ity properties were defined in [1] as Complexity, Discrimi-        systems intended to match any graph-like structures) have
nation capability, Incrementality and Correctness. The first       Recall from 30% to 60% on the dataset. At the same time
means that the dataset is "hard" for state of the art matching     the real world matching tasks are very hard for state of the
systems, the second that it discriminates among the various        art matching systems and there is a huge space for improve-
matching solutions, the third that it is effective in recogniz-    ments in the ontology matching techniques.
ing weaknesses in the state of the art matching systems and
the fourth that it can be considered as a correct one.             6. ANATOMY
The results of the evaluation give us some evidence for Com-       6.1 Test set
plexity and Discrimination capability properties. As from          The focus of this task is to confront existing alignment tech-
Figure 2 TaxME dataset is hard for state of the art matching       nology with real world ontologies. Our aim is to get a bet-
techniques since there are no systems having Recall more           ter impression of where we stand with respect to really hard
than 35% on the dataset. At the same time all the matching         challenges that normally require an enormous manual effort
systems together found about 60% of mappings. This means           and requires in-depth knowledge of the domain.
that there is a big space for improvements for state of the art
matching solutions.                                                The task is placed in the medical domain as this is the do-
                                                                   main where we find large, carefully designed ontologies.
Consider Figure 3. It contains partitioning of the mappings        The specific characteristics of the ontologies are:
found by the matching systems. As from the figure 44%
of the mappings found by any of the matching systems was              – Very large models: be prepared to handle OWL models
found by only one system. This is a good argument to the                of more than 50MB !
dataset Discrimination capability property.                           – Extensive Class Hierarchies: then thousands of classes
                                                                        organized according to different views on the domain.
                                                                      – Complex Relationships: Classes are connected by a
                                                                        number of different relations.
                                                                      – Stable Terminology: The basic terminology is rather
                                                                        stable and should not differ too much in the different
                                                                        model
                                                                      – Clear Modelling Principles: The modelling principles
                                                                        are well defined and documented in publications about
                                                                        the ontologies


                                                                   This implies that the task will be challenging from a techno-
Figure 3: Partitioning of the mappings found by the                logical point of view, but there is guidance for tuning match-
matching systems                                                   ing approach that needs to be taken into account.

                                                                   The ontologies to be aligned are different representations of
                                                                   human anatomy developed independently by teams of med-
5.3    Comments                                                    ical experts. Both ontologies are available in OWL format


                                                              66
and mostly contain classes and relations between them. The       incomplete mappings, using loose heuristics for matching
use of axioms is limited.                                        nodes will create a rather complete, but often incorrect set of
                                                                 mappings. In our approach for generating reference align-
6.1.1    The Foundational Model of Anatomy                       ments, we completely focus on the correctness. The result is
The Foundational Model of Anatomy is a medical ontology          a small set of reference mappings that we can assume to be
developed by the University of Washington. We extracted an       correct. We can evaluate matching approaches against this
OWL version of the ontology from a Protege database. The         set of mappings. The idea is that the matching approaches
model contains the following information:                        should at least be able to determine these mappings. From
                                                                 the result, we can extrapolate the expected completeness of
                                                                 a matching algorithm.
   – Class hierarchy;
   – Relations between classes;                                  We assume that the task is to create a reference alignment
   – Free text documentation and definitions of classes;         for two a number of known conceptual models. In contrast
   – Synonyms and names in different languages.                  to existing work [1] we do not assume that instance data
                                                                 is available or that the models are represented in the same
6.1.2    The OpenGalen Anatomy Model                             way or using the same language. Normally, the models will
The second ontology is the Anatomy model developed in            be from the same domain (eg. medicine or business). The
the OpenGalen Project by the University of Manchester. We        methodology consists of four basic steps. In the first step,
created an OWL version of the ontology using the export          basic decisions are made about the representation of the con-
functionality of Protege. The model contains the following       ceptual models and instance data to be used. In the second
information:                                                     step instance data is created by selecting it from an exist-
                                                                 ing set or by classifying data according to the models under
                                                                 consideration. In the third step, the generated instance data
   – Concept hierarchy;                                          is used to generate candidate mappings based on shared in-
   – Relations between concepts.                                 stances. In the forth step finally, the candidate mappings are
                                                                 evaluated against a set of quality criteria and the final set of
The task is to find alignment between classes in the two on-     reference mappings is determined.
tologies. In order to find the alignment, any information in
the two models can be used. In addition, it is allowed to use    6.2.1    Step 1. Preparation
background knowledge, that has not specifically been cre-        The first step of the process is concerned with data prepa-
ated for the alignment tasks (i.e., no hand-made mappings        ration. In particular, we have to transform the conceptual
between parts of the ontologies). Admissible background          models into a graph representation and select and prepare
knowledge are other medical terminologies such as UMLS           the appropriate instance data to be used to analyze overlap
as well as medical dictionaries and document sets. Further,      between concepts in the different models. We structure this
results must not be tuned manually, for instance, by remov-      step based on the KDD process for Knowledge Discovery
ing obviously wrong mappings.                                    and Data Mining.

6.2     Results                                                  6.2.2    Step 2. Instance Classification
At the time of printing we are not able to provide results of    In the second step the chosen instance data is classified ac-
evaluation on this test.                                         cording to the different conceptual models. For this purpose,
                                                                 an appropriate classification method has to be chosen that
Validation of the results on the medical ontologies matching     fits the data and the conceptual model. Further, the result of
task is still an open problem. The results can be replicated     the classification process has to be evaluated. For this step
in straightforward way. At the same time there are no suf-       we rely on established methods from Machine Learning and
ficiently big set of the reference mappings what makes im-       Data Mining.
possible calculation of the matching quality measures.
                                                                 6.2.3    Step 3. Hypothesis Generation
We are currently developing an approach for creating such        In the third step, we generate hypothesis for reference map-
a set is to exploit semi-automatic reference mappings acqui-     pings based on shared instances created in the first two steps.
sition techniques. The underlying principle is that the task     In this step, we prune the classification by removing in-
of creating such a reference alignment is fundamentally dif-     stances that are classified with a low confidence and select-
ferent from the actual mapping problem. In particular, we        ing subsets of the conceptual models that show sufficient
believe that automatically creating reference alignments is      overlap. We further compute a degree of overlap between
easier than solving the general mapping problem. The rea-        concepts in the different models and based on this degree of
son for this is, that methods for creating general mappings      overlap select a set of reference mappings between concepts
have to take into account both, correctness and complete-        with a significant overlap.
ness of the generated mappings. This is difficult, because
allying very strict heuristics will lead to correct, but very    6.3     Step 4. Evaluation


                                                            67
In the last step, the generated reference mapping is eval-         be interesting and certainly more realistic, instead of crip-
uated against the result of different matching systems as          pling all names to do it for some random proportion of them
described in ?? using a number of criteria for a reference         (5% 10% 20% 40% 60% 100% random change). This has
mapping. These criteria include correctness, complexity            not been done for reason of time.
of the mapping problem and the ability of the mappings to
discriminate between different matching approaches.                E) The real world benchmarks were huge benchmarks. Two
                                                                   different strategies have been taken with them: cutting them
                                                                   in a huge set of tiny benchmark or providing them as is.
We are testing this methodology using a data set of med-           The first solution brings us away from "real world", while
ical documents called OHSUMED. The data set contains               the second one raised serious problems to the participants.
350.000 articles from medical journals covering all aspects        It would certainly be worth designing these tests in order
of medicine. For classifying these documents according to          to assess the current limitation of the tools by providing an
the two ontologies of anatomy, we use the collexis text in-        increasingly large sequence of such tests (0.1%, 1%, 10%,
dexing and retrieval system that implements a number of au-        100% of the corpus for instance).
tomatic methods for assigning concepts to documents. Cur-
rently, we are testing the data set and the system on a subset     F) Validation of the results are quite difficult to establish.
of UMLS with known mappings in order to assess the suit-
ability of the methodology. The generation of the reference        9.    FUTURE PLANS
mappings for the Anatomy case will proceed around the end          The future plans for the Ontology Alignement Evaluation
of 2005 and we are hopeful to have thoroughly tested set of        Initiative are certainly to go ahead and improving the func-
reference mappings for the 2006 alignment challenge.               tioning of these evaluation campaign. This most surely in-
                                                                   volves:
6.4    Comments
We had very few participants able to even produce the align-            – Finding new real world cases;
ments between both ontologies. This is mainly due to their              – Improving the tests along the lesson learned;
inability to load these ontologies with current OWL tools               – Accepting continuous submissions (through validation
(caused either by the size of the ontologies or errors in the             of the results);
OWL).                                                                   – Improving the measures to go beyond precision and
                                                                          recall.
7.    RESULT VALIDATION
As can be seen from the procedure, the results published in
                                                                   Of course, these are only suggestions and other ideas could
the following papers are not obtained independently. The re-
                                                                   come during the wrap-up meeting in Banff.
sults provided here have been computed from the alignment
provided by the participants and can be considered as the
                                                                   10.     CONCLUSION
official results of the evaluation.
                                                                   In summary, the tests that have been run this year are harder
In order to go one step further, we have attempted, this year,     and more complete than those of last year. However, more
to generate the results obtained by the participants from their    teams participated and the results tend to be better. This
tools. The tools for which the results have been validated         shows that, as expected, the field of ontology alignment is
independently are marked in Table 1.                               getting stronger (and we hope that evaluation is contributing
                                                                   to this progress).
8.    LESSON LEARNED                                               Reading the papers of the participants should help people
A) It seems that there are more and more tools able to jump        involved in ontology matching to find what make these al-
in this kind of tests.                                             gorithms work and what could be improved.
B) Contrary to last year it seems that the tools are more ro-      The Ontology Alignment Evaluation Initiative will continue
busts and people deal with more wider implementation of            these tests by improving both test cases and test methodol-
OWL. However, this can be that we tuned the tests so that          ogy for being more accurate. It can be found at:
no one has problems.

C) Contrary to what many people think, it is not that easy                      http://oaei.inrialpes.fr.
to find ontological corpora suitable for this evaluation test.
From the proposals we had from last year, only one proved          11.     ACKNOWLEDGEMENTS
to be usable and with great difficulty (on size, conformance       We warmly thank each participant of this contest. We know
and juridical aspects).                                            that they worked hard for having their results ready and they
                                                                   provided insightful papers presenting their experience. The
D) The extension of the benchmark tests towards more cov-          best way to learn about the results remains to read what fol-
erage of the space is relatively systematic. However, it would     lows.


                                                              68
Many thanks are due to the teams at the University of Wash-
ington and the University of Manchester for allowing us to
use their ontologies of anatomy.

The other members of the Ontology Alignment Evaluation
Initiative Steering committee: Benjamin Ashpole (Lockheed
Martin Advanced Technology Lab.), Marc Ehrig (University
of Karlsruhe), Lewis Hart (Applied Minds), Todd Hughes
(Lockheed Martin Advanced Technology Labs), Natasha
Noy (Stanford University), and Petko Valtchev (Université
de Montréal, DIRO)

This work has been partially supported by the Knowledge
Web European network of excellence (IST-2004-507482).

12.    REFERENCES
[1] Paolo Avesani, Fausto Giunchiglia, and Michael
    Yatskevich. A large scale taxonomy mapping
    evaluation. In Proceedings of International Semantic
    Web Conference (ISWC), 2005.
[2] Jérôme Euzenat. An API for ontology alignment. In
    Proc. 3rd international semantic web conference,
    Hiroshima (JP), pages 698–712, 2004.
[3] Andreas Heß and Nicholas Kushmerick. Iterative
    ensemble classification for relational data: A case study
    of semantic web services. In Proceedings of the 15th
    European Conference on Machine Learning, Pisa, Italy,
    2004.
[4] York Sure, Oscar Corcho, Jérôme Euzenat, and Todd
    Hughes, editors. Proceedings of the 3rd Evaluation of
    Ontology-based tools (EON), 2004.


      Montbonnot, Amsterdam, Trento, September 7th, 2005


                                                            69
   #   Name    Com    Hier   Inst   Prop   Class              Comment
 101                                                    Reference alignment
 102                                                     Irrelevant ontology
 103                                                   Language generalization
 104                                                    Language restriction
 201    R                                                     No names
 202    R       N                                      No names, no comments
 203            N                                  No comments (was missspelling)
 204    C                                               Naming conventions
 205    S                                                     Synonyms
 206    F       F                                            Translation
 207    F
 208    C       N
 209    S       N
 210    F       N
 221                   N                                 No specialisation
 222                   F                                Flatenned hierarchy
 223                   E                                Expanded hierarchy
 224                          N                             No instance
 225                                 R                    No restrictions
 226                                                       No datatypes
 227                                                      Unit difference
 228                                 N                     No properties
 229                                                     Class vs instances
 230                                        F            Flattened classes
231*                                        E            Expanded classes
 232                   N      N
 233                   N             N
 236                          N      N
 237                   F      N
 238                   E      N
 239                   F             N
 240                   E             N
 241                   N      N      N
 246                   F      N      N
 247                   E      N      N
 248    N       N      N
 249    N       N             N
 250    N       N                    N
 251    N       N      F
 252    N       N      E
 253    N       N      N      N
 254    N       N      N             N
 257    N       N             N      N
 258    N       N      F      N
 259    N       N      E      N
 260    N       N      F             N
 261    N       N      E             N
 262    N       N      N      N      N
 265    N       N      F      N      N
 266    N       N      E      N      N
 301                                                     Real: BibTeX/MIT
 302                                                    Real: BibTeX/UMBC
 303                                                      Real: Karlsruhe
 304                                                        Real: INRIA


            Table 5: Structure of the systematic benchmark test-case


                                     70
  algo       edna         falcon        foam      ctxMatch2-1    dublin20         cms          omap          ola
  test    Prec. Rec.   Prec. Rec.   Prec. Rec.    Prec. Rec.    Prec. Rec.   Prec. Rec.     Prec. Rec.   Prec. Rec.
  101     0.96 1.00    1.00 1.00     n/a    n/a   0.10 0.34     1.00 0.99     n/a     n/a   0.96 1.00    1.00 1.00
  103     0.96 1.00    1.00 1.00    0.98 0.98     0.10 0.34     1.00 0.99    0.67 0.25      0.96 1.00    1.00 1.00
  104     0.96 1.00    1.00 1.00    0.98 0.98     0.10 0.34     1.00 0.99    0.80 0.34      0.96 1.00    1.00 1.00
  201     0.03 0.03    0.98 0.98     n/a    n/a   0.00 0.00     0.96 0.96    1.00 0.07      0.80 0.38    0.71 0.62
  202     0.03 0.03    0.87 0.87    0.79 0.52     0.00 0.00     0.75 0.28    0.25 0.01      0.82 0.24    0.66 0.56
  203     0.96 1.00    1.00 1.00    1.00 1.00     0.08 0.34     1.00 0.99    1.00 0.24      0.96 1.00    1.00 1.00
  204     0.90 0.94    1.00 1.00    1.00 0.97     0.09 0.28     0.98 0.98    1.00 0.24      0.93 0.89    0.94 0.94
  205     0.34 0.35    0.88 0.87    0.89 0.73     0.05 0.11     0.98 0.97    1.00 0.09      0.58 0.66    0.43 0.42
  206     0.51 0.54    1.00 0.99    1.00 0.82     0.05 0.08     0.96 0.95    1.00 0.09      0.74 0.49    0.94 0.93
  207     0.51 0.54    1.00 0.99    0.96 0.78     0.05 0.08     0.96 0.95    1.00 0.09      0.74 0.49    0.95 0.94
  208     0.90 0.94    1.00 1.00    0.96 0.89     0.09 0.28     0.99 0.96    1.00 0.19      0.96 0.90    0.94 0.94
  209     0.35 0.36    0.86 0.86    0.78 0.58     0.05 0.11     0.68 0.56    1.00 0.04      0.41 0.60    0.43 0.42
  210     0.51 0.54    0.97 0.96    0.87 0.64     0.05 0.08     0.96 0.82    0.82 0.09      0.88 0.39    0.95 0.94
  221     0.96 1.00    1.00 1.00    1.00 1.00     0.12 0.34     1.00 0.99    1.00 0.27      0.96 1.00    1.00 1.00
  222     0.91 0.99    1.00 1.00    0.98 0.98     0.11 0.31     1.00 0.99    1.00 0.23      0.96 1.00    1.00 1.00
  223     0.96 1.00    1.00 1.00    0.99 0.98     0.09 0.34     0.99 0.98    0.96 0.26      0.96 1.00    1.00 1.00
  224     0.96 1.00    1.00 1.00    1.00 0.99     0.10 0.34     1.00 0.99    1.00 0.27      0.96 1.00    1.00 1.00
  225     0.96 1.00    1.00 1.00    0.00 0.00     0.08 0.34     1.00 0.99    0.74 0.26      0.96 1.00    1.00 1.00
  228     0.38 1.00    1.00 1.00    1.00 1.00     0.12 1.00     1.00 1.00    0.74 0.76      0.92 1.00    1.00 1.00
  230     0.71 1.00    0.94 1.00    0.94 1.00     0.08 0.35     0.95 0.99    1.00 0.26      0.89 1.00    0.95 0.97
  231     0.96 1.00    1.00 1.00    0.98 0.98     0.10 0.34     1.00 0.99    1.00 0.27      0.96 1.00    1.00 1.00
  232     0.96 1.00    1.00 1.00    1.00 0.99     0.12 0.34     1.00 0.99    1.00 0.27      0.96 1.00    1.00 1.00
  233     0.38 1.00    1.00 1.00    1.00 1.00     0.12 1.00     1.00 1.00    0.81 0.76      0.92 1.00    1.00 1.00
  236     0.38 1.00    1.00 1.00    1.00 1.00     0.09 1.00     1.00 1.00    0.74 0.76      0.92 1.00    1.00 1.00
  237     0.91 0.99    1.00 1.00    1.00 0.99     0.11 0.31     1.00 0.99    1.00 0.23      0.95 1.00    0.97 0.98
  238     0.96 1.00    0.99 0.99    1.00 0.99     0.07 0.34     0.99 0.98    0.96 0.26      0.96 1.00    0.99 0.99
  239     0.28 1.00    0.97 1.00    0.97 1.00     0.14 1.00     0.97 1.00    0.71 0.76      0.85 1.00    0.97 1.00
  240     0.33 1.00    0.97 1.00    0.94 0.97     0.10 1.00     0.94 0.97    0.71 0.73      0.87 1.00    0.97 1.00
  241     0.38 1.00    1.00 1.00    1.00 1.00     0.12 1.00     1.00 1.00    0.81 0.76      0.92 1.00    1.00 1.00
  246     0.28 1.00    0.97 1.00    0.97 1.00     0.14 1.00     0.97 1.00    0.71 0.76      0.85 1.00    0.97 1.00
  247     0.33 1.00    0.94 0.97    0.94 0.97     0.10 1.00     0.94 0.97    0.71 0.73      0.87 1.00    0.97 1.00
  248     0.06 0.06    0.84 0.82    0.89 0.51     0.00 0.00     0.71 0.25    0.25 0.01      0.82 0.24    0.59 0.46
  249     0.04 0.04    0.86 0.86    0.80 0.51     0.00 0.00     0.74 0.29    0.25 0.01      0.81 0.23    0.59 0.46
  250     0.01 0.03    0.77 0.70    1.00 0.55     0.00 0.00     1.00 0.09    0.00 0.00      0.05 0.45    0.30 0.24
  251     0.01 0.01    0.69 0.69    0.90 0.41     0.00 0.00     0.79 0.32    0.25 0.01      0.82 0.25    0.42 0.30
  252     0.01 0.01    0.67 0.67    0.67 0.35     0.00 0.00     0.57 0.22    0.25 0.01      0.82 0.24    0.59 0.52
  253     0.05 0.05    0.86 0.85    0.80 0.40     0.00 0.00     0.76 0.27    0.25 0.01      0.81 0.23    0.56 0.41
  254     0.02 0.06    1.00 0.27    0.78 0.21     0.00 0.00     NaN 0.00     0.00 0.00      0.03 1.00    0.04 0.03
  257     0.01 0.03    0.70 0.64    1.00 0.64     0.00 0.00     1.00 0.09    0.00 0.00      0.05 0.45    0.25 0.21
  258     0.01 0.01    0.70 0.70    0.88 0.39     0.00 0.00     0.79 0.32    0.25 0.01      0.82 0.25    0.49 0.35
  259     0.01 0.01    0.68 0.68    0.61 0.34     0.00 0.00     0.59 0.21    0.25 0.01      0.82 0.24    0.58 0.47
  260     0.00 0.00    0.52 0.48    0.75 0.31     0.00 0.00     0.75 0.10    0.00 0.00      0.05 0.86    0.26 0.17
  261     0.00 0.00    0.50 0.48    0.63 0.30     0.00 0.00     0.33 0.06    0.00 0.00      0.01 0.15    0.14 0.09
  262     0.01 0.03    0.89 0.24    0.78 0.21     0.00 0.00     NaN 0.00     0.00 0.00      0.03 1.00    0.20 0.06
  265     0.00 0.00    0.48 0.45    0.75 0.31     0.00 0.00     0.75 0.10    0.00 0.00      0.05 0.86    0.22 0.14
  266     0.00 0.00    0.50 0.48    0.67 0.36     0.00 0.00     0.33 0.06    0.00 0.00      0.01 0.15    0.14 0.09
  301     0.48 0.79    0.96 0.80    0.83 0.31     0.10 0.07     0.74 0.64    1.00 0.13      0.94 0.25    0.42 0.38
  302     0.31 0.65    0.97 0.67    0.97 0.65     0.14 0.27     0.62 0.48    1.00 0.17      1.00 0.58    0.37 0.33
  303     0.40 0.82    0.80 0.82    0.89 0.80     0.04 0.29     0.51 0.53    1.00 0.18      0.93 0.80    0.41 0.49
  304     0.71 0.95    0.97 0.96    0.95 0.96     0.11 0.26     0.75 0.70    0.85 0.22      0.91 0.91    0.74 0.66
H-means   0.45 0.61    0.91 0.89    0.90 0.69     0.08 0.24     0.92 0.72    0.81 0.18      0.35 0.70    0.80 0.74


                                            Table 6: Full results


                                                   71

</pre>