=Paper= {{Paper |id=Vol-156/paper-15 |storemode=property |title=OLA in the OAEI 2005 Alignment Contest |pdfUrl=https://ceur-ws.org/Vol-156/paper15.pdf |volume=Vol-156 |dblpUrl=https://dblp.org/rec/conf/kcap/EuzenatGV05 }} ==OLA in the OAEI 2005 Alignment Contest== https://ceur-ws.org/Vol-156/paper15.pdf
                        OLA in the OAEI 2005 alignment contest

                  Jérôme Euzenat                 Philippe Guégan                    Petko Valtchev
                  INRIA Rhône-Alpes           DIRO, Université de Montréal    DIRO, Université de Montréal
                  Montbonnot, France             Montr eal (Qc), Canada            Montr eal (Qc), Canada
               euzenat-at-inrialpes.fr                 Guegan-at-                    Petko.Valtchev-at-
                                                    iro.umontreal.ca                   umontreal.ca


ABSTRACT                                                        vironment for alignment manipulation. Indeed, in its
 Among the variety of alignment approaches (e.g., us-           current version, the system offers, via its GUI compo-
ing machine learning, subsumption computation, for-             nent VisOn, the following services:
mal concept analysis, etc.) similarity-based ones rely
on a quantitative assessment of pair-wise likeness be-
                                                                   • parsing and visualization of OWL-Lite and OWL-
tween entities. Our own alignment tool, OLA, features
                                                                     DL ontologies,
a similarity model rooted in principles such as: com-
                                                                   • computation of similarities between entities from
pleteness on the ontology language features, weighting
                                                                     two ontologies,
of different feature contributions and mutual influence
                                                                   • extraction of alignments from a pair of ontologies,
between related ontology entities. The resulting similar-
                                                                     provided with a set of similarity matrices, one per
ities are recursively defined hence their values are cal-
                                                                     category of ontology entities (see below),
culated by a step-wise, fixed-point-bound approximation
                                                                   • manual construction of alignments by composing
process. For the OAEI 2005 contest, OLA was provided
                                                                     entity pairs from two ontologies,
with an additional mechanism for weight determination
                                                                   • use of an existing (partial) alignment as a seed
that increases the autonomy of the system.
                                                                     for automated alignment construction (alignment
                                                                     completion),
1.      PRESENTATION OF THE SYSTEM                                 • alignment visualization,
OLA (for OWL-Lite Alignment) is an open-source tool                • comparison of two alignments.
jointly developed by teams at University of Montréal
and INRIA Rhône Alpes. It features similarity-based
alignment and a set of auxiliary services supporting the        In the remainder, the focus will be limited to the auto-
manipulation of alignment results [5, 6].                       mated alignment construction with OLA.

1.1      General purpose statement                               1.1.2   Principles of the alignment in OLA
The primary goal behind the OLA tool design is to per-          The following fundamental principles underly the de-
form alignment of ontologies expressed in OWL, with a           sign of the three key mechanisms in OLA, internal rep-
short-term emphasis on OWL-Lite and long-term one               resentation of the ontology, similarity computation and
on OWL-DL. However, its GUI component, VisOn1                   alignment extraction, that are involved in the global
allows for many other services involving alignments (in         ontology alignment process:
the sense of [4]) to be accessed.
                                                                All-encompassing comparison : We tend to believe
1.1.1       Functional specifications                                that all the available knowledge about a pair of on-
From a mere algorithm for automated alignment con-                   tology entities should be taken into account when
struction, OLA has grown for the last year to an en-                 aligning. This does not exclude the possibility of
1
    see http://www.iro.umontreal.ca/∼owlola/                         ignoring particular aspects, i.g., OWL instances
                                                                     in case of OWL class comparison. However such
                                                                     a choice should be deliberately made by the tool
                                                                     user, here through appropriate weight assignment,
                                                                     or, if performed by an automated mechanisms,
                                                                     should reflect some particularity, either of the en-
                                                                     tire ontology (e.g., global absence of instances in
                                                                     both ontologies) or of the pair of entities at hand
                                                                     (e.g., local absence of instances in the pair of classes
                                                                     to be compared).




                                                          97
Highest automation level : Although we recognize                 OLA features an alignment process that splits into three
    that the entire alignment process often needs to             basic steps: constructing the intermediate representa-
    be set on a semi-automated basis, we nevertheless            tion of the compared ontologies as labeled graphs, com-
    argue in favor of a completely automated process             puting the similarity of each pair of same-category en-
    for ”draft” alignment generation. Thus, we see the           tities from the respective ontology graphs, extracting
    OLA user providing a minimal set of parameters               an alignment from the similarity matrices for each cat-
    at the initial steps of the process whereas the tool         egory.
    will suggest one or more candidate alignments at
    the end, without any other human intervention.               1.2.1      OL-Graph construction
Category-dependent comparison : Following the syn-               OL-Graphs are graph structures that provide an easy-
    tactic structure of the OWL language, entities are           to-process inner representation of OWL ontologies. An
    divided into categories, e.g., classes, objects, prop-       OL-Graph is a labeled graph where vertices correspond
    erties, relations, and only entities of the same cat-        to OWL entities and edges to inter-entity relationships.
    egory are compared. Moreover, the entities of a              As described in [6], the set of different vertex categories
    category are compared using similarity functions             is: class (C), object (O), relation (R), property (P ),
    of the same basic shape. The respective category             property instance (A), datatype (D), datavalue (V ),
    functions comprise the same factors and the same             property restriction labels (L). Furthermore, we dis-
    weights. They are further customized for each pair           tinguish between datatype relations (Rdt ) and object
    of category entities by projecting them over the ac-         relations (Ro ), and between datatype properties (Pdt )
    tual feature space of the entities (which may be far         and object ones (Po ).
    smaller than the complete space of the category).
Comparability of similarity results : To enable com-             The OL-Graph model allows the following relationships
    parison of similarity scores between different align-        among entities to be expressed:
    ment tasks but also for some computational rea-
    sons, a set of useful properties is insured for the
                                                                      • specialization between classes or relations (denoted
    similarity functions: normalization, positiveness,
                                                                        S),
    maximalness 2 , and symmetry 3 .
                                                                      • instanciation (denoted I) between objects and classes,
                                                                        property instances and properties, values and datatypes,
1.1.3    Current limitations                                          • attribution (denoted A) between classes and prop-
   • Although it would be of certain value for align-                   erties, objects and property instances;
     ment, OLA currently offers no inference mecha-                   • restriction (denoted R) expressing the restriction
     nisms that could help complete the entity descrip-                 on a property in a class,
     tions. In particular, inheritance is not used to ex-             • valuation (denoted U) of a property in an object.
     pand entities, mostly out of efficiency considera-
     tions.
                                                                 The OL-Graph of an ontology is built after the ontology
   • Although neighborhoods play crucial role in the             is parsed4 . The process of OL-Graph construction is
     similarity definition, two neighbor entities are not        described in [8].
     necessarily affecting each other’s respective simi-
     larities to a pair of other entities. As only descrip-      1.2.2      Similarity model
     tive knowledge is taken into account, given two             The similarity functions used in OLA are designed in
     such entities, say e1 and e2 , for e2 to appear in          a category-specific manner and cover all the available
     a similarity expression for e1 , it should be consid-       descriptive knowledge about an entity pair. Thus, given
     ered as part of the description of the latter. For          a category X of OL-Graph nodes, the similarity of two
     instance, a data type is not seen as being described        nodes from X depends on:
     by a property whose range the datatype repre-
     sents. Consequently, datatypes are compared in
     an ontology-independent manner.                                  • the similarities of the terms used to designate them,
                                                                        i.e., URIs, labels, names, etc.,
   • Category borders are not similarity-permeable: Only              • the similarity of the pairs of neighbor nodes in the
     entities from the same category are compared for                   respective OL-Graphs that are linked by edges ex-
     similarity and hence for alignment.                                pressing the same relationships (e.g., class node
                                                                        similarity depends on similarity of superclasses, of
1.2     Specific techniques used                                        property restrictions and of member objects),
                                                                      • the similarity of other local descriptive features de-
2
  With normalization, this amounts to forcing scores of 1 for           pending on the specific category (e.g., cardinality
identical entities within identical ontologies                          intervals, property types)
3
  The price to pay for symmetry is the impossibility of de-
                                                                 4
tecting subsumption by this purely numerical procedure.              So far, we use the OWL API [1].




                                                            98
 Funct.   Node    Factor                            Measure          where N + (x, x0 ) is the set of all relationships F for
 SimO     o∈O     λ(o)                              simL
                  a ∈ A, (o, a) ∈ A                 M SimA           which F(x) ∪ F(x0 ) 6= ∅ 5 .
 SimA     a∈A     r ∈ R, (a, r) ∈ R                 SimR
                  o ∈ O, (a, o) ∈ U                 M SimO           OLA relies on various functions for identifiers compar-
                  v ∈ V , (a, v) ∈ U                M SimV
 SimV     v∈V     value literal                     type dependent   ison. Both string distances and lexical distances are
 SimC     c∈C     λ(c)                              simL             used. Lexical distances rely on an exploration of Word-
                  p ∈ P , (c, p) ∈ A                M SimP
                  c0 ∈ C, (c, c0 ) ∈ S              M SimC
                                                                     Net 2.0 [7] with a quantitative assessment of the “relat-
 simD     d∈D     λ(r)                              XML-Schema       edness” between two, possibly multi-word, terms. More
 SimR     r∈R     λ(r)                              simL             specifically, the degree of relatedness between two Word-
                  c ∈ C, (r, domain, c) ∈ R         M SimC
                  c ∈ C, (r, range, c) ∈ R          M SimC           Net entries is computed as the ratio between the depth,
                  d ∈ D, (r, range, d) ∈ R          SimD             in graph-theoretic sense, of the most specific common
                  r 0 ∈ R, (r, r 0 ) ∈ S            M SimR
 SimP     p∈P     r ∈ R, (p, r 0 ) ∈ S              SimR
                                                                     hypernym and the average of both term depths. The
                  c ∈ C, (p, all, c) ∈ R            M SimC           computation of multi-word term similarity consists in
                  n ∈ {0, 1, ∞}, (p, card, n) ∈ R   equality         first splitting the terms into a set of tokens each and
Table 1: Similarity function decomposition                           then comparing all possible pairs of tokens from oppo-
(card = cardinality and all = allValuesFrom).                        site sets using the above depth-based principle. The
                                                                     global term similarity is then computed as a similarity-
                                                                     based matching between both sets (see above).

Datatype and datavalue similarities are external to our              As circular dependencies are impossible to avoid with
model and therefore they are either user-provided or                 the above definitions, the computing of the similarity
measured by a standard function (e.g., string identity               values requires non-standard mechanisms. Following [2,
of values and datatype names/URIs).                                  9], an equation system is composed out of the similar-
                                                                     ity definitions where variables correspond to similari-
Formally, given a category X together with the set of                ties of node pairs while coefficients come from weights.
relationships it is involved in, N (X), the similarity mea-          The process of iterative, fixed-point-bound resolution
sure SimX : X 2 → [0, 1] is defined as follows:                      of that system, as well as the related convergence and
                                                                     determinism issues are described in [6].
                       X
   SimX (x, x0 ) =               X
                                πF M SimY (F(x), F(x0 )).
                     F ∈N (X)
                                                                     1.3   Implementation
                                                                     OLA is implemented in Java. Its architecture follows
                                                 X                   the one of the Alignment API and the recent implemen-
The function
        P is normalized,      i.e., the weights πF  sum to           tation that was described in [4]. OLA relies on the OWL
                      X
a unit, F ∈N (X) πF      = 1. for the computability The              API [1] for parsing OWL files. An entire subsystem is
set functions M SimY compare two sets of nodes of the                dedicated to the onstruction of OL-Graphs on top of the
same category (see [6] for details). Table 1 illustrates             parsed ontologies. A set of further components that of-
the set of similarities in our model.                                fer similarity computation services: substring distances,
Following the lessons learned with our participation in
the EON 2004 alignment contest [?], we have adapted                  edit distances, Hamming distance, WordNet interface
the above measure to fit cases where particular pair                 (via the JWNL library [3]), etc., that were originally
of entities is described only by a small subset of the               designed for OLA are now part of the Alignment API.
entire set of category descriptors. Thus, a descriptive              The VisOn GUI component offers a uniform interface
factor is ignored for similarity computation whenever                to all services provided by Alignment API and OLA.
neither of the compared entities possesses a neighbor                In particular, it visualizes both the input data, i.e., the
with the underlying link label (e.g., no instances for a             OL-Graphs, and the final result, i.e., the alignment file,
pair of compared classes). In this case, not only its                of the global process.
weight is set to 0, but also the weights of the remaining
”active” factors are increased correspondingly. To scale             1.4   Adaptations made for the contest
that principle up to the entire set of descriptive factors,          Along the preparation of the AOEI 2005 contest, a row
the following simple mechanism has been realized in                  of changes have been made to the system in order to
OLA: In order to keep both normalization and equity                  make it fit the complexity of the alignment discovery
in similarity values, the weights of all non-null factors            task. The most striking one is the introduction of a
for a given entity pair are divided through their sum.               weight-computing mechanism that eliminates the ne-
Thus, for a category X, the similarity measure Sim+    X :           cessity for the tool user to provide initial weights and
X 2 → [0, 1] becomes:                                                hence makes a significant step towards full automation
                                                                     of the alignment process.
                      0      SimX (x, x0 )                           5
                                                                      That is, there exists at least one y such that (x, y) ∈ F or
            Sim+
               X (x, x  ) = P
                             F ∈N + (x,x0 ) πF                       at least one y 0 such that (x0 , y 0 ) ∈ F.




                                                                99
1.4.1      Weight computing mechanism                          1.4.3     Minor adaptations
As it is far from obvious for novice users how to weigh        Following experiences from EON 2004, a set of simple
the different similarity factors, we initiated work on in-     but decisive modifications have been applied in order
corporating a weight computing mechanism within the            to prevent the precision leak in the tests. First, the
system. The intended mechanism is both intuitive and           instances have been excluded from the alignments by
effective so that alignment practitioners with various         default, although the possibility is given to the user
skill levels could find a match for their knowledge and        to reverse this choice. Then, entities external to the
experience. So far, we used a simple heuristic method          ontologies at hand have also been excluded from the
that, according to the obtained results, performs rea-         alignment (but not from the similarity computation).
sonably well. The basic idea of the method consists            Finally, one-to-one alignment production has been en-
in distributing the weights among similarity factors in        forced in OLA to increase the potential recall of the
the generic similarity function of a node category ac-         resulting alignment.
cording to the relative importance of the corresponding
category in the entire ontology. That is to say we use         2.    RESULTS
the average number of links of the corresponding type          The comments are grouped by test categories.
per entity of the category at hand. For instance, the
greater the number of super-class links in the ontology,       2.101      Tests 10X
the higher the weight of the super-class factor in the         OLA performed very well on the tests of this group.
class similarity formula.                                      This seems to be due to the fact that while the lan-
                                                               guage varies along the individual tests of the group, the
1.4.2      Similarity measure for entity names                 basic ontology entities involved in the similarity compu-
                                                               tation remain unchanged with respect to the reference
OLA uses two alternative modes of comparison for en-
                                                               ontology.
tity names (URIs, labels, etc.): a string measure6 (a
default) and a lexical similarity measure that relies on
WordNet 2.0 (see above).
                                                               2.102      Tests 2XX
                                                               The performances of the algorithm seem to suggest that
The highly sophisticated lexical similarity measure that       three sub-groups of tests can be distinguished. The first
was used in OLA for the EON competition has been re-           one comprises the tests 21X, 22X, 23X and 24X, with a
placed by a simpler but more purposeful one. Indeed,           small number of exceptions where the performance have
the initial function compared multi-word terms on three        been:
separate axes: nouns, verbs and adjectives, as provided
by WordNet 2,0. Such comparison seemed appropriate                  • Quite good: This is the case of tests 201, 202,
for cases where the meanings of a word fall in more                   with random class names. The random names
than one part-of-speech category. The inter-word simi-                were putting a strain on the ability of the algo-
larities on each axis were aggregated by an independent               rithm to propagate similarity along the network of
best-match computations while the three resulting val-                node pairs. Obviously, our technique needs some
ues were further combined to a single one via a weighted              improvements on that point.
sum.
                                                                    • Satisfactory: In the case of tests 248, 249, there
The new measure trades separate matchings on speech-                  is a combination of missing (or random) names
part-wise basis to a single global matching along entry               with one other missing factor. For tests 248, 249,
similarities that aggregate all three possible aspects of a           the missing factors are hierarchy (sub-class links)
word. Thus, the words are compared to each other with                 and instances, respectively. Both play important
all possible meanings and the highest similarity over a               role in similarity computation of classes, whenever
single pair of meanings is taken for the words.                       these are stripped of their names as is the case
                                                                      with these two ontologies. Hence the sharp drop in
For the OAEI competition, as we had to rely on a fixed                precision and recall with respect to the preceding
parameter set for the entire collection of tests, we have             tests.
chosen to force the use of the string distance. Indeed,             • Weak: The notorious failure here have been the
it showed better performances while being much more                   tests 205, 209, which are the only ones to use of
efficient than the WordNet-based computation.                         synonymous names in the ontology entities (with
                                                                      respect to the intial ontology). As WordNet has
Nevertheless, the improved lexical similarity was not                 been plugged-out of the similarity computation,
completely discarded: it is currently used as a pre-                  these results are not surprising.
processing tool that helps decide automatically the dis-
tribution of weights among similarity factors.
                                                               The second groups is made of the tests 25X. Here OLA
6
    subString distance provided by the Alignment API           performances varied substantially: from extremely poor




                                                         100
(254) to satisfactory (252, 259).                              choice could easily generate a chain of wrong alignment
                                                               decisions and thus negatively impact the performances
The last five ontologies of the group, the 26X ones, have      of the tool.
proven to represent a serious obstacle for OLA. The
performances of the system here were poor to very poor.

2.103    Tests 30X
The real-world ontologies of the group 30X made OLA            3.3   Comments on the experiment
perform in an unimpressive way. We believe that this           Two months during summer period is definitely too
is due to the fact that string similarity was systemati-       short to run shuch an experiment.
cally used as identifier comparison means. Indeed, ten-
tative runs with WordNet as basis for name similarity
yielded way more precise alignments on that group. Un-
fortunately, they also brought down the overall statis-
tics from the entire test set such as mean precision and
mean recall. Hence the choice of the WordNet-based             4.    CONCLUSION
lexical similarity for a default name comparison means         In its latest version, OLA has proven a more robust
has been dropped.                                              tool for alignment than it was a year before. While
                                                               the difficulties with real-world ontologies persist, the
                                                               progress on noisy ones has been substantial.
3. GENERAL COMMENTS
3.1 Comments on the results                                    The next key topic of the research around OLA will be
The results show a substantial progress has been made          the automation of the weight computation for a specific
since the EON 2004 alignment contest. With respect to          pair of ontologies.
the performances of OLA at that forum, we made a big
leap amounting to about 25% in both mean precision
and mean recall.

Nevertheless, we see that a vast space for improvement
lays ahead of our project. The weaknesses of the current
                                                               5.    RAW RESULTS
similarity mechanisms can be summarized as follows.
First, the tuning of the algorithm is still a rigid pro-
cess. Indeed, while the weights can now be computed
following a specific footprint of the ontology, a mecha-       5.1   Link to the set of provided alignments
nism for the choice of a particular name similarity on         A .zip archive of all the contest results is available at
the same basis has yet to be defined.                          the following URL:

Second, although we take into account the biggest possi-
ble amount of knowledge about entities, there are sources
of similarity that have been ignored so far, in particular     http://www.iro.umontreal.ca/∼owlola/OAEI.html
entity comments.

3.2   Discussions on the way to improve the pro-
      posed system
Besides expanding the lexical processing to comments           5.2   Link to the system and parameters file
in entities and providing a flexible decision mechanism        A similar archive with the parameters and the .jar files
for the choice of the default name similarity, a possible      used in the contest-related experiments is available at
improvement of the system will be the integration of a         the following URL:
learning module for weight estimation. As for similarity,
the biggest challenge here is to define the representation
of the input data, i.e., the descriptors of the entries for
the learning algorithm.
                                                               http://www.iro.umontreal.ca/∼owlola/OAEI.html
Another research track would be the definition of an
optimal matching algorithm. In fact, the current proce-
dures are sub-optimal in the sense that they only chose
local optima for each aligned entity. Consequently, as
strict 1:1 matchings are to be produced, a single bad          5.3   Matrix of results




                                                         101
                                                         [2] Gilles Bisson. Learning in FOL with similarity
  #     Name                      Prec.   Rec.   Time        measure. In Proc. 10th AAAI, San-Jose (CA US),
 101    Reference alignment         1       1                pages 82–87, 1992.
 103    Language generalization     1       1
 104    Language restriction        1       1            [3] John Didion. The Java WordNet Library,
 201    No names                  0.71    0.62               http://jwordnet.sourceforge.net/, 2004.
 202    No names & no comments    0.66    0.56           [4] Jérôme Euzenat. An API for ontology alignment.
 203    No comments                 1       1                In Proc. 3rd ISWC, pages 698–712, Hiroshima
 204    Naming conventions        0.94    0.94               (JP), 2004.
 205    Synonyms                  0.43    0.42
 206    Translation               0.94    0.93           [5] Jérôme Euzenat and Petko Valtchev. An
 207                              0.95    0.94               integrative proximity measure for ontology
 208                              0.94    0.94               alignment. In Proc. ISWC-2003 workshop on
 209                              0.43    0.42               semantic information integration, Sanibel Island
 210                              0.95    0.94               (FL US), pages 33–38, 2003.
 221    No specialisation           1       1
                                                         [6] Jérôme Euzenat and Petko Valtchev.
 222    Flatenned hierarchy         1       1
                                                             Similarity-based ontology alignment in OWL-lite.
 223    Expanded hierarchy          1       1
                                                             In Proc. 15th ECAI, pages 333–337, Valencia (ES),
 224    No instance                 1       1
                                                             2004.
 225    No restrictions             1       1
 228    No properties               1       1            [7] A.G. Miller. Wordnet: A lexical database for
 230    Flattened classes         0.95    0.97               english. Communications of the ACM,
 231                                1       1                38(11):39–41, 1995.
 232                                1       1
 233                                1       1            [8] Mohamed Touzani. Alignement d’ontologies dans
 236                                1       1                OWL. Master’s thesis, University of Montréal, in
 237                              0.97    0.98               preparation.
 238                              0.99    0.99           [9] Petko Valtchev. Construction automatique de
 239                              0.97      1                taxonomies pour l’aide à la représentation de
 240                              0.97      1                connaissances par objets. Thèse d’informatique,
 241                                1       1                Université Grenoble 1, 1999.
 246                              0.97      1
 247                              0.97      1
 248                              0.59    0.46
 249                              0.59    0.46
 250                               0.3    0.24
 251                              0.42     0.3
 252                              0.59    0.52
 253                              0.56    0.41
 254                              0.04    0.03
 257                              0.25    0.21
 258                              0.49    0.35
 259                              0.58    0.47
 260                              0.26    0.17
 261                              0.14    0.09
 262                               0.2    0.06
 265                              0.22    0.14
 266                              0.14    0.09
 301    Real: BibTeX/MIT          0.42    0.38
 302    Real: BibTeX/UMBC         0.37    0.33
 303    Real: Karlsruhe           0.41    0.49
 304    Real: INRIA               0.74    0.66

6.     REFERENCES
[1] Sean Bechhofer, Raphael Voltz, and Phillip Lord.
    Cooking the semantic web with the OWL API. In
    Proc. 2nd International Semantic Web Conference
    (ISWC), Sanibel Island (FL US), 2003.




                                                   102