Eff2Match Results for OAEI 2010

                       Watson Wei Khong Chua1 and Jung-Jae Kim2
   1
       Nanyang Technological University, Singapore watsonchua@pmail.ntu.edu.sg
        2
          Nanyang Technological University, Singapore jungjae.kim@ntu.edu.sg


        Abstract. While the primary objective of an ontology alignment tool is to iden-
        tify as many correct correspondences as possible, efficiency in terms of run-time
        needs to be achieved for practical usage. Not only does run-time efficiency enable
        scalability, it also facilitates information integration for time-critical applications
        using heterogeneous ontologies. In this paper, we present our ontology alignment
        approach known as Eff2Match which aligns a pair of ontologies with high accu-
        racy and low runtime.


1 Presentation of the system
Ontologies are being widely used for semantic representation in applications from var-
ious domains such as biomedical informatics [3] and earth sciences [5]. They can be
used to provide data with semantics, thus resolving the heterogeneity problem be-
tween information sources at the data level. However, the problem is only partially
resolved because different ontology engineers model their ontologies differently. There-
fore, the heterogeneity problem is escalated to the ontological level if information is
to be shared between applications using different ontologies. As a result, ontology
alignment tools that can achieve high accuracy are required. In this paper, we present
Eff2Match (pronounced “Eff Squared match”), an Effective and Efficient ontology
matching tool which can match a pair of ontologies with good accuracy within a short
amount of time. Eff2Match uses an effective and dynamic candidate reduction technique
to avoid performing unnecessary comparisons, thereby achieving high efficiency.

1.1 State, purpose, general statement
In order to facilitate the sharing of information among applications using different on-
tologies, we have developed an automatic ontology alignment tool called Eff2Match. It
is able to align both concepts and properties in different ontologies that are semantically
equivalent (concepts are matched to concepts and properties are matched to properties).
The current implementation does not match the instances in the ontologies.

1.2 Specific techniques used
Eff2Match takes as input the URI of a pair of ontologies to be aligned and matches
entities (concepts or properties) in the source ontology to those in the target ontology.
The alignment process consists of four stages: 1) Anchor Generation, 2) Candidates
Generation, 3) Anchor Expansion and 4) Iterative Score Boosting as shown in Fig. 1 .
                                                  Candidate
                                                    Set
                                                                                     Remove pairs in Anew
     O1                O2


                            Remove pairs in Ain


          1) Anchors                           2) Candidate                     3) Anchors                  4) Iterative
          Generation                            Generation                      Expansion                    Boosting


                                Initial                          New
                              anchor set,                     anchor set,
                                  Ain                            Anew


                                                                 U


                                                                                      Expanded                 New
                                                                                      anchor set,             matches
                                                                                         Aexp                  Mnew

                                               Final
                                            Alignment                       U
                                                Afin


                                            Fig. 1. Eff2Match Algorithm Flow


Anchor Generation In the Anchor Generation stage, matching entities are identified
using an exact string matching technique. Local names and labels of entities e2j in the
target ontology are first preprocessed through camel case conversion, case-normalization
and removal of delimiters. A hash-table is then used to map the preprocessed local
names and labels to their corresponding entities. After that, we preprocess the local
name and label for each entity e1i in the source ontology and look them up in the
hash-table. If either a matching local name or label can be found in the hash-table,
we consider the corresponding entity e2j in the target ontology to be equivalent to the
source entity. This method is significantly faster than a pairwise comparison of local
names and labels as it takes only O(n1 ) + O(n2 ) time compared to the latter which
requires O(n1 × n2 ) where n1 and n2 are the number of entities in the source ontology
O1 and target ontology O2 respectively.


Candidate Generation In the Candidates Generation stage, we enumerate candidates
for entities in the source ontology that has not been matched in the previous stage using
a Vector Space Model (VSM) approach. For each concept, we generated three VSM
vectors from the annotations (local name, label and comments) in the ancestors (V eca ),
descendants (V ecd ) and the concept (V ecc ) itself. For each property, the vectors gen-
erated consist of the annotations in the property (V ecp ) itself, the property’s domain
concepts (V ecdo ) and range concepts(V ecr ). The VSM similarity (SimV SM ) between
two concepts c1i and c2j is an aggregation of the cosine similarity between the concept
VSM vectors, ancestor VSM vectors and descendant VSM vectors using a weighted
average.

                α × V ecc1i · V ecc2j + β × V eca2i · V eca2j + γ × V ecd2i · V ecd2j
  SimV SM =
                                            α+β+γ

where α, β and γ are the weights given to the similarity between annotations of the
concepts themselves, annotations of their ancestor concepts and annotations of their
descendant concepts respectively. The similarity values for properties are calculated in
a similar manner and two matrices, Mcon and Mprop are used to store the similarity
values for concepts and properties respectively. The VSM similarities are normalised to
[0,1] by dividing each entry in Mcon and Mprop by their largest value to get the nor-
malised VSM similarity, SimV SMN (C1i , C2j ). Candidates selection is then performed
for each source entity by taking the top-K entities in the target ontology according to
their VSM similarities.


Anchor Expansion In the anchor expansion stage, more equivalent pairs of entities are
identified by comparing the source entities with their candidate entities using termino-
logical methods. In Eff2Match, a term-removing algorithm (TRA) is used for efficiency
purposes and the algorithm is illustrated in Fig. 2. First, the labels (local names) of a
pair of entities are tokenised. The tokens are then stemmed and words that are stemmed
to the same form are removed from both labels. If there are tokens remaining in both
the labels, the next stage compares tokens from different labels pairwise using WordNet
to determine if they are synonyms of each other. Synonymous tokens are then removed
from the labels. If there are no tokens remaining in both the labels after any stage, the
two entities are considered to be equivalent and added to the anchor set.
    If there are remaining tokens in only one of the labels and not the other, we use
a novel technique known as Informative Word Matching (IWM) to determine if the
entities are matches. If concept C1i contains p more terms than C2j and these p terms
occur in the labels of the ancestors of C2j , we can consider C1i and C2j to have the
same meaning. For example, if C1i has the label Heart Endocardium and C2j has the
label Endocardium and the C2j is a sub-concept of Heart Part, we can determine that
C1i and C2j are semantically equivalent from their labels since the word Heart in C1i
is not informative.
    Given a pair of entities eemp and erem where eemp is the entity without remain-
ing tokens in its label and erem is the entity with p remaining tokens in its label, the
following steps are performed to determine if the p remaining tokens are informative
words:

 1. Collect the labels of ancestors of eemp up to r generations or when the root of the
    ontology is reached, whichever is earlier.
 2. Tokenise and stem the collection of labels collected to get a set of stemmed ancestor
    tokens Ssat .
 3. For each token ti , i ∈ [1..p] remaining in erem , stem ti and check if it exists in
    Ssat . If it does, the word is not informative and it is removed it from erem .
 4. Look up the definition of the original label of eemp in WordNet, tokenise and stem
    the words in the definition before adding them to the set of definition words, Sdef .
 5. For each token ti , i ∈ [1..p] remaining in erem , stem ti and check if it exists in
    Sdef . If it does, the word is not informative and it is removed it from erem .


                                      Remove tokens                           Remove tokens
                Tokenise and
                                      stemmed to the                             that are
              remove stopwords
                                         same form                             synonymous


                                       Are both labels           No                      Are both labels
                                           empty?                                            empty?


                                             No                                                 No


                                        Is one of the                                      Is one of the
                                       labels empty?                                      labels empty?


                                                               Check
                                          Yes            informativeness of               Yes
                                                            excess words

                        Yes


                                                                                                           Yes
                                                            Are excess
                                                               words
                                                           informative?


                                                                No

                                                         Add entity pair to
                                                         expanded anchor
                                                               set


                                 Fig. 2. Anchor Expansion Flow Chart


Iterative Boosting In the final stage of the matching process, an iterative boosting (Iter-
Boost) process is used to identify more pairs of equivalent concepts using the expanded
anchor set Aexp . In this stage, the algorithm attempts to match the source concepts that
have not been matched with their candidates iteratively. In each iteration, the source
concepts are ranked based on the sum of their ancestors and descendants with matches
in Aexp . The top K source concepts are then selected and a formula is used to boost the
score of their candidates based on the number of common ancestors and descendants
that they share. Given a source concept C1i and a candidate concept C2j , the structural
overlap SO(C1i , C2j ) between them is calculated using:

                                    |ξ(A(C1i ), A(C2j ))| + |ξ(D(C1i ), D(C2j ))|
      SO(C1i , C2j ) =
                                 min(|A(C1i )|, |A(C2j )|) + min(|D(C1i )|, |D(C2j )|)

where A(C) is the set of ancestors of concept C and D(C) is the set of descendants
of concept C and the function ξ(X, Y ) enumerates the set of concepts in X which
have equivalences in Y. The equivalences were determined from the comparison in the
previous stages. The similarity between C1i and C2j is then given by:
                                  ½
                                      Sim
                                      p V SMN (C1i , C2j ),         SO(C1i , C2j ) > tb
       Simboost (C1i , C2j ) =
                                        SimV SMN (C1i , C2j ),      otherwise.

The highest scoring candidate is then selected to be the matching concept and the confi-
dence that the pair matches is given by Simboost (C1i , C2j ). If Simboost (C1i , C2j ) is
greater a cut-off threshold tc , C1i and C2j are inserted into Aexp as well as the final set
of alignment and the process is repeated until all the source entities and their candidates
have been visited. If Simboost (C1i , C2j ) is less than tc , C1i and C2j are inserted into
the set of final alignment but are not considered anchors for future iterations.


1.3 Adaptations made for the evaluation

No special adaptations were made for individual tracks and all alignment processes
make use of the same set of parameters. The cut-off threshold for the correspondences
was set at 0.7 for the best F-Measure. The only external resource that we used is Word-
Net. For Informative Word Matching, we set p = 1, meaning that we only perform IFM
for entities with one remaining token after the Term-Removing Algorithm (TRA). For
IterBoost, the cut-off threshold for boosting, tb is set at 0.4 while the cut-off threshold
for matching entities to be considered anchors, tc is set at 0.5. Lastly, the weights α, β,
and γ are set to 2, 1, 1, respectively. The matcher has also been implemented as a web
service so that it can be evaluated on the SEALS platform.


1.4 Link to the system and parameters file

The Eff2Match system (jar file) and configuration files can be found at 3
http://www.cais.ntu.edu.sg/˜chua0507/OAEI/Eff2MatchSystem.zip
with installation instructions at
http://www.cais.ntu.edu.sg/˜chua0507/OAEI/Instructions.txt


1.5 Link to the set of provided alignments (in align format)

The set of alignments produced by Eff2Match can be found at
http://www.cais.ntu.edu.sg/˜chua0507/OAEI/Eff2MatchAlignments.zip


2 Results

Eff2Match participated in the benchmark, anatomy and conference tracks of the OAEI
2010 competition and the results are presented in the following subsections:
 3
     Please replace the ˜symbol by typing it in manually using your keyboard.
2.1 Benchmark

The ontologies in the benchmark dataset can be categorised into 5 different categories
according to the difficulty of matching as shown in Table 2.1. Eff2Match performs well
on all the categories, achieving an F-Measure of more than 0.75 with the exception of
the category 248 − 266. Ontologies in this category have different linguistics and struc-
tural characteristics, which are core attributes which Eff2Match relies on for finding
correspondences, thus explaining the poor results.


                               Ontologies Precision Recall F-Measure
                                 101-104     1.000 1.000       1.000
                                 201-210     0.986 0.684       0.768
                                 221-247     0.990 1.000       0.995
                                 248-266     0.929 0.502       0.591
                                 301-304     0.889 0.711       0.780

                         Table 1. Results for benchmark dataset


2.2 Anatomy

The anatomy dataset consists of two large real world ontologies, namely the Adult
Mouse Anatomy with 2247 classes and the anatomy part of the NCI Thesaurus with
3304 classes. Eff2Match’s results for this track are shown in Table 2.2 and shows that
Eff2Match can match large ontologies with high accuracy and short run-time.


              Task                       Precision Recall F-Measure Recall+ Time Taken
              Task 1 (Optimal F-Measure)    0.955 0.781       0.859 0.440      2.5 mins
              Task 2 (Optimal Precision)    0.968 0.745       0.842   N.A.     2.5 mins
              Task 3 (Optimal Recall)       0.766 0.838       0.800 0.588      2.5 mins

                           Table 2. Results for anatomy dataset


2.3 Conference

Lastly, Table 2.3 presents results for Eff2Match in the conference track for the 21 on-
tologies where reference alignments are available. As these ontologies are developed
heterogeneously, discovering the correct correspondences between them is more diffi-
cult than for the other two tracks. Eff2Match was able to achieve F-Measures ranging
from 0.4 to 0.759 for pairs of ontologies in this track.
                                  Ontologies Precision Recall F-Measure
                                     cmt-ekaw   0.316 0.545       0.400
                             conference-edas    0.303 0.588       0.400
                           conference-iasted    0.350 0.500       0.412
                                    cmt-iasted  0.267 1.000       0.421
                                   edas-iasted  0.471 0.421       0.444
                              cmt-conference    0.467 0.438       0.452
                                   cmt-confof   0.538 0.438       0.483
                            conference-ekaw     0.481 0.520       0.500
                                    edas-ekaw   0.500 0.522       0.511
                                      cmt-edas  0.385 0.769       0.513
                                  confof-edas   0.500 0.684       0.578
                                 ekaw-sigkdd    0.538 0.636       0.583
                               confof-sigkdd    0.667 0.571       0.615
                          conference-sigkdd     0.588 0.667       0.625
                          conference-confof     0.550 0.733       0.629
                                confof-iasted   0.600 0.667       0.632
                                iasted-sigkdd   0.500 0.867       0.634
                                  ekaw-iasted   0.533 0.800       0.640
                                  edas-sigkdd   0.611 0.733       0.667
                                 confof-ekaw    0.824 0.700       0.757
                                   cmt-sigkdd   0.647 0.917       0.759
                                      Average   0.506 0.653       0.555

                        Table 3. Results for conference dataset


3 General comments
3.1 Comments on the results
This is the first time that Eff2Match is participating in the OAEI competition and it has
shown good results compared to the results of other systems in the 2009 competition.
In particular, for the anatomy track, its F-Measure of 0.857 tops the best F-Measure
achieved in the OAEI 2009 anatomy track. What is more remarkable is that this was
achieved with a runtime of only around 2.5 minutes, thereby living up to its name of
being an effective and efficient matcher. In addition, its average F-Measure of 0.555 for
the conference track ranks second when compared with systems participating in OAEI
2009.

3.2 Discussions on ways to improve Eff2Match
The current implementation of Eff2Match only matches concepts and properties with
equivalence relations. The techniques used are mainly terminological and structural.
Our first proposed improvement to Eff2Match is to extend its functionalities to include
the discovery of non-equivalence correspondences. Other than the subsumption and
disjointedness correspondences defined in OWL, Eff2Match will also discover other
pre-defined relations that are common within the ontologies to be aligned. For example,
in the biomedical domain, many ontologies in the OBO foundry [1] contains the part-
of relationship but they only connect concepts within the same ontology. We intend to
discover relationships like these in a future version of Eff2Match.
    In addition, the current version of Eff2Match requires a few parameters to be set
by the user. The tuning of parameters is a manual process which can be tedious and
ineffective. Our other proposed improvement to Eff2Match is to enable it to tune the
parameters automatically, like what has been done in [2].

3.3 Comments on the OAEI 2010 procedure, test cases and measures
In this year’s OAEI competition, evaluation was done on the SEALS platform [4] for the
benchmark, anatomy and conference tracks. Matchers participating in this track have to
be implemented as web services so that they can be evaluated on the SEALS platform.
We feel that this service is very useful for us, particularly for the anatomy track. Unlike
the benchmark and conference tracks where the reference alignments are made known
to us, evaluation for the anatomy track is done using a blind test. Therefore, it is not
possible for us to observe how changes we make to the algorithm affect the results on
the anatomy dataset and it is difficult to make improvements. The SEALS platform has
alleviated this problem and made evaluation for the anatomy track possible.
    Another useful feature of the SEALS evaluation mechanism is that it shows the
correct, incorrect and missing correspondences to the participants, allowing them to
gain a greater insight of the strengths and weaknesses of their systems. Though this is
an extremely useful feature, we feel that the correspondences for the anatomy dataset
should not be shown if it were to remain a blind test. The reason is that by joining the
set of correct correspondences and the set of missing correspondences, one can easily
get hold of the complete reference alignment for the anatomy dataset.


4 Conclusion

We have presented an ontology matcher named Eff2Match that can align ontologies
efficiently and effectively. Experiments were performed on different pairs of ontologies
from three different tracks in OAEI 2010 and results show that Eff2Match is able to
match real-world ontologies accurately. In addition, it scales well to large ontologies
and can be used in applications where the ontology matching process has to be fast.


References
1. The Open Biomedical Ontologies. Available at: http://obofoundry.org/about.shtml.
2. Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan, and Arnon Rosenthal. Tuning schema
   matching software using synthetic scenarios. In Proceedings of 31st International Conference
   on Very Large Data Bases (VLDB’05), pages 994–1005, Trondheim, Norway, 2005.
3. Nadine Schuurman and Agnieszka Leszczynski. Ontologies for bioinformatics. Bioinform
   Biol Insights, 2:187–200, 2008.
4. SEALS-Semantic        Evaluation     At      Large     Scale.               Available     at:
   http://seals.inrialpes.fr/platform.
5. Semantic Web for Earth and Environmental Terminology (SWEET).                  Available at:
   http://sweet.jpl.nasa.gov/ontology/.