=Paper=
{{Paper
|id=Vol-1111/oaei13_paper14
|storemode=property
|title=WeSeE-Match results for OAEI 2013
|pdfUrl=https://ceur-ws.org/Vol-1111/oaei13_paper14.pdf
|volume=Vol-1111
|dblpUrl=https://dblp.org/rec/conf/semweb/PaulheimH13a
}}
==WeSeE-Match results for OAEI 2013==
<pdf width="1500px">https://ceur-ws.org/Vol-1111/oaei13_paper14.pdf</pdf>
<pre>
                WeSeE-Match Results for OAEI 2013

                            Heiko Paulheim1 and Sven Hertling2
                                   1
                                  University of Mannheim
                               Data and Web Science Group
                        heiko@informatik.uni-mannheim.de
                            2
                              Technische Universitt Darmstadt
                              Knowledge Engineering Group
                          hertling@ke.tu-darmstadt.de


        Abstract. WeSeE-Match is a simple, element-based ontology matching tool. Its
        basic technique is invoking a web search engine request for each concept and
        determining element similarity based on the similarity of the search results ob-
        tained. Multi-lingual ontologies are translated using a standard web based transla-
        tion service. Furthermore, it implements a simple strategy for selecting candidate
        mappings interactively.


1     Presentation of the system

1.1    State, purpose, general statement

The idea of WeSeE-Match is to use information on the web for matching ontologies.
When developing the algorithm, we were guided by the way a human would possibly
solve a matching task. Consider the following example from the OAEI anatomy track1 :
one element in the reference alignment are the two classes with labels eyelid tarsus and
tarsal plate, respectively. As a person not trained in anatomy, one might assume that
they have something in common, but one could not tell without doubt.
    For a human, the most straight forward strategy in the internet age would be to
search for both terms with a search engine, look at the results, and try to figure out
whether the websites returned by both searches talk about the same thing. Implicitly,
what a human does is identifying relevant sources of information on the web, and an-
alyzing their contents for similarity with respect to the search term given. This naive
algorithm is implemented in WeSeE-Match.
    Furthermore, WeSeE-Match uses a basic method for interactive matching, which
tries to adaptively set the threshold for selecting the final alignment from the set of
candidates.


1.2    Specific techniques used

The core idea of our approach is to use a web search engine for retrieving web docu-
ments that a relevant for concepts in the ontologies to match. For getting search terms
 1
     http://oaei.ontologymatching.org/2013/anatomy/
from ontology concepts (i.e., classes and properties), we use the labels, comments, and
URI fragments of those concepts as search terms. The search results of all concepts are
then compared to each other. The more similar the search results are, the higher the
concepts’ similarity score.
    To search for websites, we use URI fragments, labels, and comments of each con-
cept as search strings, and perform some pre-processing, i.e., splitting camel case and
underscore separated words into single words, and omitting stop words. For every
search result, all the titles and summaries of web pages provided by the search en-
gine are put together into one describing document. This approach allows us to parse
only the search engine’s answer, while avoiding the computational burden of retrieving
and parsing all websites in the result sets. The answer provided by the search engine
contains titles and excerpts from the website (i.e., some sentences surrounding the oc-
curance of the search term in the website). Therefore, we do not use whole websites,
but ideally only relevant parts of those web sites, i.e., we exploit the search engine both
for information retrieval and for information extraction.
    For each concept c, we perform a single search each for the fragment, the label,
and the comment (if present), thus, we generate up to three documents docf ragment (c),
doclabel (c), and doccomment (c). The similarity score for each pair of concepts is then
computed as the maximum similarity over all of the documents generated for those
concepts:

      sim(c1 , c2 ) := maxi,j∈{f ragment,label,comment} sim∗ (doci (c1 ), docj (c2 ))   (1)

For computing the similarity sim∗ of two documents, we compute a TF-IDF score,
based on the complete set of documents retrieved for all concepts in both ontologies.
     Using the TF-IDF measure for computing the similarity of the documents has sev-
eral advantages. First, stop words like and, or, and so on are inherently filtered, because
they occur in the majority of documents. Second, terms that are common in the domain
and thus have little value for disambiguating mappings are also weighted lower. For
example, the word anatomy will occur quite frequently in the anatomy track, thus, it
has only little value for determining mappings there. On the other hand, in the library
track, it will be a useful topic identifier and thus be helpful to identify mappings. The
TF-IDF measure guarantees that the word anatomy gets weighted accordingly in each
track.
     The result is a score matrix with elements between 0 and 1 for each pair of concepts
from both ontologies. For each row and each column where there is a score exceeding
τ , we return that pair of concepts with the highest score as a mapping. Furthermore, the
filter chain explained in [2] is used, which removes mappings for datatype properties
with different ranges, as well as mappings that refer to any imported ontologies.
     For multi-lingual ontologies, we first translate the fragments, labels, and comments
to English as a pivot language [4]. The translated concepts are then processed as de-
scribed above.
     While the threshold is set to fixed values for the standard tracks in the evaluation,
we use an interactive method of selecting the threshold in the interactive track. We
use a binary search for selecting the threshold, as discussed in [6], using the following
algorithm:
 1. Set τ to the average threshold of all candidates. Set τmin = 0, τmax = 1.
 2. Present a candidate that has a confidence of τ (or the candidate whose confidence
    is closest to τ ) to the oracle.
 3. If the candidate is correct, set τ to τmin + (τmax − τmin /2), τmin to the previous
    value of τ , Otherwise set set τ to τmax − (τmax − τmin /2), τmax to the previous
    value of τ
 4. If τmax > τmin , go to 2.
 5. Select the final candidates: all candidates with a threshold above τ plus all candi-
    dates that are rated positive by the oracle minus all candidates that are rated negative
    by the oracle.

    Given that the ordering of candidates is optimal, i.e., all wrong candidates have a
lower confidence score than all correct candidates, that algorithm will yield a threshold
τ that separates correct from incorrect candidates.

1.3    Adaptations made for the evaluation
The 2012 version of WeSeE-Match [5] used Microsoft Bing as a search engine, as well
as the Microsoft Translator API. In order to create a matching system that does not use
any services which require payment, we decided to make the following changes for the
2013 evaluation campaign:

    – The Bing web search was replaced by JFreeWebSearch2 , which encapsulates the
      free FAROO web search API3 .
    – The Microsoft Translator API was replaced by the Web Translator API4 .

    The other new feature for this year’s evaluation was the inclusion of the interactive
post-processing method. The same code for handling interactive matching was also used
in the Hertuda matching system [3] for the OAEI 2013 evaluation campaign.
    The parameter τ was set to 0.42 for multi-lingual and to 0.51 for mono-lingual
matching problems in the non-interactive tracks.

1.4    Link to the system and parameters file
The system is available from http://www.ke.tu-darmstadt.de/resources/
ontology-matching/wesee-match/.


2     Results
2.1    benchmark
The results from the benchmark track are not surprising. For those problems where
labels, URI fragments and/or comments are present and contain actual terms, i.e., they
 2
   http://www.ke.tu-darmstadt.de/resources/jfreewebsearch
 3
   http://www.faroo.com/
 4
   http://sourceforge.net/projects/webtranslator/
are not replaced by random strings, WeSeE-Match provides reasonable results. As soon
as those strings are removed, the F-measure of WeSeE-Match drops, since no other
evaluation (e.g., ontology structure) is used by WeSeE-Match.
    The evaluation of runtime also reveals that WeSeE-Match is one of the slowest
matching systems participating in the campaign. This is due to the fact that the search
engine used restricts the usage to one request per second. Thus, WeSeE-Match spends a
lot of idle time to fulfill that requirement.


2.2   anatomy and conference

On anatomy, WeSeE-Match is one of the worst performing matchers, suffering both in
precision in recall. It is in particular interesting to see the drop from last year’s perfor-
mance (F-measure 0.829) to this year’s (F-measure 0.47), where the only significant
change was the use of a different search engine. This shows that the choice of the web
search engine in search-engine based matching can make a huge difference in the re-
sults.
    In the conference track, the differences to last year are not that significant, ending
at a comparable F-measure of 0.55, which makes WeSeE-Match an average matcher in
the conference track.


2.3   multifarm

For multifarm, the F-measure has dropped from 0.41 in OAEI 2012 to 0.15 in OAEI
2013. As discussed above, the performance on the English-only conference track, which
underlies multifarm, has remained stable, so this is mainly an effect of the use of a
different translation engine. Again, it becomes obvious that the choice of a different
translation engine clearly influences the results.
    In detail, the 2012 version of WeSeE-Match was capable of matching Chinese and
Russian ontologies, because these languages were supported by the Microsoft Transla-
tor API, but not the Web Translator API. Furthermore, the results for Czech are signif-
icantly below the 2012 results, which shows a suboptimal support for that language in
the Web Translator API. For the other languages, the results have remained on a similar
level.


2.4   library and large biomedical ontologies

Since WeSeE-Match is not optimized for scalability (in particular, necessary waiting
times to avoid blocking from search engine providers slow down the matcher), it could
not be run on those larger tracks due to time limits. However, the approach does in
principle scale up to larger ontologies as well. In the OAEI 2012 campaign, for example,
WeSeE-Match was capable of completing the library track, when the time limit was set
to one week instead of one day [1].
Table 1. Thresholds and results in the interactive track. The table depicts the best possible thresh-
old and the F-measure that is achieved with that threshold, as well as the threshold chosen and
the F-measure achieved by WeSeE-Match. In cases where an interval of thresholds leads to the
same result, we report the average of that interval.

                      best possible selection                  actual selection
Test case      Threshold Precision Recall F-measure Threshold Precision Recall F-Measure
cmt-conference      0.14    0.438 0.467       0.452      0.03     0.304 0.467      0.368
cmt-confOf          0.15    0.385 0.313       0.345      0.26     0.714 0.313      0.435
cmt-edas            0.94    0.889 0.615       0.727      1.00     0.778 0.538      0.636
cmt-ekaw            0.90    0.625 0.455       0.526      0.19     0.500 0.455      0.476
cmt-iasted          0.66    0.800 1.000       0.889      1.00     0.800 1.000      0.889
cmt-sigkdd          0.14    0.786 0.917       0.846      0.23     0.769 0.833      0.800
conf-confOf         0.97    0.583 0.467       0.519      1.00     0.714 0.333      0.455
conf-edas           0.95    0.583 0.412       0.483      1.00     1.000 0.118      0.211
conf-ekaw           0.12    0.333 0.480       0.393      0.18     0.538 0.280      0.368
conf-iasted         0.15    0.500 0.357       0.417      0.79     1.000 0.214      0.353
conf-sigkdd         0.28    0.818 0.600       0.692      0.78     0.750 0.600      0.667
confOf-edas         1.00    0.643 0.474       0.545      1.00     0.778 0.368      0.500
confOf-ekaw         0.11    0.619 0.650       0.634      1.00     0.667 0.600      0.632
confOf-iasted       0.90    0.571 0.444       0.500      1.00     1.000 0.333      0.500
confOf-sigkdd       0.69    0.667 0.571       0.615      0.19     0.500 0.429      0.462
edas-ekaw           0.60    0.526 0.435       0.476      1.00     1.000 0.130      0.231
edas-iasted         0.28    0.500 0.421       0.457      0.19     0.500 0.105      0.174
edas-sigkdd         0.95    0.875 0.467       0.609      1.00     0.875 0.467      0.609
ekaw-iasted         0.57    0.429 0.600       0.500      1.00     0.556 0.500      0.526
ekaw-sigkdd         0.81    0.875 0.636       0.737      1.00     1.000 0.273      0.429
iasted-sigkdd       0.94    0.647 0.733       0.688      1.00     0.667 0.133      0.222


2.5     interactive

The interactive matching component in WeSeE-Match – which is the same as in Hertuda
– tries to find an optimal threshold via binary search for selecting the final mapping, as
discussed above. Table 1 depicts the results. For comparison, we also show the optimal
threshold that could have been selected in theory.5
    It can be observed that the thresholds chosen by our algorithm are most often far
from the optimum, which is also reflected in the F-measure achieved. In most cases, the
selected threshold is higher than the optimal, i.e., the selection is biased towards preci-
sion. In fact, 1.0 is often chosen as a threshold. The reason is that that mapping elements
such as conference#Conference = eads#Conference, which naturally re-
 5
     There are cases where the F-measure achieved by the actual selection is higher. This is due
     to the fact that in a post-processing step, all information obtained from the oracle is exploited
     by removing the wrong mappings and including the correct ones, even if they do not fall into
     the interval defined by the final threshold selection. Furthermore, in some cases, the precision
     is lower despite a higher threshold. This is due to the fact that the interactive track uses a
     slightly different reference alignment than the original conference track, on which the optimal
     thresholds have been determined.
ceive a confidence score of 1.0, are rated wrong by the oracle, so that our binary search
approach sets the threshold τ to the maximum possible, i.e., 1.0.
    In general, while our interactive approach would work well for ideally sorted candi-
dates, its performance with real confidence scores provided by WeSeE-Match (and also
Hertuda) are not satisfying. Thus, using a greedy algorithm here is maybe not the best
choice, and other means to determine an optimal threshold, such as estimating the F-
measure based on user interaction, as discussed in [6], may be a better option for future
versions.


3   Conclusion
In this paper, we have discussed the results of the 2013 version of WeSeE-Match. It can
be observed that the choice for free services instead of commercial ones has changed
the performance of WeSeE-Match for the worse. The trade-off between the use of com-
mercial services and high-quality results is not easy to address.
    Furthermore, it can be seen that a naive greedy approach for selecting a threshold
parameter interactively does not provide satisfying results, which calls for more sophis-
ticated methods.


References
1. Jos Luis Aguirre, Kai Eckert, Jrme Euzenat, Alfio Ferrara, Willem Robert van Hage, Laura
   Hollink, Christian Meilicke, Andriy Nikolov, Dominique Ritze, Francois Scharffe, Pavel
   Shvaiko, Ondrej Svab-Zamazal, Cssia Trojahn, Ernesto Jimnez-Ruiz, Bernardo Cuenca Grau,
   and Benjamin Zapilko. Results of the ontology alignment evaluation initiative 2012. In Sev-
   enth International Workshop on Ontology Matching (OM 2012), 2012.
2. Thanh Tung Dang, Alexander Gabriel, Sven Hertling, Philipp Roskosch, Marcel Wlotzka,
   Jan Ruben Zilke, Frederik Janssen, and Heiko Paulheim. Hotmatch results for oeai 2012. In
   Seventh International Workshop on Ontology Matching (OM 2012), 2012.
3. Sven Hertling. Hertuda results for oeai 2012. In Seventh International Workshop on Ontology
   Matching (OM 2012), 2012.
4. Michael Paul, Hirofumi Yamamoto, Eiichiro Sumita, and Satoshi Nakamura. On the impor-
   tance of pivot language selection for statistical machine translation. In 2009 Annual Con-
   ference of the North American Chapter of the Association for Computational Linguistics,
   Companion Volume: Short Papers, pages 221–224, 2009.
5. Heiko Paulheim. Wesee-match results for oeai 2012. In Seventh International Workshop on
   Ontology Matching (OM 2012), 2012.
6. Heiko Paulheim, Sven Hertling, and Dominique Ritze. Towards evaluating interactive ontol-
   ogy matching tools. In Lecture Notes in Computer Science, 2013.

</pre>