=Paper=
{{Paper
|id=None
|storemode=property
|title=WikiMatch results for OEAI 2012
|pdfUrl=https://ceur-ws.org/Vol-946/oaei12_paper15.pdf
|volume=Vol-946
|dblpUrl=https://dblp.org/rec/conf/semweb/HertlingP12a
}}
==WikiMatch results for OEAI 2012==
<pdf width="1500px">https://ceur-ws.org/Vol-946/oaei12_paper15.pdf</pdf>
<pre>
                   WikiMatch Results for OEAI 2012

                             Sven Hertling and Heiko Paulheim

                            Technische Universität Darmstadt
                    {hertling,paulheim}@ke.tu-darmstadt.de


        Abstract. WikiMatch is a matching tool which makes use of Wikipedia as an
        external knowledge resource. The overall idea is to search Wikipedia for a given
        concept and retrieve all pages describing the term. If there is a large amount
        of common pages for two terms, then the concepts will have similar semantics.
        We make also use of the inter-language links between Wikipedias in different
        languages to match multilingual ontologies. The results show that this simple
        idea can keep up with state of the art tools. Moreover, the results on the Multifarm
        track depend on the Wikipedia’s number of articles as well as the link amount to
        the Wikipedia of the other natural language to match. The growth of Wikipedia
        will thus help this matcher to improve the matching quality.


1     Presentation of the system
1.1   State, purpose, general statement
WikiMatch is an element-level ontology matching tool. It uses Wikipedia as a huge
background knowledge to find out, how similar two concepts are. The algorithm ex-
tracts all labels, comments, and URI fragments, and uses Wikipedia’s search function
to retrieve an set of articles related to that term. If the intersection between such two
sets is high, then we assume that the terms have something in common and are related
to each other.
    To also deal with multilingual ontologies, all language links of the returned articles
are requested as a second step. For each language, the Jaccard coefficient of the two
sets of articles retrieved is computed, as equation (1) shows.

                                                                      #(S(t1 ) ∩ S(t2 ))
 sim(t1 , t2 ) := maxti ∈{label(ci ),f ragment(ci ),comment(ci )},i∈{1,2}
                                                                      #(S(t1 ) ∪ S(t2 ))
                                                                                       (1)
     For the terms t1 and t2 the resulting article set S is computed. The maximum over
all labels, comments and URI fragments are then the similarity measure for these terms.
     If Wikipedia returns an suggestion for the term, a new query is made with this new
search term. This is typically the case when entering a misspelled term in the search.
An overview of the entire system is shown in Fig. 1.

1.2   Specific techniques used
Our first test was to search for the whole term in Wikipedia. We call this approach sim-
ple search. As a result the precision is high in contrast to the recall which is very low. To
                                                                                                                                         WikiMatch


                                                                     query                                translated titles
                                                                translated titles
                                                                                                                Lang=z
                                         query                                              Lang=y
                                                           Lang=x
                                     wikipedia titles

                                                             Fragment                      Fragment                           Fragment
                                                Lang=x
        read ontology
                            Fragment
                                                               Label                         Label                             Label
       O1                    Label

                            Comment
                                                             Comment                       Comment                            Comment


                                                                                         Lang=y                comparison              Lang=z


                                                             Fragment                      Fragment                           Fragment


                            Fragment
                                                               Label                         Label                             Label
       O2                    Label

                            Comment
            read ontology                                    Comment                       Comment                            Comment

                                                Lang=a

                                            query
                                        wikipedia titles    Lang=a                         Lang=y
                                                                                                                Lang=z
                                                                          query
                                                                     translated titles                translated titles


Fig. 1. Illustration of the matching process (see [1]). As a first all Wikipedia articles are requested
for the language of the term. As a second step all language links from these articles are queried.
The comparison of all these sets is per language. The maximum of the cross product of fragment,
comment and label is returned.


improve the recall measure we have tried another search approach, i.e., splitting each
term into individual tokens and searching for those tokens individually. For example,
the query for the string Passive conference participant will therefore contain three sin-
gle searches with passive, conf erence and participant. Both search approaches are
shown in Fig. 2 in pseudo code.
    Our own tests showed that the individual tokens search (ITS) will result in a bet-
ter recall, but a lower precision. To have a look at the F-Measure between the two
approaches, the first idea of simple search can produce better values. Therefore this
approach is was submitted.


1.3   Adaptations made for the evaluation

No adaptions for the evaluation are made.
float getsimilarity(term1, term2) {
  titlesForTerm1 = getAllTitles(term1);
  titlesForTerm2 = getAllTitles(term2);

    commonTitles = intersectionOf(titlesForTerm1, titlesForTerm2);
    allTitles = unionOf(titlesForTerm1, titlesForTerm2);

    return #(commonTitles) / #(allTitles);
}

List<WikipediaPage> getAllTitles(searchTerm) {
  removeStopwords(searchTerm);
  removePunctuation(searchTerm);

    if(simpleSearch) {
      resultList = searchWikipedia(searchTerm);
    }

    if(individualTokenSearch) {
      tokens = tokenize(searchTerm);
      for each token in tokens
        resultList = resultList + searchWikipedia(searchTerm);
    }

    for each page in results
      resultList = resultList + getLanguageLinks(page);

    return resultList;
}


          Fig. 2. Pseudo code of simple search and individual token search (see [1]).


1.4   Link to the system and parameters file

The WikiMatch tool can be downloaded from http://www.ke.tu-darmstadt.
de/resources/ontology-matching/wikimatch.


2     Results

2.1   Benchmark

Since our approach is entirely element-based, removing or replacing labels or com-
ments results in lower F-Mesaure. By removing only one of the describing elements,
WikiMatch deals also with the remaining literals and can provide good results. If there
are neither labels nor comments, then this approach does not work. On the other hand,
removing structural features, such as subclass relations, does not influence the results
of WikiMatch.
2.2    Anatomy
The comparison with StringEquiv of the OAEI 2011.5 is not that well, because the recall
is not much higher, but therefore the precision is very low (0.997 to 0.864). A nontrivial
mapping that is found by our tool is ophthalmic artery and Opthalmic Artery.

2.3    Conference
In the conference track, WikiMatch reached 0.6 F-Measure for ra1. This is better than
the baseline2 from OAEI 2011.5. The same applies for ra2. Unfortunately, the confer-
ence domain is not well covered in Wikipedia to match special terms like Chair PC
and ProgramCommitteeChair. But through the suggestion feature it is possible to find
a mapping between Sponsorship and Sponzorship.

2.4    Multifarm
On the Multifarm track, WikiMatch exploits the inter-language links from each returned
article. Therefore a mapping between different languages can be found. The best results
are achieved for matching English to Spanish (F-Measure 0.29), the worst for Chinese-
German and Chinese-Portuguese (F-Measure 0.1).
    The results on the Multifarm track strongly depend on the involved Wikipedia’s
sizes, in particular the number of articles and links to other Wikipedias. Fig. 3 depicts
the results of WikiMatch in relation to the corresponding Wikipedias’ article counts;
Fig. 4 the results in relation to the number of links from the corresponding Wikipedias
to other Wikipedias1 . It can be observed that the results get better with larger and more
strongly inter-linked Wikipedias.
    As the number of articles and inter-Wikipedia links grow by around 2% per month
(even more rapidly for Chinese, which is currently the smallest and least interlinked
Wikipedia used in Multifarm), we expect the results of WikiMatch to improve just by
the growth of Wikipedia. The trend lines in Fig. 3 and 4 indicate that about 500,000
additional articles and Wikipedia links lead to an increase of five percentage points in
F-Measure. At the current growth rate of Wikipedia, this takes a little less than two
years.

2.5    Library
The library track unfortunately did not finish within one week. The reason can be the
calculation of the cross product between the concepts of the ontologies, or the gener-
ally long times required for looking up concepts in Wikipedia. This requires an more
detailed look.

2.6    Large Biomedical Ontologies
Like the library track, the ontologies in this track are also too large handle by WikiMatch
in its current version.
 1
     Using numbers obtained from http://stats.wikimedia.org/
            0.35

             0.3

            0.25

             0.2
F-Measure


            0.15

             0.1

            0.05

               0
                   0            500000          1000000          1500000           2000000         2500000
                                                  H-mean article count

                Fig. 3. Multifarm results in relation to the corresponding Wikipedias’ article counts


            0.35

             0.3

            0.25

             0.2
F-Measure


            0.15

             0.1

            0.05

               0
                   0        1000000       2000000        3000000         4000000     5000000       6000000
                                                    H-mean link count

            Fig. 4. Multifarm results in relation to the corresponding Wikipedias’ inter-wiki link counts
3     General comments

3.1   Comments on the results
On Multifarm and conference track WikiMatch shows that a simple element based ap-
proach can keep up with state of the art tools. Especially using the inter-language links
in Wikipedia looks like a promising approach to deal with multi-lingual ontologies. On
large tracks the current approach does not scale well and did not finish in time.
    In general, like most approaches using web data by querying the web at run-time,
WikiMatch is rather slow compared to matchers working entirely internally or only use
local resources.


3.2   Discussions on the way to improve the proposed system
For improving the approach, we envision to set threshold values dynamically, based on
the matched ontologies. In order to cope with the run-time restrictions, it is possible to
not use WikiMatch as a single matching approach, but to first match the easy cases (i.e.,
same or very similar terms) with string-level methods.
    At the moment, WikiMatch only uses the page identifiers returned by the search,
ignoring the text snippets, i.e., the portions of the Wikipedia pages that are relevant
for the search term. Using those snippets, e.g., like WeSeE-Match does [2], could help
leveraging the potential of WikiMatch more effectively.


4     Conclusion
With our work on WikiMatch, we have shown how a large general-purpose resource like
Wikipedia can be used for ontology matching. Especially the cross-linking of different
language Wikipedias is useful for multi-lingual ontology matching. Furthermore, we
have seen that the results of WikiMatch improve with a growing size of Wikipedia –
which in turn indicates that the results of WikiMatch will improve in the future merely
by the growth of Wikipedia.


References
1. Hertling, S., Paulheim, H.: Wikimatch - using wikipedia for ontology matching. In: Seventh
   International Workshop on Ontology Matching (OM 2012). (2012)
2. Paulheim, H.: Wesee-match results for oeai 2012. In: Seventh International Workshop on
   Ontology Matching (OM 2012). (2012)

</pre>