<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>WikiV3 results for OAEI 2017</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Data and Web Science Group, University of Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>WikiV3 is the successor of WikiMatch (participated in OAEI 2012 and 2013) which explores Wikipedia as one external knowledgebase for ontology matching. The results show that the matcher is slightly better than matchers based on string equality and can get higher recall values. Moreover due to the construction of the system it is able to compute mappings in a multilingual setup.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Presentation of the system</title>
      <sec id="sec-2-1">
        <title>State, purpose, general statement</title>
        <p>
          WikiV3 is a system which exploits external knowledgebases - in this case Wikipedia.
It uses the MediaWiki API and searches pages which corresponds to a given
resource. When exploring the interlanguage links of Wikipedia the system is also
able to nd mapping between ontologies of di erent languages. These links point
from a Wikipedia page to a correspondent page in Wikipedia with a di erent
language. In contrast to the previous version of the matcher (WikiMatch [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] which
participated in OAEI 2012 and 2013) all interlanguage links are now stored in
Wikidata 1.
        </p>
        <p>Wikidata is a separate project which allows to build a collaboratively edited
knowledge base. One part of this project is to centralize the interlanguage links.
Thus the text of Wikipedia is used to better map to Wikidata entities than just
using the text available in Wikidata. The search engine of Wikipedia is based
on Elasticsearch and is wrapped by a MediaWiki plugin called CirrusSearch2.
The service provided by this plugin is heavily used by this matcher to nd
corresponding resources.</p>
        <p>The general approach is shown in gure 1.</p>
        <p>For each resource of the rst ontology a list of corresponding Wikidata
concepts is generated. A resource can be a class, datatype property or a object
property. All of them are handled seperately to ensure that no mapping between
di erent type of resources is generated (e.g. no class is matched to a datatype
or object property). In the same way a list of Wikidata IDs (WIDs) is created
for the second ontology. If there is at least one WID of a list in ontology 2
appearing in a list of WIDs in ontology 1, then a mapping is created. This will
1 https://en.wikipedia.org/wiki/Help:Interlanguage_links
2 https://www.mediawiki.org/wiki/Help:CirrusSearch</p>
        <p>Resource1
Resource2</p>
        <p>Maximum 10
entries per text</p>
        <sec id="sec-2-1-1">
          <title>Ontology 1</title>
          <p>Fragment</p>
          <p>Label
Comment
Fragment</p>
          <p>Label
Comment</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Wikidata IDs</title>
          <p>where M represents the mapping, Ont1 and Ont2 selects the corresponding
resource in Ontology one or two and the function WID returns the set of all
Wikidata IDs for the corresponding resource.</p>
          <p>The retrieval of WIDs for one resource is now described in more detail. The
goal is to generate a list of WIDs which represents a given resource. In the best
case there is a WID which directly represents the resource but most of the time
there will be only Wikidata entries which partially represents the concept. For
achieving that goal, the search API of Wikipedia is used3.</p>
          <p>We queried the search API for all labels, comments and for the fragment of
the URI for each resource. The text length is reduced in case it is longer than 300
characters because otherwise the endpoint do not process the query. Furthermore
we do not consult the endpoint if 50% of the characters are numbers. Due to the
fact that the search endpoint is sensitive to tokenization (compare results from
3 https://www.mediawiki.org/wiki/API:Search
\Review preference"4 and \Review preference"5), the text is tokenized (using
the following characters as a splitting point:\,;:()?!. - "). Afterwards all tokens
are joined with a single whitespace.</p>
          <p>The search URI6 is parameterized and the language variable is replaced with
the ISO 639-1 language code of the literal. In case there is no language tag the
default language of the ontology is used (the most used language of all literals).
The variable text is replaced with the processed string of the literal. With this
query the suggestions of Wikipedia are also explored. Thus misspellings can be
detected and xed.</p>
          <p>The results of this API call are Wikipedia page titles. These are converted to
WIDs by using the page properties call7 and the remaining variable joinedTitles
is replaced with the Wikipedia page titles. For faster processing all queries are
cached.</p>
          <p>After comparing the WID lists from each ontology the result is a n:m mapping
of the concepts with a computed con dence value which is used in a second step
to increase the precision of the matcher. This step will lter all mappings below
a given threshold. There are two di erent thresholds depending if the matching
task is multilingual or not. This is detected through the default languages of
both ontologies. If they di er then the threshold is not applied because in a
multilingual setup the recall would drop drastically. In monolingual setup we
choose a threshold of 0.28 which means that more than a quarter of the WIDs
of two resources have to match.</p>
          <p>The con dence lter does not ensure that we get a 1:1 mapping. Therefore an
additional cardinality lter is applied. In case there is an n:m mapping it chooses
the one with the best con dence score. As a last step all mappings which do not
have the same host URI as the majority of the ontology will be deleted. This
ensures that the nal mapping does not contain trivial mappings.
1.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Speci c techniques used</title>
        <p>The main technique is the usage of Wikipedia API as an external source to nd
mappings in Wikidata. With this information it is possible to also deal with
a multilingual ontology matching setup. The lter steps of the postprocessing
ensures a 1:1 mapping which is generally applicable.
1.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Adaptations made for the evaluation</title>
        <p>The only adaption of the system is the threshold setting. In a multilingual setup
the threshold is not applied whereas in all other cases a value of 0.28 is used. In
4 http://en.wikipedia.org/w/index.php?search=Review_preference
5 http://en.wikipedia.org/w/index.php?search=Review+preference
6 https://{language}.wikipedia.org/w/api.php?action=query&amp;list=search&amp;
format=json&amp;srsearch={text}&amp;srinfo=suggestion&amp;srlimit=10&amp;srprop=
&amp;srwhat=text
7 https://{language}.wikipedia.org/w/api.php?action=query&amp;prop=pageprops&amp;
format=json&amp;titles={joinedTitles}&amp;ppprop=wikibase_item
context of the matching system this value represents the overlap in percentage
of two sets consisting of WIDs representing a resource.
1.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Link to the system and parameters le</title>
        <p>The WikiV3 tool can be downloaded from
https://www.dropbox.com/s/kqthgvci2onj472/WikiV3.zip.
2
2.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <sec id="sec-3-1">
        <title>Anatomy</title>
        <p>WikiV3 has by far the highest runtime due to Wikipedia API calls (nearly 37
minutes). In comparison to the string equivalence base line the system has only
a little bit higher F-measure (+0.036) but a better recall (+0.112).</p>
        <p>The system is able to match the follwing resources but only with a low
threshold.
left label con dence right label
osseus spiral lamina 0.2857 Lamina Spiralis Ossea
thoracic vertebra 9 0.3333 T9 Vertebra
trigeminal V spinal sensory nucleus 0.3333 Nucleus of the Spinal Tract
of the Trigeminal Nerve
zygomatic bone 0.3333 Zygomatic Arch
lumbar vertebra 2 0.3333 L2 Vertebra
nasopharyngeal tonsil 0.3333 Pharyngeal Tonsil
endocrine pancreas secretion 0.3636 Pancreatic Endocrine Secretion
synovium 8 0.4000 Synovial Membrane
xiphoid cartilage 9 0.4286 Xiphoid Process</p>
        <p>
          If the text is more and more equal then the con dence will also arise. But
these examples can be clearly also found by string comparison approaches [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Conference</title>
        <p>In conference track the situation is same as in anatomy. WikiV3 is slightly better
than the string equivalence baseline (+0.02 F-measure in ra1-M1).
Nevertheless it nds correspondences like http://iasted#Sponsor = http://sigkdd#
Sponzor (di erent spelling) and http://iasted#Student_registration_fee
= http://sigkdd#Registration_Student (di erent fragment text).
8 https://en.wikipedia.org/wiki/Synovial_membrane
9 https://en.wikipedia.org/w/index.php?search=xiphoid+cartilage&amp;title=</p>
        <p>Special:Search
2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Multifarm</title>
        <p>In the interesting case of matching di erent ontologies in di erent languages our
system achieves 0.25 F-measure. Most problematic is the recall of 0.25 because
we already reduced the threshold in a multilingual setup. In most cases the
concept at hand is not represented as its own Wikipedia article. Nevertheless
the system is able to nd mappings (exemplary for english-german) like
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>General comments</title>
      <sec id="sec-4-1">
        <title>Comments on the results</title>
        <p>The overall results shows that WikiV3 is able to beat at least the string
equivalence matching approaches in terms of F-measure. The recall values are higher
than the one of the baselines but could be even higher.</p>
        <p>The main drawback of the system is that most of the resources in the
ontologies are not described by exactly one concept in Wikipedia (and thus Wikidata).
Furthermore the Elasticsearch cluster can only deal with small misspellings and
not with semantic equivalent terms or more sophisticated approaches like
rewriting the query or applying any machine learning approaches. But this allows
reproducible results when xing a speci c version of the cirrussearch dumps.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Discussions on the way to improve the proposed system</title>
        <p>One improvement concern the runtime of WikiV3. Each call to Wikipedia API
costs a lot of time. For a future version of this matcher it would be possible
to replicate the cirrussearch dumps 10 with the given setting11 and mapping12
les. Querying this Elasticsearch cluster is also possible due to the ability to
retrieve the corresponding query 13. With this information a in-depth analysis
10 https://dumps.wikimedia.org/other/cirrussearch/
11 https://en.wikipedia.org/w/api.php?action=cirrus-settings-dump&amp;
formatversion=2
12 https://en.wikipedia.org/w/api.php?action=cirrus-mapping-dump&amp;
formatversion=2
13 https://en.wikipedia.org/w/index.php?title=Special:Search&amp;
cirrusDumpQuery=&amp;search=cat+dog+chicken
of the results are feasible. This setup enables a change of the index settings and
preprocessing steps to further improve the results.</p>
        <p>
          In the classi cation of elementary matching approaches [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] the system works
at the syntactic element-level and do not use any graph or model based
techniques. This is a desired property for this matching system but it can be extended
to also use structural information.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper we analyzed the results for WikiV3 - an ontology matching
system which explores Wikipedia as an external knowledge base. It is able to nd
more correspondences than a simple string comparison approach. Nevertheless
it is only slightly better than that in terms of F-measure. Thus such a
mapping approach can be used as a intermediate step to increase the recall also in
multilingual setups.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Hertling</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Wikimatch - using wikipedia for ontology matching</article-title>
          .
          <source>In: Ontology Matching : Proceedings of the 7th International Workshop on Ontology Matching (OM-</source>
          <year>2012</year>
          <article-title>) collocated with the 11th International Semantic Web Conference (ISWC-</article-title>
          <year>2012</year>
          ). vol.
          <volume>946</volume>
          , pp.
          <volume>37</volume>
          {
          <fpage>48</fpage>
          . RWTH,
          <string-name>
            <surname>Aachen</surname>
          </string-name>
          (
          <year>2012</year>
          ), http://ub-madoc.bib.uni-mannheim.de/33071/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Shvaiko</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A survey of schema-based matching approaches</article-title>
          . In: Spaccapietra,
          <string-name>
            <surname>S</surname>
          </string-name>
          . (ed.)
          <source>Journal on Data Semantics IV, Lecture Notes in Computer Science</source>
          , vol.
          <volume>3730</volume>
          , pp.
          <volume>146</volume>
          {
          <fpage>171</fpage>
          . Springer Berlin Heidelberg (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheatham</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A replication study: understanding what drives the performance in wikimatch</article-title>
          .
          <source>In: Ontology Matching : Proceedings of the 12th International Workshop on Ontology Matching collocated with the 16th International Semantic Web Conference (ISWC-2017)</source>
          (
          <year>2017</year>
          ), to appear
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>