-

WikiV3 results for OAEI 2017

0 Data and Web Science Group, University of Mannheim , Germany

WikiV3 is the successor of WikiMatch (participated in OAEI 2012 and 2013) which explores Wikipedia as one external knowledgebase for ontology matching. The results show that the matcher is slightly better than matchers based on string equality and can get higher recall values. Moreover due to the construction of the system it is able to compute mappings in a multilingual setup.

1.1

Presentation of the system State, purpose, general statement

WikiV3 is a system which exploits external knowledgebases - in this case Wikipedia. It uses the MediaWiki API and searches pages which corresponds to a given resource. When exploring the interlanguage links of Wikipedia the system is also able to nd mapping between ontologies of di erent languages. These links point from a Wikipedia page to a correspondent page in Wikipedia with a di erent language. In contrast to the previous version of the matcher (WikiMatch [ 1 ] which participated in OAEI 2012 and 2013) all interlanguage links are now stored in Wikidata 1.

Wikidata is a separate project which allows to build a collaboratively edited knowledge base. One part of this project is to centralize the interlanguage links. Thus the text of Wikipedia is used to better map to Wikidata entities than just using the text available in Wikidata. The search engine of Wikipedia is based on Elasticsearch and is wrapped by a MediaWiki plugin called CirrusSearch2. The service provided by this plugin is heavily used by this matcher to nd corresponding resources.

The general approach is shown in gure 1.

For each resource of the rst ontology a list of corresponding Wikidata concepts is generated. A resource can be a class, datatype property or a object property. All of them are handled seperately to ensure that no mapping between di erent type of resources is generated (e.g. no class is matched to a datatype or object property). In the same way a list of Wikidata IDs (WIDs) is created for the second ontology. If there is at least one WID of a list in ontology 2 appearing in a list of WIDs in ontology 1, then a mapping is created. This will 1 https://en.wikipedia.org/wiki/Help:Interlanguage_links 2 https://www.mediawiki.org/wiki/Help:CirrusSearch

Resource1 Resource2

Maximum 10 entries per text

Ontology 1

Fragment

Label Comment Fragment

Label Comment

Wikidata IDs

where M represents the mapping, Ont1 and Ont2 selects the corresponding resource in Ontology one or two and the function WID returns the set of all Wikidata IDs for the corresponding resource.

The retrieval of WIDs for one resource is now described in more detail. The goal is to generate a list of WIDs which represents a given resource. In the best case there is a WID which directly represents the resource but most of the time there will be only Wikidata entries which partially represents the concept. For achieving that goal, the search API of Wikipedia is used3.

We queried the search API for all labels, comments and for the fragment of the URI for each resource. The text length is reduced in case it is longer than 300 characters because otherwise the endpoint do not process the query. Furthermore we do not consult the endpoint if 50% of the characters are numbers. Due to the fact that the search endpoint is sensitive to tokenization (compare results from 3 https://www.mediawiki.org/wiki/API:Search \Review preference"4 and \Review preference"5), the text is tokenized (using the following characters as a splitting point:\,;:()?!. - "). Afterwards all tokens are joined with a single whitespace.

The search URI6 is parameterized and the language variable is replaced with the ISO 639-1 language code of the literal. In case there is no language tag the default language of the ontology is used (the most used language of all literals). The variable text is replaced with the processed string of the literal. With this query the suggestions of Wikipedia are also explored. Thus misspellings can be detected and xed.

The results of this API call are Wikipedia page titles. These are converted to WIDs by using the page properties call7 and the remaining variable joinedTitles is replaced with the Wikipedia page titles. For faster processing all queries are cached.

After comparing the WID lists from each ontology the result is a n:m mapping of the concepts with a computed con dence value which is used in a second step to increase the precision of the matcher. This step will lter all mappings below a given threshold. There are two di erent thresholds depending if the matching task is multilingual or not. This is detected through the default languages of both ontologies. If they di er then the threshold is not applied because in a multilingual setup the recall would drop drastically. In monolingual setup we choose a threshold of 0.28 which means that more than a quarter of the WIDs of two resources have to match.

The con dence lter does not ensure that we get a 1:1 mapping. Therefore an additional cardinality lter is applied. In case there is an n:m mapping it chooses the one with the best con dence score. As a last step all mappings which do not have the same host URI as the majority of the ontology will be deleted. This ensures that the nal mapping does not contain trivial mappings. 1.2

Speci c techniques used

The main technique is the usage of Wikipedia API as an external source to nd mappings in Wikidata. With this information it is possible to also deal with a multilingual ontology matching setup. The lter steps of the postprocessing ensures a 1:1 mapping which is generally applicable. 1.3

Adaptations made for the evaluation

The only adaption of the system is the threshold setting. In a multilingual setup the threshold is not applied whereas in all other cases a value of 0.28 is used. In 4 http://en.wikipedia.org/w/index.php?search=Review_preference 5 http://en.wikipedia.org/w/index.php?search=Review+preference 6 https://{language}.wikipedia.org/w/api.php?action=query&list=search& format=json&srsearch={text}&srinfo=suggestion&srlimit=10&srprop= &srwhat=text 7 https://{language}.wikipedia.org/w/api.php?action=query&prop=pageprops& format=json&titles={joinedTitles}&ppprop=wikibase_item context of the matching system this value represents the overlap in percentage of two sets consisting of WIDs representing a resource. 1.4

Link to the system and parameters le

The WikiV3 tool can be downloaded from https://www.dropbox.com/s/kqthgvci2onj472/WikiV3.zip. 2 2.1

Results Anatomy

WikiV3 has by far the highest runtime due to Wikipedia API calls (nearly 37 minutes). In comparison to the string equivalence base line the system has only a little bit higher F-measure (+0.036) but a better recall (+0.112).

The system is able to match the follwing resources but only with a low threshold. left label con dence right label osseus spiral lamina 0.2857 Lamina Spiralis Ossea thoracic vertebra 9 0.3333 T9 Vertebra trigeminal V spinal sensory nucleus 0.3333 Nucleus of the Spinal Tract of the Trigeminal Nerve zygomatic bone 0.3333 Zygomatic Arch lumbar vertebra 2 0.3333 L2 Vertebra nasopharyngeal tonsil 0.3333 Pharyngeal Tonsil endocrine pancreas secretion 0.3636 Pancreatic Endocrine Secretion synovium 8 0.4000 Synovial Membrane xiphoid cartilage 9 0.4286 Xiphoid Process

If the text is more and more equal then the con dence will also arise. But these examples can be clearly also found by string comparison approaches [ 3 ]. 2.2

Conference

In conference track the situation is same as in anatomy. WikiV3 is slightly better than the string equivalence baseline (+0.02 F-measure in ra1-M1). Nevertheless it nds correspondences like http://iasted#Sponsor = http://sigkdd# Sponzor (di erent spelling) and http://iasted#Student_registration_fee = http://sigkdd#Registration_Student (di erent fragment text). 8 https://en.wikipedia.org/wiki/Synovial_membrane 9 https://en.wikipedia.org/w/index.php?search=xiphoid+cartilage&title=

Special:Search 2.3

Multifarm

In the interesting case of matching di erent ontologies in di erent languages our system achieves 0.25 F-measure. Most problematic is the recall of 0.25 because we already reduced the threshold in a multilingual setup. In most cases the concept at hand is not represented as its own Wikipedia article. Nevertheless the system is able to nd mappings (exemplary for english-german) like 3 3.1

General comments Comments on the results

The overall results shows that WikiV3 is able to beat at least the string equivalence matching approaches in terms of F-measure. The recall values are higher than the one of the baselines but could be even higher.

The main drawback of the system is that most of the resources in the ontologies are not described by exactly one concept in Wikipedia (and thus Wikidata). Furthermore the Elasticsearch cluster can only deal with small misspellings and not with semantic equivalent terms or more sophisticated approaches like rewriting the query or applying any machine learning approaches. But this allows reproducible results when xing a speci c version of the cirrussearch dumps. 3.2

Discussions on the way to improve the proposed system

One improvement concern the runtime of WikiV3. Each call to Wikipedia API costs a lot of time. For a future version of this matcher it would be possible to replicate the cirrussearch dumps 10 with the given setting11 and mapping12 les. Querying this Elasticsearch cluster is also possible due to the ability to retrieve the corresponding query 13. With this information a in-depth analysis 10 https://dumps.wikimedia.org/other/cirrussearch/ 11 https://en.wikipedia.org/w/api.php?action=cirrus-settings-dump& formatversion=2 12 https://en.wikipedia.org/w/api.php?action=cirrus-mapping-dump& formatversion=2 13 https://en.wikipedia.org/w/index.php?title=Special:Search& cirrusDumpQuery=&search=cat+dog+chicken of the results are feasible. This setup enables a change of the index settings and preprocessing steps to further improve the results.

In the classi cation of elementary matching approaches [ 2 ] the system works at the syntactic element-level and do not use any graph or model based techniques. This is a desired property for this matching system but it can be extended to also use structural information. 4

Conclusions

In this paper we analyzed the results for WikiV3 - an ontology matching system which explores Wikipedia as an external knowledge base. It is able to nd more correspondences than a simple string comparison approach. Nevertheless it is only slightly better than that in terms of F-measure. Thus such a mapping approach can be used as a intermediate step to increase the recall also in multilingual setups.

1. Hertling , S. , Paulheim , H.: Wikimatch - using wikipedia for ontology matching . In: Ontology Matching : Proceedings of the 7th International Workshop on Ontology Matching (OM- 2012

) collocated with the 11th International Semantic Web Conference (ISWC-

2012 ). vol. 946 , pp. 37 { 48 . RWTH, Aachen ( 2012 ), http://ub-madoc.bib.uni-mannheim.de/33071/

2. Shvaiko , P. , Euzenat , J.: A survey of schema-based matching approaches . In: Spaccapietra, S . (ed.) Journal on Data Semantics IV, Lecture Notes in Computer Science , vol. 3730 , pp. 146 { 171 . Springer Berlin Heidelberg ( 2005 )

3. Zhou , L. , Cheatham , M.: A replication study: understanding what drives the performance in wikimatch . In: Ontology Matching : Proceedings of the 12th International Workshop on Ontology Matching collocated with the 16th International Semantic Web Conference (ISWC-2017) ( 2017 ), to appear