<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">WikiV3 results for OAEI 2017</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Sven</forename><surname>Hertling</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Data and Web Science Group</orgName>
								<orgName type="institution">University of Mannheim</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">WikiV3 results for OAEI 2017</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">BACDAF0C6B185A1C41AEECCC84805DAA</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T09:46+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>WikiV3 is the successor of WikiMatch (participated in OAEI  2012 and 2013)  which explores Wikipedia as one external knowledgebase for ontology matching. The results show that the matcher is slightly better than matchers based on string equality and can get higher recall values. Moreover due to the construction of the system it is able to compute mappings in a multilingual setup.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>1 Presentation of the system 1.1 State, purpose, general statement WikiV3 is a system which exploits external knowledgebases -in this case Wikipedia. It uses the MediaWiki API and searches pages which corresponds to a given resource. When exploring the interlanguage links of Wikipedia the system is also able to find mapping between ontologies of different languages. These links point from a Wikipedia page to a correspondent page in Wikipedia with a different language. In contrast to the previous version of the matcher (WikiMatch <ref type="bibr" target="#b0">[1]</ref> which participated in OAEI 2012 and 2013) all interlanguage links are now stored in Wikidata 1 .</p><p>Wikidata is a separate project which allows to build a collaboratively edited knowledge base. One part of this project is to centralize the interlanguage links. Thus the text of Wikipedia is used to better map to Wikidata entities than just using the text available in Wikidata. The search engine of Wikipedia is based on Elasticsearch and is wrapped by a MediaWiki plugin called CirrusSearch 2 . The service provided by this plugin is heavily used by this matcher to find corresponding resources.</p><p>The general approach is shown in figure <ref type="figure" target="#fig_0">1</ref>.</p><p>For each resource of the first ontology a list of corresponding Wikidata concepts is generated. A resource can be a class, datatype property or a object property. All of them are handled seperately to ensure that no mapping between different type of resources is generated (e.g. no class is matched to a datatype or object property). In the same way a list of Wikidata IDs (WIDs) is created for the second ontology. If there is at least one WID of a list in ontology 2 appearing in a list of WIDs in ontology 1, then a mapping is created. This will result in a n:m mapping which means one concept can be mapped to multiple other concepts. This will be reduced in a further step. The confidence value of a generated mapping is computed by the Jaccard index which is defined as</p><formula xml:id="formula_0">Fragment Label Comment Resource1 Q41742077 Q3486677 … … … … Wikidata IDs Q1542 Q513 … … … … Q45847 Q3486677 … … … … Wikidata IDs Q41742077 Q2 … … … …</formula><formula xml:id="formula_1">conf idence(M ) = W ID(Ont1(M )) ∪ W ID(Ont2(M )) W ID(Ont1(M )) ∩ W ID(Ont2(M ))<label>(1)</label></formula><p>where M represents the mapping, Ont1 and Ont2 selects the corresponding resource in Ontology one or two and the function WID returns the set of all Wikidata IDs for the corresponding resource. The retrieval of WIDs for one resource is now described in more detail. The goal is to generate a list of WIDs which represents a given resource. In the best case there is a WID which directly represents the resource but most of the time there will be only Wikidata entries which partially represents the concept. For achieving that goal, the search API of Wikipedia is used <ref type="foot" target="#foot_0">3</ref> .</p><p>We queried the search API for all labels, comments and for the fragment of the URI for each resource. The text length is reduced in case it is longer than 300 characters because otherwise the endpoint do not process the query. Furthermore we do not consult the endpoint if 50% of the characters are numbers. Due to the fact that the search endpoint is sensitive to tokenization (compare results from "Review preference"<ref type="foot" target="#foot_1">4</ref> and "Review preference"<ref type="foot" target="#foot_2">5</ref> ), the text is tokenized (using the following characters as a splitting point:",;:()?!. -"). Afterwards all tokens are joined with a single whitespace.</p><p>The search URI<ref type="foot" target="#foot_3">6</ref> is parameterized and the language variable is replaced with the ISO 639-1 language code of the literal. In case there is no language tag the default language of the ontology is used (the most used language of all literals). The variable text is replaced with the processed string of the literal. With this query the suggestions of Wikipedia are also explored. Thus misspellings can be detected and fixed.</p><p>The results of this API call are Wikipedia page titles. These are converted to WIDs by using the page properties call<ref type="foot" target="#foot_4">7</ref> and the remaining variable joinedTitles is replaced with the Wikipedia page titles. For faster processing all queries are cached.</p><p>After comparing the WID lists from each ontology the result is a n:m mapping of the concepts with a computed confidence value which is used in a second step to increase the precision of the matcher. This step will filter all mappings below a given threshold. There are two different thresholds depending if the matching task is multilingual or not. This is detected through the default languages of both ontologies. If they differ then the threshold is not applied because in a multilingual setup the recall would drop drastically. In monolingual setup we choose a threshold of 0.28 which means that more than a quarter of the WIDs of two resources have to match.</p><p>The confidence filter does not ensure that we get a 1:1 mapping. Therefore an additional cardinality filter is applied. In case there is an n:m mapping it chooses the one with the best confidence score. As a last step all mappings which do not have the same host URI as the majority of the ontology will be deleted. This ensures that the final mapping does not contain trivial mappings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Specific techniques used</head><p>The main technique is the usage of Wikipedia API as an external source to find mappings in Wikidata. With this information it is possible to also deal with a multilingual ontology matching setup. The filter steps of the postprocessing ensures a 1:1 mapping which is generally applicable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Adaptations made for the evaluation</head><p>The only adaption of the system is the threshold setting. In a multilingual setup the threshold is not applied whereas in all other cases a value of 0.28 is used. In context of the matching system this value represents the overlap in percentage of two sets consisting of WIDs representing a resource.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.4">Link to the system and parameters file</head><p>The WikiV3 tool can be downloaded from https://www.dropbox.com/s/kqthgvci2onj472/WikiV3.zip.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Anatomy</head><p>WikiV3 has by far the highest runtime due to Wikipedia API calls (nearly 37 minutes). In comparison to the string equivalence base line the system has only a little bit higher F-measure (+0.036) but a better recall (+0.112).</p><p>The system is able to match the follwing resources but only with a low threshold. If the text is more and more equal then the confidence will also arise. But these examples can be clearly also found by string comparison approaches <ref type="bibr" target="#b2">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Conference</head><p>In conference track the situation is same as in anatomy. WikiV3 is slightly better than the string equivalence baseline (+0.02 F-measure in ra1-M1). Nevertheless it finds correspondences like http://iasted#Sponsor = http://sigkdd# Sponzor (different spelling) and http://iasted#Student_registration_fee = http://sigkdd#Registration_Student (different fragment text).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Multifarm</head><p>In the interesting case of matching different ontologies in different languages our system achieves 0.25 F-measure. Most problematic is the recall of 0.25 because we already reduced the threshold in a multilingual setup. In most cases the concept at hand is not represented as its own Wikipedia article. Nevertheless the system is able to find mappings (exemplary for english-german) like right label Autor@de author@en Konferenz@de conference@en hat E-Mailadresse@de has email@en Dokument@de document@en 3 General comments</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Comments on the results</head><p>The overall results shows that WikiV3 is able to beat at least the string equivalence matching approaches in terms of F-measure. The recall values are higher than the one of the baselines but could be even higher.</p><p>The main drawback of the system is that most of the resources in the ontologies are not described by exactly one concept in Wikipedia (and thus Wikidata). Furthermore the Elasticsearch cluster can only deal with small misspellings and not with semantic equivalent terms or more sophisticated approaches like rewriting the query or applying any machine learning approaches. But this allows reproducible results when fixing a specific version of the cirrussearch dumps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Discussions on the way to improve the proposed system</head><p>One improvement concern the runtime of WikiV3. Each call to Wikipedia API costs a lot of time. For a future version of this matcher it would be possible to replicate the cirrussearch dumps 10 with the given setting 11 and mapping 12 files. Querying this Elasticsearch cluster is also possible due to the ability to retrieve the corresponding query 13 . With this information a in-depth analysis of the results are feasible. This setup enables a change of the index settings and preprocessing steps to further improve the results.</p><p>In the classification of elementary matching approaches <ref type="bibr" target="#b1">[2]</ref> the system works at the syntactic element-level and do not use any graph or model based techniques. This is a desired property for this matching system but it can be extended to also use structural information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusions</head><p>In this paper we analyzed the results for WikiV3 -an ontology matching system which explores Wikipedia as an external knowledge base. It is able to find more correspondences than a simple string comparison approach. Nevertheless it is only slightly better than that in terms of F-measure. Thus such a mapping approach can be used as a intermediate step to increase the recall also in multilingual setups.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Matching strategy of WikiV3</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>True positive matches in Anatomy</figDesc><table><row><cell>left label</cell><cell>confidence right label</cell></row><row><cell>osseus spiral lamina</cell><cell>0.2857 Lamina Spiralis Ossea</cell></row><row><cell>thoracic vertebra 9</cell><cell>0.3333 T9 Vertebra</cell></row><row><cell cols="2">trigeminal V spinal sensory nucleus 0.3333 Nucleus of the Spinal Tract</cell></row><row><cell></cell><cell>of the Trigeminal Nerve</cell></row><row><cell>zygomatic bone</cell><cell>0.3333 Zygomatic Arch</cell></row><row><cell>lumbar vertebra 2</cell><cell>0.3333 L2 Vertebra</cell></row><row><cell>nasopharyngeal tonsil</cell><cell>0.3333 Pharyngeal Tonsil</cell></row><row><cell>endocrine pancreas secretion</cell><cell>0.3636 Pancreatic Endocrine Secretion</cell></row><row><cell>synovium 8</cell><cell>0.4000 Synovial Membrane</cell></row><row><cell>xiphoid cartilage 9</cell><cell>0.4286 Xiphoid Process</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>True positive matches in Multifarm left label</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">https://www.mediawiki.org/wiki/API:Search</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">http://en.wikipedia.org/w/index.php?search=Review_preference</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">http://en.wikipedia.org/w/index.php?search=Review+preference</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">https://{language}.wikipedia.org/w/api.php?action=query&amp;list=search&amp; format=json&amp;srsearch={text}&amp;srinfo=suggestion&amp;srlimit=10&amp;srprop= &amp;srwhat=text</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4">https://{language}.wikipedia.org/w/api.php?action=query&amp;prop=pageprops&amp; format=json&amp;titles={joinedTitles}&amp;ppprop=wikibase_item</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_5">https://en.wikipedia.org/wiki/Synovial_membrane</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_6">https://en.wikipedia.org/w/index.php?search=xiphoid+cartilage&amp;title= Special:Search</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_7">https://dumps.wikimedia.org/other/cirrussearch/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_8">https://en.wikipedia.org/w/api.php?action=cirrus-settings-dump&amp; formatversion=2</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_9">https://en.wikipedia.org/w/api.php?action=cirrus-mapping-dump&amp; formatversion=2</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_10">https://en.wikipedia.org/w/index.php?title=Special:Search&amp; cirrusDumpQuery=&amp;search=cat+dog+chicken</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Wikimatch -using wikipedia for ontology matching</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hertling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Paulheim</surname></persName>
		</author>
		<ptr target="http://ub-madoc.bib.uni-mannheim.de/33071/" />
	</analytic>
	<monogr>
		<title level="m">Ontology Matching : Proceedings of the 7th International Workshop on Ontology Matching (OM-2012) collocated with the 11th International Semantic Web Conference (ISWC-2012)</title>
				<meeting><address><addrLine>RWTH, Aachen</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="volume">946</biblScope>
			<biblScope unit="page" from="37" to="48" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A survey of schema-based matching approaches</title>
		<author>
			<persName><forename type="first">P</forename><surname>Shvaiko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Euzenat</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Journal on Data Semantics IV</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">S</forename><surname>Spaccapietra</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="volume">3730</biblScope>
			<biblScope unit="page" from="146" to="171" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A replication study: understanding what drives the performance in wikimatch</title>
		<author>
			<persName><forename type="first">L</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cheatham</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Ontology Matching : Proceedings of the 12th International Workshop on Ontology Matching collocated with the 16th International Semantic Web Conference (ISWC-2017)</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note>to appear</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
