<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Demonstration of MTab: Tabular Data Annotation with Knowledge Graphs</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Phuc</forename><surname>Nguyen</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">National Institute of Informatics</orgName>
								<address>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ikuya</forename><surname>Yamada</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Studio Ousia</orgName>
								<address>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Natthawut</forename><surname>Kertkeidkachorn</surname></persName>
							<affiliation key="aff2">
								<orgName type="institution" key="instit1">Japan</orgName>
								<orgName type="institution" key="instit2">Advanced Institute of Science and Technology</orgName>
								<address>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ryutaro</forename><surname>Ichise</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">National Institute of Informatics</orgName>
								<address>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hideaki</forename><surname>Takeda</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">National Institute of Informatics</orgName>
								<address>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Demonstration of MTab: Tabular Data Annotation with Knowledge Graphs</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">4C6668B54798215CCDA6F59EE9023D0D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T01:34+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>tabular data annotation</term>
					<term>knowledge graph</term>
					<term>semantic annotation</term>
					<term>structural annotation</term>
					<term>Wikidata</term>
					<term>Wikipedia</term>
					<term>DBpedia</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents a demonstration of MTab, a tabular data annotation toolkit with knowledge graphs: Wikidata, Wikipedia, and DBpedia. MTab is the best performance system for all semantic annotation tasks at the Semantic Web Challenges on tabular data to knowledge graph matching SemTab 2019 and SemTab 2020. This paper introduces MTab's public APIs capable of structural and semantic annotations for tabular data. We also provide a graphical interface to visualize the annotation results. The tool supports multilingual tables and could process many table formats such as Excel, CSV, TSV, markdown tables, or a pasted table content. MTab's repository is publicly available at https://github.com/phucty/mtab_tool.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Many valuable tabular resources have been made available on the Internet and Open Data Portals, thanks to the Open Data movement. However, the usage of the tabular data is very limited in applications due to lacking or insufficient data descriptions, various data formats, vocabulary issues. Tabular data usually do not have a description, or the description does not cover data content. Tabular data also lack specification on table structure, and layout. Moreover, many tables do not use a standard vocabulary such as expressed in non-English, abbreviation, ambiguous or contain many misspellings, encoding problems. It is crucial to have a tabular data annotation system that could provide explicit information about table content to improve tabular data usability.</p><p>Previous studies addressed many tabular data annotation tasks such as structural annotations <ref type="bibr" target="#b5">[6]</ref>, <ref type="bibr" target="#b8">[9]</ref> or semantic annotations as the participant systems in the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching:</p><p>SemTab 2019 <ref type="bibr" target="#b2">[3]</ref>, and SemTab 2020 <ref type="bibr" target="#b3">[4]</ref>. Unfortunately, most solutions or systems are not available to use, or require extensive configuration, setup, high computing power, or high time complexity <ref type="bibr" target="#b9">[10]</ref>.</p><p>This paper introduces MTab, a public service that generates structural and semantic annotations for tabular data. The structural annotations provide information about table headers, the table core attribute. The semantic annotations offer table elements matching knowledge graph concepts: cell-entity (CEA task), column-type (CTA task), and CPA task where the relation between core attribute to another column is annotated with a property. We also provide a graphical interface to visualize the annotation results.</p><p>The major advantages of MTab compared to other systems are as follows.</p><p>-Effectiveness: MTab tool is the best performance system in SemTab 2019 <ref type="bibr" target="#b4">[5]</ref>, <ref type="bibr" target="#b2">[3]</ref> and SemTab 2020 <ref type="bibr" target="#b6">[7]</ref>, <ref type="bibr" target="#b3">[4]</ref>. The key success of MTab is on the entity search modules with multilingual support (a keyword search with BM25 algorithm, a fuzzy search with edit distances, and an aggregation search with weighted fusion of keyword search and fuzzy search). The fuzzy search could support up to six edits (on the low-budget mac mini M1 2021), while most other systems only support two edits. As a result, MTab could address a higher level of noisiness compared to other systems. The entity search module achieves 87.98% on average of the top 1 accuracy (the top 1000 accuracy is 99.7%) <ref type="bibr" target="#b7">[8]</ref> on Semtab 2020 <ref type="bibr" target="#b3">[4]</ref> and Tough Tables <ref type="table">datasets [1</ref> MTab's repository, API documents, and other information could be accessed at https://github.com/phucty/mtab_tool; the demonstration video is available at https://youtu.be/0ibTWeObWaA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">MTab</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Knowledge Graphs</head><p>We build a WikiGraph from the dump data of Wikidata, Wikipedia, and DBpedia as the target knowledge graph the annotation tasks. Wikidata is the central knowledge graph because it has the largest number of entities among the three graphs. With the dump data on 1 January 2021, we extracted 91.2 million entities and 249.3 million entity labels in multilingual, including entity labels, aliases, other names, redirect entity labels, and disambiguation entities. We also extracted 3.5 billion triples in WikiGraph. Additionally, WikiGraph will be updated frequently based on the future released dumps of knowledge graphs (Wikidata, Wikipedia, and DBpedia).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Entity Search Modules</head><p>Entity Search on a Cell We introduce the search modes<ref type="foot" target="#foot_0">1</ref> as follows <ref type="bibr" target="#b7">[8]</ref>.</p><p>-Keyword search with BM25 algorithm: We use the hyper-parameters as b = 0.75, k 1 = 1.2. -Fuzzy search with edit distance: We use Damerau-Levenshtein distance as the edit distance for fuzzy search. We also perform candidate filtering and hashing with pre-calculating entity label deletes as the Symmetric Delete algorithm <ref type="bibr" target="#b1">[2]</ref> to reduce the number of operations on pairwise edit distance calculation. Overall, MTab could support the fuzzy search up to six edits. -Aggregation search: This module is a weighted fusion of the keyword search and the fuzzy search results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Statement Search on Two Cells</head><p>This module is built on the assumption that there is a logical relation between two cells of a table row, equivalent to a knowledge graph triple. We only keep the candidates of the two cells that have equivalent statements in the WikiGraph. We implement this statement search with a sparse matrix of 91 million entities and around 500 million edges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Table Annotation: Use Case and Demo</head><p>MTab demonstration is available at https://mtab.app. Users could submit table files in various table formats, expressed in any language to MTab API, or copy data content and paste it to the interface. Then, users could tap to the "Annotate" button to get the annotation results. MTab will perform the following steps.</p><p>The annotation procedure<ref type="foot" target="#foot_1">2</ref> are as the following steps:  The photo on the right is the annotation results. The table header is in the first row, and the core attribute is in the first column. Entity annotations are in red and located below the table cell value. The type annotation is in green and located in the "Type" column. Finally, the relations between the core attribute and other columns are in blue and located in the property column.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Conclusions</head><p>This paper presents a demonstration of the MTab toolkit for table annotation with knowledge graphs of Wikidata, DBpedia, and Wikipedia. MTab is effective, efficient, and easy to use.</p><p>In the future work, we will focus on building downstream applications based on MTab's annotations such as question answering, and data analysis.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Example tabular data annotation with MTab</figDesc><graphic coords="4,279.50,116.39,200.53,125.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>We provide public APIs, graphical interfaces so that users do not need to do intensive setup or configuration. MTab also supports multilingual and could process many table formats such as Excel, CSV, TSV, or markdown tables. According to Wang et al., they only could generate the annotations using the MTab tool, while other systems require high time complexity to process<ref type="bibr" target="#b9">[10]</ref>.-Privacy Policy: MTab does not store any data from users. All users' tabular data files are completely deleted after the annotation.</figDesc><table /><note>]. -Efficiency: MTab fuzzy search implementation works efficiently with candidate filtering based on entity labels and hashing with pre-calculating entity label deletes as the Symmetric Delete algorithm [2]. Moreover, the statement search also gives a tremendous efficient improvement where it could eliminate non-statements entity candidates. Additionally, we use a light way solution as the value matching to calculate the context similarity between entity candidate statements and table row values. The experiments show that our solution could improve efficiency without losing effective performance [4]. Overall, it takes only 1.52 seconds/table on average (SemTab 2020 dataset) to annotate with MTab. -Easy to use:</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>MTab took 0.49 seconds to annotate a pasted table from the text box (left picture).</figDesc><table /><note>table cells whose data types are strings. The CTA matching targets are columns so that the column data types are strings. The CPA matching targets are the relation between the core attribute and the remaining table columns. Then, we perform entity candidate generation for each table cell with entity search and two cells in the same row with statement search. We calculate context similarities with the value matching between statements of entity candidates in the core attributes with table row values. Finally, generate the annotations for entities, properties, and types based on majority voting of context similarities<ref type="bibr" target="#b6">[7]</ref>. Fig. 1 illustrate an annotation example for a SemTab dataset's table.</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Entity Search Documents: https://mtab.app/mtabes/docs</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">Table Annotation Document: https://mtab.app/mtab/docs</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>The research was supported by the Cross-ministerial Strategic Innovation Promotion Program (SIP) Second Phase, "Big-data and AI-enabled Cyberspace Technologies" by the New Energy and Industrial Technology Development Organization (NEDO).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Tough tables: Carefully evaluating entity linking for tabular data</title>
		<author>
			<persName><forename type="first">V</forename><surname>Cutrona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bianchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Jiménez-Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Palmonari</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-62466-8_21</idno>
		<idno>-030-62466-8_21</idno>
		<ptr target="https://doi.org/10.1007/978-3" />
	</analytic>
	<monogr>
		<title level="m">The Semantic Web -ISWC 2020</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">12507</biblScope>
			<biblScope unit="page" from="328" to="343" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Symspell: Symmetric delete algorithm</title>
		<author>
			<persName><forename type="first">W</forename><surname>Garbe</surname></persName>
		</author>
		<ptr target="https://github.com/wolfgarbe/SymSpell(2012" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Semtab 2019: Resources to benchmark tabular data to knowledge graph matching systems</title>
		<author>
			<persName><forename type="first">E</forename><surname>Jiménez-Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Hassanzadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Efthymiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Srinivas</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-49461-2_30</idno>
		<idno>-030-49461-2_30</idno>
		<ptr target="https://doi.org/10.1007/978-3" />
	</analytic>
	<monogr>
		<title level="m">The Semantic Web -17th International Conference, ESWC 2020</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">12123</biblScope>
			<biblScope unit="page" from="514" to="530" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Results of semtab 2020</title>
		<author>
			<persName><forename type="first">E</forename><surname>Jimenez-Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Hassanzadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Efthymiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Srinivas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Cutrona</surname></persName>
		</author>
		<ptr target="http://ceur-ws.org/Vol-2775/paper0.pdf" />
	</analytic>
	<monogr>
		<title level="m">SemTab@ISWC. CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">2775</biblScope>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Mtab: Matching tabular data to knowledge graph using probability models</title>
		<author>
			<persName><forename type="first">P</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kertkeidkachorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ichise</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Takeda</surname></persName>
		</author>
		<ptr target="http://ceur-ws.org/Vol-2553/paper2.pdf" />
	</analytic>
	<monogr>
		<title level="m">SemTab@ISWC 2019. CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">2553</biblScope>
			<biblScope unit="page" from="7" to="14" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Tabeano: Table to knowledge graph entity annotation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kertkeidkachorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ichise</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Takeda</surname></persName>
		</author>
		<idno>CoRR abs/2010.01829</idno>
		<ptr target="https://arxiv.org/abs/2010.01829" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Mtab4wikidata at semtab 2020: Tabular data annotation with wikidata</title>
		<author>
			<persName><forename type="first">P</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Yamada</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kertkeidkachorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ichise</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Takeda</surname></persName>
		</author>
		<ptr target="http://ceur-ws.org/Vol-2775/paper9.pdf" />
	</analytic>
	<monogr>
		<title level="m">SemTab@ISWC</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">2775</biblScope>
			<biblScope unit="page" from="86" to="95" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Mtabes: Entity search with keyword search, fuzzy search, and entity popularities</title>
		<author>
			<persName><forename type="first">P</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Yamada</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Takeda</surname></persName>
		</author>
		<ptr target="https://www.jstage.jst.go.jp/article/pjsai/JSAI2021/0/JSAI2021_1N4IS1a02/_pdf" />
	</analytic>
	<monogr>
		<title level="m">The 35th Annual Conference of the Japanese Society for Artificial Intelligence</title>
				<meeting><address><addrLine>JSAI</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021. 2021</date>
		</imprint>
	</monogr>
	<note>The Japanese Society for Artificial Intelligence</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Matching html tables to dbpedia</title>
		<author>
			<persName><forename type="first">D</forename><surname>Ritze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Lehmberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</author>
		<idno type="DOI">10.1145/2797115.2797118</idno>
		<ptr target="https://doi.org/10.1145/2797115.2797118" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics</title>
				<meeting>the 5th International Conference on Web Intelligence, Mining and Semantics<address><addrLine>WIMS</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2015">2015. 2015</date>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page">6</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">TCN: table convolutional network for web table interpretation</title>
		<author>
			<persName><forename type="first">D</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shiralkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lockard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jiang</surname></persName>
		</author>
		<idno type="DOI">10.1145/3442381.3450090</idno>
		<idno>/ IW3C2</idno>
		<ptr target="https://doi.org/10.1145/3442381.3450090" />
	</analytic>
	<monogr>
		<title level="m">WWW &apos;21: The Web Conference 2021, Virtual Event</title>
				<meeting><address><addrLine>Ljubljana, Slovenia</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2021">April 19-23, 2021. 2021</date>
			<biblScope unit="page" from="4020" to="4032" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
