<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">KGCODE-Tab Results for SemTab 2022</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Xinhe</forename><surname>Li</surname></persName>
							<email>lixinhe669@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science and Engineering</orgName>
								<orgName type="institution">Southeast University</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Shuxin</forename><surname>Wang</surname></persName>
							<email>shuxinwang662@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science and Engineering</orgName>
								<orgName type="institution">Southeast University</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Wei</forename><surname>Zhou</surname></persName>
							<email>zhouweiseu@seu.edu.cn</email>
							<affiliation key="aff1">
								<orgName type="department">College of Software Engineering</orgName>
								<orgName type="institution">Southeast University</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gongrui</forename><surname>Zhang</surname></persName>
							<email>grzhang@seu.edu.cn</email>
							<affiliation key="aff2">
								<orgName type="department">Chien-Shiung Wu College</orgName>
								<orgName type="institution">Southeast University</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chenghuan</forename><surname>Jiang</surname></persName>
							<affiliation key="aff2">
								<orgName type="department">Chien-Shiung Wu College</orgName>
								<orgName type="institution">Southeast University</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tianyu</forename><surname>Hong</surname></persName>
							<email>tianyuhong677@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science and Engineering</orgName>
								<orgName type="institution">Southeast University</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Peng</forename><surname>Wang</surname></persName>
							<email>pwang@seu.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science and Engineering</orgName>
								<orgName type="institution">Southeast University</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">College of Software Engineering</orgName>
								<orgName type="institution">Southeast University</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Chien-Shiung Wu College</orgName>
								<orgName type="institution">Southeast University</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">KGCODE-Tab Results for SemTab 2022</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">59BB2ABE7A54C78D9154B63B8AAAB996</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T20:39+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Tabular Data, Knowledge Graph, Entity Linking, KGCODE-Tab, Semantic Annotation Orcid 0000-0002-6299-4229 (X. Li)</term>
					<term>0000-0002-3677-8477 (S. Wang)</term>
					<term>0000-0002-4558-245X (W. Zhou)</term>
					<term>0000-0002-8342-5834 (G. Zhang)</term>
					<term>0000-0002-3583-3569 (C. Jiang)</term>
					<term>0000-0002-3773-7108 (T. Hong)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents the results of KGCODE-Tab in the tabular data to knowledge graph matching contest SemTab 2022. As an efficient tabular data linking system, KGCODE-Tab is intended to participate in three tasks of the content: Column Type Annotation (CTA), Cell Entity Annotation (CEA), and Columns Property Annotation (CPA). The specific techniques used by KGCODE-Tab will be introduced briefly. The strengths and weaknesses of KGCODE-Tab will also be discussed.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>KGCODE-Tab combines several effective tabular data preprocessing techniques, which are fundamental for TDKGM. We analyze the structure of tabular data, which is helpful to extract the subject column and non-subject columns, correct the spelling of texts in cells, and recall all candidate entities and their information needed in the later modules. In the entity disambiguation module, preliminary scores are assigned to all candidate entities of the cells in the subject column, based on the similarities between tabular cells and property values in KGs. In each task, a ranking algorithm is designed according to the preliminary scores, and finally we obtain the semantic annotation based on the ranks. KGCODE-Tab separates the look-up step and entity linking step, the latter can directly use the intermediate results produced by the former in JSON files.</p><p>In SemTab 2022, KGCODE-Tab is an efficient tabular data linking system, and some algorithms and matching strategies of it have been designed for high efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Specific techniques used</head><p>KGCODE-Tab aims to provide high-quality semantic annotation of tabular data. The main specific techniques used by KGCODE-Tab are as follows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.1.">Table Structure Analysis</head><p>Firstly, KGCODE-Tab classifies each column into entity column and non-entity column. It employs spaCy<ref type="foot" target="#foot_0">2</ref> , a python package for Named Entity Recognition (NER), to give each cell a tag. A cell is an entity cell if it is tagged with</p><formula xml:id="formula_0">P E R S O N , N O R P , F A C , O R G , G P E , L O C , P R O D U C T , E V E N T , W O R K _ O F _ A R T , L A W or L A N G U A G E . A cell is a non-entity cell if it is tagged with D A T E , T I M E , P E R C E N T , M O N E Y , Q U A N T I T Y , O R D I N A L or C A R D I N A L .</formula><p>Cells that cannot be recognized by spaCy are classified into entity cells to prevent omissions. Then a column is an entity column if more than half of its cells (except the header) are entity cells. Otherwise, it is a non-entity column.</p><p>Secondly, KGCODE-Tab selects the subject column from the entity columns. It defines the Column Entropy, which describes the diversity of contents in a column. The subject column commonly has a higher value of the Column Entropy. If more than one subject columns exist, then KGCODE-Tab selects the one with the smallest index.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.2.">Spell Correction</head><p>Tables on the Internet usually have misspelled words, and researches <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref> show that spelling mistakes can make a huge difference to entity recall. Some systems <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref> remove special characters in the text, but have no idea about the wrong words. Inspired by <ref type="bibr" target="#b5">[6]</ref>, KGCODE-Tab utilizes search engines to find the correct words.</p><p>For a tabular cell 𝑐 𝑖𝑗 , KGCODE-Tab uses Bing 3 to search it and obtains the result page in HTML format. Secondly, it extracts the titles of websites in the HTML and splits them into words 𝒲 = {𝑤 1 , 𝑤 2 , … , 𝑤 𝑛 }, where 𝑛 is the total number of words. Thirdly, it calculates the Levenshtein Distance between 𝑤 𝑖 , 𝑖 = 1, 2 … , 𝑛 and 𝑐 𝑖𝑗 . Finally, the word with the shortest Levenshtein Distance to 𝑐 𝑖𝑗 is selected as the correct mention of 𝑐 𝑖𝑗 , and words whose Levenshtein Distance to the correct word are no more than 2 are also appended to the list of candidate mentions of 𝑐 𝑖𝑗 , preventing omissions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.3.">Entity Recall</head><p>Entity recall aims to select several candidate entities from a given KG. If the system cannot even recall the ground truth entities, then all the subsequent work is in vain. For the data source of KG, Some systems <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref> build their database using the Wikidata local dump. However, the method requires high storage and IO performance of computers due to the huge size of local dump files. Therefore, we use the look-up services MediaWiki Action API <ref type="foot" target="#foot_2">4</ref> and DBpedia Lookup<ref type="foot" target="#foot_3">5</ref> to access the data of KGs online. We use 100 threads in entity query to improve query speed and obtain up to 50 candidate entities for each query text.</p><p>Furthermore, we find that the look-up services of KGs (Wikidata/DBpedia) are sensitive to the noise in the query text, such as adverbs, adjectives, prepositions, and so on. They may lead to wrong or empty results.</p><p>To tackle this problem, we introduce the tokenization technique. For the text of cell 𝑐 𝑖𝑗 with</p><formula xml:id="formula_1">𝑙 words t = [𝑡 1 , 𝑡 2 , … , 𝑡 𝑙 ], KGCODE-Tab constructs a query set 𝒬 = {q 𝑖∶𝑗 = [𝑡 𝑖 , 𝑡 𝑖+1 , … , 𝑡 𝑗 ] | 𝑖, 𝑗 = 1, 2,</formula><p>… , 𝑙 and 𝑖 ⩽ 𝑗}. Then it sends each q 𝑖∶𝑗 in 𝒬 to the spell correction module and obtains the candidate mention set ℳ of 𝑐 𝑖𝑗 . Finally, it sends ℳ into the KGs API and gets the candidate entities set ℰ. It also collects the information of each entity into a dictionary containing its label, description, statements, identifiers, and so on.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.4.">Entity Disambiguation</head><p>Entity disambiguation is to select the ground truth entity from candidate entities. The architecture of existing systems can be classified into two categories: Graph-based <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b9">10]</ref> and Score-base <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b10">11]</ref>, and we design an algorithm to calculate the similarity score.</p><p>Commonly, a table has at least one subject column, and the others are non-subject columns. The non-subject columns are generally properties of subject columns. Therefore, KGCODE-Tab can exclude some candidate entities of subject columns by comparing their properties with the content of related non-subject columns. There are mainly six data types in Wikidata: wikibase-entityid, string, time, globecoordinate, quantity, and multilingualtext, so we need to design different formulas to calculate the similarity score according to different data types. Let an entity 𝑒 has 𝑃 properties, and 𝑣 𝑘 denotes the 𝑘-th property.</p><p>For the string and multilingualtext data types, it is enough to rely on Levenshtein Distance. For the wikibase-entityid data type, they need to be converted to labels firstly. The similarity score formula is shown as follows:</p><formula xml:id="formula_2">𝑆𝑖𝑚(𝑐 𝑖𝑗 , 𝑣 𝑘 ) = { 𝐿𝑒𝑣𝑅𝑎𝑡𝑖𝑜(𝑐 𝑖𝑗 , 𝑣 𝑘 ), 𝐿𝑒𝑣𝑅𝑎𝑡𝑖𝑜(𝑐 𝑖𝑗 , 𝑣 𝑘 ) ⩾ 𝛼 0, otherwise<label>(1)</label></formula><p>where the optimal value of parameter 𝛼 is 0.98 which is obtained by experiments.</p><p>For the quantity data type, we define the Number Relevance Degree (NRD) which is shown as follows:</p><formula xml:id="formula_3">𝑁 𝑅𝐷(𝑎, 𝑏) = ⎧ ⎨ ⎩ 1 − |𝑎−𝑏| max(|𝑎|,|𝑏|) , 𝑎𝑏 ≠ 0 and 1 − |𝑎−𝑏| max(|𝑎|,|𝑏|) ⩾ 𝛽 1 − |𝑎 − 𝑏|, 𝑎𝑏 = 0 and 1 − |𝑎 − 𝑏| ⩾ 𝛽 0, otherwise<label>(2)</label></formula><formula xml:id="formula_4">𝑆𝑖𝑚(𝑐 𝑖𝑗 , 𝑣 𝑘 ) = 𝑁 𝑅𝐷(𝑐 𝑖𝑗 , 𝑣 𝑘 )<label>(3)</label></formula><p>where the optimal value of parameter 𝛽 is 0.98 which is also obtained by experiments. For the globecoordina data type which contains longitude and latitude, we directly use NRD to calculate the similarity score. The similarity score formula is shown as follows:</p><formula xml:id="formula_5">𝑆𝑖𝑚(𝑐 𝑖𝑗 , 𝑣 𝑘 ) = max (𝑁 𝑅𝐷(𝑐 𝑖𝑗 , 𝑣 𝑙𝑎 𝑘 ), 𝑁 𝑅𝐷(𝑐 𝑖𝑗 , 𝑣 𝑙𝑜 𝑘 ))<label>(4)</label></formula><p>For the time data type, we define a list T which contains year, month, day, hour, minute, and second to represent the time value. In tabular data, we use regular expressions for extracting time information as a T. The similarity score formula is shown as follows:</p><formula xml:id="formula_6">𝑆𝑖𝑚(𝑐 𝑖𝑗 , 𝑣 𝑘 ) = { 1, T 𝑐 𝑖𝑗 ⊆ T 𝑣 𝑘 0, otherwise<label>(5)</label></formula><p>After the similarity scores calculation, each candidate entity has a final score calculated by the formula:</p><formula xml:id="formula_7">𝐹 𝑆(𝑒) = 1 𝑁 − 1 𝑁 ∑ 𝑗=1,𝑗≠𝑠 max 𝑣 𝑘 ∈𝑃 𝑒 𝑆𝑖𝑚(𝑐 𝑖𝑗 , 𝑣 𝑘 )<label>(6)</label></formula><p>where 𝑒 is the candidate entity of the 𝑖-th cell in the subject column, 𝑠 denotes the column index of subject column, and 𝑃 𝑒 is the set of properties in 𝑒.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.5.">Task Analysis</head><p>In our system, we utilize a cooperative score mechanism. Let 𝑀(𝑒 𝑘 𝑖 , 𝑐 𝑖𝑗 ) and 𝑀(𝑒 𝑘 𝑖 , 𝑒 𝑘 ′ 𝑖𝑗 ) denote the matching score of (𝑒 𝑘 𝑖 , 𝑐 𝑖𝑗 ) or (𝑒 𝑘 𝑖 , 𝑒 𝑘 ′ 𝑖𝑗 ) used later. We use a normalization function</p><formula xml:id="formula_8">𝜙(𝑥) = (𝑎𝑥) 𝑏<label>(7)</label></formula><p>to widen the gap between high and low matching score, where 𝑎 = 1.1 and 𝑏 = 8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Column Type Annotation</head><p>Let 𝑒 𝑘 𝑖 denote the 𝑘-th candidate entity of the 𝑖-th cell in the subject column. Then the set of candidate types is 𝒞 sub = {𝑡|(𝑒 𝑘 𝑖 , InstanceOf, 𝑡) ∈ 𝐾 𝐺, 𝑖 = 1, 2, … , 𝑚, 𝑘 = 1, 2, … 𝑁 (𝑐 𝑖 )}, where 𝑁 (𝑐 𝑖 ) is the number of candidate entities of the 𝑖-th cell. We assign a score to each type 𝑡 in 𝒞 sub by Eq.9.</p><formula xml:id="formula_9">𝐼 sub (𝑡, 𝑒 𝑘 𝑖 ) = { 1 𝑁 −1 ∑ 𝑗≠𝑠 𝑀(𝑒 𝑘 𝑖 , 𝑐 𝑖𝑗 ), (𝑒 𝑘 𝑖 , InstanceOf, 𝑡) ∈ 𝐾 𝐺 0, otherwise<label>(8)</label></formula><formula xml:id="formula_10">𝐶𝑇 𝐴𝑆𝑐𝑜𝑟𝑒 𝑠𝑢𝑏 (𝑡) = 𝑚 ∑ 𝑖=1 𝑁 (𝑐 𝑖 ) max 𝑘=1 𝜙 (𝐼 sub (𝑡, 𝑒 𝑘 𝑖 ))<label>(9)</label></formula><p>For non-subject columns, the score of candidate types in 𝒞 non are assigned by Eq.11.</p><formula xml:id="formula_11">𝐼 non (𝑡, 𝑒 𝑘 𝑖 , 𝑒 𝑘 ′ 𝑖𝑗 ) = { 𝑀(𝑒 𝑘 𝑖 , 𝑒 𝑘 ′ 𝑖𝑗 ), (𝑒 𝑘 ′ 𝑖𝑗 , InstanceOf, 𝑡) ∈ 𝐾 𝐺 0, otherwise<label>(10)</label></formula><formula xml:id="formula_12">𝐶𝑇 𝐴𝑆𝑐𝑜𝑟𝑒 𝑛𝑜𝑛 (𝑡 𝑗 ) = 𝑀 ∑ 𝑖=1 max 𝑘,𝑘 ′ 𝜙 (𝐼 non (𝑡 𝑗 , 𝑒 𝑘 𝑖 , 𝑒 𝑘 ′ 𝑖𝑗 ))<label>(11)</label></formula><p>Cell Entity Annotation For an entity in the subject column, we enumerate all types 𝑡 𝑢 of candidates to take advantage of CTA scores, as shown in Eq.12, where the parameter 𝜆 is a cooperative factor set to 0.1. We skip the items that makes 𝐼 sub (⋅, ⋅) or 𝑀(⋅, ⋅) equals 0.</p><formula xml:id="formula_13">𝐶𝐸𝐴𝑆𝑐𝑜𝑟𝑒 𝑠𝑢𝑏 (𝑒 𝑘 𝑖 ) = max 𝑡,𝑢 {𝜙 (𝐼 sub (𝑡 𝑢 , 𝑒 𝑘 𝑖 )) + 𝜆 ⋅ 𝐶𝑇 𝐴𝑆𝑐𝑜𝑟𝑒 𝑠𝑢𝑏 (𝑡 𝑢 )}<label>(12)</label></formula><p>For a non-subject column with index 𝑗, we give the entity 𝑒 𝑘 𝑖𝑗 score by Eq.13.</p><formula xml:id="formula_14">𝐶𝐸𝐴𝑆𝑐𝑜𝑟𝑒 𝑛𝑜𝑛 (𝑒 𝑘 ′ 𝑖𝑗 ) = 𝑁 (𝑐 𝑖𝑗 ) max 𝑘 ′ =1 {𝜙 (𝑀(𝑒 𝑘 𝑖 , 𝑒 𝑘 ′ 𝑖𝑗 )) + 𝜆 ⋅ 𝐶𝐸𝐴𝑆𝑐𝑜𝑟𝑒 𝑠𝑢𝑏 (𝑒 𝑘 𝑖 )} (13)</formula><p>Columns Property Annotation The set of candidate properties is denoted by 𝒫 {𝑝 | (𝑒 𝐺 𝑖 , ℎ𝑎𝑠𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦, 𝑝) ∈ 𝐾 𝐺, 𝑖 = 1, 2 … , 𝑚}. We assign a score to each property 𝑝 in 𝒫 with respect to the 𝑗-th column by:</p><formula xml:id="formula_15">𝐼 (𝑝, 𝑒 𝑘 𝑖 ) = { 𝑀(𝑒 𝑘 𝑖 , 𝑒 𝑘 ′ 𝑖𝑗 ), (𝑒 𝑘 𝑖 , 𝑝, 𝑒 𝑘 ′ 𝑖𝑗 ) ∈ 𝐾 𝐺 0, otherwise<label>(14)</label></formula><p>The CPA matching score is calculated by Eq.15.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>𝐶𝑃𝐴𝑆𝑐𝑜𝑟𝑒(𝑝</head><formula xml:id="formula_16">𝑗 ) = 𝑀 ∑ 𝑖=1 𝑁 (𝑐 𝑖 ) max 𝑘=1 {𝜙 (𝐼 (𝑝 𝑗 , 𝑒 𝑘 𝑖 )) + 𝜆 ⋅ 𝐶𝐸𝐴𝑆𝑐𝑜𝑟𝑒 𝑠𝑢𝑏 (𝑒 𝑘 𝑖 )} (15)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Results</head><p>In the Accuracy Track of SemTab 2022, participants compete with each other for three rounds.</p><p>In each round, different datasets are provided to evaluate their systems on CTA, CEA, and CPA tasks.</p><p>Table <ref type="table">.</ref>1 shows the results of KGCODE-Tab in all datasets of SemTab 2022. Since our system evolved as the competition went on, its rank and performance were on the rise during the whole competition. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Round 1</head><p>In Round 1, tables of HardTables datasets have small numbers of rows and columns, and the subject columns of most tables are the first columns. Thus KGCODE-Tab processes tables in batches and sets the first columns as subject columns by default. Experiments show that processing in batches dramatically improves the efficiency of spell correction and entity recall, fully utilizing the multithreading technology. Fixing subject columns also reduce the error caused by the table structure analysis module.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Round 2</head><p>In Round 2, the subject columns of tables in ToughTables datasets are not always the first columns, and non-subject columns are not necessary to be the properties of subject columns but can be their descriptions. Hence, the table structure analysis module comes into play, and the descriptions of entities participate in the calculation of similarity scores. Results show that these modifications largely increase the accuracy of the entity disambiguation module, improving the ranking of our system. In addition, the number of rows in each table in ToughTables datasets fluctuates greatly, and some tables have extremely large numbers of rows. Hence, adaptive batch processing is introduced according to the size of the tabular data, and for the table with a large number of rows, only part of the representative rows are randomly selected for CTA task annotation, improving the efficiency of tabular data in spell correction and entity recall.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Round 3</head><p>In Round 3, tables in the BiodivTab datasets are about biodiversity, so KGCODE-Tab constructs a biodiversity corpus for abbreviations and aliases commonly used in the field of biodiversity. Furthermore, many cells contain noise like adverbs and adjectives, and most headers have semantic information. Therefore, tokenization is introduced to reduce the effect of noise, and KGCODE-Tab converts CTA task into CEA task for headers.</p><p>For Gittables datasets, by observing the annotation results of its training dataset, we find that the number of its labels is small and the type of annotation is relatively general, so we consider using a text classification algorithm to solve the problem. After preliminary analysis and research, we select the FastText <ref type="bibr" target="#b11">[12]</ref> model. Firstly, original words are divided into several tokens, and the CTA results are used as labels. Then the spaCy is used for word recognition, and the results are used as keywords. They are put into the FastText model for training. After training, it is used to annotate the test dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">General comments</head><p>In SemTab 2022, our KGCODE-Tab team participating in SemTab for the first time has a good result. Among all the participating teams, we achieve first-place results in multiple tasks.</p><p>KGCODE-Tab has some strategies to improve performance with less query time. The task analysis of the top layer can directly call the interface of the bottom layer, which increases the maintainability of the system. The tabular data preprocessing module makes full use of several tools like search engines, KGs API, and spaCy library to generate structured JSON files for each tabular data to increase reusability. To achieve the semantic annotation of tabular data, three tasks of CEA, CTA, and CPA are closely combined to deal with. As a whole, KGCODE-Tab fully utilizes the context of the whole table and the information provided by KGs to achieve a high accuracy.</p><p>However, the entity disambiguation module can continue to be optimized, and machine learning algorithms can be used to train parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusion</head><p>In this paper, we propose a novel table annotation system, KGCODE-Tab, which can deal with three TDKGM tasks: CTA, CEA, and CPA. We propose several effective tabular data preprocessing techniques, which consist of table structure analysis, spell correction, and entity recall. KGCODE-Tab emphasizes entity disambiguation with table context, which reduces much noise and remains candidate entities with high confidence. For each task, we design a scoring formula to select the right answer among candidate entities, which utilizes the results from other tasks. Results of SemTab 2022 show that KGCODE-Tab has excellent disambiguation ability and achieves outstanding performance. Supplemental Material Statement: Source code and constructed datasets will be released on GitHub soon.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Results of KGCODE-Tab obtained in SemTab 2022.</figDesc><table><row><cell>Task</cell><cell></cell><cell>CTA</cell><cell></cell><cell></cell><cell>CEA</cell><cell></cell><cell></cell><cell>CPA</cell><cell></cell></row><row><cell>Dataset</cell><cell cols="9">APrecision AF1 Rank APrecision AF1 Rank APrecision AF1 Rank</cell></row><row><cell>Round1</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>HardTablesR1(WD)</cell><cell>0.944</cell><cell>0.942</cell><cell>4</cell><cell>0.916</cell><cell>0.893</cell><cell>4</cell><cell>0.918</cell><cell>0.906</cell><cell>5</cell></row><row><cell>Round2</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>HardTablesR2(WD)</cell><cell>0.971</cell><cell>0.968</cell><cell>1</cell><cell>0.875</cell><cell>0.856</cell><cell>2</cell><cell>0.943</cell><cell>0.916</cell><cell>3</cell></row><row><cell>ToughTables(WD)</cell><cell>0.546</cell><cell>0.543</cell><cell>1</cell><cell>0.913</cell><cell>0.905</cell><cell>3</cell><cell>/</cell><cell>/</cell><cell>/</cell></row><row><cell>ToughTables(DBP)</cell><cell>0.485</cell><cell>0.480</cell><cell>1</cell><cell>0.830</cell><cell>0.827</cell><cell>1</cell><cell>/</cell><cell>/</cell><cell>/</cell></row><row><cell>Round3</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>BiodivTab(DBP)</cell><cell>0.867</cell><cell>0.867</cell><cell>1</cell><cell>0.911</cell><cell>0.911</cell><cell>1</cell><cell>/</cell><cell>/</cell><cell>/</cell></row><row><cell>GitTables(DBP)</cell><cell>0.608</cell><cell>0.587</cell><cell>2</cell><cell>/</cell><cell>/</cell><cell>/</cell><cell>/</cell><cell>/</cell><cell>/</cell></row><row><cell>GitTables(SCH)(class)</cell><cell>0.716</cell><cell>0.693</cell><cell>1</cell><cell>/</cell><cell>/</cell><cell>/</cell><cell>/</cell><cell>/</cell><cell>/</cell></row><row><cell>GitTables(SCH)(property)</cell><cell>0.665</cell><cell>0.618</cell><cell>2</cell><cell>/</cell><cell>/</cell><cell>/</cell><cell>/</cell><cell>/</cell><cell>/</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">https://github.com/explosion/spaCy</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">https://www.bing.com/search</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">https://wikidata.org/w/api.php</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">https://lookup.dbpedia.org/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Resources to benchmark tabular data to knowledge graph matching systems</title>
		<author>
			<persName><forename type="first">E</forename><surname>Jiménez-Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Hassanzadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Efthymiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Srinivas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th Extended Semantic Web Conference (ESWC 2020)</title>
				<meeting>the 17th Extended Semantic Web Conference (ESWC 2020)<address><addrLine>Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019. 2020</date>
		</imprint>
	</monogr>
	<note>Semtab</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Amalgam: Making tabular dataset explicit with knowledge graph</title>
		<author>
			<persName><forename type="first">R</forename><surname>Azzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Diallo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020)</title>
				<meeting>the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020)<address><addrLine>Virtual, Online</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Linkingpark: An integrated approach for semantic table interpretation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Karaoglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Negreanu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-G</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Williams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gordon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020)</title>
				<meeting>the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020)<address><addrLine>Virtual, Online</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Dagobah: An end-to-end context-free tabular data semantic annotation system</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Chabot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Labbe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Troncy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) co-located with the 18th International Semantic Web Conference (ISWC 2019)</title>
				<meeting>the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) co-located with the 18th International Semantic Web Conference (ISWC 2019)<address><addrLine>Auckland, New zealand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Dagobah: Enhanced scoring algorithms for scalable annotations of tabular data</title>
		<author>
			<persName><forename type="first">V.-P</forename><surname>Huynh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chabot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Labbe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Monnin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Troncy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020)</title>
				<meeting>the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020)<address><addrLine>Virtual, Online</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Knowledge graph matching with inter-service information transfer</title>
		<author>
			<persName><forename type="first">S</forename><surname>Yumusak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020)</title>
				<meeting>the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020)<address><addrLine>Virtual, Online</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Adog -annotating data with ontologies and graphs</title>
		<author>
			<persName><forename type="first">D</forename><surname>Oliveira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Aquin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) co-located with the 18th International Semantic Web Conference (ISWC 2019)</title>
				<meeting>the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) co-located with the 18th International Semantic Web Conference (ISWC 2019)<address><addrLine>Auckland, New zealand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Mantistable: An automatic approach for the semantic table interpretation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Cremaschi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Avogadro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chieregato</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) co-located with the 18th International Semantic Web Conference (ISWC 2019)</title>
				<meeting>the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) co-located with the 18th International Semantic Web Conference (ISWC 2019)<address><addrLine>Auckland, New zealand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Semantic table interpretation using lod4all</title>
		<author>
			<persName><forename type="first">H</forename><surname>Morikawa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) co-located with the 18th International Semantic Web Conference (ISWC 2019)</title>
				<meeting>the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) co-located with the 18th International Semantic Web Conference (ISWC 2019)<address><addrLine>Auckland, New zealand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Csv2kg: Transforming tabular data into semantic knowledge</title>
		<author>
			<persName><forename type="first">B</forename><surname>Steenwinckel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Vandewiele</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>De Turck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ongenae</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) co-located with the 18th International Semantic Web Conference (ISWC 2019)</title>
				<meeting>the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) co-located with the 18th International Semantic Web Conference (ISWC 2019)<address><addrLine>Auckland, New zealand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Lexma: Tabular data to knowledge graph matching using lexical techniques</title>
		<author>
			<persName><forename type="first">S</forename><surname>Tyagi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Jimenez-Ruiz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020)</title>
				<meeting>the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020)<address><addrLine>Virtual, Online</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Bag of tricks for efficient text classification</title>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bojanowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017)</title>
				<meeting>the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017)<address><addrLine>Valencia, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
