GBMTab: A Graph-Based Method for Interpreting Noisy Semantic Table to Knowledge Graph Lianzheng Yang1 , Shuyang Shen2 , Jingyi Ding3 , and Jiahui Jin3B 1 School of Cyber Science and Engineering, Southeast University, China 2 Chien-Shiung Wu College, Southeast University, China 3 School of Computer Science and Engineering, Southeast University, China {lianzhengyang,shua,jingyi_ding,jjin}@seu.edu.cn Abstract. Tabular data on the web contains rich semantic informa- tion. However, it is challenging to match noisy tabular data to knowl- edge graphs. In this paper, we propose a framework called GBMTab. GBMTab is a semantic table interpretation framework that uses a ro- bust probabilistic graphical model (PGM) method to solve entity linking and column type annotation problems with tabular data. Through this framework we participated in two tasks of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2021), i.e., CEA (Cell Entity Annotation), CTA (Column Type Annotation). The results of SemTab 2021 challenge show that our framework has positive performance. Keywords: Tabular data · Entity linking · Table interpretation · GBMTab. 1 Introduction Tables on web pages have become a high-quality source for applications such as knowledge extraction [1] and knowledge graph augmentation [2], as they contain a large amount of relational data covering information in many fields. Tabular data on the web contains rich semantic information, these data and correspond- ing semantic information are also recorded in some specific knowledge graphs (KGs), so match the tabular data into knowledge graph has critical research value. In recent years, there have been lots of works on semantic annotation with tabular data, many of which use probabilistic graphical models to annotate tabular data. Limaye et al. [3] and Mulwad et al. [4] respectively proposed a probabilistic graphical model that captures semantic information in tables and annotates cells, columns and column pairs as entities, types and relationships in KGs. According to the results of SemTab 2020, the best performing approach is Copyright © 2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 Yang et al. MTab4Wikidata [5], which improves the results of entity search by using fuzzy search and statement search to deal with noise mentions. However, it is arduous for machines to interpret semantic tabular data be- cause of the diversity of languages and noise mentions. The Semantic Web Chal- lenge on Tabular Data to Knowledge Graph Matching (SemTab 2021) aims at benchmarking tabular data to knowledge graph matching systems. There are three tasks: Column Type Annotation (CTA), Cell Entity Annotation (CEA) and Column Property Annotation (CPA). The CTA task is assigning a semantic type to a column, the CEA task is matching cells to entities in a specific KG, and the CPA task is assigning a KG property to the relationship between two columns. These three tasks and their formal definitions can be illustrated by Figure 1. Type in KG (CTA) Artist Nationality Genre Entity in KG The Andrews Sisters United States Traditional popp (CEA) Spice Girls United Kingdom Popp Speed Japan J-ppop Property in KG (CPA) Fig. 1. Three sub-tasks of SemTab 2021. We have proposed an approach to solve the CTA and CEA tasks, where the result of CEA task is the input of CTA task. For the CEA task, we build a graphical model and used an iterative probability propagation algorithm to find the referent entity for a cell. For the CTA task, we matched types of columns by using Wikidata SPARQL endpoint and feature engineering. The remainder of this paper is organized as follows. In Section 2 we introduce the approach we proposed to solve the CEA and CTA tasks. Then, we describe the results for two rounds in Section 3 and make a conclusion in Section 4. 2 Methodology The framework flowchart of the CEA and CTA tasks is showed in Figure 2, where the CEA task is composed of candidate entity generation and entity disambigua- tion, and the CTA task is composed of relation match and disambiguation. An GBMTab at SemTab 2021 3 input form first passes the CEA task, it should be noted that in this step we introduce the noise mention repair mechanism. After the CEA task is completed, the marked form is used as the input of the CTA task, annotating the column type according to the amount of available information of the corresponding men- tion in the table. The following part is the detail of our specific approach. Candidate Entity Relation Column type generation disambiguation match disambiguation Tables CEA CTA Results Results Fig. 2. The framework flowchart of the CEA and CTA tasks. 2.1 Entity linking The entity linking is divided into two parts: candidate generation and entity disambiguation. For round 1, we select the challenge which link tabular data to the knowledge graph of DBpedia 2016-10. In round 2, we linked “BioTable” to the knowledge graph of Wikidata. 2.1.1 Candidate generation For each mention in tabular data, we first need to find the corresponding candidate entity in the given knowledge graph. Due to the different composition of the knowledge graph data, we classified the different knowledge graphs in the candidate entity generation part. It is worth mentioning that in this step we have added a noise mention repair mechanism. As for DBpedia, we use the following methods. String similarity comparison: We build an index with a hash table for the DBpedia ontology and the corresponding link by using DBpedia 2016-10 datasets. Then we calculate similarity traversal between all DBpedia ontologies and mentions to determine which entity is the candidate for the mention. We define s as a mention in tabular data, e is an entity in the knowledge graph. Thereupon, we define the string similarity StringSimilarity(s, e) as follows: LevenshteinDistance(s, e)) StringSimilarity(s, e) = 1 − (1) sum {length(s), length(e)} We set a threshold to determine whether to add an entity e to the candidate set. If the threshold is set overly high, ground truth entity may not be included. On the contrary, if the threshold is set excessively low, it will increase the diffi- culty of subsequent task for entity disambiguation. Through experiments show in Figure 3, we randomly selected 10,000 mentions to generate candidate sets 4 Yang et al. under different threshold settings. It turns out that setting a threshold of 0.7 is a good choice. Percentage of ground truth entity included [%] 100 90 80 70 0 0.2 0.4 0.6 0.8 1 threshold Fig. 3. The influence of the threshold on candidate set. Noise mentions repair: After adding noise for some short length mentions, for example “cat” becomes “catt”. Due to the short length, the string similarity will be lower and thus will be ignored. This is very fatal for the subsequent task of entity disambiguation. In order to solve noise mentions problem, we use search engine to solve this problem. Through Google search engine, it may automat- ically correct keywords containing the noise mentions and continue searching according to the recommended keywords. If search results have an impact on the judgment, we will also limit search scope to the domain name of DBpedia. Multilingual: For mentions whose language is not English, we have intro- duced multilingual DBpedia datasets. As for Wikidata, we use the following methods. Wikidata MediaWiki API: Unlike DBpedia, in Wikidata, we query Me- diaWiki API by posting the mention and setting the limits to a maximum of 50. Wikidata will perform fuzzy search and internal sorting. We can think that the higher-ranking entities are more related to mentions. Correction of noise mentions: This step is the same as that of DBpedia, we use the Google search engine. 2.1.2 Entity disambiguation https://www.mediawiki.org/wiki/API:Search GBMTab at SemTab 2021 5 In this step, our goal is to select an entity that best meets the contextual semantics from the candidate set. We assume that in the table, mentions in the same row or column are related. Thus, we set up such a model, which uses a fully connected network to build a graphical model, and uses the iterative probability propagation algorithm [6] to find a set of entities that are most likely to express semantics, jointly disambiguate mentions in table. These steps can be illustrated by Figure 4. mention node candidate node mention-mention edge mention-candidate node col1 col2 col3 col4 𝑚𝑚𝑖𝑖−1,𝑗𝑗 𝑚𝑚𝑖𝑖−1,𝑗𝑗 𝑚𝑚𝑖𝑖,𝑗𝑗−1 𝑚𝑚𝑖𝑖,𝑗𝑗 𝑚𝑚𝑖𝑖,𝑗𝑗+1 𝑚𝑚𝑖𝑖,𝑗𝑗+2 𝑚𝑚𝑖𝑖,𝑗𝑗−1 𝑚𝑚𝑖𝑖,𝑗𝑗 𝑚𝑚𝑖𝑖,𝑗𝑗+1 𝑚𝑚𝑖𝑖,𝑗𝑗+2 𝑚𝑚𝑖𝑖+1,𝑗𝑗 𝑚𝑚𝑖𝑖+1,𝑗𝑗 Fig. 4. The flowchart of entity disambiguation. i) Build disambiguation graph: Starting from a given mention, create a disambiguation graph of the other mentions and the corresponding candi- dates which in the same row or column. For the mention mij in ith row j th column, add the ith row mention mi? and j th column mention m?j into the graph as mention node, then add candidate sets S(m) corresponding to these mentions into the graph as candidate nodes. ii) Build features between nodes: We use feature engineering to score the relationships between nodes to measure the semantic connection of nodes. The features consist of the following parts: Priori Features: We calculate priori features from knowledge graph and WDC. For example, “New York” in Wikidata can be interpreted as New York City, New York State, Film and television, Literature, Music, etc. Thus, when entities link the word “New York”, we tend to think of New York City or New York State, which appears more frequently. For DBpedia, we calculate the popularity of candidate entities, and for Wikidata, we take advantage of the order in which the MediaWiki API returns results. In addition, we also consider the prior knowledge of context instead of in- dividual mention itself, WDC is web table corpora which has extracted a large amount of web table data on the internet. For example, when “New http://webdatacommons.org/webtables/ 6 Yang et al. York” appeared on the web table with “Lou Reed”, because the New York album and its author appeared more frequently in WDC, we can consider “New York” in this table as the album. Context Features: Tabular structure is naturally composed of an ordered set of rows and columns [7]. For the cell cij in ith row j th column. Commonly one row contains a key object and its properties. Thus, mention texts in the same row could be recognized as the extended information of the key object. Despite the subtle connections among cells in the same column, there are strong semantic similarities among them. Under some circumstances, key object could appear repeatedly, which usually indicates the repeat of the same row. This will help to improve the accuracy when noise occurs. For the reasons above, we take the values of cells in the same row or column of the objective cell as its features, using Levenshtein distance and cosine distance to rank candidate entities. Abstract Features: In round 1 of SemTab 2021, tasks can be fulfilled with DBpedia data. Compared to Wikidata, DBpedia has less data, but the data is more structured. Many entities have their own abstracts which describe the general information of entity. We can easily access the abstract of an entity with local DBpedia SPARQL endpoint. The search can make an in- tersection with the other available text features and score the corresponding entity with cosine distance as long as we have it appropriately tokenized. iii) Iterative probability propagation: Through iterative probability prop- agation, based on the current values of nodes, greedily iterative assigns its maximum likelihood value. Then constantly calculates and updates features of the mention, and finally reaches the global optimal solution. Here, we for- mally define the state of the graphical model under initialization as P0 , the maximum number of iterations is T , the current number of iterations is t, when t == T or Pt == Pt−1 probability propagation is finish, the Pt is output as a global disambiguation result. 2.2 Column type annotation CEA Primary key column Entity Yes Ranking Result Scoring Result Voting Primary key Intersection Targe No Embedding properties Column Fig. 5. The flowchart of column type annotation. GBMTab at SemTab 2021 7 The CTA task can be performed by exploiting the process described in Figure 5. We use the result of the CEA to do the task. The first step is to identify the primary key column of the targets. Then adapted the relation match (Intersec- tion in Figure 5) and embedding similarity to rank the candidates which are searched with the SPARQL endpoints. 2.2.1 Relation match After annotating cell values, we search different types of each of these entities in the same column by using Wikidata SPARQL endpoint. Since there might be a lot of wrong candidates in the search results. It is essential to clean candidate set of each cell in the table to apply further mechanisms like voting and embedding to disambiguation. Here we assume the mentions in primary key columns are all appropriately linked to corresponding entities while mentions not in primary key columns are commonly linked to a “vogue entity”. For example, Trump and American are texts of two cells in the same row, Trump can be easily linked to “Donald Trump” while American maybe linked to “The United States of America”, which could not precisely describe semantics within the column: birth place of Trump. When we search for the type of “The United States of America” the results shall include “Country”, “Superpower”, etc., which do not involve the relation “place of birth”. Thus, we use knowledge graph structure, trying to choose the most suitable relation path [3] for those entities whose values of properties exactly match the “vogue entity”. And thus extract the type we need in ontology. Then we use vote mechanism, embedding distance [8] [9] and text similarity to rank types in the candidate set and finally get result. 2.2.2 Column type disambiguation If the intersection of properties and search candidates is empty, we need to use feature engineering to find the most suitable linking entity instead of simple voting. With word embedding, we can achieve better accuracy when it comes to correlation in semantic level. Considering word embedding to extend available information within a table for the CTA task is reliable. Since the mention text on the CEA task is hard to apply word embedding due to the uncommon words which are out of vocab- ulary, the trending method on CTA is voting [10], ignoring properties of the cell in primary key column. However, if the intersection of properties and search candidates is empty, we use a hybrid method which consists of vote mechanism, embedding distance and text similarity to rank types in candidate set. 3 Results Table 1 shows the results of our first two rounds of challenge in SemTab 2021. 8 Yang et al. Table 1. The SemTab 2021 results of our team. Round1 Round2 F1 Precision F1 Precision CEA 0.692 0.692 0.868 0.868 CTA 0.133 0.133 - - 3.1 Result for the CEA task For the CEA task in round1, we use DBpedia 2016-10 data as the knowledge graph. It can be seen in Table 2 that the addition of noise to the table has an im- pact on the candidate entities generated based on string similarity. After adding the noise mentions repair mechanism, we can see that F1 and precision have been greatly improved which proves that this method is effective. In round2, we used Wikidata as our knowledge graph. In this round of challenge, we focused on the impact of candidate entity generation on the CEA results. After experimen- tation, we are more inclined to use MediaWiki API, which can perform fuzzy searches on characters. It is obvious that the precision is better than round1. Figure 6 shows that by experimenting with different range of top-k on the can- didate set which contains ground truth entity, we found that our probabilistic graphical model has strong robustness. Table 2. The impact of noise mentions repair in round1. F1 Precision Without Repair 0.502 0.502 Repair 0.692 0.692 Accuracy of disambiguation[%] 100 90 80 0 20 40 60 80 100 Range for Top-K (Contains ground truth entity) Fig. 6. The impact of the range for Top-K on PGM model. GBMTab at SemTab 2021 9 3.2 Result for the CTA task For the CTA task, the introduction of encoding model and elements in pri- mary key column appears to regularize candidate list at the level of semantics and give less weight to the coarse-grained candidates. Even though the intersection of properties of key value and target mention mentioned in Section 2.2, we still remove the most semantically irrelevant entities to guarantee the candidate list is as clean as possible. A better solution might be retraining pre-trained BERT to fine-tune BERT model. Since the language that ontology uses is restricted (En- glish almost) and the word tokens applied in ontology are also high-frequency words included in vocabulary. This could be future work of the task. 4 Conclusion Graphical model of iterative probability propagation has obvious effect in entity disambiguation, and candidate generation as its upstream task has a greater impact with disambiguation results. Spell correction and noise mentions detec- tion in the CEA task can improve the performance for the CTA task. Also, the size of the table has a great influence on the speed, it will take a lot of time to propagate the probability of graph with large table. In addition, the application of BERT embedding and property intersection helped to improve the result for the CTA task. In comparison with the dataset in round 1, the dataset of the fol- lowing rounds is more complicated and field-specific. DBpedia could extend the information of some entities while helps little to the uncommon and field-specific entities. In the future, we will pay more attention to accuracy and speed of candidate entity generation and entity disambiguation, we will also aim at improving col- umn type annotation by using pre-trained model to integrate the information, making the result more fine-grained. 5 Acknowledgements This work is supported by the National Natural Science Foundation of China under Grants No. 62072099, Jiangsu Provincial Key Laboratory of Network and Information Security under Grants No.BM2003201, Key Laboratory of Com- puter Network and Information Integration of Ministry of Education of China under Grants No.93K-9, and the Fundamental Research Funds for the Central Universities. References 1. Daheng Wang, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Xin Luna Dong, and Meng Jiang. TCN: Table Convolutional Network for Web Table Interpretation. In proceedings of the Web Conference 2021, pp. 4020–4032, April 2021. 10 Yang et al. 2. Dominique Ritze, Oliver Lehmberg, Yaser Oulabi, and Christian Bizer. Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases. In proceedings of the 25th international conference on world wide web, pp. 251-261, April 2016. 3. G. Limaye, S. Sarawagi, and S. Chakrabarti, “Annotating and searching web tables using entities, types and relationships,” Proc. VLDB Endow., vol. 3, pp. 1338–1347, September 2010. 4. Varish Mulwad, Tim Finin, and Anupam Joshi. Semantic message passing for gen- erating linked data from tables. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 363–378. Springer, Heidelberg (2013) .https://doi.org/10.1007/978-3-642-41335-3_23 5. Phuc Nguyen, Ikuya Yamada, Natthawut Kertkeidkachorn, Ryutaro Ichise, and Hideaki Takeda. MTab4Wikidata at SemTab 2020: Tabular Data Annotation with Wikidata. In SemTab@ ISWC, pp. 86-95, 2020. 6. Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. TabEL: Entity Linking in Web Tables. In Proceedings of the 14th International Conference on The Semantic Web (ISWC’15). 425–441. 7. Yusra Ibrahim, Mirek Riedewald, and Gerhard Weikum. Making Sense of Entities and Quantities in Web Tables. In Proc. of CIKM ’16. 1703–1712, 2016. 8. Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, and Charles A. Sutton. ColNet: Embedding the Semantics of Web Tables for Column Type Prediction. In AAAI, 2018. 9. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. 10. S. Tyagi and E. Jimenez-Ruiz. LexMa: Tabular Data to Knowledge Graph Match- ing using ´ Lexical Techniques. In Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab). CEUR-WS.org, 2020.