Introduction

LinkingPark: An Integrated Approach for Semantic Table Interpretation

Shuang Chen

0 1

Alperen Karaoglu

Carina Negreanu

Tingting Ma

0 1

Jin-Ge Yao

Jack Williams

Andy Gordon

Chin-Yew Lin

1 0 Harbin Institute of Technology , Harbin , China 1 Microsoft Research Asia , Beijing , China 2 Microsoft Research Cambridge , Cambridge , UK

In this paper, we present LinkingPark, our system for Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020). LinkingPark is an integrated approach for semantic table interpretation. Our system includes a cascaded pipeline for candidate generation, an iterative coarse-to- ne entity disambiguation algorithm, a multi-pass property linking algorithm, and a type inference algorithm tackling the issue of loose ontology in Wikidata. Results on SemTab 2020 demonstrate the e ectiveness of our approach.

Introduction

Approach

Input: Access MediaWiki

API

Mention Spelling Corrector

Finegrained Elastic

Search Property linker Entity Property Linker

Lexical Property Linker mPeatrfcehcetr mFautzczhyer describe key attributes of an entity. F is the set of facts which consists of a set of RDF triples hs; p; oi, where s denotes a subject (an entity e 2 E ), p 2 P is a property (also known as predicate or relation) and o denotes an object (an entity e, or a data value, e.g. number, time, string etc.). The target knowledge base of SemTab-2020 is Wikidata.5

The three matching tasks of SemTab 2020 can be described as: { CEA (Cell Entity Annotation): to link each entity mention string tij in table

T to its referent entity in E . { CTA (Column Type Annotation): to associate a table column cj with an entity type t 2 T . A column may be described by multiple types and the most speci c one is usually preferred. { CPA (Columns Property Annotation): to associate a pair of columns, cs and ct with a property p 2 P.

Entity linker Candidate generation

Entity disambiguation Coarse Phase

Candidate Pruning

Fine Phase

CEA CPA

CTA

Type inference

By SupportCount(t) no By InstanceRank(t)

By AverageLevel(t)

MainColumn?

yes By Population(t) 5 http://wikidata.org/ the entity disambiguation module to characterise the relatedness among di erent rows. Finally, we design a heuristic multi-pass sieve method for type inference based on the linked entities. Next, we describe each component in detail. 2.1 The entity linker is implemented with a typical approach that consists of two sub-modules: candidate generation and entity disambiguation.

Candidate generation Given an entity mention tij , we generate its candidate entities Eij = (eij1; : : : ; eijk) through a cascaded pipeline which includes three core steps: { Accessing Wikidata MediaWiki API: we start by accessing Wikidata MediaWiki API6. We set the largest number of candidates returned from this API to be 50. { Correcting the spelling errors: The MediaWiki API does not handle spelling errors. Following the design principles of a typical spelling corrector7, we implement a tailored mention spelling corrector for better candidate retrieval. Speci cally, the corrector checks all strings within one edit distance to the original mention string, then retains the strings among the set of Wikidata entity titles as candidates. This step is not intended for mentions with multiple spelling errors due to the exponential complexity in the length of edit distance. { Searching using ne-grained Elastic Search: In addition, we build a ne-grained Elastic Search index using all entity titles of Wikidata. The Elastic Search uses a weighted combination of word-based BM25 score and trigram-based BM25 score to do fuzzy matching. This step can improve the recall of candidate generation, but may also return more false positive candidates compared with the rst two steps.

Entity disambiguation Given an entity mention tij along with its candidate list Eij = (eij1; : : : ; eijk), the entity disambiguation stage aims to select the correct entity e^ij 2 Eij from its candidate list based on their contextual information.

Formally, given a table T = fft11; : : : ; t1ng; : : : ; ftm1; : : : ; tmngg, the objective of entity disambiguation is to nd the most compatible entity assignment for each cell tij :

argmax e11;e12;:::;emn2E11 E12 Emn g(e11; e12; : : : ; emnjT ): ( 1 ) where g(e11; e12; : : : ; emnjT ) is the function measuring the compatibility score of entity assignments in table T . 6 https://www.wikidata.org/w/api.php?action=help&modules=wbsearchentities 7 https://norvig.com/spell-correct.html Algorithm 1 coarse-to- ne disambiguation algorithm

Input: Table T with candidate lists fEij g and parameters f ; ; g

Output: Entity assignments fe^ij g 1: Initialize ei0j = argmaxe2Eij edit dist sim(e; tij ) + ( 1 ) ps(ejtij ) 2: while t < max iter and any of the entity assignments have changed do 3: scol = m1 1 Pkm=1;k6=i coarse ent sim(e; etkj 1)

( 1 Pn 4: srow = fnent1ity(Ek=i01;;kf6=egj)max(flexical(e; tik); fentity(feg; Eik)) ieflsje=0 scol + srow + 5: sij (e) = edit dist sim(e; tij ) + (1 6: end while 7: Prune candidates based on sij (e) 8: while t < max iter and any of the entity assignments have changed do 9: scol = m1 1 Pkm=1;k6=i ne ent sim(e; etkj 1) srow = (fnen1t1ityP(fkne=it01;k16=g;jfmega)x(flexical(e; tik); fentity(feg; E^ik)) ieflsje=0 scol + srow + edit dist sim(e; tij ) + ( 1 ) ps(ejtij ) ) ps(ejtij )

Since the exact inference of the above objective is NP-hard, we adopt the framework of an Iterative Classi cation Algorithm (ICA) [1] for approximate inference. ICA is an iterative local search method which greedily re-assigns each cell to the entity that maximises the probability conditioned on the current entity assignments of other cells. The main assumption behind the design of the disambiguation model is to characterise: ( 1 ) type consistency along each column of entities, and ( 2 ) property relatedness within each row of attribute values. In other words, entities mentioned in the same column should have compatible types, while entities or values mentioned in the same row (henceforth describing the same entity) should be related via relational facts and satisfy lexical constraints. Speci cally, our model includes a coarse-grained phase which tries to lter out type-incompatible candidates and a ne-grained phase which selects the best candidate by considering more ne-grained property values. The pseudo-code of the disambiguation procedure is shown in Algorithm 1, which can be described as the following four steps: 1. Initialization (line 1): Let eitj be the cell tij 's entity assignment at iteration t. Initially, the entity assignments for all cells are independently set by maximising local scores for each speci c cell (line 1). The score is a weighted combination of the string similarity between the cell text and the title of the candidate entity (edit dist sim(e; tij )8) and a prior score ps(ejtij ). The prior 1 score ps(ejtij ) is calculated as ps(ejtij ) = ranke , where ranke is the ranking index (starting at 1) of the entity e in its candidate list Eij . 8 Implemented using the Levenshtein.ratio function in Python 2. Coarse-grained phase (lines 2-6): During the coarse phase, the candidate entity's score sij (e) is a weighted combination of column support score scol, row support score srow, string similarity edit dist sim(e; tij ) and prior score ps(ejtij ). The column score scol is calculated by averaging the entity similarity between the current candidate entity and each of the remaining cells' entity assignments in the same column of the previous iteration (etkj 1). Speci cally, we represent each entity as a sparse feature vector where each property and the value of instance of (P31) / subclass of (P279) properties serve as one feature dimension. Our basic assumption is that the properties of an entity are also a proxy of its type besides the explicit types annotation in the KB. The coarse ent sim( ; ) function is implemented by calculating the cosine similarity of the above sparse feature vectors. Obviously, the features are not equally important. We adopt a dynamic method to generate feature weights by considering how the feature is shared along the column and how discriminative it is for disambiguating the current cell. We use something similar to TF-IDF weighing: the term fraction of a feature f in a column j denoted by TFj (f ) is de ned as

TFj (f ) = jfeitj 1jf 2 eitj 1; 1 m i mgj ; which is the fraction of entities in the column of last time step consisting of this feature. To avoid the noise of irrelevant features, we set TFj (f ) = 0 if it is lower than 0.5. The Inverse Document Frequency (IDF) of a feature f over one cell Tij is de ned as

IDFij (f ) = log

jEij j + 1 jfejf 2 e; e 2 Eij gj + 1 + 1; essentially treating each candidate as a document and measures the IDF over it. Here we adopt a smoothed version of IDF to avoid zero-divisions and zero weights. Finally, a feature over a cell Tij denoted by fij is de ned as fij = TFj (f ) IDFij (f ): Similar TF-IDF formulations have been used successfully in previous SemTab participants (e.g., the Tabularisi system [7] at SemTab 2019 calculating the ranking score). We adapt this formulation for the ICA framework to calculate pairwise entity similarities by implementing a smoothed version of IDF and prune features with low support to mitigate the noise.

The row score srow is calculated by extracting the property features at both lexical and entity level. This feature characterises the property relatedness between current candidate entity and the remaining cells in the same row. Speci cally, for each cell if it lies in the main column of the table, we will calculate the support score from each remaining cell in the same row. Otherwise, we only consider the support score from the cell in the main column. Given the property distribution from the property linker, the support score ( 2 ) ( 3 ) ( 4 ) (fentity( ; ) or flexical( ; )) is calculated by rst retrieving the possible properties between the current candidate entity and the remaining cells followed by getting the largest con dence in the corresponding property distribution. 3. Pruning (line 7): We reduce the search space at the current stage before more ne-grained processing. For each entity we look at the candidates sorted by their nal scores. If the di erence between the nal scores of the top-2 entities is above a threshold min di , then we only keep the top-1 candidate. Otherwise, we only keep the top-K candidates plus candidates whose nal score is above a certain threshold (min abs). 4. Fine-grained phase (lines 8-12): For some highly ambiguous cases, we need to compare the speci c values of a certain property instead of looking at only the appearance of the property elds. For example, for a column of Canadian cities such as [\Kingston", \Montreal"], the system could know that these are cities after the coarse-grained step, but there exist multiple cities named \Kingston". We still have to make a choice between Kingston in Jamaica and Kingston in Canada. In such cases, we have to further consider the speci c values of certain key properties, such as Country = Canada. In this ne-grained phase we extend the sparse features for calculating entity similarity from all properties to all property values. 2.2

Property linker

For the property linking algorithm, we use the approach presented in the technical report [3]. For every relational column, we start from the strings in the cells and try to generate candidates as described in the previous section for the coarse-grained phase. When the search does not return satisfactory results (for example, none of the strings in the column can be matched to an entity), we usually encounter numerical properties which contain numbers or dates and we treat them as special columns.

For columns where we can identify KB entities, we try to nd direct matches or matches within a given edit distance with the property values of the entities in the main column. For numerical properties, we try to nd direct matches within unit conversion. Once we have a set of matches, each row votes to nd a rst most-likely property. If we do not reach a certain threshold, or the difference between the top choices is too small, we use a second re nement phase that is more computationally expensive. For numerical properties we have precomputed a set of characteristic statistics per type (for example, human heights have a certain range, mean and standard deviation). For each given type that can suitably describe the main column, we check which of the pre-computed statistics are best matches for the numerical column that we could not identify. For the SemTab dataset we found that just looking at ranges su ces.

A common issue we encounter for Wikidata is that the entities do not have complete information, i.e. some properties could be missing. For columns where we can identify KB entities, we extend the ranking score by considering the properties of similar entities. If several rows voted for a given property and a given row does not have that property, we want to know if that property is missing or not applicable. We extend binary scoring, a given property present for a given entity, to a new score in (0; 1) that takes into account how many similar entities do or do not contain the relevant property. We de ne the most similar entities of a given entity as the set of nearest neighbours (in cosine distance) that share the same type with the given entity in the BigGraph space [4]. 2.3

Type inference

Our type inference algorithm is a heuristic multi-pass sieve method that is fully dependent on the entity linking results. To predict the type of column j, we rst acquire the entity linking results Ej = fe^ijj1 i mg from the entity linker. Then we retrieve the entity types T (e) for each entity e 2 Ej, where we de ne T (e) as the set of all types satisfying the SPARQL expression ?entity wdt:P31/wdt:P279?/wdt:P279? ?types., treating the values of instance of (P31) and subclass of (P279) as the types for each entity. Then the goal is to nd the most common types shared by most of the entities. To do so, we de ne the rst criterion named SupportCount(t):

SupportCount(t) = jfeje 2 Ej; t 2 T (e)gj: ( 5 ) We select the type with maximum SupportCount(t), but multiple types may have the same count. In that case, we want to prioritise the most speci c one. We design a second criterion named AverageLevel(t) based on the type ontology to characterise the speci city of a type t:

AverageLevel(t) = AVG(fhje is instance of t via a h length path; e 2 Ejg) ( 6 ) Since lower distance with respect to the entity nodes indicates a more speci c type, we select the type with the minimum AverageLevel(t) to break the above ties. However, this method does not guarantee uniqueness. In practice, we found the following design works well on the SemTab data for tie-breaking. For the main column, we select the type with minimum Population(t) on Wikidata, where

Population(t) = jfejt 2 T (e); e 2 Egj: For relation columns, we select the type with the minimum InstanceRank.

InstanceRank(t) = AVG(frje is instance of t at r rank ; e 2 Ejg); where rank means the position of the type t among the statement group of the instance of property. ( 7 ) (8) 3

Setup and Results

Accessing the online SPARQL endpoint is very slow given the large amount of data, so we use an o ine Wikidata dump (20200525). Our experimental pipeline starts by calling the MediaWiki API which usually takes 2-3 days for each Round. After we generate the entity candidates, we cache the results and extract the relevant subset of the Wikidata dump. Our multi-threaded Python pipeline takes at most 20-30 minutes for each Round on a Intel(R) Xeon(R) CPU E7-4860 v2 (4 processors) machine. As we do not train the hyper-parameters, we empirically set to be 0.20, to be 0.50, to be 0.1, min di to be 0.30, min abs to be 0.50 and K to be 2. 3.1

Results Discussion Synthetic data vs. Real data

The evaluation datasets for the SemTab challenge use synthetic data that is automatically generated from the knowledge base. Although in the generation process various re nement strategies have been adopted to simulate real data, we argue that there is still a signi cant gap. 9 https://www.cs.ox.ac.uk/isg/challenges/sem-tab/2020/results.html { In real data a table expresses the intent of its creator, while synthetic data is generated through a random combination of type compatible entities, { The spelling errors introduced are not necessarily representative of the errors that a table creator might produce, { Real data might contain much more entities, types, and relations outside the speci ed knowledge base, making them more challenging than data synthesised from the knowledge base.

The currently available datasets curated from real-world data are either in small scale [5, 6] or with huge noise as the data is automatically extracted from Wikipedia [2]. In order to make progress in this eld, better datasets need to be curated and carefully annotated to compliment the synthetic SemTab data produced in the current way. 4.2

Type ontology in Wikidata

The type ontology in Wikidata is noisy, as we can see from the example in Fig 2. Under an ontology with such complex sub-structures, it is hard to determine the speci city of a certain type. In order to de ne the CTA task more clearly and more fairly on the Wikidata ontology, further cleaning is required (either manually or automatically) to reach a more reliable structure such as the one curated for DBpedia.

Challenge design

Finally, we would like to suggest to split the dataset into a development set and a test set. The test set should be used for nal evaluation, while the development set should be released for model design and tuning. This way participants can try to improve their systems without having to make multiple submissions. 5

Conclusion

In this paper, we present LinkingPark, our system for SemTab 2020. Our pipeline with multiple components is an integrated approach for semantic table interpretation. Results on SemTab 2020 demonstrate the e ectiveness of our approach for all three tasks. We hope that some parts of our solutions as well as the observations and insights we gathered during the challenge will be bene cial for future research e orts towards better understanding of tabular data.

1. Bhagavatula , C.S. , Noraset , T. , Downey , D. : Tabel: entity linking in web tables . In: International Semantic Web Conference . pp. 425 { 441 . Springer ( 2015 )

2. Efthymiou , V. , Hassanzadeh , O. , Rodriguez-Muro , M. , Christophides , V. : Matching web tables with knowledge base entities: from entity lookups to entity embeddings . In: International Semantic Web Conference . pp. 260 { 277 . Springer ( 2017 )

3. Karaoglu , A. , Negreanu , C. , Chen , S. , Williams , J. , Fabian , D. , Gordon , A. , Lin , C.Y. : Wiki2row - the in's and out's or row suggestion with a large scale knowledge base . Tech. Rep. MSR-TR-2020-37 , Microsoft ( October 2020 ), https://www.microsoft.com/en-us/research/publication/wiki2row -theins-and-outs-or-row-suggestion-with-a-large-scale-knowledge-base/

4. Lerer , A. , Wu , L. , Shen , J. , Lacroix , T. , Wehrstedt , L. , Bose , A. , Peysakhovich , A. : PyTorch-BigGraph: A Large-scale Graph Embedding System . In: Proceedings of the 2nd SysML Conference . Palo Alto, CA, USA ( 2019 )

5. Limaye , G. , Sarawagi , S. , Chakrabarti , S. : Annotating and searching web tables using entities, types and relationships . Proceedings of the VLDB Endowment 3 ( 1- 2 ), 1338 { 1347 ( 2010 )

6. Ritze , D. , Lehmberg , O. , Bizer , C. : Matching html tables to dbpedia . In: Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics . pp. 1 { 6 ( 2015 )

7. Thawani , A. , Hu , M. , Hu , E. , Zafar , H. , Divvala , N.T. , Singh , A. , Qasemi , E. , Szekely , P.A. , Pujara , J.: Entity linking to knowledge graphs to infer column types and properties . In: SemTab@ISWC ( 2019 )