Introduction

MantisTable SE: an E cient Approach for the Semantic Table Interpretation

Marco Cremaschi

marco.cremaschi@unimib.it 0

Roberto Avogadro

Andrea Barazzetti

a.barazzetti1g@campus.unimib.it 0

David Chieregato

0 0 University of Milan - Bicocca , Viale Sarca 336, 20126 Milan , Italy

In this paper, we present a novel unsupervised and automatic approach for Semantic Table Interpretation. The technique presented is performed against DBpedia and Wikidata, and it can be easily adapted to any other Knowledge Graph (KG). Moreover, we provide a tool (LamAPI) that allows to e ciently fetch data needed for Semantic Table Interpretation (STI) tasks from the KG dumps.

Semantic Table Interpretation DBpedia Wikidata Knowledge Graph

Introduction

The STI is a research eld in continuous evolution with increasing interest over time. To mention a few numbers describing the phenomenon:

KG because it was the richest data source available with ground truths available, later we adapted our approach to Wikidata because of it's increasing popularity and in order to take part in the SemTab2020 challenge[ 4 ]. The requirement to adapt the approach to Wikidata drove us to the development of a new approach. In this way, we showed that the approach can be easily adapted to be used with any KG. We will be referring to our new approach with \MantisTable SE", and it is the 4th Open Source implementation of MantisTable. 2

Methodology

As seen in Section 1, to obtain the STI of tabular data, it is required to link elements of the table with the elements in a KG. The elements in KGs (e.g. DBpedia or Wikidata) are frequently stored in Resource Description Framework (RDF) format, so to access to these elements, it is necessary to query a SPARQL endpoint. For instance, the most popular way to access DBpedia dumps is by using OpenLink Virtuoso, a row-wise transaction-oriented RDBMS with a SPARQL query engine to access to RDF graph store1. Wikidata instead uses Blazegraph2 that is a high-performance graph database supporting Blueprints and RDF/SPARQL APIs. The issue faced with these solutions is the time required for importing the data. Wikidata2019 dump requires some days to set 1 http://virtuoso.openlinksw.com 2 https://blazegraph.com up3. Another problem is given by the amount of information present in KG; for instance, the Wikidata dump is about 1.1TB (uncompressed). The English version of DBpedia instead is split into multiple les of the size of 26GB, which leads to high computation times to obtain a complete STI (e.g. TableMiner+ according to the author [ 6 ] took 13.35 hours to process the Limaye200 dataset). However, not all the information present in a KG is necessary to carry out a STI. Therefore, in order to obtain an e cient approach, it is necessary to identify other ways for querying KG. To do that, the approach described in this paper does not use SPARQL queries but queries indexes built on DBpedia and Wikidata dump les. These indexes are accessible through the use of four di erent API services that allow us to replace all the SPARQL queries. The tool that provides these APIs is called LamAPI. LamAPI is an open-source tool, and the code is available as a Git repository4. LamAPI provides the following services: 1. Label Matching: given a free text (in this case a data inside the table cell), it retrieves the entities with the greatest similarity, using ElasticSearch's default sorted scoring algorithm with full-text search enabled. For instance, considering a table cell containing \Jurassic World" the result returned is shown in Listing 1.1 or Listing 1.2; 2. Predicate matching: given two entities it retrieves all the predicates between them; considering the entity \Jurassic World" and the entity \Colin Trevorrow", the list of predicates is shown in Listing 1.4; 3. Object matching: given an entity it retrieves all the related objects and predicates; for example, with the entity \Jurassic World" the result is the one shown in Listing 1.4; 4. Concept matching: given an entity it retrieves all it's concepts as shown in Listing 1.3. 3 https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limitspart-1/ 4 https://bitbucket.org/disco unimib/lamapi

Listing 1.1: Wikidata labels. } " count ": 70 , " r"esuurilt"s:":" Q[5{5615459 ",

" label ": " Jurassic World " } ,{" uri ": " Q3512046 ",

" label ": " Jurassic World " } ,{" uri ": " Q18615494 ",

" label ": " Jurassic World " } ,{" uri ": " Q2336369 ",

" label ": " Jurassic World " } ,{" uri ": " Q37598390 ",

" label ": " Jurassic World Evol ." Listing 1.3: Query result - Literals and Concepts. 7 9 11 13 15

The loading of the labels (as shown in Listing 1.1 and Listing 1.2) takes place through the use of pre-computed JSON les. The other three services are made with Redis5 key-value database. As a search engine for the labels service, we use an instance of ElasticSearch6 that supports HTTP GZip natively, giving a performance boost. KG's dump les need some pre-processing in order to be used for this system, in particular we use namespace abbreviations (e.g. dbo: for dbpedia.org/ontology ) in order to save space and we avoid data duplication.

In the following, we will focus on the algorithmic process of our MantisTable SE. The process is organised into eight phases as follows: i) Data Preparation and Normalisation, ii) Column Analysis and Subject Detection, iii) Data Retrieval, iv) Cell Entity Annotation (CEA), v) Column Predicate Annotation (CPA), vi) Column Type Annotation (CTA), vii) Revision and viii) Export.

The process was designed to retrieve the candidate for a given cell only once, especially if the content of that cell is repeated across multiple tables. This allows

5 https://redis.io/ 6 https://www.elastic.co/

the approach to avoid repeating queries for the same content and saves network time, but the drawback is more di cult memory management during execution. Every phase must be completed for all the tables before going to the next one. This does not preclude running the tool against only one table, but when running with multiple tables the execution time will be sharply reduced (for the same text in a cell we have the same candidate entities that will be sorted considering the content of every table. This will be explained more in detail in the CEA phase). For examples, we will refer to Table 1.

SUBJ

LIT

i) Data Preparation and Normalisation. During this phase, all the cells of all the tables are analysed using a tokeniser managing special characters and removing parenthesis and the text block inside them.

ii) Column Analysis and Subject Detection. During Column Analysis we identify literal columns (L-column) by using a set of regex [ 2 ] (i.e. BOOLEAN, DATE, EMAIL, GEOCORDS, INTEGER, REAL, ISBN, URL) to identity different datatypes. The other columns are annotated as Named Entity column (NE-column). The subject column (S-column) can be identi ed between NEcolumn thanks to content-based scores, but it will not be discussed in this approach as it would not introduce anything new from [ 2,6 ]. Besides, target columns are provided during SemTab2019 and SemTab2020 challenges.

iii) Data Retrieval. For each normalised cell, the candidate entities are retrieved from the Label Matching API service. Retrieving data implies to seek for the entities in the KG with similar labels to our Named Entities cells. For example, if we query \jurassic world" on LamAPI Wikidata we get a list of candidates: { Q3512046 ("Jurassic World" - Movie); { Q18615494 ("Jurassic World" - Comic Strip); { Q55615459 ("Jurassic World" - Attraction); { thousands more results.

The candidates will be ranked during the next phases. Cells of NE columns that have the same string value in di erent tables start by considering the same set of candidate entities; this set will be sorted di erently, considering relationships between the entire contents of a table.

iv) Cell Entity Annotation (CEA). For each cell of the S-column and NE-columns a con dence score is calculated by computing the edit distance (Levenshtein distance) between the labels (in di erent languages) of candidate entity ei;j 2 Ei;j and the content of the cell tx(i; j): 1 norm(LevenshteinDistance(tx(i; j)); ei;j ) (1) All the values are normalised in (0,1) range with Divide by Maximum normalisation (for every entity). For L-columns, the con dence score is computed as follows: { for L-columns with numeric datatype: The con dence score is calculated as of the formula in Equation 2. Where a is the cell and b is the corresponding candidate value in the KG and is a constant (experimentally xed at 100). e 0:5[ (a b) ]2 (2) { for L-columns with string datatype: the con dence score is computed like in the case of NE-column for short strings. For long strings, a Jaccard distance is computed, because the number of edits required to change a long string into another one is not necessarily signi cant while considering n-grams with Jaccard is generally a better similarity measure. { for L-column with date datatype, the dates are considered as sortable numeric values in the format YYYYMMDDHHmmSS. The con dence score is computed in the same way as in the numeric case.

As an example we consider the cell containing \superman returns" which can refer to both a movie (Table 1) and a videogame (Table 2).

Listing 1.6: Candidates for the cell \superman returns" of the Table 2.

Considering literal values for the cell \batman begins" with the content of the column \length in min" we are almost sure that the value is correct because after sorting all the values we have the result shown in Listing 1.7.

Listing 1.7: Literal values for the entity \Batman Begins ( lm)". " P4632 ": " label ": " Bechdel Test Movie List ID ", " value ": 40 " confidence ": 0 " P2047 ": " label ": " duration ", " value ": 140 " confidence ": 1.0 " P3110 ": " label ": " ISzDb film ID ", " value ": 234 " confidence ": 0 v) Column Predicate Annotation (CPA). Considering that all the necessary information is gathered in the previous phase, the CPA is a relatively fast process. For each column, all the predicates are retrieved for every possible candidate couple with con dence greater than 0 and they are sorted by their relative frequency to the entire column. The predicate with the greatest frequency will be ranked rst. This process allows to reduce the number of candidate entities for every cell. Con dence scores for the predicates of the director column are shown in Listing 1.8.

When we consider the video game in Table 2, the CPA phase allowed us to drop the \ lm" with the same name. This because it does not have any relationship with the rest of the content of the table (the lm does not have anything to do with \electronic arts", while the video game has a property with exact match).

Listing 1.8: Predicates for \Director" column.

" P57 ": 2 "" lcaobneflid"e:nc"ed"i:rec1t.o0r ", 4 " P58 ":

" label ": " screenwriter ", 6 " P162""c:onfidence ": 0.75 8 "" lcaobneflid"e:nc"ep"r:odu0c.e7r5 ", 10 " P161 ":

" label ": " cast member ", 12 " confidence ": 0.25

vi) Column Type Annotation (CTA). In this phase every candidate entity ei;j 2 Ei;j is analysed again considering the CPA reordering. The process is a bit di erent in DBpedia and Wikidata because the Wikidata KG contains cycles. For reasons of space, in this paper, only the approach for Wikidata will be considered. In the Wikidata version of MantisTable SE, we retrieve the concepts corresponding to every candidate entity from LamAPI. For every column, we use an in-memory structure storing frequencies of the most common datatypes, as shown in Listing 1.9.

Listing 1.9: Example of the structure storing the most frequent concepts.

The frequencies are stored with by-max normalization in order to be homogeneous (i.e. values are between 0 and 1). For every column, the winning concept is chosen as follows: { concepts with a frequency lower than 0.95 are discarded in order to reduce the needed computation on the fully connected Wikidata concepts graph; { in every single table, we consider every possible column pair and we count inbound and outbound edges; the nal concept selected is the one having the highest number of connections in the Wikidata graph.

Listing 1.10: Con dence for rst column of the lm table (Table 1). 1 " Q229390 ":

" label ": "3 D film ", 3 " Q114"2c4o"n:fidence ": 1.0 5 "" lcaobneflid"e:nc"ef"il:m "1.,0 7 " Q114"2l4a"be:l {": " live - action animated film ", 9 " confidence ": 0.33

As both \ lm" and \3D lm" have the same con dence, we consider the number of links in the Wikidata Graph. In our algorithm, we initially ranked rst concepts with the highest number of links in the graph, but after nding out experimentally that in the SemTab2020 challenge a most speci c concept is preferred by the scoring system, we decided to consider the relationship between the nal concepts. In this example, as \3D lm" is a child concept of \ lm", we assign the most speci c concept, so \3D lm" is the nal winning concept. This consideration should not be considered ideal for every dataset.

Listing 1.11: Links in Wikidata Graph for concepts of rst column of Table 1. 1 " Q229390 ":

" label ": "3 D film ", 3 " Q114"2l4i"nk:s ": 0 5 "" llaibnekls "":: "1 film ",

vii) Revision. The revision phase analyses all the information gathered in the previous phases in order to do a nal reordering of the candidates. In particular, this allows for correcting CEA entities previously selected: every entity have to be coherent with the rest of the concepts and predicates selected in every column. Moreover, predicates are also re-ranked.

viii) Export. The MantisTable SE approach previously described keeps the candidates coming from each phase; all the results are stored. It is possible to apply some thresholds during the export phase to reach a balance between annotation quality and the number of annotations provided. The export threshold can have a signi cant role in evaluation metrics for every gold standard. 3

Deployment

As described above, the MantisTable SE approach has been integrated into a web application developed with Python and Django (Figure 4a, Figure 4b). A MongoDB database acts as table and KG repository. The code is freely available through a Git repository7. In order to achieve the scalability of the application, and therefore improve e ciency, MantisTable SE has been installed in a Docker container to achieve parallelisms at the application level and to facilitate the deployment on servers. The management of resources is performed by using Task Queues (i.e. Celery Workers8). The GUI is current under development, when ready it will allow to graphically explore annotations for every phase. (a) (b) This section will report the results obtained by MantisTable SE in Rounds 1, 2, 3 of the SemTab 2020 challenge.

7 bitbucket.org/disco unimib/mantistable-4 8 docs.celeryproject.org/en/latest/userguide/workers.html

Tasks FR1oundP1 FR1oundP2 FR1oundP3 CEA 0.982 0.989 0.991 0.993 0.974 0.979 CTA 0.745 0.753 0.966 0.973 0.958 0.965

CPA 0.888 0.942 0.961 0.966 0.941 0.957

The results obtained in the rst three rounds shows that our method is highly promising. Unfortunately, during Round4, our results are not representative. Due to an infrastructural problem, it was not possible to complete the computation in the given timeframe. 5

Conclusions and General comments

MantisTable SE i) provides a comprehensive solution to support all annotation steps; ii) provides an unsupervised method to annotate datasets of tables e ciently; iii) generates context for disambiguation; iv) provides a tool to support KG querying for STI tasks (LamAPI) and a work ow for STI (MantisTable SE) which are publicly available.

Talking about future improvements: { CEA does take in consideration only directly linked entities. This is likely the most important source of missing/wrong annotations and is going to be solved in the next version; { CPA is directly computed from CEA results but some ambiguities still remain. A slightly more complex algorithm is currently under development; { for CTA annotations we will possibly use space-representations and featurebased similarities (e.g. embeddings) to match missing concepts in the KG.

1. Cafarella , M.J. , Halevy , A. , Wang , D.Z. , Wu , E. , Zhang, Y.: Webtables: Exploring the power of tables on the web . Proc. VLDB Endow . 1 ( 1 ), 538 {549 (Aug 2008 ). https://doi.org/10.14778/1453856.1453916, https://doi.org/10.14778/1453856. 1453916

2. Cremaschi , M. , De Paoli , F. , Rula , A. , Spahiu , B. : A fully automated approach to a complete semantic table interpretation . Future Generation Computer Systems 112 , 478 { 500 ( 2020 ). https://doi.org/https://doi.org/10.1016/j.future. 2020 . 05 .019, http://www.sciencedirect.com/science/article/pii/S0167739X19302663

3. Fetahu , B. , Anand , A. , Koutraki , M. : Tablenet: An approach for determining ne-grained relations for wikipedia tables . In: The World Wide Web Conference . p. 2736 { 2742 . WWW ' 19 , Association for Computing Machinery, New York, NY, USA ( 2019 ). https://doi.org/10.1145/3308558.3313629, https://doi.org/10. 1145/3308558.3313629

4. Jimenez-Ruiz , E. , Hassanzadeh , O. , Efthymiou , V. , Chen , J. , Srinivas , K. : Semtab 2019: Resources to benchmark tabular data to knowledge graph matching systems . In: European Semantic Web Conference . pp. 514 { 530 . Springer ( 2020 )

5. Lehmberg , O. , Ritze , D. , Meusel , R. , Bizer , C. : A large public corpus of web tables containing time and context metadata . In: Proceedings of the 25th International Conference Companion on World Wide Web . p. 75 { 76 . WWW '16 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE ( 2016 ). https://doi.org/10.1145/2872518.2889386, https://doi.org/10.1145/2872518.2889386

6. Zhang , Z. : E ective and e cient semantic table interpretation using tableminer+ . Semantic Web 8 ( 6 ), 921 { 957 ( 2017 )