s-elBat: a Semantic Interpretation Approach for Messy taBle-s Marco Cremaschi1 , Roberto Avogadro1 and David Chieregato1 1 University of Milan - Bicocca, viale Sarca 336, Edificio U14, 20126, Milan, Italy Abstract This paper describes s-elBat, a Semantic Table Interpretation approach. The approach inherits and improves the part of the techniques belonging to the MantisTable, an approach used and tested in previous editions of the SemTab challenge. s-elBat adds an innovative and optimised lookup approach for generating candidate entities for the annotation. Keywords Semantic Table Interpretation, Tabular Data, SemTab Challenge, Knowledge Graph 1. Introduction Tables are everywhere and play a crucial role in creating, organising, and sharing information on the Web of Data. The increase in the spread of tables can be linked to the uptake of the Open Data movement, whose purpose is to make a large number of tabular data sources freely available, addressing a wide range of domains, such as finance, mobility, tourism, sports, or cultural heritage [1]. The phenomenon can be sized by the number of available tables or the number of users who use Google Sheets or Microsoft Excel: • Web Tables: in 2008 were extracted 14.1 billion HTML tables, and it was found that 154 million were high-quality tables (1.1%); • Web Tables: 233 million content tables in Common Crawl 2015 repository1 ; • Wikipedia Tables: the 2022 English snapshot of Wikipedia contains 2 803 424 tables from 21 149 260 articles [2]; • Spreadsheets: there are 750 million to 2 billion people in the world who use either Google Sheets or Microsoft Excel2 . Tables contain high-value data, but they can be challenging to understand for humans and machines due to several factors, such as the ambiguity of the values contained therein and the lack of contextual information (e.g., metadata). SemTab: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 2022 Envelope-Open marco.cremaschi@unimib.it (M. Cremaschi); roberto.avogadro@unimib.it (R. Avogadro); david.chieregato@unimib.it (D. Chieregato) Orcid 0000-0001-7840-6228 (M. Cremaschi); 0000-0001-8074-7793 (R. Avogadro) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 commoncrawl.org 2 askwonder.com/research/number-google-sheets-users-worldwide-eoskdoxav The table-to-KG matching problem, also referred to as Semantic Table Interpretation (STI), provides explicit semantic annotations (e.g., identifying and annotating entities in cells, their types/classes and the connections/properties between entities), thus capturing knowledge from tables [3]. STI has recently collected much attention in the research community [4, 5, 6] and is a key step to enrich data [7, 8] and construct and extend Knowledge Graphs (KGs) from semi-structured data [9, 10]. The input of STI is i) a well-formed and normalised relational table (i.e., a table with headers and simple values, thus excluding nested and figure-like tables), and ii) a Knowledge Graph (KG) (e.g., Wikidata, DBpedia) which describes real-world entities in the domain of interest (i.e., a set of types, datatypes, predicates, instances, and the relations among them). The output is a semantically annotated table, obtained by mapping its elements (i.e., cells/mentions, columns, rows) to semantic tags (i.e., entities, types, properties) from KGs as shown in Figure 1. This process is typically broken down into the following tasks: (i) cell/mentions to KG entity matching (CEA task), (ii) column to KG type matching (CTA task), and (iii) column pair to KG property matching (CPA task) [6]. Release Domestic length Worldwide Subject column (S-column) Title Director date Distributor in min gross Named-Entity column (NE-column) Jurassic Colin 12/06/2015 Universal 124 1670400637 Literal column (LIT-column) World Trevorrow Pictures Superman Bryan Singer 21/06/2006 Warner 154 391081192 MAIN TASKS Returns Bros. CEA: Entity reconciliation Cell Entity Annotation Batman Christopher 15/06/2005 Warner 140 371853783 CPA: Identification of relationships Begins Nolan Bros. Column Property Annotation CTA: Identification of types Avatar James 18/12/2009 20 Century 162 2744336793 Column Type Annotation Cameron Fox Q11424 Q5 Q1762059 DATE (Film production INTEGER INTEGER (Film) (Human) company) Q3512046 Q5145625 12/06/2015 Q3512046 124 1670400637 (Jurassic World) (Colin Trevorrow) (Universal Pictures) P57 (director) P577 (publication date) P750 (distributed by) P2047 (duration) P2142 (box office) Figure 1: Example of an annotated table. As depicted in Figure 1 the majority of entities in the Title column are of type Film . publication_date can be identified as the property connecting entities in the Title column with elements in the Year column. Unfortunately, explicit situations like the ones in the example are not so common; therefore, we need to set up strategies and algorithms to address several issues. An excellent STI approach must consider and adequately balance the different features of a table (or a set of tables). The annotation involves several key challenges: i) disambiguation: the type of the entities described in a table are not known in advance, and those entities may correspond to more than one type in the KG. ii) homonymy: this issue is related to the presence of different entities with the same name and type. iii) matching: the mention in the table may be syntactically different from the label of the entity in a KG (i.e., use of acronyms, aliases, and typos). iv) NIL-mentions: the approach much also consider strings that refer to entities for which a representation has not yet been created within the KG, namely NIL-mentions. v) literal and named-entity: in a table, there can be columns that contain named-entity mentions (NE-column) and columns containing strings (L-column). vi) missing context: it is often easier to extract the context from textual documents than from tables due to the amount of content to be processed. For instance, the header, the first row of a table, which usually contains descriptive attributes for the columns, may or may not be present. vii) amount of data: the approach must consider large tables with many rows and columns, and tables with very few mentions. viii) different domains: the tables within a set can belong to very general or specific domains. s-elBat3 is an approach that employs several techniques to consider all of these challenges. It is a new approach that inherits and improves what has been proposed by the MantisTable [3] and MantisTable SE [11] STI approaches. The experiences acquired with the tools mentioned above and their participation in the various editions of the SemTab challenge4 led to the definition of new techniques for i) an efficient lookup approach, through the use of indexes on optimised data structures, ii) an Information Retrieval-based Entity Linking, augmented with a type-based filtering feature, and iii) a feature vector based entity disambiguation approach. The rest of the paper is organised as follows. In Section 2 we describe s-elBat in detail. Section 3 introduces the Gold Standards, the configuration parameters, and the evaluation results. Finally, we conclude this paper and discuss the future direction in Section 4. 2. s-elBat approach s-elBat provides an iterative process that performs Entity Linking (EL) on tables. Given a KG containing a set of entities 𝐸 and a collection of named-entity mentions 𝑀, the goal of EL is to map each entity mention 𝑚 ∈ 𝑀 to its corresponding entity 𝑒 ∈ 𝐸 in the KG. As described above, a typical EL service consists of the following modules [12]: 1. Entity Retrieval (ER). In this module, for each entity mention 𝑚 ∈ 𝑀, irrelevant entities in the KG are filtered out to return a set 𝐸𝑚 of candidate entities: entities that mention 𝑚 may refer to. To achieve this goal, state-of-the-art techniques have been used, such as name dictionary-based techniques, surface form expansion from the local document, and methods based on search engines. 2. Entity Disambiguation (ED). In this module, the entities in the set 𝐸𝑚 are more accurately ranked to select the correct entity among the candidate ones. In practice, this is a re- ranking activity that considers other information (e.g., contextual information) besides the simple textual mention 𝑚 used in the ER module. 3 The name comes from taBle-s and Semantic Entity Linking to BAtch Table. 4 www.cs.ox.ac.uk/isg/challenges/sem-tab/ In the s-elBat these modules are integrated in a pipeline composed of 7 sequential phases: Preprocessing and Data Preparation, Entity Retrieval, Cell Entity Annotation (CEA), Column Property Annotation (CPA), Column Type Annotation (CTA), Revision, Export. The overall framework is described in Figure 2. Prior Knowledge Low (CEA + CPA confidence + CTA) annotations Preliminary Entity CEA - CPA - annotations Revision Retrieval CTA (metadata) Pre- Dataset processing Set of LamAPI Annotated Export mentions mentions Figure 2: s-elBat process. 2.1. Preprocessing and Data preparation During this phase, as a first step, every table’s cells are converted into lowercase. The next step performs a column classification, associating L-column (columns containing strings) and NE- column (columns that contain named-entity mentions) tag to every column [3, 11]. The potential subject column (S-columns, the main column, the one all the others refer to) is identified [3]. With the s-elBat approach, the selected subject is not determinant for the final annotation but can positively influence the execution time. Eventually, the cells from NE-column are extracted to generate the set 𝑀 of mentions for the next step. 2.2. Entity Retrieval According to some state-of-the-art experiments [13], the role of the ER module is critical since it should ensure the presence of the correct entity in the returned set to let the ED module find it. For this reason, s-elBat integrates LamAPI (LAbel Matching API), a comprehensive tool for Information Retrieval (IR)-based ER, augmented with type-based filtering features [14]. The current version of LamAPI integrates the data of DBpedia (v. 2016-10 and v. 2022.03.01) and Wikidata (v. 20220708). The KGs are indexed with ElasticSearch5 , an engine that can search and analyse huge volumes of data in near real-time. The ElasticSearch index was configured using the IB similarity as similarity function6 with the default values of hyperparameters. LamAPI is highly modular so it is possible to integrate any indexing engine (e.g., Apache Solr, Apache Lucene, Arango DB). The entity’s identifier and label are indexed. For each entity, the length in terms of characters, the number of tokens and the n-grams of the labels, and a value representing the entity’s popularity are stored in the index. 5 www.elastic.co 6 www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html Among the services provided by LamAPI to search and retrieve information from a KG, the Lookup and Types are described, which are the relevant services for the STI tasks. Lookup: given a string as input, it retrieves a set of candidate entities 𝐸 from the reference KG. The request can be qualified by setting some attributes: limit: an integer value specifies the number of entities to retrieve. The default value is 100; it has been empirically demonstrated how this limit allows a good level of coverage. kg: specifies which KG and version to use. The default is dbpedia_2022_03_01, and other possible values are dbpedia_2022_03_01, dbpedia_2016_10 or wikidata_latest. fuzzy: a boolean value. When true, it matches tokens inside a string with an edit distance (Levenshtein distance) less than or equal to 2. This gives a greater tolerance for spelling errors. When false, the fuzzy operator is not applied to the input. ngrams: a boolean value. When true, it permits to search n-grams. After many empirical experi- ments, we set ‘n’ of n-grams equal to 3. A lower value can bring some bias in the search, while a higher value could not be very effective in terms of spelling errors. Using n-grams equal to 3 “albert einstein” is split in [’alb’, ’lbe’, ’ber’, ’ert’, ...]. When false, n-grams search is not applied. types: this parameter allows the specification of a list of types (e.g., rdf:type for DBpedia and Property:P31 [instance of] for Wikidata) associated with the input string to filter the retrieved entities. This attribute plays a key role in re-ranking the candidates, allowing a more accurate search based on input types. Types: given the unique id of an entity as input, it retrieves all the types of which the entity is an instance. For DBpedia entities, the service returns direct types, transitive types, and Wikidata types of the related entity, while for Wikidata, it returns only the list of concepts/types for the input entity. For each mention 𝑚 ∈ 𝑀, the approach performs a search using the LamAPI Lookup service to retrieve a set of entity 𝐸. During the service invocation, some heuristics are applied to handle possible misspelt input. In detail, two different requests are made: i) using only the mention, ii) modifying the mention with the removing of repeated letters and brackets. For instance, by considering the mention “pariss”, the second query is created using “paris”. Repeated characters are a frequent mistake, and the 3-gram search implemented by ElasticSearch will be badly affected by this kind of mistake. At the same time, the fuzzy matching will easy overcome possibly missing double characters. Brackets affect the edit distance, and their content can frequently be irrelevant. For the selection of the best set 𝐸𝑚 of candidates, LamAPI computes a similarity between the mention 𝑚 ∈ 𝑀 and the label 𝑙(𝑒) (i.e., rdf:label of 𝑒) of each entity 𝑒 ∈ 𝐸: 𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑚, 𝑙(𝑒)) 𝑠𝑡𝑟𝑖𝑛𝑔𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑚, 𝑙(𝑒)) = 1 − (1) 𝑚𝑎𝑥 {𝑙𝑒𝑛𝑔𝑡ℎ(𝑚), 𝑙𝑒𝑛𝑔𝑡ℎ(𝑙(𝑒))} A threshold determines whether to remove an entity 𝑒 from the candidate set 𝐸. To evaluate the threshold empirically, four Gold Standard (GS) have been selected (2T [15], SemTab 2020 R4 [5], HardTable 2021 R2, and HardTable 2021 R3 [6]). The distribution of string similarity values between the entity labels within the GS and the corresponding mention in the table was analysed (Table 1). It turned out that setting a threshold of 0.40 is a good choice. Table 1 String similarity values for different Gold Standards. total avg exact match Dataset <0.10 <0.20 <0.30 <0.40 <0.50 <0.60 <0.70 <0.80 <0.90 <1.0 mentions string similarity = 1.0 2T 2020 73327 81.03% 1839 3550 5263 5996 6904 7951 10281 21972 43055 57793 15534 % 2.51% 4.84% 7.18% 8.18% 9.42% 10.84% 14.02% 29.96% 58.72% 78.82% 21.18% SemTab 2020 R4 485190 98.18% 71 257 556 918 1363 2116 3067 5291 25927 93654 391536 % 0.01% 0.05% 0.11% 0.19% 0.28% 0.44% 0.63% 1.09% 5.34% 19.30% 80.70% HardTable 2021 R2 37153 98.48% 4 14 55 73 92 129 168 308 1721 5876 31277 % 0.01% 0.04% 0.15% 0.20% 0.25% 0.35% 0.45% 0.83% 4.63% 15.82% 84.18% HardTable 2021 R3 51169 97.11% 11 20 26 35 46 67 156 563 6128 13557 37612 % 0.02% 0.04% 0.05% 0.07% 0.09% 0.13% 0.30% 1.10% 11.98% 26.49% 73.51% Average 93.70% 0.64% 1.24% 1.87% 2.16% 2.51% 2.94% 3.85% 8.25% 20.17% 35.11% 64.89% 2.3. Cell Entity Annotation During this phase, for each pair of candidates associated with two cells on the same row but from different columns, the respective properties are extracted by using LamAPI. For each candidate entity 𝑒, a feature vector with the following items is created: • string similarity: the score is based on Levenshtein edit distance; it is calculated between the mention and the label associated with the candidate entity. • jaccard: the score is the same as string similarity but calculated with Jaccard distance instead of Levenshtein. • object: this score is set only if there is a property between two candidate entities. In this case, the subject entity considered in this pair receives a boost equal to the string similarity score of that entity. • relation: as the previous object score, the relation is set only if there is at least one property between the considered entities. This is the exact antagonist of the object score; in this case, the object entity receives a boost. • literal: this score is applied for relations between the cell from the subject column and the cell from the L-column. In this case date, number and string values are compared as explained in [16]. Given a vector of features the final score for each entity is computed as follows: #𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑠𝑐𝑜𝑟𝑒(𝑒) = ∑ 𝑤𝑖 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠𝑖 (𝑒) (2) 𝑖=1 The weights 𝑤 are set as follow: string similarity 10, jaccard 8, object 3, relation 4, literal 7. The final score allows to rank the candidates in order to sort them. Considering Table 2, the candidates for the mention “Jurassic World” are reported in Listings 1. The entity with the highest score is Entity:Q3512046 [Jurassic World]; it is used to annotate the mention. 2.4. Cell Property Annotation In this phase, the information collected during the previous phase is used to aggregate predicates by frequency. The CPA is relatively fast because all the necessary information was already gathered in the previous phase using LamAPI. The first step consists of creating a dictionary Table 2 Example of relational table containing a list of movies. title director release year domestic distributor length in min worldwide gross Jurassic World Colin Trevorrow 2015 Universal Pictures 124 1,670,400,637 Superman Returns Bryan Singer 2006 Warner Bros. 154 391,081,192 Batman Begins Christopher Nolan 2005 Warner Bros. 140 371,853,783 Avatar James Cameron 2009 Twentieth Century Fox 162 2,744,336,793 Listing 1: Candidate entities for the mention “Jurassic World” (Table 2). 1 "Q3512046": "label": "Jurassic World" 3 "instance_of": ["3D film", "film", "sequel"] "string similarity": 1.44 5 "jaccard": 1.0 "object": 2.88 7 "relation": 0 "literal": 1.728 9 "score": 51.78 "Q21877685": 11 "label": "Jurassic World", "instance_of": ["film"] 13 "string similarity": 1.2 "jaccard": 1.0 15 "object": 2.88 "relation": 0 17 "literal": 0.809 "score": 39.99 19 "Q55178974": "label": "Jurassic World 3" 21 "instance_of": ["film", "film project"] "string similarity": 0.875 23 "jaccard": 0.667 "object": 2.888 25 "relation": 0 "literal": 0.491 27 "score": 31.37 Listing 2: Set of properties for the “director” column (Table 2). 1 "P57": "label": "director" 3 "confidence": 1.0 "P58": 5 "label": "screenwriter" "confidence": 0.75 7 "P162": "label": "producer" 9 "confidence": 0.5 "P161": 11 "label": "cast member" "confidence": 0.25 for every column containing all the winning properties and related frequencies. In the second step, the most frequent property is selected for the CPA annotation. Given a director the most frequent property is Property:P57 [director] (Listing 2). 2.5. Cell Type Annotation In this phase, the information collected during the CEA phase is used. To get the CTA annotation for a given column, all the cells of that column are iterated. During the process, the approach Listing 3: Example of the structure storing the most frequent classes for each column in Table 2. "movie_table": 2 "0": "Q229390 (3D film)": 1.0, 4 "Q11424 (film)": 0.667, "Q261636 (sequel)": 0.33 6 "1": "Q5 (Human)": 1.0. 8 "Q2526255 (Film director)": 1.0 "Q28389 (screenwriter)": 1.0 10 "Q3282637 (film producer)": 0.667 "3": 12 "Q1762059 (film production company)": 1.0, "Q375336 (film studio)": 0.5, 14 "Q1107679 (animation studio)": 0.5, "Q18127 (record label)": 0.5, 16 "Q4830453 (business)": 0.5, "Q10689397 (television production company)": 0.25 creates a dictionary with the frequency of all the classes of the winning entities obtained in the previous step. The type with the maximum frequency will be selected as an annotation for the column under analysis. An example for Wikidata is shown in Listing 3. 2.6. Revision The revision process consists of setting constraints on types and predicates obtained in the first execution of annotation process. As this implies computing again all the phases; only low confidence mentions are considered to optimise computational efficiency. An experiment was conducted in order to identify the best criteria to be used for classifying these mentions. The experiment was redacted using the datasets available from the previous editions of the SemTab Challenge. For every dataset considered to revise the CEA results, all the mentions are checked against the corresponding ground truth, and errors are noted. In Figure 3 the results of this analysis are graphically represented by a chart. We can consider the x-axys as the cutoff threshold while the y-axes represent the number of wrong mentions that are not considered for the revision step, e.g., with a threshold of 0.6 on average, it would bring an inner error under 5%. Clearly, it is required to minimise the number of wrong mentions while considering computational efficiency. For the SemTab challenge, the threshold was set at 0.4 with an inner error under 2%. Additional features to the previous feature vector have been added to obtain a better ranker. The new features were not known prior to the CEA phase; in detail: • cta: This score is related to the types of candidates. Considering a candidate entity this score consists of the intersection between the types of the candidate entity with the types found on the whole column. • cpa: In the same way as the previous score, the CPA considers the predicates when the mention is the subject and uses the scores collected from the CPA phase. The key challenges presented in Section 1 are managed as follows: i) disambiguation: the disambiguation is managed by the ED module presented in Section 2; ii) homonymy: the homonym cases are generally resolved with row context, the scores “object”, “relation”, and Figure 3: Analysis of the threshold for the Revision phase. “literal” presented in the CEA phase help the resolution; iii) matching: LamAPI manages most of the matching issues that are encountered during the annotation process; iv) NIL-mentions: annotations with a confidence lower than 1 are highly likely to be NIL-mentions; v) literal and named-entity: the data preparation phase manages the column classification; vi) missing context: when the header context is missing, it is possible to use other kinds of context, such as the column context used in the CTA score; vii) amount of data: in this Section, it was proved that the proposed annotation process is growing linearly with the size of the data; viii) different domains: the approach was validated with general-purpose datasets. In the future, it may be fine-tuned for better performance on domain-specific tasks, for example, by reducing the possible candidate set. 2.7. Export In this phase, the objective is to export the annotated mentions. For every mention, the system needs to decide whether the given annotation is correct or not based on confidence. This can lead to three possible scenarios: i) there is a clear winning candidate: if there is a score difference higher than 0.3 between the first-ranked candidate and the second one, the system is confident enough to annotate with the first-ranked candidate; ii) the final score is lower than 1, so the mention is considered a “No-annotation” because the system cannot be confident enough to provide an annotation; iii) the first two candidates that may be considered as winning have a final score which is too close to decide which one is correct. This can lead to unresolved mentions. In this case, three possible resolution methods are considered: a) the first one is taken, this can lead to a higher number of wrong annotations, but it is a fast way to annotate more mentions; b) the mentions are not annotated, this will lead to a higher recall; c) a ranking system based on the KG data is used to decide what annotation is the winning one. In Wikidata, Listing 4: New candidates entities for the cell “Jurassic World” of the movies table (Table 2). 1 "Q3512046": "label": "Jurassic World" 3 "instance_of": ["3D film", "film", "sequel"] "string similarity": 1.44 5 "jaccard": 1.0 "object": 2.88 7 "relation": 0 "literal": 1.72 9 "cta": 2.0 "cpa": 3.692 11 "score": 74.167 "Q21877685": 13 "label": "Jurassic World" "instance_of": ["film"] 15 "string similarity": 1.2 "jaccard": 1.0 17 "object": 2.88 "relation": 0 19 "literal": 0.809 "cta": 0.667 21 "cpa": 3.442 "score": 26.939 23 "Q55178974": "label": "Jurassic World" 25 "instance_of": ["film", "film project"] "string similarity": 0.875 27 "jaccard": 0.667 "object": 2.88 29 "relation": 0 "literal": 0.491 31 "cta": 0.667 "cpa": 1.545 33 "score": 20.733 when annotating on general-purpose data, one possible criterion is to take the lowest identifier because more popular entities were generally created before newer ones. 3. Validation Table 3 shows the results obtained by the s-elBat tool for three main tasks (CEA, CPA, and CTA) on 2T and HardTable datasets. Overall, these results show that s-elBat tool achieves a significant performance across multiple datasets from different domains and generation methods. The results show that the methodology presented in this paper is a general-purpose approach that can be applied to any dataset. Table 3 Results for the SemTab 2022 dataset HardTable 2022 R1 HardTable 2022 R2 2T 2022 Tasks P F1 P F1 P F1 CEA 0.964 0.945 0.875 0.825 0.938 0.937 CTA 0.951 0.957 0.878 0.859 0.367 0.366 CPA 0.989 0.983 0.96 0.931 - - Experiments were carried out to determine the computational efficiency of the different phases and to validate the assumption that the computation time grows linearly based on the data size. The data from the previous editions of the challenge is used to validate those assumptions. In Figure 4, it is possible to see how different datasets with an increasing number of mentions have a nearly linear outcome regarding execution time. Figure 4: Computation time analysis of s-elBat approach. In Table 4 the complete data about execution time is available. From this analysis, the results show that the most computationally expensive phase is the candidate generation. More in detail, it is possible to notice how the candidate generation consists of at least 96% of the whole processing time for any dataset while the rest of the phases aggregated as “computation time” use less than 4%. Table 4 Time analysis on different datasets from SemTab challenges. Using cache Entity retrieval Entity retrieval Computation Computation Dataset # Mentions Time (s) time (s) time (s) time % time (s) time % T2D 8079 2471 81 2390 96.72% 81 3.28% HardTable 2021 R2 47440 8599 282 8317 96.72% 282 3.28% HardTable 2021 R3 58949 80528 905 79623 98.88% 905 1.12% SemTab 2020 R3 390457 14264 1446 12818 89.86% 1446 10.14% 2T 2020 667244 95190 1777 93413 98.13% 1777 1.87% SemTab 2020 R4 994921 174046 4345 169701 97.50% 4345 2.50% A further contribution from this paper consists of the definition of a format for a generic API specification useful for STI tasks. In Listing 5 the JSON format specification is reported; after the “name”, “dataset”, and “header” properties, there is the “rows” array. In this array, there is an object where the first element “idRow” is a numbered identifier for the row and the other element “data” contains the row content. The key “semanticAnnotations” allows to specify prior knowledge regarding the annotation of the table. For example, the “cta” key can be filled if the column types are already known. In the same way, also the “cta” and “cea” keys can be populated before the computation. The experiments were conducted using 16 parallel processes: i) the ER is performed on a server with 40 CPU(s) Intel Xeon 4114 CPU @ 2.20GHz and 40GB RAM; ii) the ED is performed on a server with 32 CPU(s) Intel Xeon E5-2650 @ 2.00GHz and 94GB RAM. Listing 5: API specification for s-elBat. 1 [ { 3 "name": "TEST1", "dataset": "Dataset1", 5 "header": ["col1","col2","col3"], "rows": [ 7 {"idRow": 1, "data": ["Alabama", "United States", "5,024,279"]}, {"idRow": 2, "data": ["Colorado", "United States", "5,773,714"]}, 9 {"idRow": 3, "data": ["North Carolina", "United States", "10,439,388"]} ], 11 "semanticAnnotations": { "cea": [], 13 "cpa": [], "cta": [] 15 }, "metadata": { 17 "column": [ {"idColumn": 0, "tag": "NE"}, 19 {"idColumn": 1, "tag": "NE"}, {"idColumn": 2,"tag": "LIT", "datatype": "NUMBER"} 21 ] }, 23 "kgReference": "wikidata" }, 25 { "name": "TEST2", 27 "dataset": "Dataset1", "rows": [...], 29 ... "kgReference": "wikidata" 31 } ] The tool and all the resources used for the experiments are released following the FAIR Guiding Principles7 . s-elBat is released under the Apache 2.0 licence8 . 4. Conclusion and Future Works s-elBat, is a new approach that inherits and improves what was proposed by MantisTable. The results show an improvement in terms of the quality of the annotations and scalability. The formalisation of a STI API specification is an interesting addition to state-of-the-art. Regarding future developments, we want to discover a way to improve the computation time of entity retrieval. Another potentially interesting research would be to analyse how the features used for entity retrieval and disambiguation impact the results on different datasets. Acknowledgement This work has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070284 - enRichMyData. 7 www.nature.com/articles/sdata201618 8 bitbucket.org/disco_unimib/s-elbat/ References [1] S. Neumaier, J. Umbrich, J. X. Parreira, A. Polleres, Multi-level semantic labelling of numerical values, in: The Semantic Web – ISWC 2016, Springer International Publishing, Cham, 2016, pp. 428–445. [2] M. Marzocchi, M. Cremaschi, R. Pozzi, R. Avogadro, M. Palmonari, Mammotab: a giant and comprehensive dataset for semantic table interpretation, Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, SemTab2022 (in press). [3] M. Cremaschi, F. De Paoli, A. Rula, B. Spahiu, A fully automated approach to a complete semantic table interpretation, Future Generation Computer Systems 112 (2020) 478 – 500. [4] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, Semtab 2019: Re- sources to benchmark tabular data to knowledge graph matching systems, in: The Semantic Web, Springer International Publishing, Cham, 2020, pp. 514–530. [5] E. Jimenez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, V. Cutrona, Results of semtab 2020, CEUR Workshop Proceedings 2775 (2020) 1–8. [6] V. Cutrona, J. Chen, V. Efthymiou, O. Hassanzadeh, E. Jimenez-Ruiz, J. Sequeda, K. Srinivas, N. Abdelmageed, M. Hulsebos, D. Oliveira, C. Pesquita, Results of semtab 2021, in: 20th International Semantic Web Conference, volume 3103, CEUR Proceedings, 2022, pp. 1–12. [7] V. Cutrona, M. Ciavotta, F. D. Paoli, M. Palmonari, ASIA: a tool for assisted semantic interpretation and annotation of tabular data, in: Proceedings of the ISWC 2019 Satellite Tracks, volume 2456 of CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 209–212. [8] M. Palmonari, M. Ciavotta, F. De Paoli, A. Košmerlj, N. Nikolov, Ew-shopp project: Supporting event and weather-based data analytics and marketing along the shopper journey, in: Advances in Service-Oriented and Cloud Computing, Springer International Publishing, Cham, 2020, pp. 187–191. [9] G. Weikum, X. L. Dong, S. Razniewski, F. M. Suchanek, Machine knowledge: Creation and curation of comprehensive knowledge bases, Found. Trends Databases 10 (2021) 108–490. [10] M. Kejriwal, C. A. Knoblock, P. Szekely, Knowledge graphs: Fundamentals, techniques, and applications, MIT Press, 2021. [11] M. Cremaschi, R. Avogadro, A. Barazzetti, D. Chieregato, Mantistable se: an efficient approach for the semantic table interpretation., in: SemTab@ ISWC, 2020, pp. 75–85. [12] W. Shen, J. Wang, J. Han, Entity linking with a knowledge base: Issues, techniques, and solutions, IEEE Transactions on Knowledge and Data Engineering 27 (2015) 443–460. [13] B. Hachey, W. Radford, J. Nothman, M. Honnibal, J. R. Curran, Evaluating entity linking with wikipedia, Artificial Intelligence 194 (2013) 130–150. Artificial Intelligence, Wikipedia and Semi-Structured Resources. [14] R. Avogadro, M. Cremaschi, F. D’adda, F. De Paoli, M. Palmonari, Lamapi: a comprehensive tool for string-based entity retrieval with type-base filters, in: 17th ISWC workshop on ontology matching (OM), in press. [15] V. Cutrona, F. Bianchi, E. Jiménez-Ruiz, M. Palmonari, Tough tables: Carefully evaluating entity linking for tabular data, in: The Semantic Web – ISWC 2020, Springer International Publishing, Cham, 2020, pp. 328–343. [16] R. Avogadro, M. Cremaschi, Mantistable v: A novel and efficient approach to semantic table interpretation., in: SemTab@ ISWC, 2021, pp. 79–91.