MantisTable: an Automatic Approach for the Semantic Table Interpretation Marco Cremaschi, Roberto Avogadro, and David Chieregato University of Milan - Bicocca, Viale Sarca 336, 20126 Milan, Italy marco.cremaschi@unimib.it {r.avogadro,d.chieregato}@campus.unimib.it 1 Presentation of the system 1.1 State, purpose, general statement On the Web we can find a vast amount of structured data, represented in ta- bles that contain relevant information. Despite the huge corpus of such tables on different topics, they set limitations on artificial intelligence tasks, such as semantic search and query answering. This is the reason why some approaches started to propose extraction, annotation and transformation of tabular data into machine-readable formats. In particular, in the last years, there has been a ton of works on the annotation of tabular data, also known as Semantic Table Interpretation (STI), which can be mainly classified as supervised (they exploit already annotated tables for training) [5,10,7,2] or unsupervised (they do not require training data) [13,8,1,6]; and as automatic [5,7,4] and semi-automatic [2]. Moreover, some approaches [3,13,8,1,6,12,11] focus mainly on the analysis of Web tables’ context such as Web page title, table caption, or surrounding text, while others [5,10,7,4,9] address independent tables which can only rely on their own data. We identify some limits of the state-of-the-art approaches as follows: i) they adopt lexical comparisons for matching which ignore the contex- tual semantics; ii) they rely on metadata like column names and sometimes even external information like table descriptions, both of which are often unavailable in real world applications; iii) they use personalised Knowledge Graph (KG); iv) they perform only a few steps of STI. To overcome such limitations, we pro- pose a comprehensive approach and a tool named MantisTable 1 , which provides an unsupervised method to annotate independent tables, possibly without a header row or other external information. MantisTable takes a well-formed and normalised relational table (i.e. a table with headers and simple values, thus excluding nested and figure-like tables), and a KG which describes real-world entities in the domain of interest (i.e. a set of concepts, datatypes, predicates, entities, and the relations among them) in input, and returns a semantically an- notated table in output. This process comprises different steps to semantically Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 mantistable.disco.unimib.it Cremaschi M. et al. annotate tables, such as semantic classification of columns, which classifies a col- umn either as a literal or as a named entity. Besides the approach, we propose MantisTable tool, a web interface and an open source Semantic Table Interpre- tation tool that automatically annotates, manages and makes the semantic of tables accessible to humans and machines. This tool is independent of any partic- ular context. Additional built-in guidance functionalities help to avoid common pitfalls and to create correct annotations. Although a STI contains several steps, as will be explained in the next section, the key feature of our approach is the involvement of all the STI steps that run fully automatically. This approach and tool were developed by one PhD student and two master’s students. 1.2 Specific techniques used The MantisTable approach implements STI steps through five phases: 0. Data Preparation, which aims to prepare the data inside the table; 1. Column Analysis, whose tasks are the semantic classification, that assigns types to columns (NE-column or L-column), and the detection of the subject column (S-column); 2. Entity Linking, which deals with mappings between cells and entities in a KG; 3. Predicate Annotation, whose task is to find relations, in the form of predicates, between the main column and the other columns to set the overall meaning of the table; 4. Concept and Datatype Annotation, which deals with mappings between columns and semantic elements (concepts or datatypes) in a KG. To describe each phase of the STI approach we consider Table 12 , which lists video games with additional information, such as publisher, release date, etc. Table 1. Table with a list of video games extracted from Round 1. Title Publisher EU Release Date AU Release Date PEGI ACB Donkey Kong Country Nintendo 2006-12-08 2006-12-07 7 G Super Castlevania IV Konami 2006-12-29 2006-12-29 3 PG DoReMi Fantasy: Milon’s DokiDoki Adventure Hudson Soft 2008-09-05 2008-09-05 3 G (900 Wii Points) ... Data Preparation aims to clean and uniform data inside the table. Trans- formations applied to tables are as follows: deletion of HTML tags and some characters (i.e. ” ‘), transformation of text into lowercase, deletion of text in brackets, resolution of acronyms and abbreviations, and normalisation of units of measurement. To decipher acronyms and abbreviations, the Oxford English Dictionary3 is used. The normalisation of units of measurement is performed by applying regular expressions, as described in [8]. MantisTable extends the orig- inal set of regular expressions to cover a complete set of units, which includes 2 Round 1 table index: 11833461 1 3811022039809817402 3 public.oed.com/how-to-use-the-oed/abbreviations/ MantisTable: an Automatic Approach for the STI area, currency, density, electric current, energy, flow rate, force, frequency, fuel efficiency, information unit, length, linear mass density, mass, numbers, popu- lation density, power, pressure, speed, temperature, time, torque, voltage and volume. Column Analysis whose tasks are the semantic classification that assigns types to columns that are named entity (NE-column) or literal column (L- column), and the detection of the subject column (S-column). The first step of the Column Analysis phase is to identify good L-column candidates. To accomplish this task, we consider 16 regular expressions that identify several Regextypes (e.g. numbers, geo coordinate, address, hex color code, URL). If the number of occurrences of the most frequent Regextype in a column exceeds a given threshold, that column is annotated as L-column, otherwise, it is annotated as NE-column. The second step deals with the subject column detection that takes into account the identified NE-columns. We can define the S-column as the main column of the table based on different statistic features, like Average Number of Words (aw) in each cell, Fraction of Empty Cells (emc) in the column, Fraction of Cells with Unique content (uc) and Distance from the First NE-column (df). These features are combined to compute the subcol(cj ) score for each NE-column as follows: 2ucnorm (cj ) + awnorm (cj ) − emcnorm (cj ) subcol(cj ) = p (1) df (cj ) + 1 The column with the highest score will be selected as the S-column for the considered table. The values of the features for the S-column detection related to the video games table (Table 1) are shown in Table 2. In this case the Title column is the S-column of the table (Table 3). Table 2. Values of the features of the S-column detection for the video games table. Feature Title column Publisher column emc 0 0 uc 1 0.21 df 1 2 aw 1 0.37 final 3 0.57 Table 3. Table 1 after the Column Analysis phase. S NE L L L NA Title Publisher EU Release Date AU Release Date PEGI ACB donkey kong country nintendo 2006-12-08 2006-12-07 7 g ... Entity Linking deals with mappings between the content of cells and en- tities in the KG. The original version of MantisTable used a naive approach: we employed a SPARQL query considering only the entities for which, within the values of rdf:type, it was possible to find the label of the winning class. If more than one entity was returned for a cell, the one with a smaller edit dis- tance (i.e. Wagner-Fischer distance) was taken. Between Round 3 and Round 4 we developed a more effective approach, which is described below. In the new approach we consider the table entities row by row. To discover the mappings for each cell in a row we take a set of eligible entities from the KG. We define Cremaschi M. et al. as eligible all the entities which label contains the cell’s content or contains the cell’s tokens as shown in Listing 1.1. By converting this information to a graph representation we find all the paths that, starting from a eligible entity of the cell’s subject, enter into a eligible entity of the cell’s object of the row. To in- crease the accuracy we also try to match the literal cells with the literals in the subject cell’s candidate entities: that is, we seek for a match between literals in the table and eligible entities by using a simple matching algorithm. This algo- rithm distinguishes between three main datatypes: dates, numbers and strings. While dates and strings are matched exactly, numeric matching is done using an approximated matching algorithm: a pair of numbers makes a match if the distance (the absolute difference) between the two is less than a threshold. To pick the winning entities we apply to each path a score defined in the following way: ||pi || X P Sr (pi ) = ||pi || + (1 − editDistance(tx(j, r), ej,r )) (2) j where P Sr is the path score of row r, pi is an eligible path found, ||pi || is the length of the path defined as the number of vertices contained in the path and editDistance(tx(j, r), ej,r ) is the edit distance (Levenshtein) between cell content and eligible entity. Then eventually we take the path with the maximum score. Listing 1.1. SPARQL query to retrieve a set of eligible entities for a cell’s content. 1 CONSTRUCT { ?s ?p ?o. 3 } WHERE { { 5 { ?s ?p ?o. 7 ? s rdfs : label ? label . ? label < bif : contains > ’( < cell value >) ’. 9 } UNION { ?s ?p ?o. 11 ? s rdfs : label ? label . ? label < bif : contains > ’( < token > AND < token > [ AND < token >]) ’. 13 } UNION { OPTIONAL { 15 ? altName rdfs : label ? lab . ? lab < bif : contains > ’( < token > AND < token > [ AND < token >]) ’. 17 ? altName dbo : w i k i P a g e R e d i r e c t s ? s . ?s ?p ?o. 19 }. } 21 } UNION { ?o ?p ?s. 23 ? s rdfs : label ? label . ? label < bif : contains > ’( < token > AND < token > [ AND < token >]) ’. 25 } [...] 27 } Predicate Annotation, whose task is to find relations in the form of pred- icates, between the Subject column and the other columns, to set the overall meaning of the table. MantisTable approach considers the set of predicates found in the Entity Linking phase and takes the predicate with maximum frequency. MantisTable: an Automatic Approach for the STI Concept and Datatype Annotation deals with mappings between columns headers and semantic elements (concepts or datatypes) in a KG. Our approach reuse the information extracted by the linking phase (linked entities). For each winning entity (Eij ) we pick the associated concepts (rdf:type), then for each column we build a dictionary Cj where each rdf:type have asso- ciated the number of rows of the table containing that concept divided by the total number of entities found for that column. We use a threshold set at 40% of the maximum score, the concepts with a score lower then this threshold are discarded. Table 4 show an example of this scoring. Table 4. Example for Cj Concept Score Person 0.7 Politician 0.5 Athlete 0.3 GolfPlayer 0.3 Place 0.3 PopulatedPlace 0.3 Settlement 0.3 City 0.3 With the concept list obtained in this way, we build a concept graph repre- senting the hierarchy between them as shown in Figure 1. To identify the winning connected component within this graph, we take the maximum result of the fol- lowing formula (connected component score) applied to each component: X CCScore(CCi ) = conceptScore(n) (3) n∈CCi Fig. 1. Example concepts graph’s paths Cremaschi M. et al. where CCi is the ith connected component and conceptScore is a function that returns the corresponding concept score. After this step, in order to complete the column annotation task we pick the path of the winning connected component which maximizes the score. We define the path’s score in the following way, where p is a path of concepts defined as c1 , c2 , · · · , cn . n X pathScore(p) = conceptScore(pi ) (4) i This method is being used to reinforce the identification of winning concepts in spite of the presence of noisy data or contingent errors in the knowledge base. Without loss of generality let’s consider the example in Table 4 and Figure 1. To calculate the score for the Person and Place connected components we calculate CCScore as in Equation 3. CCScore(P ersonCC) = conceptScore(P erson) + conceptScore(Athlete)+ conceptScore(P olitician) + conceptScore(Golf P layer) CCScore(P laceCC) = conceptScore(P lace) + conceptScore(P opulatedP lace)+ conceptScore(Settlement) + conceptScore(City) therefore CCScore(P ersonCC) = 0.7 + 0.3 + 0.3 + 0.5 = 1.8 CCScore(P laceCC) = 0.3 + 0.3 + 0.3 + 0.3 = 1.2 so the winning connected component will be PersonCC. Now let’s see how path’s score on PersonCC is computed. Let’s pick some paths on this connected component defined as p1 = P erson, P olitician and p2 = P erson, · · · , Golf P layer, then the score is computed in the following way: score(p1 ) = conceptScore(P erson) + conceptScore(P olitician) score(p2 ) = conceptScore(P erson) + conceptScore(Athlete)+ conceptScore(Golf P layer) therefore score(p1 ) = 0.7 + 0.5 = 1.2 score(p2 ) = 0.7 + 0.3 + 0.3 = 1.3 so our method choose the p2 path as the winning concept annotation. MantisTable: an Automatic Approach for the STI 2 Link to the system As described above, the MantisTable approach has been integrated into a web application developed with Python and the Django. A MongoDB database acts as table and KG repository. The code is freely available through a Git reposi- tory4 . In order to achieve the scalability of the application, and therefore improve efficiency, MantisTable has been installed in a Docker container to achieve par- allelisms at the application level and to facilitate the deployment on servers. The management of resources is performed by using Task Queues (i.e. Celery Work- ers5 ). The five phases of the STI have been modularly implemented, allowing an easy replacement or extension by other developers. 2.1 Adaptations made for the evaluation To participate in the challenge we made some changes as follows: i) MantisTable was originally developed to support the JSON format for loading tables and for exporting results. In order to take part in the challenge, we developed a new parser for managing the CSV tables. During this phase, we encountered several problems in the management of different characters encoding; ii) since target columns (columns to be annotated) were provided during the challenge, we disabled most of the Column Analysis phase (i.e. subject column detection); iii) our solution was made to identify just one correct Class per column, so we made a new export script with different criteria for the selection of concepts, also considering the hierarchy of these in the KG. 4 bitbucket.org/disco unimib/mantistable-tool.py 5 docs.celeryproject.org/en/latest/userguide/workers.html Cremaschi M. et al. 3 Results In this section, the MantisTable approach’s results in Round 1, Round 2, Round 3 and Round 4 of the challenge will be discussed. F1 and F2 in the following tables are referred to F1-Score and Precision for CEA and CPA Tasks in every Round and CTA in Round1 while they are referred to AH-Score and AP-score for CTA task in the Rounds from 2 to 4. In the first round MantisTable achieved the results in Table 5. The results are good, in particular in the CEA task. In the second round MantisTable achieved the results in Table 6. Table 5. Results of Round 1. Table 6. Results of Round 2. TASK F1 F2 TASK F1 F2 CTA 0.929 0.929 CTA 1.049 0.247 CEA 1.0 1.0 CEA 0.614 0.673 CPA 0.965 0.991 CPA 0.460 0.544 Table 7. Results of Round 3. Table 8. Results of Round 4. TASK F1 F2 TASK F1 F2 CTA 1.648 0.269 CTA 1.682 0.322 CEA 0.633 0.679 CEA 0.973 0.983 CPA 0.518 0.595 CPA 0.787 0.841 In Round 2 we particularly focused on the CTA task because it was crucial to our initial algorithm for the other steps; we used the results of the Concept Annotation to filter the results in the CEA and CPA tasks. About the CEA task, it is possible that during the parsing and the cleaning of data some rows were misplaced in different indexes. We didn’t have much time to look into it but we are sure that we could have done a better job in this task using other resources for disambiguation (e.g. DBpedia or Wikidata). Most of the considerations we did in Round 2 are no longer valid with the new approach. In the third round MantisTable achieved the results in Table 7. The results are better than the previous round both in the precision and in the F1-Score but the CEA and CPA tasks still have low precision due to too many false- positives: regarding the CEA task our approach did not have enough contextual information to discriminate between eligible entities, while for the CPA task our approach suffered for the entity linking. In the fourth round MantisTable achieved the results in Table 8. To overcome the lack of contextual information needed by the linking phase, we changed our approach as described in the previous sections gaining a considerable precision boost for both the CEA and CPA tasks compared to previous rounds. MantisTable: an Automatic Approach for the STI The results of CEA and CPA tasks with the new method are shown in tables 9 and 10. Table 9. Results of Round 2 (recom- Table 10. Results of Round 3 (recom- puted). puted). TASK F1 F2 TASK F1 F2 CEA 0.799 0.897 CEA 0.94 0.977 CPA 0.663 0.702 CPA 0.723 0.759 4 Conclusions and General comments Unlike the state of the art approaches, MantisTable i) provides a comprehensive solution to support all annotations steps; ii) provides an unsupervised method to annotate independent tables; iii) generates context for disambiguation; iv) provides a tool to support STI workflow and a tool to support the evaluation by providing validation indicators which are both publicly available. In relation to the results obtained in Round 4, we are going to change the workflow of our tool to better implement the methods for the CEA and CPA tasks. We are also going to develop a way to process tables via remote API calls to allow easier third party integration. Cremaschi M. et al. References 1. Deng, D., Jiang, Y., Li, G., Li, J., Yu, C.: Scalable column concept determination for web tables using large knowledge bases. Proc. VLDB Endow. 6(13), 1606–1617 (Aug 2013) 2. Knoblock, C.A., Szekely, P., Ambite, J.L., Goel, A., Gupta, S., Lerman, K., Muslea, M., Taheriyan, M., Mallick, P.: Semi-automatically Mapping Structured Sources into the Semantic Web, pp. 375–390. Springer Berlin Heidelberg, Berlin, Heidelberg (2012) 3. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3(1-2), 1338–1347 (Sep 2010) 4. Mulwad, V., Finin, T., Joshi, A.: Semantic message passing for generating linked data from tables. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) The Semantic Web – ISWC 2013. pp. 363–378. Springer Berlin Heidelberg, Berlin, Heidelberg (2013) 5. Pham, M., Alse, S., Knoblock, C.A., Szekely, P.: Semantic Labeling: A Domain- Independent Approach, pp. 446–462. Springer International Publishing, Cham (2016) 6. Quercini, G., Reynaud, C.: Entity discovery and annotation in tables. In: Proceed- ings of the 16th International Conference on Extending Database Technology. pp. 693–704. EDBT ’13, ACM, New York, NY, USA (2013) 7. Ramnandan, S., Mittal, A., Knoblock, C.A., Szekely, P.: Assigning Semantic Labels to Data Sources, pp. 403–417. Springer International Publishing, Cham (2015) 8. Ritze, D., Lehmberg, O., Bizer, C.: Matching html tables to dbpedia. In: Proceed- ings of the 5th International Conference on Web Intelligence, Mining and Seman- tics. pp. 10:1–10:6. WIMS ’15, ACM, New York, NY, USA (2015) 9. Syed, Z., Finin, T., Mulwad, V., Joshi, A.: Exploiting a web of semantic data for interpreting tables. In: Proceedings of the Second Web Science Conference. vol. 5 (2010) 10. Taheriyan, M., Knoblock, C.A., Szekely, P., Ambite, J.L.: Learning the semantics of structured data sources. Web Semantics: Science, Services and Agents on the World Wide Web 37–38, 152 – 169 (2016) 11. Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. Proc. VLDB Endow. 4(9), 528–538 (Jun 2011) 12. Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Proceedings of the 31st International Conference on Conceptual Modeling. pp. 141–155. ER’12, Springer-Verlag, Berlin, Heidelberg (2012) 13. Zhang, Z.: Effective and efficient semantic table interpretation using tableminer+. Semantic Web 8(6), 921–957 (2017)