1. Introduction

Towards Generative Semantic Table Interpretation

Viet-Phi Huynh

Yoan Chabot

Raphaël Troncy

Orange

France

EURECOM

France

Semantic Table Interpretation (STI), or Semantic Table Annotation, is the process of understanding the semantics of tabular data with reference information identified in knowledge graphs (KG). In this paper, we first present insights gained from the design and implementation of DAGOBAH SL, a top performing STI system in state-of-the-art benchmarks, and we discuss the unsolved challenges that need to be addressed to make STI more efective in practice. Pre-trained generative Large Language Models (LLMs) have demonstrated their powerful versatility in tackling a broad spectrum of natural language understanding tasks. We envision their potential for improving STI systems. We describe several appealing research ideas that could lay the foundation for future development of Generative Semantic Table Interpretation.

eol>Semantic Table Interpretation DAGOBAH Knowledge Graph Large Language Model Generative Information Extraction

1. Introduction

scenarios, or the choice of algorithmic backbone [ 8, 9 ]. BM25, and a page-rank score quantifying the popularity DAGOBAH SL [ 9, 10 ], the winning system of SemTab of an entity. DAGOBAH SL currently supports various over the last two years, has shown to perform eficiently snapshots of DBpedia and Wikidata knowledge graphs. on well-formed relational tables (like the one in Figure 1) thanks to incorporating a rich set of match-based heuris- An iterative cell entity/column type/column pairs relatics to evaluate the relevance between table context and tion disambiguation that leverages mutual interaction entity graph. However, it still struggles with enterprise between table elements to optimize the re-ranking of tables, heterogeneous tables that can be found in the wild candidate entities/column type/relations. For example, Web or tables that have a low encyclopedic coverage (e.g. the types of a column can guide the ranking of entity GitTables [ 11 ]). candidates in cells associated with that column, and vice

This paper has two objectives. First, we present versa. The compatibility between table context and entity the insights gained when designing and implementing graph is evaluated thoroughly using a comprehensive set DAGOBAH SL, and we discuss the remaining scientific of matching rules including: i) semantic context plays barriers that need to be tackled to obtain a reliable and more important role than literal context, ii) in the table, generic annotation system. Second, in light of advance- a neighboring column that is highly connected to tarments in pre-trained generative Large Language Models get column should have higher contextual weight, iii) (LLMs) [ 12 ] with emergent abilities and flexibility in solv- entity representation is expressed by a multi-hop graph ing a wide range of natural language understanding tasks, centered around the entity, allowing the exploitation of we envision several future steps we plan to experiment richer context, iv) the semantic correlation between colin leveraging LLM to fuel our table annotation system. umn header and cell entity’s description is exploited via Specifically, we rely on fine-tuning or few-shot learning a BERT-based cross-encoder. techniques to adapt the model to table structure, inject Furthermore, DAGOBAH SL is packaged in a RESTful and update knowledge within LLMs and use knowledge API [ 2 ] and user-friendly Web UI [ 13 ] to facilitate the to generate auto-regressively the annotation in textual usage of STI framework. form.

3. Lessons Learned and Challenges 2. DAGOBAH SL Tabular data is highly heterogeneous. Relational

DAGOBAH SL [ 9, 10 ] (SL standing for Semantic table, as shown in Figure 1, is not the sole type of table. Lookup) is a framework for interpreting relational tables There are other types with diferent topology of the seautomatically via a two-stage pipeline: i) preprocessing is mantic connection between the cells, such as entity table1 performed to clean the table and extract metadata, such or matrix table2. In view of layout structure, tables can as orientation, header, key column, column primitive have multiple headers, splitted cells, merged cells or cells typing (e.g. units, URL, email, etc). Importantly, this step containing multiple values (e.g. a list). A more detailed can automatically detect {cells, columns, column pairs} table taxonomy is provided in [ 1 ]. targets that require annotation; ii) annotation follows Arguably, an annotation algorithm designed for a spea retrieve-then-rerank strategy where the retrieve cific table type or table layout may not be suitable nor phase searches for relevant KG entity candidates for a eficient for other types or layouts. DAGOBAH SL is built target cell mention via a keyword-based entity lookup, upon intuitions and assumptions derived from relational and then the rerank phase sorts out the most relevant tables where i) each row corresponds to the description entity for the cell (CEA task). The cell annotations are of a specific entity with columns providing its attributes; subsequently leveraged to predict the column types ii) a semantic cell (i.e. a cell that contains a mention that (CTA) and column-pair relations (CPA). The strength of can be disambiguated) is fully represented by an entity; DAGOBAH SL lies in two points: iii) tables have either no header or only a single header. However, the former intuition does not hold for matrix table and the two later assumptions hinder DAGOBAH SL from handling splitted cells and multivalued cells.

A powerful keyword-based entity lookup service, built on Elasticsearch, indexing every label/alias of every entity in the alias table into an inverted index. It is capable of covering diverse surface forms of cell mentions such as acronyms, synonyms or typos through alias table enrichment, which results in a high recall within a few retrieved candidates. It incorporates three ranking factors: two similarity scores between mention and entity label/aliases calculated using edit distance and

Knowledge Base (KB) Indexing and Exploitation.

As a closed Information Extraction application [ 14 ], STI systems rely on a knowledge base (e.g. a knowledge graph, an ontology or a catalog of entity definitions) to

1See Fig. 1b in webtables for an example of entity table

2See Fig. 1c in webtables for an example of matrix table constrain and guide the annotation process. A typical guage models, the embedding approach via contrastive KB consists of millions of entities, requiring a proper learning based on dual encoders is capable of capturing indexing strategy for eficient retrieval and exploitation. both table and entity semantics, hence, can ofer supeThe usage of a KB in information extraction tasks may rior disambiguation capability. However, to the best of imply two aspects: i) entity attributes to be indexed and our knowledge, there has been no work on investigating exploited: entity can be characterized by labels/aliases, the potential of dual encoders specifically on structured description or contextual information within a graph. tabular data. Moreover, the construction of negative sets These attributes are leveraged partly or fully to retrieve for contrastive learning is a non trivial task that impacts and disambiguate relevant entity candidates, ii) skew- the quality of learned embeddings. In addition, likewise ness of entity/relation distribution: entities and relations the target detection module, the candidate generation vary significantly in term of popularity and expressive- exhibits the risk of propagating errors to the annotation ness, which may challenge the consistency in model step. Hence, if the gold entity is not part of the retrieved performance. Heuristic-based annotation system, like candidates, the corresponding table element will never DAGOBAH SL, relying on inverted indexes associated be correctly annotated. with flexible matching mechanism (e.g. exact matching or fuzzy matching) can work efectively and eficiently on tables that exhibit a high degree of literal similarity with 4. Towards Generative Semantic entity’s attributes. On the other hand, representation Table Interpretation learning-based models learn the underlying semantics of table and entity through embeddings, making them more robust to noise and ambiguous/incomplete context. However, their performance depends strongly on the quality of training data and can difer greatly between frequently occurring entities/relations and rare ones.

Pre-trained generative LLMs (or foundational language

models) have revolutionized numerous natural language understanding tasks, including information extraction (IE) tasks such as named entity recognition, entity linking, relation extraction [19]. The present state-of-theart IE models leverage the flexibility of decoder-only or Error Propagation from Detection to Annotation. encoder-decoder LLM architectures for structured predicLike many other STI systems [15, 16, 17, 18], DAGOBAH tion, allowing for the joint handling of diferent IE tasks SL performs target detection (via preprocessing) and tar- in an end-to-end and unified manner [ 20, 21, 22, 23, 24]. get annotation independently, hence sufers from error While this approach is originally applied to unstructured accumulation where error caused by the first stage will textual data, we argue it is still beneficial for structured propagate to the later stage. Cases in which the system data. In line with [25, 26], we believe that LLMs will be fails to distinguish cells containing literal mentions and more and more adopted to tackle this task. This paper incells containing semantic mentions, will lead to missing troduces our vision towards generative close Information or incorrect annotations. Moreover, most target detection Extraction tailored for tabular data, namely Generative techniques are heuristic-based (e.g cells with string data Semantic Table Interpretation (GenSTI, Figure 2). As a type are considered as CEA targets), or locally contex- stepping stone, we will aim to evaluate the potential of tualized (e.g. using only the single column to determine LLMs in tackling challenges discussed in Section 3. The whether the inner cells are linkable to KG entities). The desiderata are as follows: efectiveness of these techniques in various table scenarios remains uncertain.

Ability to handle simultaneously various table

types and layouts. By framing the STI tasks within Candidate Generation is challenging. Entity candi- a unified seq-2-seq framework [ 27], generative LLMs date generation is critical for efective STI systems that can be prompted with diferent table types/layouts and rely on the retrieve-then-rerank paradigm. Its goal is to can jointly solve CEA, CTA, CPA tasks. This framework deal with huge number of entities and narrow down the reveals the common multi-task learning that has been search space. DAGOBAH SL employs dictionary lookup successful in NLP. In particular, the model is fine-tuned that computes the literal similarity between the table with a mix of table sets to perform multiple STI tasks at mention (possibly with context) and entity’s attributes once. Accordingly, it can facilitate knowledge transfer (labels, aliases or descriptions). While this approach is between tasks and table structures, leading to the acquiappealing for handling various surface forms of mentions sition of more robust and generalizable representations. (e.g. acronym, synonym, typos), it lacks semantic under- Inspired by generative information extractors for text, we standing which can amplify the ambiguity within candi- investigate two sequence modeling strategies: (i) serialize date sets and make the subsequent candidate re-ranking table input and annotation outputs as plain text (text-2phase more challenging. With the rise of pre-trained lan- text) and solve with natural language LLMs (NL-LLMs) (Figure 2-left) [20, 21, 23]. This flattening method ignores, however, the structural information embedded in table concept has successfully inspired state-of-the-art models and output, due to the discrepancy between the datasets in related applications such as entity linking [22, 30] or used to finetune LLMs for STI tasks and the natural lan- information extraction [20, 23]. In the context of GenSTI, guage corpora that LLMs are pretrained on. To alleviate there are two aspects related to arbitrary generation of this issue, we could rely on [24, 28] to (ii) cast STI tasks as generative LLMs that need to be controlled to ensure a recode generation (code-2-code) and solve with Code-LLMs liable end-to-end GenSTI: (i) Output structure consistency: (Figure 2-right). Converting structured data into code is only generate valid tags and adhere to the predefined easier and provides more informative representation than schema. For example, in Figure 2, regarding CEA task, transforming it into free-form text. Hence, this technique NL-LLM has to follow the template: [CEA] | cell_mention narrows the gap between pre-training and fine-tuning (entity_id) | ... where cell_mention is copied from the in Code-LLMs. Interestingly, by programming a table table and is linked to the corresponding entity_id in the as a two-dimensional list which is a common data type KB; (ii) Semantic consistency: entity_id must be a valid enin code and is expected to occur frequently during the tity existing in the KB, [CPA] must generate a relation_id pre-training, the model could better capture the table’s rather than an entity_id. A solution to both challenges topology (i.e. facilitate the identification of the ℎ row, is to endow LLMs with a decoding scheme constrained ℎ column or the cell at coordinates [, ]), compared by prefix-trees that forces the model to generate only to NL-LLMs. [24, 28] have demonstrated the appealing legal tokens at each decoding step [22, 23]. Interestingly, few-shot performance of Code-LLMs (i.e. Codex[29]) in even without such constraint decoding, we observe that structured prediction task that involves no code at all [24] still reports good few-shot performance for IE tasks, such as information extraction or argument graph gen- suggesting that, to some extent, LLMs, especially Codeeration. We argue that using Code-LLMs to tackle table- LLMs, are capable of capturing the internal representarelated downstream tasks could be a promising future tions of the task, and generate relevant outputs without research direction. guidelines. [31] made a similar observation when training a LLM to play Othello game by feeding it a naive End-to-End Semantic Table Annotation. Instead of transcript recording interleaving moves of two players performing the target detection and target annotation as without adding any knowledge of the game rules. The separate stages, end-to-end STI takes into account the mu- model has efectively learned meaningful latent representual dependence and cooperation between the two, which tations, enabling it to uncover the game and make legal could lead to significant performance improvement. This disc moves on the board.

Eficient KB indexing and exploitation. Recently,

Diferentiable Search Index (DSI) [32, 33] has emerged as a novel generative retrieval, deviating from the common retrieve-then-rerank paradigm. One of our objectives is to investigate whether DSI can serve as a viable solution for eficient KB indexing and exploitation within the GenSTI system. Specifically, entities and its graphs (e.g. [34]) are directly encoded (or stored) into LLM’s parameters (aka. indexing). The model then leverages injected knowledge to predict autoregressively the entity identifier through softmax calculations over vocabulary’s tokens. This indexing mechanism helps to relax the Candidate Generation phase which consequently saves a non-negligible computation cost and eliminate the need of external space to store entity embeddings and the need of meaningful negative samples for the learning, as required by dual encoders-based candidate generation. While working efectively on small KB, the behavior of DSI when scaling to large KB (e.g. Wikidata with ∼ 100 millions entities) remains an open research challenge [35], necessitating three key elements to be clarified: (i) in the indexing phase, how many entities the model can memorize [36]; (ii) can the model eficiently learn entities and propagate its knowledge to support the generation [37]; and (iii) the robustness to entity/relation skewness. Generative Information Extraction [23] has shown to be more robust than strong baselines for long-tail entities, but is still far from being good. [38] proposes fine-tuning the extractor on a more balanced dataset that remarkably improves the macro performance.

5. Conclusion

This paper has first reflected on the work carried out when designing DAGOBAH-SL over the last few years and the lessons we learned in implementing a top performing system for STI. This work is also the first step in our journey towards Generative Semantic Table Interpretation (GenSTI). We plan to conduct intensive experiments to uncover the challenges and gain a deeper understanding of the contributions of LLMs to the STI topic, as discussed in Sections 3 and 4. Arevalo, Information extraction meets the semantic we have the best of both worlds with large language web: a survey, Semantic Web 11 (2020) 255–335. models?, arXiv preprint arXiv:2304.13010, 2023. [15] M. Cremaschi, F. De Paoli, A. Rula, B. Spahiu, A [27] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, fully automated approach to a complete semantic M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the table interpretation, Future Generation Computer limits of transfer learning with a unified text-to-text Systems 112 (2020) 478–500. transformer, Journal of Machine Learning Research [16] P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise, 21 (2020) 5485–5551.

H. Takeda, Demonstration of MTab: Tabular Data [28] A. Madaan, S. Zhou, U. Alon, Y. Yang, G. Neubig, Annotation with Knowledge Graphs, in: ISWC Language Models of Code are Few-Shot Common(Posters/Demos/Industry Track), 2021. sense Learners, in: International Conference on [17] R. Shigapov, P. Zumstein, J. Kamlah, L. Oberlän- Empirical Methods in Natural Language Processing der, J. Mechnich, I. Schumm, bbw: Matching csv (EMNLP), Abu Dhabi, UAE, 2022. to wikidata via meta-lookup, in: Semantic Web [29] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Challenge on Tabular Data to Knowledge Graph Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, Matching (SemTab), 2020. G. Brockman, et al., Evaluating large language mod[18] N. Abdelmageed, S. Schindler, JenTab: Matching els trained on code, arXiv preprint arXiv:2107.03374 Tabular Data to Knowledge Graphs, in: Seman- (2021). tic Web Challenge on Tabular Data to Knowledge [30] N. Kolitsas, O.-E. Ganea, T. Hofmann, End-toGraph Matching (SemTab), 2020, pp. 40–49. End Neural Entity Linking, in: 22nd Confer[19] H. Ye, N. Zhang, H. Chen, H. Chen, Generative ence on Computational Natural Language Learning knowledge graph construction: A review, in: In- (CoNLL), 2018, pp. 519–529. ternational Conference on Empirical Methods in [31] K. Li, A. K. Hopkins, D. Bau, F. Viégas, H. Pfister, Natural Language Processing (EMNLP), Associa- M. Wattenberg, Emergent World Representations: tion for Computational Linguistics, 2022. Exploring a Sequence Model Trained on a Synthetic [20] Y. Lu, Q. Liu, D. Dai, X. Xiao, H. Lin, X. Han, L. Sun, Task, in: 11th International Conference on Learning H. Wu, Unified Structure Generation for Universal Representations (ICLR), 2023.

Information Extraction, in: 60th Annual Meeting [32] Y. Tay, V. Tran, M. Dehghani, J. Ni, D. Bahri, of the Association for Computational Linguistics H. Mehta, Z. Qin, K. Hui, Z. Zhao, J. Gupta, et al., (ACL), 2022, pp. 5755–5772. Transformer memory as a diferentiable search in[21] G. Paolini, B. Athiwaratkun, J. Krone, J. Ma, dex, Advances in Neural Information Processing A. Achille, R. Anubhai, C. N. dos Santos, B. Xiang, Systems 35 (2022) 21831–21843.

S. Soatto, Structured Prediction as Translation be- [33] M. Bevilacqua, G. Ottaviano, P. Lewis, S. Yih, tween Augmented Natural Languages, in: 9th Inter- S. Riedel, F. Petroni, Autoregressive search engines: national Conference on Learning Representations Generating substrings as document identifiers, Ad(ICLR), 2021. vances in Neural Information Processing Systems [22] N. De Cao, G. Izacard, S. Riedel, F. Petroni, Au- 35 (2022) 31668–31683.

toregressive entity retrieval, in: 9th International [34] F. Moiseev, Z. Dong, E. Alfonseca, M. Jaggi, SKILL: Conference on Learning Representations (ICLR), Structured Knowledge Infusion for Large Language 2021. Models, in: Conference of the North American [23] M. Josifoski, N. De Cao, M. Peyrard, F. Petroni, Chapter of the Association for Computational LinR. West, GenIE: Generative information extrac- guistics (NAACL), 2022, pp. 1581–1588. tion, in: Conference of the North American Chapter [35] R. Pradeep, K. Hui, J. Gupta, A. D. Lelkes, H. Zhuang, of the Association for Computational Linguistics J. Lin, D. Metzler, V. Q. Tran, How Does Genera(NAACL), Association for Computational Linguis- tive Retrieval Scale to Millions of Passages?, in: tics, Seattle, United States, 2022. Generative Information Retrieval @ SIGIR, 2023. [24] P. Li, T. Sun, Q. Tang, H. Yan, Y. Wu, X. Huang, [36] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, X. Qiu, CodeIE: Large Code Generation Models are C. Zhang, Quantifying Memorization Across NeuBetter Few-Shot Information Extractors, in: 61st ral Language Models, in: 11Th International ConAnnual Meeting of the Association for Computa- ference on Learning Representations (ICLR), 2023. tional Linguistics (ACL), 2023. [37] Y. Onoe, M. J. Zhang, S. Padmanabhan, G. Durrett, [25] N. Tang, J. Fan, F. Li, J. Tu, X. Du, G. Li, S. Madden, E. Choi, Can LMs Learn New Entities from DescripM. Ouzzani, RPT: relational pre-trained transformer tions? Challenges in Propagating Injected Knowlis almost all you need towards democratizing data edge, in: 61st Annual Meeting of the Association preparation, The VLDB Endowment (2021). for Computational Linguistics (ACL), 2023. [26] W.-C. Tan, Unstructured and structured data: Can [38] M. Josifoski, M. Sakota, M. Peyrard, R. West, Exploiting asymmetry for synthetic training data generation: Synthie and the case of information extraction, arXiv preprint arXiv:2303.04132, 2023.

[1]

Liu ,

Chabot ,

Troncy ,

V.-P.

Huynh ,

Labbé ,

Monnin , From tabular data to knowledge graphs: A survey of semantic table interpretation tasks and methods , Journal of Web Semantics 76 ( 2023 ).

[2]

Chabot ,

Deuzé ,

V.-P.

Huynh ,

Labbé , J. Liu,

Monnin ,

Troncy , A Framework for Automatically Interpreting Tabular Data at Orange , in: ISWC (Posters/Demos/Industry Track), 2021 .

[3]

Cutrona ,

Bianchi ,

Jiménez-Ruiz ,

Palmonari , Tough tables: Carefully evaluating entity linking for tabular data , in: 9th International Semantic Web Conference (ISWC) , Springer, 2020 , pp. 328 - 343 .

[4]

Jiménez-Ruiz ,

Hassanzadeh ,

Efthymiou ,

Chen , K. Srinivas, SemTab 2019 : Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems , in: European Semantic Web Conference (ESWC) , 2020 .

[5]

Jiménez-Ruiz ,

Hassanzadeh ,

Efthymiou ,

Chen ,

Srinivas ,

Cutrona , Results of SemTab 2020 , in: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching , 2020 .

[6]

Cutrona ,

Chen ,

Efthymiou ,

Hassanzadeh ,

Jiménez-Ruiz ,

Sequeda ,

Srinivas ,

Abdelmageed ,

Hulsebos ,

Oliveira , et al., Results of SemTab 2021 , in: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching , 2021 .

[7]

Abdelmageed ,

Chen ,

Cutrona ,

Efthymiou ,

Hassanzadeh ,

Hulsebos ,

Jiménez-Ruiz ,

Sequeda ,

Srinivas , Results of SemTab 2022 , in: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching , 2022 .

[8]

Hulsebos ,

Gathani ,

Gale , I. Dillig , P. Groth, Ç. Demiralp, Making Table Understanding Work in Practice, arXiv preprint arXiv:2109.05173 , 2021 .

[9]

V.-P.

Huynh ,

Chabot ,

Labbé , J. Liu,

Troncy , From Heuristics to Language Models: A Journey Through the Universe of Semantic Table Interpretation with DAGOBAH, in: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab ), 2022 .

[10]

V.-P.

Huynh , J. Liu,

Chabot ,

Deuzé ,

Labbé ,

Monnin ,

Troncy , DAGOBAH: Table and Graph Contexts for Eficient Semantic Annotation of Tabular Data, in: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab ), 2021 .

[11]

Hulsebos , Ç. Demiralp,

Groth , Gittables: A large-scale corpus of relational tables , Proceedings of the ACM on Management of Data 1 ( 2023 ) 1 - 17 .

[12]

Wei ,

Tay ,

Bommasani ,

Rafel ,

Zoph ,

Borgeaud ,

Yogatama ,

Bosma ,

Zhou ,

Metzler ,

E. H.

Chi ,

Hashimoto ,

Vinyals ,

Liang ,

Dean ,

Fedus , Emergent Abilities of Large Language Models, Transactions on Machine Learning Research ( 2022 ).

[13]

Sarthou-Camy ,

Jourdain ,

Chabot ,

Monnin ,

Deuzé ,

V.-P.

Huynh , J. Liu,

Labbé ,

Troncy , DAGOBAH UI: a new hope for semantic table interpretation , in: European Semantic Web Conference (ESWC) , Satellite Events , Springer, 2022 , pp. 107 - 111 .

[14]

J. L.

Martinez-Rodriguez ,

Hogan , I. Lopez-