A Framework for the Generation of Training Examples from Tabular Data (Discussion Paper) Jean-Flavien Bussotti1 , Paolo Papotti1 , Donatello Santoro2 and Enzo Veltri2,* 1 EURECOM, Biot, France 2 UniversitΓ  degli Studi della Basilicata (UNIBAS), Potenza, Italy Abstract Tabular data is becoming increasingly important in Tabular Natural Language Inference (TNLI), where the goal is to assess if a table supports or refutes a given hypothesis expressed in NL text. A major issue in TNLI is the lack of such training data. Existing approaches are based on manual annotation of new training data or simple augmentation techniques that lack data variety and complexity. We present a system, Tenet, that automatically generates new training examples for TNLI applications on different domains. Our framework exploits SQL queries to introduce new data variety through evidence-queries that identify new cell values over data exploiting different data patterns, and complexity using semantic- queries that describe the different ways such data can be identified through SQL queries. Description from the semantic-queries are used to verbalize the new cell values from the evidence-queries using a Pretrained Language Model (PLM). The verbalized sentence and the cell values can be used as a new training example in the target TNLI application. We show how Tenet generates human-like examples that are comparable with manually-written examples. Keywords Tabular Natural Language Inference (TNLI), Natural Language Processing (NLP) for Databases, Text Generation, Query Generation, Data Augmentation, 1. Introduction A large class on natural language inference (NLI) problems aims at classifying a given hypothesis, such as a textual statement, as true/false/unknown given some evidence. Recently it has emerged a new class of applications that focus on inference with structured data as evidence, i.e., tabular natural language inference (TNLI). Example applications are table understanding and computational fact checking, where systems label text claims according to input structured data [1, 2, 3, 4, 5, 6, 7]. Most of the solutions in TNLI are supervised, where manually defined datasets for TNLI have been proposed, such as Feverous [8], TabFact [9], and Infotabs [5]. However, these datasets: 𝑖) cover only some generic topics from Wikipedia tables. For example, if there is a need for SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy * Corresponding author. $ jflavien.bussotti@gmail.com (J. Bussotti); papotti@eurecom.fr (P. Papotti); donatello.santoro@unibas.it (D. Santoro); enzo.veltri@unibas.it (E. Veltri)  0009-0009-8869-6025 (J. Bussotti); 0000-0003-0651-4128 (P. Papotti); 0000-0002-5651-8584 (D. Santoro); 0000-0001-9947-8909 (E. Veltri) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Name Party Spouse t1 Barack Dem Michelle TENET t2 Donald Rep Melania t3 Nancy Dem Paul Training data Claim c: β€œBarack and Nancy are in the same party” TNLI Claim c: β€œDonald is married application Label l: Supports to Michelle” Evidence cells E: Label l: Refutes t1.Name: β€œBarack” Evidence cells E: t3.Name: β€œNancy” Test data t2.Name: β€œDonald” t1.Party: β€œDem”, t2.Spouse: β€œMelania” t3.Party: β€œDem” Figure 1: Given any table, Tenet generates new training examples for a target TNLI application. The first example has a hypothesis that is refuted according to the data evidence. fact-checking claims for emerging domains such as Covid-19, a new annotated corpus must be crafted by manually writing examples using the tabular reports published by governments; 𝑖𝑖) they are not comparable in scale and variety to those available for textual NLI [10]. For example, about 80% of the examples in Totto [11] have sentences describing the data with text that does not contain mathematical expressions, such as max, min, and count, or comparison across values; 𝑖𝑖𝑖) They contain bias and errors that may lead to incorrect learning in the target models [12]. The problem of the lack of labeled examples has been treated in the literature for NLI, but it has not been tackled yet for TNLI. If some examples are given in a warm start setting, existing NLI augmentation methods can be used in the TNLI setting: the text part of the example can be rewritten with augmentation w.r.t. the (fixed) data [13]. While these methods increase the number of examples, they do not generate a new corpus that raises the variety and complexity of the examples w.r.t. the structured data, ultimately with a minor impact on the accuracy of the TNLI tasks. Moreover, in a cold start setting, where training data is unavailable, there is no proposal yet on creating annotated examples for TNLI starting only from the tables. User-provided tables can be exploited to generate ad-hoc training data for the application at hand. Our system, Tenet1 (TExtual traiNing Examples from daTa) generates large annotated corpora of training examples that are complex and rich in terms of data patterns, linguistic diversity, and reasoning complexity [14]. Figure 1 shows an overview of our architecture. The system generates training data for the target TNLI application, given only a table as input. Once generated, the examples are used to train the inference model validated on test data. Tenet is built around three modules that cover the three main elements of a complete and annotated TNLI example. Data Evidence. A key intuition in our approach is that tabular data already contains rich information for new examples. Content changes across datasets, and every relation has its own active domain. Moreover, data relationships across entities and their properties are arranged differently across datasets. To identify data evidence to create a variety of examples, we propose alternative approaches to select sets of cells from the given table, including a query generation algorithm for the semi-supervised case. A query returns a set of evidence, such as Donald and 1 Code and datasets available at https://github.com/dbunibas/tenet Michelle in the first example in Figure 1, each partially describing an example. Textual Hypothesis. Once the data is identified, we obtain the textual statement (or hypothesis) for the annotated example. Given a set of cells, we generate queries that identify such data evidence over the input table. Every query characterizes the data with different conditions (e.g., selections with constants) or constructs (e.g., aggregate). From the query and the evidence, we create a text with a prompting method that exploits the human-like generation abilities of large pre-trained language models (PLMs), such as GPT-3 [15]. Our prompting leads to a variety of factual hypotheses, such as Barack and Nancy are in the same party in the second example in Figure 1, while maximizing the coverage of the provided evidence and minimizing hallucination. Inference Label. Finally, we need the corresponding label for every example. While Supports examples are obtained naturally, as the hypothesis reflects the evidence from the table, for Refutes examples we introduce generic methods built around the idea of injecting errors in the data evidence. Once the data is modified, the process for text generation is applied to the β€œdirty” data to obtain hypotheses that are refuted w.r.t. the original β€œclean” data. In the next Section we describe the main components, then we present some experimental results using Tenet. 2. Overview of the Solution Problem Formulation. Let π‘Ÿ be a tuple in the instance 𝐼 for a relational schema 𝑅 and 𝐴𝑖 an attribute in 𝑅. We refer with cell value to the value of tuple π‘Ÿ in attribute 𝐴𝑖 and with table to the instance 𝐼 for simplicity2 . A textual hypothesis is a sentence in natural language. A Tabular Natural Language Inference (TNLI) application takes as input a pair (table 𝑐; textual hypothesis β„Ž) and outputs if β„Ž is supported or refuted by 𝑐. Data evidence is a non-empty subset of cell values from 𝑐 that varies from a small fraction in some settings [8] to the entire relation in others [9]3 . Solutions for the TNLI task rely on supervised models trained with annotated examples - our goal is to reduce the effort in creating such training data. We consider solving the example generation problem for a TNLI application 𝐴 where we are given the label space 𝐿 for 𝐴, a corpus of tables 𝐢, and (optionally) a set of training examples 𝑇 for 𝐴. Every example is composed by a quadruple (β„Ž, 𝑙, 𝑒, 𝑐) with textual hypothesis β„Ž, label 𝑙 ∈ 𝐿, set of data evidence cells 𝑒 contained in one relational table 𝑐 in the corpus 𝐢. We assume access to a text-to-text pre-trained language model (PLM) 𝑀 . We do not assume access to the TNLI application 𝐴 at hand. In this work, we focus on 𝐿 with Supports and Refutes labels only, as those are the most popular in TNLI corpora, e.g., 97% of the examples [8]. In the warm start version of the problem, training examples for 𝐴 are available and used by Tenet. In the cold start version of the problem, we drop the assumption on the availability of the examples 𝑇 . In this case, we aim at creating new training examples 𝐷 for 𝐴 just by using the tables in 𝐢. 2 Some TNLI corpora contain both relational and entity tables, i.e., relational tables transposed with a single row. Tenet supports both, but we focus the presentation on relational ones for clarity. 3 Our proposal is independent of the size of the data evidence and its retrieval. Tables C Existing examples T PLM M TNLI application A TENET Data e Error e’ Hypothesis queries Evidence Injection Text Training Q Generation Generation data D evidence e Figure 2: Tenet overview. Existing examples are optional. Any text-to-text pre-trained language model (PLM) can be used, e.g., ChatGPT. Any target TNLI application can be supported, e.g., tabular fact-checking. q Process and Challenges. Tenet is designed around three main steps, as depicted in Figure 2. Given a relation table 𝑐 ∈ 𝐢, it first gathers the evidence (set of cells) 𝑒 to produce a Supports example. Second, to enable the generation of a Refutes example, it injects errors in table 𝑐 to create its noisy version and derive data evidence 𝑒′ . Third, a textual claim (hypothesis) β„Ž is generated for every data evidence 𝑒. The quadruple (data evidence 𝑒, textual claim β„Ž, label Supports/Refutes, table 𝑐) is a complete example for training data 𝐷 for the target TNLI application 𝐴. However, the three steps come with their own challenges. Table 1 People table. Cells in bold form data evidence 𝑒1 . Name Age City Team 𝑑1 Mike 47 SF DBMS 𝑑2 Anne 22 NY AI 𝑑3 John 19 NY DBMS 𝑑4 Paul 18 NY UOL Data Evidence. Training examples 𝐷 must capture the variety of relationships in a table, such as those relating cell values in the same tuple or attribute. A hypothesis is defined over a group of cell values, such as the data evidence 𝑒1 highlighted in bold in Table 1 for tuples 𝑑1 and 𝑑2 , i.e., names of two people with different age values. Hypothesis β€œMike is older than Anne” captures the relationship across these four cell values. Data evidence with two cell values, e.g., Name for tuple 𝑑1 and Age from tuple 𝑑2 can lead to a hypothesis, e.g., β€œThere is a person called Mike and a person 22 years old”, but such sentence does not capture relationships across tuples nor attributes. In general, for effective training, the data evidence covered by the examples should cover the variety of patterns that can be identified in a relation. One approach for the data evidence generation is to pick different sets of cell values at random. While this simple approach is effective and enables an unsupervised solution, there are meaningful patterns, such as 𝑒1 , that may be covered rarely by accident. We call this approach cold-start. One approach to improve this task and obtain meaningful patterns with fewer generated examples is to infer data patterns from human-provided examples 𝑇 , when those are available. For example, in 𝑇 , we identify a query π‘ž (named evidence query or simply e-query) that returns the cell values in its data evidence as one result row. We then execute the e-query over the relation. The e-query leads to more sets of cells (one per result row) that enable the generation of examples following the same data pattern, for example involving 𝑑3 and 𝑑4 . We call this approach warm-start. Warm Start. While the cold-start is easy to implement, the generation of the e-query in the warm start is not trivial. Given the set of cell values 𝑒𝑠 and table 𝑐𝑠 as input, we want to identify the query π‘ž that outputs such 𝑒𝑠 among its results. Executing such query over the original table 𝑐𝑠 , we obtain more data evidence 𝑒1 , . . . , 𝑒𝑛 that follow the original data pattern in 𝑒𝑠 . Consider again the example in Table 1 with cell values in bold in the first two rows (𝑑1 and 𝑑2 ) as seed data evidence 𝑒𝑠 . Given such input, we want an algorithm producing a query that returns all pairs of distinct names with their different ages, such as q: SELECT c1.Name, c2.Name as Name2, c1.Age, c2.Age as Age2 FROM people c1, people c2 WHERE c1.Age > c2.Age AND c1.Name <> c2.Name The e-query generation is based on an evidence graph [14] where each node in the graph corresponds to a cell in the evidence 𝑒𝑠 and a (direct) edge across two nodes represents the relationship between their values (equality, difference, comparison). Then visiting such graph we construct the e-query [14]. Hypothesis. Given a table 𝑐 and an evidence set 𝑒 ∈ 𝑐, the latter can be described with a textual sentence. However, the way a set of cells is converted to a sentence has a huge impact on the variety and the reasoning complexity of the training data. Indeed, given a set of cells from a table, many alternatives exist for describing it in natural language. Consider again data evidence 𝑒1 in the example. The values in bold can be correctly described with β€œMike is older than Anne." or β€œThere are two persons with age higher than 19.". The more alternative sentences for a given data evidence are created, the better the training set for the target model. Unfortunately, most efforts for automatic data-to-text are focused on surface, or look-up, sentences [11], such as β€œMike is 47 years old and Anne 22.". While these kinds of sentences are fundamental, we aim to maximize the variety in the training data. For this goal, we generate various queries that return evidence 𝑒 given 𝑐. We call such queries semantic queries or simply s-queries. Such s-queries represent different ways of semantically describing the data. PLMs are trained over huge amounts of textual data, which gives them proficiency in writing, and source code, which gives them the ability to write code [16] or to be instructed with functions. We then propose prompting methods for PLMs to generate alternative sentences to describe the evidence set according to the semantics of the queries. We identify several types of s-queries: 1) the surface s-queries, i.e. queries that select cells by using only constant values; 2) comparison s-queries, i.e. queries that compare two or more rows by at least one attribute; 3) filter s-queries, i.e., queries that select cells according to a condition; 4) aggregate s-queries, i.e., queries that select cells that can be used with an aggregative function (count, sum, avg, min, max); 5) filter-aggregate s-queries, i.e., queries that select cells for an aggregation over a group of cells identified by a selection on some conditions. Such s-queries are automatically detected by Tenet [14]. For each s-query, we define a task that describes the text generation function that we want to use. Such generation functions are defined by us with the prompts for the PLM. The task uses the function from the s-query and the evidence. The text generation functions mapped to the relative s-queries are reported in Table 2 with examples of the text they generate. Due to space limits, the examples of the used prompts with ChatGPT are reported in the full paper [14]. Table 2 Functions used by Tenet in ChatGPT prompts. S-Query Function Example Surface read(attrList)[*] Anne is 22 years old and Paul is 18. Comparison compare(op, attr) Anne is older than Paul. Filter filter(cond, attr) Anne, John and Paul are from NY. FilterAggregate filter(cond, attr); The oldest person compute(func, attr)=val from NY is 22 years old. Aggregate compute(func, attr)=val Mike is the oldest person. Label. By construction, the generated data evidence is coherent with the semantics expressed in the input table. An evidence set leads to an example with a Supports label w.r.t. the data in the table. The methods above produce Support examples. However, applications also need examples with a Refutes label, i.e., textual claims not supported by the input table. We tackle this problem with an error injection approach, perturbing the input table to break the original relationships across cell values. This new version of the table is then used to identify again an evidence set 𝑒′ , which leads to a textual hypothesis that does not reflect the semantics of the original (clean) table. We generate a Refutes example for every Supports one. Given some evidence 𝑒 from the original input table 𝑐, we inject noise in a copy 𝑐′ , so that we derive a new evidence 𝑒′ using the same e-query used for the Support example. A hypothesis β„Žβ€² is then derived from 𝑒′ using the same proposed approach above. Hypothesis β„Žβ€² is a Supports sentence for 𝑐′ , with evidence 𝑒′ , but it is also a Refutes sentence w.r.t. the original (clean) table 𝑐 and evidence 𝑒. The new example is the tuple with the label Refutes, 𝑐, β„Žβ€² and evidence 𝑒. To inject errors, first we create a copy 𝑐′ of the table and manipulate it to inject noise. We shuffle in 𝑐′ the values for 50% of the attributes involved in 𝑒. This step breaks the original relationships across cell values at the tuple level. We then either introduce a new tuple in 𝑐′ or remove from 𝑐′ one tuple at random. This step changes the cardinality of the tuples, which is key for s-queries involving aggregates, and introduces out-of-domain values. The generation of the new values depends on the type. For categorical attributes, we use a PLM. For numerical attributes, we generate lower/higher values than the min/max value for every active domain - these new values break the original min/max/avg property for the updated attribute. Finally, we remove from 𝑐′ any row that appears in 𝑐. 3. Experiments and Conclusions We organize our evaluation around two main questions. First, does Tenet automatically generate training data of quality comparable to those manually created by human annotators? Second, what are the costs of Tenet, in terms of execution time and budget for external APIs? Train Datasets. In this paper we present results for a dataset from TNLI literature: Feverous [8]. Results for other datasets are presented in the full paper [14]. Feverous comes with one subset (split) of examples for training and one for test. Every annotated example consists of a table, a textual hypothesis, data evidence (a subset of the table), and a Supports/Refutes label. All examples are manually written by humans. As a baseline, we extend the original training dataset with an augmentation for text [17]. Given an example, we produce seven new versions of it by changing the textual hypothesis using back translation, wordnet, word2vec, synonyms, random word swap, random word deletion, random word insertion (Aug). We also produce training dataset for our techniques. Given a corpus of tables, we always generate the Tenet Cold (TenetC) dataset. Since Feverous has annotations for data evidence, we can also generate the dataset for Tenet Warm (TenetW). Hypotheses are created with s- queries and negative examples are generated according the presented technique. For each given table, we produce three Supports and three Refutes hypotheses, therefore all Tenet datasets are balanced in terms of labels. For every table, Tenet creates one example with a surface query and two with other s-queries among the other four types (Comparison, Filter, Aggregate, FilterAggregate). Inference Models for TNLI. Our goal is to show the quality of automatically generated training data. We therefore do not propose new TNLI models and adopt the ones in the original papers. For Feverous the inference predictor is a RoBERTa (large) encoder fine-tuned for classification on multiple NLI datasets [18]. Pre-trained Language Models. For the hypothesis generation and the error injection, we assume that a pre-trained language model (PLM) is available. We tested several PLMs and use ChatGPT as default. We report a comparison of T5, fine-tuned on ToTTo, and ChatGPT in the full paper [14]. Metrics. We report accuracy for the TNLI task: how many Supports/Refutes classification decisions are correct over the total number of tests. We also report execution times and cost (for external APIs) in running the models. Quality of Training Examples We start by comparing results with training data with examples generated from the same sets of tables. The tables are taken from Feverous dataset. As state of the art solutions, we directly use the manually written examples (Human), eventually augmenting them (Human+Aug). For Tenet methods, we take the corresponding tables of the original training data and generate examples with TenetC and TenetW. For every experiment, we increase the number of input tables, collect or generate the examples, and run the inference model to compute the accuracy on the same test data. The TNLI accuracy results in Figure 3a for the Feverous test data show the impact of examples, which is a proxy for their quality. Up to 700 input tables, both Tenet-generated datasets outperform the examples written by humans, with more than 20 absolute points in cases with less than 150 tables. Even with only 200 tables available for the training step, both Tenet example generation methods achieve an accuracy over 0.8 on the (manually crafted) original test data. If we augment the Human examples with those generated by TenetW, we observe accuracy at 0.8 even with only 150 tables in the training corpus. Tenet benefits by the fact that for every input table, it extracts one data evidence and generates three Supports and three Refutes examples, while the humans wrote one example per table. Figure 3b reports the results for the training done with a combination of Human and Tenet examples for Feverous. We report the impact of different numbers of generated examples. (a) (b) Figure 3: (a) Inference accuracy for different training datasets over the Feverous test data. The x axis is the number of tables in training set. Human is Feverous original training data. (b) Inference accuracy on Feverous when training with the union of human examples (100 to 400) and Tenet generated examples (0 to 1000). The first bar is for Human examples only, other bars are for Human+Tenet examples. Increasing the size of the generated training data increases the accuracy on the test set. The benefit of Tenet examples is higher with smaller numbers of human training examples. Execution Time and Cost. We measure Tenet execution time to generate training data. We create five samples of 200 tables from Feverous and execute the full pipeline with Cold and Warm approaches. On average the Cold takes 2.019 seconds while Warm takes 2.212. The most expensive step in our approach (97% of the execution time) is due to text generation. This heavily depends on the ChatGPT availability and it takes on average from 1.5 to 2.2 seconds per request. Table 3 Costs of generating hypothesis with ChatGPT. # Tables # Positives # Negatives Total # Price ($) Warm 200 1670 1536 3206 11.6 Cold 200 1655 1580 3245 11.7 Table 3 reports the costs of generating hypotheses with the OpenAI API and ChatGPT for 200 tables. The cost linearly depends on the number of generated examples, as ChatGPT calculates the costs based on the size of the input prompt together with the size of the generated output. On average the generation of one example costs 0.0037$. The total cost of all the experiments reported in the full paper [14] is about $130 for 36K generated examples. Conclusions. We presented a generic solution that automatically constructs high-quality annotated datasets for TNLI. Experiments show that given only a table as input, Tenet creates examples that lead to high accuracy when used as training data in the target tasks. Even in settings with a small number of tables for the training of the system, Tenet produces examples with variety both in the pattern of the data and in the reasoning used to verify or refute the hypothesis. References [1] G. Karagiannis, M. Saeed, P. Papotti, I. Trummer, Scrutinizer: A mixed-initiative approach to large-scale, data-driven claim verification, Proc. VLDB Endow. 13 (2020) 2508–2521. [2] Y. Wu, P. K. Agarwal, C. Li, J. Yang, C. Yu, Computational fact checking through query perturbations, ACM Trans. Database Syst. 42 (2017) 4:1–4:41. [3] P. Nakov, D. P. A. Corney, M. Hasanain, F. Alam, T. Elsayed, A. BarrΓ³n-CedeΓ±o, P. Papotti, S. Shaar, G. D. S. Martino, Automated fact-checking for assisting human fact-checkers, in: IJCAI, ijcai.org, 2021, pp. 4551–4558. URL: https://doi.org/10.24963/ijcai.2021/619. doi:10. 24963/ijcai.2021/619. [4] J. Herzig, P. K. Nowak, T. MΓΌller, F. Piccinno, J. Eisenschlos, TaPas: Weakly supervised table parsing via pre-training, in: ACL, Association for Computational Linguistics, 2020, pp. 4320–4333. URL: https://aclanthology.org/2020.acl-main.398. doi:10.18653/v1/2020. acl-main.398. [5] V. Gupta, M. Mehta, P. Nokhiz, V. Srikumar, INFOTABS: Inference on tables as semi- structured data, in: ACL, ACL, Online, 2020, pp. 2309–2324. [6] E. Veltri, D. Santoro, G. Badaro, M. Saeed, P. Papotti, Pythia: Unsupervised generation of ambiguous textual claims from relational data, 2022, p. 2409 – 2412. doi:10.1145/ 3514221.3520164. [7] E. Veltri, G. Badaro, M. Saeed, P. Papotti, Data ambiguity profiling for the generation of training examples, in: 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023, IEEE, 2023, pp. 450–463. URL: https://doi.org/10. 1109/ICDE55515.2023.00041. doi:10.1109/ICDE55515.2023.00041. [8] R. Aly, Z. Guo, M. S. Schlichtkrull, J. Thorne, A. Vlachos, C. Christodoulopoulos, O. Co- carascu, A. Mittal, FEVEROUS: Fact extraction and VERification over unstructured and structured information, in: NeurIPS (Datasets and Benchmarks), 2021. [9] W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, W. Y. Wang, Tabfact: A large-scale dataset for table-based fact verification, in: ICLR, 2020. [10] P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswerable questions for SQuAD, in: ACL, Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 784–789. URL: https://aclanthology.org/P18-2124. doi:10.18653/v1/P18-2124. [11] A. P. Parikh, X. Wang, S. Gehrmann, M. Faruqui, B. Dhingra, D. Yang, D. Das, Totto: A controlled table-to-text generation dataset, in: EMNLP, ACL, 2020, pp. 1173–1186. [12] V. Gupta, R. A. Bhat, A. Ghosal, M. Shrivastava, M. K. Singh, V. Srikumar, Is my model using the right evidence? systematic probes for examining evidence-based tabular reasoning, Trans. Assoc. Comput. Linguistics 10 (2022) 659–679. [13] M. Bayer, M.-A. Kaufhold, C. Reuter, A survey on data augmentation for text classification, ACM Computing Surveys (2022). [14] J.-F. Bussotti, E. Veltri, D. Santoro, P. Papotti, Generation of training examples for tabular natural language inference, Proc. ACM Manag. Data 1 (2023). URL: https://doi.org/10.1145/ 3626730. doi:10.1145/3626730. [15] I. Trummer, From BERT to GPT-3 codex: Harnessing the potential of very large language models for data management, Proc. VLDB Endow. 15 (2022) 3770–3773. [16] G. Mecca, D. Santoro, N. Sileno, E. Veltri, Diogene-ct: tools and methodologies for teaching and learning coding, International Journal of Educational Technology in Higher Education 18 (2021). doi:10.1186/s41239-021-00246-1. [17] J. Eisenschlos, S. Krichene, T. MΓΌller, Understanding tables with intermediate pre-training, in: EMNLP, 2020, pp. 281–296. [18] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).