1. Introduction

SEBD

A Framework for the Generation of Training Examples from Tabular Data

(Discussion Paper)

Jean-Flavien Bussotti

Paolo Papotti

Donatello Santoro

Enzo Veltri

Università degli Studi della Basilicata (UNIBAS)

Potenza

Italy

0 EURECOM , Biot , France

2024

32 23 26

Tabular data is becoming increasingly important in Tabular Natural Language Inference (TNLI), where the goal is to assess if a table supports or refutes a given hypothesis expressed in NL text. A major issue in TNLI is the lack of such training data. Existing approaches are based on manual annotation of new training data or simple augmentation techniques that lack data variety and complexity. We present a system, Tenet, that automatically generates new training examples for TNLI applications on diferent domains. Our framework exploits SQL queries to introduce new data variety through evidence-queries that identify new cell values over data exploiting diferent data patterns, and complexity using semanticqueries that describe the diferent ways such data can be identified through SQL queries. Description from the semantic-queries are used to verbalize the new cell values from the evidence-queries using a Pretrained Language Model (PLM). The verbalized sentence and the cell values can be used as a new training example in the target TNLI application. We show how Tenet generates human-like examples that are comparable with manually-written examples.

eol>Tabular Natural Language Inference (TNLI) Natural Language Processing (NLP) for Databases Text Generation Query Generation Data Augmentation

1. Introduction

A large class on natural language inference (NLI) problems aims at classifying a given hypothesis, such as a textual statement, as true/false/unknown given some evidence. Recently it has emerged a new class of applications that focus on inference with structured data as evidence, i.e., tabular natural language inference (TNLI). Example applications are table understanding and computational fact checking, where systems label text claims according to input structured data [ 1, 2, 3, 4, 5, 6, 7 ].

Most of the solutions in TNLI are supervised, where manually defined datasets for TNLI have been proposed, such as Feverous [ 8 ], TabFact [ 9 ], and Infotabs [ 5 ]. However, these datasets: ) cover only some generic topics from Wikipedia tables. For example, if there is a need for

Name t1 Barack t2 Donald t3 Nancy Training data Claim c: “Donald is married to Michelle” Label l: Refutes Evidence cells E: t2.Name: “Donald” t2.Spouse: “Melania”

Party Spouse Dem Michelle Rep Melania Dem Paul Claim c: “Barack and Nancy are in the same party” Label l: Supports Evidence cells E: t1.Name: “Barack” t3.Name: “Nancy” t1.Party: “Dem”, t3.Party: “Dem”

TNLI application Test data fact-checking claims for emerging domains such as Covid-19, a new annotated corpus must be crafted by manually writing examples using the tabular reports published by governments; ) they are not comparable in scale and variety to those available for textual NLI [ 10 ]. For example, about 80% of the examples in Totto [ 11 ] have sentences describing the data with text that does not contain mathematical expressions, such as max, min, and count, or comparison across values; ) They contain bias and errors that may lead to incorrect learning in the target models [ 12 ].

The problem of the lack of labeled examples has been treated in the literature for NLI, but it has not been tackled yet for TNLI. If some examples are given in a warm start setting, existing NLI augmentation methods can be used in the TNLI setting: the text part of the example can be rewritten with augmentation w.r.t. the (fixed) data [ 13 ]. While these methods increase the number of examples, they do not generate a new corpus that raises the variety and complexity of the examples w.r.t. the structured data, ultimately with a minor impact on the accuracy of the TNLI tasks. Moreover, in a cold start setting, where training data is unavailable, there is no proposal yet on creating annotated examples for TNLI starting only from the tables.

User-provided tables can be exploited to generate ad-hoc training data for the application at hand. Our system, Tenet1 (TExtual traiNing Examples from daTa) generates large annotated corpora of training examples that are complex and rich in terms of data patterns, linguistic diversity, and reasoning complexity [ 14 ]. Figure 1 shows an overview of our architecture. The system generates training data for the target TNLI application, given only a table as input. Once generated, the examples are used to train the inference model validated on test data.

Tenet is built around three modules that cover the three main elements of a complete and annotated TNLI example.

Data Evidence. A key intuition in our approach is that tabular data already contains rich information for new examples. Content changes across datasets, and every relation has its own active domain. Moreover, data relationships across entities and their properties are arranged diferently across datasets. To identify data evidence to create a variety of examples, we propose alternative approaches to select sets of cells from the given table, including a query generation algorithm for the semi-supervised case. A query returns a set of evidence, such as Donald and 1Code and datasets available at https://github.com/dbunibas/tenet Michelle in the first example in Figure 1, each partially describing an example. Textual Hypothesis. Once the data is identified, we obtain the textual statement (or hypothesis) for the annotated example. Given a set of cells, we generate queries that identify such data evidence over the input table. Every query characterizes the data with diferent conditions (e.g., selections with constants) or constructs (e.g., aggregate). From the query and the evidence, we create a text with a prompting method that exploits the human-like generation abilities of large pre-trained language models (PLMs), such as GPT-3 [ 15 ]. Our prompting leads to a variety of factual hypotheses, such as Barack and Nancy are in the same party in the second example in Figure 1, while maximizing the coverage of the provided evidence and minimizing hallucination.

Inference Label. Finally, we need the corresponding label for every example. While Supports examples are obtained naturally, as the hypothesis reflects the evidence from the table, for Refutes examples we introduce generic methods built around the idea of injecting errors in the data evidence. Once the data is modified, the process for text generation is applied to the “dirty” data to obtain hypotheses that are refuted w.r.t. the original “clean” data.

In the next Section we describe the main components, then we present some experimental results using Tenet.

2. Overview of the Solution

Problem Formulation. Let be a tuple in the instance for a relational schema and an attribute in . We refer with cell value to the value of tuple in attribute and with table to the instance for simplicity2. A textual hypothesis is a sentence in natural language.

A Tabular Natural Language Inference (TNLI) application takes as input a pair (table ; textual hypothesis ℎ) and outputs if ℎ is supported or refuted by . Data evidence is a non-empty subset of cell values from that varies from a small fraction in some settings [ 8 ] to the entire relation in others [ 9 ]3. Solutions for the TNLI task rely on supervised models trained with annotated examples - our goal is to reduce the efort in creating such training data.

We consider solving the example generation problem for a TNLI application where we are given the label space for , a corpus of tables , and (optionally) a set of training examples for . Every example is composed by a quadruple (ℎ, , , ) with textual hypothesis ℎ, label ∈ , set of data evidence cells contained in one relational table in the corpus . We assume access to a text-to-text pre-trained language model (PLM) . We do not assume access to the TNLI application at hand. In this work, we focus on with Supports and Refutes labels only, as those are the most popular in TNLI corpora, e.g., 97% of the examples [ 8 ].

In the warm start version of the problem, training examples for are available and used by Tenet. In the cold start version of the problem, we drop the assumption on the availability of the examples . In this case, we aim at creating new training examples for just by using the tables in . 2Some TNLI corpora contain both relational and entity tables, i.e., relational tables transposed with a single row. Tenet supports both, but we focus the presentation on relational ones for clarity. 3Our proposal is independent of the size of the data evidence and its retrieval.

Process and Challenges. Tenet is designed around three main steps, as depicted in Figure 2. Given a relation table ∈ , it first gathers the evidence (set of cells) to produce a Supports example. Second, to enable the generation of a Refutes example, it injects errors in table to create its noisy version and derive data evidence ′. Third, a textual claim (hypothesis) ℎ is generated for every data evidence . The quadruple (data evidence , textual claim ℎ, label Supports/Refutes, table ) is a complete example for training data for the target TNLI application . However, the three steps come with their own challenges.

Data Evidence. Training examples must capture the variety of relationships in a table, such as those relating cell values in the same tuple or attribute. A hypothesis is defined over a group of cell values, such as the data evidence 1 highlighted in bold in Table 1 for tuples 1 and 2, i.e., names of two people with diferent age values. Hypothesis “Mike is older than Anne” captures the relationship across these four cell values. Data evidence with two cell values, e.g., Name for tuple 1 and Age from tuple 2 can lead to a hypothesis, e.g., “There is a person called Mike and a person 22 years old”, but such sentence does not capture relationships across tuples nor attributes. In general, for efective training, the data evidence covered by the examples should cover the variety of patterns that can be identified in a relation.

One approach for the data evidence generation is to pick diferent sets of cell values at random. While this simple approach is efective and enables an unsupervised solution, there are meaningful patterns, such as 1, that may be covered rarely by accident. We call this approach cold-start. One approach to improve this task and obtain meaningful patterns with fewer generated examples is to infer data patterns from human-provided examples , when those are available. For example, in , we identify a query (named evidence query or simply e-query) that returns the cell values in its data evidence as one result row. We then execute the e-query over the relation. The e-query leads to more sets of cells (one per result row) that enable the generation of examples following the same data pattern, for example involving 3 and 4. We call this approach warm-start.

Warm Start. While the cold-start is easy to implement, the generation of the e-query in the warm start is not trivial. Given the set of cell values and table as input, we want to identify the query that outputs such among its results. Executing such query over the original table , we obtain more data evidence 1, . . . , that follow the original data pattern in .

Consider again the example in Table 1 with cell values in bold in the first two rows ( 1 and 2) as seed data evidence . Given such input, we want an algorithm producing a query that returns all pairs of distinct names with their diferent ages, such as q:

SELECT c1.Name, c2.Name as Name2, c1.Age, c2.Age as Age2 FROM people c1, people c2

WHERE c1.Age > c2.Age AND c1.Name <> c2.Name

The e-query generation is based on an evidence graph [ 14 ] where each node in the graph corresponds to a cell in the evidence and a (direct) edge across two nodes represents the relationship between their values (equality, diference, comparison). Then visiting such graph we construct the e-query [ 14 ].

Hypothesis. Given a table and an evidence set ∈ , the latter can be described with a textual sentence. However, the way a set of cells is converted to a sentence has a huge impact on the variety and the reasoning complexity of the training data. Indeed, given a set of cells from a table, many alternatives exist for describing it in natural language. Consider again data evidence 1 in the example. The values in bold can be correctly described with “Mike is older than Anne." or “There are two persons with age higher than 19.". The more alternative sentences for a given data evidence are created, the better the training set for the target model. Unfortunately, most eforts for automatic data-to-text are focused on surface, or look-up, sentences [ 11 ], such as “Mike is 47 years old and Anne 22.". While these kinds of sentences are fundamental, we aim to maximize the variety in the training data. For this goal, we generate various queries that return evidence given . We call such queries semantic queries or simply s-queries. Such s-queries represent diferent ways of semantically describing the data. PLMs are trained over huge amounts of textual data, which gives them proficiency in writing, and source code, which gives them the ability to write code [ 16 ] or to be instructed with functions. We then propose prompting methods for PLMs to generate alternative sentences to describe the evidence set according to the semantics of the queries.

We identify several types of s-queries: 1) the surface s-queries, i.e. queries that select cells by using only constant values; 2) comparison s-queries, i.e. queries that compare two or more rows by at least one attribute; 3) filter s-queries, i.e., queries that select cells according to a condition; 4) aggregate s-queries, i.e., queries that select cells that can be used with an aggregative function (count, sum, avg, min, max); 5) filter-aggregate s-queries, i.e., queries that select cells for an aggregation over a group of cells identified by a selection on some conditions. Such s-queries are automatically detected by Tenet [ 14 ].

For each s-query, we define a task that describes the text generation function that we want to use. Such generation functions are defined by us with the prompts for the PLM. The task uses the function from the s-query and the evidence. The text generation functions mapped to the relative s-queries are reported in Table 2 with examples of the text they generate. Due to space limits, the examples of the used prompts with ChatGPT are reported in the full paper [ 14 ].

Label. By construction, the generated data evidence is coherent with the semantics expressed in the input table. An evidence set leads to an example with a Supports label w.r.t. the data in the table. The methods above produce Support examples. However, applications also need examples with a Refutes label, i.e., textual claims not supported by the input table. We tackle this problem with an error injection approach, perturbing the input table to break the original relationships across cell values. This new version of the table is then used to identify again an evidence set ′, which leads to a textual hypothesis that does not reflect the semantics of the original (clean) table. We generate a Refutes example for every Supports one. Given some evidence from the original input table , we inject noise in a copy ′, so that we derive a new evidence ′ using the same e-query used for the Support example. A hypothesis ℎ′ is then derived from ′ using the same proposed approach above. Hypothesis ℎ′ is a Supports sentence for ′, with evidence ′, but it is also a Refutes sentence w.r.t. the original (clean) table and evidence . The new example is the tuple with the label Refutes, , ℎ′ and evidence .

To inject errors, first we create a copy ′ of the table and manipulate it to inject noise. We shufle in ′ the values for 50% of the attributes involved in . This step breaks the original relationships across cell values at the tuple level. We then either introduce a new tuple in ′ or remove from ′ one tuple at random. This step changes the cardinality of the tuples, which is key for s-queries involving aggregates, and introduces out-of-domain values. The generation of the new values depends on the type. For categorical attributes, we use a PLM. For numerical attributes, we generate lower/higher values than the min/max value for every active domain these new values break the original min/max/avg property for the updated attribute. Finally, we remove from ′ any row that appears in .

3. Experiments and Conclusions

We organize our evaluation around two main questions. First, does Tenet automatically generate training data of quality comparable to those manually created by human annotators? Second, what are the costs of Tenet, in terms of execution time and budget for external APIs? Train Datasets. In this paper we present results for a dataset from TNLI literature: Feverous [ 8 ]. Results for other datasets are presented in the full paper [ 14 ]. Feverous comes with one subset (split) of examples for training and one for test. Every annotated example consists of a table, a textual hypothesis, data evidence (a subset of the table), and a Supports/Refutes label. All examples are manually written by humans.

As a baseline, we extend the original training dataset with an augmentation for text [17]. Given an example, we produce seven new versions of it by changing the textual hypothesis using back translation, wordnet, word2vec, synonyms, random word swap, random word deletion, random word insertion (Aug).

We also produce training dataset for our techniques. Given a corpus of tables, we always generate the Tenet Cold (TenetC) dataset. Since Feverous has annotations for data evidence, we can also generate the dataset for Tenet Warm (TenetW). Hypotheses are created with squeries and negative examples are generated according the presented technique. For each given table, we produce three Supports and three Refutes hypotheses, therefore all Tenet datasets are balanced in terms of labels. For every table, Tenet creates one example with a surface query and two with other s-queries among the other four types (Comparison, Filter, Aggregate, FilterAggregate).

Inference Models for TNLI. Our goal is to show the quality of automatically generated training data. We therefore do not propose new TNLI models and adopt the ones in the original papers. For Feverous the inference predictor is a RoBERTa (large) encoder fine-tuned for classification on multiple NLI datasets [18].

Pre-trained Language Models. For the hypothesis generation and the error injection, we assume that a pre-trained language model (PLM) is available. We tested several PLMs and use ChatGPT as default. We report a comparison of T5, fine-tuned on ToTTo, and ChatGPT in the full paper [ 14 ].

Metrics. We report accuracy for the TNLI task: how many Supports/Refutes classification decisions are correct over the total number of tests. We also report execution times and cost (for external APIs) in running the models.

Quality of Training Examples We start by comparing results with training data with examples generated from the same sets of tables. The tables are taken from Feverous dataset. As state of the art solutions, we directly use the manually written examples (Human), eventually augmenting them (Human+Aug). For Tenet methods, we take the corresponding tables of the original training data and generate examples with TenetC and TenetW. For every experiment, we increase the number of input tables, collect or generate the examples, and run the inference model to compute the accuracy on the same test data.

The TNLI accuracy results in Figure 3a for the Feverous test data show the impact of examples, which is a proxy for their quality. Up to 700 input tables, both Tenet-generated datasets outperform the examples written by humans, with more than 20 absolute points in cases with less than 150 tables. Even with only 200 tables available for the training step, both Tenet example generation methods achieve an accuracy over 0.8 on the (manually crafted) original test data. If we augment the Human examples with those generated by TenetW, we observe accuracy at 0.8 even with only 150 tables in the training corpus. Tenet benefits by the fact that for every input table, it extracts one data evidence and generates three Supports and three Refutes examples, while the humans wrote one example per table.

Figure 3b reports the results for the training done with a combination of Human and Tenet examples for Feverous. We report the impact of diferent numbers of generated examples. (a) (b) Increasing the size of the generated training data increases the accuracy on the test set. The benefit of Tenet examples is higher with smaller numbers of human training examples. Execution Time and Cost. We measure Tenet execution time to generate training data. We create five samples of 200 tables from Feverous and execute the full pipeline with Cold and Warm approaches. On average the Cold takes 2.019 seconds while Warm takes 2.212. The most expensive step in our approach (97% of the execution time) is due to text generation. This heavily depends on the ChatGPT availability and it takes on average from 1.5 to 2.2 seconds per request. and learning coding, International Journal of Educational Technology in Higher Education 18 (2021). doi:10.1186/s41239-021-00246-1. [17] J. Eisenschlos, S. Krichene, T. Müller, Understanding tables with intermediate pre-training, in: EMNLP, 2020, pp. 281–296. [18] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).

[1]

Karagiannis ,

Saeed ,

Papotti , I. Trummer , Scrutinizer: A mixed-initiative approach to large-scale, data-driven claim verification , Proc. VLDB Endow . 13 ( 2020 ) 2508 - 2521 .

[2]

Wu ,

P. K.

Agarwal ,

Li ,

Yang ,

Yu , Computational fact checking through query perturbations , ACM Trans. Database Syst . 42 ( 2017 ) 4: 1 - 4 : 41 .

[3]

Nakov ,

D. P. A.

Corney ,

Hasanain ,

Alam ,

Elsayed ,

Barrón-Cedeño ,

Papotti ,

Shaar ,

G. D. S.

Martino , Automated fact-checking for assisting human fact-checkers, in: IJCAI, ijcai .org, 2021 , pp. 4551 - 4558 . URL: https://doi.org/10.24963/ijcai. 2021 /619. doi: 10 . 24963/ijcai. 2021 /619.

[4]

Herzig ,

P. K.

Nowak ,

Müller ,

Piccinno , J. Eisenschlos, TaPas: Weakly supervised table parsing via pre-training , in: ACL, Association for Computational Linguistics, 2020 , pp. 4320 - 4333 . URL: https://aclanthology.org/ 2020 .acl-main. 398 . doi: 10 .18653/v1/ 2020 . acl-main. 398 .

[5]

Gupta ,

Mehta ,

Nokhiz ,

Srikumar , INFOTABS: Inference on tables as semistructured data , in: ACL, ACL , Online, 2020 , pp. 2309 - 2324 .

[6]

Veltri ,

Santoro , G. Badaro,

Saeed ,

Papotti , Pythia: Unsupervised generation of ambiguous textual claims from relational data , 2022 , p. 2409 - 2412 . doi: 10 .1145/ 3514221.3520164.

[7]

Veltri , G. Badaro,

Saeed ,

Papotti , Data ambiguity profiling for the generation of training examples , in: 39th IEEE International Conference on Data Engineering, ICDE 2023 , Anaheim, CA, USA, April 3- 7 , 2023 , IEEE, 2023 , pp. 450 - 463 . URL: https://doi.org/10. 1109/ICDE55515. 2023 . 00041 . doi: 10 .1109/ICDE55515. 2023 . 00041 .

[8]

Aly ,

Guo ,

M. S.

Schlichtkrull ,

Thorne ,

Vlachos ,

Christodoulopoulos ,

Cocarascu ,

Mittal , FEVEROUS: Fact extraction and VERification over unstructured and structured information , in: NeurIPS (Datasets and Benchmarks) , 2021 .

[9]

Chen ,

Wang ,

Chen ,

Zhang ,

Wang ,

Li ,

Zhou ,

W. Y.

Wang , Tabfact: A large-scale dataset for table-based fact verification , in: ICLR , 2020 .

[10]

Rajpurkar ,

Jia ,

Liang , Know what you don't know: Unanswerable questions for SQuAD, in: ACL, Association for Computational Linguistics , Melbourne, Australia, 2018 , pp. 784 - 789 . URL: https://aclanthology.org/P18-2124. doi: 10 .18653/v1/ P18 -2124.

[11]

A. P.

Parikh ,

Wang ,

Gehrmann ,

Faruqui ,

Dhingra ,

Yang , D. Das , Totto: A controlled table-to-text generation dataset , in: EMNLP, ACL , 2020 , pp. 1173 - 1186 .

[12]

Gupta ,

R. A.

Bhat ,

Ghosal ,

Shrivastava ,

M. K.

Singh ,

Srikumar , Is my model using the right evidence? systematic probes for examining evidence-based tabular reasoning , Trans. Assoc. Comput. Linguistics 10 ( 2022 ) 659 - 679 .

[13]

Bayer , M. -

A. Kaufhold , C.

Reuter , A survey on data augmentation for text classification , ACM Computing Surveys ( 2022 ).

[14] J.-F. Bussotti , E.

Veltri , D.

Santoro , P.

Papotti , Generation of training examples for tabular natural language inference , Proc. ACM Manag. Data 1 ( 2023 ). URL: https://doi.org/10.1145/ 3626730. doi: 10 .1145/3626730.

[15] I. Trummer , From BERT to GPT-3 codex: Harnessing the potential of very large language models for data management , Proc. VLDB Endow . 15 ( 2022 ) 3770 - 3773 .

[16]

Mecca ,

Santoro ,

Sileno , E. Veltri, Diogene-ct: tools and methodologies for teaching