1. Introduction

Zaragoza, Spain $ roi.santos.rios@udc.es (R. Santos-Ríos); adrian.lopez.gude@udc.es (A. Gude); francisco.prado.valino@udc.es (F. Prado-Valiño); ana.ezquerro@udc.es (A. Ezquerro); jesus.vilares@udc.es (J. Vilares) https://dunque.github.io/ (R. Santos-Ríos); https://adrian-gude.github.io/ (A. Gude); https://anaezquerro.github.io/ (A. Ezquerro); https://www.grupolys.org/~jvilares/ (J. Vilares)

LyS at IberLEF 2025 Task PRESTA: Zero-Shot Code Generation for Spanish Tabular QA

Roi Santos-Ríos

Adrián Gude

Francisco Prado-Valiño

Ana Ezquerro

Jesús Vilares

0 0 Universidade da Coruña, CITIC, Departamento de Ciencias de la Computación y Tecnologías de la Información , Campus de Elviña s/n, 15071, A Coruña , Spain

2025

000 9 0009

This paper describes our participation in PRESTA, an IberLEF 2025 task, focused on Tabular Question Answering. We developed a zero-shot pipeline that leverages an Large Language Model to generate functional code capable of extracting the relevant information from tabular data based on an input question. Our approach consists of a modular pipeline where the main code generator module is supported by additional components that identify the most relevant columns and analyze their data types to improve extraction accuracy. In the event that the generated code fails, an iterative refinement process is triggered, incorporating the error feedback into a new generation prompt to enhance robustness. Our results show that zero-shot code generation is a valid approach for Tabular QA, achieving rank 4 out of 7 in the test phase despite the lack of task-specific fine-tuning.

eol>Tabular Question Answering Large Language Models Zero-Shot Code Generation

1. Introduction

Tabular Question Answering (Tabular QA) has huge potential in real-world applications such as financial analysis, business intelligence, and scientific data exploration, where structured databases serve as the primary source of information. Unlike traditional text-based Question Answering (QA), which primarily deals with unstructured data, Tabular QA requires extracting information from structured tables to be able to answer the input questions, thus involving reasoning about diverse table schemas, column relationships, and heterogeneous data types.

Complex supervised systems have been proposed to deal with the structured nature of Tabular QA, either leveraging structured prediction with language representations [ 1, 2 ] or by formulating the task as a sequence-to-sequence problem [ 3, 4, 5 ]. However, with the rise of instruction-based Large Language Models (LLM) [ 6 ], recent approaches have shifted away from reliance on large annotated datasets, instead reframing the task as a zero-shot generation problem [ 7 ].

In this work, we further explore instruction-based LLMs to dynamically generate code functions capable of retrieving relevant data from tables based on the input question in a zero-shot manner. To enhance accuracy and reliability, we developed a modular three-staged pipeline that includes: (i) a column selection mechanism to determine the most relevant columns and their data-type, (ii) a code generation module responsible for producing executable code and (iii) an iterative error handling module that, in case the initial code execution fails, tries to fix the generated code accordingly.

Our group tested this approach within IberLEF’s [ 8 ] PRESTA task [ 9 ], which provided a diverse dataset featuring real-world tabular data.1 The competition required models to produce answers in multiple formats, including boolean, categorical, numerical, and list-based outputs. Our model was designed to generalize across diferent table structures, making it adaptable to various datasets beyond the shared task, ensuring robustness and broad applicability. Although our approach demonstrated strong performance in code generation and execution, subsequent analysis revealed that the model struggles with columns containing complex data types (lists, dictionaries, etc.) and ambiguous queries, particularly for list-based responses.

2. Background

Question Answering (QA) has been gaining significant attention in recent years, driven by the need for models capable of reasoning over structured data. Early tasks in QA mainly focused on retrieving information from unstructured text sources [ 10, 11 ], but the increasing availability of structured datasets has led to new challenges in understanding and querying tabular data. Unlike classic text-based QA, where answers are retrieved from free-form text, Tabular QA requires a higher level of interpretation and robustness to map questions to relevant columns and rows, handle missing values, and compute statistics when necessary.

In parallel, several datasets have been introduced to benchmark Tabular QA models, including WikiTableQuestions [ 12 ], SQA [ 13 ], and the more recent DataBenchSPA dataset [ 14 ], which provides real-world tabular data for evaluating models in diferent scenarios.

Structured Tabular QA Most state-of-the-art approaches for Tabular QA leverage a pretrained language model —equipped with an specialized encoding module to represent tabular information— tailored for structured prediction. For example, TaPas [ 1 ] feeds both the input question and the flattened table into BERT [ 15 ] as a single sequence, and finetunes the architecture to select relevant columns and predict an aggregation function. Similarly, TaCube [ 16 ] combines a cube constructor with BART [ 17 ] to predict the real answers based on the input question and the results of the cube operations. Generative Tabular QA To address the rigidity of structured approaches, recent works have explored generative models for program synthesis, where an LLM is finetuned to generate executable programs or instructions (in the form of SQL queries, for example) to be applied against tabular sources. Zhong et al. [ 3 ] proposed Seq2SQL, a sequence-to-sequence model to translate natural language into SQL syntax, incorporating query-space pruning to significantly simplify and enhance the generative task. Later, Yin et al. [ 2 ] joined both concepts by optimizing tabular embeddings that fit both generative and structured purposes.

Zero-Shot Code Generation More recently, advancements in code generation have enabled a paradigm shift in Tabular QA, driven by powerful multipurpose LLMs with strong coding capabilities, such as Qwen [ 18 ] and Mistral’s Codestral [ 19 ]. These models facilitate a zero-shot approach to program synthesis, eliminating the need for predefined templates or large annotated datasets. Instead, zero-shot generation allows the system to dynamically adapt to diferent schemes without explicit prior knowledge of the table structure [ 7 ], thus providing flexibility and scalability.

Despite its potential, zero-shot code generation models still face big challenges, particularly in error handling, runtime execution failures, and schema variability. Building on this approach, our work extends an instruction-based model with error awareness, enabling it to detect and recover from execution failures in an iterative error-recovery mechanism, where the model dynamically analyzes execution failures and regenerates code based on error feedback.

3. System Overview

Our approach for the PRESTA task iterates upon the code generation approaches for Tabular QA, where the core component is a pretrained LLM responsible of generating executable code to extract the answer from the tables. To build upon prior works [ 1 ], we incorporated a module that helps selecting the columns relevant to the question, while also identifying the data types of their content. Moreover, we

Answer

Code error? Answer Execute CFioxdeer Yes incorporaCotdee an error-fixing module that attempts to catch runtime errors and integrates them as part of a new prompt, guiding the LLM to refine its code generation.

Figure 1 shows an schematic view of the architecture of our system. We have designed a modular pipeline that features three main components, which we describe below: (i) a column selector, (ii) an answer generator and (iii) a code fixer.

Input question ¿Cuál es la edad máxima de las mujeres participantes en la encuesta? Database

Column Selector Instead of relying on manually crafted heuristics or embedding similarity measures, the first component of our system leverages an instruction-based LLM tasked to identify the most relevant columns of a tabular source from an input question in natural language form. Our template provides the list of column names and instructs the model to return only those that are essential for answering the query.2 Answer Generator Once the relevant columns are identified, the second component of our pipeline is instructed to generate executable code that retrieves the answers from the tabular source using both the input query and the relevant columns extracted in the previous step. As part of our prompt, we guided the LLM to generate Python programming code and postprocessed the output to ensure that only Python lines were passed throught the next module. Python language was chosen since it is widely used in data analysis and has extensive support for tabular data processing through libraries such as Pandas.

Code Fixer The final component of our pipeline captures execution errors that might occur due to incorrect syntax, schema mismatches, or runtime exceptions. This module captures the error messages and re-generates a corrected function by feeding the error context back into the LLM. To achieve this, we used a structured prompt that includes the code that causes an error with the corresponding error description.

Preprocessing Since our system strongly relies on a well-formatted prompt, we manually designed a preprocessing step to ensure a consistent format to feed our system. We standardized column names for simplified versions (removing emoji and all non-alphanumerical characters except punctuation symbols) to prevent possible errors in the Answer Generator caused by mismatches between the table 2All our prompts are available in the code publicly available at GitHub, as well as in the Annex section. structure and the generated code. We identified enum-like column types, such as the case of categorical attributes with a finite amount of strings as a value (e.g. a “Survey” column that only contains “Yes”, “No” or “Maybe”), and inferred a common scheme so to ensure consistency across diferent attributes, thus reducing errors related to unexpected variations in categorical values.

4. Experimental Setup

Our system relies on open-source LLMs for zero-shot code generation. This way, no explicit training nor ifnetuning was conducted. Instead, we used the available training phase datasets to validate diferent LLMs and select the best performing one for the final test phase.

Dataset The dataset provided for the task is divided into three sets: training, development (aka dev), and test. In our case, since we had opted for a zero-shot approach, the training set remained unused during the development phase, using only the dev set for our experiments. During this stage we tried diferent LLMs to compare their ability to generate the adequate Python code to answer the input questions. To do that, we analyzed the accuracy obtained with respect to the ground truth of the validation set, together with manual checks to assess the quality of the generated code. Evaluation The oficial evaluation consists of checking if the system is able to retrieve the answer from the tables, comparing its output with the expected one. Still, in this competition the evaluation scripts allow for non-meaningful diferences in the outputs; i.e. the system outputs 125.0 (float), and the expected result is 125 (int). In this case it is counted as a correct response.

System Setup We conducted experiments with diferent open-source LLMs adjusted to our hardware limitations, specifically pretrained for instruction-based code generation: Qwen-2.5-Coder [ 18 ] (with 7B and 32B versions), Mistral-7B and Codestral-22B —the later two from Mistral [ 19 ].

To run the generated code we relied on Python 3.10.12 with Pandas 2.2.3 as a requirement. Due to VRAM constraints, all models were executed with 4-bit quantization, using a greedy generation strategy with a temperature of 0.7.

5. Analysis of Results

In this section, we present the evaluation of our system on the task. We first report performance during the development phase ($5.1), where we experimented with diferent models on the validation dataset, followed by the final test phase ($5.2), where our system was evaluated on the test dataset through CodaBench submissions.3

5.1. Development Phase

As explained before, during the development phase we focused on selecting the best performing LLM just using the dev set; that is, dismissing the training set. At this first stage, our pipeline was conformed by only the Answer Generator module.

The results obtained for this original setup, presented in Table 1, show that the only model able to outperform the baseline system [ 20 ] is Qwen-2.5-Coder32B.

The majority of code answers outputted from the Qwen-2.5-Coder7B, Mistral7B and Codestral22B models stem from trying to use the function "split()", which cannot be used with the majority of datatypes present in the columns of the dataset.

The smaller LLM models needed more postprocessing in order to be able to extract the code they generated from the rest of the response. They usually add unnecessary textual descriptions of the code, even though it is stated in the prompt that none of that is necessary, and will make the execution fail. 05.56 00.00

An example of this behavior with an output of Codestral22B: ‘ ‘ ‘ python i m p o r t p a n d a s a s pd d e f answer ( d f : pd . DataFrame ) −> l i s t : column_name = ’ c o n s i d e r a n d o

una e s c a l a de 0 a 1 0 , donde 0 s i g n i f i c a ’ nada , en a b s o l u t o ’ y 10 ’ t o t a l m e n t e ’ , digame , p o r f a v o r , s i d u r a n t e l a

u l t i m a semana s e ha s e n t i d o . . . _ F e l i z ’ r e s p o n s e , and ‘ n l a r g e s t ( 3 ) ‘ i s u s e d t o s e l e c t t h e t h r e e most common r e s p o n s e s . The ‘ i n d e x ‘ a t t r i b u t e i s u s e d t o g e t t h e a c t u a l r e s p o n s e s , and ‘ t o l i s t ( ) ‘ i s u s e d t o c o n v e r t t h e i n d e x t o a l i s t .

Meanwhile, Qwen-2.5-Coder32B does not make these kind of errors, and just outputs the desired code without need of further postprocessing. It’s important to note that the Qwen-2.5-Coder32B model was able to perform better with list datatypes rather than with numbers, when the former datatype tends to stem from more dificult or complex queries.

Ablation Study

We relied on the results displayed in Table 1 to select the best performing LLM, which served as the foundation for integrating the additional modules that could further enhance performance (see Figure 1). Table 2 shows the results when varying the components of the pipeline while maintaining Qwen-2.5-Coder32B as backbone. The AG (Answer Generator only) setup corresponds to the result displayed in Table 1, from which the extra components of our pipeline where compared to see if there was an actual improvement when introducing error-awareness and column pre-selection. The AG+CS (AG with Column Selector) setup shows a clear improvement of 8 points with respect to the AG-only model, outlining the importance of first asking the LLM to filter the relevance of the input attributes. Lastly, when integrating the Code Fixer (CF) with an enhanced column selection (ECS) to feed richer information about feature variations to the prompt, our final system setup (AG+ECS+CF) maintains almost the same performance as the setup with (AG+ECS). It deals better with some datatypes, and worse or equal with others. This means that the model is powerful enough to not input erroneous code, thus the mistakes it makes are from not interpreting the question correctly and giving a wrong answer.

AG . AG lid AG+CS VAG+ECS+CF a t se AG+CS T AG+ECS+CF

5.2. Final Test Phase

The best performing configuration is almost a draw between AG+ECS and AG+ECS+CF, but we decided to go with the latter one to participate in the competition, just in case the code fixer is able to correct possible coding mistakes. Our zero-shot approach reached 79 points of accuracy in the task, which ranked us in the 4th position out of 7 participants.

Our results during the development phase (69 points) benefited from a significant increase of 10 points in accuracy with respect to the validation results, likely due to the complexity of the questions present in the test set. boolean accuracy reaches more than 80 points, while list-like types do not surpass 75 points. This might indicate that the LLM is not able to infer these complex schemes on the test set, producing errors that are propagated from the Column Selector module to the Answer Generator.

6. Conclusions and Future Work

In this work we propose a zero-shot approach for Tabular QA that demonstrated a strong performance for the PRESTA task, ranking among the best systems in the development phase, although sufering from a performance drop in the test phase. Still, our system shows that an instruction-based approach allows to dynamically adapt to diferent dataset schemes without requiring additional training or finetuning, surpassing the baseline model even with limited hardware resources available.

Future work will focus on further refining prompt templates, improving schema adaptation, optimizing execution eficiency or incorporating a voting system with diferent LLMs. Improving the detection of complex datatypes is also critical, as they allow the model to answer questions on less structured tables —which constitute the majority of online data—, ultimately making the system more generalizable.

Hardware Setup Acknowledgments

Our hardware resources are somewhat limited by today’s standards. We had shared access to an Intel Core i9-10920X at 3.50 GHz with 258 GiB RAM and two integrated NVIDIA RTX 3090, so we opted to perform zero-shot instead of finetuning the LLMs.

We acknowledge grants

SCANNER-UDC (PID2020-113230RB-C21) funded by

CIU/AEI/10.13039/501100011033; GAP (PID2022-139308OA-I00) funded by

MICIU/AEI/10.13039/501100011033/ and ERDF, EU; LATCHING (PID2023-147129OB-C21) funded by MICIU/AEI/10.13039/501100011033 and ERDF, EU; CIDMEFEO funded by the Spanish National Statistics Institute (INE); as well as funding by Xunta de Galicia (ED431C 2024/02), and Centro de

Investigación de Galicia “CITIC”, funded by the Xunta de Galicia through the collaboration agreement between the Consellería de Cultura, Educación, Formación Profesional e Universidades and the Galician universities for the reinforcement of the research centres of the Galician University System (CIGUS).

CITIC, as a center accredited for excellence within the Galician University System and a member of the CIGUS Network, receives subsidies from the Department of Education, Science, Universities, and Vocational Training of the Xunta de Galicia. Additionally, it is co-financed by the EU through the FEDER Galicia 2021-27 operational program (Ref.ED431G 2023/01)

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT for minor copy-editing, and GitHub copilot for code autocompletion. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. arXiv:2310.06825. P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, T. Zhu, Qwen

A. Prompts used A.1. Answer Generator

R o l e and C o n t e x t You a r e a Python −powered T a b u l a r Data Q u e s t i o n − Answering System . Your c o r e e x p e r t i s e l i e s i n u n d e r s t a n d i n g t a b u l a r d a t a s e t s and c r a f t i n g Python s c r i p t s t o g e n e r a t e p r e c i s e s o l u t i o n s t o u s e r q u e r i e s .

Task D e s c r i p t i o n : G e n e r a t e Python code t o a d d r e s s a query b a s e d on t h e p r o v i d e d d a t a s e t . The o u t p u t must : − Use t h e d a t a s e t and query a s given , a v o i d i n g any e x t e r n a l a s s u m p t i o n s . − Adhere t o s t r i c t s y n t a x r u l e s f o r Python , e n s u r i n g t h e code r u n s f l a w l e s s l y w i t h o u t e x t e r n a l m o d i f i c a t i o n s . − R e t a i n t h e o r i g i n a l column names o f t h e d a t a s e t i n your s c r i p t .

I n p u t S p e c i f i c a t i o n d a t a s e t : A Pandas DataFrame c o n t a i n i n g t h e d a t a t o be a n a l y z e d .

q u e s t i o n : A s t r i n g o u t l i n i n g t h e s p e c i f i c query . I n p u t : column_names : { column_names } q u e s t i o n : { q u e s t i o n }

A.3. Code Fixer

R o l e and C o n t e x t You a r e a Python −powered T a b u l a r Data Q u e st i on − Answering System . Your c o r e e x p e r t i s e l i e s i n u n d e r s t a n d i n g t a b u l a r d a t a s e t s and c r a f t i n g Python s c r i p t s t o g e n e r a t e p r e c i s e s o l u t i o n s t o u s e r q u e r i e s .

Task D e s c r i p t i o n : F i x t h e Python code t o a d d r e s s a query b a s e d on t h e p r o v i d e d d a t a s e t . The o u t p u t must : − Use t h e d a t a s e t and query a s given , a v o i d i n g any e x t e r n a l a s s u m p t i o n s . − Adhere t o s t r i c t s y n t a x r u l e s f o r Python , e n s u r i n g t h e code r u n s f l a w l e s s l y w i t h o u t e x t e r n a l m o d i f i c a t i o n s . − R e t a i n t h e o r i g i n a l column names o f t h e d a t a s e t i n your s c r i p t .

I n p u t S p e c i f i c a t i o n code : The Python code t h a t n e e d s t o be f i x e d .

e r r o r : The e r r o r message t h a t r e s u l t s from r u n n i n g t h e code .

Output S p e c i f i c a t i o n

R e t u r n o n l y t h e Python code t h a t s o l v e s t h e query i n t h e f u n c t i o n , e x c l u d i n g any i n t r o d u c t o r y e x p l a n a t i o n s or comments .

The f u n c t i o n must : I n c l u d e a l l e s s e n t i a l i m p o r t s .

Be c o n c i s e and f u n c t i o n a l , e n s u r i n g t h e s c r i p t can be e x e c u t e d w i t h o u t a d d i t i o n a l m o d i f i c a t i o n s .

Use t h e d a t a s e t and r e t u r n a r e s u l t o f t y p e number , c a t e g o r i c a l v a l u e , b o o l e a n v a l u e , or a l i s t o f v a l u e s . Code :

Below i s t h e p i e c e o f code t h a t n e e d s t o be f i x e d , a l o n g with t h e e r r o r message t h a t r e s u l t s from r u n n i n g t h e code : { r e s p o n s e } E r r o r : { e r r o r }

[1]

Herzig ,

P. K.

Nowak ,

Müller ,

Piccinno , J. Eisenschlos, TaPas: Weakly Supervised Table Parsing via Pre-training, in: D. Jurafsky , J.

Chai , N.

Schluter , J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 4320 - 4333 . URL: https://aclanthology.org/ 2020 .acl-main. 398 /. doi: 10 .18653/v1/ 2020 .acl-main. 398 .

[2]

Yin , G. Neubig, W.-t. Yih, S. Riedel, TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , in: D. Jurafsky , J.

Chai , N.

Schluter , J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 8413 - 8426 . URL: https://aclanthology.org/ 2020 .acl-main. 745 /. doi: 10 . 18653/v1/ 2020 .acl-main. 745 .

[3]

Zhong ,

Xiong ,

Socher , Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2017 . URL: https://arxiv.org/abs/1709.00103. arXiv: 1709 . 00103 .

[4]

Yu ,

Li ,

Zhang ,

Zhang , D. Radev, TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL generation , in: M. Walker , H. Ji , A . Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 2 ( Short

Papers)

, Association for Computational Linguistics , New Orleans, Louisiana, 2018 , pp. 588 - 594 . URL: https://aclanthology.org/N18-2093/. doi: 10 .18653/v1/ N18 -2093.

[5]

Pal ,

Yates , E. Kanoulas, M. de Rijke, MultiTabQA: Generating Tabular Answers for MultiTable Question Answering , in: A. Rogers , J. Boyd-Graber , N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 6322 - 6334 . URL: https://aclanthology.org/ 2023 . acl-long . 348 /. doi: 10 .18653/v1/ 2023 . acl-long . 348 .

[6]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell ,

Agarwal ,

Herbert-Voss , G. Krueger,

Henighan ,

Child ,

Ramesh ,

Ziegler ,

Wu ,

Winter ,

Hesse ,

Chen , E. Sigler,

Litwin ,

Gray ,

Chess ,

Clark ,

Berner ,

McCandlish ,

Radford ,

Sutskever ,

Amodei , Language Models are Few-Shot Learners , in: H. Larochelle , M.

Ranzato , R.

Hadsell , M.

Balcan , H. Lin (Eds.), Advances in Neural Information Processing Systems , volume 33 , Curran

Associates

, Inc., 2020 , pp. 1877 - 1901 . URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

[7]

Cao ,

Chen ,

Liu ,

Wang ,

Fried , API-Assisted Code Generation for Question Answering on Varied Table Structures , in: H. Bouamor , J. Pino , K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Singapore, 2023 , pp. 14536 - 14548 . URL: https://aclanthology.org/ 2023 .emnlp-main. 897 /. doi: 10 .18653/v1/ 2023 .emnlp-main. 897 .

[8]

Á . González-Barba , L.

Chiruzzo , S. M.

Jiménez-Zafra , Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS . org, 2025 .

[9]

Osés-Grijalba ,

L. A.

Ureña-López ,

E. M.

Cámara ,

Camacho-Collados , Overview of PRESTA at IberLEF 2025: Question Answering Over Tabular Data In Spanish, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS . org, 2025 .

[10]

Rajpurkar ,

Zhang ,

Lopyrev ,

Liang , Squad: 100 ,000+ Questions for Machine Comprehension of Text, 2016 . URL: https://arxiv.org/abs/1606.05250. arXiv: 1606 . 05250 .

[11]

Yang ,

Qi ,

Zhang ,

Bengio ,

Cohen ,

Salakhutdinov ,

C. D.

Manning , HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , in: E. Rilof , D.

Chiang , J.

Hockenmaier , J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Brussels, Belgium, 2018 , pp. 2369 - 2380 . URL: https://aclanthology.org/D18-1259/. doi: 10 .18653/v1/ D18 -1259.

[12]

Pasupat ,

Liang , Compositional Semantic Parsing on Semi-Structured Tables , in: C. Zong , M. Strube (Eds.), Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Beijing, China, 2015 , pp. 1470 - 1480 . URL: https://aclanthology.org/P15-1142/. doi: 10 .3115/v1/ P15 -1142.

[13]

Iyyer , W.-t. Yih, M.-

Chang , Search-based Neural Structured Learning for Sequential Question Answering , in: R. Barzilay, M.-Y. Kan (Eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Vancouver, Canada, 2017 , pp. 1821 - 1831 . URL: https://aclanthology.org/P17-1167/. doi: 10 .18653/v1/ P17 -1167.

[14]

J. Osés

Grijalba ,

L. A.

Ureña-López ,

E. Martínez

Cámara ,

Camacho-Collados , Question Answering over Tabular Data with DataBench: A Large-Scale Empirical Evaluation of LLMs , in: N. Calzolari , M.- Y.

Kan , V.

Hoste , A.

Lenci , S.

Sakti , N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL , Torino , Italia, 2024 , pp. 13471 - 13488 . URL: https://aclanthology.org/ 2024 . lrec-main. 1179 /.

[15]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/N19-1423/. doi: 10 .18653/v1/ N19 -1423.

[16]

Zhou ,

Hu ,

Dong , Z. Cheng, F. Cheng, S. Han, D . Zhang, TaCube: Pre-computing Data Cubes for Answering Numerical-Reasoning Questions over Tabular Data , in: Y. Goldberg , Z. Kozareva , Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022 , pp. 2278 - 2291 . URL: https://aclanthology.org/ 2022 .emnlp-main. 145 /. doi: 10 .18653/v1/ 2022 . emnlp-main. 145 .

[17]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad ,

Mohamed ,

Levy ,

Stoyanov , L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , in: D. Jurafsky , J.

Chai , N.

Schluter , J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 7871 - 7880 . URL: https://aclanthology.org/ 2020 .acl-main. 703 /. doi: 10 . 18653/v1/ 2020 .acl-main. 703 .

[18]

Bai ,

Chu ,

Cui ,

Dang ,

Deng ,

Fan ,

Ge , Y. Han,

Huang ,

Hui ,

Ji ,

Li , Technical Report , 2023 . URL: https://arxiv.org/abs/2309.16609. arXiv: 2309 . 16609 .

[19]

A. Q.

Jiang ,

Sablayrolles ,

Mensch ,

Bamford ,

D. S.

Chaplot , D. de las Casas,

Bressand , G. Lengyel,

Lample ,

Saulnier ,

L. R.

Lavaud , M. -

A. Lachaux , P.

Stock , T. L.

Scao , T.

Lavril , T.

Wang , T.

Lacroix , W. E.

Sayed , Mistral 7B, 2023 . URL: https://arxiv.org/abs/2310.06825.

[20]

J. O.

Grijalba ,

L. A. U.

López ,

Camacho-Collados ,

E. M.

Cámara , Towards quality benchmarking in question answering over tabular data in spanish , Proces. del Leng. Natural 73 ( 2024 ) 283 - 296 .