1. Introduction

J. O. Grijalba, L. A. U. López, J. Camacho-Collados, E. M. Cámara, Towards quality benchmarking in question answering over tabular data in spanish, Proces. del Leng. Natural

sonrobok4 at Iberlef 2025 - PRESTA: Leveraging LLMs for Text-to-Python Question Answering over Tabular Data in Spanish

Nguyen Minh Son

0 1

Dang Van Thin

0 1 0 University of Information Technology-VNUHCM , Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City , Vietnam 1 Vietnam National University , Ho Chi Minh City , Vietnam

2025

73 2024 283 296

This paper presents our contribution to the PRESTA shared task on Question Answering over Tabular Data in Spanish. We explore the capabilities of large language models (LLMs) for text-to-code generation, focusing on text-to-Python approaches to handle diverse question types. Our method employs a multi-prompt strategy that emphasizes structured table understanding, language-aware prompt construction. We investigate the efectiveness of zero-shot prompting using cutting-edge models such as GPT-4o-mini, DeepSeek-V3, and DeepSeek-R1. Our experiments aim to assess the Python code generation for tabular QA, as well as the robustness of LLMs in handling multilingual and domain-specific tabular contexts. Our approach achieved the highest accuracy among competitors, reaching 87%.

eol>Question Answering Tabular Data Text-to-Code Text-to-Python Large Language Models Zero-Shot Prompting Spanish Language Prompt Engineering

1. Introduction

The growing interest in question answering (QA) over structured data has led to significant advancements in understanding and reasoning over tabular formats. Although much of this progress has focused on English language datasets, the PRESTA [ 1 ] shared task at IberLEF 2025 [ 2 ] addresses a critical gap by introducing DataBenchSPA [3], the first large-scale benchmark specifically designed for Spanish QA over tabular data. This initiative opens new opportunities to evaluate and improve QA systems in multilingual and domain-specific contexts. DataBenchSPA comprises diverse real-world tables with varying row and column counts, encompassing a wide range of data types such as numerical, categorical, boolean, and list values. The task challenges participants to build systems that can interpret natural language questions and accurately return answers from the corresponding tables.

Motivated by these questions, we focus our eforts on exploring LLM-based text-to-code generation as a practical solution for this task. In particular, we investigate prompting strategies that encourage structured table understanding while accounting for the linguistic characteristics of Spanish. Through careful experimentation and system design, we aim to highlight the potential of prompt engineering techniques in handling real-world QA over tabular data.

2. Related Work

Question Answering on tabular data represents a critical area of NLP research. Early benchmark datasets including WikiSQL [4], Spider [5], and TabFact [6] primarily focused on English-language evaluation. The development of databenchSPA [3], a comprehensive Spanish tabular QA dataset, significantly expanded evaluation capabilities by introducing a non-English benchmark, thereby underscoring the importance of multilingual approaches in table-based QA.

Recent advances have increasingly leveraged Large Language Models (LLMs) for table QA tasks. Notable approaches include Chain-of-Table [7], which dynamically evolves table representations during reasoning, and Tree-of-Table [8] that employs hierarchical structures for large-scale table understanding. The DataFrame QA framework introduces a novel method for table question answering without raw data exposure by generating executable pandas queries. Meanwhile, Table-Critic [9] demonstrates the efectiveness of multi-agent systems for collaborative table reasoning.

Preliminary experiments with text-to-python conversion on the databenchSPA dataset using small open-source models like Mistral [10] and DeepSeek-Coder [11] have revealed promising results while highlighting significant opportunities for further improvement.

3. Methods

Our system employs a Text-to-Code approach utilizes private large language models (LLMs) with costawareness to answer questions over tabular data. The architecture comprises several key components: data preprocessing, table context preparation, code generation, an error-correction loop to handle invalid code, and answer synthesis. Each component is described in detail in the following sections. The illustration for our method is in Figure 1

3.1. Data Preprocessing

An analysis of the dataset revealed that many tables contain columns with a high percentage of null values. Given that each dataset can have approximately 170 columns, it is essential to perform thorough data cleaning by removing unnecessary or sparse columns. This step reduces noise and prevents the large language model (LLM) from being overwhelmed by irrelevant or excessive data during processing.

3.2. Table Context Preparation

During system development, we observed that the type and amount of table information provided to the LLM significantly afect its performance. Specifically, supplying only the first few rows versus including additional metadata such as column names and data types results in notable diferences. We provide the first five rows, carefully selected to capture the most unique values per column to maximize the representation of special cases. Additionally, we supply a dictionary containing column names alongside their corresponding data types to enhance the LLM’s understanding of the table schema.

3.3. Code Generation

We use zero-shot prompting strategy to generate executable code in Python, depending on the question context. The generated code is then executed on the provided table to produce an intermediate result for the answer synthesis stage. If the execution result is excessively long or malformed, we return None to indicate a failure in the code generation logic.

3.4. Correction Loop

To address potential execution errors in the generated code, we implement a correction loop that attempts to revise the code up to five times. In each iteration, the LLM receives the previous code, the error message, and relevant table information to generate a corrected version. This mechanism improves robustness by allowing recovery from common syntax and logic errors.

3.5. Answer Synthesis

The competition defines a constrained set of acceptable output formats, making an answer synthesis module essential for converting raw code output into valid final answers. Based on the type of question and output, the system formats responses into one of the following categories: • Boolean: Valid values include True/False, Yes/No, or Y/N (case-insensitive). • Category: A value or substring from a single cell in the dataset. • Number: A numeric value, potentially derived from calculations such as average, maximum, or minimum. • List[category]: A fixed-length list of categorical values (e.g., [’cat’, ’dog’]). The question wording determines whether uniqueness or duplicates are expected.

• List[number]: A fixed-length list of numerical values, formatted similarly to List[category].

4. Results and Discussion 4.1. Development Phase

Our experimental results on the development set are summarized in Table ??. We categorize the experiments into three distinct groups: 1. Table Input Format and Size: This group investigates how diferent input representations afect performance. Specifically, we vary whether the table rows are provided to the LLM in plain string, Markdown, or JSON format. Additionally, we vary the number of rows given: the base case includes 2 rows, while the extended input includes 5 rows. 2. Prompt Strategy: We explore the impact of diferent prompting techniques, including zero-shot, CoT [12], and role-play [13]. These strategies aim to compare LLMs performance across diferent prompts. 3. Language Handling: Since the original dataset is in Spanish, we investigate whether language afects LLM performance. We compare three approaches: translating only the question into English, translating both the question and column headers, and using Spanish prompts.

4.2. Testing Phase 5. Conclusion

In this work, we investigated the impact of prompt design, input formatting, and language handling on table-based question answering using large language models. Our experiments, conducted using

Experiments Setting raw data + english prompt + json format with 5 table rows preprocessed data + english prompt + json format with 5 table rows raw data + spanish prompt + json format with 5 table rows DeepSeek-V3 and DeepSeek-R1, revealed that structured JSON inputs and longer table contexts (i.e., more rows) significantly improve model performance. Among prompting strategies, roleplay and chain-of-thought techniques ofer moderate gains.

Notably, our results indicate that maintaining the original language (Spanish) in the prompt leads to better performance on the development set, but may generalize less efectively to the testing phase. Furthermore, we demonstrated that DeepSeek-V3 outperforms GPT-4o-mini on baseline tasks, justifying its use in all subsequent experiments.

Future work could explore dynamic prompting, multilingual fine-tuning, and hybrid symbolic-neural table reasoning to further enhance performance in low-resource or multilingual domains.

Acknowledgments

This research was supported by The VNUHCM-University of Information Technology’s Scientific Research Support Fund.

Declaration on Generative AI

During the preparation of this work, we used GPT-4 and Grammarly in order to: check grammer, spelling, and edit the content for clarity and coherence. After using these tools, we reviewed and edited the content as needed and take full responsibility for the publication’s content.

[1]

Osés-Grijalba ,

L. A.

Ureña-López ,

E. M.

Cámara ,

Camacho-Collados , Overview of PRESTA at IberLEF 2025: Question Answering Over Tabular Data In Spanish, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS . org, 2025 .

[2]

Á . González-Barba , L.

Chiruzzo , S. M.

Jiménez-Zafra , Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS . org, 2025 .