<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. O. Grijalba, L. A. U. López, J. Camacho-Collados, E. M. Cámara, Towards quality benchmarking
in question answering over tabular data in spanish, Proces. del Leng. Natural</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>sonrobok4 at Iberlef 2025 - PRESTA: Leveraging LLMs for Text-to-Python Question Answering over Tabular Data in Spanish</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nguyen Minh Son</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dang Van Thin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Information Technology-VNUHCM</institution>
          ,
          <addr-line>Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>73</volume>
      <issue>2024</issue>
      <fpage>283</fpage>
      <lpage>296</lpage>
      <abstract>
        <p>This paper presents our contribution to the PRESTA shared task on Question Answering over Tabular Data in Spanish. We explore the capabilities of large language models (LLMs) for text-to-code generation, focusing on text-to-Python approaches to handle diverse question types. Our method employs a multi-prompt strategy that emphasizes structured table understanding, language-aware prompt construction. We investigate the efectiveness of zero-shot prompting using cutting-edge models such as GPT-4o-mini, DeepSeek-V3, and DeepSeek-R1. Our experiments aim to assess the Python code generation for tabular QA, as well as the robustness of LLMs in handling multilingual and domain-specific tabular contexts. Our approach achieved the highest accuracy among competitors, reaching 87%.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Question Answering</kwd>
        <kwd>Tabular Data</kwd>
        <kwd>Text-to-Code</kwd>
        <kwd>Text-to-Python</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Zero-Shot Prompting</kwd>
        <kwd>Spanish Language</kwd>
        <kwd>Prompt Engineering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The growing interest in question answering (QA) over structured data has led to significant
advancements in understanding and reasoning over tabular formats. Although much of this progress has
focused on English language datasets, the PRESTA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] shared task at IberLEF 2025 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] addresses a
critical gap by introducing DataBenchSPA [3], the first large-scale benchmark specifically designed for
Spanish QA over tabular data. This initiative opens new opportunities to evaluate and improve QA
systems in multilingual and domain-specific contexts. DataBenchSPA comprises diverse real-world
tables with varying row and column counts, encompassing a wide range of data types such as numerical,
categorical, boolean, and list values. The task challenges participants to build systems that can interpret
natural language questions and accurately return answers from the corresponding tables.
      </p>
      <p>Motivated by these questions, we focus our eforts on exploring LLM-based text-to-code generation
as a practical solution for this task. In particular, we investigate prompting strategies that encourage
structured table understanding while accounting for the linguistic characteristics of Spanish. Through
careful experimentation and system design, we aim to highlight the potential of prompt engineering
techniques in handling real-world QA over tabular data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Question Answering on tabular data represents a critical area of NLP research. Early benchmark datasets
including WikiSQL [4], Spider [5], and TabFact [6] primarily focused on English-language evaluation.
The development of databenchSPA [3], a comprehensive Spanish tabular QA dataset, significantly
expanded evaluation capabilities by introducing a non-English benchmark, thereby underscoring the
importance of multilingual approaches in table-based QA.</p>
      <p>Recent advances have increasingly leveraged Large Language Models (LLMs) for table QA tasks.
Notable approaches include Chain-of-Table [7], which dynamically evolves table representations
during reasoning, and Tree-of-Table [8] that employs hierarchical structures for large-scale table
understanding. The DataFrame QA framework introduces a novel method for table question answering
without raw data exposure by generating executable pandas queries. Meanwhile, Table-Critic [9]
demonstrates the efectiveness of multi-agent systems for collaborative table reasoning.</p>
      <p>Preliminary experiments with text-to-python conversion on the databenchSPA dataset using small
open-source models like Mistral [10] and DeepSeek-Coder [11] have revealed promising results while
highlighting significant opportunities for further improvement.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>Our system employs a Text-to-Code approach utilizes private large language models (LLMs) with
costawareness to answer questions over tabular data. The architecture comprises several key components:
data preprocessing, table context preparation, code generation, an error-correction loop to handle
invalid code, and answer synthesis. Each component is described in detail in the following sections.
The illustration for our method is in Figure 1</p>
      <sec id="sec-3-1">
        <title>3.1. Data Preprocessing</title>
        <p>An analysis of the dataset revealed that many tables contain columns with a high percentage of null
values. Given that each dataset can have approximately 170 columns, it is essential to perform thorough
data cleaning by removing unnecessary or sparse columns. This step reduces noise and prevents the
large language model (LLM) from being overwhelmed by irrelevant or excessive data during processing.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Table Context Preparation</title>
        <p>During system development, we observed that the type and amount of table information provided
to the LLM significantly afect its performance. Specifically, supplying only the first few rows versus
including additional metadata such as column names and data types results in notable diferences. We
provide the first five rows, carefully selected to capture the most unique values per column to maximize
the representation of special cases. Additionally, we supply a dictionary containing column names
alongside their corresponding data types to enhance the LLM’s understanding of the table schema.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Code Generation</title>
        <p>We use zero-shot prompting strategy to generate executable code in Python, depending on the question
context. The generated code is then executed on the provided table to produce an intermediate result
for the answer synthesis stage. If the execution result is excessively long or malformed, we return None
to indicate a failure in the code generation logic.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Correction Loop</title>
        <p>To address potential execution errors in the generated code, we implement a correction loop that
attempts to revise the code up to five times. In each iteration, the LLM receives the previous code,
the error message, and relevant table information to generate a corrected version. This mechanism
improves robustness by allowing recovery from common syntax and logic errors.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Answer Synthesis</title>
        <p>The competition defines a constrained set of acceptable output formats, making an answer synthesis
module essential for converting raw code output into valid final answers. Based on the type of question
and output, the system formats responses into one of the following categories:
• Boolean: Valid values include True/False, Yes/No, or Y/N (case-insensitive).
• Category: A value or substring from a single cell in the dataset.
• Number: A numeric value, potentially derived from calculations such as average, maximum, or
minimum.
• List[category]: A fixed-length list of categorical values (e.g., [’cat’, ’dog’]). The question
wording determines whether uniqueness or duplicates are expected.</p>
        <p>• List[number]: A fixed-length list of numerical values, formatted similarly to List[category].</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Development Phase</title>
        <p>Our experimental results on the development set are summarized in Table ??. We categorize the
experiments into three distinct groups:
1. Table Input Format and Size: This group investigates how diferent input representations
afect performance. Specifically, we vary whether the table rows are provided to the LLM in plain
string, Markdown, or JSON format. Additionally, we vary the number of rows given: the base
case includes 2 rows, while the extended input includes 5 rows.
2. Prompt Strategy: We explore the impact of diferent prompting techniques, including zero-shot,
CoT [12], and role-play [13]. These strategies aim to compare LLMs performance across diferent
prompts.
3. Language Handling: Since the original dataset is in Spanish, we investigate whether language
afects LLM performance. We compare three approaches: translating only the question into
English, translating both the question and column headers, and using Spanish prompts.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Testing Phase</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we investigated the impact of prompt design, input formatting, and language handling
on table-based question answering using large language models. Our experiments, conducted using</p>
      <p>Experiments Setting
raw data + english prompt + json format with 5 table rows
preprocessed data + english prompt + json format with 5 table rows
raw data + spanish prompt + json format with 5 table rows
DeepSeek-V3 and DeepSeek-R1, revealed that structured JSON inputs and longer table contexts (i.e.,
more rows) significantly improve model performance. Among prompting strategies, roleplay and
chain-of-thought techniques ofer moderate gains.</p>
      <p>Notably, our results indicate that maintaining the original language (Spanish) in the prompt leads to
better performance on the development set, but may generalize less efectively to the testing phase.
Furthermore, we demonstrated that DeepSeek-V3 outperforms GPT-4o-mini on baseline tasks, justifying
its use in all subsequent experiments.</p>
      <p>Future work could explore dynamic prompting, multilingual fine-tuning, and hybrid symbolic-neural
table reasoning to further enhance performance in low-resource or multilingual domains.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was supported by The VNUHCM-University of Information Technology’s Scientific
Research Support Fund.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, we used GPT-4 and Grammarly in order to: check grammer,
spelling, and edit the content for clarity and coherence. After using these tools, we reviewed and edited
the content as needed and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Osés-Grijalba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Ureña-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Cámara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          , Overview of PRESTA at IberLEF 2025:
          <article-title>Question Answering Over Tabular Data In Spanish, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS</article-title>
          . org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>González-Barba</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <article-title>Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS</article-title>
          . org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>