<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reasoning through Code: Question Answering on Spanish Tabular Data using LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adarsh Prakash Vemali</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>R Raghav</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Independent Researcher</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of California</institution>
          ,
          <addr-line>San Diego</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Tabular Question Answering (QA) plays a key role in enabling structured data understanding. This paper presents our system for the IberLEF 2025 PRESTA Task on Question Answering over Spanish Tabular Data. We target the DataBenchSPA QA task using a code-generation based strategy with Large Language Models (LLMs) in a zero-shot setting. Our approach formulates QA as a Python code generation task, leveraging LLMs to produce executable Pandas code that queries the relevant tabular data. We adopt a unified LLM model that jointly performs reasoning and code generation. We optimize prompt design by introducing schema compression techniques such as column aliasing and sampling representative rows to reduce context size. Our execution-aware retry mechanism improves output correctness by iteratively refining erroneous code based on runtime feedback. Our system achieved an accuracy of 73.0% on the blind test set, ranking 6th overall. These results highlight the feasibility of eficient and accurate tabular QA.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Tabular Question Answering</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Code Generation</kwd>
        <kwd>Prompt Engineering</kwd>
        <kwd>Zero-Shot Prompting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Tabular Question Answering (QA) is a pivotal area in Natural Language Processing (NLP), enabling
automated information extraction and enhancing accessibility to structured datasets. The IberLEF 2025
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] task, PRESTA: Question Answering over Tabular Data in Spanish / Preguntas y Respuestas sobre
Tablas en Español [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], advances this field through the introduction of DataBenchSPA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] - the first
large-scale, Spanish-language benchmark comprising diverse tabular datasets across multiple domains.
This task requires systems to accurately answer natural language questions over tables, with expected
answer types including boolean values, categories, numbers, or lists of these.
      </p>
      <p>Our system adopts a code-generation based methodology, employing Python and Pandas in
combination with open-source Large Language Models (LLMs) to produce executable code for answering
questions. Given a question and its associated table, the system generates code that loads the data,
performs computations, and outputs the answer. This strategy enhances interpretability by encoding
the reasoning process directly into code and remains agnostic to specific table schemas. Importantly,
we optimize for a zero-shot setting to minimize the amount of table data passed to the LLM, ensuring
low resource usage and scalability.</p>
      <p>
        Our previous work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] evaluated both agentic and unified LLM approaches, with the unified pipeline
proving more efective in maintaining consistency and reducing errors. Accordingly, we adopt the
unified approach for this Spanish-language task, leveraging prompt engineering and input minimization
- via row sampling and column aliasing - to enhance performance and eficiency.
      </p>
      <p>Since our method relies on code execution, one core challenge was ensuring the syntactic and
semantic correctness of generated code. We addressed this by implementing iterative retry mechanisms
that feed back error messages to the LLM for refinement. These retries proved efective in producing
robust, executable code.</p>
      <sec id="sec-1-1">
        <title>Dataset</title>
        <sec id="sec-1-1-1">
          <title>DatabenchSPA (Test)</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>DatabenchSPA (Validation)</title>
          <p>−</p>
          <p>
            Our system achieved an accuracy of 73% on the test set and ranked 6ℎ overall, across both proprietary
and open-source models, as shown in Table 1. Phi-4, a Small Language Model (under 8 parameters)
proved to be our most efective system for the task. Key takeaways from our participation include
the significant impact of prompt engineering in guiding LLM reasoning and the feasibility of high
performance even with limited table input. Our approach and code is publicly available1. This is the
same codebase used in our earlier English-language SemEval task [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], and it works out of the box for
this Spanish variant with minor modifications.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        Our work builds upon the dataset collection originally introduced in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As summarized in Table 2,
the training set comprises 6 distinct datasets containing a total of 150 questions. The development set
includes 4 datasets with 100 questions, while the blind test set consists of 10 datasets and 100 questions
specifically designed to assess LLM performance on question answering over structured, real-world
Spanish tabular data.
      </p>
      <sec id="sec-2-1">
        <title>Dataset Split</title>
        <sec id="sec-2-1-1">
          <title>Training Set</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Development Set</title>
        </sec>
        <sec id="sec-2-1-3">
          <title>Blind Test Set</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Number of Datasets</title>
        <p>6
4
10</p>
      </sec>
      <sec id="sec-2-3">
        <title>Number of Questions</title>
        <p>150
100
100
2.1. Dataset Details
The entire dataset collection comprises a total of 31,318 rows and 1,740 columns, spanning a wide
variety of day-to-day domains. Across all datasets, 250 questions were designed to evaluate question
answering performance. For both the training and development sets, each dataset contains a consistent
25 questions, whereas the test set includes 10 questions per dataset, as shown in Table 3. An overview
of the datasets used is presented in Table 4.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Answer Type</title>
      </sec>
      <sec id="sec-2-5">
        <title>Sample</title>
        <sec id="sec-2-5-1">
          <title>Boolean</title>
        </sec>
        <sec id="sec-2-5-2">
          <title>Number</title>
        </sec>
        <sec id="sec-2-5-3">
          <title>Category</title>
        </sec>
        <sec id="sec-2-5-4">
          <title>List[category]</title>
        </sec>
        <sec id="sec-2-5-5">
          <title>List[number]</title>
        </sec>
        <sec id="sec-2-5-6">
          <title>True/False 4, 10</title>
        </sec>
        <sec id="sec-2-5-7">
          <title>Automotive, United States [apple, mango] [2, 4, 6, 8, 10]</title>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>Number of Questions</title>
      </sec>
      <sec id="sec-2-7">
        <title>Train Dev Test</title>
        <p>
          30 19 20
40 22 20
28 22 20
25 18 20
26 18 20
1https://github.com/Adarsh-Vemali/LLM-Driven-Code-Generation-for-Zero-Shot-Question-Answering-on-Tabular-Data
2.2. Related Work
LLMs have significantly transformed code generation, especially through the incorporation of zero-shot
reasoning and structured prompting strategies such as Chain-of-Thought (CoT) prompting [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. CoT
enables models to break down tasks into intermediate reasoning steps, which is critical for complex
logical tasks. However, in code generation scenarios, particularly those involving syntactic correctness
and structural precision, CoT alone often proves insuficient. Approaches like CodeCoT [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] mitigate
these shortcomings by leveraging self-examination mechanisms that iteratively refine outputs based on
execution feedback.
        </p>
        <p>
          In contrast to agentic CoT systems, which rely on iterative, autonomous execution of subtasks [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ],
structured prompting techniques like SCoT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] incorporate programmatic structures (e.g., sequences,
branches, and loops) directly into the reasoning process. These structured methods provide more
deterministic and syntactically aligned outputs compared to agentic retries, which often incur higher
computational cost without guaranteed improvements in correctness.
        </p>
        <p>
          Despite these advances in general-purpose code generation, applying LLMs efectively to code
generation over tabular datasets - especially in languages such as Spanish - remains underexplored.
Recent eforts like MarIA [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] introduce high-quality Spanish language models that support downstream
tasks like Spanish QA. These models, when paired with tabular QA tasks, present a unique challenge
due to the complexity of schema representations and the variability of column labels.
        </p>
        <p>
          One promising direction in this space is the use of summarization techniques for label compression.
Tools like AutoDDG [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] demonstrate that LLMs can generate semantic summaries of column metadata,
enabling more human-readable, compressed, and consistent schema representations. This not only
facilitates better generalization across datasets but also supports unified modeling strategies that
reduce dependency on handcrafted agentic pipelines. By simplifying schema understanding through
summarization, models are better positioned to align input representations with reasoning steps needed
for accurate QA and code generation.
        </p>
        <p>
          Recent studies [
          <xref ref-type="bibr" rid="ref11">11, 12, 13, 14</xref>
          ] have delved into fine-grained, task-specific adaptations of LLMs across
a range of domains, underscoring the value of domain-specific prompting and structured reasoning
techniques. Likewise, generative methods have been extended to structured tasks such as sentiment
analysis [15], showcasing the versatility of structured generation beyond conventional language tasks.
        </p>
        <p>Moreover, retrieval-augmented approaches [16, 17] and extensive instructive fine-tuning [ 18] present
viable avenues for strengthening reasoning reliability and generalization in tabular data contexts.</p>
        <p>In summary, while agentic CoT pipelines provide flexibility, our work argues for a shift towards
unified, summarization-augmented modeling frameworks that leverage structured CoT and semantic
compression of tabular schemas. This direction is particularly beneficial for under-resourced languages
like Spanish, where task adaptation, schema abstraction, and robust reasoning are essential for scalable
and accurate code generation in QA systems.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. System Overview</title>
      <p>
        As illustrated in Figure 1, our system reframes the tabular QA task as a code generation problem in a
zeroshot setting. Given a tabular dataset  and a natural language question , the system operates through
a structured pipeline designed to maximize the efectiveness of LLMs. Our approach closely follows
the methodology outlined in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], with minor adaptations. First, it extracts the dataset schema (which
includes the column names and data types), and a sample of the first  rows (with  = 3 yielding the
best results; see Table 5). This is then used to construct a structured prompt in the system-user-assistant
format, providing the model with essential context while minimizing unnecessary verbosity.
      </p>
      <p>One challenge we encountered was that some datasets contained very long column names. Including
these in the prompt often pushed the input size beyond the LLM’s token limit of 16,384 tokens. To
address this, we used the same LLM to first automatically generate concise aliases for the column
names, efectively summarizing them while preserving their semantics. These shorter names were then
substituted throughout the schema and dataset before constructing the final prompt. Once the prompt
is prepared, the LLM then generates Python code intended to load the dataset and compute the answer
to the question. The generated code is then parsed, checked for errors, and executed to produce the
ifnal output.</p>
      <p>We introduce a robust retry strategy that takes execution outcomes into account. When the generated
code fails, the resulting error message is returned to the language model, enabling it to revise and
improve the code. This process can repeat up to three times, leading to more reliable outputs and a
notable decrease in execution errors.</p>
      <p>
        By minimizing the input size, we significantly improve computational eficiency and scalability,
while also enabling the language model to better contextualize relevant information. Additionally, the
introduction of column name aliasing represents a targeted adaptation of our earlier methodology,
further enhancing the system’s flexibility and robustness.
3.1. Agentic CoT Approach
In our previous work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we implemented an Agentic Chain-of-Thought (CoT) approach that separated
reasoning and code generation across two specialized LLMs. While this method ofered explicit reasoning
traces and modularity, it sufered from frequent mismatches between the reasoning output and the
code execution logic. For instance, LLaMA often produced incomplete or imprecise reasoning steps,
which CodeLLaMA or Phi-4 then misinterpreted, leading to cascading execution failures. These issues,
compounded by the limitations of smaller model sizes due to resource constraints, ultimately hindered
performance. Motivated by these limitations, we adopt a unified LLM architecture for this task, enabling
seamless integration of reasoning and code synthesis within a single model. This approach reduces
misalignment and enhances overall robustness.
3.2. Unified LLM Approach
Building on [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which showed that a unified LLM pipeline outperforms multi-step CoT approaches by
enhancing consistency and reducing error propagation, we adopt a similar strategy in this study. To
accommodate lengthy column names - we introduce an intermediate column aliasing step, replacing
them with concise, semantically meaningful alternatives. Given the efectiveness of integrating
reasoning and code generation into a single inference pass, we hypothesized that this unified approach
would again yield strong performance. In this paper, we experimented with a range of LLMs, including
LLaMA, CodeLLaMA, and Phi-4, to evaluate their ability to handle diverse question types and table
structures within this framework.
3.3. Challenges and Solutions
We encountered several challenges in this task:
• Long Column Names: Some datasets included excessively long column names, which exceeded
the LLM’s token limit of 16,384 tokens when including the dataset schema in the LLM prompt. For
column names that were more than 50 characters long, we used the LLM itself to generate concise
aliases that preserved the semantics of the original names. Refer to Appendix subsection A.2 for
examples.
• Table Reasoning: Following prior results, we pass only the schema and first three rows of the
dataframe to help the model reason over tabular inputs without exceeding token limits.
• Output Formatting: We enforce strict formatting constraints in the prompt to ensure that
generated outputs match expected types and structures.
• Code Reliability: We reuse the error feedback loop from our previous work, limiting to three
retries. As before, this balances correction and eficiency, with negligible gains beyond three
attempts.
• Prompt Strategy: We retain the System-User-Assistant template introduced earlier, which
continues to perform well with the Spanish variant of the task. A detailed description of our
prompt template can be found in the Appendix subsection A.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>
        Following our previous work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we adopt a zero-shot prompting strategy using quantized large
language models (LLMs), with a strong emphasis on prompt engineering to enhance performance.
All interactions are framed using a structured System–User–Assistant format, which has consistently
yielded reliable results. For eficient inference with minimal performance trade-ofs, we leverage
dynamically quantized 4-bit models from Unsloth [19]. Specifically, we experiment with Meta LLaMA
      </p>
      <sec id="sec-4-1">
        <title>Model</title>
      </sec>
      <sec id="sec-4-2">
        <title>LLaMA</title>
      </sec>
      <sec id="sec-4-3">
        <title>CodeLLaMA Phi-4</title>
        <p>1
3
1
3
1
3
51%
46%
62%
65%
71%
75%
51%
47%
60%
62%
69%
73%
respective LLM to alias columns exceeding 50 characters, replacing them with shorter, semantically
meaningful names before generating predictions on the transformed dataset. All models were run using
a single NVIDIA T4 GPU on Google Colab, demonstrating that strong performance can be achieved
even under constrained computational budgets.
4.1. Evaluation Function
We follow the oficial evaluation setup provided by the organizers, based on the databench_eval 5
package. The evaluation function has been adapted to allow for minor formatting diferences, making it
more forgiving of small variations in output. It uses flexible matching strategies depending on the data
type, including tolerant comparisons for booleans, numbers, and categorical outputs.</p>
        <p>The evaluation function applies type-specific heuristics to allow for flexible matching. Boolean
values are accepted in multiple valid forms (e.g., “true/false,” “yes/no”), categorical outputs use string
comparison, and dates are parsed for equivalence. Numerical answers are rounded to two decimal
places, and lists are compared as sets to tolerate reordering.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>DataBenchSPA (Validation)</title>
      </sec>
      <sec id="sec-5-2">
        <title>DataBenchSPA (Test)</title>
      </sec>
      <sec id="sec-5-3">
        <title>Ablation study on the validation dataset to decide on the ideal number of rows to be provided to the model.  is</title>
        <p>the number of rows chosen from the dataset which is sampled and provided to the model</p>
        <p>Our system achieved strong results, ranking 6ℎ in the General category (see Table 1). Improvements in
prompt design and the integration of execution-aware retry strategies further enhanced both accuracy
and eficiency. These findings reinforce the suitability of our approach for real-world tabular QA
scenarios. Comprehensive results and ablation analyses are provided in Table 5.</p>
        <p>
          We found that Phi-4 achieved the best performance when provided with three example rows ( = 3).
Consequently, we used Phi-4 for our final test predictions, making our system a lightweight solution
based on a Small Language Model (under 8 parameters).
5.1. Key Findings
Our key findings through our ablations (Table 5) include:
• As our system achieved similar results similar to [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], incorporating column aliasing did not
degrade the system’s performance.
• Our system is highly adaptable to diverse datasets across diferent languages, implying robust
performance across a wide range of tabular QA tasks.
• Execution-aware retry mechanisms improved answer correctness with minimal additional cost,
efectively resolving common syntactic and formatting errors.
2https://huggingface.co/unsloth/Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        Our system demonstrates the viability of LLM-driven code generation for zero-shot question answering
over tabular data in Spanish. Building on our prior work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we reafirm the efectiveness of a unified
LLM pipeline in maintaining logical coherence and minimizing error propagation. Prompt engineering
remains central to guiding model behavior, while input minimization strategies - such as sampling
representative rows and aliasing long column names proved efective for improving contextual relevance
and computational eficiency.
      </p>
      <p>While our current setup showcases the strength of LLMs in multilingual tabular QA, there remains
significant untapped potential in agentic approaches. Future work will focus on refining the
agentic framework to better accommodate multilingual reasoning, complex transformations, and more
sophisticated interactions with structured data.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, ChatGPT was used to provide suggestions for academic writing.
The authors subsequently reviewed and edited the content as necessary and take full responsibility for
the final version of this publication.
[12] A. Mullick, A. Nandy, M. Kapadnis, S. Patnaik, R. Raghav, R. Kar, An evaluation framework for legal
document summarization, in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck,
S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis (Eds.), Proceedings of
the Thirteenth Language Resources and Evaluation Conference, European Language Resources
Association, Marseille, France, 2022, pp. 4747–4753. URL: https://aclanthology.org/2022.lrec-1.508/.
[13] A. Mullick, I. Mondal, S. Ray, R. Raghav, G. Chaitanya, P. Goyal, Intent identification and entity
extraction for healthcare queries in Indic languages, in: A. Vlachos, I. Augenstein (Eds.), Findings
of the Association for Computational Linguistics: EACL 2023, Association for Computational
Linguistics, Dubrovnik, Croatia, 2023, pp. 1870–1881. URL: https://aclanthology.org/2023.findings-eacl.
140/. doi:10.18653/v1/2023.findings-eacl.140.
[14] R. Raghav, J. Rauchwerk, P. Rajwade, T. Gummadi, E. Nyberg, T. Mitamura, Biomedical question
answering with transformer ensembles, in: CLEF (Working Notes), 2023.
[15] R. Raghav, A. Vemali, R. Mukherjee, Etms@ iitkgp at semeval-2022 task 10: Structured sentiment
analysis using a generative approach, in: Proceedings of the 16th International Workshop on
Semantic Evaluation (SemEval-2022), 2022, pp. 1373–1381.
[16] P. Carragher, A. Jha, R. Raghav, K. M. Carley, Quantifying memorization and retriever performance
in retrieval-augmented vision-language models, arXiv preprint arXiv:2502.13836 (2025).
[17] P. Carragher, N. Rao, A. Jha, R. Raghav, K. M. Carley, Segsub: Evaluating robustness to knowledge
conflicts and hallucinations in vision-language models, arXiv preprint arXiv:2502.14908 (2025).
[18] R. Raghav, A. P. Vemali, D. Aswal, R. Ramesh, P. Tusham, P. Rishi, Tartantritons at semeval-2025
task 10: Multilingual hierarchical entity classification and narrative reasoning using instruct-tuned
llms, in: Proceedings of the 19th International Workshop on Semantic Evaluation, SemEval 2025,
Vienna, Austria, 2025.
[19] M. H. Daniel Han, U. team, Unsloth, 2023. URL: http://github.com/unslothai/unsloth.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Appendix</title>
      <p>A.1. Prompt Template
We use a system–user–assistant chat-based prompting framework. For this task, we design two separate
prompt templates: one for column aliasing and another for code generation to answer questions over
tabular data. Both templates are provided below.</p>
      <sec id="sec-8-1">
        <title>A.1.1. Column Aliasing Prompt Template</title>
        <p>System
You are a helpful assistant who understands Spanish and can shorten long column
names in tabular datasets. Your goal is to generate concise aliases (in the same
language as the original column) for column names that are longer than 50
characters. The alias should preserve the original meaning while being as short
and clear as possible
Generate the alias for the following column name
Column name: &lt;original column name from the dataset&gt;
Output Format:
Return a JSON array, where each output is structured as follows:
{
"original_name": &lt;original column name from the dataset&gt;,
"aliased_name": "your_answer_here",
}
The value to "aliased_name" key in the JSON should be the aliased column name for
the original column</p>
      </sec>
      <sec id="sec-8-2">
        <title>A.1.2. Question Answering Prompt Template</title>
        <p>You are an expert Python data engineer.</p>
        <p>Your task is to generate pandas code based on a structured reasoning process
You only generate code, no references or explanation - just code
You generate only 20 lines of code at max
User
System
User
Dataframe Schema:
&lt;The schema of the dataset which would contain the column name and its data type&gt;
Sample Rows:
&lt;The first 3 rows of the dataset serialized into a list of dictionaries, where
each dictionary represents a row with column names as keys&gt;
User Question:
&lt;The question that is asked about the dataset&gt;
Expected Output Format:
Generate runnable Python code that follows the given reasoning using pandas.
The code should assume that the dataframe is already loaded as `df`.
The final output should be stored in a variable named `result`.</p>
        <p>The expected answer type is unknown, but it will always be one of the following:
* Boolean: True/False, "Y"/"N", "Yes"/"No" (case insensitive).
* Category: A value from a cell (or substring of a cell) in the dataset.
* Number: A numerical value from a cell or a computed statistic.
* List[category]: A list of categories (unique or repeated based on context).
Format: ['cat', 'dog'].</p>
        <p>* List[number]: A list of numbers.</p>
        <p>Given the user question, you need to write code in pandas, assume that you already
have df.</p>
        <p>Generate only the code.</p>
        <p>The assistant prompt was left blank during inference. The code obtained from the model would then
be run on the dataset to evaluate the answer to the question.</p>
        <p>A.2. Column Aliasing
Across the 10 test datasets provided for the task, 765 out of 1,740 total columns had names longer than
50 characters. All of these lengthy column names were aliased using the LLM. The table below presents
examples of column names that were aliased in the test set</p>
        <p>Dataset
ES_01_40db_Igualdad
ES_02_40dB_Dormir
ES_03_CIS_Enero_Marzo_2023
ES_04_CEA_Barometro_Andaluz_Septiembre_2023</p>
        <p>Original Column Name Aliased Column Name
Imagina una pareja formada por un hombre y una mujer
Rol_cuidados_pareja_igual_laboque tuvieran exactamente las mismas condiciones laborales. ral
En caso de que fuera necesario que una de las dos personas
dejase de trabajar para cuidar de familiares dependientes
(hijos/as, padres, madre)
Si mañana se celebraran unas nuevas elecciones generales, Prob_Votar_0a10
¿cuál sería la probabilidad de que acudieras a votar? Utiliza
una escala de 0 a 10, en la que ‘0’ representa ‘con toda
seguridad, no iría a votar’ y ‘10’ ‘con toda seguridad, sí iría
a votar’
Motivos entrevista incorrecta. Negativa que ofrece dudas, Motivo_duda_datos_contacto
no coincide la dirección ni datos personales. Posibilidad de
error en el teléfono
Del siguiente conjunto de medidas relacionadas con la
Medida_gestion_agua_deseada_gestión del agua, ¿cuál considera usted que sería la más desaladoras
adecuada para Andalucía? Dos opciones de
respuesta_</p>
        <p>Construir más desaladoras para potabilizar el agua</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>González-Barba</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <article-title>Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS</article-title>
          . org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Osés-Grijalba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Ureña-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Cámara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          , Overview of PRESTA at IberLEF 2025:
          <article-title>Question Answering Over Tabular Data In Spanish, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS</article-title>
          . org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Grijalba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A. U.</given-names>
            <surname>López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Cámara</surname>
          </string-name>
          ,
          <article-title>Towards quality benchmarking in question answering over tabular data in spanish</article-title>
          ,
          <source>Proces. del Leng. Natural</source>
          <volume>73</volume>
          (
          <year>2024</year>
          )
          <fpage>283</fpage>
          -
          <lpage>296</lpage>
          . URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6617.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Raghav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Vemali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Aswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhupal</surname>
          </string-name>
          , Scottyposeidon at semeval
          <article-title>-2025 task 8: Llm-driven code generation for zero-shot question answering on tabular data</article-title>
          ,
          <source>in: Proceedings of the 19th International Workshop on Semantic Evaluation, SemEval</source>
          <year>2025</year>
          , Vienna, Austria,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
          <article-title>Large language models are zero-shot reasoners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>22199</fpage>
          -
          <lpage>22213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Bu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cui</surname>
          </string-name>
          , Codecot:
          <article-title>Tackling code syntax errors in cot reasoning for code generation</article-title>
          ,
          <source>arXiv preprint arXiv:2308.08784</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Autoagent: A fully-automated and zero-code framework for llm agents</article-title>
          , arXiv e-prints (
          <year>2025</year>
          ) arXiv-
          <fpage>2502</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>Structured chain-of-thought prompting for code generation</article-title>
          ,
          <source>ACM Transactions on Software Engineering and Methodology</source>
          <volume>34</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gutiérrez-Fandiño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Armengol-Estapé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pàmies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Llop-Palao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Silveira-Ocampo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Carrino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gonzalez-Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Armentano-Oller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rodriguez-Penagos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <article-title>Maria: Spanish language models</article-title>
          ,
          <source>arXiv preprint arXiv:2107.07253</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Liu,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          , et al.,
          <article-title>Autoddg: Automated dataset description generation using large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2502.01050</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mullick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nandy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Kapadnis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Patnaik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raghav</surname>
          </string-name>
          ,
          <article-title>Fine-grained intent classification in the legal domain</article-title>
          ,
          <source>arXiv preprint arXiv:2205.03509</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>