<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NIL-UCM at PROFE 2025: Adapting QA Models to Multiple-Choice Tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anna-Maria Winkler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Díaz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Facultad de Filología, Universidad Complutense de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Facultad de Informática and ITC, Universidad Complutense de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>These working notes evaluate pre-trained transformer models-BERT-large, RoBERTa-base, and RoBERTalarge-for multiple-choice question answering in Spanish reading comprehension exams. Using the ICUNED-RC-ES dataset, models were tested in a zero-shot setting with a token-matching approach. Exploratory experiments with semantic similarity, entailment, and generative prompts highlighted both limitations and future potential. The results underline the need for fine -tuning, robust prompting, and semantic alignment for reliable educational NLP applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;multiple choice</kwd>
        <kwd>Spanish reading comprehension</kwd>
        <kwd>NLP</kwd>
        <kwd>transformer models</kwd>
        <kwd>BERT</kwd>
        <kwd>RoBERTa</kwd>
        <kwd>generative language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In recent years, the exponential growth of large language models (LLMs) has revolutionized the
field of Natural Language Processing (NLP), enabling major advancements across tasks such as
machine translation, summarization, and question answering. This study focuses on the application
of these models to multiple-choice question answering (MCQA) tasks that involve short fictional or
simplified contexts, similar to those found in educational assessments. Specifically, we explore how
different types of language models—encoder-only versus generative—perform in this setting, and
what challenges arise from task design, model architecture, and prompt formula-tion.</p>
      <p>The motivation for this work is rooted in the increasing relevance of AI-assisted tools in
education. MCQA is widely used in academic and professional settings, but correcting such questions
can be time-consuming. At the same time, students increasingly rely on AI systems for learning and
test preparation. Understanding how current models perform in MCQA tasks is key to developing
effective and fair educational technologies.</p>
      <p>
        Our experiments compare the performance of BERT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], RoBERTa [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and Mistral [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]noa variety
of MCQA formats. We examine the impact of prompt design, model architecture, and output
interpretation on model accuracy and reliability.
      </p>
    </sec>
    <sec id="sec-2">
      <title>1. Task description</title>
      <p>1.1.
●
●
●
●
●
●
●
●
In this subtask, each instance consists of:
●
●
●</p>
      <sec id="sec-2-1">
        <title>A short context paragraph (text)</title>
        <p>A question based on the text</p>
        <p>A set of predefined answer choices (typically A–D)</p>
        <p>Only one option is correct per question. The task requires systems to select the correct answer
based on the information in the text. This mirrors real-world reading comprehension exercises used
in language proficiency testing.
1.2.</p>
        <sec id="sec-2-1-1">
          <title>Dataset: IC-UNED-RC-ES</title>
          <p>
            For this study, we use the IC-UNED-RC-ES dataset [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], which contains reading comprehension
tasks extracted from real exams created by the Instituto Cervantes for the evaluation of Spanish as a
Foreign Language (ELE). These exams are designed by human experts and cover levels from A1 to
C2 according to the Common European Framework of Reference for Languages (CEFR). The exams
span various difficulty levels (A1–C2) and were converted into machine-readable format as part of a
collaboration between the Instituto Cervantes and UNED, under a formal agreement signed in May
2021 and funded by the DeepInfo Project (AEI PID2021-127777OB-C22).
          </p>
          <p>In total, the full dataset comprises 282 exams, 855 exercises, and 6,146 annotated responses (from
16,570 possible options). For PROFE 2025, approximately 50% of the data is used, while the rest is
reserved for future editions to prevent overfitting and contamination by public LLMs. The gold
standard labels are not publicly distributed for the same reason.
1.3.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Types of Questions Tackled</title>
          <p>In this paper, we focus on standard multiple-choice questions where only one answer is correct.
These questions are:</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Taken directly from ELE exams created by Instituto Cervantes.</title>
        <p>Designed to assess reading comprehension and general language understanding.
Spanning levels from A1 to B2, where texts are short, often fictional or semi-authentic, and
the questions require inference, reasoning, and detail recognition.</p>
        <p>A typical example includes a brief personal letter followed by four or five questions, each
with four possible answers. The distractors are carefully constructed to appear plausible,
making the task non-trivial for AI models.
1.4.</p>
        <sec id="sec-2-2-1">
          <title>Task Challenges</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>This task presents several challenges:</title>
        <p>Contextual dependence: Many questions require precise understanding of the context and
implicit reasoning.</p>
        <p>Ambiguity: Distractor options can be semantically close to the correct answer.</p>
        <p>Varying difficulty: Questions span CEFR levels from A1 (basic) to B2 (upper-intermediate),
introducing variation in vocabulary complexity and required inference.</p>
        <p>Answer format: Models must return the correct letter label (e.g., “B”).</p>
        <p>In summary, this task provides a realistic benchmark for evaluating how well current NLP
models—both encoder-based and generative—can perform in educational language assessment
contexts.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Methodology</title>
      <p>2.1.</p>
      <sec id="sec-3-1">
        <title>Models Explored</title>
        <p>To evaluate the effectiveness of different pre-trained language models on contextual
multiplechoice question answering (MCQA), we tested three transformer models:
●
●
●</p>
        <sec id="sec-3-1-1">
          <title>RoBERTa-base (deepset/roberta-base-squad2) RoBERTa-large (deepset/roberta-large-squad2) BERT-large (bert-large-uncased-whole-word-masking-finetuned-squad)</title>
          <p>All three models are pretrained on the SQuAD dataset and follow an extractive QA paradigm,
returning a text span from the context that best answers the given question. None of these models
were fine-tuned on the PROFE dataset; instead, they were used in a zero-shot evaluation setup.
2.2.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Adapting QA Models to Multiple-Choice Tasks</title>
        <p>As these models are trained to return free-text answers, an answer-matching strategy was
required to map their outputs to one of the predefined multiple-choice options.</p>
        <p>●
●
●
●</p>
        <p>For each question, the model receives the context and question as input and returns an
answer span. This span is then normalized using the following steps:
Lowercasing and Unicode normalization (removing diacritics).</p>
        <p>Removal of punctuation and splitting into tokens.</p>
        <p>Filtering out Spanish stopwords (via NLTK).</p>
        <p>Each multiple-choice option is normalized in the same way. We then calculate the token overlap
score between the model's answer and each of the options. The option with the highest score is
selected as the model’s final prediction for that question. This heuristic provides a lightweight but
effective way to adapt extractive QA models to the classification nature of MCQA tasks.
2.3.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Prompting Strategy for Generative Models (Future Work)</title>
        <p>Although the focus of this work is on encoder-based models, we also explored the use of
promptbased methods with decoder-only generative models, such as Mistral, as a complementary approach.
The goal was to evaluate how generative language models perform in multiple-choice question
answering when prompted in natural language. We designed a set of dynamic prompts,
automatically generated for each example, that included the reading comprehension text, the
question, and all answer choices.</p>
        <p>However, due to time constraints and the additional complexity involved in integrating and
evaluating generative models (e.g., handling variations in output, rate limits, prompt tuning), this
approach was not fully implemented or evaluated within the current phase of the project. It remains
as future work, with the potential to compare generation-based answers to the results of extractive
models.
2.4.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Technical Implementation</title>
        <p>All experiments were implemented in Python using Google Colab as the execution environment.
The dataset was loaded from Google Drive, and predictions were saved back in .json format. The
following tools and libraries were used:
●
●
●</p>
        <p>Hugging Face Transformers: loading and executing QA pipelines for all models.</p>
        <p>NLTK: Spanish stopword removal for normalization.</p>
        <p>Standard Python libraries: json, unicodedata, re, and os for file and text handling.</p>
        <p>Each model was executed using the pipeline() utility from Hugging Face: Model outputs were
processed in a loop across all questions in the test dataset, and final predictions were stored in a
dictionary indexed by questionId.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Results and Discussion</title>
      <p>3.1.</p>
      <sec id="sec-4-1">
        <title>Final Evaluation Results</title>
        <p>The final evaluation results were significantly lower than expected. On the test set, all three
models—bert-large-uncased-whole-word-masking-finetuned-squad, deepset/roberta-base-squad2,
and deepset/roberta-large-squad2—achieved nearly identical accuracy scores. Both BERT and
RoBERTa-base reached 32.28%, while RoBERTa-large performed only marginally better with 32.51%.
These figures are surprisingly low given earlier development experiments and indicate a possible
mismatch between the development and test distributions.</p>
        <p>One particularly noticeable pattern was the disproportionately high number of predictions for
option A across questions. This suggests that the overall accuracy may partly reflect the natural
distribution of correct answers rather than actual model understanding. In other words, the models
might have been defaulting to option A when uncertain or when the token-matching heuristic failed,
which would artificially inflate scores if A happened to be correct more often.
3.2.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Exploratory Experiments on the Development Set</title>
        <p>During an earlier experimental phase, several alternative methods were tested on a separate
development set. An overview of the models and their main results is presented in Table 1. These
approaches yielded considerably better results and provided insights into model behaviour under
different configurations. The most successful method combined RoBERTa-base with a
straightforward token-matching heuristic, reaching an accuracy of 70%. Despite its simplicity, this
method proved stable and precise across a variety of input structures.</p>
        <p>Other experiments explored semantic similarity through embeddings. A version using MiniLM to
compare the cosine similarity between the predicted span and the answer options reached 50%
accuracy. However, this approach performed less reliably when the context was short or when
answers shared overlapping vocabulary.</p>
        <p>Generative models also showed promise. Using google/flan-t5-large in a free-text generation
setup led to 60% accuracy on the development set. The model demonstrated good semantic
understanding, but its performance was sensitive to input length and prompt truncation. Another
approach using facebook/bart-large-mnli to compute textual entailment between the
contextquestion pair and each answer option also achieved 50%, particularly effective when the distractors
were clearly distinguishable.
3.3.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Prompting Limitations and Generative Model Issues</title>
        <p>Initial attempts to include a generative model such as Mistral in the final evaluation were
ultimately not successful. The prompt templates used in early tests did not explicitly include the
letter labels (A, B, C, D) associated with the answer choices. As a result, the model frequently
defaulted to generating "A" regardless of the content, or returned ambiguous outputs such as “Not
detected.” These issues underline the importance of prompt structure when working with generative
language models.</p>
        <p>Due to time constraints and the need for more robust prompt engineering and post-processing
logic, generative models were excluded from the final comparison. However, the exploratory results
suggest that, with better formatting and output handling, generation-based models could become
competitive alternatives to extractive methods for this task.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Interpretation and Future Directions</title>
        <p>The contrast between the final test results and earlier development experiments highlights several
important considerations. First, pre-trained QA models—even those fine-tuned on SQuAD—do not
generalize well to multiple-choice tasks without task-specific adaptation. Second, naive matching
strategies like token overlap are insufficient in cases where distractor options are semantically
similar. Third, the design of prompts and answer formats plays a critical role in how models interpret
and respond to inputs.</p>
        <p>Future work should explore fine-tuning encoder-based models on a subset of the PROFE dataset,
implementing semantic similarity measures beyond token matching, and developing more reliable
prompting strategies for generative models. With these improvements, it may be possible to
significantly raise the performance ceiling on this type of contextual MCQA task.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusions and Future Work</title>
      <p>The main objective of this study was to evaluate the performance of different NLP approaches in
answering multiple-choice reading comprehension questions in Spanish, using the IC-UNED-RC-ES
dataset provided by the ProFE 2025 shared task. Our experiments show that RoBERTa outperformed
the other tested models in terms during the development phase, but the final test performance was
notably low, with all models achieving just over 32% accuracy. Therefore, this study shows the
challenges of applying extractive QA models to multiple-choice tasks without fine-tuning or more
robust adaptation strategies.</p>
      <p>Several areas for improvement were identified. First, performance could likely be enhanced
through few-shot prompting, fine-tuning on a representative subset of the PROFE dataset, or more
advanced semantic matching techniques. The evaluation also highlighted the limitations of using a
static development/test split. Future iterations should explore a dynamic or stratified sampling
approach to ensure consistency in task difficulty and domain distribution, minimizing the risk of
overfitting to a particular data slice.</p>
      <p>Although generative models such as Mistral were not fully integrated into the final evaluation for
the competition due to time constraints, an experiment using an improved prompt, with a subset of
49 questions from the test set, has shown a promised 62.81% accuracy. Future work should continue
exploring different prompting strategies and output formats to improve performance and stability in
generation-based setups.</p>
      <p>Beyond experimental refinements, this research has broader implications for educational
applications. Accurate automatic MCQA systems could support large-scale assessment, adaptive
learning environments, and AI-based tutoring systems. However, ensuring fairness, robustness, and
transparency remains essential for their responsible deployment.
This publication is part of the R&amp;D&amp;I project HumanAI-UI, Grant PID2023-148577OB-C22
(HumanCentered AI: User-Driven Adaptative Interfaces-HumanAI-UI) funded by
MICIU/AEI/10.13039/501100011033 and by FEDER/UE.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT in order to: Grammar and
spelling check, Paraphrase and reword. After using this tool/service, the authors reviewed and
edited the content as needed and takes full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , M.-W.,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>BERT: Pre-training of deep bidirectional transformers for language understanding [Preprint]</article-title>
          . arXiv. https://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>RoBERTa: A robustly optimized BERT pretraining approach [Preprint]</article-title>
          . arXiv. https://arxiv.org/abs/
          <year>1907</year>
          .11692
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>A. Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sablayrolles</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mensch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bamford</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaplot</surname>
            ,
            <given-names>D.</given-names>
            S., de las Casas, D.
          </string-name>
          ,
          <string-name>
            <surname>Bressand</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lengyel</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saulnier</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavaud</surname>
            ,
            <given-names>L. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lachaux</surname>
            , M.-
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Scao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>El Sayed</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Mistral 7B [Preprint]</article-title>
          . arXiv. https://arxiv.org/abs/2310.06825
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Rodrigo</surname>
            ,
            <given-names>Á.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moreno-Álvarez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>García-Plaza</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peñas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agerri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fruns-Jiménez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Soria-Pastor</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          (
          <year>2025</year>
          ). Overview of ProFE at IberLEF 2025:
          <article-title>Language proficiency evaluation</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          ,
          <volume>75</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>González-Barba</surname>
            ,
            <given-names>J. Á.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jiménez-Zafra</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>Overview of IberLEF 2025: Natural language processing challenges for Spanish and other Iberian languages</article-title>
          .
          <source>In Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2025</year>
          ),
          <article-title>co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2025</year>
          ).
          <article-title>CEUR-WS.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Alcántara</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pérez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>IC-UNED-RC-ES: Recursos para comprensión de lectura y preguntas de opción múltiple en español [Dataset]</article-title>
          .
          <source>In Proceedings of the IberLEF and ProFE 2025 Workshop</source>
          . Association for Computational Linguistics. https://www.aclweb.org/portal/content/profe-2025
          <string-name>
            <surname>-iberlef-</surname>
          </string-name>
          2025
          <string-name>
            <surname>-</surname>
          </string-name>
          call-participation
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>