<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Zaragoza, Spain
$ miguelangel.rodriguez@lsi.uned.es (M. Rodríguez-García); raul.cabido@urjc.es (R. Cabido); michel.maes@urjc.es
(M. Maes-Bermejo); soto.montalvo@urjc.es (S. Montalvo)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>URJC-Team at PROFE2025@ IberLEF: Deep Language Comprehension Using Transformers Voting Architecture</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miguel Ángel Rodríguez-García</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raúl Cabido</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michel Maes-Bermejo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soto Montalvo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dpto. Informática y Estadística, Universidad Rey Juan Carlos (URJC)</institution>
          ,
          <addr-line>Calle Tulipán s/n, Móstoles 28933, Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NLP IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED)</institution>
          ,
          <addr-line>Juan del Rosal 16, Madrid 28040</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Machine Reading Comprehension (MRC) is a challenging task focused on developing systems that can read text and understand its meaning, which is crucial for question-answering systems. This area has gained significant attention from the research community, leading to the emergence of shared evaluation campaigns that promote exciting challenges in the field. In this context, the IberLEF 2025 includes the PROFE task, which centers on deep language comprehension. In this work, we describe the system proposed for one of the tasks in this challenge: matching answers. Our approach is primarily based on fine-tuning three Spanish language models RoBERTA, BETO, and BERT and applying a voting technique to determine the best match for each question. For training the models, we used external training data from a public dataset for Spanish question answering.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Transformers</kwd>
        <kwd>Deep comprehension</kwd>
        <kwd>Question answering</kwd>
        <kwd>Answer extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        One of the main challenges of natural language processing is that machines can be able to understand
text as well as humans. Machine Reading Comprehension (MRC) is an essential task in evaluating
natural language understanding. One common method for evaluating someone’s understanding of
text is to give a passage, then they must answer the questions correctly. Then, for MRC is necessary
to training a machine to predict the answer to a question given a relevant context [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. If the machine
can answer the question correctly, we can conclude that it is capable of understanding the context and
inferring the question from it [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Various works have focused on MRC, providing new datasets to encourage research [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], as well as
benchmarks designed to thoroughly assess the MRC capabilities of LLMs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], among others. In this
line, the PROFE task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] at IberLEF 2025 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] focuses on deep language comprehension including three
subtasks, where each subtask contains several exercises of the same type. On the first subtask (multiple
choice) given a multiple-choice question, systems must select the correct answer among the candidates.
On the second subtask (matching), each exercise contains two sets of texts and systems must find the
text in the second set that best matches the first set. Finally, on the third subtask (filling-in-the-gap)
each exercise contains a text with several gaps corresponding to textual fragments and systems must
determine the correct position for each fragment.
      </p>
      <p>This work presents a system developed for the second subtask. We approach the challenge combining
diferent models designed for Spanish Question Answering tasks.</p>
      <p>The remainder of the article is organized as follows: Section 2 presents a focused study of the
approaches analyzed for building the proposed system. Section 3 ofers an overview of the dataset and
methods used in the campaign. Section 4 discusses the results obtained. Finally, Section 5 outlines
the conclusions drawn from our participation in the shared evaluation campaign, and suggests how
improve the proposal for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In this section, we present a brief analysis of the Machine Reading Comprehension task within the
teaching domain. We outline various approaches that have been foundational in developing the
architecture for our participation in the PROFE task.</p>
      <p>
        Zoukagh [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] focuses on designing and implementing BERThiz, a model aimed at understanding
contextual cues and generating accurate responses for masked inputs. The proposed model surpasses
other models within the same category and language. For Multi-Language Question Answering (MLQA)
task, the proposed model obtain the same results as BETO model. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] the authors present an study
explored the challenging task of automatically generating Spanish multiple-choice question using
multilingual language models.
      </p>
      <p>
        Ruiz et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposes a question-answering system designed to assist workers in the manufacturing
industry in requesting information from technical manuals. The authors manually annotated these
manuals and conducted experiments with various Spanish language models (five in Spanish and 1
Multilingual). The results show that the proposed system will enhance the eficiency of the operators,
as it allows them to perform routine tasks without physically consulting the manual.
      </p>
      <p>
        More general approach is presented in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This work is focused on analyzing the performance
of Spanish language models across various natural language tasks. This study examined the
convergence of Spanish sequence-to-sequence models while considering diferent downstream tasks, such as
summarization, splitting and rephrasing, dialogue, and question answering, among others.
      </p>
      <p>We have explored additional approaches to the state-of-the-art not presented in this section, and
their analysis was crucial in pushing us to employ language models for the PROFE task.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and methods</title>
      <p>In this section, we discuss the models used to structure the system’s architecture and the resources
utilized during the training stage.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>Training is a crucial stage when adapting a deep learning model built for one task to work on another.
Since the number of examples provided in the PROFE task was limited, we decided to seek out similar
datasets related to question-answering tasks. From these datasets, we adapted the information to fit the
format required for our specific task in order to train the selected models.</p>
        <p>For this purpose, we chose the Spanish Question-Answering Corpus (SQAC) dataset 1, which contains
6,247 contexts and 18,817 questions, each with corresponding answers rated on a scale from 1 to 5.
The dataset is divided into three files: training, development, and testing, each varying in size. We
specifically selected the training file detailed in this proposal. To create the synthetic dataset for training,
we defined the following procedure. This file is structured as a list of paragraphs, each consisting of
several attributes: i) "context," which represents the context of the query; ii) "qas," which includes a list
of three attributes: question, id, and answer. The "question" refers to the query itself, "id" represents
its unique identifier, and "answer" provides both the position of the answer and the corresponding
sentence.</p>
        <p>We mapped this information into a similar structure provided by the organizers. For each question
and its answer, we selected four diferent contexts, labeled A, B, C, and D, from the dataset. We ensured
that one of these contexts contained the correct answer. This process was repeated for each question
available in the dataset. Consequently, we compiled hundreds of questions into the synthetic dataset.</p>
        <sec id="sec-3-1-1">
          <title>1https://huggingface.co/datasets/PlanTL-GOB-ES/SQAC</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Architecture of the system</title>
        <p>The primary approach taken to tackle the task was to utilize pre-trained models for question-answering
tasks in the Spanish language. Initially, we tested several language models, including DistilBERT, BERT,
RoBERTa in its various versions, and T5, among others. We then ranked these models based on their
performance using the prepared dataset during training. From this evaluation, we selected the top three
models to form the proposed system.</p>
        <p>The models that achieved the best results in our initial experiments were BETO 2, BERT
MMG/bertbase-spanish-wwm-cased-finetuned-sqac-finetuned-squad, and RoBERTa 3. We combined these models
using a voting strategy: for each question posed, each model voted for the best matching answer.
Ultimately, the answer that received the most votes was selected as the final response to link with the
question. Figure 1 illustrates the architecture of the designed system.</p>
        <p>The proposed architecture, as shown in Figure 1, consists of a pipeline with three main stages. The
ifrst stage is the preprocessing phase. The selected dataset for training did not match the format
provided by the organizers, so it needed to be preprocessed. This involved reorganizing its structure to
group the contexts, questions, and answers into a user-friendly data format that facilitates the training
stage; the second stage is the training phase. The three selected models were trained using the dataset
compiled from the previous stage; and, finally, the third phase focuses on providing the final prediction
by managing a voting system of the three models integrated into the ensemble.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>This section outlines the experiments conducted for the shared evaluation campaign. The first step
involved selecting language models trained specifically on the Spanish Question-Answering Corpus.
We configured each model using the following settings: 3 epochs, a learning rate of 5e-5, a weight decay
of 0.01, and a batch size of 16. Subsequently, we used our dataset to independently train and evaluate
each model. The dataset was split into three parts: 70% for training and 20% for testing and 10% for
validation. Table 1 presents the performance of each model over de testing in terms of accuracy.</p>
      <p>Our results demonstrate a high level of performance for each model, approaching values very close
to 1. This leads us to believe that creating an ensemble model could yield significantly better results.
This formed the basis of our hypothesis for building the ensemble model.</p>
      <p>Table 2 presents the oficial results achieved by the ensemble model on the test set provided by the
organizers.</p>
      <sec id="sec-4-1">
        <title>2Josue/BETO-espanhol-Squad2 3PlanTL-GOB-ES/roberta-base-bne-sqac</title>
        <p>When comparing the performance of the ensemble model using our dataset and the ensemble model
evaluated with the test provided by the organizers, a significant diference in accuracy becomes apparent.
The ensemble model for the given test dropped to 0.40, showcasing a loss of more than half of its accuracy.
This stark decline suggests that neither the isolated models nor the ensemble model can generalize
the knowledge learned during training to apply it to unknown instances. This scenario indicates a
substantial gap between the two datasets: the one we prepared and the one used in the test.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and future works</title>
      <p>The PROFE task challenges researchers to build a system capable of understanding information described
in natural language in Spanish and answering three diferent types of questions: multiple choice,
matching, and filling-in-the-gap. In this work, we focused specifically on the matching subtask, which
involves linking each context to the corresponding answer based on a given set of contexts and answers.</p>
      <p>We fine-tuned three Spanish language models (BETO, BERT, and RoBERTa), and applied a voting
ensemble approach to combine their individual predictions. Our low rankings on the leaderboard
indicate that there is significant room for improvement.</p>
      <p>We are considering several new directions for future work. On one hand, we plan to use multilingual
datasets with multilingual models to evaluate their performance. On the other hand, we will dismiss
the dataset compilation and exploring language models with few-shot strategies to analyze their
convergence. Finally, we want to explore generative large language models from diferent point of view.
Action funded by Universidad Rey Juan Carlos (URJC) under the 2024/2025 Educational Innovation
Projects Call, project code PIE24_092.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Smarnet: Teaching machines to read and comprehend like human</article-title>
          ,
          <source>arXiv preprint arXiv:1710.02772</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>W.</given-names>
            <surname>Che</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Teaching machines to read, answer and explain</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>30</volume>
          (
          <year>2022</year>
          )
          <fpage>1483</fpage>
          -
          <lpage>1492</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Matthew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Burges</surname>
          </string-name>
          , E. Renshaw,
          <string-name>
            <surname>Mctest</surname>
          </string-name>
          :
          <article-title>A challenge dataset for the open-domain machine comprehension of text</article-title>
          ,
          <source>in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>193</fpage>
          -
          <lpage>203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Mrceval: A comprehensive, challenging and accessible machine reading comprehension benchmark</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2503.07144. arXiv:
          <volume>2503</volume>
          .
          <fpage>07144</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Á.</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Moreno-Álvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Pérez</given-names>
            <surname>García-Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Peñas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agerri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fruns-Jiménez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. SoriaPastor</surname>
          </string-name>
          , Overview of PROFE at IberLEF 2025:
          <article-title>Language Proficiency Evaluation</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>González-Barba</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <article-title>Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS</article-title>
          . org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>I. Zoukagh</surname>
          </string-name>
          ,
          <article-title>Bethiz: Precise and complete bert model to fill out questions with correct answers, understanding the context for the spanish language</article-title>
          , Authorea
          <string-name>
            <surname>Preprints</surname>
          </string-name>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>D. de Fitero-Dominguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Garcia-Cabot</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Garcia-Lopez</surname>
          </string-name>
          ,
          <article-title>Automated multiple-choice question generation in spanish using neural language models</article-title>
          ,
          <source>Neural Computing and Applications</source>
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>18223</fpage>
          -
          <lpage>18235</lpage>
          . URL: https://doi.org/10.1007/s00521-024-10076-7.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Torres</surname>
          </string-name>
          ,
          <source>A. del Pozo</source>
          ,
          <article-title>Question answering models for human-machine interaction in the manufacturing industry</article-title>
          ,
          <source>Computers in Industry</source>
          <volume>151</volume>
          (
          <year>2023</year>
          )
          <fpage>103988</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Araujo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Trusca</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Tufiño</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-F. Moens</surname>
          </string-name>
          ,
          <article-title>Sequence-to-sequence Spanish pre-trained language models</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          , N. Xue (Eds.),
          <source>Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>14729</fpage>
          -
          <lpage>14743</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>1283</volume>
          /.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>