<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.3390/educsci14060656</article-id>
      <title-group>
        <article-title>Models as Educational Evaluators: Reasoning and Explainability in Low-Resource Language Assessment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ekhi Azurmendi</string-name>
          <email>ekhi.azurmendi@ehu.eus</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Doctoral Symposium on Natural Language Processing</institution>
          ,
          <addr-line>25</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>HiTZ Center - Ixa, University of the Basque Country UPV/EHU</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <fpage>24824</fpage>
      <lpage>24837</lpage>
      <abstract>
        <p>With large language models, significant advances have been made in various language processing tasks, such as sentiment analysis or automatic translation. However, there are tasks that still prove dificult for these models, and the automatic evaluation of written compositions is one of them. In this task, the model must evaluate a composition according to guidelines or criteria. Diferent evaluation systems have been attempted, but there is still much to investigate. This project aims to develop an automatic evaluation system for compositions, focused on the Basque language. The goal is to evaluate compositions following the system's guidelines and provide feedback on errors made. Additionally, the model should identify students' weaknesses and create exercises to address their deficiencies, contributing to the learning process. To develop this system, we will use advanced techniques to overcome the limitations of language models, hallucinations, and the generation of incorrect information, such as Retrieval Augmented Generation (RAG) or more general forms of text-conditioned learning. We will work on zero-shot or few-shot learning techniques to follow guidelines not observed during training, as well as eficient parameter adjustment methods, such as supervised fine-tuning (SFT) or reinforcement learning. We will also define and create synthetic data and auxiliary tasks to aid in the model's learning process. We will share with the scientific community the resources generated and conclusions obtained throughout the project.</p>
      </abstract>
      <kwd-group>
        <kwd>NLP</kwd>
        <kwd>LLM</kwd>
        <kwd>automatic essay review</kwd>
        <kwd>reasoning</kwd>
        <kwd>explainable AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2. Background and Related</title>
    </sec>
    <sec id="sec-2">
      <title>Work</title>
      <p>
        Thanks to LLMs, Natural Language Processing (NLP) has experienced unprecedented advancement
[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. These models are trained on large amounts of text to learn language representation, and then
transfer that knowledge to new contexts, trained with few manually annotated texts. These techniques
      </p>
      <p>CEUR
Workshop</p>
      <p>
        ISSN1613-0073
improve performance as the number of parameters increases [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], as shown by results obtained in
diferent benchmarks such as SuperPEG [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], MMLU [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and the BIG-bench evaluation benchmarks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Despite their unprecedented success, Large Language Models (LLMs) have significant limitations in
following orders in special or highly specialized contexts [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. An obvious limitation of LLMs is the
phenomenon of hallucinations, where they provide erroneous information to users. Several eforts have
been made to overcome these limitations: techniques such as Retrieval Augmented Generation (RAG)
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] or more general forms of textual conditioning learning [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], better known as prompt learning or
in-context learning, ofer great flexibility to adapt models to diferent domains and tasks. Conventional
ifne-tuning can also help address this problem, but training on constantly changing data is not a practical
solution due to the high computational cost. Therefore, techniques such as the aforementioned RAG
are very useful when developing applications for the real world [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Suitable recipes for efectively carrying out continuous learning to keep LLM skills up to date have
not yet been clarified. Techniques such as RAG are appropriate for feeding models with information
extracted from external documents, and facilitate bringing the capabilities of huge LLMs to smaller
models [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. In this way, we can introduce new information not encoded in parameters without
training, which reduces update costs. Moreover, some studies have shown that the evaluation of essays
based on ideal references enhanced models’ performance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], so RAG techniques are suitable to find
reference essays.
      </p>
      <p>
        One current trend is the simultaneous application of strategies that combine textual conditioning
learning with SFT. According to recent studies, LLMs have demonstrated generalization capability using
hybrid techniques when following guidelines not observed in training, surpassing existing zero-shot
capabilities, that is, improving models without learning examples [
        <xref ref-type="bibr" rid="ref16 ref2">2, 16</xref>
        ]. However, even with the
spectacular results obtained, it is still unclear how knowledge obtained in one domain can be transferred
to another specialized domain. For example, adaptation techniques to a specialized domain such as
language learning have been little studied.
      </p>
      <p>In addition to the previous techniques, it has been observed that using external tools to complete
information that models don’t have encoded is a powerful strategy [17, 18, 19]. There are numerous
external tools: information retrieval, search engines, symbolic modules or code interpreters, for example.
The use of these tools opens multiple research avenues, in which models can be interactively trained
using reinforcement learning to adapt to these tools [20].</p>
      <p>Although these lines of research generally aim to avoid hallucination and reasoning problems, they
can be used to create efective, powerful, and dynamic applications and systems.</p>
      <p>Another way to improve the skills of LLMs is to use reasoning strategies [21, 22? ], so that more
appropriate responses can be generated in exchange for more computational resources. Reasoning
strategies are completed through chains of reasoning, that is, the problem is divided into several steps
to facilitate its resolution. Previous studies have shown that LLMs are capable of simple reasoning [23],
but have problems when performing complex reasoning.</p>
      <p>For example, these models accurately respond to the birth and death dates of historical figures, but
often have problems when asked about how old these people were when they died. To address these
problems, reasoning chains are appropriate strategies, since by solving step by step, a more suitable
ifnal result can be obtained.</p>
      <p>However, recent researchers have discovered new methods to improve the reasoning capabilities of
these models by applying reinforcement learning techniques [24]. The latest open source models are
able to outperform proprietary models in several benchmarks such as scientific question answering or
language understanding [25].</p>
      <p>
        Learning by text commands can be considered as a method that allows interaction between people and
computers. Work done in recent years has used learning from commands written in natural language
to guide computers towards diferent real-world tasks [
        <xref ref-type="bibr" rid="ref5">26, 5</xref>
        ].
      </p>
      <p>
        In relation to the human-computer interaction environment, education is an important field of
application where LLMs can have a great impact. The work carried out has shown that LLMs can be
helpful in writing or reading in the educational environment [27, 28]. Innovative research has also been
conducted [
        <xref ref-type="bibr" rid="ref10">29, 10</xref>
        ], using LLMs as aid in the classroom environment, collaborating in teacher-student
interaction, ofering specialized teaching, or for automatic assessment of essays. The automatic creation
of adapted exercises becomes increasingly interesting, due to the creative competencies of LLMs. The
automatic creation of distractors, for example, has shown usable results in practice [30, 31]. Although
the competencies of LLMs to create good exercises have improved greatly, it is not clear how models
can be dynamically adapted to the specific needs of students, to improve exercises and feedback. The
automatic creation of multiple-choice questions has taught us that we already have tools to put this into
practice [30, 31]. Even taking into account the competencies of language models, it is still not clear how
to adjust LLMs to create appropriate and high-quality exercises. The creation of these variable domain
exercises adapted to students is important to provide the most appropriate help in the learning process.
      </p>
      <p>With Basque as the focus, automatic assessment systems for texts written by students have been
developed using traditional machine learning techniques [32]. These systems are based on the extraction
of linguistic features that take into account the evaluation criteria, subsequently using a classifier to
determine the level. However, traditional systems have shown problems adapting to new domains.</p>
      <p>Using deep learning techniques, attempts have been made to improve the competencies of these
systems [33]. The weak point of these latest systems lies in the reasoning of responses, as the model
is not able to determine the errors or weaknesses identified through a textual description or give
indications that help in the writing process.</p>
      <p>Several attempts have been made to develop approaches to directly generate exercises from texts in
Basque [34]. The authors aimed to create exercises from free texts using rules and traditional learning
techniques, but in this approach the exercises are not reformed based on the user’s errors, that is, the
system is not dynamic.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Description of the Proposed Research, Including the Main</title>
    </sec>
    <sec id="sec-4">
      <title>Hypotheses for Research</title>
      <p>Advances in NLP in the educational field have been achieved thanks to the surprising competencies that
LLMs have demonstrated. Very evident improvements have been achieved in tasks such as automatic
text evaluation, automatic exercise creation, or automatic text correction, among others. Despite these
advances, systems created through LLMs have shown limitations: 1) Annotated data is needed to
adjust the models, but the number of annotated texts in Basque in the educational field is low; 2) The
adjustment of models has high computational costs and there are problems when dynamically adapting
to new domains; 3) Despite the reasoning competencies of LLMs, there are still no adequate recipes for
developing systems to adapt to student needs.</p>
      <p>The objective of this project is to adapt LLMs to the educational domain to perform evaluations
following guidelines, explain errors or improvement needs to the user through explanations, and create
exercises or instructions dynamically adapting to the user’s needs. This main objective can be divided
into continued tasks or sub-objectives:</p>
      <p>Add guideline-following capacity to LLMs. In this way, reasoning competencies of model evaluations
would be developed and would be flexible to adapt to diferent evaluation criteria. It will be essential to
translate and adapt to Basque the techniques used in the current state of the art. We will base on two
main approaches:
• Use zero-shot or few-shot learning techniques, so the model has flexibility to adapt to new
domains when little annotated data is available. We will use RAG, textual conditioning and
command training methods to carry out this sub-objective.
• Study supervised learning methods by command, so that LLMs learn to follow domain instructions.</p>
      <p>Given the small amount of data, synthetic data or auxiliary tasks must be created to address this
problem.</p>
      <p>Develop LLMs that adapt to the needs of users. To meet this objective, the model must have the ability
to plan, reason, explain and make appropriate comments. New Reinforcement Learning techniques,
such as Direct Preference Optimization (DPO) [35] or Group Relative Policy Optimization (GRPO) [24]
will be taken into account, as well as other state-of-the-art techniques used in reasoning, including
reasoning strategies.</p>
      <p>Develop a model that, independent of the domain, is capable of making comments based on student
errors and generating exercises. It is closely related to the previous objectives, as the model will
need reasoning and planning capabilities to successfully complete this task. Appropriate evaluation
methodologies and datasets must be created so that automatically created exercises are of good quality.</p>
      <p>Use appropriate adjustment and training methods to reduce computation costs. Reducing the costs
of adjustment, training, and use of LLMs is very important so that applications or real-world uses
are as accessible as possible. To carry out this objective, PEFT-type methods will be used, to achieve
competencies and the ability to follow orders with reduced costs. We will rely on techniques known as
LORA, QLORA, or VeRA to meet this objective.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Methodology and the Proposed Experiments</title>
      <p>This research project will use the research methodologies and functions presented below to carry out
the aforementioned objectives.</p>
      <p>The empirical method will be used; that is, the proposed hypotheses will be implemented in a
system and evaluated using publicly accessible datasets. In this evaluation, we will make a comparison
with systems available in the state of the art, validating the hypothesis when statistically significant
improvements are obtained in said comparison.</p>
      <p>The objectives we propose in this thesis are ambitious, and it is likely that not all proposed hypotheses
will be fulfilled. For this reason, approaches will be tested one by one using the empirical method, and
those with the greatest future projection will be explored in depth, leaving the rest aside.</p>
      <p>The test banks and evaluation metrics and environments created and built throughout the project
will be shared with the scientific community. In this way, we could not only collect comparable results,
but also advice and improvements from the community. Although the results obtained may not be as
expected, we will meet the set objectives and learn from the comments of the scientific community.
We will disseminate the contributions made during the research at high-quality conferences with the
community (ICML, IAAA, IJCAI, ICRL, ACL, EACL, NAACL, EMNLP, all SCIE Class 1 - Core A or A*).</p>
      <sec id="sec-5-1">
        <title>Research Tasks (RT) and Research Questions (RQ):</title>
        <p>RT0: Prepare the research environment. The first task is related to the preparation of the evaluation
environment. The task will be precisely defined and publicly available datasets will be collected.
Several works to follow guidelines have already been identified, but we need to check if there are new
developments at the start of the thesis. Available LLMs will also be selected and initial experiments will
be carried out to define an appropriate baseline. The main research questions in this section will be the
following:
• RQ0.A) In the educational field, what are the most appropriate evaluation environments and
tasks to evaluate LLM competencies?
• RQ0.B) What datasets are available and useful to us?
• RQ0.C) Of the publicly available LLMs, which are the most suitable for defining the baseline?
RT1: Adapt to follow evaluation guidelines in environments without training examples
(zero-shot scenario). The task will focus on training LLMs to follow evaluation guidelines. The goal
is to create a model independent of domains and guidelines. That is, the model should be able to adapt
to new guidelines. LLMs will be adapted to carry out the task in situations with no examples or few
examples. The main research questions in this section will be the following:
• RQ1.A) In an environment with no or few examples, what technique is most efective for
incorporating domain-associated knowledge into the model?
• RQ1.B) Are RAG and textual conditioning learning efective techniques for models to learn to
follow guidelines? If so, how can we implement them in the language learning domain?
RT2: Train LLMs by instruction to learn to follow guidelines. In this task, we will study
methods to overcome data scarcity, to learn to follow guidelines. In addition, auxiliary tasks will be
defined using synthetic data and avoiding the need for manual annotations. These tasks will help in the
learning process. We will focus on the following research questions:
• RQ2.A) In synthetic data generation, what are the most efective techniques for teaching LLMs
to follow orders?
• RQ2.B) What auxiliary tasks might be most suitable for teaching LLMs to follow guidelines?
RT3: Align LLMs with user needs. The objective is to investigate diferent methods for LLMs to
carry out appropriate planning and reasoning and provide indications, explanations, and
recommendations to users. This task will include the following research questions:
• RQ3.A) What is the best way to give feedback to users after reasoning?
• RQ3.B) How could we adapt LLMs to generate exercises and comments based on the educational
needs and competencies of users?
• RQ3.C) Can we train an LLM dynamically and automatically to create exercises and comments?</p>
      </sec>
      <sec id="sec-5-2">
        <title>Annual Research Planning:</title>
        <p>First year, foundations: The work to be carried out during the first year will be related to tasks RT0
and RT1. The objective is to prepare the research environment and, therefore, create the evaluation
methodology and carry out the first experiments. The following tasks are planned:
1.1) To answer questions RQ0.A and RQ0.B, the evaluation environment will be defined. The necessary
datasets will be collected, and, if necessary for the project, a proprietary dataset will be created.
1.2) We will evaluate available LLMs and develop basic techniques to answer RQ0.C.
1.3) To answer RQ1.A, we analyze state-of-the-art models. We will perform a quantitative and
qualitative analysis of the shortcomings of these systems: Their behavior across datasets and
diferent tasks will be analyzed in depth.
1.4) To answer RQ1.B, we will try to improve textual conditioning learning methods to adequately
follow guidelines.
1.5) We expect to send the answers and conclusions obtained from the diferent RQs to high-level
journals and conferences.</p>
        <p>Second year, model adaptation: In the second year, work will be done on tasks RT1 and RT2. The
objective is to finish researching training methodologies for LLMs to follow guidelines. The following
tasks are planned:
2.1) Taking advantage of the conclusions from the experiments performed, we will try to improve the
system created to adequately answer question RQ1.B.
2.2) To answer question RQ2.A, synthetic data methods will be analyzed, as well as the shortcomings
of current techniques. Using what was learned from the research conducted, new methods for
generating efective synthetic data will be proposed.
2.3) To answer question RQ2.B, on the other hand, we will analyze works to create auxiliary tasks
and design and create tasks appropriate to our domain.
2.4) With the learning obtained from questions RQ1 and RQ2, a more powerful and better model will
be developed. We will use the data generated in RQ1 and the auxiliary tasks from RQ2 to improve
the results of state-of-the-art techniques.
2.5) At least one article will be submitted to a main conference or journal, based on what is obtained
when answering these research questions.</p>
        <p>Third year, model alignment: During the third year, we will try to complete RT3. Our goal is
to align LLMs to the needs of users, so that they generate appropriate responses, improvements, and
exercises.</p>
        <p>3.1) The work done in section RT0 will be reviewed and new evaluation datasets will be added and
updated, if applicable.
3.2) To answer question RQ3.A, datasets associated with symbolic reasoning will be collected and
adapted. In addition, synthetic data generation techniques will be applied to improve the reasoning
capacity of the models.
3.3) The technologies and methods developed during the second year of the project will interact with
the model arising from question RQ3.A, to then answer RQ3.B.
3.4) We will conduct experiments to evaluate the improved model we are going to create, and at the
same time pay attention to question RQ3.C.
3.5) The results obtained with the new model will be sent to a main journal and conference.</p>
        <p>Fourth year, refinement: During the first month, the results obtained in previous years will be
collected and completed. Then, the thesis will be written, and the defense will be prepared. To carry
out these objectives, the following tasks have been defined:</p>
        <sec id="sec-5-2-1">
          <title>4.1) Refine tasks from previous years. 4.2) Submit an article to a journal. 4.3) Write the thesis. 4.4) Prepare the thesis defense.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Specific Issues of Research to be Discussed</title>
      <p>
        Our research focuses primarily on evaluating written texts according to specific rubrics, then providing
feedback and creating exercises based on the educational needs of the user. Although there are plans
to train models to follow guidelines for creating specific and useful feedback [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we are uncertain
about how to adapt these techniques to the educational domain in low-resource languages with data
scarcity. Preliminary experiments have shown that models can predict individual marks across diferent
evaluation criteria, but we remain unsure which techniques would be adequate to verbalize the inner
reasoning process of the model to create specific feedback while avoiding ambiguous or general
comments.
      </p>
      <p>
        The evaluation of the feedback and generated exercises presents a challenging task in our work.
While evaluation based on agreement with GPT-4 or other closed models is widely used in the field
[36], the linguistic capabilities of these models for Basque lag behind newer open-source models [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
suggesting they may not be appropriate for evaluating feedback and generated exercises in this context.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>This PhD will be partially supported by: • The Basque Government (IKER-GAITU project). • Ixa group A type research group (IT1570-22) • Ekhi Azurmendi hold a PhD grant from the Basque Government (PRE_2024_1_0035).</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
        <p>with predictive prompting and large language models, in: Joint European Conference on Machine
Learning and Knowledge Discovery in Databases, Springer, 2023, pp. 48–63.
[32] J. M. Arriola, M. Iruskieta, E. Arrieta, J. Alkorta, Towards automatic essay scoring of basque
language texts from a rule-based approach based on curriculum-aware systems, in: Proceedings
of the NoDaLiDa 2023 Workshop on Constraint Grammar-Methods, Tools and Applications, 2023,
pp. 20–28.
[33] E. Agirre, I. Aldabe, X. Arregi, M. Artetxe, U. Atutxa, E. Azurmendi, I. De la Iglesia, J. Etxaniz,
V. García-Romillo, I. Hernaez-Rioja, et al., Iker-gaitu: research on language technology for basque
and other low-resource languages (2024).
[34] N. Perez, M. Cuadros, Multilingual call framework for automatic language exercise generation
from free text, in: Proceedings of the Software Demonstrations of the 15th Conference of the
European Chapter of the Association for Computational Linguistics, 2017, pp. 49–52.
[35] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, C. Finn, Direct preference optimization:
Your language model is secretly a reward model, 2024. URL: https://arxiv.org/abs/2305.18290.
arXiv:2305.18290.
[36] L. Zhu, X. Wang, X. Wang, Judgelm: Fine-tuned large language models are scalable judges, 2025.</p>
        <p>URL: https://arxiv.org/abs/2310.17631. arXiv:2310.17631.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Etxaniz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sainz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Perez</surname>
          </string-name>
          , I. Aldabe,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rigau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ormazabal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soroa</surname>
          </string-name>
          ,
          <string-name>
            <surname>Latxa:</surname>
          </string-name>
          <article-title>An open language model and evaluation suite for basque</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/ 2403.20266. arXiv:
          <volume>2403</volume>
          .
          <fpage>20266</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seo</surname>
          </string-name>
          , Prometheus:
          <article-title>Inducing fine-grained evaluation capability in language models</article-title>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2310.08491. arXiv:
          <volume>2310</volume>
          .
          <fpage>08491</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2005</year>
          .14165. arXiv:
          <year>2005</year>
          .14165.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schuh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsvyashchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maynez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Prabhakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hutchinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pope</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Austin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Isard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gur-Ari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Duke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Levskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Michalewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Robinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ippolito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spiridonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sepassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Omernick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Pillai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pellat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lewkowycz</surname>
          </string-name>
          , E. Moreira,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Polozov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saeta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Firat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Catasta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Meier-Hellstern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fiedel</surname>
          </string-name>
          , Palm:
          <article-title>Scaling language modeling with pathways</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2204.02311. arXiv:
          <volume>2204</volume>
          .
          <fpage>02311</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Scialom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakrabarty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muresan</surname>
          </string-name>
          , Continual-t0:
          <article-title>Progressively instructing 50+ tasks to language models without forgetting</article-title>
          ,
          <source>arXiv preprint arXiv:2205.12393</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yogatama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          , et al.,
          <article-title>Emergent abilities of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2206.07682</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pruksachatkun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nangia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>Superglue: A stickier benchmark for general-purpose language understanding systems</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1905</year>
          .00537. arXiv:
          <year>1905</year>
          .00537.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazeika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <article-title>Measuring massive multitask language understanding</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2009</year>
          .03300. arXiv:
          <year>2009</year>
          .03300.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rastogi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. M.</given-names>
            <surname>Shoeb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G.-A</surname>
          </string-name>
          . et al.
          <year>2022</year>
          ,
          <article-title>Beyond the imitation game: Quantifying and extrapolating the capabilities of language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2206.04615. arXiv:
          <volume>2206</volume>
          .
          <fpage>04615</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Kamalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Santandreu</given-names>
            <surname>Calonge</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurrib</surname>
          </string-name>
          ,
          <article-title>New era of artificial intelligence in education: Towards a sustainable multifaceted revolution</article-title>
          ,
          <source>Sustainability</source>
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <fpage>12451</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Gpt-4
          <source>technical report</source>
          , https://cdn. openai.com/papers/gpt-4.pdf (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for large language models: A survey</article-title>
          ,
          <source>arXiv preprint arXiv:2312.10997 2</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rutherford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Millican</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. B. Van Den Driessche</surname>
          </string-name>
          , J.
          <string-name>
            <surname>-B. Lespiau</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Damoc</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Improving language models by retrieving from trillions of tokens</article-title>
          , in: International conference on machine learning,
          <source>PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2206</fpage>
          -
          <lpage>2240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lomeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dwivedi-Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          , E. Grave, Atlas:
          <article-title>Few-shot learning with retrieval augmented language models</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>O.</given-names>
            <surname>Sainz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>García-Ferrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agerri</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. L.</surname>
          </string-name>
          de Lacalle, G. Rigau, E. Agirre,
          <article-title>Gollie: Annotation guidelines improve zero-shot information-extraction</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2310.03668.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>