<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Cross-prompt Automated Essay Scoring by Selecting Training Data Based on Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Takumi Shibata</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Masaki Uto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Electro-Communications</institution>
          ,
          <addr-line>1-5-1 Chofugaoka, Chofu, Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automated essay scoring (AES) aims to automatically grade essays, thereby reducing the time and cost associated with manual scoring. The most common AES methods are classified under the prompt-specific approach, which involves developing a scoring model exclusively for a target prompt by using a dataset of scored essays corresponding to that prompt. Meanwhile, recent studies have emphasized the crossprompt approach, which leverages scored essay data from other prompts, referred to as source prompts, to build an AES model for the target prompt. However, these cross-prompt methods have limitations in that they do not consider the presence of source prompt essays that can potentially have a negative impact on the construction of the AES model for the target prompt. To address this limitation, we propose a novel cross-prompt AES method that utilizes data valuation with reinforcement learning (DVRL). The proposed method enables the selective use of source prompt essays, which positively contributes to improving the scoring accuracy of the AES for the target prompt. Experiments on a benchmark dataset demonstrate that the proposed method enhances the performance of various AES models in cross-prompt scoring settings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Cross-prompt automated essay scoring</kwd>
        <kwd>reinforcement learning</kwd>
        <kwd>data valuation</kwd>
        <kwd>transfer learning</kwd>
        <kwd>educational measurement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, dynamic changes in social structures have led to a growing emphasis on practical
skills such as critical thinking and expressive abilities in educational settings. The essay exam
has gained attention as a popular method for assessing these practical abilities [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. However,
grading essays incurs substantial costs in terms of personnel, time, and money, and it is also
challenging to ensure consistency and fairness in scoring [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To address these issues, automated
essay scoring (AES) methods, which employ artificial intelligence technologies to automatically
score essays, have been extensively explored in recent years (e.g., [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref3 ref4 ref5 ref6 ref7 ref8 ref9">3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23</xref>
        ]).
      </p>
      <p>
        AES methods can be broadly classified into two categories [ 22]: prompt-specific and
crossprompt methods. Prompt-specific AES methods construct a specialized scoring model for a
single target prompt by using a training dataset consisting of scored essays corresponding to
that prompt1. Traditional prompt-specific AES methods have relied on feature-based methods,
which involve extracting specific features such as essay length and grammatical error rate
from essays and training machine learning models using these features [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. However, these
methods require substantial efort in feature engineering and their performance depends heavily
on manually designed features. To address these limitations, deep learning-based approaches
have gained popularity in recent years. These methods directly input the word sequences of
essays into deep neural networks, eliminating the need for manual feature design [
        <xref ref-type="bibr" rid="ref14 ref3">3, 14, 15</xref>
        ].
In particular, pre-trained transformer–encoder-based models, such as those using BERT [24]
or its variants, have been widely adopted over the past few years, and have demonstrated
high performance [25]. Furthermore, recent research has begun to explore the potential of
large language models (LLMs) for AES, investigating their enhanced knowledge retention and
language-understanding capabilities [
        <xref ref-type="bibr" rid="ref15">26, 27</xref>
        ], although they are not necessarily superior to the
AES models using BERT or its variants.
      </p>
      <p>
        Although these prompt-specific AES models demonstrate high performance on the target
prompt for which they were trained, there is no guarantee that directly applying the trained
model to other prompts will yield high performance. To enhance the scoring performance for
other prompts, it is generally necessary to collect an additional scored essay dataset tailored to
each prompt and subsequently retrain the AES model using those data. To avoid such retraining
processes, cross-prompt AES methods have recently been proposed [
        <xref ref-type="bibr" rid="ref11 ref16 ref17">11, 17, 22, 23, 28, 29</xref>
        ].
Cross-prompt AES methods build an AES model for a target prompt by leveraging scored essay
data collected from other prompts, referred to as source prompts. The efective use of source
prompt data can enhance the performance of an AES model for a target prompt, even when
there are no or only a limited number of scored essays corresponding to that prompt.
      </p>
      <p>
        Various cross-prompt AES methods have been explored recently. For example, Li et al. [23]
proposed a feature-based AES model using prompt-independent features, constructed by domain
adversarial neural networks (DANN) [
        <xref ref-type="bibr" rid="ref18">30</xref>
        ]. Furthermore, Ridley et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed a deep neural
network model that integrates prompt-independent features and is designed to receive sequences
of part-of-speech (POS) tags instead of word sequences as input in order to mitigate the influence
of prompt-specific information. More recently, Chen et al. [ 22] introduced a technique that
employs a contrastive learning approach to obtain more consistent prompt-independent features,
thereby achieving the current state-of-the-art.
      </p>
      <p>
        However, these existing cross-prompt AES methods are assumed to utilize all source prompt
essays, ignoring the presence of essays that can potentially have a negative impact on the
construction of the AES model for the target prompt [
        <xref ref-type="bibr" rid="ref18 ref19 ref20">30, 31, 32</xref>
        ]. Because some essays from
source prompts that exhibit significantly diferent characteristics compared with the target
prompt essays can act as noise, proper data selection to omit such essays is expected to improve
scoring accuracy.
      </p>
      <p>
        For this reason, we propose a cross-prompt AES method that follows the approach of data
valuation by using reinforcement learning (DVRL) [
        <xref ref-type="bibr" rid="ref20">32</xref>
        ] to select source prompt essays that
are valuable in constructing AES models for the target prompt. DVRL is a reinforcement
learning framework that estimates the value of each data sample based on its contribution to
1Note that the term prompt refers to the writing task or instructions given to a student, distinct from prompts used
as inputs for large language models.
performance improvement in a specific target task. In our method, we adapt DVRL to construct
a data value estimator, which assigns higher values to source prompt essays that positively
contribute to AES performance on the target prompt and assigns lower values to those that
might negatively impact the AES performance. The data selected using our DVRL framework
can be used to construct any type of AES model, enhancing their AES performance on the target
prompt compared with scenarios that use all source prompt data. In this study, we evaluate
the efectiveness of our proposed method, using a benchmark dataset and several popular AES
models, including BERT, Llama-2 [
        <xref ref-type="bibr" rid="ref21">33</xref>
        ], and the models proposed by Ridley et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Chen
et al. [22]. The experimental results show that the proposed method succeeded in improving
performance across all AES models.
      </p>
      <p>The remainder of this paper is structured as follows: Section 2 provides further details on
conventional cross-prompt AES models. Section 3 explains the data valuation methods. Section
4 describes the proposed method, and Section 5 evaluates its efectiveness, using a benchmark
dataset. Finally, Section 6 summarizes the study.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Conventional Cross-Prompt AES Methods</title>
      <p>This section provides an overview of conventional cross-prompt AES methods and discusses
the limitations and drawbacks of these approaches.</p>
      <p>
        Jin et al. [17] proposed a cross-prompt AES method based on a two-stage approach. In the
ifrst stage, a RankSVM [
        <xref ref-type="bibr" rid="ref22">34</xref>
        ] is trained using essays from source prompts. This RankSVM is then
used to generate prediction scores for essays of the target prompt, which serve as pseudo-scores
for the next stage. In the second stage, a prompt-specific AES model is trained for the target
prompt, using these pseudo-scores.
      </p>
      <p>Li et al. [23] also proposed a two-stage AES method that utilizes DANN in the first stage.
DANN is a deep learning approach that learns domain-independent features through an
adversarial training process. This adversarial training uses two models: a main model that solves
a target task and a domain classifier that identifies the domain each datum belongs to. These
models are trained to maximize the performance of the main model while minimizing that of the
domain classifier. The first stage of the method of Li et al. [ 23] uses the DANN to construct a
feature extractor that produces prompt-independent features. Then, an AES model is constructed
using source prompt data of essays that are vectorized by the feature extractor to generate
pseudo-scores for the target prompt essays. The second stage trains a prompt-dependent AES
model for the target prompt, using the target prompt essays with the pseudo-scores.</p>
      <p>
        Meanwhile, Ridley et al.[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] introduced a model called the prompt-agnostic essay scorer (PAES),
which learns an AES model in an end-to-end fashion. PAES is a deep neural network model
that integrates manually-designed prompt-independent features. This neural model is designed
to receive sequences of POS tags instead of word sequences as input in order to mitigate the
influence of prompt-specific information.
      </p>
      <p>Chen et al. [22] proposed a model called the prompt-mapping contrastive learning for
crossprompt automated essay scoring (PMAES), which uses contrastive learning to learn more
consistent prompt-independent features. PMAES utilizes PAES as an encoder to generate feature
vectors for essays. It then employs contrastive learning to bring the vectors from the essays
of source prompts closer to those from the target prompt. This process contributes to the
construction of more consistent prompt-independent features, which are efective for cross-prompt
scoring. PMAES has achieved state-of-the-art performance in cross-prompt AES methods.</p>
      <p>
        As discussed above, conventional cross-prompt AES methods have focused primarily on
learning prompt-independent features in order to extract transferable knowledge in essay
scoring from source prompt data to target prompt data. However, these existing cross-prompt
AES methods are assumed to utilize all source prompt essays, ignoring the presence of essays
that can negatively impact the construction of the AES model for the target prompt [
        <xref ref-type="bibr" rid="ref18 ref19 ref20">30, 31, 32</xref>
        ].
Although these methods assume the source prompts to be a mixture of multiple prompts [
        <xref ref-type="bibr" rid="ref11 ref16 ref17">11,
17, 22, 23, 28, 29</xref>
        ], not all of the source prompts will necessarily share similar characteristics
with the target prompt. Thus, the inclusion of source prompt essays that are greatly dissimilar
to the target prompt essays can act as noise in the construction of an AES model for the target
prompt. This issue becomes particularly relevant in conditions where there is a large variety
of source prompts in terms of topics and writing styles. These insights suggest that a careful
selection of source prompt essays would be efective for obtaining accurate cross-prompt AES
models. The idea of our study is thus to apply data valuation methods to construct a selector of
valuable source prompt essays.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data Valuation Methods</title>
      <p>Data valuation is a method for quantifying the importance of each sample in a dataset.
Quantifying the value of data is regarded as an important task in various machine learning problems,
including domain adaptation, discovering noisy samples, learning robust models, and improving
the quality of datasets.</p>
      <p>
        Representative data valuation methods include leave-one-out and data Shapley [
        <xref ref-type="bibr" rid="ref23">35</xref>
        ].
Leaveone-out is a method that estimates the importance of each sample by calculating the change in
performance of a target task when removing each sample one by one. Data Shapley evaluates the
value of data, using the Shapley value from cooperative game theory. Specifically, data Shapley
calculates the marginal contribution of each sample by evaluating the prediction performance
of a target task when using each possible combination of samples. Moreover, another method
using the Banzhaf value, which originates from cooperative game theory as well, has also been
proposed [
        <xref ref-type="bibr" rid="ref24">36</xref>
        ].
      </p>
      <p>
        Several data valuation methods based on meta-learning have also been proposed. One
example is ChoiceNet [
        <xref ref-type="bibr" rid="ref25">37</xref>
        ], a valuation method that identifies noisy data within training datasets
by separately estimating the distributions of meaningful data and noise data. Learning to
reweight [
        <xref ref-type="bibr" rid="ref26">38</xref>
        ] is another method that calculates the weights of each sample in the source dataset
based on the performance of a target task on a validation dataset. Furthermore, as a recent
meta-learning-based data valuation method, Yoon et al. [
        <xref ref-type="bibr" rid="ref20">32</xref>
        ] proposed a method called data
valuation using reinforcement learning (DVRL). DVRL employs a reinforcement learning strategy
that simultaneously optimizes a data value estimator and a predictor model for a target task. In
this study, we apply the framework of DVRL to cross-prompt AES.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Method</title>
      <sec id="sec-4-1">
        <title>4.1. Task Definition</title>
        <p>This study assumes that a large number of scored essays from a mixture of source prompts  =
{(, )}=1 and a small number of scored essays for the target prompt  = {(, )}=1
are given. Here,  and  represent the -th essay in the source and target prompt essays,
respectively, while  and  denote their corresponding scores.  and  represent the total
numbers of essays for the source prompts and target prompt, respectively.</p>
        <p>Our study aims to develop an AES model that can accurately predict scores for unscored
essays corresponding to the target prompt by executing the following two steps.
1. Construct a data value estimator, using DVRL to assign value scores to each essay in the
source prompt essays.
2. Train an AES model for the target prompt, using a subset of source prompt essays assigned
high-value scores by the data value estimator.</p>
        <p>Note that this study exclusively uses  in the AES training process, while both  and  are
used in the DVRL process2. The following sections describe the details of each step.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Data Valuation Using DVRL</title>
        <p>2It should be noted that  is also available to train the AES model constructed in step 2. However, we do not use
 because this study focuses on how data selection by the proposed method afects AES performance compared
with scenarios in which all source prompt data are used. A detailed evaluation of the efect of integrating  as
AES training data remains a subject for future research.
the predicted score of the essay. Here,  and  are the model parameters of the data value
estimator and predictor, respectively.</p>
        <p>
          In the figure, ℎ and ℎ represent feature vectors corresponding to  and , respectively.
The method for creating these feature vectors depends on the type of AES model that will
ultimately be constructed. Specifically, when we intend to use AES models that accept word
sequences as input, we use distributed essay representation vectors obtained from
DeBERTav3-large [
          <xref ref-type="bibr" rid="ref27 ref28">39, 40</xref>
          ] as the feature vectors. Meanwhile, when we intend to use cross-prompt AES
models such as PAES and PMAES, we utilize manually designed prompt-independent features.
        </p>
        <p>The learning process of DVRL is formulated as the following optimization problem:
max</p>
        <p>E
(ℎ,)∼ 
[()] s.t. * = arg min</p>
        <p>E
(ℎ,)∼ 
[ (ℎ, )ℒ((ℎ), )] .</p>
        <p>
          (1)
Here, () represents the reward, which is the performance of the predictor  trained using
the source prompt data  and evaluated using  as test data. The reward is measured using the
quadratic weighted kappa (QWK) metric, which assesses the agreement between the predicted
scores and the ground truth scores and is widely used in AES studies [
          <xref ref-type="bibr" rid="ref10 ref3">3, 10</xref>
          ]. ℒ denotes the
mean squared error (MSE) loss function used to train the predictor, as explained in Section 4.2.2.
 and  represent the distributions of the source prompt data and the target prompt data,
respectively. Solving this formulation ofers a data value estimator that estimates the value
score of each essay. The following subsections explain the specific calculation procedures.
        </p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Data Value Estimator</title>
          <p>predictor trained on .</p>
          <p>
            For each essay vector ℎ and its score  for the source prompt essays in , the data value
estimator  outputs its data value  ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ] as  =  (ℎ
, 
). The data value estimator  is
implemented using a deep neural network with six stacked dense layers, where the output layer
is designed as a linear layer with sigmoid activation; it also incorporates marginal information
 into its intermediate layer. The marginal information  is a quantity expected to correlate
with the data value of each essay  and can be written as  = | − ˆ(ℎ)|, where ˆ is a

Using the calculated data value , the selection indicator  ∈
{0, 1} for each essay is
determined by sampling from a Bernoulli distribution with probability ; that is,  ∼
where  = 1 means that the -th data is selected, and  = 0 means that it is not selected.
Ber(),
          </p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Predictor</title>
          <p>The source prompt data selected through the above procedure are used to train the predictor .
The predictor is designed as a multi-layer perceptron with a linear output layer with sigmoid
activation3. The weighted loss function ℒ used for learning is calculated as follows:
ℒ() =
1</p>
          <p>∑︁
 (,)∈</p>
          <p>
            · ℒ (ˆ
, ),
(2)
3In our study, we used diferent multi-layer perceptrons depending on the input data types. Specifically, a two-layer
perceptron is used for cases inputting distributed essay representation vectors obtained from DeBERTa-v3-large,
while a single-layer perceptron is used for cases inputting manually designed prompt-independent features.
where ˆ is the predicted score of the predictor  for the -th essay of the source prompt data.
As the loss function ℒ, we use the MSE between the predicted score ˆ and the ground truth
score . Note that the ground truth scores  are assumed to be normalized to the range [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]
because the predicted scores are within this range too, as a result of the sigmoid activation in
the output layer.
          </p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Reinforcement Learning</title>
          <p>
            Using the trained predictor, our method computes the reward () for reinforcement learning
as the QWK between the predicted scores and the ground truth scores evaluated using the
dataset . The reward () is used to update the parameters  of the data value estimator  .
Specifically, the parameters  are updated using the REINFORCE algorithm [41], a reinforcement
learning algorithms, with the following loss function [
            <xref ref-type="bibr" rid="ref20">32</xref>
            ]:
          </p>
          <p>ℒ( ) = () * log  ((1, 2, . . . ,  ) |  ),
where  ((1, 2, . . . ,  ) |  ) represents the joint probability of the selection indicators given
the parameters  . Note that each essay is selected independently, meaning that the joint
probability can be written as ∏︀=1  (1 − )1−  . Using this loss function, the parameters 
are updated by gradient ascent as follows:</p>
          <p>←  +  ∇ ℒ( ),
where  represents the learning rate, which is set to 0.001 in this study. Adam [42] is used as
the optimization method for parameter updates.</p>
          <p>Finally, by repeating the above steps until the model converges, the data value estimator 
is trained.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Train an Arbitrary AES Model Based on Estimated Data Values</title>
        <p>
          Through the above process, we can obtain the data value estimator  and the resulting data
value scores for essays in the source prompt data . Thus, our last step is to construct an AES
model for the target prompt, using source prompt essays with high-value scores. However, it is
not clear how much data should be selected based on their value scores. Thus, we employ the
following approach, which is inspired by that described in [
          <xref ref-type="bibr" rid="ref20">32</xref>
          ], to select essays based on their
value scores.
        </p>
        <p>1. Sort the source prompt essays in descending order based on their estimated value scores.
2. Train an AES model using essays with top 10% value scores and repeat this process with
diferent data usage percentages, ranging from 10% to 100%, in increments of 10%.
3. For the ten constructed models, evaluate their MSE loss, using  as test data. The
model with the lowest MSE loss is selected as the optimal one and is used for scoring the
unscored target prompt essays.
(3)
(4)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiment</title>
      <sec id="sec-5-1">
        <title>5.1. Dataset</title>
        <p>We conducted an evaluation experiment using real-world data to demonstrate the score
prediction performance of the proposed method compared with the conventional method, which uses
all source data.</p>
        <p>In this experiment, we used the ASAP (Automated Student Assessment Prize)4 dataset as
realworld data. The ASAP dataset is used in Kaggle’s automated essay-scoring competition and
is widely used as a benchmark dataset in many AES studies. The ASAP contains a total of 8
essay prompts for 3 genres: argumentative, source-dependent responses, and narrative. Each
prompt also includes student’s essays and their scores. The details of the dataset characteristics
are shown in Table 1.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Performance Evaluation of our Proposed Method</title>
        <p>
          In line with previous cross-prompt AES studies, the present experiment was conducted using
prompt-wise cross-validation [
          <xref ref-type="bibr" rid="ref11">11, 17, 22</xref>
          ]. In prompt-wise cross-validation, one prompt is used
as the target prompt, while all remaining prompts are used as source prompts for training. This
operation is performed sequentially for all prompts, and the average is calculated to evaluate
performance.
        </p>
        <p>Our proposed method needs , a small number of scored essay data sampled from the target
prompt. In this experiment, the size of  was set to 30, and the set of samples was selected so
that the sum of the Euclidean distances between each distributed essay representation vector
obtained from DeBERTa-v3-large was maximized.</p>
        <p>
          Our proposed method can be used for any AES model. The present experiment used four
representative AES models: BERT, Llama-2-7B [
          <xref ref-type="bibr" rid="ref21">33</xref>
          ], PAES, and PMAES. Note that the PMAES
with the same hyper-parameters as in [22] could not be implemented using our GPU (RTX4090).
Thus, we changed some hyper-parameters. Specifically, the number of mini-batches was changed
from 2 to 20.
        </p>
        <p>The experiments were conducted in two settings: All source, and Proposed, and the score
prediction accuracy was compared. All source is a setting in which each AES model is trained
using all source prompt data, which is equivalent to the case where all essays are selected in
the proposed method. Proposed is a setting in which each AES model is trained using a subset
of source prompt data selected using our method. The prediction performance of each trained
model is evaluated by QWK using the target prompt essays, excluding 30 data in .</p>
        <p>Table 2 shows the experimental results. The results show that the proposed method
outperforms the All source settings for all models. The improvement is particularly significant for
BERT and Llama-2-7B. These models use the word sequence as the input data, increasing the
diference in feature vector characteristics between the source and target prompts. This would
enhance the negative impact of using source prompt essays irrelevant to the target prompt,
thereby deteriorating the AES model trained using all source prompt data.</p>
        <p>For PAES and PMAES, the improvement margin is smaller because they mitigate the
diference in the feature space between prompts by using prompt-independent features and POS
sequences as input. However, even for these models, the proposed method succeeds in
improving their performances by selecting relevant essays that align better with the target prompt’s
characteristics.</p>
        <p>Moreover, BERT achieves higher performance with the proposed method than does PAES
and PMAES without the proposed method. This suggests that the proposed method applied
to BERT can achieve performance comparable to these cross-prompt AES models. This is a
significant result because it indicates that by simply selecting essays that are efective for the
target prompt, it is possible to achieve performance comparable to conventional cross-prompt
AES models without relying on complex techniques to align features across prompts.</p>
        <p>These results demonstrate the efectiveness of the proposed method in selecting the most
relevant essays from source prompts, leading to improved performance of conventional AES
models.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Validity Evaluation of Estimated Data Values</title>
        <p>In this section, we investigate whether the value estimates of the proposed method appropriately
relate to the score prediction performance. To confirm this point, we examined the prediction
accuracy, QWK, of an AES model trained using source prompt essays, excluding those with top
or bottom % value scores. The removing ratio  was changed from 0% to 90% in increments of
10%. This analysis uses PAES as the AES model because, as reported above, it demonstrated the
highest performance among the models to which the proposed method was applied.</p>
        <p>The experimental results for Prompt 1 are presented in Figure 2, which shows the ratio
of excluded essays on the horizontal axis and the QWK on the vertical axis. The blue line
represents the QWK when essays are excluded in order of the highest value scores, while the
orange line represents the QWK when essays are excluded in order of the lowest value scores.</p>
        <p>The figure demonstrates that, for the range where the ratios of removed essays are small
to medium, QWK tends to increase as essays with low value scores are sequentially excluded,
whereas it tends to decrease when essays with high value scores are sequentially excluded. For
the range where the ratios of removed essays are extremely large, both cases revealed low QWK
values due to the removal of too many training data, which is a reasonable trend.</p>
        <p>These results suggest that the value scores estimated by the proposed method appropriately
relate to the efectiveness of the scoring performance of the constructed AES model for the
target prompt.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This study introduced a novel cross-prompt AES approach that leverages the data valuation
method to select source prompt essays valuable to improving the accuracy of the AES model
for the target prompt. The experimental results demonstrate the efectiveness of our method in
improving the performance of AES models.</p>
      <p>In future work, we will perform further analyses of the proposed model aimed at gaining a
deeper understanding of its characteristics and behavior. Additional experiments are needed
to evaluate the efects of utilizing a small set of scored essays for the target prompt, denoted
as , to train the AES model, in addition to its usage in our DVRL process. We also aim to
explore methods that do not rely on  because this requirement may not always be feasible in
real-world scenarios. Furthermore, we intend to develop an end-to-end model that integrates
the data value estimation and AES components into a single, unified framework. This will
enable a more streamlined and eficient approach to cross-prompt AES.
Language Learning, 2017, pp. 153–162. doi:10.18653/v1/K17-1017.
[15] Y. Tay, M. Phan, L. A. Tuan, S. C. Hui, SkipFlow: Incorporating neural coherence features
for end-to-end automatic text scoring, Proceedings of the AAAI Conference on Artificial
Intelligence 32 (2018). doi:10.1609/aaai.v32i1.12045.
[16] Y. Farag, H. Yannakoudakis, T. Briscoe, Neural automated essay scoring and coherence
modeling for adversarially crafted input, in: Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics,
2018, pp. 263–271. doi:10.18653/v1/N18-1024.
[17] C. Jin, B. He, K. Hui, L. Sun, TDNN: A two-stage deep neural network for
promptindependent automated essay scoring, in: Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 1088–1097.
doi:10.18653/v1/P18-1100.
[18] P. U. Rodriguez, A. Jafari, C. M. Ormerod, Language models and automated essay scoring,
2019. arXiv:1909.09482.
[19] M. Uto, Y. Xie, M. Ueno, Neural automated essay scoring incorporating handcrafted
features, in: Proceedings of the 28th International Conference on Computational Linguistics,
2020, pp. 6077–6088. doi:10.18653/v1/2020.coling-main.535.
[20] M. Uto, M. Okano, Learning automated essay scoring models using
item-response-theorybased scores to decrease dfects of rater biases, IEEE Transactions on Learning Technologies
14 (2021) 763–776. doi:10.1109/TLT.2022.3145352.
[21] T. Shibata, M. Uto, Analytic automated essay scoring based on deep neural networks
integrating multidimensional item response theory, in: Proceedings of the 29th International
Conference on Computational Linguistics, International Committee on Computational
Linguistics, 2022, pp. 2917–2926.
[22] Y. Chen, X. Li, PMAES: Prompt-mapping contrastive learning for cross-prompt automated
essay scoring, in: Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics,
2023, pp. 1489–1503. doi:10.18653/v1/2023.acl-long.83.
[23] X. Li, M. Chen, J.-Y. Nie, SEDNN: Shared and enhanced deep neural network model for
cross-prompt automated essay scoring, Knowledge-Based Systems 210 (2020) 106491.
doi:https://doi.org/10.1016/j.knosys.2020.106491.
[24] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423.
[25] R. Yang, J. Cao, Z. Wen, Y. Wu, X. He, Enhancing automated essay scoring
performance via fine-tuning pre-trained language models with combination of regression and
ranking, in: Findings of the Association for Computational Linguistics: EMNLP 2020,
Association for Computational Linguistics, 2020, pp. 1560–1569. doi:10.18653/v1/2020.
findings-emnlp.141.
[26] G.-G. Lee, E. Latif, X. Wu, N. Liu, X. Zhai, Applying large language models and
chain-ofthought for automatic scoring, Computers and Education: Artificial Intelligence 6 (2024)
training with gradient-disentangled embedding sharing, 2021. arXiv:2111.09543.
[41] R. J. Williams, Simple statistical gradient-following algorithms for connectionist
reinforcement learning, Machine learning 8 (1992) 229–256.
[42] D. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference
on Learning Representations, 2015.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I. D.</given-names>
            <surname>Erguvan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Aksu</given-names>
            <surname>Dunya</surname>
          </string-name>
          ,
          <article-title>Analyzing rater severity in a freshman composition course using many facet Rasch measurement</article-title>
          ,
          <source>Language Testing in Asia</source>
          <volume>10</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . doi:https://doi.org/10.1186/s40468-020-0098-3.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Uto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ueno</surname>
          </string-name>
          ,
          <article-title>A generalized many-facet Rasch model and its Bayesian estimation using Hamiltonian Monte Carlo</article-title>
          ,
          <source>Behaviormetrika</source>
          <volume>47</volume>
          (
          <year>2020</year>
          )
          <fpage>469</fpage>
          -
          <lpage>496</lpage>
          . doi:https://doi.org/ 10.1007/s41237-020-00115-7.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Taghipour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>A neural approach to automated essay scoring</article-title>
          ,
          <source>in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1882</fpage>
          -
          <lpage>1891</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D16</fpage>
          -1193.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Attali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Burstein</surname>
          </string-name>
          ,
          <source>Automated essay scoring with e-rater® v.2</source>
          ,
          <source>The Journal of Technology, Learning and Assessment</source>
          <volume>4</volume>
          (
          <year>2006</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          . doi:https://doi.org/10.1002/j. 2333-
          <fpage>8504</fpage>
          .
          <year>2004</year>
          .tb01972.x.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Automated essay scoring by maximizing human-machine agreement</article-title>
          ,
          <source>in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1741</fpage>
          -
          <lpage>1752</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Phandi</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. A. Chai</surname>
          </string-name>
          , H. T. Ng,
          <article-title>Flexible domain adaptation for automated essay scoring using correlated linear regression</article-title>
          ,
          <source>in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>431</fpage>
          -
          <lpage>439</lpage>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>D15</fpage>
          -1049.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dascalu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Westera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruseti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Trausan-Matu</surname>
          </string-name>
          , H. Kurvers, ReaderBench learns Dutch:
          <article-title>Building a comprehensive automated essay scoring system for Dutch language</article-title>
          ,
          <source>in: International Conference on Artificial Intelligence in Education</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>63</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -61425-
          <issue>0</issue>
          _
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Hastings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hughes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Britt</surname>
          </string-name>
          ,
          <article-title>Active learning for improving machine learning of student explanatory essays</article-title>
          ,
          <source>in: International Conference on Artificial Intelligence in Education</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>140</fpage>
          -
          <lpage>153</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -93843-1_
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Haberman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Prediction of writing true scores in automated scoring of essays by best linear predictors and penalized best linear predictors</article-title>
          ,
          <source>ETS Research Report Series</source>
          <year>2019</year>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          . doi:https://doi.org/10.1002/ets2.
          <fpage>12248</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Uto</surname>
          </string-name>
          ,
          <article-title>A review of deep-neural automated essay scoring models</article-title>
          ,
          <source>Behaviormetrika</source>
          <volume>48</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          . doi:
          <volume>10</volume>
          .1007/s41237-021-00142-y.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ridley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Prompt agnostic essay scorer: A domain generalization approach to cross-prompt automated essay scoring</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2008</year>
          .01441.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Alikaniotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yannakoudakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <article-title>Automatic text scoring using neural networks</article-title>
          ,
          <source>in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>715</fpage>
          -
          <lpage>725</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P16</fpage>
          -1068.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Automatic features for essay scoring-an empirical study</article-title>
          ,
          <source>in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1072</fpage>
          -
          <lpage>1077</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D16</fpage>
          -1115.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Yang,
          <article-title>Attention-based recurrent convolutional neural network for automatic essay scoring</article-title>
          ,
          <source>in: Proceedings of the 21st Conference on Computational Natural</source>
          <volume>100213</volume>
          . doi:https://doi.org/10.1016/j.caeai.
          <year>2024</year>
          .
          <volume>100213</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Biermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nehring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <article-title>Exploring llm prompting strategies for joint essay scoring and feedback generation</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2404</volume>
          .
          <fpage>15845</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Domain-adaptive neural automated essay scoring</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1011</fpage>
          -
          <lpage>1020</lpage>
          . doi:
          <volume>10</volume>
          .1145/3397271.3401037.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          , M. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          , Z. Cheng, Q. Gu,
          <article-title>Improving domain generalization for prompt-aware essay scoring via disentangled representation learning</article-title>
          ,
          <source>in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>12456</fpage>
          -
          <lpage>12470</lpage>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>696</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ganin</surname>
          </string-name>
          , E. Ustinova,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ajakan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Germain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Laviolette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marchand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lempitsky</surname>
          </string-name>
          ,
          <article-title>Domain-adversarial training of neural networks</article-title>
          , Springer International Publishing,
          <year>2017</year>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -58347-1_
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>X.</given-names>
            <surname>Glorot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Domain adaptation for large-scale sentiment classification: A deep learning approach</article-title>
          ,
          <source>in: Proceedings of the 28th International Conference on International Conference on Machine Learning</source>
          , Omnipress,
          <year>2011</year>
          , pp.
          <fpage>513</fpage>
          -
          <lpage>520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. O.</given-names>
            <surname>Arik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pfister</surname>
          </string-name>
          ,
          <article-title>Data valuation using reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 37th International Conference on Machine Learning, JMLR.org</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bikel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Blecher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cucurull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Esiobu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kardas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kerkez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khabsa</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kloumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Koura</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Liskovich</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Martinet</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mihaylov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mishra</surname>
            , I. Molybog,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Poulton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Reizenstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rungta</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saladi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schelten</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>X. E.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J. X.</given-names>
          </string-name>
          <string-name>
            <surname>Kuan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
            , I. Zarov,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kambadur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Narang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stojnic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Edunov</surname>
          </string-name>
          ,
          <source>T. Scialom, Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>09288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <article-title>Optimizing search engines using clickthrough data</article-title>
          ,
          <source>in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          . doi:
          <volume>10</volume>
          .1145/775047.775067.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghorbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <article-title>Data shapley: Equitable valuation of data for machine learning</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2242</fpage>
          -
          <lpage>2251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <article-title>Data Banzhaf: A robust data valuation framework for machine learning</article-title>
          ,
          <source>in: International Conference on Artificial Intelligence and Statistics</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>S.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Lim,</surname>
          </string-name>
          <article-title>ChoiceNet: Robust learning by revealing output correlations</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>1805</year>
          .06431.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Urtasun</surname>
          </string-name>
          ,
          <article-title>Learning to reweight examples for robust deep learning</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4334</fpage>
          -
          <lpage>4343</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen,
          <article-title>DeBERTa: Decoding-enhanced BERT with disentangled attention</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <year>2006</year>
          .03654.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen,
          <article-title>DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>