<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Masaki Uto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuma Ito</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Electro-Communications</institution>
          ,
          <addr-line>1-5-1 Chofugaoka, Chofu, Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Evaluating the abilities of learners is a fundamental objective in the field of education. In particular, there is an increasing need to assess higher-order abilities such as expressive skills and logical thinking. Constructedresponse tests such as short-answer and essay-based questions have become widely used as a method to meet this demand. Although these tests are efective, they require substantial manual grading, making them both labor-intensive and costly. Item response theory (IRT) provides a promising solution by enabling the estimation of ability from incomplete score data, where human raters grade only a subset of answers provided by learners across multiple test items. However, the accuracy of ability estimation declines as the proportion of missing scores increases. Although data augmentation techniques for imputing missing scores have been explored in order to address this limitation, they often struggle with inaccuracy for sparse or heterogeneous data. To overcome these challenges, this study proposes a novel method for imputing missing scores by leveraging automated scoring technologies for accurate IRT-based ability estimation. The proposed method achieves high accuracy in ability estimation while markedly reducing manual grading workload.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Constructed-response tests</kwd>
        <kwd>educational measurement</kwd>
        <kwd>item response theory</kwd>
        <kwd>automated scoring</kwd>
        <kwd>data augmentation</kwd>
        <kwd>large language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Evaluating the abilities of learners is a critical component of various educational assessments, including
entrance and qualification exams, as well as in-class assessments. Ability estimation is also essential for
educational applications such as personalized learning support systems, including intelligent tutoring,
and adaptive learning platforms because they generally require ability estimation to provide optimal
recommendations for learning strategies, content, and other interventions tailored to the ability of each
learner [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ].
      </p>
      <p>
        Objective tests, typically consisting of multiple-choice questions, have been widely adopted as a
popular approach for ability estimation in educational settings owing to their scalability and ease of
implementation. However, modern education increasingly emphasizes the importance of 21st-century
skills such as expressive abilities and critical thinking [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8 ref9">5, 6, 7, 8, 9</xref>
        ]. To efectively assess such abilities,
constructed-response tests, including short-answer and essay-type questions, have gained increasing
attention. These tests, however, necessitate substantial manual grading, which makes them both
labor-intensive and costly [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">10, 11, 12, 13</xref>
        ].
      </p>
      <p>
        Item response theory (IRT) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], a statistical method well-established in the fields of educational
and psychological measurement, ofers a promising solution for estimating ability using incomplete
score data, where human raters grade only a subset of learner answers across multiple test items, as
exemplified in Table 1. IRT typically estimates learner ability by maximizing the likelihood of observed
scores based on IRT models, which define the probability of score observations as a function of learner
ability and item characteristic parameters. This allows IRT to be easily applied to incomplete score data
by calculating the likelihood while excluding missing scores [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ]. This feature is particularly
advantageous for achieving ability estimation while reducing manual grading workload. However, the
accuracy of ability estimation decreases as the proportion of missing scores increases.
      </p>
      <p>
        A possible strategy for addressing this problem is data augmentation through imputation of missing
scores [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. Simple methods such as mean or mode imputation [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] are widely recognized but fail
to efectively capture the underlying complex patterns within data. More advanced approaches such
as statistical model-based and machine learning-based methods [
        <xref ref-type="bibr" rid="ref21 ref22 ref23 ref24 ref25">21, 22, 23, 24, 25, 26</xref>
        ] aim to model
such underlying patterns to predict missing scores for more accurate imputation. However, when
the assumed model does not fit the target data or when the missing rate is very high, the resulting
imputations often lack accuracy. Moreover, conventional imputation methods generally rely on the
assumption that all data, including both observed and missing data, follow a single underlying pattern
modeled as a specific data-generation process, while real-world constructed-response tests frequently
violate this assumption. For instance, proficient learners could achieve lower scores due to inattention,
while less proficient learners could obtain higher scores due to compatibility between the learner and
the test item or other chance factors. These issues suggest that conventional imputation methods may
lack the robustness for sparsity and heterogeneous data, making them unsuitable for achieving accurate
ability estimation while substantially reducing manual scoring workload.
      </p>
      <p>
        To overcome these limitations, this study proposes a novel method for imputing missing scores
by leveraging automated scoring technologies [
        <xref ref-type="bibr" rid="ref11">11, 27, 28, 29, 30, 31</xref>
        ] for accurate IRT-based ability
estimation in constructed-response tests. Specifically, the approach begins by developing neural
automated scoring models trained on a subset of manually scored responses for each test item, or by
employing zero-shot scoring models using large language models (LLMs). These models are then used
to predict missing scores, generating a complete dataset. The augmented dataset is subsequently used
to estimate learner ability by using IRT models. The proposed method ofers several key advantages:
1. More robust imputation is achieved, even for heterogeneous data, by using learner answer text
directly to predict missing scores without the need to model underlying patterns of score data.
2. Recent scoring models based on pre-trained neural models are expected to enable accurate
imputations from a relatively small subset of score data, as demonstrated in recent automated
scoring studies [28, 32, 33, 34, 35]. This facilitates accurate ability estimation while markedly
reducing reliance on human grading.
      </p>
      <p>Through empirical evaluation using real-world datasets, this study demonstrates that the proposed
method achieves markedly higher accuracy in ability estimation than do conventional approaches,
even with high missing ratios. While the proposed method is based on a relatively simple idea, its
efectiveness in greatly improving the accuracy of ability estimation directly contributes to enhancing
various educational applications, as outlined above.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Settings and Objective</title>
      <p>The objective of this study is to estimate learner ability based on a collection of scores assigned to their
answers for multiple constructed-response test items. This collection of scores is defined as follows:
U = { ∈  ∪ {− 1} |  ∈  ,  ∈ ℐ},
(1)
where  indicates the score assigned to the answer of learner  ∈  for item  ∈ ℐ, and ℐ and 
represent the sets of items and learners, respectively. Furthermore,  = {1, 2, . . . , } represents the
set of score categories, where  indicates the number of categories, and  = − 1 indicates missing
data.</p>
      <p>A common approach to estimating learner ability is to calculate the average or total score for
each learner. However, such simple methods are not suitable for datasets with missing data, such
as those exemplified in Table 1. This limitation arises because item characteristics such as dificulty
and discrimination vary among items, causing average or total scores to depend heavily on which
items each learner is graded on [36, 37]. This property is not suitable for our objective, which is to
accurately estimate learner ability from score data with substantial missing values to reduce assessment
workload. To address this limitation, we utilize IRT, a robust framework for estimating learner ability
from incomplete data.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Item Response Theory</title>
      <p>
        IRT [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is a test theory based on statistical models that has been widely employed for ability estimation
and item analysis in various educational tests. IRT estimates learner ability by considering the
characteristics of test items, such as item dificulty and discrimination. This is done using probabilistic models
known as IRT models, which define the probability of score observations as a function of learner ability
and item characteristics.
      </p>
      <p>Among the various IRT models, the generalized partial credit model (GPCM) [38] is a representative
model particularly suited for Likert-scale polytomous score data, as assumed in this study. The GPCM
defines the probability that learner  receives score  for constructed-response test item  as
(2)
(3)
 ( = ) =
exp ∑︀</p>
      <p>=1 [ (  −   − )]
∑︀ =1 exp ∑︀
=1 [ (  −   − )]
where   represents the latent ability of learner ,   is the discrimination parameter for item ,   is
the dificulty parameter for item , and  is the step dificulty parameter representing the dificulty
of transitioning between scores  − 1 and  for the item. For model identification, 1 = 0 and
∑︀
=2  = 0 are assumed.</p>
      <p>The parameters of learner ability and item characteristics are estimated from a score collection U,
typically by maximizing the following log-likelihood function:
log ℒ = ∑︁ ∑︁  log  ( ),</p>
      <p>
        ∈ℐ ∈
where  is a dummy variable that equals 0 if  = − 1 and 1 otherwise. As evident from this equation,
IRT can estimate parameters, including ability, from datasets with missing scores by calculating the
likelihood while excluding missing scores [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ]. Furthermore, IRT generally provides more
accurate ability estimates compared with methods based on simple averages or total scores because
it accounts for the characteristics of test items during the estimation process [36, 37]. However, even
with the IRT approach, the accuracy of ability estimation diminishes as the proportion of missing
scores increases. A common strategy to address this limitation is the application of data augmentation
techniques to impute missing scores.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Data Augmentation</title>
      <p>
        There are various methods for imputation-based data augmentation. Simple approaches include mean or
mode imputation, in which missing scores are replaced with averages or the most frequent scores [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
Although computationally eficient, these methods often produce biased estimates because they fail to
efectively capture the underlying patterns within the data.
      </p>
      <p>
        More advanced approaches such as statistical model-based and machine learning-based methods aim
to predict missing scores by modeling such underlying patterns for more accurate imputation [
        <xref ref-type="bibr" rid="ref21 ref22 ref23 ref24 ref25">21, 22,
23, 24, 25, 26</xref>
        ]. For example, one representative approach involves constructing supervised machine
learning models, such as linear regression, support vector machine, and random forests, that predict
each variable as the objective variable using the remaining variables as explanatory variables [
        <xref ref-type="bibr" rid="ref21 ref25">21,
25, 26</xref>
        ]. Another approach is based on unsupervised learning and directly utilizes the similarity of
observed data patterns among samples [
        <xref ref-type="bibr" rid="ref22 ref23 ref24">22, 23, 24</xref>
        ]. A typical example is -nearest neighbors (k-NN),
which estimates missing values by identifying similar samples based on their observed data [
        <xref ref-type="bibr" rid="ref22 ref24">22, 24</xref>
        ].
Other examples include matrix factorization techniques, such as singular value decomposition, which
approximate the data as a low-rank matrix, estimating missing values by uncovering latent structures
and dependencies among variables [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Multiple imputation is another approach that integrates these
individual imputation methods by generating multiple plausible datasets and combining the results
through statistical pooling [39]. This approach accounts for the uncertainty of missing data.
      </p>
      <p>However, as discussed in Section 1, these traditional imputation methods often struggle to achieve
accurate imputation in real-world constructed-response tests, particularly in situations with high data
sparsity and potential violations of the assumption of an underlying consistent data-generation process.
This dificulty arises because they primarily infer missing values based on the relationships observed
among the available scores (e.g., correlations between items or similarities between learners). When
data sparsity is high, or when the assumption of a single underlying data-generation process is violated,
these observed relationships become unreliable predictors for the missing scores, leading to biased
or inaccurate imputations. To address these limitations, the main idea of this study is to leverage
automated scoring technologies to impute missing scores.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Automated Scoring for Constructed-Responses</title>
      <p>
        Recently, automated essay scoring and automated short-answer grading using artificial intelligence
technologies have become prominent topics of artificial intelligence in the education community [
        <xref ref-type="bibr" rid="ref11">11,
27, 28, 29, 30, 31</xref>
        ]. While various methods have been proposed, conventional automated scoring
approaches typically fall into one of two categories: feature-engineering-based approaches or
neuralbased approaches [
        <xref ref-type="bibr" rid="ref11">11, 31</xref>
        ].
      </p>
      <p>
        Feature-engineering-based approaches rely on manually designed features, such as text length or the
number of spelling errors, to predict scores using regression or classification models [
        <xref ref-type="bibr" rid="ref10">10, 40, 41, 42, 43, 44</xref>
        ].
While this approach ofers interpretability and explainability, achieving high accuracy generally requires
extensive efort in feature design and selection, which often needs to be tailored for each specific test
item.
      </p>
      <p>To address this limitation, neural-based approaches, which automatically extract features from data
using deep neural networks, have gained increasing popularity. Early neural models have primarily
employed convolutional neural networks or recurrent neural networks [45, 46, 47, 48, 49, 50]. More
recent advancements have focused on using pretrained transformer networks [51], such as bidirectional
encoder representations from transformers (BERT) [52], which have demonstrated superior performance
and accuracy in automated scoring tasks [53, 54, 55, 56, 57, 58]. BERT and its variants use extensive
pretraining on large-scale text corpora, with high accuracy obtained by fine-tuning them for a target
scoring task using relatively small datasets of scored responses.</p>
      <p>Most recently, LLMs have emerged as the next frontier in automated scoring. LLMs build upon the
transformer architecture, similar to BERT and its variants, but are pretrained on more massive and
diverse datasets using various training techniques, such as reinforcement learning from human feedback
and instruction tuning [59, 60]. A major advantage of LLMs is their capability to address various natural
language tasks, including automated scoring of constructed responses, by providing a concrete task
explanation as a prompt in a zero-shot setting or by including a small number of examples alongside
the prompt in a few-shot setting [61]. This reduces the reliance on extensive labeled datasets, making
LLMs highly adaptable to a wide range of natural language processing tasks. Recent studies exploring
the application of LLMs for essay and short-answer scoring have shown that LLMs often achieve
reasonable scoring performance using zero-shot or few-shot approaches [28, 30, 62, 63, 64, 65, 66, 67],
although they tend to perform worse than conventional fine-tuned scoring models based on pretrained
transformers [62, 64, 65, 67, 68].</p>
    </sec>
    <sec id="sec-6">
      <title>6. Proposed Method</title>
      <p>This study proposes a novel imputation-based data augmentation approach using automated scoring
technologies for accurate IRT-based ability estimation in constructed-response tests. The proposed
approach consists of the following steps:
1. Developing a scoring model: Neural automated scoring models are prepared either by
finetuning BERT or its variants on a subset of manually scored learner answers for each test item, or
by employing a zero-shot scoring model using LLMs. The choice between these methods depends
on various conditions as listed below:
• Fine-tuned models: These methods are recommended when a relatively large number of
scored answers, such as more than a hundred, can be prepared for each item. This is often
feasible in scenarios with a large number of examinees and when manual grading costs are
acceptable.
• Zero-shot scoring models: These are more suitable when only a very limited number of
scored answers are available for each item. For instance, this approach is preferable when
the number of examinees is small, or when scoring individual answers is time-consuming.
Zero-shot models are also suitable for situations in which clear scoring criteria are available
or in which the evaluation task is relatively easy, because zero-shot evaluation is expected
to be efective in such cases.
2. Predicting missing scores: Once the scoring model is prepared, it is used to predict missing
scores and construct a complete score dataset.
3. Estimating ability using IRT models: The augmented dataset is then used to estimate learner
ability by applying IRT models.</p>
      <p>The proposed method is expected to achieve more robust imputation, even for heterogeneous data
lacking a consistent data-generation process, because it directly leverages learner answer text to predict
missing scores without modeling underlying patterns of score data. Additionally, it is expected to provide
accurate imputations even under high missing ratios because constructing scoring models based on
pre-trained neural language models or zero-shot models often achieves reasonable scoring performance
with relatively few or no samples, as demonstrated in recent automated scoring studies [28, 32, 33, 34, 35].
These features make the proposed method suitable for achieving accurate ability estimation while
reducing the need for human grading.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Experiments</title>
      <p>7.1. Data
We conducted empirical evaluation experiments using real-world datasets to demonstrate the
efectiveness of the proposed method.</p>
      <p>For our experiments, we required datasets comprising scored constructed responses for multiple items
in which the same set of learners answered all items. Therefore, popular benchmark datasets for
automated essay or short-answer grading tasks, such as ASAP (automated student assessment prize)1</p>
      <sec id="sec-7-1">
        <title>1https://www.kaggle.com/competitions/asap-aes</title>
        <p>and ASAP–SAS (short answer scoring)2, could not be used owing to the lack of information identifying
respondents for each answer. Consequently, we utilized the following three datasets:
1. Short-Answer Grading (SAG) Dataset: This dataset, developed by the Benesse Educational
Research and Development Institute in Japan, consists of responses from 511 Japanese university
students to three short-answer items in a Japanese reading comprehension test. Scores for the
responses were provided by expert raters using five rating categories for each item.
2. Essay Scoring (ES) Dataset [69]: This dataset consists of essays written by 327 Japanese
university students in response to three essay tasks ofered in a natural science lecture. Each
response was scored on a five-point scale by expert raters.
3. ELYZA-tasks-100 (ELYZA) Dataset3: This dataset is designed for evaluating Japanese LLMs. It
includes responses generated by 33 LLMs for 100 writing tasks, with each response scored by
expert raters using a five-point scale. We used this dataset by treating individual LLMs as learners
and writing tasks as constructed-response test items. Although this dataset is not specifically
intended for automated essay or short-answer scoring, it appears to be suitable for applying
the proposed method with a zero-shot scoring model. This is because the number of examinees
is small, and LLM-generated responses are relatively straightforward to score using LLMs in a
zero-shot manner.</p>
        <sec id="sec-7-1-1">
          <title>7.2. Experimental Procedures</title>
          <p>We conducted the following experiments for each dataset:
1. We estimated the IRT parameters from the complete score data based on the GPCM introduced
earlier. The obtained ability estimates for learners were treated as gold-standard values in this
experiment.
2. We created incomplete datasets from the complete score data by converting some scores into
missing values. The missing ratios were varied as follows:
• For the SAG and ES datasets, we examined three missing ratios: 33%, 50%, and 62%
• For the ELYZA dataset, we examined five missing ratios: 10%, 20%, 50%, 65%, and 80%
The incomplete datasets were created following a systematic design [70, 71, 72] to generate
missing patterns while ensuring the applicability of IRT and the conventional missing imputation
methods. The algorithm for creating missing patterns, which supports the rationale behind the
selection of missing ratios for each dataset, is detailed in the Appendix.
3. Using each incomplete score dataset, we applied the GPCM to estimate learner ability through
the following methods:
• Estimation without imputation: This method directly applies the GPCM to the
incomplete score dataset, ignoring missing scores during likelihood calculation as detailed in
Section 3.
• Estimation with imputation by k-NN or random forest: Ability is estimated by the
GPCM after missing scores are imputed using k-NN or random forest (RF)4, as introduced
in Section 4.
• Estimation with imputation by the proposed method: Ability is estimated using the
GPCM based on the proposed method. In the SAG and ES datasets, BERT-based models
are used for the automated scoring models fine-tuned on scored answers for each item 5.</p>
          <p>For the ELYZA dataset, we employed a zero-shot scoring approach with GPT-4o, where the</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>2https://github.com/benhamner/ASAP-SAS</title>
        <p>3https://huggingface.co/datasets/elyza/ELYZA-tasks-100/tree/main
4We used the VIM and missForest libraries in R for missing score imputations using k-NN and RF, respectively.
5We used BERT models with a linear output layer on top of the [CLS] token, which is appended to the beginning of the
input text. The pre-trained BERT model was tohoku-nlp/bert-base-japanese-v3. The optimizer was AdamW with a
learning rate of 1e-5. The mini-batch size was 64 for the SAG dataset and 16 for the ES dataset.</p>
        <p>You are a grader. You will be given a test item, a reference answer, a grading rubric, and a response.
Referencing the grading rubric and the reference answer, grade the response on a scale of 1 to 5, and
output the number only.</p>
        <p>Test Item: {text of the test item}
Reference Answer: {text of the reference answer}
Basic Grading Criteria
1. Incorrect: Does not follow instructions. Chooses an incorrect option in a multiple-choice question. (†)
2. Incorrect but heading in the right direction: Usually a 3-point answer with a 1-point deduction. (†)
3. Partially correct: Addresses the majority of a complex instruction correctly. (†)
4. Correct: Correctly answers the question. (†)
5. Helpful: Correctly answers the question and further anticipates user needs. (†)
Basic Deductions: Scores may be adjusted based on the following factors.
- Unnatural Japanese (-1 point): Syntactically awkward or unclear Japanese, repetition of the same
sentence, or abrupt insertion of English words.
- Partial Hallucination (-1 point): A response partially inconsistent with facts. (†)
- Excessive Safety Concerns (Score as 2 points): e.g., Responds with “I cannot answer for ethical reasons.”
Item-Specific Grading Criteria: {Grading Criteria defined for each item}
Response: {Response to be evaluated}
prompt was designed to include detailed task instructions with scoring criteria defined in
the original dataset, as shown in Table 2.
4. The root-mean-squared error (RMSE) and Pearson’s correlation coeficient were calculated
between the estimated and gold-standard abilities. To address the scale indeterminacy inherent in
IRT estimation, the ability values were normalized to have a mean of zero and a variance of one,
ensuring that the estimated and gold-standard abilities were directly comparable.
5. Steps 2 to 4 were repeated 10 times, each time using a diferent missing pattern. To create diferent
missing patterns using the same algorithm, the order of learners was randomly shufled with
each repetition.</p>
        <p>Furthermore, we investigated the performance of the proposed method in situations in which the
conventional approach was not applicable owing to a substantially high missing ratio. Specifically, we
examined three additional missing ratios, namely, 70%, 80%, and 90%, for the SAG and ES datasets, and
a 100% missing ratio for the ELYZA dataset in experimental procedure 2 for the proposed method. It
should be noted that under these conditions, the conventional method cannot be applied due to the
presence of learners with no assigned scores.</p>
        <sec id="sec-7-2-1">
          <title>7.3. Experimental Results</title>
          <p>Fig. 1 shows the experimental results. The upper plot shows the averaged RMSEs, while the lower
plot shows the averaged correlation coeficients, with error bars indicating the standard deviation
over 10 repetitions. An exception is the condition with a 100% missing ratio in the ELYZA dataset, in
which results are based on a single trial due to the lack of variation in missing patterns. Note that the
conventional methods have no results for missing ratios above 62% in the SAG and ES datasets and
100% in the ELYZA dataset because they are not applicable under these conditions, as described above.</p>
          <p>According to the experimental results, the proposed method demonstrated lower RMSEs and higher
correlation coeficients compared to conventional methods across all conditions. While all methods
showed a reduction in the accuracy of ability estimation as the missing ratio increased, conventional
imputation methods exhibited a more rapid decline in accuracy as the missing ratio increased than the
proposed method, as highlighted in the results for the ELYZA dataset. This reflects the dificulty of
conventional methods in modeling patterns underlying observed score data under high missing ratios.</p>
          <p>Additionally, conventional imputation methods underperformed compared to ability estimation without
imputation, suggesting that inaccurate imputation introduces heavy bias into ability estimation.</p>
          <p>A paired t-test was conducted to compare the average accuracy diferences between the proposed
method and each of the other methods at each missing ratio. The results indicated significant diferences
at the 1% significance level for all comparisons, except for the comparison between the proposed
method and the method without imputation at a missing ratio of 10% in the ELYZA dataset. These
ifndings indicate that the proposed method achieves remarkably high accuracy in ability estimation
from incomplete data, except when the missing ratio is extremely low.</p>
          <p>Furthermore, the proposed method achieved reasonable accuracy in ability estimation in situations
in which conventional methods are not applicable. Specifically, it maintained high accuracy even with
a 100% missing ratio in the ELYZA dataset, corresponding to the case employing zero-shot scoring.
The use of fine-tuned scoring models in the SAG and ES datasets also demonstrated relatively good
accuracy with a missing ratio of up to 80%. More specifically, the accuracies under the 80% missing
ratio were higher than those of conventional methods at a 62% missing ratio. These results highlight
the unique and important advantage of the proposed method for accurately estimating abilities, except
in cases of extremely large missing ratios, such as at a missing ratio of 90%, when fine-tuned scoring
models are used.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Analysis</title>
      <sec id="sec-8-1">
        <title>8.1. Accuracy of Missing Score Prediction</title>
        <p>We can infer that the high accuracy in ability estimation of the proposed method comes from its
high accuracy in missing score imputation. To confirm this, we analyzed the accuracy of missing
score imputation for each method. Specifically, using the results from experimental procedure 3, we
calculated the agreement between the predicted scores for missing values generated by each imputation
method, including k-NN, RF, and the proposed method, and their corresponding true scores. As an
evaluation metric, we used quadratic weighted kappa (QWK), which is commonly employed in research
on automated scoring.</p>
        <p>Fig. 2 shows the results, with the average QWK values with error bars representing the standard
deviations obtained from 10 repeated experiments. Given that higher QWK values indicate greater
imputation accuracy, the results show that the proposed method achieves high accuracy in imputing
missing scores, whereas conventional methods exhibit markedly lower accuracy.</p>
        <p>Furthermore, in the SAG and ES datasets, the proposed method exhibits a decline in accuracy as
the missing ratio increases due to the reduction in training data for fine-tuning, while the degraded
accuracies remain higher than those of conventional methods. Moreover, in the ELYZA dataset, while
conventional methods sufer a drastic decline in imputation accuracy, the proposed method maintains
high accuracy regardless of the missing ratio. These results suggest that the proposed method achieves
high imputation accuracy, which is likely to contribute to its high ability estimation accuracy.</p>
      </sec>
      <sec id="sec-8-2">
        <title>8.2. Robustness to Heterogeneity</title>
        <p>As described in Section 1, the proposed method is expected to be efective for imputing missing
scores that are dificult to predict from the patterns of observed data owing to their heterogeneity. To
demonstrate this, Table 3 provides examples of score imputations by the proposed method and the
k-NN method for two learners sampled from the results of the SAG dataset.</p>
        <p>In the table, the True scores row represents the original complete data and the w/o imput. row shows
the incomplete data (with missing values indicated by “NA”). The k-NN and Proposed rows show the
complete data created by imputing the missing values using each respective method. Additionally, the
True avg. row indicates the averaged scores for each item calculated from the complete data, while ^
represents the estimated abilities based on each score dataset.</p>
        <p>Considering that the average item scores follow the order of item 3 &gt; item 2 &gt; item 1, predicting
the missing data in the left table as a score of 5, as done by the k-NN method, might seem reasonable.
Similarly, in the right table, given that the score for item 3 was 3, predictions by the k-NN method for
Examples of score imputation by the k-NN and proposed methods.</p>
        <p>Item 1 Item 2 Item 3
True scores
w/o imput.</p>
        <p>k-NN
Proposed
True avg.</p>
        <p>4
4
4
4
these predictions by the k-NN method are quite inaccurate. This discrepancy arises because the true
scores in these cases do not follow the surrounding patterns, making them dificult to predict.</p>
        <p>As shown in the examples above, conventional methods struggle to handle such data. In contrast,
the proposed method does not rely on modeling surrounding score patterns but instead evaluates the
content of individual answer texts. Therefore, the proposed method can make accurate predictions for
heterogeneous score data.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusion</title>
      <p>This study proposed a novel method for imputing missing scores to enhance IRT-based ability estimation
by leveraging automated scoring technologies. Experimental results demonstrated that the proposed
method achieves higher accuracy in ability estimation compared with conventional approaches, even
under conditions of high missing ratios or heterogeneous data. This indicates that the proposed method
achieves high accuracy in ability estimation while markedly reducing the manual grading workload.</p>
      <p>However, we acknowledge several limitations. Firstly, the efectiveness of our method depends on the
accuracy of the employed automated scoring model. Inaccurate scoring models, whether fine-tuned on
insuficient data or based on zero-shot LLMs with suboptimal prompts, could lead to biased imputations
and consequently afect the validity of the final ability estimates. The observed accuracy drop at the
90% missing ratio when using fine-tuned models might partially reflect this dependency. Secondly,
developing these scoring models can require substantial computational resources for fine-tuning or
careful prompt engineering for zero-shot approaches, potentially ofsetting the intended reduction
in manual grading workload. Thirdly, while we demonstrated efectiveness on specific datasets, all
datasets were limited to the Japanese language, which restricts the generalizability of our findings
to broader linguistic and assessment contexts. Fourthly, the proposed method treats imputed scores
as deterministic inputs to IRT estimation and does not propagate the uncertainty associated with
automated scoring, which could result in overconfident or biased ability estimates. Lastly, the use of
black-box models such as BERT and LLMs limits the interpretability of individual imputed scores and
hinders systematic fairness evaluation, an important consideration for high-stakes testing scenarios.</p>
      <p>
        Future research should address these limitations and explore further extensions. Although this study
focused on unidimensional IRT, the proposed method may be applicable in various ability measurement
contexts. For instance, the method is applicable to complex student models, such as multidimensional
IRT [73], cognitive diagnostic models [74], and knowledge tracing [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These complex models are
more sensitive to data sparsity, which may further highlight the advantages of the proposed approach.
Furthermore, exploring methods to incorporate the uncertainty associated with automated scores could
lead to more reliable ability estimates and associated error measures. In addition, methods to enhance the
interpretability of imputed scores should be explored to increase transparency and trust in the system.
Evaluating the method on more diverse datasets across multiple languages, domains, and assessment
formats is also a key direction for improving external validity. In addition, future studies should consider
comparing the proposed method with more advanced deep-learning-based imputation methods, such
as deep matrix factorization [75] and variational autoencoders, which were not included in the current
evaluation. Another promising direction is to leverage IRT models with rater parameters [37, 76, 77, 78],
such as many-facet Rasch models, to achieve more valid ability estimation by treating AI graders as
distinct raters within a unified measurement framework. Finally, studying the integration of this method
into practical applications like adaptive testing systems for constructed-response items, where real-time
scoring and imputation could enhance test eficiency and personalization, would be valuable. Given
the critical role of accurate ability estimation in learning support systems, we believe these future
investigations hold significant promise.
      </p>
    </sec>
    <sec id="sec-10">
      <title>Appendix: Algorithms for Generating Missing Patterns</title>
      <p>In this appendix, we explain how incomplete data were created from complete data in the experiments.
For the missing ratios of 33%, 50%, and 62% in the SAG and ES datasets, the missing patterns, which
enable IRT parameter linking [70, 71, 72] and the application of other imputation methods, were
generated as follows:
• The 33% missing pattern was created by repeating the patterns of learners 1–3 on the left side of</p>
      <p>Table 1.
• The 50% missing pattern was created by repeating the patterns of learners 1–6 on the right side
of Table 1.
• The 62% missing pattern was created by combining one repetition of learners 1–3 with six
repetitions of learners 4–6, based on the pattern on the right side of Table 1.</p>
      <p>For the missing ratios of 10%, 20%, 50%, 65%, and 80% in the ELYZA dataset, these missing patterns
were generated using Algorithm 1, which also ensures parameter linking in IRT estimation and the
application of other imputation methods</p>
      <sec id="sec-10-1">
        <title>Algorithm 1 Algorithm for creating missing patterns for the ELYZA dataset.</title>
        <p>Input: ℐ,  , : Number of missing ratio
Initialize missing indicator variable { = 1 |  ∈ ℐ,  ∈  } defined in Equation (3)
for  ∈  do</p>
        <p>Set  = ( * 10)%100 and  = ( * 10 + )%100
if  &gt;  then</p>
        <p>Set  = 0 for  in [ : ]
else</p>
        <p>Set  = 0 for  in [ : 100] and [1 : ]
end if
end for</p>
        <p>For the other missing ratios in the SAG and ES datasets, where IRT and conventional imputation
methods are not applicable, missing patterns were generated by randomly selecting learners for each
item until the specified percentage of scores was converted to missing values.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Acknowledgments</title>
      <p>This work was supported by JSPS KAKENHI Grant Numbers 23K20727 and 24H00739. We thank
Yuki Doka and Yoshihiro Kato from the Benesse Educational Research and Development Institute for
permission to use the SAG dataset.</p>
    </sec>
    <sec id="sec-12">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4o in order to: Grammar and spelling check.
After using this tool, the authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.</p>
      <p>Service Operations Management 24 (2022) 485–503.
[26] X. Zhang, X. Song, H. Wang, H. Zhang, Sequential local least squares imputation estimating
missing value of microarray data, Computers in Biology and Medicine 38 (2008) 1112–1120.
[27] R. Chaudhari, M. Patel, Deep learning in automated short answer grading: A comprehensive
review, in: ITM Web of Conferences, volume 65, 2024.
[28] G. Kortemeyer, Performance of the pre-trained large language model GPT-4 on automated short
answer grading, Discover Artificial Intelligence 4 (2024) 47.
[29] S. Li, V. Ng, Automated essay scoring: A reflection on the state of the art, in: Proceedings of the
2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 17876–17888.
[30] H. Misgna, B.-W. On, I. Lee, G. S. Choi, A survey on deep learning-based automated essay scoring
and feedback generation, Artificial Intelligence Review 58 (2025) 1–40.
[31] M. Uto, A review of deep-neural automated essay scoring models, Behaviormetrika 48 (2021)
459–484.
[32] H. Do, Y. Kim, G. G. Lee, Prompt- and trait relation-aware cross-prompt essay trait scoring, in:</p>
      <p>Findings of the Association for Computational Linguistics, 2023, pp. 1538–1551.
[33] L. Jiang, N. Bosch, Short answer scoring with GPT-4, in: Proceedings of the Eleventh ACM</p>
      <p>Conference on Learning@ Scale, 2024, pp. 438–442.
[34] R. Ridley, L. He, X. Y. Dai, S. Huang, J. Chen, Automated cross-prompt scoring of essay traits, in:
Proceedings of the Association for the Advancement of Artificial Intelligence, volume 35, 2021, pp.
13745–13753.
[35] T. Shibata, M. Uto, Enhancing cross-prompt automated essay scoring by selecting training data
based on reinforcement learning, in: Workshop on Automated Evaluation of Learning and
Assessment Content, 2024.
[36] M. Uto, M. Ueno, Empirical comparison of item response theory models with rater’s parameters,</p>
      <p>Heliyon 4 (2018).
[37] M. Uto, M. Ueno, A generalized many-facet Rasch model and its Bayesian estimation using</p>
      <p>Hamiltonian Monte Carlo, Behaviormetrika 47 (2020) 469–496.
[38] E. Muraki, A generalized partial credit model, in: W. J. van der Linden, R. K. Hambleton (Eds.),</p>
      <p>Handbook of Modern Item Response Theory, Springer, 1997, pp. 153–164.
[39] D. B. Rubin, Multiple imputation after 18+ years, Journal of the American Statistical Association
91 (1996) 473–489.
[40] S. Burrows, I. Gurevych, B. Stein, The eras and trends of automatic short answer grading, Journal
of Artificial Intelligence in Education 25 (2015) 60–117.
[41] M. Dascalu, W. Westera, S. Ruseti, S. Trausan-Matu, H. Kurvers, ReaderBench learns Dutch:
Building a comprehensive automated essay scoring system for Dutch language, in: Proceedings of
the International Conference on Artificial Intelligence in Education, 2017, pp. 52–63.
[42] C. Leacock, M. Chodorow, C-rater: Automated scoring of short-answer questions, Computers and
the Humanities 37 (2003) 389–405.
[43] H. V. Nguyen, D. J. Litman, Argument mining for improving the automated scoring of persuasive
essays, in: Proceedings of the Association for the Advancement of Artificial Intelligence, volume 32,
2018.
[44] M. D. Shermis, J. C. Burstein, Automated Essay Scoring: A Cross-disciplinary Perspective,
Routledge, 2003.
[45] D. Alikaniotis, H. Yannakoudakis, M. Rei, Automatic text scoring using neural networks, in:
Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2016, pp.
715–725.
[46] Y. Farag, H. Yannakoudakis, T. Briscoe, Neural automated essay scoring and coherence modeling
for adversarially crafted input, in: Proceedings of the Annual Conference of the North American
Chapter of the Association for Computational Linguistics, 2018, pp. 263–271.
[47] C. Jin, B. He, K. Hui, L. Sun, TDNN: A two-stage deep neural network for prompt-independent
automated essay scoring, in: Proceedings of the Annual Meeting of the Association for Computational
Linguistics, 2018, pp. 1088–1097.
[48] M. Mesgar, M. Strube, A neural local coherence model for text quality assessment, in: Proceedings
of the Conference on Empirical Methods in Natural Language Processing, 2018, pp. 4328–4339.
[49] K. Taghipour, H. T. Ng, A neural approach to automated essay scoring, in: Proceedings of the</p>
      <p>Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1882–1891.
[50] Y. Wang, Z. Wei, Y. Zhou, X. Huang, Automatic essay scoring incorporating rating schema via
reinforcement learning, in: Proceedings of the Conference on Empirical Methods in Natural
Language Processing, 2018, pp. 791–797.
[51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp.
5998–6008.
[52] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the Annual Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019,
pp. 4171–4186.
[53] T. Liu, W. Ding, Z. Wang, J. Tang, G. Y. Huang, Z. Liu, Automatic short answer grading via multiway
attention networks, in: Proceedings of the International Conference on Artificial Intelligence in
Education, 2019, pp. 169–173.
[54] J. Lun, J. Zhu, Y. Tang, M. Yang, Multiple data augmentation strategies for improving performance
on automatic short answer scoring, in: Proceedings of the Association for the Advancement of
Artificial Intelligence, volume 34, 2020, pp. 13389–13396.
[55] E. Mayfield, A. W. Black, Should you fine-tune BERT for automated essay scoring?, in: Proceedings
of the Workshop on Innovative Use of NLP for Building Educational Applications, 2020, pp. 151–
162.
[56] C. Sung, T. I. Dhamecha, N. Mukhi, Improving short answer grading using transformer-based
pre-training, in: Proceedings of the International Conference on Artificial Intelligence in Education,
2019, pp. 469–481.
[57] J. Xue, X. Tang, L. Zheng, A hierarchical BERT-based transfer learning approach for
multidimensional essay scoring, IEEE Access 9 (2021) 125403–125415.
[58] M. Yamaura, I. Fukuda, M. Uto, Neural automated essay scoring considering logical structure,
in: Proceedings of the International Conference on Artificial Intelligence in Education, 2023, pp.
267–278.
[59] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, et al., Training a helpful
and harmless assistant with reinforcement learning from human feedback, arXiv preprint (2022).
[60] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, et al., Training
language models to follow instructions with human feedback, Advances in Neural Information
Processing Systems 35 (2022) 27730–27744.
[61] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners,</p>
      <p>Advances in Neural Information Processing Systems 35 (2022) 22199–22213.
[62] I. Chamieh, T. Zesch, K. Giebermann, LLMs in short answer scoring: Limitations and promise of
zero-shot and few-shot approaches, in: Proceedings of the 19th Workshop on Innovative Use of
NLP for Building Educational Applications, 2024, pp. 309–315.
[63] L.-H. Chang, F. Ginter, Automatic short answer grading for finnish with ChatGPT, in: Proceedings
of the AAAI Conference on Artificial Intelligence, volume 38, 2024, pp. 23173–23181.
[64] S. Lee, Y. Cai, D. Meng, Z. Wang, Y. Wu, Unleashing large language models’ proficiency in zero-shot
essay scoring, in: Findings of the Association for Computational Linguistics, 2024, pp. 181–198.
[65] W. A. Mansour, S. Albatarni, S. Eltanbouly, T. Elsayed, Can large language models automatically
score proficiency of written essays?, in: Proceedings of the 2024 Joint International Conference on
Computational Linguistics, Language Resources and Evaluation, 2024, pp. 2777–2786.
[66] M. Stahl, L. Biermann, A. Nehring, H. Wachsmuth, Exploring LLM prompting strategies for joint
essay scoring and feedback generation, in: Proceedings of the 19th Workshop on Innovative Use
of NLP for Building Educational Applications, 2024, pp. 283–298.
[67] Y. Wang, R. Hu, Z. Zhao, Beyond agreement: Diagnosing the rationale alignment of automated
essay scoring methods based on linguistically-informed counterfactuals, in: Findings of the
Association for Computational Linguistics, 2024, pp. 8906–8925.
[68] K. P. Yancey, G. Laflair, A. Verardi, J. Burstein, Rating short L2 essays on the CEFR scale with
GPT-4, in: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational
Applications, 2023, pp. 576–584.
[69] K. Takeuchi, M. Ohno, K. Motojin, M. Taguchi, Y. Inada, M. Iizuka, T. Abo, H. Ueda, Development
of essay scoring methods based on reference texts with construction of research-available japanese
essay data, Journal of Information Processing [In Japanese] 62 (2021) 1586–1604.
[70] M. Ilhan, A comparison of the results of many-facet Rasch analyses based on crossed and judge
pair designs, Educational Sciences: Theory and Practice 16 (2016) 579–601.
[71] M. J. Kolen, R. L. Brennan, Test Equating, Scaling, and Linking, Springer, 2014.
[72] M. Uto, Accuracy of performance-test linking based on a many-facet Rasch model, Behavior</p>
      <p>Research Methods 53 (2021) 1440–1454.
[73] M. D. Reckase, Multidimensional Item Response Theory, Statistics for Social and Behavioral</p>
      <p>Sciences, Springer-Verlag New York, NY, 2009.
[74] J. L. Templin, R. A. Henson, Measurement of psychological disorders using cognitive diagnosis
models, Psychological methods 11 (2006) 287–305.
[75] H.-J. Xue, X. Dai, J. Zhang, S. Huang, J. Chen, Deep matrix factorization models for recommender
systems, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017,
pp. 3203–3209.
[76] T. Eckes, Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating
Rater</p>
      <p>Mediated Assessments, Peter Lag Pub. Inc., 2023.
[77] M. Uto, A multidimensional generalized many-facet Rasch model for rubric-based performance
assessment, Behaviormetrika 48 (2021) 425–457.
[78] M. Uto, J. Tsuruta, K. Araki, M. Ueno, Item response theory model highlighting rating scale of a
rubric and rater–rubric interaction in objective structured clinical examination, PLOS ONE 19
(2024) 1–23.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hajjioui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Benslimane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ibriz</surname>
          </string-name>
          ,
          <article-title>Intelligent tutoring systems: A review</article-title>
          ,
          <source>in: Big Data and Internet of Things</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>663</fpage>
          -
          <lpage>676</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Khine</surname>
          </string-name>
          ,
          <article-title>Using AI for Adaptive Learning</article-title>
          and Adaptive Assessment, Springer Nature,
          <year>2024</year>
          , pp.
          <fpage>341</fpage>
          -
          <lpage>466</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A survey of knowledge tracing: Models, variants, and applications</article-title>
          ,
          <source>IEEE Transactions on Learning Technologies</source>
          <volume>17</volume>
          (
          <year>2024</year>
          )
          <fpage>1898</fpage>
          -
          <lpage>1919</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tomikawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Suzuki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Uto</surname>
          </string-name>
          ,
          <article-title>Adaptive question-answer generation with dificulty control using item response theory and pre-trained transformer models</article-title>
          ,
          <source>IEEE Transactions on Learning Technologies</source>
          <volume>17</volume>
          (
          <year>2024</year>
          )
          <fpage>2186</fpage>
          -
          <lpage>2198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Abosalem</surname>
          </string-name>
          ,
          <article-title>Assessment techniques and students' higher-order thinking skills</article-title>
          ,
          <source>International Journal of Secondary Education</source>
          <volume>4</volume>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Bernardin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thomason</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Buckley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Kane</surname>
          </string-name>
          ,
          <article-title>Rater rating-level bias and accuracy in performance appraisals: The impact of rater personality, performance management competence, and rater accountability</article-title>
          ,
          <source>Human Resource Management</source>
          <volume>55</volume>
          (
          <year>2016</year>
          )
          <fpage>321</fpage>
          -
          <lpage>340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>O. L.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Frankel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. C.</given-names>
            <surname>Roohr</surname>
          </string-name>
          ,
          <article-title>Assessing critical thinking in higher education: Current state and directions for next-generation assessment</article-title>
          ,
          <source>ETS Research Report Series</source>
          (
          <year>2014</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mislevy</surname>
          </string-name>
          , Sociocognitive Foundations of Educational Measurement, Routledge,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Murtonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Balloo</surname>
          </string-name>
          ,
          <article-title>Redefining Scientific Thinking for Higher Education: Higher-Order Thinking, Evidence-Based Reasoning</article-title>
          and Research Skills, Palgrave Macmillan,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Amorim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cançado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Veloso</surname>
          </string-name>
          ,
          <article-title>Automated essay scoring in the presence of biased ratings</article-title>
          ,
          <source>in: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>Automated essay scoring: A survey of the state of the art</article-title>
          ,
          <source>in: Proceedings of the International Joint Conference on Artificial Intelligence</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6300</fpage>
          -
          <lpage>6308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Leckie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Baird</surname>
          </string-name>
          ,
          <article-title>Rater efects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience</article-title>
          ,
          <source>Journal of Educational Measurement</source>
          <volume>48</volume>
          (
          <year>2011</year>
          )
          <fpage>399</fpage>
          -
          <lpage>418</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Uto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Okano</surname>
          </string-name>
          ,
          <article-title>Learning automated essay scoring models using item-response-theory-based scores to decrease efects of rater biases</article-title>
          ,
          <source>IEEE Transactions on Learning Technologies</source>
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>763</fpage>
          -
          <lpage>776</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Lord</surname>
          </string-name>
          ,
          <article-title>Applications of item response theory to practical testing problems</article-title>
          , Routledge,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Baker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Kim</surname>
          </string-name>
          , Item Response Theory:
          <article-title>Parameter Estimation Techniques</article-title>
          , CRC Press,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>W. J. van der Linden</surname>
          </string-name>
          ,
          <source>Handbook of Item Response Theory</source>
          , Volume Two: Statistical Tools, CRC Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>M. L. Nering</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Ostini</surname>
          </string-name>
          ,
          <source>Handbook of Polytomous Item Response Theory Models, Routledge</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Barrabés</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Moriano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Giró-I-Nieto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Montserrat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Ioannidis</surname>
          </string-name>
          ,
          <article-title>Advances in biomedical missing data imputation: A survey</article-title>
          ,
          <source>IEEE Access 13</source>
          (
          <year>2025</year>
          )
          <fpage>16918</fpage>
          -
          <lpage>16932</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dezfouli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. V.</given-names>
            <surname>Bonilla</surname>
          </string-name>
          ,
          <article-title>Transformed distribution matching for missing value imputation</article-title>
          ,
          <source>in: Proceedings of the 40 th International Conference on Machine Learning</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>42159</fpage>
          -
          <lpage>42186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R. J. A.</given-names>
            <surname>Little</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <article-title>Statistical Analysis with Missing Data</article-title>
          , Wiley &amp; Sons,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bertsimas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pawlowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. D.</given-names>
            <surname>Zhuo</surname>
          </string-name>
          ,
          <article-title>From predictive methods to missing data imputation: An optimization approach</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>18</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Brás</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Menezes</surname>
          </string-name>
          ,
          <article-title>Improving cluster-based missing value estimation of dna microarray data</article-title>
          ,
          <source>Biomolecular Engineering</source>
          <volume>24</volume>
          (
          <year>2007</year>
          )
          <fpage>273</fpage>
          -
          <lpage>282</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F.</given-names>
            <surname>Husson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Josse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          , G. Robin,
          <article-title>Imputation of mixed data with multilevel singular value decomposition</article-title>
          ,
          <source>Journal of Computational and Graphical Statistics</source>
          <volume>28</volume>
          (
          <year>2019</year>
          )
          <fpage>552</fpage>
          -
          <lpage>566</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>O.</given-names>
            <surname>Troyanskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cantor</surname>
          </string-name>
          , G. Sherlock, P. Brown, T. Hastie,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Botstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <article-title>Missing value estimation methods for DNA microarrays</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>17</volume>
          (
          <year>2001</year>
          )
          <fpage>520</fpage>
          -
          <lpage>525</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>What is the impact of nonrandomness on random choice models?</article-title>
          ,
          <source>Manufacturing &amp;</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>