<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of the
American statistical Association 102 (2007) 359-378.
[20] A. P. Weigel</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.48550/arXiv.1803.05457</article-id>
      <title-group>
        <article-title>Ordinality in Discrete-level Question Dificulty Estimation: Introducing Balanced DRPS and OrderedLogitNN</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arthur Thuy</string-name>
          <email>arthur.thuy@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ekaterina Loginova</string-name>
          <email>ekaterina.d.loginova@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dries F. Benoit</string-name>
          <email>dries.benoit@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CVAMO Core Lab Flanders Make</institution>
          ,
          <addr-line>Tweekerkenstraat 2, 9000 Ghent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dedalus Healthcare</institution>
          ,
          <addr-line>Roderveldlaan 2, 2600 Antwerp</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ghent University</institution>
          ,
          <addr-line>Tweekerkenstraat 2, 9000 Ghent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>2</volume>
      <fpage>359</fpage>
      <lpage>378</lpage>
      <abstract>
        <p>Recent years have seen growing interest in Question Dificulty Estimation (QDE) using natural language processing techniques. Question dificulty is often represented using discrete levels, framing the task as ordinal regression due to the inherent ordering from easiest to hardest. However, the literature has neglected the ordinal nature of the task, relying on classification or discretized regression models, with specialized ordinal regression methods remaining unexplored. Furthermore, evaluation metrics are tightly coupled to the modeling paradigm, hindering cross-study comparability. While some metrics fail to account for the ordinal structure of dificulty levels, none adequately address class imbalance, resulting in biased performance assessments. This study addresses these limitations by benchmarking three types of model outputs-discretized regression, classification, and ordinal regression-using the balanced Discrete Ranked Probability Score (DRPS), a novel metric that jointly captures ordinality and class imbalance. In addition to using popular ordinal regression methods, we propose OrderedLogitNN, extending the ordered logit model from econometrics to neural networks. We fine-tune BERT on the RACE++ and ARC datasets and find that OrderedLogitNN performs considerably better on complex tasks. The balanced DRPS ofers a robust and fair evaluation metric for discrete-level QDE, providing a principled foundation for future research.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Question Dificulty Estimation</kwd>
        <kwd>Natural language processing</kwd>
        <kwd>Ordinal regression</kwd>
        <kwd>Ordered logit</kwd>
        <kwd>Fine-tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Question Dificulty Estimation (QDE), also known as question calibration, aims to predict a question’s
dificulty directly from its textual content and its answer options. This task plays a central role in
personalized learning tools such as computerized adaptive testing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and dynamic online learning
platforms, which aim to present questions aligned with a learner’s proficiency level. Selecting questions
that are either too easy or excessively dificult can reduce student motivation and hinder learning
outcomes [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Reliable estimation of question dificulty is therefore essential.
      </p>
      <p>
        Traditionally, QDE has relied on manual calibration [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or pretesting [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], both of which are
timeconsuming and costly. To address these limitations, research has explored the use of natural language
processing (NLP) techniques. These approaches train machine learning models to infer dificulty from
the question text, allowing for rapid and scalable calibration of new questions without the need for
manual intervention.
      </p>
      <p>Dificulty levels in QDE are represented either as continuous scores or as discrete categories, with
discrete levels being attractive for their ease of use. While continuous dificulty estimation is generally
framed as a regression task, discrete-level QDE is essentially an ordinal regression problem, given the
inherent ordering of dificulty levels from easiest to hardest.</p>
      <p>However, the discrete-level QDE literature has neglected the ordinal nature of the task. Instead,
existing work relies exclusively on classification and discretized regression methods, both of which
are oversimplifications of the problem structure. Classification models disregard ordinal relationships</p>
      <sec id="sec-1-1">
        <title>Model output format</title>
      </sec>
      <sec id="sec-1-2">
        <title>Regression</title>
      </sec>
      <sec id="sec-1-3">
        <title>Classification Ordinal</title>
        <p>!
!
!
—
!
!
—
—
!
—
—
—
—
—
—
—
—
!</p>
      </sec>
      <sec id="sec-1-4">
        <title>Metric</title>
        <p>(Adjacent) accuracy</p>
      </sec>
      <sec id="sec-1-5">
        <title>Accuracy</title>
      </sec>
      <sec id="sec-1-6">
        <title>Accuracy</title>
        <p>(Adjacent) accuracy</p>
        <sec id="sec-1-6-1">
          <title>Accuracy, 1-score</title>
        </sec>
      </sec>
      <sec id="sec-1-7">
        <title>Accuracy</title>
        <sec id="sec-1-7-1">
          <title>RMSE, 2, Spearman’s</title>
        </sec>
      </sec>
      <sec id="sec-1-8">
        <title>RMSE</title>
      </sec>
      <sec id="sec-1-9">
        <title>Balanced DRPS</title>
        <p>altogether, and while discretized regression methods preserve ordinality, they implicitly assume equal
spacing between levels—an assumption often violated in real-world data. As such, specialized ordinal
regression techniques remain unexplored in this context. Moreover, no prior studies have systematically
compared these competing approaches. Compounding the issue, studies typically only report
performance metrics aligning with their chosen modeling paradigm, making cross-study comparisons dificult.
The metrics also fail to account for class imbalances, a prevalent issue in this setting. As a result, there
is no consensus on the most efective evaluation metric or modeling approach for discrete-level QDE.</p>
        <p>This study addresses the literature gaps in discrete-level QDE by proposing the balanced Discrete
Ranked Probability Score (DRPS), a novel evaluation metric that jointly captures ordinality and class
imbalance. It also provides a direct way to compare deterministic predictions to probabilistic ones, which
are especially valuable for downstream decision-making. We benchmark three types of model outputs—
discretized regression, classification, and ordinal regression—using the balanced DRPS. Moreover, we
propose a novel ordinal regression model OrderedLogitNN, extending the ordered logit model from
econometrics to neural networks (NNs). Our work is the first to (i) introduce the balanced DRPS metric,
(ii) compare classification and discretized regression models for QDE, and (iii) investigate specialized
ordinal regression techniques, including the novel OrderedLogitNN. We conduct experiments by
finetuning the Transformer model BERT on the RACE++ and ARC datasets.</p>
        <p>The remainder of the paper is structured as follows. Section 2 reviews related work. Section 3
introduces the balanced DRPS metric. Section 4 discusses the novel OrderedLogitNN model and section
5 outlines existing methods for ordinal tasks. Section 6 describes our experimental setup, and Section 7
presents the results and discussion. We conclude in Section 8. The source code is available on GitHub.1</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Building on the survey by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we further investigate studies in QDE that utilize datasets with discrete
dificulty levels. Question dificulty is defined using one of three main approaches: (i) Classical Test
Theory (CTT) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], (ii) Item Response Theory (IRT) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and (iii) manual calibration. In the case of manual
calibration with expert annotators, dificulty is almost exclusively assigned in discrete levels due to its
ease of use. While dificulty scores derived from CTT and IRT are continuous by nature, they are often
discretized in practical applications to facilitate interpretation. Table 1 provides an overview of related
work in discrete-level QDE.
1https://github.com/arthur-thuy/qde-ordinality
2.1. Output types
Early work on discrete-level QDE employed classification models, such as support vector machines
and Bayesian NNs [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ]. Subsequently, [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed a discretized regression approach using an
LSTM-based NN, where dificulty was predicted as a continuous value between 0 and 1 and mapped to
discrete intervals (e.g., [0.0; 0.2), [0.2; 0.4), etc.).
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] introduced a multi-task BERT-based model that leverages shared representations across datasets,
using a classification head to predict dificulty levels. To reduce the reliance on large labeled datasets
in supervised methods, [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed an unsupervised QDE method. They leverage the uncertainty
in pre-trained question-answering models as a proxy for human-perceived dificulty, computed as the
variance over the predictions from an ensemble of classification models.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] conducted a benchmarking study comparing traditional machine learning methods and
end-toend NNs across datasets with both discrete and continuous dificulty labels. Their results show that
ifne-tuned Transformer-based models such as BERT and DistilBERT consistently outperform classical
methods. However, these models exclusively employed a discretized regression approach to address the
ordinal nature of the labels.
      </p>
      <p>
        More recently, [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] investigated active learning for QDE, demonstrating that comparable performance
to fully supervised models can be achieved by labeling only a small fraction of the training data. Yet
again, only a discretized regression modeling strategy was considered. As summarized in Table 1, prior
work has not directly compared discretized regression and classification approaches, and specialized
ordinal regression methods remain entirely unexplored.
2.2. Metrics
Accuracy is the most widely used evaluation metric for discrete-level QDE, particularly in studies
employing classification models, as it aligns with the cross-entropy loss typically used during training.
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is the only study using a discretized regression approach that also reports accuracy. However,
accuracy fails to account for the ordinal structure of dificulty levels: all misclassifications are treated
equally, regardless of their distance from the true label. Even if a prediction is incorrect, it should still
be as close as possible to the true dificulty level. Additionally, accuracy is a threshold-based metric,
relying solely on the final predicted label rather than the full output distribution—an issue that also
applies to metrics such as the 1-score. To partially address this, [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] report adjacent accuracy,
defined as the proportion of predictions within  levels of the true label. However, the choice of  is
often arbitrary and dataset-dependent, complicating comparisons across studies.
      </p>
      <p>
        The most recent works, which adopt discretized regression approaches, report RMSE as their primary
evaluation metric, treating dificulty levels as integers. While RMSE reflects the ordinal structure to
some extent, it assumes uniform spacing between levels—a condition rarely met in real-world data.
For example, in primary school, there is a non-linear increase in dificulty over the years [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This
limitation also applies to metrics such as 2 and Spearman’s rank correlation. Furthermore, since RMSE
is used as the loss function in discretized regression models, these models benefit from being evaluated
on the same objective they were optimized for, giving them an unfair advantage and introducing a
potential bias in comparative evaluations.
      </p>
      <p>
        In addition to the ordinality aspect, these commonly used metrics fail to account for class imbalance,
which is a prevalent issue in discrete-level QDE. In many datasets, mid-range dificulty levels tend to
dominate, while questions at the extremes—those that are very easy or very dificult—are
underrepresented [
        <xref ref-type="bibr" rid="ref17">17, 18</xref>
        ]. Standard metrics, which compute aggregate scores across all samples, inherently place
greater weight on the majority classes. As a result, model performance on minority classes is often
underrepresented, leading to inflated metrics that do not accurately reflect model efectiveness across
the full dificulty spectrum. This is particularly problematic in educational contexts, where balanced
performance across all dificulty levels is essential to ensure adequate personalized learning experiences.
      </p>
      <p>As a result, the literature ignores the ordinal aspect in the modeling approaches and employs
suboptimal evaluation metrics for discrete-level QDE. There is a clear need for an evaluation metric that
simultaneously accounts for ordinality and class imbalance, thereby enabling fair comparison across
diferent modeling strategies.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Balanced Discrete Ranked Probability Score</title>
      <p>The Continuous Ranked Probability Score (CRPS) is the most widely adopted scoring rule for evaluating
probabilistic forecasts of real-valued variables, such as in precipitation forecasting [19]. It is defined as
the integral of the squared diference (i.e., Brier score) between the cumulative distribution function
(CDF) of a probabilistic forecast  and the CDF of the observed outcome, at all real-valued thresholds.
The observed outcome is represented as a degenerate distribution, as its CDF is a step function. Formally,
given a dataset  = {x, }=1, the CRPS is computed as:</p>
      <p>CRPS(, ) =
1 ∑︁ ∫︁ ∞
 =1 −∞</p>
      <p>( (^) − 1{^  ≥  })2 ^ ,
where  (^) denotes the CDF of the forecast and 1{·} is the step function.</p>
      <p>The CRPS is distance-sensitive, meaning it rewards forecasts that assign higher probability mass
to values near the true outcome. Specifically, when the forecast distribution concentrates probability
density around the true value, the squared error between the forecast CDF and the observed step
function is smaller across the integration range, resulting in a lower (better) score. In other words, even
if the predicted value does not exactly coincide with the true outcome, placing substantial probability
on neighboring values results in a better score than a prediction that is entirely of-target.</p>
      <p>This property makes the CRPS particularly well-suited for ordinal prediction tasks. In such cases, the
DRPS serves as a natural extension of the CRPS for discrete outcomes across  ordered categories. The
DRPS has only been applied in meteorology [20] and has received little attention in the general field of
ordinal regression. It is defined as:</p>
      <p>−1
DRPS(, ) = 1 ∑︁ ∑︁ ((^) − 1{ ≥ 

where (^) denotes the predicted cumulative probability up to class . The step function 1{·} moves
from 0.0 to 1.0 at the position of the ground truth label.</p>
      <p>Unlike other evaluation metrics in related work, the DRPS operates on full probability distributions
rather than point estimates, enabling a more nuanced assessment of ordinal predictions. Such a
probability distribution over the levels is available for all classification and ordinal regression models,
but not for the discretized regression model as it only outputs a predicted dificulty level. In the
case of deterministic predictions, the output is treated as a degenerate distribution—analogous to the
representation of the observed outcome. Figures 1 and 2 illustrate how the DRPS is computed for single
observations with probabilistic and deterministic model outputs.</p>
      <p>Crucially, the DRPS respects the ordinal structure of the prediction task without assuming equal
inter-class distances, a notable limitation of metrics such as RMSE. Additionally, when applied to
deterministic predictions, the DRPS reduces to the mean absolute error, providing a direct way to
compare deterministic and probabilistic predictions within a unified evaluation metric.</p>
      <p>
        In this work, we introduce the balanced DRPS to address the class imbalance in discrete-level QDE
datasets, where extreme dificulty levels are typically underrepresented compared to mid-range levels
[
        <xref ref-type="bibr" rid="ref17">17, 18</xref>
        ]. The popular metrics (accuracy and RMSE) and standard DRPS compute unweighted averages,
which overemphasize performance on majority classes and can produce misleadingly high scores on
imbalanced data. However, for educational practitioners, robust performance across the full spectrum
of dificulty levels—including the rarest—is essential. To ensure fair evaluation, the balanced DRPS
weights each observation inversely proportional to the prevalence of its true class,  = ∑︀=1 1 ,
1{=}
      </p>
      <p>Thus for balanced datasets, the balanced DRPS is equivalent to the standard DRPS.</p>
      <p>In conclusion, the balanced DRPS ofers a robust and fair evaluation metric for discrete-level QDE by
accounting for both ordinal structure and class imbalance. It supports both deterministic and probabilistic
predictions and remains neutral to training objectives, making it well-suited for benchmarking across
diverse modeling approaches.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Ordered Logit for NNs</title>
      <p>OrderedLogitNN extends the classical ordered logit model to NNs, efectively bridging the gap between
econometrics and deep learning. At its core, it is a latent variable model, where each observation is
associated with an unobserved continuous utility value * [21] modeled as: * = x +  . The observed
ordinal outcome  is derived from * through a censoring mechanism, whereby the continuous latent
variable is mapped to one of  discrete categories based on a sequence of  + 1 increasing threshold
values { −1 ,  0,  1, . . . ,  −1 }.</p>
      <p>To identify the model parameters, several normalizations are required. First, the thresholds must be
increasing   &gt;  −1 to ensure valid (i.e., positive) probabilities. Second, the endpoints of the support
are fixed as  −1 = −∞ and  −1 = +∞, covering the entire real line. Third, the error term   is
assumed to follow a standardized logistic distribution (mean zero, variance  32 ). The logistic distribution
is preferred over the Gaussian (i.e., probit) for computational convenience, as the derivative has a closed
form solution and is readily available as the sigmoid function. Finally, since x includes a bias term, the
threshold  0 = 0.</p>
      <p>The model defines the class probabilities as:
 ( =  | x) =  (  − x ) −  (
−1 − x ) ,
with  the cdf of the logistic distribution. An example for  = 3 is shown in Figure 3.</p>
      <p>The NN is trained by minimizing the negative log-likelihood (NLL):</p>
      <p>−1
NLL = ∑︁ ∑︁  log [ (  − x ) −  (
=1 =0
−1 − x )] ,
where  = 1 if  =  and 0 otherwise. The thresholds are reparameterized to ensure monotonically
increasing values:   =  −1 + exp( ) = ∑︀=1 exp( ).</p>
      <p>The   parameters are initialized such that the ordinal levels have equal probability mass under the
logistic distribution, with the first threshold set to zero. The bias term is initialized to lie at the center of
this distribution, while all other weights follow standard initialization practices in PyTorch. To facilitate
convergence, the learning rates for the   values and the bias term are scaled to be 100 times larger
than those of the remaining network parameters.</p>
      <p>Importantly, OrderedLogitNN is architecture-agnostic and can be integrated into any NN. Additionally,
it makes no assumptions about the distances between ordinal levels, allowing it to flexibly model a
wide range of ordered regression problems.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Existing approaches for ordinal regression with NNs</title>
      <p>This section describes existing approaches to handle ordinal regression using NNs. Depending on
the specific approach, the amount of ordinal information used and the underlying assumptions vary.
This study uses three existing specialized ordinal regression methods that follow the extended binary
classification framework, most widely used in the ordinal regression literature [ 22]. Note that the
methods discussed below are not tied to any specific architecture and can be utilized with any NN.</p>
      <p>Let  = {x, }=1 be the training dataset consisting of  training examples. Here, x ∈  denotes
the th training example and  the corresponding rank, where  ∈  = {1, 2, . . . ,  } with ordered
rank  ≻  −1 ≻ · · · ≻  1. The objective is to find a model that maps  → . For example, the ARC
dataset has  = 7 dificulty levels with an output space  = {“grade 3”, “grade 4”, . . . , “grade 9”}.
5.1. Discretized regression
In the regression approach, also referred to as discretized regression, the  rank indices are treated as
numerical values to utilize the ordinal information (see Table 1 for references). The model  minimizes
the mean squared error loss and predicts a real-valued quantity  (x) ∈ R representing a continuous
rank estimate, which is then converted to the closest rank index. For example, a regression estimate of
2.7 is converted to index 3 while estimate 5.2 is converted to index 5.</p>
      <p>
        Using a discretized regression approach in an ordinal QDE problem assumes that the inter-level
distances are equal. However, this condition is only rarely satisfied in practice [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. On the ARC dataset
with levels “grade 3” to “grade 9”, for example, such an approach assumes that the jump in dificulty
among all grades is identical.
5.2. Classification
In the classification approach, the model’s output space is a set of  unordered labels, one for each
rank (see Table 1 for references). The model is trained to minimize the cross-entropy loss and the
predicted rank label is the class with the highest predicted probability. As such, the predicted rank label
is ^ = arg max∈ ( | x).
      </p>
      <p>This approach essentially assumes that the dificulty levels are completely independent, hence
discarding the available ordinal information. For example, for a question in the ARC dataset with true
level “grade 3”, predicting levels “grade 4” and “grade 5” incurs the same loss even though the diference
between “grade 3” and “grade 5” is larger than the that between level “grade 3” and “grade 4”.
5.3. Ordinal: OR-NN
A popular general machine learning approach to ordinal regression is to cast it as an extended binary
classification problem [ 23], leveraging the relative order among the labels. That is, the ordinal regression
task with  ranks is represented as a series of  − 1 simpler binary classification sub-problems. For
each rank index  ∈ 1, 2, . . . ,  − 1 , a binary classifier is trained according to whether the rank of a
sample is larger than . As such, all  − 1 tasks share the same intermediate layers but are assigned
distinct weight parameters in the output layer. In summary, this framework relies on three steps: (i)
extending rank labels to binary vectors, (ii) training binary classifiers on the extended labels, and (iii)
computing the predicted rank label from the binary classifiers.</p>
      <p>In 2016, the authors of [24] adapted this framework to train NNs for ordinal regression; we refer
to this method as OR-NN. More formally, a rank label  is first extended into  − 1 binary vectors
(1), . . . , (−1) such that the () ∈ {0, 1} indicates whether  exceeds rank , for instance, () =
1{ &gt; }. Using the extended binary labels, a single NN is trained with  − 1 binary classifiers in
the output layer to minimize the cross-entropy loss. Based on the binary task predictions, the predicted
rank label is ^ =  . The rank index  is given by</p>
      <p>−1
 = 1 + ∑︁ 1{︀  ( &gt; ) &gt; 0.5}︀ ,</p>
      <p>
        =1
where  ( &gt; ) ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is the predicted probability of the th binary classifiers in the output layer.
      </p>
      <p>However, the authors pointed out that OR-NN can sufer from rank inconsistencies among the binary
tasks such that the predictions for individual binary tasks may disagree. For example, on the RACE++
dataset, it would be contradictory if the first binary task predicts that the dificulty is not higher than
middle school level while the second binary task predicts it to be more dificult than high school level.
This inconsistency could lead to suboptimal results when combining the  − 1 predictions to obtain the
estimated dificulty level. Figure 4 provides an example of a rank consistent and inconsistent prediction.
In response, two methods have been proposed that overcome this drawback of rank inconsistency:
CORAL and CORN.</p>
      <p>(a) Rank inconsistent
(b) Rank consistent
5.4. Ordinal: CORAL and CORN
CORAL [25] achieves rank consistency by imposing a weight-sharing constraint in the last layer. Instead
of learning distinct weights between each unit in the penultimate layer and each output unit, CORAL
enforces that the  − 1
bias term for the output layer.</p>
      <p>binary tasks share the same weight parameters to the units in the penultimate
layer. In addition, CORAL learns independent bias terms for each output unit as opposed to a single
)︀ = ∏︀</p>
      <p>=1  (x).
classification framework.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments</title>
      <p>6.1. Data</p>
      <p>CORAL uses a cross-entropy loss over the  − 1</p>
      <p>binary classifiers and the authors show theoretically
that by minimizing this loss function, the learned bias terms of the output layer are non-increasing such
that 1 ≥  2 ≥ · · · ≥</p>
      <p>−1 . Consequently, the predicted probabilities of the  − 1
which ensures that the output reflects the ordinal information and is rank consistent. All other steps
tasks are decreasing
are identical to the extended binary classification framework.</p>
      <p>However, while CORAL outperforms the OR-NN method in age prediction [25], the weight-sharing
constraint may restrict the expressiveness and capacity of the NN.</p>
      <p>More recently, the authors of [22] proposed CORN, which guarantees rank consistency without
restricting the NN’s expressiveness and capacity. CORN achieves rank consistency by a novel training
scheme which uses conditional training sets in order to obtain the unconditional rank probabilities.</p>
      <p>More formally, CORN constructs conditional training subsets such that the output of the th binary
task (x) represents the conditional probability (x) =  (︀  &gt;  |  &gt; −1
︀) . For  ≥ 2 ,
the conditional subsets consist of observations where  &gt; 1 . When  = 1, 1(x) represents
the initial unconditional probability  (︀  &gt; 1)︀ based on the complete dataset. The transformed
unconditional probabilities can then be computed by applying the chain rule for probabilities:  (︀  &gt;
−
Since ∀, 0 ≤   (x) ≤ 1 , we have  (︀  &gt; 1)︀
≥  (︀  &gt; 2)︀
≥ · · · ≥ 
︀(  &gt; −1
︀) , which
guarantees rank consistency among the  − 1
the cross-entropy loss over the binary tasks. All other steps are identical to the extended binary
binary tasks. During model training, CORN minimizes
We evaluate our models on two multiple-choice question (MCQ) datasets: RACE++ and ARC. These
datasets vary in domain and granularity of dificulty levels.</p>
      <p>
        RACE++ [
        <xref ref-type="bibr" rid="ref17">17, 26</xref>
        ] is a dataset of reading comprehension MCQs. Questions are labeled with one of
three dificulty levels—1 (middle school), 2 (high school), and 3 (university level)—which we treat as
ground truth labels for QDE. The label distribution is imbalanced: 25% of the questions are labeled as
level 1, 62% as level 2, and 13% as level 3. The dataset is partitioned into training, validation, and test
sets containing 100,568; 5599; and 5642 questions, respectively.
      </p>
      <p>
        ARC [18] is a dataset of science MCQs across grades 3 through 9. Question dificulty is indicated by
the target grade level (i.e., 7 levels), which we use as ground truth. The training, validation, and test
splits contain 3358, 862, and 3530 questions, respectively. The distribution is highly imbalanced: level
8 appears approximately 1400 times, level 5 about 700 times, and level 6 only 100 times. To decrease
this imbalance, we follow [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and downsample the two most frequent levels to 500 examples each,
resulting in a partially balanced training set of 2293 questions.
6.2. Model Architecture
We focus on end-to-end Transformer-based NNs, as they have been shown to outperform traditional
NLP approaches that rely on separate feature engineering and modeling stages [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. We fine-tune
the Transformer BERT (“bert-base-uncased”) on the task of QDE, stacking an output layer on top of
the pre-trained language model. During fine-tuning, both the weights of the output head and the
pre-trained model are updated. We follow the input encoding of [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and concatenate the question and
the text of all the possible answer choices in a single sentence, divided by separator tokens.
      </p>
      <p>Additionally, we investigate the performance of two baselines which serve as a lower bound on
performance: (i) Random and (ii) Majority. The Random baseline randomly predicts a dificulty level,
while the Majority baseline consistently predicts the majority level in the training set.</p>
      <p>The experiments are implemented in PyTorch [27] using the HuggingFace [28] package, and results
are averaged over five independent runs with random seeds.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Results and Discussion</title>
      <p>7.1. Balanced DRPS
Tables 2 and 3 present the results for the RACE++ and ARC datasets, which contain 3 and 7 dificulty
levels, respectively. Balanced DRPS with probabilistic inputs is the main evaluation metric, as these
predictions express uncertainty and are particularly valuable in downstream decision-making. In
addition, we report the balanced DRPS computed using degenerate distributions—i.e., where all probability
mass is placed entirely on the predicted level. This setup removes any representation of uncertainty
from the predictions, thereby altering the scores for both classification and ordinal regression methods.
Notably, the baselines and the discretized regression model remain unafected in this setting, as they
do not express uncertainty. Recall that for both metrics, lower is better. Furthermore, we include the
commonly used but flawed metrics RMSE and accuracy.</p>
      <p>On RACE++ with 3 levels (Table 2), the classification model performs comparably to the ordinal
methods OR-NN, CORN, and OrderedLogitNN in terms of balanced DRPS. The discretized regression
model, by contrast, shows slightly inferior performance. Interestingly, the CORAL model underperforms
significantly, likely due to its weight-sharing constraint limiting the NN’s capacity. Nonetheless, all
models substantially outperform the baseline approaches. We hypothesize that the small diferences in
performance across models are due to the limited number of ordinal levels.</p>
      <p>The ARC dataset with 7 levels (Table 3) presents a more challenging setting, as it includes a greater
number of dificulty levels and exhibits more pronounced class imbalance. Here, the OrderedLogitNN
model considerably outperforms all other methods. For the remaining methods, the insights are
consistent with those observed on RACE++.</p>
      <p>When restricting the predictions to degenerate distributions, scores generally deteriorate (see the third
column in Tables 2 and 3) because balanced DRPS is designed to reward well-calibrated probabilistic
predictions while penalizing overconfident, incorrect ones. This shift brings the regression model’s
performance in line with the classification model, OR-NN, and CORN. OrderedLogitNN again performs
on par for RACE++ and performs substantially better on ARC.</p>
      <p>When considering RMSE and accuracy—metrics that are poorly suited for ordinal prediction tasks
(see Section 2.2)—we observe that models closely related to these objectives unsurprisingly achieve
the best performance. On the RACE++ dataset, the results are close and OrderedLogitNN performs
comparably to or even slightly better than the regression and classification models. In contrast, on ARC,
the regression approach achieves the lowest RMSE, while classification achieves the highest accuracy.
These results are expected, as the regression method is directly optimized with RMSE loss while the
classification method entirely omits the ordinal information, just like the accuracy metric.
7.2. Confusion matrix
To further investigate the behavior of the models, Figure 5 presents confusion matrices for the ARC
dataset, the more complex task in this study. These matrices are normalized by the true class (i.e.,
row-wise) and are based on discrete predicted levels—rather than full probability distributions—thus
aligning with the evaluation setting of the balanced DRPS with degenerate predictions.</p>
      <p>The results indicate the inherent dificulty of the task, as demonstrated by the substantial deviations
from the diagonal (highlighted in black). All models hardly predict level 6 as these observations are least
present in the training set. Notably, the CORAL model exhibits highly atypical behavior, predicting
exclusively at the extreme levels (3 and 9). This pattern suggests that the model fails to converge to a
meaningful solution, consistent with its poor performance on the balanced DRPS metric.</p>
      <p>Among the remaining models, all except OrderedLogitNN show concentrated errors in specific
of-diagonal cells, i.e., predicting level 4 instead of level 3, or level 8 instead of level 9. These errors are
associated with the outermost classes (levels 3 and 9), which are often neglected entirely by the models.
In contrast, OrderedLogitNN stands out as the only method that successfully captures both extremes of
the ordinal scale, avoiding these consistent misclassifications.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>This study approaches discrete-level QDE through the lens of ordinal regression, reflecting the inherent
ordering of dificulty levels from easiest to hardest. Prior work in this area has ignored this ordinal</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4o in order to check grammar and spelling.
After using this tool, the authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Van der Linden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Glas</surname>
          </string-name>
          ,
          <source>Computerized adaptive testing: Theory and practice</source>
          , Springer,
          <year>2000</year>
          . doi:
          <volume>10</volume>
          .1007/0-306-47531-6.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>A regularized competition model for question dificulty estimation in community question answering services</article-title>
          ,
          <source>in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1115</fpage>
          -
          <lpage>1126</lpage>
          . doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>D14</fpage>
          -1118.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Attali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saldivia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schuppan</surname>
          </string-name>
          , W. Wanamaker,
          <article-title>Estimating item dificulty with comparative judgments</article-title>
          ,
          <source>ETS Research Report Series</source>
          <year>2014</year>
          (
          <year>2014</year>
          )
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1002/ets2.
          <fpage>12042</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Raymond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Haladyna</surname>
          </string-name>
          , et al.,
          <source>Handbook of test development</source>
          , volume
          <volume>2</volume>
          , Routledge New York, NY,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Benedetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caines</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buttery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cappelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giussani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Turrin</surname>
          </string-name>
          ,
          <article-title>A survey on recent approaches to question dificulty estimation from text</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          . doi:
          <volume>10</volume>
          .1145/3556538.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Hambleton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <article-title>Comparison of classical test theory and item response theory and their applications to test development</article-title>
          ,
          <source>Educational measurement: issues and practice 12</source>
          (
          <year>1993</year>
          )
          <fpage>38</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Hambleton</surname>
          </string-name>
          ,
          <article-title>Fundamentals of item response theory</article-title>
          ,
          <source>Sage</source>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.-Y.</given-names>
            <surname>Hsu</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-M. Lee</surname>
            ,
            <given-names>T.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , Y.-T. Sung,
          <article-title>Automated estimation of item dificulty for multiplechoice tests: An application of word embedding techniques</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>54</volume>
          (
          <year>2018</year>
          )
          <fpage>969</fpage>
          -
          <lpage>984</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , E. Suyong,
          <article-title>Feature analysis on english word dificulty by gaussian mixture model</article-title>
          ,
          <source>in: 2018 International Conference on Information and Communication Technology Convergence (ICTC)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <article-title>Exercise dificulty prediction in online education systems</article-title>
          ,
          <source>in: 2019 International Conference on Data Mining Workshops (ICDMW)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>317</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11] L.
          <string-name>
            <surname>-H. Lin</surname>
            ,
            <given-names>T.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , F.-Y. Hsu,
          <article-title>Automated prediction of item dificulty in reading comprehension using long short-term memory</article-title>
          ,
          <source>in: 2019 International Conference on Asian Language Processing (IALP)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>132</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Tao, Multi-task bert for problem dificulty prediction, in: 2020 international conference on communications, information system and computer engineering (cisce)</article-title>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>216</lpage>
          . doi:
          <volume>10</volume>
          .1109/CISCE50729.
          <year>2020</year>
          .
          <volume>00048</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Loginova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Benedetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Benoit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <article-title>Towards the application of calibrated transformers to the unsupervised estimation of question dificulty from text</article-title>
          ,
          <source>in: RANLP</source>
          <year>2021</year>
          , INCOMA,
          <year>2021</year>
          , pp.
          <fpage>846</fpage>
          -
          <lpage>855</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Benedetto</surname>
          </string-name>
          ,
          <article-title>A quantitative study of nlp approaches to question dificulty estimation</article-title>
          ,
          <source>in: International Conference on Artificial Intelligence in Education</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>428</fpage>
          -
          <lpage>434</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -36336-8_
          <fpage>67</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Thuy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Loginova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Benoit</surname>
          </string-name>
          ,
          <article-title>Active learning to guide labeling eforts for question dificulty estimation</article-title>
          ,
          <source>arXiv preprint arXiv:2409.09258</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2409.09258.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Coe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Searle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barmby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Higgins</surname>
          </string-name>
          ,
          <article-title>Relative dificulty of examinations in diferent subjects, Durham: CEM centre (</article-title>
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>A new multi-choice reading comprehension dataset for curriculum learning</article-title>
          ,
          <source>in: Asian Conference on Machine Learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>742</fpage>
          -
          <lpage>757</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>