1. Introduction

Journal of the American statistical Association 102 (2007) 359-378. [20] A. P. Weigel

10.48550/arXiv.1803.05457

Ordinality in Discrete-level Question Dificulty Estimation: Introducing Balanced DRPS and OrderedLogitNN

Arthur Thuy

arthur.thuy@ugent.be 0 2

Ekaterina Loginova

ekaterina.d.loginova@gmail.com 1

Dries F. Benoit

dries.benoit@ugent.be 0 2 0 CVAMO Core Lab Flanders Make , Tweekerkenstraat 2, 9000 Ghent , Belgium 1 Dedalus Healthcare , Roderveldlaan 2, 2600 Antwerp , Belgium 2 Ghent University , Tweekerkenstraat 2, 9000 Ghent , Belgium

2020

2 359 378

Recent years have seen growing interest in Question Dificulty Estimation (QDE) using natural language processing techniques. Question dificulty is often represented using discrete levels, framing the task as ordinal regression due to the inherent ordering from easiest to hardest. However, the literature has neglected the ordinal nature of the task, relying on classification or discretized regression models, with specialized ordinal regression methods remaining unexplored. Furthermore, evaluation metrics are tightly coupled to the modeling paradigm, hindering cross-study comparability. While some metrics fail to account for the ordinal structure of dificulty levels, none adequately address class imbalance, resulting in biased performance assessments. This study addresses these limitations by benchmarking three types of model outputs-discretized regression, classification, and ordinal regression-using the balanced Discrete Ranked Probability Score (DRPS), a novel metric that jointly captures ordinality and class imbalance. In addition to using popular ordinal regression methods, we propose OrderedLogitNN, extending the ordered logit model from econometrics to neural networks. We fine-tune BERT on the RACE++ and ARC datasets and find that OrderedLogitNN performs considerably better on complex tasks. The balanced DRPS ofers a robust and fair evaluation metric for discrete-level QDE, providing a principled foundation for future research.

eol>Question Dificulty Estimation Natural language processing Ordinal regression Ordered logit Fine-tuning

1. Introduction

Question Dificulty Estimation (QDE), also known as question calibration, aims to predict a question’s dificulty directly from its textual content and its answer options. This task plays a central role in personalized learning tools such as computerized adaptive testing [ 1 ] and dynamic online learning platforms, which aim to present questions aligned with a learner’s proficiency level. Selecting questions that are either too easy or excessively dificult can reduce student motivation and hinder learning outcomes [ 2 ]. Reliable estimation of question dificulty is therefore essential.

Traditionally, QDE has relied on manual calibration [ 3 ] or pretesting [ 4 ], both of which are timeconsuming and costly. To address these limitations, research has explored the use of natural language processing (NLP) techniques. These approaches train machine learning models to infer dificulty from the question text, allowing for rapid and scalable calibration of new questions without the need for manual intervention.

Dificulty levels in QDE are represented either as continuous scores or as discrete categories, with discrete levels being attractive for their ease of use. While continuous dificulty estimation is generally framed as a regression task, discrete-level QDE is essentially an ordinal regression problem, given the inherent ordering of dificulty levels from easiest to hardest.

However, the discrete-level QDE literature has neglected the ordinal nature of the task. Instead, existing work relies exclusively on classification and discretized regression methods, both of which are oversimplifications of the problem structure. Classification models disregard ordinal relationships

Model output format Regression Classification Ordinal

! ! ! — ! ! — — ! — — — — — — — — !

Metric

(Adjacent) accuracy

Accuracy Accuracy

(Adjacent) accuracy

Accuracy, 1-score Accuracy RMSE, 2, Spearman’s RMSE Balanced DRPS

altogether, and while discretized regression methods preserve ordinality, they implicitly assume equal spacing between levels—an assumption often violated in real-world data. As such, specialized ordinal regression techniques remain unexplored in this context. Moreover, no prior studies have systematically compared these competing approaches. Compounding the issue, studies typically only report performance metrics aligning with their chosen modeling paradigm, making cross-study comparisons dificult. The metrics also fail to account for class imbalances, a prevalent issue in this setting. As a result, there is no consensus on the most efective evaluation metric or modeling approach for discrete-level QDE.

This study addresses the literature gaps in discrete-level QDE by proposing the balanced Discrete Ranked Probability Score (DRPS), a novel evaluation metric that jointly captures ordinality and class imbalance. It also provides a direct way to compare deterministic predictions to probabilistic ones, which are especially valuable for downstream decision-making. We benchmark three types of model outputs— discretized regression, classification, and ordinal regression—using the balanced DRPS. Moreover, we propose a novel ordinal regression model OrderedLogitNN, extending the ordered logit model from econometrics to neural networks (NNs). Our work is the first to (i) introduce the balanced DRPS metric, (ii) compare classification and discretized regression models for QDE, and (iii) investigate specialized ordinal regression techniques, including the novel OrderedLogitNN. We conduct experiments by finetuning the Transformer model BERT on the RACE++ and ARC datasets.

The remainder of the paper is structured as follows. Section 2 reviews related work. Section 3 introduces the balanced DRPS metric. Section 4 discusses the novel OrderedLogitNN model and section 5 outlines existing methods for ordinal tasks. Section 6 describes our experimental setup, and Section 7 presents the results and discussion. We conclude in Section 8. The source code is available on GitHub.1

2. Related Work

Building on the survey by [ 5 ], we further investigate studies in QDE that utilize datasets with discrete dificulty levels. Question dificulty is defined using one of three main approaches: (i) Classical Test Theory (CTT) [ 6 ], (ii) Item Response Theory (IRT) [ 7 ], and (iii) manual calibration. In the case of manual calibration with expert annotators, dificulty is almost exclusively assigned in discrete levels due to its ease of use. While dificulty scores derived from CTT and IRT are continuous by nature, they are often discretized in practical applications to facilitate interpretation. Table 1 provides an overview of related work in discrete-level QDE. 1https://github.com/arthur-thuy/qde-ordinality 2.1. Output types Early work on discrete-level QDE employed classification models, such as support vector machines and Bayesian NNs [ 8, 9, 10 ]. Subsequently, [ 11 ] proposed a discretized regression approach using an LSTM-based NN, where dificulty was predicted as a continuous value between 0 and 1 and mapped to discrete intervals (e.g., [0.0; 0.2), [0.2; 0.4), etc.).

[ 12 ] introduced a multi-task BERT-based model that leverages shared representations across datasets, using a classification head to predict dificulty levels. To reduce the reliance on large labeled datasets in supervised methods, [ 13 ] proposed an unsupervised QDE method. They leverage the uncertainty in pre-trained question-answering models as a proxy for human-perceived dificulty, computed as the variance over the predictions from an ensemble of classification models.

[ 14 ] conducted a benchmarking study comparing traditional machine learning methods and end-toend NNs across datasets with both discrete and continuous dificulty labels. Their results show that ifne-tuned Transformer-based models such as BERT and DistilBERT consistently outperform classical methods. However, these models exclusively employed a discretized regression approach to address the ordinal nature of the labels.

More recently, [ 15 ] investigated active learning for QDE, demonstrating that comparable performance to fully supervised models can be achieved by labeling only a small fraction of the training data. Yet again, only a discretized regression modeling strategy was considered. As summarized in Table 1, prior work has not directly compared discretized regression and classification approaches, and specialized ordinal regression methods remain entirely unexplored. 2.2. Metrics Accuracy is the most widely used evaluation metric for discrete-level QDE, particularly in studies employing classification models, as it aligns with the cross-entropy loss typically used during training. [ 11 ] is the only study using a discretized regression approach that also reports accuracy. However, accuracy fails to account for the ordinal structure of dificulty levels: all misclassifications are treated equally, regardless of their distance from the true label. Even if a prediction is incorrect, it should still be as close as possible to the true dificulty level. Additionally, accuracy is a threshold-based metric, relying solely on the final predicted label rather than the full output distribution—an issue that also applies to metrics such as the 1-score. To partially address this, [ 8 ] and [ 11 ] report adjacent accuracy, defined as the proportion of predictions within levels of the true label. However, the choice of is often arbitrary and dataset-dependent, complicating comparisons across studies.

The most recent works, which adopt discretized regression approaches, report RMSE as their primary evaluation metric, treating dificulty levels as integers. While RMSE reflects the ordinal structure to some extent, it assumes uniform spacing between levels—a condition rarely met in real-world data. For example, in primary school, there is a non-linear increase in dificulty over the years [ 16 ]. This limitation also applies to metrics such as 2 and Spearman’s rank correlation. Furthermore, since RMSE is used as the loss function in discretized regression models, these models benefit from being evaluated on the same objective they were optimized for, giving them an unfair advantage and introducing a potential bias in comparative evaluations.

In addition to the ordinality aspect, these commonly used metrics fail to account for class imbalance, which is a prevalent issue in discrete-level QDE. In many datasets, mid-range dificulty levels tend to dominate, while questions at the extremes—those that are very easy or very dificult—are underrepresented [ 17, 18 ]. Standard metrics, which compute aggregate scores across all samples, inherently place greater weight on the majority classes. As a result, model performance on minority classes is often underrepresented, leading to inflated metrics that do not accurately reflect model efectiveness across the full dificulty spectrum. This is particularly problematic in educational contexts, where balanced performance across all dificulty levels is essential to ensure adequate personalized learning experiences.

As a result, the literature ignores the ordinal aspect in the modeling approaches and employs suboptimal evaluation metrics for discrete-level QDE. There is a clear need for an evaluation metric that simultaneously accounts for ordinality and class imbalance, thereby enabling fair comparison across diferent modeling strategies.

3. Balanced Discrete Ranked Probability Score

The Continuous Ranked Probability Score (CRPS) is the most widely adopted scoring rule for evaluating probabilistic forecasts of real-valued variables, such as in precipitation forecasting [19]. It is defined as the integral of the squared diference (i.e., Brier score) between the cumulative distribution function (CDF) of a probabilistic forecast and the CDF of the observed outcome, at all real-valued thresholds. The observed outcome is represented as a degenerate distribution, as its CDF is a step function. Formally, given a dataset = {x, }=1, the CRPS is computed as:

CRPS(, ) = 1 ∑︁ ∫︁ ∞ =1 −∞

( (^) − 1{^ ≥ })2 ^ , where (^) denotes the CDF of the forecast and 1{·} is the step function.

The CRPS is distance-sensitive, meaning it rewards forecasts that assign higher probability mass to values near the true outcome. Specifically, when the forecast distribution concentrates probability density around the true value, the squared error between the forecast CDF and the observed step function is smaller across the integration range, resulting in a lower (better) score. In other words, even if the predicted value does not exactly coincide with the true outcome, placing substantial probability on neighboring values results in a better score than a prediction that is entirely of-target.

This property makes the CRPS particularly well-suited for ordinal prediction tasks. In such cases, the DRPS serves as a natural extension of the CRPS for discrete outcomes across ordered categories. The DRPS has only been applied in meteorology [20] and has received little attention in the general field of ordinal regression. It is defined as:

−1 DRPS(, ) = 1 ∑︁ ∑︁ ((^) − 1{ ≥ where (^) denotes the predicted cumulative probability up to class . The step function 1{·} moves from 0.0 to 1.0 at the position of the ground truth label.

Unlike other evaluation metrics in related work, the DRPS operates on full probability distributions rather than point estimates, enabling a more nuanced assessment of ordinal predictions. Such a probability distribution over the levels is available for all classification and ordinal regression models, but not for the discretized regression model as it only outputs a predicted dificulty level. In the case of deterministic predictions, the output is treated as a degenerate distribution—analogous to the representation of the observed outcome. Figures 1 and 2 illustrate how the DRPS is computed for single observations with probabilistic and deterministic model outputs.

Crucially, the DRPS respects the ordinal structure of the prediction task without assuming equal inter-class distances, a notable limitation of metrics such as RMSE. Additionally, when applied to deterministic predictions, the DRPS reduces to the mean absolute error, providing a direct way to compare deterministic and probabilistic predictions within a unified evaluation metric.

In this work, we introduce the balanced DRPS to address the class imbalance in discrete-level QDE datasets, where extreme dificulty levels are typically underrepresented compared to mid-range levels [ 17, 18 ]. The popular metrics (accuracy and RMSE) and standard DRPS compute unweighted averages, which overemphasize performance on majority classes and can produce misleadingly high scores on imbalanced data. However, for educational practitioners, robust performance across the full spectrum of dificulty levels—including the rarest—is essential. To ensure fair evaluation, the balanced DRPS weights each observation inversely proportional to the prevalence of its true class, = ∑︀=1 1 , 1{=}

Thus for balanced datasets, the balanced DRPS is equivalent to the standard DRPS.

In conclusion, the balanced DRPS ofers a robust and fair evaluation metric for discrete-level QDE by accounting for both ordinal structure and class imbalance. It supports both deterministic and probabilistic predictions and remains neutral to training objectives, making it well-suited for benchmarking across diverse modeling approaches.

4. Ordered Logit for NNs

OrderedLogitNN extends the classical ordered logit model to NNs, efectively bridging the gap between econometrics and deep learning. At its core, it is a latent variable model, where each observation is associated with an unobserved continuous utility value * [21] modeled as: * = x + . The observed ordinal outcome is derived from * through a censoring mechanism, whereby the continuous latent variable is mapped to one of discrete categories based on a sequence of + 1 increasing threshold values { −1 , 0, 1, . . . , −1 }.

To identify the model parameters, several normalizations are required. First, the thresholds must be increasing > −1 to ensure valid (i.e., positive) probabilities. Second, the endpoints of the support are fixed as −1 = −∞ and −1 = +∞, covering the entire real line. Third, the error term is assumed to follow a standardized logistic distribution (mean zero, variance 32 ). The logistic distribution is preferred over the Gaussian (i.e., probit) for computational convenience, as the derivative has a closed form solution and is readily available as the sigmoid function. Finally, since x includes a bias term, the threshold 0 = 0.

The model defines the class probabilities as: ( = | x) = ( − x ) − ( −1 − x ) , with the cdf of the logistic distribution. An example for = 3 is shown in Figure 3.

The NN is trained by minimizing the negative log-likelihood (NLL):

−1 NLL = ∑︁ ∑︁ log [ ( − x ) − ( =1 =0 −1 − x )] , where = 1 if = and 0 otherwise. The thresholds are reparameterized to ensure monotonically increasing values: = −1 + exp( ) = ∑︀=1 exp( ).

The parameters are initialized such that the ordinal levels have equal probability mass under the logistic distribution, with the first threshold set to zero. The bias term is initialized to lie at the center of this distribution, while all other weights follow standard initialization practices in PyTorch. To facilitate convergence, the learning rates for the values and the bias term are scaled to be 100 times larger than those of the remaining network parameters.

Importantly, OrderedLogitNN is architecture-agnostic and can be integrated into any NN. Additionally, it makes no assumptions about the distances between ordinal levels, allowing it to flexibly model a wide range of ordered regression problems.

5. Existing approaches for ordinal regression with NNs

This section describes existing approaches to handle ordinal regression using NNs. Depending on the specific approach, the amount of ordinal information used and the underlying assumptions vary. This study uses three existing specialized ordinal regression methods that follow the extended binary classification framework, most widely used in the ordinal regression literature [ 22]. Note that the methods discussed below are not tied to any specific architecture and can be utilized with any NN.

Let = {x, }=1 be the training dataset consisting of training examples. Here, x ∈ denotes the th training example and the corresponding rank, where ∈ = {1, 2, . . . , } with ordered rank ≻ −1 ≻ · · · ≻ 1. The objective is to find a model that maps → . For example, the ARC dataset has = 7 dificulty levels with an output space = {“grade 3”, “grade 4”, . . . , “grade 9”}. 5.1. Discretized regression In the regression approach, also referred to as discretized regression, the rank indices are treated as numerical values to utilize the ordinal information (see Table 1 for references). The model minimizes the mean squared error loss and predicts a real-valued quantity (x) ∈ R representing a continuous rank estimate, which is then converted to the closest rank index. For example, a regression estimate of 2.7 is converted to index 3 while estimate 5.2 is converted to index 5.

Using a discretized regression approach in an ordinal QDE problem assumes that the inter-level distances are equal. However, this condition is only rarely satisfied in practice [ 16 ]. On the ARC dataset with levels “grade 3” to “grade 9”, for example, such an approach assumes that the jump in dificulty among all grades is identical. 5.2. Classification In the classification approach, the model’s output space is a set of unordered labels, one for each rank (see Table 1 for references). The model is trained to minimize the cross-entropy loss and the predicted rank label is the class with the highest predicted probability. As such, the predicted rank label is ^ = arg max∈ ( | x).

This approach essentially assumes that the dificulty levels are completely independent, hence discarding the available ordinal information. For example, for a question in the ARC dataset with true level “grade 3”, predicting levels “grade 4” and “grade 5” incurs the same loss even though the diference between “grade 3” and “grade 5” is larger than the that between level “grade 3” and “grade 4”. 5.3. Ordinal: OR-NN A popular general machine learning approach to ordinal regression is to cast it as an extended binary classification problem [ 23], leveraging the relative order among the labels. That is, the ordinal regression task with ranks is represented as a series of − 1 simpler binary classification sub-problems. For each rank index ∈ 1, 2, . . . , − 1 , a binary classifier is trained according to whether the rank of a sample is larger than . As such, all − 1 tasks share the same intermediate layers but are assigned distinct weight parameters in the output layer. In summary, this framework relies on three steps: (i) extending rank labels to binary vectors, (ii) training binary classifiers on the extended labels, and (iii) computing the predicted rank label from the binary classifiers.

In 2016, the authors of [24] adapted this framework to train NNs for ordinal regression; we refer to this method as OR-NN. More formally, a rank label is first extended into − 1 binary vectors (1), . . . , (−1) such that the () ∈ {0, 1} indicates whether exceeds rank , for instance, () = 1{ > }. Using the extended binary labels, a single NN is trained with − 1 binary classifiers in the output layer to minimize the cross-entropy loss. Based on the binary task predictions, the predicted rank label is ^ = . The rank index is given by

−1 = 1 + ∑︁ 1{︀ ( > ) > 0.5}︀ ,

=1 where ( > ) ∈ [ 0, 1 ] is the predicted probability of the th binary classifiers in the output layer.

However, the authors pointed out that OR-NN can sufer from rank inconsistencies among the binary tasks such that the predictions for individual binary tasks may disagree. For example, on the RACE++ dataset, it would be contradictory if the first binary task predicts that the dificulty is not higher than middle school level while the second binary task predicts it to be more dificult than high school level. This inconsistency could lead to suboptimal results when combining the − 1 predictions to obtain the estimated dificulty level. Figure 4 provides an example of a rank consistent and inconsistent prediction. In response, two methods have been proposed that overcome this drawback of rank inconsistency: CORAL and CORN.

(a) Rank inconsistent (b) Rank consistent 5.4. Ordinal: CORAL and CORN CORAL [25] achieves rank consistency by imposing a weight-sharing constraint in the last layer. Instead of learning distinct weights between each unit in the penultimate layer and each output unit, CORAL enforces that the − 1 bias term for the output layer.

binary tasks share the same weight parameters to the units in the penultimate layer. In addition, CORAL learns independent bias terms for each output unit as opposed to a single )︀ = ∏︀

=1 (x). classification framework.

6. Experiments

6.1. Data

CORAL uses a cross-entropy loss over the − 1

binary classifiers and the authors show theoretically that by minimizing this loss function, the learned bias terms of the output layer are non-increasing such that 1 ≥ 2 ≥ · · · ≥

−1 . Consequently, the predicted probabilities of the − 1 which ensures that the output reflects the ordinal information and is rank consistent. All other steps tasks are decreasing are identical to the extended binary classification framework.

However, while CORAL outperforms the OR-NN method in age prediction [25], the weight-sharing constraint may restrict the expressiveness and capacity of the NN.

More recently, the authors of [22] proposed CORN, which guarantees rank consistency without restricting the NN’s expressiveness and capacity. CORN achieves rank consistency by a novel training scheme which uses conditional training sets in order to obtain the unconditional rank probabilities.

More formally, CORN constructs conditional training subsets such that the output of the th binary task (x) represents the conditional probability (x) = (︀ > | > −1 ︀) . For ≥ 2 , the conditional subsets consist of observations where > 1 . When = 1, 1(x) represents the initial unconditional probability (︀ > 1)︀ based on the complete dataset. The transformed unconditional probabilities can then be computed by applying the chain rule for probabilities: (︀ > − Since ∀, 0 ≤ (x) ≤ 1 , we have (︀ > 1)︀ ≥ (︀ > 2)︀ ≥ · · · ≥ ︀( > −1 ︀) , which guarantees rank consistency among the − 1 the cross-entropy loss over the binary tasks. All other steps are identical to the extended binary binary tasks. During model training, CORN minimizes We evaluate our models on two multiple-choice question (MCQ) datasets: RACE++ and ARC. These datasets vary in domain and granularity of dificulty levels.

RACE++ [ 17, 26 ] is a dataset of reading comprehension MCQs. Questions are labeled with one of three dificulty levels—1 (middle school), 2 (high school), and 3 (university level)—which we treat as ground truth labels for QDE. The label distribution is imbalanced: 25% of the questions are labeled as level 1, 62% as level 2, and 13% as level 3. The dataset is partitioned into training, validation, and test sets containing 100,568; 5599; and 5642 questions, respectively.

ARC [18] is a dataset of science MCQs across grades 3 through 9. Question dificulty is indicated by the target grade level (i.e., 7 levels), which we use as ground truth. The training, validation, and test splits contain 3358, 862, and 3530 questions, respectively. The distribution is highly imbalanced: level 8 appears approximately 1400 times, level 5 about 700 times, and level 6 only 100 times. To decrease this imbalance, we follow [ 14 ] and downsample the two most frequent levels to 500 examples each, resulting in a partially balanced training set of 2293 questions. 6.2. Model Architecture We focus on end-to-end Transformer-based NNs, as they have been shown to outperform traditional NLP approaches that rely on separate feature engineering and modeling stages [ 14 ]. We fine-tune the Transformer BERT (“bert-base-uncased”) on the task of QDE, stacking an output layer on top of the pre-trained language model. During fine-tuning, both the weights of the output head and the pre-trained model are updated. We follow the input encoding of [ 14 ] and concatenate the question and the text of all the possible answer choices in a single sentence, divided by separator tokens.

Additionally, we investigate the performance of two baselines which serve as a lower bound on performance: (i) Random and (ii) Majority. The Random baseline randomly predicts a dificulty level, while the Majority baseline consistently predicts the majority level in the training set.

The experiments are implemented in PyTorch [27] using the HuggingFace [28] package, and results are averaged over five independent runs with random seeds.

7. Results and Discussion

7.1. Balanced DRPS Tables 2 and 3 present the results for the RACE++ and ARC datasets, which contain 3 and 7 dificulty levels, respectively. Balanced DRPS with probabilistic inputs is the main evaluation metric, as these predictions express uncertainty and are particularly valuable in downstream decision-making. In addition, we report the balanced DRPS computed using degenerate distributions—i.e., where all probability mass is placed entirely on the predicted level. This setup removes any representation of uncertainty from the predictions, thereby altering the scores for both classification and ordinal regression methods. Notably, the baselines and the discretized regression model remain unafected in this setting, as they do not express uncertainty. Recall that for both metrics, lower is better. Furthermore, we include the commonly used but flawed metrics RMSE and accuracy.

On RACE++ with 3 levels (Table 2), the classification model performs comparably to the ordinal methods OR-NN, CORN, and OrderedLogitNN in terms of balanced DRPS. The discretized regression model, by contrast, shows slightly inferior performance. Interestingly, the CORAL model underperforms significantly, likely due to its weight-sharing constraint limiting the NN’s capacity. Nonetheless, all models substantially outperform the baseline approaches. We hypothesize that the small diferences in performance across models are due to the limited number of ordinal levels.

The ARC dataset with 7 levels (Table 3) presents a more challenging setting, as it includes a greater number of dificulty levels and exhibits more pronounced class imbalance. Here, the OrderedLogitNN model considerably outperforms all other methods. For the remaining methods, the insights are consistent with those observed on RACE++.

When restricting the predictions to degenerate distributions, scores generally deteriorate (see the third column in Tables 2 and 3) because balanced DRPS is designed to reward well-calibrated probabilistic predictions while penalizing overconfident, incorrect ones. This shift brings the regression model’s performance in line with the classification model, OR-NN, and CORN. OrderedLogitNN again performs on par for RACE++ and performs substantially better on ARC.

When considering RMSE and accuracy—metrics that are poorly suited for ordinal prediction tasks (see Section 2.2)—we observe that models closely related to these objectives unsurprisingly achieve the best performance. On the RACE++ dataset, the results are close and OrderedLogitNN performs comparably to or even slightly better than the regression and classification models. In contrast, on ARC, the regression approach achieves the lowest RMSE, while classification achieves the highest accuracy. These results are expected, as the regression method is directly optimized with RMSE loss while the classification method entirely omits the ordinal information, just like the accuracy metric. 7.2. Confusion matrix To further investigate the behavior of the models, Figure 5 presents confusion matrices for the ARC dataset, the more complex task in this study. These matrices are normalized by the true class (i.e., row-wise) and are based on discrete predicted levels—rather than full probability distributions—thus aligning with the evaluation setting of the balanced DRPS with degenerate predictions.

The results indicate the inherent dificulty of the task, as demonstrated by the substantial deviations from the diagonal (highlighted in black). All models hardly predict level 6 as these observations are least present in the training set. Notably, the CORAL model exhibits highly atypical behavior, predicting exclusively at the extreme levels (3 and 9). This pattern suggests that the model fails to converge to a meaningful solution, consistent with its poor performance on the balanced DRPS metric.

Among the remaining models, all except OrderedLogitNN show concentrated errors in specific of-diagonal cells, i.e., predicting level 4 instead of level 3, or level 8 instead of level 9. These errors are associated with the outermost classes (levels 3 and 9), which are often neglected entirely by the models. In contrast, OrderedLogitNN stands out as the only method that successfully captures both extremes of the ordinal scale, avoiding these consistent misclassifications.

8. Conclusion

This study approaches discrete-level QDE through the lens of ordinal regression, reflecting the inherent ordering of dificulty levels from easiest to hardest. Prior work in this area has ignored this ordinal

Declaration on Generative AI

During the preparation of this work, the authors used GPT-4o in order to check grammar and spelling. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

[1]

W. J.

Van der Linden ,

C. A.

Glas , Computerized adaptive testing: Theory and practice , Springer, 2000 . doi: 10 .1007/0-306-47531-6.

[2]

Wang ,

Liu ,

Wang ,

Guo , A regularized competition model for question dificulty estimation in community question answering services , in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2014 , pp. 1115 - 1126 . doi: 10 .3115/v1/ D14 -1118.

[3]

Attali ,

Saldivia ,

Jackson ,

Schuppan , W. Wanamaker, Estimating item dificulty with comparative judgments , ETS Research Report Series 2014 ( 2014 ) 1 - 8 . doi: 10 .1002/ets2. 12042 .

[4]

Lane ,

M. R.

Raymond ,

T. M.

Haladyna , et al., Handbook of test development , volume 2 , Routledge New York, NY, 2016 .

[5]

Benedetto ,

Cremonesi ,

Caines ,

Buttery ,

Cappelli ,

Giussani ,

Turrin , A survey on recent approaches to question dificulty estimation from text , ACM Computing Surveys 55 ( 2023 ) 1 - 37 . doi: 10 .1145/3556538.

[6]

R. K.

Hambleton ,

R. W.

Jones , Comparison of classical test theory and item response theory and their applications to test development , Educational measurement: issues and practice 12 ( 1993 ) 38 - 47 .

[7]

R. K.

Hambleton , Fundamentals of item response theory , Sage , 1991 .

[8]

F.-Y.

Hsu , H. -M. Lee , T.-H. Chang , Y.-T. Sung, Automated estimation of item dificulty for multiplechoice tests: An application of word embedding techniques , Information Processing & Management 54 ( 2018 ) 969 - 984 .

[9]

Yang , E. Suyong, Feature analysis on english word dificulty by gaussian mixture model , in: 2018 International Conference on Information and Communication Technology Convergence (ICTC) , IEEE, 2018 , pp. 191 - 194 .

[10]

Fang ,

Zhao ,

Jia , Exercise dificulty prediction in online education systems , in: 2019 International Conference on Data Mining Workshops (ICDMW) , IEEE, 2019 , pp. 311 - 317 .

[11] L. -H. Lin , T.-H. Chang , F.-Y. Hsu, Automated prediction of item dificulty in reading comprehension using long short-term memory , in: 2019 International Conference on Asian Language Processing (IALP) , IEEE, 2019 , pp. 132 - 135 .

[12]

Zhou , C. Tao, Multi-task bert for problem dificulty prediction, in: 2020 international conference on communications, information system and computer engineering (cisce) , IEEE, 2020 , pp. 213 - 216 . doi: 10 .1109/CISCE50729. 2020 . 00048 .

[13]

Loginova ,

Benedetto ,

Benoit ,

Cremonesi , Towards the application of calibrated transformers to the unsupervised estimation of question dificulty from text , in: RANLP 2021 , INCOMA, 2021 , pp. 846 - 855 .

[14]

Benedetto , A quantitative study of nlp approaches to question dificulty estimation , in: International Conference on Artificial Intelligence in Education , Springer, 2023 , pp. 428 - 434 . doi: 10 .1007/978-3- 031 -36336-8_ 67 .

[15]

Thuy ,

Loginova ,

D. F.

Benoit , Active learning to guide labeling eforts for question dificulty estimation , arXiv preprint arXiv:2409.09258 ( 2024 ). doi: 10 .48550/arXiv.2409.09258.

[16]

Coe ,

Searle ,

Barmby ,

Jones ,

Higgins , Relative dificulty of examinations in diferent subjects, Durham: CEM centre ( 2008 ).

[17]

Liang ,

Li ,

Yin , A new multi-choice reading comprehension dataset for curriculum learning , in: Asian Conference on Machine Learning, PMLR , 2019 , pp. 742 - 757 .