<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Professor</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>NLP Methods May Actually Be Better Than Professors at Estimating Question Dificulty</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonidas Zotos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivo Pascal de Jong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matias Valdenegro-Toro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreea Ioana Sburlea</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Malvina Nissim</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hedderik van Rijn</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bernoulli Institute, University of Groningen</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Center for Language and Cognition, University of Groningen</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Experimental Psychology, University of Groningen</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <issue>1</issue>
      <abstract>
        <p>Estimating the dificulty of exam questions is essential for developing good exams, but professors are not always good at this task. We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions in the areas of Neural Networks and Machine Learning. Our results show that the professors have limited ability to distinguish between easy and dificult questions and that they are outperformed by directly asking Gemini 2.5 to solve this task. Yet, we obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples. We conclude that supervised learning using LLM uncertainty can help professors better estimate the dificulty of exam questions, improving the quality of assessment.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;item dificulty estimation</kwd>
        <kwd>uncertainty estimation</kwd>
        <kwd>educational data</kwd>
        <kwd>large language models</kwd>
        <kwd>assessment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Good exam design is time-consuming and dificult. One of the challenges is to ensure consistent
dificulty over multiple years, as exam scores should be comparable between cohorts. As previous exams
might circulate among students, instructors are required to design exams anew, selecting questions that
are neither too dificult, nor too easy [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. One solution is to randomly select a suficiently large sample
of questions from an item pool [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, the number of questions can often not be suficiently large
to be confident that the dificulty will remain constant over years. This requires instructors to estimate
the dificulty of the questions to ensure consistency, a process that is often an implicit aspect of exam
design.
      </p>
      <p>
        In this paper we assess whether Artificial Intelligence (AI), and in particular Natural Language
Processing (NLP), can be used to assist instructors in this process. AI is being viewed as a valuable avenue
for decreasing workload and increasing the capacity of educational staf in a variety of applications,
ranging from tutor chat-bots to systems that can grade exams [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Even though using AI for dificulty
estimation has been explored [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], success has been modest [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], with NLP systems often performing
marginally better than average-based baselines [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This task is also challenging for teachers, as shown
by van de Watering and van der Rijt [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. They found that teachers could correctly estimate the dificulty
levels for only a small proportion of the questions.
      </p>
      <p>The modest success of question dificulty estimation using NLP methods and the known limitations of
teachers to estimate question dificulty motivates this study. It is clear that both teachers and NLP-based
methods have a limited ability to estimate exam item dificulty, but it is not known how they compare.
This comparison is critical to determining whether automated question dificulty estimation is ready
for educational practice. In our work, we compare NLP-based approaches for automated question
dificulty estimation with expert-human estimation of dificulty. We demonstrate that state-of-the-art</p>
      <p>NLP methods are better at question dificulty estimations than university professors, and highlight the
potential of integrating NLP-based methods in the workflow of exam design.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The task of question dificulty estimation using NLP methods is not new [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Already in the 1990s,
traditional AI methods, were employed for question dificulty estimation [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. More recent approaches are
typically based on the transformer architecture. An example of this is the recent “Building Educational
Applications" shared task on “Automated Prediction of Item Dificulty and Item Response Time" [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
wherein a variety of approaches were explored ranging from changing the transformer architecture
to data augmentation techniques. The best performing team (EduTec) used a combination of model
optimisation techniques including scalar mixing [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], rational activation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and multi-task learning
to predict the proportion of students answering each question correctly [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>Our Contribution
The goal of this study is to establish whether modern NLP-based methods can be applied for question
dificulty estimation in university education. This is operationalized by comparing whether NLP-based
approaches perform similar or better than the lecturers who would normally construct the exams. To
the best of our knowledge, this is the first study of this kind. This comparison is conducted using two
university-level exams, the moderate-size question set being representative of the data that would
typically be available in real-world scenarios. The code implementation of this project is publicly
available.1</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>
        To compare the performance of professors and LLM-based solutions in question dificulty estimation,
we collected a dataset of exam questions used in university education. The proportion of students
answering a question correctly (known as the +-value) is considered the ground-truth dificulty. The
professors and LLM-based methods estimate this ground truth based on the exam question text. We
chose to use the +-value over IRT metrics [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], because it is more intuitive for a professor to interpret
and estimate.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Exam data</title>
        <p>We included data from two courses in the area of Artificial Intelligence that are taught at University of
Groningen. Specifically, we used Neural Networks, which is taught in the Artificial Intelligence BSc
program, and Advanced Machine Learning, which is taught in the Artificial Intelligence MSc program.
Both courses and exams follow a similar setup. The course material consists of custom lecture notes,
and the exams are made of twenty-two questions selected from a private item pool of exam questions.
Each question in the exam is a True/False question, and students have two hours to complete the exam.
Examples of exam questions are shown in Table 1.</p>
        <p>We collected three archived exams for each course, covering the years 22/23 (111 students), 23/24
(114 students), and 24/25 (20 students) for Neural Networks and the years 21/22 (103 students), 22/23
(119 students) and 23/24 (71 students) for Advanced Machine Learning. We collected all questions from
the exams and pooled them together. For questions that were repeated across years, the +-value was
based on all students that received this question. This was the case for five questions in total for each
of the two courses. Moreover, the examiner of the courses considered four questions from Advanced
Machine Learning and one question from Neural Networks as ambiguous and marked those as correct
for both True and False. Those ambiguous questions were removed from this study. Additionally, there</p>
        <sec id="sec-3-1-1">
          <title>1https://github.com/LeonidasZotos/nlp_vs_professors_difficulty_estimation</title>
          <p>(Machine Learning basics) Given a training data set (u, y)=1,..., , where u ∈ Ra(nd)yis l∈esRsor, False
then for any model  : R → R and any loss function , the empirical risk ℛ
equal to the risk ℛ( ).
(Elementary math) Let ,  : R → R be diferentiable functions with gradients ∇, ∇. Then
∇( + ) = ∇ + ∇.</p>
          <p>True
Answer
was one question which included an image. This question was also removed as we consider this to
be out-of-scope for the present study. This resulted in 59 questions from Neural Networks and 53
questions from Advanced Machine Learning.</p>
          <p>
            We use this new dataset instead of using existing datasets for three reasons. First and foremost, in
contrast to other datasets, we have professors that are experts in the field available that can provide
+-value estimates to represent the manual question dificulty estimation in a way that is ecologically
valid. Secondly, the questions in our new dataset are not publicly available guaranteeing that they are
unseen for all LLMs. Finally, by analyzing how dificulty estimation methods perform on questions
involving abstract mathematical reasoning and comprehension, we examine whether their previous
success in assessing the dificulty of clinical decision-making and language comprehension exams [
            <xref ref-type="bibr" rid="ref14 ref6">6, 14</xref>
            ]
extends to this domain.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Professors’ estimations</title>
        <p>Three professors of the University of Groningen were asked to estimate for each question the percentage
of students that would answer correctly. All three professors have expertise in Machine Learning and
Neural Networks and would be qualified to teach these courses. However, none of them have taught
these specific courses, and have never been a student in these courses. This ensures they are fairly
knowledgeable about the population of students and the topic, but have not seen students’ performance
on these questions. As an example and to provide some calibration, the professors were given one exam
question with the true percentage of students that answered it correctly. The professors were also given
the correct True/False answers. This was to help them focus on the task — predicting the dificulty of
the question rather than solving it. The exact annotation instructions are presented in Table 2.</p>
        <p>Annotation Instruction: Below are exam questions from the Advanced Machine Learning Course,
taught in the University of Groningen. For each question, the correct answer is highlighted in green.
Estimate, from the examiner’s perspective, what percentage of students will answer each question
correctly. Feel free to re-visit and adjust previous estimates. An example is presented below, where
the percentage of students who selected the correct answer is provided. At the end, provide an
estimate of the time spent on this item dificulty estimation task.</p>
        <p>Purpose of the Study: This study aims to compare the performance of expert educators and
state-of-the-art LLM-based methods in estimating the dificulty of True/False questions.</p>
        <p>Each professor made their estimates independently and at a moment that fits their schedule. On
average, estimating the dificulty of the total of 112 questions took each professor 2 hours and 15 minutes.
One professor (professor 3) declined to give estimates for sixteen questions for Neural Networks and
six questions for Advanced Machine Learning, stating that they miss the specific knowledge of some
concepts to provide a confident estimate . These questions were not considered in the evaluation for this
professor, but were maintained for the rest of the analysis.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. NLP Approaches</title>
        <p>
          We focus on two types of NLP-based methods for item dificulty estimation. We investigate methods
based on prompting where LLMs directly estimate the question dificulty and methods based on the
uncertainty of an LLM attempting to solve the question. The mathematical notation in all questions is
encoded using LaTeX, which LLMs can process well [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ].
        </p>
        <p>Using Direct Estimation As a simple comparison between LLMs and professors, we tested two
setups in which a powerful LLM is prompted to directly estimate the +-value of a question. We
provided the LLMs with the same instruction (and example question) that we also gave to the professors.
Additionally, we use Chain of Thought (by instructing the LLM to “Think step by step") to allow
the model to “reason" before giving an estimate [17]. To get the most competitive results we use
gemini-2.5-pro-preview-03-25 (Gemini 2.5) and Gemini-2.0-flash (Gemini 2.0), two of
the best-performing current LLMs, as measured by the community-driven Chatbot Arena [18]. At the
time of writing, they rank 1st and 8th respectively.</p>
        <p>The LLMs are prompted using two diferent setups. In the single question setup, the LLM is tasked
with predicting the +-value of each question individually without being able to see the other questions.
In contrast, in the all-questions setup the LLM is given the complete question set and is tasked with
estimating the +-values of all items in one go. We consider both of these setups to be promising, each
with its own trade-ofs. On one hand, we observe that prompting the model to generate the dificulty of
a single question item encourages it to generate longer reasoning streams, which could lead to more
accurate predictions. At the same time, predicting the dificulty of all items concurrently can also
be beneficial, potentially steering the prediction of each item to be informed by the entire set. The
all-questions setup closely resembles the setup with the professors, as they can also see all questions to
gauge the overall dificulty of the question set.</p>
        <p>Using LLM Uncertainty As a task-specific question dificulty estimation method we implement the
approach by Zotos et al. [19], which is a good representation of the current state-of-the-art for this task.</p>
        <p>In this approach, a set of nine LLMs are prompted to solve each question without the answer, where the
uncertainty of the LLM can be used as a feature for a supervised learning model to predict the +-value.
By using a mixture of stronger and weaker LLMs we can get a good spread of LLM uncertainties as
features. We use the same LLMs as in the original work.</p>
        <p>Two measures of uncertainty are used to indicate the dificulty of the question according to the
LLM. One is the probability of the first generated token (probability of “A" or “B"). The other one
is Choice-Order Sensitivity [20], which measures whether the LLM gives the same prediction when
the order of the answer choices is shufled. This is operationalised by performing inference with
diferent permutations of choice-order (for True/False questions, only two permutations are possible)
and calculating the proportion of times the correct answer is selected. Both measures have been found to
correlate with the probability that a prediction from an LLM is correct [21, 19] as well as the +-values
of exam questions [22].</p>
        <p>As supervised learning models we use three diferent regressors: Random Forest, Support Vector
Machine and Linear Regression. Each model is trained for each course using an 80:20 train-test split.
The regression models for Neural Networks are therefore trained with 47 samples, and the models for
Advanced Machine Learning with 42 samples. Using Grid Search with 5-fold validation we determine
the best hyperparameters for each regression model.</p>
        <p>We compare this with two alternative supervised learning setups. In one we train a dummy model
with no features, always predicting the mean +-value. In the other, we consider what may be learned
with simple features from the text. For this we use TF-IDF features for a supervised learning model. For
completeness, we also consider the concatenation of TF-IDF features and LLM uncertainties as features.</p>
        <p>Arguably, comparing the professors to this supervised learning approach is unfair. The professors
are only given one “labeled" example to calibrate their predictions, while the regressor needs more than
one example to undergo training. The supervised-learning approach therefore can directly learn the
distribution of +-values, which the professors do not have access to. At the same time, this setup is
realistic for an advanced NLP setup that may be used in practice. Universities often have archived data
of previous years’ exams, but reviewing them is time-consuming. Using this supervised learning setup,
we can capitalize on this existing data efectively.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>
        To evaluate how the professors compare to the NLP-based methods on the question dificulty estimation,
we use the root mean squared error (RMSE) between the estimated +-values and the ground truth
values as a standard metric for error. We also measure the rank correlation between the estimates and
the ground truth using Spearman’s  . This rank correlation assessment allows us to detect whether
an approach which has consistently biased estimates still maintains a strong monotonic relationship
with the true +-values and is thus able to distinguish easy from dificult questions. The Mean Error
(ME) metric directly evaluates any consistent bias, by measuring whether the dificulty estimates are on
average too high or too low. The results of all experiments are presented in Table 3.
Professor Performance Overall, and in line with the study by van de Watering and van der Rijt
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], professors seem to have limited ability to estimate question dificulty. We see that for the Neural
Networks (NN) exam, two of the professors estimated +-values that do not correlate with the
performance of students. Only professor 3 has a positive rank correlation with  =0.211. This may be partly
because professor 3 did not give an answer to sixteen questions for Neural Networks, presumably the
ones they felt uncertain about. For the MSc level Advanced Machine Learning (AML) course professor
2 and professor 3 achieved better performances. Their estimated +-values consistently show a weak
rank correlation with the ground truth.
      </p>
      <p>The Mean Errors are sometimes positive and sometimes negative, depending on the professor and
the exam. This suggests that there is no clear pattern of professors consistently over/underestimating
question dificulty. The high RMSE does show that overall the professors perform poorly at directly
estimating +-values. Furthermore, for each question we also average the three professor estimates as
if they are voting. This did not lead to any improvements.</p>
      <p>Direct Prompting Performance When assessing the two methods of direct prompting, we find
that prompting the model with one question at a time generally leads to lower RMSE and higher rank
correlations, with the exception of Gemini 2.0 in the Neural Networks set. Additionally, we find
that Gemini 2.5 is consistently more accurate than Gemini 2.0 which corresponds well with their
performance in other tasks [18]. When comparing the direct prompting of LLMs to the professors we
ifnd that the LLMs tend to perform better. The best LLM method ( Gemini 2.5, single question) has
better rank correlation than all professors on both exams. For Advanced Machine Learning the best
rank correlation from a professor was  =0.241, while that of the LLM was  =0.345.
Supervised Learning Performance The supervised learning methods achieve lower RMSE than
the professors and the direct LLM predictions, because only the supervised learning methods are able
to learn the distribution of +-values from data. The SVM performs best likely due to the small dataset
and non-linear relationships. Using only TF-IDF features was not suficient to estimate +-values for
the Neural Networks set, but was already better than the professors and often better than the LLMs
for the Advanced Machine Learning set. The LLM uncertainties as features are substantially more
predictive, resulting in lower RMSE and higher rank correlation  . The SVM with TF-IDF Scores and
LLM Uncertainties performed the best, with a rank correlation of  =0.853 for Neural Networks. For
Advanced Machine Learning, the SVM trained only on LLM Uncertainties performed best, with a rank
correlation of  =0.582. This is much better than either direct estimation from the LLM, or estimation
from the professors.</p>
      <sec id="sec-4-1">
        <title>4.1. Inter-Annotator Agreement</title>
        <p>Figure 1 shows the inter-annotator agreement (including the direct assessment of the Gemini LLMs),
represented using the Spearman correlation coeficient. Overall, this analysis shows that, while the task
is dificult (as was shown earlier), there are moderate correlations between the professors, indicating
that they might be over/under-estimating the dificulty of the same questions. This suggests that for
Advanced Machine Learning professors have a consistent notion of what should be dificult and what
should be easy and are making informed estimates.</p>
        <p>Additionally, we observe a high correlation between professor 3 and the Gemini Models in the Neural
networks dataset, in line with their relatively good performance on the set. Lastly, we also find a
moderate to high correlation between the assessments of the two Gemini LLMs, suggesting that LLMs
of the same family behave consistently on this task.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Per Question Analysis</title>
        <p>Figure 2 presents, per question, the +-value, along with the estimates of the best performing systems
per category. For each dataset, we separate the questions based on the train and test splits used for
the best-performing Supervised Learning approach (this separation has no impact on the teachers and
prompted LLMs). Here, we directly observe that there is a good range and distribution of dificulties,
with a balance of easy and dificult questions. We also see that a few questions in each set were
answered correctly by less than 40% of the student population, suggesting that these questions might
be misleading or trick-questions.</p>
        <p>Looking at the general picture, and in line with the results presented in Table 3, the best professors’
predictions and Gemini 2.5’s predictions do not show a strong correlation with the true +-values.
Additionally, their estimates show high variability, but are seldom below 50%, suggesting that they
do not recognize trick-questions that might lead students to perform worse than random guessing. In
Professor 2 0.31
Professor 3 0.29
Gemini 2.0 0.26
Gemini 2.5 0.21
Professor 2 0.38
Professor 3 0.35
Gemini 2.0 0.25
contrast, the estimated +-values of the best trained Supervised Learning Model show low variability
and remain near the average +-value observed in the training sets. We do see that the estimated
+-values correlate with the true +-values, but that the very easy/dificult questions are estimated
close to the average. This explains the good rank correlation, but still high RMSE that we observed in
Table 3.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>We have shown that Gemini 2.5 is better at question dificulty estimation for Neural Networks and
Advanced Machine Learning exams than three university professors. We also find that Gemini 2.5
consistently outperforms Gemini 2.0 on this task, which suggests that future LLM releases may lead
to further improvement.</p>
      <p>Additionally, we find that with as little as 42 training samples the supervised learning method of
Zotos et al. [22] based on the uncertainty of LLMs solving the problem substantially outperforms
both professors and a standard LLM approach. This finding is significant, as it demonstrates that
implementing this system for individual courses is feasible, with only a couple of exams from previous
years being required to train a good regression model.</p>
      <p>
        Lastly, our findings are on questions that require parsing mathematical notation and that require
mathematical reasoning. This extends previous successes of NLP-based question dificulty estimation
on biopsychology [22], clinical decision making [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and language comprehension [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] exams to more
mathematical fields. While the current results are specific to Machine Learning, they suggest that
NLP methods are also promising to other mathematical topics such as physics, computer science and
astronomy.
      </p>
      <p>Overall we have demonstrated that state-of-the-art NLP methods are – relative to professors – very
good at question dificulty estimation, and can support them in ranking question dificulty. Of course
our findings are focused on question dificulty estimation, and we still need professors for the many
other aspects of exam design and education!
Limitations The primary limitation of this study is that we relied on professors that did not teach
this specific course. In reality, professors have additional information which can help with the task of
question dificulty estimation, such as the performance of a cohort during the semester. At the same
time, the professors in our study already have significantly more background information than the
better-performing Gemini 2.5, as they are familiar with the rest of the curriculum and know how these
students perform in other courses. This limitation may cast doubt whether question dificulty estimation
from Gemini 2.5 is better than a professor that has been teaching a specific course. However, it
remains clear that the supervised learning method is superior given the large diferences.</p>
      <p>We also observed that professors mostly make +-value predictions in increments of 5% (e.g., 65% or
70%, but not 68%). This results in items being tied in terms of predicted +-value. While these ties do
not impact the calculation of the Root Mean Squared Error, they might negatively afect the Spearman
Rank Correlation Coeficient, as granular judgments that would create a clear ranking of the question
items is not available. However, we observe that Gemini 2.5’s direct estimations also frequently occur
in 5% increments (with the model often predicting 60% or 75%, as shown in Figure 2), yet a consistently
higher correlation is observed compared to the professors’ annotations.</p>
      <p>Broader Impact Statement While the results of the current study are promising, implementing
such a system is not trivial: The best performing system we tested relies on the existence of some
training data, as well as the availability of suficient computational resources to compute the uncertainty
metrics of the LLMs. At the same time, instructing a state-of-the-art proprietary LLM to estimate
question dificulty can lead to good performance on the task, a solution that is trivial to use. As a final
consideration, we believe that any system of this type should be used in a human-in-the-loop fashion to
address cases where the NLP methods unavoidably lack context such as, for example, when a question
is assessed as easy even though the material was not covered in class.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank Professor Herbert Jaeger for providing the exam material for this study. We
would also like to thank the three professors for volunteering their time and expertise to provide
estimates of question dificulty. This study was positively reviewed by the Faculty of Science and
Engineering Ethics Committee under reference FSE.EC25005.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>For the preparation of this work, Gemini 2.5 was used for: Grammar check.</title>
        <p>pp. 225–237. URL: https://aclanthology.org/2024.eacl-srw.17/.
[17] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou,
Chainof-thought prompting elicits reasoning in large language models, in: S. Koyejo, S. Mohamed,
A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems,
volume 35, Curran Associates, Inc., 2022, pp. 24824–24837. URL: https://proceedings.neurips.cc/p
aper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
[18] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. I. Jordan,
J. E. Gonzalez, I. Stoica, Chatbot arena: an open platform for evaluating LLMs by human preference,
in: Proceedings of the 41st International Conference on Machine Learning, ICML’24, JMLR.org,
2024.
[19] L. Zotos, H. van Rijn, M. Nissim, Are you doubtful? Oh, it might be dificult then! Exploring the
use of model uncertainty for question dificulty estimation, in: C. Mills, G. Alexandron, D. Taibi,
G. L. Bosco, L. Paquette (Eds.), Proceedings of the 18th International Conference on Educational
Data Mining, International Educational Data Mining Society, Palermo, Italy, 2025, pp. 77–89.
doi:10.5281/zenodo.15870153.
[20] P. Pezeshkpour, E. Hruschka, Large language models sensitivity to the order of options in
multiple-choice questions, in: K. Duh, H. Gomez, S. Bethard (Eds.), Findings of the Association
for Computational Linguistics: NAACL 2024, Association for Computational Linguistics, Mexico
City, Mexico, 2024, pp. 2006–2017. URL: https://aclanthology.org/2024.findings-naacl.130/.
doi:10.18653/v1/2024.findings-naacl.130.
[21] B. Plaut, K. Nguyen, T. Trinh, Softmax probabilities (mostly) predict large language model
correctness on multiple-choice Q&amp;A, CoRR abs/2402.13213 (2024). URL: https://doi.org/10.48550/a
rXiv.2402.13213.
[22] L. Zotos, H. van Rijn, M. Nissim, Can model uncertainty function as a proxy for
multiplechoice question item dificulty?, in: O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D.
Eugenio, S. Schockaert (Eds.), Proceedings of the 31st International Conference on Computational
Linguistics, Association for Computational Linguistics, Abu Dhabi, UAE, 2025, pp. 11304–11316.
URL: https://aclanthology.org/2025.coling-main.749/.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L. F.</given-names>
            <surname>Bachman</surname>
          </string-name>
          ,
          <article-title>Fundamental considerations in language testing</article-title>
          , Oxford university press,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W. D.</given-names>
            <surname>Way</surname>
          </string-name>
          ,
          <article-title>Protecting the integrity of computerized testing item pools</article-title>
          ,
          <source>Educational Measurement: Issues and Practice</source>
          <volume>17</volume>
          (
          <year>1998</year>
          )
          <fpage>17</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Tuomi</surname>
          </string-name>
          ,
          <article-title>State of the art and practice in AI in education</article-title>
          ,
          <source>European journal of education 57</source>
          (
          <year>2022</year>
          )
          <fpage>542</fpage>
          -
          <lpage>570</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Benedetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caines</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buttery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cappelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giussani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Turrin</surname>
          </string-name>
          ,
          <article-title>A survey on recent approaches to question dificulty estimation from text</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>AlKhuzaey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Grasso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Payne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tamma</surname>
          </string-name>
          ,
          <article-title>Text-based question dificulty prediction: A systematic review of automatic approaches</article-title>
          ,
          <source>International Journal of Artificial Intelligence in Education</source>
          <volume>34</volume>
          (
          <year>2024</year>
          )
          <fpage>862</fpage>
          -
          <lpage>914</lpage>
          . doi:
          <volume>10</volume>
          .1007/s40593-023-00362-1.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Yaneva</surname>
          </string-name>
          , K. North,
          <string-name>
            <given-names>P.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Ha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rezayi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Harik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Clauser</surname>
          </string-name>
          ,
          <article-title>Findings from the first shared task on automated prediction of dificulty and response time for multiple-choice questions</article-title>
          ,
          <source>in: Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA</source>
          <year>2024</year>
          ),
          <year>2024</year>
          , pp.
          <fpage>470</fpage>
          -
          <lpage>482</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>G. van de Watering</surname>
          </string-name>
          , J. van der Rijt,
          <article-title>Teachers' and students' perceptions of assessments: A review and a study into the ability and accuracy of estimating the dificulty levels of assessment items</article-title>
          ,
          <source>Educational Research Review</source>
          <volume>1</volume>
          (
          <year>2006</year>
          )
          <fpage>133</fpage>
          -
          <lpage>147</lpage>
          . URL: https://www.learntechlib.org/p/197391.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Perkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tammana</surname>
          </string-name>
          ,
          <article-title>Predicting item dificulty in a reading comprehension test with an artificial neural network</article-title>
          ,
          <source>Language Testing</source>
          <volume>12</volume>
          (
          <year>1995</year>
          )
          <fpage>34</fpage>
          -
          <lpage>53</lpage>
          . doi:
          <volume>10</volume>
          .1177/02655322950120 0103.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Boldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Freedle</surname>
          </string-name>
          ,
          <article-title>Using a neural net to predict item dificulty</article-title>
          ,
          <source>ETS Research Report Series (1996) i-19</source>
          . doi:https://doi.org/10.1002/j.2333-
          <fpage>8504</fpage>
          .
          <year>1996</year>
          .tb01709.x.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gombert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Mitri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Karademir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kubsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kolbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tautz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grimm</surname>
          </string-name>
          , I. Bohm,
          <string-name>
            <given-names>K.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Drachsler</surname>
          </string-name>
          ,
          <article-title>Coding energy knowledge in constructed responses with explainable NLP models</article-title>
          ,
          <source>Journal of Computer Assisted Learning</source>
          <volume>39</volume>
          (
          <year>2023</year>
          )
          <fpage>767</fpage>
          -
          <lpage>786</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schramowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kersting</surname>
          </string-name>
          ,
          <article-title>Padé activation units: End-to-end learning of flexible activation functions in deep networks</article-title>
          ,
          <source>in: 8th International Conference on Learning Representations</source>
          ,
          <string-name>
            <given-names>ICLR</given-names>
            ,
            <surname>Addis</surname>
          </string-name>
          <string-name>
            <surname>Ababa</surname>
          </string-name>
          , Ethiopia,
          <source>April 26-30</source>
          ,
          <year>2020</year>
          , OpenReview.net,
          <year>2020</year>
          . URL: https://openreview.net/forum?id=BJlBSkHtDS.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gombert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Menzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Mitri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Drachsler</surname>
          </string-name>
          ,
          <article-title>Predicting item dificulty and item response time with scalar-mixed transformer encoder models and rational network regression heads</article-title>
          , in: E. Kochmar,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bexte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Burstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Horbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Laarmann-Quante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Yaneva</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          Yuan (Eds.),
          <source>Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA</source>
          <year>2024</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>483</fpage>
          -
          <lpage>492</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .bea-
          <volume>1</volume>
          .
          <fpage>40</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Hambleton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Swaminathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <article-title>Fundamentals of item response theory</article-title>
          , volume
          <volume>2</volume>
          ,
          <string-name>
            <surname>Sage</surname>
          </string-name>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mullooly</surname>
          </string-name>
          , Ø. Andersen,
          <string-name>
            <given-names>L.</given-names>
            <surname>Benedetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buttery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caines</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J. F.</given-names>
            <surname>Gales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Karatay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Knill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liusie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Taslimipoor</surname>
          </string-name>
          , The Cambridge Multiple-Choice Questions Reading Dataset, Cambridge University Press and Assessment,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .17863/CAM.102185.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Frieder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pinchetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chevalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.-R.</given-names>
            <surname>Grifiths</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Salvatori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lukasiewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Petersen</surname>
          </string-name>
          , J. Berner, Mathematical capabilities of ChatGPT, in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>36</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2023</year>
          , pp.
          <fpage>27699</fpage>
          -
          <lpage>27744</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/202 3/file/58168e8a92994655d6da3939e7cc0918-Paper-Datasets_and_Benchmarks.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lou</surname>
          </string-name>
          , D. Liu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Yin,
          <article-title>Large language models for mathematical reasoning: Progresses and challenges</article-title>
          , in: N.
          <string-name>
            <surname>Falk</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Papi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          Zhang (Eds.),
          <source>Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop</source>
          , Association for Computational Linguistics, St.
          <source>Julian's, Malta</source>
          ,
          <year>2024</year>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>