Question Difficulty Prediction Based on Virtual Test-Takers and Item Response Theory Masaki Uto1,* , Yuto Tomikawa1 and Ayaka Suzuki1 1 The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu, Tokyo, Japan Abstract Predicting the difficulty of test questions is a crucial task in the field of education. Many recent studies have proposed supervised machine learning methods that predict difficulty from question text. However, this approach requires a large dataset of questions with known difficulties to train difficulty prediction models. Recently, another approach was proposed that uses question-answering (QA) systems as virtual test-takers. This method predicts question difficulty based on the correct/incorrect responses obtained from QA systems, obviating the need to pre-collect questions with known difficulties. However, this approach is limited by the fact that the scale of difficulty values estimated from the responses of QA systems do not necessarily align with the scale derived from human test-takers’ responses. To overcome this limitation, we propose a novel method that utilizes QA systems to predict question difficulty while ensuring the difficulty scale aligns with that derived from human test-takers. Our method uses the principle of test linking from item response theory to transform the difficulty scale predicted by QA systems into one derived from human test-takers. Experiments using real data demonstrate that our proposed method can achieve higher accuracy in difficulty prediction compared with conventional methods. Keywords Question difficulty, item response theory, large language models, question answering, educational measurement 1. Introduction Estimating the difficulty of test questions is a crucial task in the education domain. For example, in the context of learning support, providing questions of appropriate difficulty to individual learners enhances learning. Accordingly, such adaptive question presentation is a common objective of intelligent tutoring systems, adaptive learning systems, and knowledge tracing technologies [1, 2, 3, 4, 5, 6, 7]. Furthermore, in the context of educational measurement, estimat- ing question difficulty, specifically using item response theory (IRT) [8], enables sophisticated testing operations, including (1) adaptive testing, which enables accurate measurement of ability in a short time by presenting questions with a difficulty tailored to each test-taker’s ability [9]; (2) uniform test assembly, which involves composing multiple test forms with equivalent diffi- culty levels [10]; and (3) test linking, which facilitates ability estimation on a common scale for EvalLAC’24: Workshop on Automatic Evaluation of Learning and Assessment Content, July 08, 2024, Recife, Brazil * Corresponding author. $ uto@ai.lab.uec.ac.jp (M. Uto) € https://sites.google.com/site/utomasakieng/ (M. Uto)  0000-0002-9330-5158 (M. Uto) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings test-takers who have taken different test forms [11, 12]. Given these, it is evident that estimating question difficulty plays a crucial role in various essential tasks in the educational field. The most common approach for estimating question difficulty entails presenting target questions to human test-takers and using the resulting correct/incorrect response data to estimate their difficulties [11, 12, 13]. Methods for quantifying difficulty are generally divided into two approaches: one based on classical test theory [14], which quantifies question difficulty through the correct answer rate, and another based on IRT [15]. However, regardless of the difficulty quantification approach, this approach necessitates prior administration of target questions to human test-takers, thereby incurring significant costs and potentially compromising the reliability of the test owing to exposure of its content. Methodologies employing natural language processing technology to predict question diffi- culty from question texts have recently attracted widespread attention as a means of overcoming this limitation [15, 16, 17, 18, 19, 20, 21]. In this approach, a large dataset of questions with known difficulties is assumed to be given. This dataset is compiled by presenting a large number of questions to a specific group of test-takers and estimating their difficulties from the correct/in- correct responses obtained. The resulting dataset, containing questions with known difficulties, is then used to train a supervised machine learning model that is capable of predicting the difficulties of questions from their texts. Existing methods based on this approach can be broadly divided into feature-based and neural-based methods [15, 21]. Typical feature-based methods include R2DE (Regressor for Difficulty and Discrimination Estimation) [18] and its extension models (e.g., [16]). However, these methods require meticulous feature engineering to achieve high accuracy. Neural-based methods obviate the necessity for feature engineering by utilizing deep neural networks that process the sequence of words in a question. Recent studies have proposed neural-based methods that utilize pre-trained transformer models such as BERT (bidirectional encoder representations from transformers) [22] and DistilBERT (Distilled-BERT) [23], as illustrated in Fig. 1 1 [17, 19, 20]. However, even with these neural methods, the accuracy of difficulty prediction often remains modest. For example, a recent study utilizing difficulty prediction models based on BERT and DistilBERT reported that the correlation between the predicted and actual IRT-based difficulty values was not sufficiently high, at 0.441, despite employing relatively large training instances with around 6,700 samples [24]. This suggests that there are inherent limitations to the accuracy of difficulty prediction using this approach, potentially due to the substantial differences between the tasks of question difficulty prediction and general natural-language understanding. These differences complicate the process of transferring the language understanding capability of pre-trained models to question difficulty–prediction tasks. On the other hand, an alternative approach has been explored that predicts difficulty using question-answering (QA) systems as virtual test-takers [25, 26, 27], as outlined in Fig. 2. This ap- proach constructs several QA systems in advance and predicts the difficulties of target questions 1 The input for these models consists of the sequence of words in a question text, including related information, such as the reading passage and the correct or distractor options. In the figure, 𝑤𝑡 represents the 𝑡-th word of an input text sequence, and 𝑛 denotes the length of the input. The [CLS] symbol signifies a special token, whose output vector serves as a distributed representation of the given text. Consequently, the model predicts a difficulty value by converting this distributed representation vector from BERT or DistilBERT into a scalar value through a linear layer. Figure 1: Outline of a neural difficulty Figure 2: Outline of difficulty prediction using prediction model. QA systems. by estimating them from the correct/incorrect responses of the QA systems. For instance, Gao et al. [25] introduced a binary classification method for question difficulty, categorizing questions as "easy" if answered correctly by two QA systems, and "hard" if answered incorrectly by the QA systems. Additionally, Byrd et al. [26] and Uto et al. [27] proposed estimating IRT-based question difficulty by using the correct/incorrect responses from various QA systems. The key premise of this approach is that natural language understanding and QA capabilities are closely related, which makes the construction of QA systems based on pre-trained neural models easier than that of text-based difficulty prediction models. Although this approach can provide a difficulty prediction system without the need to pre-collect questions with known difficulties, it is limited by the fact that the scale of difficulty values estimated from the responses of QA systems will necessarily not align with the scale derived from human test-takers’ responses. If these scales do not align, the difficulty values derived from QA systems may not be applicable or meaningful for human test-takers. To overcome this limitation, we propose a novel method for predicting question difficulty using QA systems while ensuring the difficulty scale aligns with that derived from human test- takers. This method leverages the principle of test linking within IRT [28, 29], a well-designed strategy for unifying scales of IRT parameters estimated from different datasets. Specifically, our method initially collects correct/incorrect responses to a set of questions from human test-takers and various QA systems. IRT is then applied to this response data to estimate their ability values. In this process, we first estimate the ability values of human test-takers using only their response data. Subsequently, given the estimated human test-takers’ ability values, we estimate the ability values of the QA systems with IRT using the entirety of the response data. This enables us to map the QA systems’ ability estimates onto the scale of human test-takers. We then estimate the difficulty of new questions based on the correct/incorrect responses from the QA systems, using IRT given the QA systems’ ability values. Because the ability scale for QA systems is matched to that of human test-takers, the difficulty values for new questions derived from the QA systems are also aligned with the difficulty scale derived from human test-takers. Experiments with real data confirm that this method outperforms conventional methods of predicting question difficulty from question texts. To our knowledge, this research is the first to focus on using QA systems to predict question difficulty in alignment with the scale of human test-takers, although a similar attempt has now been investigated [30]. Moreover, within the context of test theory, our proposed method introduces a novel test linking technique based on QA systems. This approach could potentially mark a significant advancement over traditional test theory methodologies, as well as in the field of question difficulty prediction research. 2. Quantification of difficulty using IRT This study employs IRT to quantify question difficulty because our proposed method leverages the advantages of IRT. IRT uses statistical models, called IRT models, to define the probability of each test-taker’s response to a question as a function of both their ability and the question’s difficulty. This study uses the Rasch model, the simplest IRT model, which defines the probability that test-taker 𝑗 will answer question 𝑖 correctly as 𝑝𝑖𝑗 = [1 + exp(−(𝜃𝑗 − 𝑏𝑖 ))]−1 , (1) where 𝜃𝑗 is the parameter representing the ability of test-taker 𝑗, and 𝑏𝑖 is the parameter representing the difficulty of question 𝑖. These parameters are estimated from a collection of correct/incorrect responses of a group of test-takers to a set of questions. In the following sections, we assume that difficulty is quantified based on this Rasch model. 3. Proposed method Our proposed method utilizes the concept of test linking based on IRT [28] to realize question difficulty prediction using QA systems while ensuring the predicted difficulty scale aligns with that derived from human test-takers. The detailed processes involve the following steps. 1. Construct QA systems by fine-tuning pre-trained neural models, such as BERT. The fine- tuning process can be performed using publicly available question corpora that match the format of the target questions, or using a small subset of the target question bank. QA systems with varying performances need to be prepared to mimic the diverse abilities of human test-takers, for instance, by varying the base pre-trained models or by limiting the amount of data used for fine-tuning. 2. Administer a set of questions to human test-takers, collect correct/incorrect response data, and apply the Rasch model to the data to estimate the ability values of each test-taker. In the subsequent experiments, we apply expected a posterior (EAP) estimation based on the Markov Chain Monte Carlo algorithm. Note that text-based difficulty prediction methods also require such human response data to construct training data consisting of questions with known difficulty. 3. Gather correct/incorrect responses from QA systems for the same questions administered to human test-takers. Then, using the entirety of the response data collected from both human test-takers and QA systems, estimate the ability values of the QA systems using the Rasch model. In this process, the ability estimates of human test-takers must be given Figure 4: Prediction phase of our proposed method. Figure 3: Preparation phase of our proposed method. Figure 5: Creation of the ground-truth difficulty. and fixed. This facilitates the alignment of the QA systems’ ability estimates with those of human test-takers on a common scale. 4. Gather correct/incorrect responses from QA systems for new target questions and estimate their difficulty from the response data, using the Rasch model given the QA systems’ ability estimates. Because the ability scale for QA systems is matched to that of human test-taker abilities, the difficulty values for new questions derived from the QA systems are also aligned with the difficulty scale derived from human test-takers. Note that this difficulty inference is feasible for both scenarios: individual questions one by one, and all questions simultaneously. Fig. 3 provides conceptual diagrams for steps 2 and 3, which are preparatory phases for difficulty prediction. Additionally, Fig. 4 provides the diagram for step 4, corresponding to the phase of predicting difficulty for new questions. The advantage of our method compared with conventional text-based difficulty prediction methods is its efficiency in constructing the question difficulty prediction system. As discussed in Section I, text-based difficulty prediction methods require a large number of questions with known difficulty as a training dataset. This means that a vast number of human responses to many questions are required in advance, which incurs extensive costs. The proposed method can significantly reduce the required amount of such response data because the ability parameter of the Rasch model can be estimated from just several dozen questions [31]. Furthermore, QA systems based on pre-trained transformer models are expected to be constructed using a relatively small number of questions, as discussed in Section I. They can therefore contribute greatly to reducing the necessary data for constructing a question difficulty prediction system. Furthermore, compared with conventional QA-based difficulty prediction methods, our method has the advantage of being able to predict question difficulty on a scale aligned with that derived from human test-takers’ responses, using only the QA systems. 4. Evaluation experiment We designed an empirical experiment to evaluate the effectiveness of our method. In it, the accuracy of predicting difficulty for new questions using our method is compared with that of conventional text-based difficulty prediction methods. 4.1. Data For our experiments, we utilized two publicly available datasets: EVKD (The ESL Learners’ Vocabulary Knowledge Dataset) [32, 33], which comprises English vocabulary tests, and SQuAD (The Stanford Question Answering Dataset) [34], commonly employed as a benchmark in QA and question generation research. EVKD comprises English vocabulary test questions that require test-takers to select the appropriate expression to replace a specified part of a given English sentence from multiple choices. The dataset contains 100 questions, each with question text, one correct answer choice, and three distractor choices, along with correct/incorrect response data from 100 English learners. In this experiment, a randomly selected subset of 50 questions was used to construct QA systems, while the remaining 50 questions and their corresponding response data were used to evaluate the prediction performance of both our proposed and conventional methods. SQuAD is a dataset for reading comprehension, comprising reading passages, comprehension questions, and reference answers. The reading passages are sourced from Wikipedia, with ques- tions and reference answers generated by crowdworkers. Each reference answer corresponds to a segment of the text in the reading passage. The SQuAD dataset is pre-split into 90% for training and 10% for testing. However, it cannot be directly applied to our experiment as it lacks correct/incorrect response data from human test-takers for the questions it contains. Thus, for this study, we randomly selected 570 questions from the SQuAD test dataset and collected response data from 10 human test-takers for these questions. On average each test-taker an- swered 120 questions, guaranteeing that at least two test-takers responded to each question. Answer correctness was verified by exact match after preprocessing the test-takers’ answers (e.g., removing articles, standardizing case, eliminating spaces). In this experiment, the SQuAD training dataset was used to construct QA systems, and the 570 questions with responses from human test-takers were used to evaluate the prediction performance of both our proposed and conventional methods. 4.2. QA systems For each dataset, a variety of QA systems with differing abilities were developed. Specifically, we utilized 12 pre-trained transformer models from Huggingface2 : bert-base-uncased, bert- large-uncased, roberta-base, roberta-large, microsoft/deberta-base, microsoft/deberta-large, microsoft/deberta-v3-base, microsoft/deberta-v3-large, albert-base-v1, albert-base-v2, albert- large-v2, and distilbert-base-uncased. We adapted the output layers of these models to align with the question type of each dataset, and conducted model training with varying amounts of 2 https://huggingface.co/ data to generate QA systems of diverse performance levels3 . For the EVKD dataset, the QA systems were designed as classifiers that process the question text and four choice options to identify the correct answer. Specifically, the special token [CLS] is appended to the input text, and an output classification layer is added atop the output vector corresponding to this token. In addition to appending the [CLS] token at the beginning of the input, a special token [SEP] is inserted as a separator between the question text and the four choice options. As mentioned previously, the data from 50 questions was available for constructing the QA systems. Accordingly, we trained each QA system using the entire dataset and random subsets corresponding to 40, 30, 20, 10, and 5 questions, respectively. For the SQuAD dataset, the QA systems were configured to predict the start and end positions of the answer within the reading passage. The input for the models comprises the concatenation of a passage and a question text, separated by the special token [SEP]. We trained each QA system using the entire SQuAD training dataset and random subsets of 3000, 2400, 1800, 1200, and 600 data points, respectively. This procedure resulted in 72 QA systems with varying levels of ability for each dataset. 4.3. Experimental procedure We evaluated the performance of difficulty prediction using both our proposed method with constructed QA systems and conventional methods that predict difficulty from question texts using supervised regression models. As detailed in Section 4.1, for performance evaluation, we can utilize 50 questions from the EVKD dataset and 570 questions from the SQuAD dataset, along with the respective responses from human test-takers. Thus, for each dataset we randomly split the data into 90% and 10%. The 90% portion, denoted as 𝐷, was used to develop difficulty predictors. Developing difficulty predictors corresponds to the process of estimating the abilities of human test-takers and QA systems in our method as well as that of training a regression model for difficulty prediction from question texts in the conventional method. The remaining 10%, denoted as 𝐸, was used to evaluate the accuracy of the difficulty prediction. Specifically, in our method, the ability values of test-takers were initially estimated using the Rasch model based on the correct/incorrect response data from human test-takers within 𝐷. Subsequently, the ability values of the QA systems were estimated using response data from both the 72 QA systems and human test-takers for the same questions, while fixing the ability values of human test-takers. Finally, for each question in 𝐸, the difficulty was estimated using the Rasch model based on responses from the QA systems, with the ability estimates of the QA systems held fixed. These calculated values were considered as the predicted difficulty values. In the conventional method, the difficulty of questions within 𝐷 was first estimated based on the Rasch model using the response data from human test-takers. Subsequently, regression models for predicting difficulty from question texts were trained using the set of questions with estimated difficulties4 . We explored two neural regression models, BERT and DistilBERT, which were also utilized in prior research. For each question in 𝐸, the predicted difficulty values were derived by inputting the question texts into these trained models. 3 The training of the QA systems employed AdamW with a learning rate of 1e-5 and a maximum of 5 epochs. Neither the EVKD nor the SQuAD datasets were used in the original pretraining of each transformer model. 4 The training was done by AdamW with a learning rate of 1e-5 and a maximum of 10 epochs. Table 1 Experimental results Correlation coefficient Regression coefficient Prop. BERT DistilBERT Prop. BERT DistilBERT EVKD Mean 0.326 0.274 0.275 0.215 0.034 0.026 SD 0.251 0.233 0.412 0.175 0.030 0.031 SQuAD Mean 0.588 0.191 0.134 1.339 0.065 0.034 SD 0.054 0.043 0.047 0.125 0.019 0.014 Figure 6: Relationship between predicted difficulty values and ground-truth in the EVKD dataset. Because the objective of this study was to predict difficulty values for new questions that align with human scales, the ground-truth difficulty values for each question in 𝐸 were estimated from the response data of human test-takers within 𝐸. In this difficulty estimation process, the ability values of human test-takers, estimated from 𝐷, were given. The process for generating the ground-truth difficulty values is depicted in Fig. 5. We evaluated the accuracy of difficulty prediction by comparing the predicted difficulty for each question in 𝐸 provided by each method to the corresponding ground-truth defined above. Correlation coefficients and regression coefficients served as metrics for evaluating prediction accuracy. A higher correlation approaching one and regression coefficient values nearing one signify enhanced prediction accuracy. To improve the reliability of the experimental results, we repeated the experiment 10 times, varying the random splits of 𝐷 and 𝐸 each time. 4.4. Experimental results The experimental results are presented in Table 1. The rows labeled mean and SD represent the average performance over 10 repetitions and its standard deviation, respectively. The results demonstrate that our proposed method outperforms conventional methods in both datasets. Notably, when examining the regression coefficients, we can see that the values for conventional methods are nearly zero. To elucidate this phenomenon, Fig. 6 and Fig. 7 display scatter plots of the predicted difficulty values against the ground truth for each dataset. These figures demonstrate how conventional methods produce limited variances in predicted difficulties, failing to accurately capture the range of difficulty. Figure 7: Relationship between predicted difficulty values and ground-truth in the SQuAD dataset. Table 2 Table 3 Ability estimates and their PSD for human test-takers and for Ability estimates of QA systems each of the 12 pre-trained transformer models used in QA systems. across various training sample Ability estimates 𝜃𝑗 PSD of 𝜃𝑗 sizes. Test-takers Mean SD Mean SD Sample Ability estimates 𝜃𝑗 human 0.056 0.588 0.243 0.034 size Mean SD distilbert.base.uncased -0.189 1.602 0.133 0.017 300 0.136 1.197 bert.base.uncased 0.374 1.161 0.130 0.011 600 0.816 0.747 albert.base.v1 1.025 0.733 0.134 0.013 900 1.194 0.609 albert.large.v2 1.031 0.510 0.133 0.006 1200 1.362 0.610 albert.base.v2 1.101 0.766 0.138 0.012 1500 1.516 0.563 bert.large.uncased 1.186 0.803 0.138 0.014 full 2.398 0.403 roberta.base 1.375 0.814 0.140 0.015 deberta.base 1.542 0.684 0.144 0.014 deberta.v3.base 1.590 0.646 0.143 0.014 roberta.large 1.789 0.587 0.148 0.013 deberta.v3.large 1.993 0.579 0.150 0.014 deberta.large 2.028 0.384 0.152 0.010 4.5. Additional analysis This section analyzes the ability estimates of the QA systems constructed for our method. Specifically, we investigated the estimated abilities of the 72 QA systems and the human test- takers, which were obtained from the experiments conducted using the SQuAD dataset. Because the experiments yielded ability estimates for each of the 10 repetitions, we first confirmed the correlations and root mean squared errors (RMSEs) in the estimates among all pairs of the 10 repetitions. We found that the average correlation was 0.995 with an SD of 0.001, while the average RMSE was 0.102 with an SD of 0.014. These results suggest that the ability estimates are strongly consistent among the repetitions. Thus, we subsequently investigated the ability estimates obtained from the first repetitions. Table 2 shows the statistics corresponding to the ability estimates for human test-takers and for each of the 12 pre-trained transformer models used in the 72 QA systems. The statistics include the average and SD of the ability estimates as well as those of the posterior standard deviations (PSDs) for the ability estimates. This table reveals some reasonable trends. Specifically, variants of the DeBERTa model, one of the latest models, exhibit higher average abilities, while the distilBERT, a simplified version of BERT, shows the lowest average abilities. Furthermore, when comparing the sizes of each model, the larger models tend to provide higher abilities. Table 2 also indicates that the PSDs are low for all test-takers, including humans and QA systems, suggesting that the accuracy of ability estimation would be acceptable. Furthermore, Table 3 shows the average and SD of the ability estimates of the QA systems across the various training sample sizes, demonstrating that an increase in training sample size leads to an increase in ability estimates, which is also a reasonable trend. Finally, the analysis of Tables 2 and 3 indicates that the developed QA systems tend to have higher abilities than the human test-takers, suggesting a mismatch in the ability distribution between QA systems and human test-takers. This discrepancy can potentially lead to a dete- rioration in difficulty estimation, especially over the small difficulty value range. Therefore, filtering the QA systems or adding relatively weak QA systems could be beneficial for improving the performance of difficulty estimation, a task we intend to focus on in future studies. 5. Conclusion In this study, we introduced an IRT-based question difficulty prediction method that uses QA systems as virtual test-takers, while ensuring alignment of the difficulty scale with that derived from human test-takers. Through experiments with real data, we showed that our proposed method outperforms traditional text-based difficulty prediction methods. This study has some limitations. The first is the scarcity of detailed experiments, which is due primarily to the limited availability of open datasets that include both question data and human test-takers’ responses. Future work that evaluates the effectiveness of the proposed method across a broader range of datasets in various educational domains to identify necessary adaptations. Furthermore, possible further investigations based on our data, such as examining if larger difficulty estimation errors are typically made against lower-difficulty questions, will also be part of our future research. Second, the proposed method necessitates collecting questions for training QA systems in addition to those that include human test-takers’ responses. We assume that it is generally easier to collect questions without human responses than those with, and a relatively small dataset may suffice for training QA systems. However, the feasibility of this data collection process and the amount of data required for training should be examined in future investigations. Furthermore, a recent study proposes a method that considers the uncertainty of predictions from a QA system as the difficulty for multiple-choice questions [35]. This idea might be integrated with our approach to enhance its effectiveness. Finally, it is anticipated that the proposed method will require significantly fewer questions with human responses compared with the conventional text-based difficulty prediction approach. This is because the proposed method uses the response data primarily to estimate a small number of parameters in an IRT model, whereas the conventional approach uses these data to train a large neural model on a complex task. Future work will explore the extent to which the proposed method can reduce the amount of data required and the corresponding costs. References [1] G. Kurdi, J. Leo, B. Parsia, U. Sattler, S. Al-Emari, A systematic review of automatic question generation for educational purposes, International Journal of Artificial Intelligence in Education 30 (2019) 121–204. [2] N.-T. Le, T. Kojiri, N. Pinkwart, Automatic question generation for educational applications – the state of art, Advanced Computational Methods for Knowledge Engineering 282 (2014) 325–338. [3] M. Rathod, T. Tu, K. Stasaski, Educational multi-question generation for reading com- prehension, in: Proc. Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, Washington, 2022, pp. 216–223. [4] R. Zhang, J. Guo, L. Chen, Y. Fan, X. Cheng, A review on question generation from natural language text, ACM Transactions on Information Systems 40 (2021) 1–43. [5] M. Liu, R. A. Calvo, Using information extraction to generate trigger questions for academic writing support, in: Intelligent Tutoring Systems, Chania, Crete, Greece, 2012, pp. 358–367. [6] C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. J. Guibas, J. Sohl-Dickstein, Deep knowledge tracing, Advances in neural information processing systems 28 (2015) 505–513. [7] Z. Wang, X. Feng, J. Tang, G. Y. Huang, Z. Liu, Deep knowledge tracing with side information, in: Proc. International conference on artificial intelligence in education, 2019, pp. 303–308. [8] F. M. Lord, Applications of item response theory to practical testing problems, Routledge, 1980. [9] W. J. van der Linden, P. J. Pashley, Computerized Adaptive Testing: Theory and Practice, Springer Netherlands, 2000. [10] D. Belov, Uniform test assembly: Concepts, problems, solvers, and applications for adaptive testing, Journal of Computerized Adaptive Testing 5 (2017) 1–21. [11] F. B. Baker, S. H. Kim, Item Response Theory: Parameter Estimation Techniques, CRC Press, Boca Raton, FL, USA, 2004. [12] W. J. van der Linden, R. K. Hambleton, Handbook of modern item response theory, Springer Verlag, 1996. [13] M. J. Kolen, R. L. Brennan, Test Equating, Scaling, and Linking: Methods and Practices, Springer New York, 2013. [14] F. M. Lord, M. R. Novick, Statistical theories of mental test scores, Information Age Publishing, 1968. [15] L. Benedetto, P. Cremonesi, A. Caines, P. Buttery, A. Cappelli, A. Giussani, R. Turrin, A survey on recent approaches to question difficulty estimation from text, ACM Computing Surveys 55 (2023). [16] L. Benedetto, G. Aradelli, P. Cremonesi, A. Cappelli, A. Giussani, R. Turrin, On the application of transformers for estimating the difficulty of multiple-choice questions from text, in: Proc. Workshop on Innovative Use of NLP for Building Educational Applications, 2021, pp. 147–157. [17] L. Benedetto, A. Cappelli, R. Turrin, P. Cremonesi, Introducing a framework to assess newly created questions with natural language processing, in: Proc. International Conference on Artificial Intelligence in Education, 2020, pp. 43–54. [18] L. Benedetto, A. Cappelli, R. Turrin, P. Cremonesi, R2DE: a NLP approach to estimating IRT parameters of newly generated questions, in: Proc. International Conference on Learning Analytics & Knowledge, 2020, pp. 412–421. [19] A. D. McCarthy, K. P. Yancey, G. T. LaFlair, J. Egbert, M. Liao, B. Settles, Jump-starting item parameters for adaptive language tests, in: Proc. Conference on Empirical Methods in Natural Language Processing, 2021, pp. 883–899. [20] K. Xue, V. Yaneva, C. Runyon, P. Baldwin, Predicting the difficulty and response time of multiple choice questions using transfer learning, in: Proc. Workshop on Innovative Use of NLP for Building Educational Applications, 2020, pp. 193–197. [21] S. AlKhuzaey, F. Grasso, T. R. Payne, V. Tamma, Text-based question difficulty prediction: A systematic review of automatic approaches, International Journal of Artificial Intelligence in Education (2023). [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proc. Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186. [23] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, arXiv (19). [24] L. Benedetto, A quantitative study of nlp approaches to question difficulty estimation, in: Proc. International Conference on Artificial Intelligence in Education, 2023, pp. 428–434. [25] Y. Gao, L. Bing, W. Chen, M. Lyu, I. King, Difficulty controllable generation of reading comprehension questions, in: Proc. International Joint Conference on Artificial Intelligence, 2019, pp. 4968–4974. [26] M. Byrd, S. Srivastava, Predicting difficulty and discrimination of natural language ques- tions, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022, pp. 119–130. [27] M. Uto, Y. Tomikawa, A. Suzuki, Difficulty-controllable neural question generation for reading comprehension using item response theory, in: Proc. Workshop on Innovative Use of NLP for Building Educational Applications, 2023, pp. 119–129. [28] M. J. Kolen, R. L. Brennan, Test Equating, Scaling, and Linking, Springer Verlag, 2014. [29] M. Uto, Accuracy of performance-test linking based on a many-facet Rasch model, Behavior Research Methods 53 (2021) 1440–1454. [30] H. Maeda, Field-testing multiple-choice questions with AI examinees, Preprint available at Research Square, 2024. [31] J. M. Linacre, Sample size and item calibration stability, Rasch measurement transactions 7 (1994). [32] Y. Ehara, Building an english vocabulary knowledge dataset of japanese english-as- a-second-language learners using crowdsourcing, in: Proc. Language Resources and Evaluation Conference, 2018, pp. 484–488. [33] Y. Ehara, I. Sato, H. Oiwa, H. Nakagawa, Mining words in the minds of second lan- guage learners: Learner-specific word difficulty, in: Proc. International Conference on Computational Linguistics, 2012, pp. 799–814. [34] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, SQuAD: 100,000+ questions for machine comprehension of text, in: Proc. Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2383–2392. [35] E. Loginova, L. Benedetto, D. Benoit, P. Cremonesi, Towards the application of calibrated transformers to the unsupervised estimation of question difficulty from text, in: Proceed- ings of the International Conference on Recent Advances in Natural Language Processing, 2021, pp. 846–855.