=Paper=
{{Paper
|id=Vol-3836/paper1
|storemode=property
|title=Using Machine Learning to Predict the Number of Latent Skills in Online Learning Environments
|pdfUrl=https://ceur-ws.org/Vol-3836/paper2.pdf
|volume=Vol-3836
|authors=Changsheng Chen,Robbe D’hondt,Celine Vens,Wim Van Den Noortgate
|dblpUrl=https://dblp.org/rec/conf/all/ChenDVN24
}}
==Using Machine Learning to Predict the Number of Latent Skills in Online Learning Environments==
Using Machine Learning to Predict the Number of
Latent Skills in Online Learning Environments
Changsheng Chen1, 2, ∗, Robbe D’hondt2, 3, Celine Vens2, 3 and Wim Van Den Noortgate1, 2
1
Faculty of Psychology and Educational Sciences, KU Leuven, Campus KULAK, Kortrijk, Belgium
2
imec research group itec, KU Leuven, Kortrijk, Belgium
3
Department of Public Health and Primary Care, KU Leuven, Campus KULAK, Kortrijk, Belgium
Abstract
Extracting skill information for students in online learning environments has been a challenging topic
across different domains. Predicting the number of skills is the first step towards estimating students’
skills. In this paper, we propose prediction methods based on Machine Learning (ML) models, where
we used the analysis model to generate simulation data reflecting the data features of our target
scenarios and took the features from simulation data to train and test ML models. We illustrated this
approach in tandem with Multidimensional Item Response Theory (MIRT) for the simple and complex
structure, and further compared the trained ML models with a selection of statistical methods based
on the test data. Our preliminary results show that, compared to statistical methods, ML models
generally reach a noticeably higher proportion of correct estimations for both structures.
Additionally, we find that an increase in the percentage of missing values and sample size leads to
negative and positive effects on the methods’ performance respectively. Using simulation data from
the analysis model to train ML models and doing prediction can extend the current operation of skill
extraction, which provides extra options for the practitioners.
Keywords
machine learning, multidimensional item response theory, latent skills, online learning1
1. Introduction
Skill information is one type of fundamental quantitative evidence for building an online
learning system (including adaptive lifelong learning system). With accurate users’ skill
estimates, such a system can personalize materials and instruction design to improve the
learning experience effectively and efficiently. With monitoring the changes of users’ skill
information, the system can recommend further learning resources to adapt to users’ situation
frequently. However, what skills can be extracted and monitored and how the skill information
can be estimated by which test items and relevant users’ response are still a challenging topic.
Several kinds of techniques have been used to extract users’ skill information based on users’
response to test items, such as Multidimensional Item Response Theory (MIRT) [2], Cognitive
Diagnostic Model (CDM) [3], Matrix Factorization (MF) [4,5], and so forth. The common start
ALL’24: Workshop on Adaptive Lifelong Learning, July 08–12, 2024, Recife, Brazil [1]
∗
Corresponding author.
changsheng.chen@kuleuven.be (C. Chen); robbe.dhondt@kuleuven.be (R. D’hondt); celine.vens@kuleuven.be
(C. Vens); wim.vandennoortgate@kuleuven.be (W. V. D. Noortgate)
0000-0001-6092-6655 (C. Chen); 0000-0001-7843-2178 (R. D’hondt); 0000-0003-0983-256X (C. Vens); 0000-0003-
4011-219X (W. V. D. Noortgate)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
15
for conducting these techniques is to decide the number of skills and clarify the relationship
between items and skills (or knowledge components). In other words, the number of skills and
which items can be used to measure which skills are clearly defined before skill estimation and
tracing algorithms are performed. For example, in the MIRT, the item-dimension relationship
needs to be explored, which serves as the basis for estimating user’s skill values, after the
predetermination of the number of latent dimensions. In the CDM, the item-attributes
relationship depicted by the Q-matrix functions in a similar way and the number of attributes
should also be confirmed beforehand. In the MF, the number of ranks for shaping two
decomposed matrices (i.e., a user-factor matrix and an item-factor matrix) is required initially
before the technique is performed. Traditionally, the number of skills and the item-skill
relationship are theoretically defined by domain experts. However, human examination is too
inefficient to satisfy the needs of online learning system because of the large number of items,
which calls for the data-driven approach (i.e., extracting the number of skills and exploring and
confirming the item-skill structure based on the response matrix).
Many techniques have been proposed to estimate the number of skills based on data-driven
evidence. For example, in the MIRT, the number of latent dimensions is estimated by certain
statistical methods, such as Kaiser Criterion (KC) [6], Empirical Kaiser Criterion (EKC) [7],
Parallel Analysis (PA) [8], non-graphical Scree Plot with Optimal Coordinates (OC) or
Acceleration Factor (AF) [9], Very Simple Structure (VSS) with two variants (i.e., C1 & C2) [10],
and so forth. In the CDM, the number of attributes and related Q-matrix are estimated and
evaluated by the designed algorithms or statistics, such as the G-DINA Discrimination Index
(GDI) method [11], the stepwise method [12], and so on. In the MF, the number of ranks is
usually seen as a hyperparameter, which is predicted based on the evaluation of defined loss
[13]. Additionally, some researchers have explored using Machine Learning (ML) methods to
estimate the number of skills, and they found that it can increase the proportion of correct
predictions. For example, Goretzko & Bühner [14] used eXtreme Gradient Boosting (XGBoost),
Random Forest (RF), and Adaptive Boosting to predict the number of factors for continuous
response simulation data, and found that these methods performed better than other traditional
statistical methods in terms of prediction accuracy (i.e., the proportion of correct estimation).
However, their study did not explore the possibilities of using ML methods to predict the
number of skills for the dichotomous response with considering the features of online or
adaptive learning data (e.g., the sparsity and the large number of items) and properties of
different multidimensional structures.
In this study, we aim to fill this research gap by proposing ML prediction methods inspired
by Goretzko & Bühner [14] and comparing their performance with other selected statistical
methods. The general operation is that we use the analysis model (such as the MIRT, CDM, or
MF) to generate simulation data reflecting the data features of target scenarios in online
learning environments. The simulation data includes two parts, i.e., the training data (including
validation data) for training and tuning ML models and the test data for evaluating the
performance of ML models and selected statistical methods. In detail, the selected methods
included: 1) ML models: the regression variant of XGBoost and RF whose results were rounded
to the integer; 2) statistical methods: KC, PA, EKC, Scree Plot (OC), Scree Plot (AF), VSS (C1),
and VSS (C2). For the sake of parsimony, the explanation of methods’ mechanism is skipped,
and relevant details can be consult by provided references.
16
In the following sections, we illustrate this operation in tandem with the MIRT for the simple
and complex structure. MIRT is the prevailing statistical model for analyzing students’ binary
response (0: wrong; 1: right) to estimate students’ ability and relevant item parameters in the
field of psychological and educational assessments. The principle of MIRT is that it models the
probability of giving a correct answer based on the interaction between students’ ability and
item parameters. For example, a 2-parameter MIRT model can be expressed by:
𝑒𝑥𝑝(𝜶𝒋 𝜽′𝒊 + 𝑑𝑗 )
𝑃(𝑥𝑖𝑗 = 1|𝜽𝒊 ; 𝒂𝒋 , 𝑑𝑗 ) =
1 + 𝑒𝑥𝑝(𝜶𝒋 𝜽′𝒊 + 𝑑𝑗 )
In the above formula, 𝑥𝑖𝑗 = 1 refers to the correct response of user i for item j and the 𝜽𝒊 =
(𝜃𝑖1 , 𝜃𝑖2 , … , 𝜃𝑖𝑘 ), 𝜶𝒋 = (𝛼𝑗1 , 𝛼𝑗2 , … , 𝛼𝑗𝑘 ), and 𝑑𝑗 indicate the ability of user i for skill k, the item
discrimination of item j for skill k, and the item intercept for item j respectively [2]. As for the
two multidimensional structures, under the simple structure, each item is solely related to one
latent skill and the latent skills are correlated with each other. Under the complex structure,
each item is related to more than one latent skill and the latent skills are correlated with each
other as well.
2. Method
2.1. Data
Table 1 presents the settings for generating the training data and test data by a 2-parameter
MIRT model for the simple and complex structures based on R function “simdata” of R package
“mirt” [15] in R 4.3.2 [16]. These simulation features contained the number of items, the number
of latent skills, the sample size, and the proportion of missing values in the response matrix,
and the correlation between latent skills. The relevant settings mimicked the possible features
of online learning and assessments [17,18]. The settings for generating the training data were
randomly selected from the designed range for each simulation feature, except for the number
of latent skills. In detail, we randomly selected 20 values from the specified range for the number
of items. For the sample size and missingness, we randomly selected 10 values, and for the
correlation, we randomly selected 5 values. The setting for generating the test data were based
on fixed values for detecting their effects on methods’ performance. In total, there were 80,000
and 7200 scenarios for the training and test data respectively. Considering the constraints on
computation power, we randomly selected 1000 scenarios for both and generated one dataset
for each scenario as the preliminary results for the subsequent analysis. The simulation codes
will be publicly available by contacting the corresponding author when the paper with final
results is published.
Table 1
Settings of Generating Simulation Data
Features Settings for Training Data Settings for Test Data
The number of items From 300 to 800 300, 400, 500, 600, 700, 800
The number of latent skills 1, 2, 3, 4, 5, 6, 7, 8 1, 2, 3, 4, 5, 6, 7, 8
Sample size From 300 to 800 300, 400, 500, 600, 700, 800
17
Missingness (proportion) From 0 to 0.9 0, 0.25, 0.5, 0.75, 0.9
Correlation (latent skills) From 0.1 to 0.5 0.1, 0.2, 0.3, 0.4, 0.5
2.2. Methods Implementation
All methods implementation was based on R 4.3.2 [16]. The statistical methods were mainly
implemented based on the tetrachoric correlation matrix corresponding to the dichotomous
responses by R function “tetrachoric2” of R package “sirt” [19] with Bonett method [20]. The
results of KC and EKC were estimated by manual function in R. PA and scree plot (OC & AF)
were performed by relevant functions in R package “nFactors” [21], and VSS (C1 & C2) was
implemented by relevant functions in R package “psych” [22].
Table 2
Hyperparameter Consideration for ML models
Random Forest XGBoost
Number of trees From 10 to 500 Maximum depth of a From 1 to 20
tree
Number of From 1 to all Minimum sum of From 1 to 10
considered variables features instance weight
at each split (hessian)
Minimum size of From 1 to 10 Fraction of features From 0.5 to 1
terminal nodes for each tree
Maximum size of From 5 to 50 Fraction of samples From 0.5 to 1
terminal nodes for each tree
Maximum number of 100 Number of boosting From 30 to 100
iterations for tuning rounds
Loss function Mean Squared Error Learning rate From 0.01 to 0.5
Minimum loss From 0 to 10
reduction
Loss function Mean Square Error
The RF and XGBoost were implemented by relevant functions in R package “mlr” [23] and
“xgboost” [24]. Both ML models were trained and tested based on the features extracted from
available information, such as the original response matrix, the estimated tetrachoric
correlation matrix, and the estimated results of statistical methods. The features included [14]:
1) from the response matrix: the sample size, the number of items, and the proportion of
missingness; 2) from the correlation matrix: the determinant, the number of entries smaller or
equal to 0.1, the number of eigenvalues larger than 0.7, the relative proportion of eigenvalues,
the standard deviation of all eigenvalues, the number of eigenvalues accounting for over 50% or
75% of the variance, the matrix norms (i.e., the L1-norm, Frobenius-norm, maximum-norm, and
spectral-norm), the average of off-diagonal entries and the communality estimates, the
sampling adequacy [25], the Gini-coefficient [26], the Kolm inequality [27], the top 50
18
eigenvalue estimates; 3) from the results of statistical methods: KC, PA, EKC, scree plot (OC),
scree plot (AF), VSS (C1), and VSS(C2).
As ML models can be trained by integrating the results of statistical methods, which may
lead to a fairness concern regarding the method comparison, we trained RF and XGBoost in two
ways, i.e., one without including results of statistical methods in the features and another with
including them. Additionally, all ML models were trained by 10-fold cross-validation based on
the training data. Table 2 provides the partial hyperparameter settings for the RF and XGBoost
with or without extra features (i.e., the results of statistical methods). The settings of other
possible hyperparameters followed the default settings of two R packages. The relevant codes
will be publicly available by contacting the corresponding author when the paper with final
results is published.
2.3. Evaluation Metrics
To evaluate and compare the performance of all candidate methods, the deviation score and
several metrics based on the deviation score were used. The deviation score is defined as the
estimated number of latent skills minus the true number of latent skills. The correct-estimation
proportion is the number of deviation scores equal to zero divided by the total number of
estimates (i.e., 1000). The under-estimation proportion is the number of deviation scores lower
than zero divided by the total number of estimates. The over-estimation proportion is the
number of deviation scores higher than zero divided by the total number of estimates. The bias
is the average of deviation scores. The precision is the average absolute deviation score.
3. Results
Table 3 shows the results of all selected methods based on the test data. For the simple structure,
KC, PA, EKA, and scree plot (OC) performed worse than other methods. Their correct-
estimation proportions were nearly equal to zero. For scree plot (AF), VSS (C1), and VSS (C2),
the correct-estimation proportions ranged from 0.5 to 0.3, which was obviously better than
other statistical methods. Regarding the performance of ML models, RF and XGBoost without
extra features reached even higher correct-estimation proportions (more than 0.7) than the
variants with extra features (less than 0.7). The correct-estimation proportions of all ML models
were higher than the statistical methods, with the minimum difference between them equal to
0.1954. In terms of under and over estimation, KC, PA, EKC, and scree plot (OC) tended to overly
estimate the number of latent skills, which was further confirmed by the result of bias and
precision. ML models tended to estimate a higher number of latent skills as well, although their
over-estimation proportions were relatively lower. In contrast, scree plot (AF), VSS (C1), and
VSS (C2) estimated a smaller number of latent skills than the true number of skills.
For the complex structure, the general pattern was similar to the simple structure. KC, PA,
and scree plot (OC) had the lowest proportions of correct estimations, again close to zero. EKC
performed poorly as well, even though its correct proportion was around 0.1. The correct
proportion of scree plot (AF), VSS (C1), and VSS (C2) ranged from 0.2490 to 0.3740, which was
better than other statistical methods. In terms of the performance of ML models, their
proportions of correct estimation were higher than 0.74, which was substantially better than
statistical methods. Regarding the under and over estimation, KC, PA, scree plot (OC), and EKC
overly estimated the number of latent skills (their over-estimation proportions above 0.9), while
19
scree plot (AF), VSS (C1), and VSS (C2) tended to estimate a smaller number of latent skills (their
under-estimation proportions ranging from around 0.4 to 0.5). ML models also estimated a
smaller number of latent skills, but their under-estimation proportions (around 0.14) were
noticeably lower than statistical methods. The patterns of under and over estimations were
further supported by the results of bias and precision.
Table 3
Results of Test Data
Correct- Under- Over-
estimation estimation estimation Bias Precision
Proportion Proportion Proportion
Simple Structure
KC 0 0 1 168.0779 168.0779
PA 0.0065 0 0.9935 94.0455 94.0455
EKC 0.0195 0 0.9805 71.4870 71.4870
Scree Plot (OC) 0.0130 0.0065 0.9805 30.3312 30.3442
Scree Plot (AF) 0.3571 0.6169 0.0260 -2.7273 2.7792
VSS (C1) 0.3636 0.4481 0.1883 -1.6039 2.3571
VSS (C2) 0.4545 0.2792 0.2662 -0.6818 1.3312
RF 0.7143 0.1364 0.1494 0.0519 0.4935
RF (extra) 0.6883 0.0519 0.2597 0.5260 0.6818
XGBoost 0.7078 0.1104 0.1818 0.1169 0.4545
XGBoost (extra) 0.6494 0.0909 0.2597 0.4935 0.7143
Complex Structure
KC 0 0 1 160.7990 160.7990
PA 0.0550 0 0.9450 86.2640 86.2640
EKC 0.1010 0 0.8990 67.2240 67.2240
Scree Plot (OC) 0.0580 0.0020 0.9400 27.1280 27.1420
Scree Plot (AF) 0.3740 0.5800 0.0460 -2.5580 2.6500
VSS (C1) 0.2490 0.6710 0.0800 -2.5470 2.7810
VSS (C2) 0.3130 0.3930 0.2940 -0.3890 1.7310
RF 0.7600 0.1280 0.1120 -0.1920 0.5240
RF (extra) 0.7490 0.1380 0.1130 -0.2280 0.5200
XGBoost 0.7940 0.1420 0.0640 -0.3040 0.4840
XGBoost (extra) 0.7820 0.1490 0.0690 -0.3210 0.5050
Figure 1 and Figure 2 present the effects of simulation features on the correct-estimation
proportions of selected methods. As these proportions were extremely low for KC, PA, EKC,
and scree plot (OC), they were omitted in the effects analysis. For the simple structure, when
the percentage of missing values in the response matrix increased from 0 to 90%, the respective
proportions of all methods decreased, especially for ML models (falling from above 0.8 to below
0.2). Raising the sample size from 300 to 800 generally led to an increase in the respective
proportions of ML methods by 0.2, while the effects of sample size on statistical methods were
not detectable due to the fluctuations. Regarding the effects of the number of latent skills,
changing the settings from 1 to 8 was related to the tremendous decrease in the proportions of
20
scree plot (AF) and VSS (C1) by around 0.7. For the effects of the number of items, when it rose
from 400 to 600, the proportion of most methods went down by around 0.2.
Figure 1: Effects of Simulation Features (x-axis) on the Correct-estimation Proportions (y-axis)
for the Simple Structure
Figure 2: Effects of Simulation Features (x-axis) on the Correct-estimation Proportions (y-axis)
for the Complex Structure
Compared to the patterns in the case of simple structure, the changes of proportions for the
complex structure fluctuated less. When the missingness percentage went up from 0 to 90%, the
21
proportions of ML methods dropped down from over 0.9 to lower than 0.3 and the proportions
of statistical methods went down relatively slightly by around 0.2. Raising the sample size led
to the increase in proportions of ML methods by around 0.2, while the proportions of statistical
methods fluctuated by a small amount. In terms of the number of latent skills, when it changed
from 2 to 8, the proportion of statistical methods fell down massively from over 0.6 to below
0.1. In contrast, the proportion of ML models almost stayed the same. Regarding the number of
items, the proportion of all methods fluctuated slightly without noticeable changes across
different settings.
4. Discussion
In the present study, we proposed a general operation of building prediction models using ML,
with simulation data to estimate the number of latent skills for online learning environments,
which was illustrated based on the MIRT. The results of the performance comparison revealed
that ML models had a markedly better performance than statistical methods regarding the
correct-estimation proportions. This finding is generally consistent with the previous study [14].
However, the correct estimation of proportions in the previous study is higher than 0.9, which
is different from the results in this study (ranging from 0.65 to 0.8). One possible explanation
for this difference might be due to the different simulation models and scenarios. In the previous
study, the dichotomous response generated by the MIRT was not considered. The simulation
settings more reflected the features of relatively small-scale psychological tests instead of the
large-scale online learning settings. For example, the number of items is usually set below 100
in the field of psychology, while it might be over hundreds and even thousands in the online
learning environments. Additionally, the problem of missingness or sparsity is also less of a
concern in previous research. Regarding the performance of statistical methods, our results
showed that they performed surprisingly poorer than previous studies. Goretzko & Bühner [14]
found that KC, EKC and PA reached over 0.75 regarding the correct estimation proportion,
which is completely different from our results. Guo & Choi [28] found that the proportion of
identifying the correct number of latent skills for PA with tetrachoric ranged from 0.43 to 1
across various simulation features, which is also dissimilar from our results. It may be
speculated that this is because of the different settings of simulation features.
Except for the results of methods comparison, the effects analysis of simulation features
found that the increase in the missingness and sample size lead to a going-down and going-up
trends for most of methods regarding the correct estimation proportions. It is interesting to
note that raising missingness and sample size may have negative and positive impact on
methods’ performance respectively. As mentioned above, missingness was not considered in
the previous study, and our study fills this gap. As for the positive effects of sample size, our
results further confirm the findings of the previous study. For example, the correct estimation
proportion of ML models increased by 0.06 when the sample size rose from 250 to 1000 in the
study of Goretzko & Bühner [14].
Overall, the results of this study imply that compared to statistical methods, using simulation
data generated by the analysis model (e.g., the MIRT) to train ML models and applying them to
do predictions can work relatively effectively for estimating the number of latent skills in online
learning environments. This kind of operation can be generalized to other kinds of analysis
models. For example, when practitioners believe that their real-world data fits the assumptions
22
of CDM, they can choose a suitable model of CDM to simulate data reflecting the data features
of expected scenarios and train ML models to predict the number of attributes in the Q-matrix.
This can also be used for MF in terms of predicting the number of ranks.
Several limitations of this study need to be acknowledged. First, the trained and tuned ML
models were not tested by real data. The conclusions of simulation study heavily rely on the
data-generation model and the settings of simulation features, so relevant findings should be
confirmed further based on real data. Second, due to the constraints of computational power,
the present preliminary study only covered partial simulation scenarios, and the number of
simulated data was limited to one for each scenario, which may make the relevant conclusions
less stable. Third, as mentioned above, the illustration was based on the MIRT, and whether the
findings remain the same for CDM or MF still needs to be tested.
5. Conclusion
In this study, we used the MIRT to generate simulation data reflecting the data features of target
scenarios and took the features from simulation data to train and test two ML models (i.e., RF
and XGBoost) for the simple and complex structure. These two ML models were compared with
selected statistical methods regarding their performance of predicting the number of latent
skills. The preliminary results show that the ML models (with or without including results of
statistical methods during the training stage) generally outperform statistical methods in terms
of correct estimation proportions. Additionally, regarding the effects of simulation features, we
find that raising missingness level and the number of samples leads to a falling-down and going-
up trend respectively in the correct estimation proportions of most methods. To conclude, our
result implies that compared to statistical methods, using simulation data generated by the
selected analysis model to train ML models and further doing prediction can relatively improve
the prediction of the number of latent skills and extend the current operation related to users’
skill extraction.
Acknowledgements
This work was funded by Research Fund Flanders (FWO fellowship 1S38023N). We also
acknowledge the Flemish Government (AI Research Program).
References
[1] A. Gharahighehi, R. Van Schoors, P. Topali, J. Ooge, Adaptive Lifelong Learning (ALL), in:
International Conference on Artificial Intelligence in Education, Springer Nature
Switzerland, Cham, 2024: pp. 452–459.
[2] W. Bonifay, Multidimensional item response theory, Sage, 2020.
[3] M. von Davier, Y.S. Lee, Handbook of diagnostic classification models, Springer Publishing,
2019.
[4] M.C. Desmarais, Mapping question items to skills with non-negative matrix factorization,
ACM SIGKDD Explorations Newsletter 13 (2012) 30–36.
https://doi.org/10.1145/2207243.2207248.
[5] M.C. Desmarais, R. Naceur, A matrix factorization method for mapping items to skills and
for enhancing expert-based Q-matrices, in: H.C. Lane, K. Yacef, J. Mostow, P. Pavlik (Eds.),
23
Artificial Intelligence in Education, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013:
pp. 441–450. https://doi.org/10.1007/978-3-642-39112-5_45.
[6] H.F. Kaiser, The application of electronic computers to factor analysis, Educ. Psychol. Meas.
20 (1960) 141–151. https://doi.org/10.1177/001316446002000116.
[7] J. Braeken, M.A.L.M. Van Assen, An empirical Kaiser criterion., Psychol. Methods 22 (2017)
450–466. https://doi.org/10.1037/met0000074.
[8] J.L. Horn, A rationale and test for the number of factors in factor analysis, Psychometrika
30 (1965) 179–185. https://doi.org/10.1007/BF02289447.
[9] G. Raîche, T.A. Walls, D. Magis, M. Riopel, J.-G. Blais, Non-graphical solutions for cattell’s
scree test, Methodology 9 (2013) 23–29. https://doi.org/10.1027/1614-2241/a000051.
[10] W. Revelle, T. Rocklin, Very simple structure: an alternative procedure for estimating the
optimal number of interpretable factors, Multivariate Behavioral Research 14 (1979) 403–
414. https://doi.org/10.1207/s15327906mbr1404_2.
[11] J. De La Torre, C.-Y. Chiu, A general method of empirical Q-matrix validation,
Psychometrika 81 (2016) 253–273. https://doi.org/10.1007/s11336-015-9467-8.
[12] W. Ma, J. De La Torre, An empirical Q‐matrix validation method for the sequential
generalized DINA model, Br. J. Math. Stat. Psychol. 73 (2020) 142–163.
https://doi.org/10.1111/bmsp.12156.
[13] W.-S. Chin, Y. Zhuang, Y.-C. Juan, C.-J. Lin, A fast parallel stochastic gradient method for
matrix factorization in shared memory systems, ACM Trans. Intell. Syst. Technol. 6 (2015)
2:1-2:24. https://doi.org/10.1145/2668133.
[14] D. Goretzko, M. Bühner, One model to rule them all? Using machine learning algorithms
to determine the number of factors in exploratory factor analysis., Psychological Methods
25 (2020) 776–786. https://doi.org/10.1037/met0000262.
[15] R.P. Chalmers, mirt: a multidimensional item response theory package for the R
environment, Journal of Statistical Software 48 (2012). https://doi.org/10.18637/jss.v048.i06.
[16] R Core Team, R: A language and environment for statistical computing, (2024).
https://www.R-project.org/.
[17] Y. Liu, F. Robin, H. Yoo, V. Manna, Statistical Properties of the GRE ® Psychology Test
Subscores, ETS Research Report Series 2018 (2018) 1–13. https://doi.org/10.1002/ets2.12206.
[18] USMLE, 2024 USMLE bulletin of information, (2023).
https://www.usmle.org/sites/default/files/2023-08/2024bulletin.pdf.pdf (accessed March 23,
2024).
[19] A. Robitzsch, sirt: Supplementary item response theory models, (2024). https://CRAN.R-
project.org/package=sirt.
[20] D.G. Bonett, R.M. Price, Inferential methods for the tetrachoric correlation coefficient, J.
Educ. Behav. Stat. 30 (2005) 213–225. https://doi.org/10.3102/10769986030002213.
[21] G. Raiche, D. Magis, nFactors: Parallel analysis and other non graphical solutions to the
cattell scree test, (2022). https://CRAN.R-project.org/package=nFactors.
[22] William Revelle, psych: Procedures for psychological, psychometric, and personality
research, (2024). https://CRAN.R-project.org/package=psych.
[23] B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E. Studerus, G. Casalicchio, Z.M. Jones,
mlr: Machine Learning in R, Journal of Machine Learning Research 17 (2016) 1–5.
[24] T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, K. Chen, R. Mitchell, I. Cano,
T. Zhou, M. Li, J. Xie, M. Lin, Y. Geng, Y. Li, J. Yuan, xgboost: Extreme gradient boosting,
(2024). https://CRAN.R-project.org/package=xgboost.
[25] H.F. Kaiser, A second generation little jiffy, Psychometrika 35 (1970) 401–415.
https://doi.org/10.1007/BF02291817.
[26] H. Dalton, The measurement of the inequality of incomes, Econ. J. 30 (1920) 348.
https://doi.org/10.2307/2223525.
24
[27] S.-C. Kolm, The rational foundations of income inequality measurement, in: Handbook of
Income Inequality Measurement, Springer, 1999: pp. 19–100.
[28] W. Guo, Y.-J. Choi, Assessing dimensionality of IRT models using traditional and revised
parallel analyses, Educ. Psychol. Meas. 83 (2023) 609–629.
https://doi.org/10.1177/00131644221111838.
25