Difficulty of Items – Predictions on Linguistic Features
                                Anna Winklerová
                                Masaryk University, Botanická 68a, 60200, Brno, Czechia


                                           Abstract
                                           To fulfill adaptive and mastery learning parameters of an educational system (both learning and assess-
                                           ment), it is necessary to continuously develop and manage a large item pool containing thousands of
                                           items in a properly designed structure. Content management can be efficiently supported by utilizing
                                           augmented intelligence models that can deduce behaviour of items in the system based on linguistic
                                           features, independent on user data.
                                               This paper focuses on categorizing linguistic features for short L2 English multiple choice items,
                                           discusses ways of feature selection towards feature interpretability and its consequences on model
                                           prediction. It demonstrates practical application of prediction results for item management and further
                                           meaningful feature development.

                                           Keywords
                                           Item difficulty prediction, Second language acquisition, Natural language feature engineering, Inter-
                                           pretable features, Estimation of question statistics


                                1. Introduction
                                The overall accuracy behind an adaptive educational (both learning and assessment) system
                                performance is dependent on student – item interactions.Difficulty of an item is one of the most
                                descriptive metrics of how an item behaves in the system but the reasoning for the behaviour is
                                more complex and tricky. For instance, we can easily identify that an item behaves differently
                                than expected by measuring the item’s error rate distance from the mean error rate of the whole
                                set of items however the reasoning for this behaviour can not be easily done without further
                                complementary data. This applies to the item complexity features that are free of student
                                interaction [1], in other words, textual and form related content of items. Analysis of linguistic
                                features characterizing item complexity in combination with item difficulty serve as a valuable
                                insight into item behavior in the system.
                                   Features engineered from even such short digital entities as multiple choice gap filling (MCQ)
                                items in educational content summarized in recent research count in hundreds. Linguistic
                                features free of user interaction data take a vast share. They diverge from simple lexical
                                and surface features, through syntax and basic semantic features to composite discourse and
                                embedding features. Current acceleration in computational linguistics brings profits in improved
                                methods and tools for feature engineering but also challenges in the dimensionality reduction
                                and proper machine learning model application to reach practical goals such as difficulty
                                prediction.


                                EvalLAC’24: Workshop on Automatic Evaluation of Learning and Assessment Content, July 08, 2024, Recife, Brazil
                                $ anna.winklerova@mail.muni.cz (A. Winklerová)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Our research is motivated mainly by two groups of users: content decision makers and model
developers. These two distinctive groups of users represent different views on item features as
they both stress out different objectives. Decision makers are mostly domain specialists and
they need to comprehend the results of the predictions and its underlying decisions. On the
other hand model developers aim to improve the precision of the predictions no matter how
complex and inarticulate the models and features are. Based on the work of Alexandra Zytek et
al. [2] we examine the overlapping Interpretable feature space and Model-ready feature space to
find the suitable set of feature properties to select relevant interpretable features in the difficulty
prediction setting.
   The aim of this paper is to (1) describe feature engineering methods in the context of item
difficulty prediction (2) report on the ongoing research of automated methods for interpretable
feature selection methods on the 230 feature set of real life data from a educational system
containing thousands of items on English L2 practice with thousands of student interactions
and to (3) demonstrate exploitation of augmented intelligence for decision making in item
management.
   The difficulty estimation is implemented as a regression task utilizing simple Random Forest
(RF) ensemble algorithm and Gradient Boosting Trees (GBT) for comparison. The focus of this
work is not on optimizing the ML algorithms, and cross validation of hyperparameter setting
was not performed. Rather, the ML algorithms are used in static setting for comparable results
on feature engineering and selection.


2. Relevant research
The recent comprehensive state-of-the-art overviews on item difficulty prediction [3, 4] have
demonstrated intensive ongoing research in various learning contexts. The predominant reason
to predict difficulty of an item is to accelerate establishment of new items in educational systems,
which could reduce the resource cost of item pool management and administration. Behind the
pursue of these clear quantitative goals there also emerge data quality benefits of item feature
modeling. As mentioned in[5], the data-driven insights, such as intensive item modeling with
linguistic features, can (1) inform on overall structuring of content and in particular (2) help
predict the individual learners’ difficulties and skills.
   Distilling features as numerical representations of textual parameters of a natural language
text passage is increasingly influential in recent research and applications. There are being
developed libraries for basic handcrafted features engineering such as LFTK [6] containing 220
features covering lexical, semantic, syntactic or discourse features. The efforts to categorize and
standardize text features are creditable, reduce the heavy lifting of text preprocessing and, in
our research context, enable systematically build up specific features. In addition, implementing
standardized vector representations of items by linguistic features can contribute to sharing data
in educational and assessment systems. Lack of data for methodology comparison is identified
as one of the obstacles in item difficulty prediction research.
   With the increasing number of item features and the aim of practical usage of the predictive
models, the need for interpretable features is eminent. Consistently with the recent work of
Alexandra Zytek et al. [2] we define the key stakeholders in item pool management as decision
makers and model developers. Decision makers use the model results to gain insight on creating
new items, identifying items for revision and taking appropriate action such as modifying
or removing dysfunctional items, inspecting the coherency of domain subsets (e.g., splitting
topics, adding new topics), analysing item functioning in order to gain insight with applicable
actions [7]. These users need the model results to be understandable and consistent with their
domain expertise.
   Model developers use and finetune machine learning algorithms to improve model precision
for a given task (such as difficulty prediction) on a particular dataset. Their motivation is
therefore focused on collecting predictive, model-ready features correlating with the target
variable.
   Building models on interpretable features aims to build trust [8] in end users and opens door
to further development including crowd feature engineering. With compliance with [2] we
require the interpretable features to have at the same time the following properties, or have
clearly defined transform functions from interpretable to model-ready feature space. Relevant
interpretable properties have features that are:

    • Understandable – related to real-world metric, e.g. Age of acquisition stated in age rather
      than given by a scale from 1 to 100,
    • Readable – labeled with plain human language with understandable meaning (for our
      purpose readable is defined to the extent of human-worded features), e.g. Number of words
      in a sentence instead of Average number of tokens per sentence,
    • Meaningful – with clear relation to the target variable, comprehensible to decision makers,

   From the model point of view, a feature must be predictive: improve a model performance,
have sufficient data coverage, explain data variance and be independent on other features.
   Interpretable features subset from the set of 230 features needs to be justified on a compu-
tational basis. Based on comprehensive review of dimensionality reduction techniques by R.
Zebari et al. [9] we examine two main groups of approaches of achieving reduced features set by
feature selection and feature extraction. Although we have experimented with feature extraction
methods such as PCA on feature correlation clusters obtaining significant computational time
improvements while keeping the prediction accuracy, the main objective of our work lies heavily
in the interpretability of features’ significance to item modeling. Therefore, the main ongoing
work leverages from the feature selection methods.
   The studied feature selection mechanisms focus on maximizing relevant information while
minimizing redundant information. The research in this area deals with automated calculation
of minimal or optimal number of features that still cover the relevant variance of data while
decreasing bias or noise, suggesting various approaches and methods such as mutual information,
vector variance inflation, clustering or correlation based analysis.[10, 11]
   Although, in many cases of feature selection, the best performance is obtained by using all
available features, [12] improvement in item difficulty prediction by feature subset selection
was demonstrated in [5]. Furthermore, models with smaller interpretable set of features are
more usable to decision makers and are open to further development by users that are not
machine learner experts.
                  0.8
                                                                                                      10000


                                                                         Number of answers per item
                  0.6                                                                                 8000
Item difficulty


                                                                                                      6000
                  0.4
                                                                                                      4000
                  0.2
                                                                                                      2000

                  0.0                                                                                    0


Figure 1: Distribution of difficulty.                                    Figure 2: Number of answers distribution.


Table 1
Examples of items.
                        Question                               Correct answer                                 Distractor     Difficulty
                        How many days          after school?   have you been running                          have you run   0.37
                             the morning                       In                                             At             0.26
                        They hope       us next year.          to visit                                       visiting       0.21
                        Hey! Check this      !                 out                                            in             0.17
                        You should cut      on sugar.          back                                           over           0.42


3. Framework implementation
Our work does not focus merely on item difficulty prediction, but on a wider context of
augmented intelligence models supporting decisions towards item pool management. Secondly,
we aim to give insights and recommendations to model developers concerning (1) which feature
types are still meaningful to further investigate in the context of educational content and also
(2) which features are useful in a given ML application task. Therefore, the individual pipeline
steps starting with the text preprocessing up to result evaluation are covered in a framework.
   In this paper, we focus mainly on the implementation and evaluation of the feature engineering
and selection methods.

3.1. Umíme dataset
The items in this dataset are from private educational system for practicing grammar, vocabulary
and use of English. It is targeted on L2 learners of English from the first years of studying
language up to advanced high school learners. Over 5900 items are structured in 34 resource
sets focusing on different concepts of language. Difficulty of items is calculated as an error rate
on the scale from 0 to 1.

3.2. Feature engineering
MCQ items are short, one to two sentence long passages of text. The lexical and semantic
representation of these sentences hardly contains every aspect of what makes an item difficult.
To distinguish MCQ relevant features, we propose describing item characteristics as static
or dynamic. Imagine a sentence The United States have around 330 million inhabitants. This
sentence consists of static features including POS tags, syntax tree parameters, surface features
such as word syllables count or even various metrics for readability indexes, word frequencies
or age of acquisition.
   Next assume MCQ item created from the basic sentence. The item has one correct answer
(stated as the first in the bracket) and one distractor. The item can be created as one of the
following examples:
    • The United States have around [330;660] million inhabitants.
    • The United States [has;have] around 330 million inhabitants.
    • The United States has around 330 million[s;_] inhabitants.
   Although the static features of the above-mentioned MCQ items are almost identical, the
dynamic of student engagement in the individual items is essentially different. Features describ-
ing the dynamic item component can be derived from the item answers or grammar pattern of
underlying knowledge domain. These features try to explain a context-sensitive stimuli leading
to a student action and resolving into item difficulty.
   Results of our experiments show that the development of dynamic item features contributes
to the difficulty prediction and item behaviour modeling in an educational system.

3.2.1. Feature set
After a standard text processing steps containing item purification, contraction expansion
(i.e. it’s – it is), tokenization, stopword filtering etc. we derived numerical representation of
handcrafted features provided by the LFTK package for Python and handcrafted more features
describing mostly the dynamic item component.
    In the LFTK package, there are currently 220 features divided into overlapping sections of
foundation (e.g. verb count) and derivation (e.g. average verb count per word/per sentence)
features from diverse domains and families (syntactic, semantic, discourse or named entities).
The short, mostly one sentence long items, do not utilize all features from the LFTK package as
many of them are designed for longer passages of text (readability measures, counts of unique
words per sentence).
    MCQ specific features are not present in the package. The 10 remaining features were derived
from textual parts of items using nlp libraries such as SpaCy, nltk for distance metrics for
distractor similarity measures or CEFR level dictionary for vocabulary difficulty analysis. These
features are described in table 2
    The resulting size of features set is 230 containing all LFTK features and basic MC specific
features.

3.3. Dimensionality reduction
In order to achieve model interpretability and usability, we performed several feature reduction
experiments based on different methods and underlying calculations. To lower the high number
of dimensions and to reduce bias caused collinearity of features, we have experimented with (1)
hierarchical clustering based on feature correlations and Principal Component Analysis as well
as (2) various approaches for feature selection.
Table 2
List of MCQ specific handcrafted features in model-ready description.
 Handcrafted feature name         Description
 Manhattan vector distance        Difference between correct and wrong sentence represented by LFTK
                                  feature vector.
 Distractor similarity            Edit distance between the correct answer and distractor.
 Sentence similarity              Edit distance between the whole correct sentence (correct answer inputed
                                  into the gap) and wrong sentence.
 Certainty                        BERT fill in the mask pretrained model percentual value of certainty for
                                  masked expression.
 Gap position                     Normalized score of the position of the gap in the sentence on scale 0-1
                                  representing beginning and end of sentence respectively.
 𝑄0 mean CEFR                     Mean of all words in CEFR A1-C2 label on scale 1-6
 𝑄0 max CEFR                      Mean of all words in CEFR A1-C2 label on scale 1-6
 Distractor max CEFR              Distractor max CEFR A1-C2 label on scale 1-6
 Correct max CEFR                 Correct answer max CEFR A1-C2 label on scale 1-6
 Average sentence parser tree     Sentence parse depth represented by average number of children of parse
 depth                            tree nodes.


3.3.1. Feature extraction
The feature extraction methods such as PCA are very straightforward and automatically applica-
ble on any data types (with respect to numerical representations and normalization), preserving
data variation and improving ML tasks in some cases. Instead of applying PCA across the whole
set of 230 features, we first implemented clustering on correlated features as illustrated in fig-
ure ??. The figure shows large groups of highly correlated linguistic features (e.g. token absolute
and average lengths and counts). Large clusters with closely similar features (correlating more
than 0.8) indicate that these types of features are not interesting for further investigation, as
the informational gain has already been exhausted. These features can be substituted in further
computations by representative selection or linear combination (e.g. PCA).
   On the other hand, the low size clusters may contain features of significant importance
towards the ML task and represent promising areas for deeper feature analysis and engineering
(e.g. distractor similarity).
   The size of clusters is parametrized by the distance factor which is directly influencing the
number and size of feature groups, PCA variation ratio and difficulty prediction accuracy.
   In this particular experiment, PCA was applied on features from clusters containing at least 5
features. Features in smaller clusters are used as individual independent variables in the training
set.
   The extracted principal component features lose their interpretable qualities and can not be
used unless a transformation function is applied which translates the principal components or
factors into an abstract concept. Such abstract concept (could be labeled for example as textual
complexity) must be understandable and meaningful to the decision maker.
                                                                                                                                                                        1.00
   1.0


                                                     simp_verb_var
   0.5                                                                    1              0.91            0.91        0.68      0.69       0.76        0.71      0.14

   0.0
                                                                                                                                                                        0.75


                                                     root_verb_var
    0.5
                                                                         0.91             1               1          0.92      0.93       0.85        0.89      0.18
                                    fkre
                                    t_syll
                                                                                                                                                                        0.50
                                    t_char


                                                     corr_verb_var
                                    a_char_ps
                                    a_syll_ps                            0.91             1               1          0.92      0.93       0.85        0.89      0.18
                                    t_stopword
                                    a_stopword_ps
                                                                                                                                                                        0.25
                                    a_word_ps
                                    t_uword


                                                     n_verb
                                    rt_average                           0.68            0.92            0.92          1       0.99       0.78        0.91      0.18
                                    rt_fast
                                    t_word
                                                                                                                                                                        0.00
                                    rt_slow
                                    fkgl


                                                     n_uverb
                                    a_char_pw                            0.69            0.93            0.93        0.99       1         0.79        0.91      0.18
                                    a_syll_pw
                                    cole
                                                                                                                                                                         0.25
                                    auto
                                    t_syll2


                                                     a_verb_pw
                                    fogi                                 0.76            0.85            0.85        0.78      0.79        1           0.8      0.098
                                    rs
                                    t_sent
                                                                                                                                                                         0.50
                                    a_stopword_pw
                                    t_syll3


                                                     a_verb_ps
                                    smog                                 0.71            0.89            0.89        0.91      0.91        0.8         1        0.17
                                    q_max_cefr
                                    q_mean_cefr
                                                                                                                                                                         0.75
                                    dist_sim
                                    dist_max_cefr
                                                                         0.14            0.18            0.18        0.18      0.18      0.098        0.17        1


                                                     diff
                                    dist_mean_cefr
                         fkre
                       t_syll
                     t_char
                a_char_ps
                 a_syll_ps
               t_stopword
          a_stopword_ps
               a_word_ps
                  t_uword
               rt_average
                     rt_fast
                    t_word
                   rt_slow
                         fkgl
               a_char_pw
                 a_syll_pw
                        cole
                        auto
                     t_syll2
                         fogi
                           rs
                     t_sent

                     t_syll3
          a_stopword_pw

                      smog
              q_max_cefr

                  dist_sim
            q_mean_cefr

           dist_max_cefr
          dist_mean_cefr
                                                                                                                                                                         1.00
                                                                     simp_verb_var   root_verb_var   corr_verb_var   n_verb   n_uverb   a_verb_pw   a_verb_ps    diff


Figure 3: Hierarchical clustering based                              Figure 4: Correlation cluster of features based
on pairwise correlation.                                             on verbs and their correlation with difficulty.


3.3.2. Feature selection
To investigate the methods of feature selection that best fit our dataset and ML task, we made
use of the fine granularity of the dataset structure. The feature correlations with target value,
but also collinearity among features proved to be very diverse among different item resource
sets. Future work on the item pool and feature set will focus on finding general characteristics
in resource sets with low prediction accuracy. The desired result is a predictive model that is as
general as possible yet able to explain most of the variance in various data.
   Two approaches are described in the following text in detail: features selected based on
model-feature importances and interpretable features selected based on hierarchical clustering.
   Features selected based on model-feature importances. This approach combines results
of the random forest decision algorithm and the model feature importances and structure of the
data. The difficulty prediction was calculated separately on each resource set (5900 items in 34
resource sets). The top 20 features in each resource set were multiplicated by the importance of
each given feature calculated by the RF and further multiplicated by the Pearson’s correlation
coefficient of the prediction on the whole resource set. From this cumulated importances, 10
features were selected and new prediction was performed.
   Interpretable features selected based on hierarchical clustering. This approach uses the
results of hierarchical clustering depicted in figure 3.3.1 and selects (manually) representatives
of the constructed clusters that are predictive and interpretable. For instance all features
from the LFTK package based on verbs ended up in the "verb" cluster as shown in figure
4. From the features in the cluster, we have selected the most interpretable feature while
maximizing the predictive quality (correlation with target value). These are usually translated
to an understandable definition and were created as combination of model-ready features.
   The table 5 shows different results of predictions on structured data. Table contains worst
and best predictions expressed with the pearson correlation. The difficulty predictions vary
greatly among different resource sets.
   The overall accuracy of predictions on all data combined is stated in table 6.
Table 3
Features selected based on the cumulative RF feature importance across all RS.
     Feature                                                                Source        Type
     Manhattan vector distance                                              handcrafted   dynamic
     Distractor similarity (correct answer, distractor string similarity)   handcrafted   dynamic
     Kuperman AoA per word                                                  LFTK          static
     SubtlexUS                                                              LFTK          static
     BERT fill in the mask pretrained certainty                             handcrafted   dynamic
     Gap position                                                           handcrafted   dynamic
     Question mean CEFR A1–C2 class                                         handcrafted   static
     Sentence similarity (correct vs wrong sentence)                        handcrafted   dynamic
     Average count of auxiliary words per word                              LFTK          static
     Correct answer max CEFR A1–C2 class                                    handcrafted   static
     Average punctuation per word                                           handcrafted   static
     Average verb per word                                                  LFTK          static

Table 4
Interpretable features from feature clusters.
 Feature representation         Underlying feature                                         Source
 Item length                    Word count                                                 LFTK
 Gap position                   Normalized gap position                                    handcrafted
 Gap expectation                BERT fill in the mask pretrained probability               handcrafted
 Similarity of options          Edit distance correct vs wrong sentence                    handcrafted
 Consistency of vocabulary      𝐶𝐸𝐹 𝑅𝑚𝑎𝑥𝑄𝑐 − 𝐶𝐸𝐹 𝑅𝑚𝑒𝑎𝑛𝑄𝑐                                   handcrafted
 Readability                    Flesch Reading Ease                                        LFTK
 Age of Acquisition             Average Kuperman age of acquistion of words per word
 Usual vocabulary               SubtlexUS word frequency                                   LFTK
 Complexity of sentence         Average sentence parser tree depth                         handcrafted


4. Applicable results for decision making
Table 7 shows practical steps in evaluation of difficulty prediction results. Difficulty of the item
"My sisters       a beach house." based on linguistic features was predicted 0.24, whereas the
observed difficulty (error rate) is calculated as 0.6.
   With further investigation of linguistic parameters of similar items within the same resource
set (item modeling in context), the table 8 shows that items with similar mask "s have" represent
an outlier with suggested action for the content developer to add more similar items in order to
balance structure and content of the resource set. The overall number of items with a correct
answer containing have got or has got from the whole resource set is 67 of which only two
represent similar token linkage with plural noun My sisters or My cousins respectively. The
token (grand)parents implicates plural form of the verb have got. Other items use different
subject forms, such as plural pronouns they, we, you and multiple personal names Jane and Pete.
   This item could have been marked as dysfunctional by simply comparing the outlying
difficulty towards the mean difficulty of the resource set. However, the further analysis leading
to content developer action is heavily dependent on rich item modeling (POS tags and syntactic
Table 5
Comparison of predictions for different feature sets on structured data.
     RS name                                   RF import.    rP     RF230    rP        RF interpr. rP
     Present perfect: simple vs. continuous    1             0.01   1        -0.01     0.08
     Phrasal verbs                             2             0.03   11       0.28      0.27
     Wh- question                              3             0.06   3        0.1       0.2
     Some, any, no, every                      4             0.08   2        0.06      0.11
     Possessive pronouns                       30            0.57   31       0.61      0.53
     Present tense: negatives                  31            0.66   30       0.6       0.59
     To do, to have, to be in past simple      32            0.68   32       0.69      0.61
     Can vs. could                             33            0.75   33       0.69      0.65
     To be in present simple                   34            0.75   34       0.75      0.73

Table 6
Comparison of predictions on different feature sets for on all data (5900 items).
                         Prediction model                    Pearson    RMSE
                         All features RF                     0.394
                         Top 10 RF importance features       0.347
                         Top 10 interpretable features RF    0.313

Table 7
Results on prediction outlier with the highest difference between observed and predicted difficulty –
error rate. TER = True error rate, PER = Predicted error rate.
          Item text                                Correct option    Distractor      TER    PER
          My sisters    a beach house.             have got          has got         0.6    0.24
          My cousins      a farm in Oregon.        have got          has got         0.46   0.26
          Tourists    their cameras ready.         have got          has got         0.19   0.23
          My parents      two dogs and a cat.      have got          has got         0.17   0.19
          My grandparents       a big garden.      have got          has got         0.08   0.10


marking).
   The other side of the difficulty prediction error scale contains the cases of significantly higher
predicted values than true values. This is a much more interesting result considering the novelty
insight into data. The explanation is often in an unintentional clue in the item revealed by the
students, or wrongly placed item (too easy in the context of other items, but not necessarily the
easiest). This type of item deficiency can be corrected by the content developer, who should be
able to find the hidden clue and rewrite the item, or place it in different set of items.
   In table 9 is an example of an item with the highest difference between predicted difficulty
and true difficulty. The explanation lies in markedly different options. Correct answer Do I live
and distractor Have I live are a combination that has not appeared in any other item and the
distractor is of no attraction for the student.
Table 8
Analysis of items with have/has got grammar focus within the resource set To do, to have, to be in present
simple, with average difficulty: 0.16.
           Number of items       Form       Example items                           Error rate
                37              have got    We have got a dog.                      0.15
                                            They have got two daughters.            0.18
                                            Ann and Bill have got a new house.      0.10
                   30            has got    My father has got two cars.             0.20
                                            My dog has got long ears.               0.20
                                            Ela has got long hair.                  0.13

Table 9
Example of an item easier than predicted.
                        Item text                   Correct option    Distractor
                             in a city?             Do I live         Have I live
                             he have an apple?      Does              Do
                             she there?             Is                Are
                             they sisters?          Are               Is
                             he need our help?      Does              Do


5. Conclusion
Exploiting recent advancements in linguistic computations and natural language processing,
this paper demonstrates a framework covering steps of feature engineering, dimensionality
reduction and practical application. The main importance is in the interpretability of model by
its features. We believe that models with understandable features are of the most use to decision
makers. Furthermore, the interpretability can lead to model precision improvement as more
involved users, who are not ML specialists, can contribute to the dynamic feature engineering.
   Difficulty prediction tasks give deeper insight into the anatomy of items in context of student
skills which helps to manage the content of educational systems.


References
 [1] R. Pelánek, T. Effenberger, J. Čechák, Complexity and difficulty of items in learning
     systems, International Journal of Artificial Intelligence in Education 32 (2022) 196–232.
 [2] A. Zytek, I. Arnaldo, D. Liu, L. Berti-Equille, K. Veeramachaneni, The need for interpretable
     features: Motivation and taxonomy, ACM SIGKDD Explorations Newsletter 24 (2022)
     1–13.
 [3] S. AlKhuzaey, et al., Text-based question difficulty prediction: A systematic review of
     automatic approaches, 2023.
 [4] L. Benedetto, et al., A survey on recent approaches to question difficulty estimation from
     text, 2023.
 [5] I. Pandarova, T. Schmidt, J. Hartig, A. Boubekki, R. D. Jones, U. Brefeld, Predicting the
     difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring,
     2019.
 [6] B. W. Lee, J. H.-J. Lee, Lftk: Handcrafted features in computational linguistics, 2023.
 [7] R. Pelánek, T. Effenberger, A. Kukučka, et al., Towards design-loop adaptivity: identifying
     items for revision, Journal of Educational Data Mining 14 (2022) 1–25.
 [8] S. R. HONG, J. HULLMAN, E. BERTINI, Human factors in model interpretability: Industry
     practices, challenges, and needs, arXiv preprint arXiv:2004.11440 (2020).
 [9] R. Zebari, A. Abdulazeez, D. Zeebaree, D. Zebari, J. Saeed, A comprehensive review of
     dimensionality reduction techniques for feature selection and feature extraction, 2020.
[10] H. Zhou, X. Wang, R. Zhu, Feature selection based on mutual information with correlation
     coefficient, Applied Intelligence 52 (2022) 5457–5474.
[11] H. Liu, Z. Wu, X. Zhang, Feature selection based on data clustering, in: Intelligent
     Computing Theories and Methodologies: 11th International Conference, ICIC 2015, Fuzhou,
     China, August 20-23, 2015, Proceedings, Part I 11, Springer, 2015, pp. 227–236.
[12] M. A. Munson, R. Caruana, On feature selection, bias-variance, and bagging, in: Joint
     European Conference on Machine Learning and Knowledge Discovery in Databases,
     Springer, 2009, pp. 144–159.


6. Online Resources
    • The Umime educational system umimeto.org
    • LFTK lftk.readthedocs.io,
    • scikit scikit-learn.org,
    • BERT Fill-Mask pretrained huggingface.co/google-bert.