Robustness Testing of Machine Learning Families using
Instance-level IRT-Difficulty
Raül Fabra-Boluda1 , Cèsar Ferri1 , Fernando Martínez-Plumed1 and M. José Ramírez-Quintana1
1
    Universitat Politècnica de València


                        Abstract
                        Performance evaluation of Machine Learning systems have been usually limited to performance measures on curated and
                        clean datasets that may not properly reflect how robustly these systems can operate in real-world situations. One key element
                        in this understanding of robustness is instance difficulty. The effect of instance difficulty on robustness could be understood
                        as how unexpected would be that a customary system fails on a particular instance of certain difficulty. In order to provide
                        further understanding on this issue, we estimate IRT-based instance difficulty for an illustrative set of supervised tasks
                        and we implement and test perturbation methods that simulate noise and variability depending on the type of input data.
                        With this, we evaluate the robustness of different families of machine learning models, which we select and characterise
                        according to their behaviour. The preliminary results of this work in progress allow us to define a novel taxonomy based on
                        the robustness of the different models and the difficulty of the instances addressed. This study is a significant step towards
                        exposing vulnerabilities of particular families of machine learning models.

                        Keywords
                        Machine Learning, Robustness, Robustness Taxonomy, IRT, Noise, Adversarial


1. Introduction                                                                                        On the other hand, there are a wide range of factors
                                                                                                   that can affect the robustness of a model [9] and the dif-
The success of AI and specially Machine Learning                                                   ficulty of the instances (intrinsic or extrinsic) [10] is one
(ML) technologies caused these type of systems to                                                  of the most relevant [11]. Robustness must be based on
spread across many applications from different domains,                                            knowing where and why the model fails, avoiding highly
e.g., medical, financial, social or autonomous transport,                                          unexpected failure, and difficulty is one key element in
among others [1, 2, 3]. These applications form part of                                            this understanding. Therefore, the question we want to
our daily life and shapes our lifestyle. They recommend                                            analyse is whether the performance of a particular model
us music to listen or people to establish career/social re-                                        varies equally distributed across difficulties. We may also
lationships with. They diagnose our health and monitor                                             analyse model robustness by perturbing data and locat-
our finance. Given this scenario, there is an obvious need                                         ing those examples that are more likely to change their
of more robust ML systems.                                                                         predicted label under certain amount of noise, i.e., those
   Robustness is defined by the IEEE standard glossary                                             instances for which the model is less robust, and hence,
of software engineering terminology [4] as: The degree                                             more prone to cause a vulnerability. We can expect that
to which a system or component can function correctly in                                           the more difficult an instances is, the more likely it is to
the presence of invalid inputs or stressful environmental                                          change their predicted label under minor perturbations.
conditions. In the context of ML, robustness measures                                                  Estimating the difficulty of instances is another prob-
the resilience of a system towards perturbations in any                                            lem we need to address. We may just calculate the av-
of its components (the data, the learning program, or                                              erage error of a set of systems for each instance as a
the framework) [5]. Earlier works [6] assessed model                                               proxy for difficulty [12]. However, the use of a popula-
robustness by perturbing instances in the training set,                                            tion of systems entails some risks as well. For instance,
test set, or both. A general way to perturb instances is                                           if the population contains a non-conformant system (fail-
adding noise, a method that has been extensively applied                                           ing on the easy instances and succeeding in some of the
in the adversarial ML field for the generation of adver-                                           hard ones), it may lead to very unstable difficulty met-
sarial examples ([7]). This is why most of the research in                                         rics. A solution to this problem was introduced several
ML robustness focused on measuring the robustness of                                               decades ago, and it is known as item response theory
systems with adversarial samples [7, 8].                                                           (IRT) [13], where difficulty is inferred from a matrix of
                                                                                                   items (instances) and respondents (systems), giving more
EBeM’22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022,
Vienna, Austria                                                                                    relevance to conformant systems. In addition, IRT gives
Envelope-Open rafabbo@dsic.upv.es (R. Fabra-Boluda); cferri@dsic.upv.es                            a scaled metric of difficulty that follows a normal distri-
(C. Ferri); fmartinez@dsic.upv.es (F. Martínez-Plumed);                                            bution and can be compared directly against the ability
mramirez@dsic.upv.es (M. J. Ramírez-Quintana)                                                      of a system.
    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
    Attribution 4.0 International (CC BY 4.0).                                                         In this work in progress paper, we present, as a proof of
    CEUR Workshop Proceedings (CEUR-WS.org)
concept, an evaluation setting to analyse the robustness       output of the model changes. Attribute noise has been
of different ML models empirically, considering the dif-       also used in different approaches to improve the robust-
ficulty of the instances. We also perform a hierarchical       ness of models to adversarial examples. For instance, [20]
clustering to derive taxonomies of ML models according         shows that the injection of noise in the training dataset
to their robustness. The setting is general as we em-          results in models more robust to attacks since they are
ployed datasets from different domains, a wide set of rep-     able to detect the perturbed instances beforehand. Simi-
resentative ML techniques, and an instance perturbation        larly, adding adversarial instances to the training set can
function that introduce random noise with no specific          improve the robustness of neural nets [8], and make a
goal. Therefore, it can be adapted to more specific prob-      Speech Emotion Recognition system more robust [21].
lems by changing the datasets, models and perturbation         On the other hand, artificial label noise is useful to simu-
function to the domain of interest. In this general evalu-     late wrongly annotated instances or other sources of data
ation framework, we measure the robustness of a model          corruption. Label noise has been used, for instance, to
as the agreement modulo instance difficulty between the        evaluate robustness in computer vision applications [22].
output of the model for the original and the perturbed         Additionally, label noise in the training set can be em-
test sets.                                                     ployed to enhance the robustness of models, for instance,
   The paper is structured as follows. In Section 2 we         by reducing errors derived from overfitting [23].
review part of the literature related to the assessment           There has been proposed different ways of assessing
of robustness in presence of noise, the estimation of in-      the robustness of a model in noise environments. The
stances difficulty and a taxonomy of machine learning          most general method consists in measuring the correct-
techniques derived from a notion of behavioral similarity.     ness loss of models with noise in the data, with respect to
Section 3 describes the method we developed to assess          the case without noise, regardless of where the noise is
model robustness. We apply that method and describe            located (training set or test set). For noise in the training
the experiments in Section 4. Finally, 5 concludes the         set, the metrics used to quantify the loss are the standard
paper.                                                         classification metrics such as accuracy and F-measure
                                                               [14] and the Equalised Loss of Accuracy (a metric for
                                                               measuring a classifier’s noise robustness [24]). For the
2. Background                                                  sub-field of adversarial robustness (where noise is used
                                                               in the test set to generate the adversarial examples) there
In this section we revisit some key concepts related to
                                                               has been used specific metrics such as adversarial accu-
model robustness, instance IRT-based difficulty and the
                                                               racy [8]. There are other general techniques for evalu-
definition of behavioural taxonomies of machine learning
                                                               ating robustness, including mixed integer programming
techniques.
                                                               [25], abstract interpretation [26], and symbolic execution
                                                               [27, 28, 29].
2.1. Robustness in a noisy framework                              In this work, we are interested in studying how the ro-
Robustness is one of the properties of ML systems that         bustness of a model is affected by the distortion produced
characterise their behavior [5], being particularly suitable   by injecting different levels of noise into test instances
for checking whether the system behaves as expected to         considering the difficulty of the perturbed instances. As
changes in the data. A common way to simulate those            far as we know, instance difficulty has not yet been taking
changes is injecting noise into the data, given that real      into account to evaluate the robustness of models.
world data often contain some degree of noise [14].
   In the literature, there is a large number of approaches    2.2. Instance difficulty
for adding artificial noise to datasets [15, 16, 17]. Usu-
                                                               Difficult instances may cause problems during AI sys-
ally, noise is introduced by perturbing the values of the
                                                               tem development, especially for models that are trained.
attribute(s) (attribute noise) or perturbing the class la-
                                                               These instances (e.g., usually associated with noise, out-
bel (label noise). A general technique for the injection
                                                               liers or decision boundaries) have been blamed for over-
of attribute noise consists in perturbing the instance at-
                                                               fitting, lack of convergence or both. Handling these sort
tribute(s) value following a well-known distribution (e.g.,
                                                               of anomalies has been addressed in a number of different
a Gaussian distribution) for numerical attributes, or ran-
                                                               ways trying to prevent overfitting. However, these ap-
domly choosing a different value for categorical attributes
                                                               proaches usually try to identify anomalies or mislabeled
[14, 18, 19]. This is the method usually used in adver-
                                                               instances but without defining what characterise them.
sarial ML, where adversarial examples are created by
                                                               For instance, in [30] instances that are hard to classify are
slightly modifying attribute values of examples correctly
                                                               identified through instance hardness metrics. These met-
classified by the model to craft new instances (ideally
                                                               rics try to characterise the level of difficulty of each input
indistinguishable from the original ones) for which the
                                                               sample following a (populational) empirical definition
based on the classification behaviour over the instances                                                                    values of ability. Difficult items in turn are those cor-
to be evaluated. If we move out of the field of machine                                                                     rectly answered only by the most proficient respondents.
learning (e.g., on computer vision or NLP-related tasks),                                                                   From this understanding and some common assumptions
we find an area still to be explored, where the different                                                                   (ability and difficulty following some particular normal
works are limited to analysing global image properties                                                                      distributions), the latent variables can be inferred from a
(e.g., salience, memorability, photo quality, tone, colour,                                                                 table of item-respondent pairs 𝑈𝑗𝑖 . Some two-step itera-
texture, etc.) [31, 32, 33], or, in the case of NLP, they are                                                               tive variants of maximum-likelihood estimation (MLE),
based on lexical readability and richness [34, 35].                                                                         such as Birnbaum’s method [41], can be used to infer all
   All the approaches above are specific to a domain                                                                        the IRT parameters.
and in many cases also anthropocentric. A completely                                                                           IRT difficulty is characterised by being system-
different approach is Item Response Theory, a well-                                                                         independent an domain-generic unlike the other metrics
developed subdiscipline in psychometrics [36], only re-                                                                     described above [11]. It also has some advantages over
cently brought to AI and machine learning [37, 12, 38, 39,                                                                  using average performance as a metric of difficulty, in
40]. In IRT, the probability of a correct response for an                                                                   terms of distribution, stability and predictability, as has
item is a function of the respondent’s ability and some                                                                     been studied in the literature of IRT.
item’s parameters. The respondent solves the problem
and the item is the problem instance itself. We focus
on the dichotomous models where the response can be
either correct or incorrect.
   Let 𝑈𝑗𝑖 be a binary response of a respondent 𝑗 to item 𝑖,
with 𝑈𝑗𝑖 = 1 for a correct response and 𝑈𝑗𝑖 = 0 otherwise.
For the basic one-parameter logistic (1PL) IRT model, the
probability of a correct response given the examinee’s
ability is modelled as a logistic function:
                                                               1
                                   𝑃(𝑈𝑗𝑖 = 1|𝜃𝑗 ) =                                                                   (1)
                                                      1 + 𝑒𝑥𝑝(−𝑎𝑖 (𝜃𝑗 − 𝑏𝑖 ))

The parameter 𝜃𝑗 is the ability or proficiency of 𝑗 and 𝑏𝑖
is the difficulty of 𝑖. If ability 𝜃 equals item difficulty 𝑏𝑖 ,
then there are even odds of a correct answer (cutting the
curve as exactly 0.5, as the light blue dashed shows in
Fig. 1. For each item, the above model provides an Item
Characteristic Curve (ICC) (see Fig. 1, left), characterised
by difficulty (𝑏𝑖 ), which is the location parameter of the
logistic function.

                        1.00                                                    1.00                    Ability = 2
                                                          Probability Correct
  Probability Correct


                                                                                                        Ability = 3
                                                                                                        Ability = 4
                        0.75                                                    0.75
                                                                                                                            Figure 2: Dendrogram representing the hierarchical cluster-
                        0.50                                                    0.50
                                                                                                                            ing (18 groups) of ML models. From [42].
                        0.25                                                    0.25

                        0.00                                                    0.00 0      2        4          6
                               0      2          4    6
                                          Ability                                        Instance Difficulty

Figure 1: Left: Example of a 2PL IRT ICC curve, with slope                                                                  2.3. Behaviour-based Machine Learning
𝑎 = 2 in red and location parameter 𝑏 = 3 in blue. Right:                                                                        families
Example of SCC curves with different abilities.
                                                            One classical way of characterising the rich range of
                                                            machine learning techniques is by defining ‘families’, ac-
                                                            cording to their formulation and learning strategy (e.g.,
   As all IRT models assume one single parameter for neural networks, Bayesian methods, etc.) [43, 44, 45].
the respondent, their dual plots (known originally as However, this taxonomy of learning techniques does not
person characteristic curves, here renamed as system take into account the intrinsic behaviour of the models
characteristic curves (SCC), also follow a logistic func- (measured as output agreement), especially considering
tion (see Fig. 1, right). Respondents who tend to correctly predictions in sparse zones where insufficient training
answer the most difficult items will be assigned to high data was available. If we want to characterise the robust-
ness of ML models, we need to analyse a diverse set of            the preliminary objective of testing the effectiveness of
models, as many as possible, under different parameters           our setting), but is also limited by those benchmarks
as well. In this regard, [42] derived a taxonomy of ML            where there is a sufficiently large number |𝐼 | of examples
techniques for classification, where families are clustered       (articles) and |𝐽 | of models (respondents).
according to their degree of (dis)agreement in behaviour,
i.e., the differences between models on how they dis-                          Dataset           # Instances   # Features   # Classes

tribute the output class labels along the feature space.                         letter            20000          16           26
                                                                                optdigits           5620          64           10
We considered both dense and sparse zones (where train-                  wall-robot-navigation      5456           4           4
ing data is scarce or inexistent), using Cohen’s kappa
statistic [46]. While in dense areas differences between          Table 1
                                                                  List of datasets for the experiments.
models may be difficult to find, in sparse areas the algo-
rithms diverge significantly, and unveil the characteristic
behaviour of the trained models using those techniques.              Regarding ML models, we employed a set of 18 ML
   The methodology was based on comparing the be-                 models from different ML families (see Table 2), derived
haviour of 65 different learning models (including hyper-         in [42]. These 18 model families were obtained from a
parameter variations), performing a pairwise comparison           pool of 65 models learned and evaluated on a wide range
(based on Kappa) and averaging the results obtained for           of datasets for categorisation into different families, as
75 datasets. For grouping in families, authors applied a          described in the section 2.3. For each family we selected
hierarchical clustering so that the models that presented         a single model, its centroid (i.e., representing the center
similar behaviour fell in the same cluster, which is consid-      of each family cluster), assuming it to be representative
ered a model family (see the 18 different families obtained       of its family. Thus, we can assume that the 18 selected
in Figure 2). This method is useful to objectively quantify       models are diverse enough to provide a wide view of how
how different two models (or model families) are.                 different model families behave in terms of robustness.


3. Empirical Methodology                                          3.2. Estimation of Difficulty
                                                                  As mentioned in Section 3.1, in order to estimate the
In this section we describe the experimental methodol-            difficulty of the instances, we first check that for each
ogy performed to obtain a taxonomy of classification              benchmark selected from OpenML there is at least 10-20
algorithms according to their robustness. We start in-            reponses (model evaluations) per item/feature (e.g., we
troducing the set of representative datasets and learn-           would need between 640 and 1280 responses for optdigits)
ing models we have employed. Then, we describe how                and that they are sufficiently diverse (different architec-
we estimate instance difficulty, the approach followed            tures or technologies). Next, we obtain their responses
to introduce noise in the data and, finally, how we de-           for unseen instances (e.g., we will be using the test folds,
fine the taxonomy of ML families. All the data, code,             so it is actually test performance, even if we cover the
complete experiments, plots and results can be found in           whole dataset). This will be our |𝐽 | × |𝐼 | matrix 𝑈 with all
https://github.com/rfabra/family-robustness.                      binary responses 𝑈𝑗𝑖 .
                                                                     We follow the recommendations from [12] for the ap-
3.1. Data and Classifiers                                         plication of IRT. In practice, for generating the IRT mod-
                                                                  els, we used the MIRT R package [49], using Birnbaum’s
In order to estimate IRT-difficulty, we need to find bench-
                                                                  method, as explained above. The package MIRT (as many
marks that had instance-wise results of a good number
                                                                  other IRT libraries) output indicators about the goodness
of models. It is recommended to have at least 10-20 re-
                                                                  of fit which can be used to quantify the discrepancy be-
sponses per item [47]. More importantly, we need the
                                                                  tween the values observed in the data (items) and the
instance-wise results, i.e., a |𝐽 | × |𝐼 | matrix with the per-
                                                                  values expected under the statistical IRT model. Item-fit
formance of each system 𝑗 ∈ 𝐽 for each instance 𝑖 ∈ 𝐼.
                                                                  statistics may be used to test the hypothesis of whether
Finding experiments not reported in an aggregated way
                                                                  the fitted model could truly be the data-generating model
was not an easy task. As an exception to the instance-
                                                                  or, conversely, we expect the item parameter estimates
wise result problem, we find platforms such as OpenML
                                                                  to be biased. In practice, an IRT model may be rejected
[48], a repository in which AI researchers and practi-
                                                                  on the basis of bad item-fit statistics, as we would not be
tioners can share data sets and results in as much detail
                                                                  reasonably confident about the validity of the inferences
as possible. The platform also provides several curated
                                                                  drawn from it [50]. In the present case, none of the es-
datasets such as OpenML-CC18, from which we address a
                                                                  timated models were discarded because of bad item-fit
set of 3 benchmarks for supervised learning (see Table 1).
                                                                  statistics or inconsistency in their results.
The selection is guided by the illustrative character (with
 Technique                           Parameters                            id
                                                                                     𝑎𝑡, and 𝑝 the vector that represents the empiri-
 C5.0                                                                      C5.0
 Cond. Inf. Tree                     mincriterion = 0.05                   CI_T      cal distribution of 𝑎𝑡, that is, 𝑝 = (𝑝𝑎𝑡1 , … , 𝑝𝑎𝑡𝑚 ),
 Flex. Disc. Analysis                degree = 1, nprune = 17               FDA       where, 𝑝𝑖 is the frequency of value 𝑖. Consider
 Stoch. Grad. Boosting Machine       interaction.depth = 2, n.trees = 50   GBM
 JRip                                                                      JRip      we have an instance of value 𝑥 = 𝑎𝑡𝑗 in 𝑎𝑡, we
 K-Nearest Neighbor                  K=3                                   3NN
 Learning Vector Quant.              size = 50, K = 3                      LVQ
                                                                                     represent as the vector 𝑡 = (𝑡𝑎𝑡1 , … , 𝑡𝑎𝑡𝑚 ) with
 MultiLayer Perceptron               1 hidden layer, 7 neurons             MLP       𝑡𝑎𝑡𝑖 = 0 ∀𝑖 ∈ {1..𝑚}, 𝑖 ≠ 𝑗, and 𝑡𝑎𝑡𝑗 = 1. To in-
 Multinomial Log. Regression                                               MLR
 Naive Bayes                                                               NB        sert a noise level 𝜈, we calculate 𝛼 = 1 − 𝑒 (−𝜈) ,
 PART                                                                      PART
 Radial Basis Function Network                                             RBF       and then compute a new vector of probabilities
 Regularised Discriminant Analysis                                         RDA       𝑝 ′ = 𝛼 ⋅ 𝑝 + (1 − 𝛼) ⋅ 𝑡. Finally, we use 𝑝 ′ in order
 Random Forest                       mtry = 64                             RF
 RPART                                                                     RPART     to sample the new value 𝑥 ′ of the attribute.
 Part. Least Squares                 ncomp = 3                             PLS
 SVM                                 Poly, degree = 2                      SVM
 RFRules                             mtry = 64                       For the experiments, we will generate noisy datasets
                                                                           RFRules
                                                                  (test set) using a noise level 𝜈 = 0.2. We vary the pro-
Table 2
                                                                  portion of perturbed instances 𝛿 in each bin, from 𝛿 = 0
List of the 18 models employed for the experiments, along
with the parameters used.
                                                                  (keeping   unperturbed the original test set) to 𝛿 = 1 (per-
                                                                  turbing the whole test set). This is performed under a
                                                                  5-fold cross validation setting. For each model, we will
                                                                  compare its predictions on the original test set with the
3.2.1. System Characteristic Curves                               predictions of each of the noisy test sets, by means of the
One of the most powerful visualisation tools that de- Kappa metric, as we describe below.
rives from difficulty is what we call system characteristic
curves (SCC) (Fig. 1, right). Inspired by the concept of 3.4. Model robustness to noise and
person characteristic curve previously developed in IRT,                 difficulty
a SCC is a plot for the response probability (e.g., accuracy,
kappa, etc.) of a particular classifier as a function of the We compare the behaviour of ML models from different
instance difficulty. For producing the SCC, we divide the families by classifying the same test set from a particu-
instances in bins according to difficulty. For each bin, we lar benchmark, to which we introduce different levels of
plot on the 𝑥-axis the average difficulty of the instances noise. The more the behaviour of a model changes under
in the bin and on the 𝑦-axis we plot the performance noise, the less robust it is. This difference in behaviour
metric selected.                                                  can be measured with Cohen’s Kappa metric [46]. More
                                                                  concretely, given 𝑇 the domain of all data sets we can
                                                                  create from all possible inputs, a test set 𝑇 ∈ 𝑇, a pertur-
3.3. Introduction of Noise
                                                                  bation function 𝜙 ∶ 𝑇 → 𝑇 to introduce noise into a data
We need a method to generate noise, representative and set, the perturbed test set 𝑇 ′ = 𝜙(𝑇 ), the predictions of a
general enough, so that the experimental results can be model 𝑀 for the original test 𝑦𝑀 = 𝑀(𝑇 ), the predictions
adapted to other noise settings, e.g., to include adversarial of a model 𝑀 for the perturbed test 𝑦𝑀     ′ = 𝑀(𝑇 ′ ) and two
attacks. Hence, we will work directly with noise levels, models 𝑀1 and 𝑀2 learned on the same data, the model
assuming that they are mapped from contexts. Noise is 𝑀1 is considered more robust than model 𝑀2 if
generated randomly by using some well-known proba-
bility distributions, following a similar procedure as in                                   ′ ) > 𝜅(𝑦 , 𝑦 ′ )
                                                                                   𝜅(𝑦𝑀1 , 𝑦𝑀 1      𝑀2 𝑀2
[16]. Instances are perturbed by changing their attribute
                                                                     Thus, we employ the Kappa as a measure of similarity
values into a range of possible values. The process to
                                                                  between the predictions of a model on the original and the
select among the possible values depends on whether the
                                                                  perturbed test sets. It is important to notice that we are
attribute is nominal or numerical:
                                                                  not accounting for the real class label, since adding noise
      • Numerical attributes: Let 𝜈 be the level of noise to the input attributes of an instance implies that the
         to be injected into a numerical attribute 𝑎𝑡, and 𝜎 actual class is probably not the same as it was originally.
         the standard deviation of all values of 𝑎𝑡. Then, Instead, we compare the model predicted labels for the
         a value 𝑥 in 𝑎𝑡 is modified as 𝑥 ′ ∼ 𝑁 (𝑥, 𝜎 ⋅ 𝜈), i.e., original test set (without noise) with the ones predicted
         we follow a normal distribution using 𝑥 as mean for the noisy test sets. Our goal is not to determine the
         and 𝜎 multiplied by the noise level 𝜈 as standard well-performance of a model to solve a task, but to assess
         deviation.                                               how the behaviour of the model changes under different
                                                                  levels of noise applied to instances of different levels of
      • Nominal attributes: Let {𝑎𝑡1 ,...,𝑎𝑡𝑚 } be the set
                                                                  difficulty. As we want to analyse whether the model
         of the 𝑚 possible values of a nominal attribute
                                                                  robustness may vary depending on the difficulty of the
instances addressed, we estimated the difficulty of each      typical in educational measurement. In health measure-
instance in the dataset following the procedure described     ment, however, these values are usually much higher and
above. Later, we grouped instances into difficulty bins to    around 4. In our case, when addressing ML benchmarks,
analyse the robustness (to produce SCCs), as explained        difficulty values around -3 and 3 are the norm (see [12]).
above.                                                        For this reason, we decided to remove those instances
   Analysing the data from the SCCs for different mod-        whose difficulty is out of the range [−6, 6], which are
els we also derive a ML robustness model taxonomy at-         considered outliers. This happened in all benchmarks
tending at the different shapes of the SCCs and models’       for very easy instances for which all techniques are cor-
behaviour. In this regard, for each dataset, we built a ma-   rect, never affecting more than 0.1% of the instances.
trix where each row represents a model and each column        Figure 3 shows the IRT-difficulty distribution per bench-
represent a combination of difficulty bins and proportion     mark, with a standard deviation around 1 (as expected).
of noisy instances per bin. Each element represents the       In terms of location (Q1), the letter benchmark contains
similarity (i.e., the kappa metric) between the predictions   more difficult instances (mean difficulty of −1.50 ± 0.92 )
of the model for the original test set and the predictions    than the others (−1.92 ± 0.67 for optdigits and −2.36 ± 0.9
for each noisy test and bin. By averaging these across all    for wall-robot navigation). Although the distribution is
the datasets, we may perform a hierarchical clustering        generally normally distributed, the wall-robot-navigation
with the aim of obtaining different grouping of models        dataset presents a higher number of difficult instances,
by robustness, showing the similarity between different       skewing the distribution to the right. This may be due to
ML families, in a data-driven fashion.                        the diversity of the population of systems used for the
                                                              difficulty estimation (similar cases can be observed in
3.5. Experimental questions                                   [11]).

Once the experimental methodology is clear, we now
                                                              dataset


                                                               wall−robot−navigation
want to investigate the relationship between the robust-                   optdigits
ness of the models and the difficulty of the instances, the                    letter
latter having been altered with different levels of noise.                            −4            0
                                                                                               Difficulty
                                                                                                              4

For this, we set 3 experimental questions. Q1: How do Figure 3: IRT-difficulty distribution per dataset. Benchmarks
difficulties distribute per benchmark for the IRT-difficulty sorted by average difficulty.
metric estimated? Q2. Can we see differences in robust-
ness for different models based on the difficulty of the        Regarding Q2, for each technique in Table 2, we com-
instances? Q3. Can we group models by robustness?            pare its predictions on the original test set (for each
                                                             dataset in Table 1) with the predictions of each of the
                                                             noisy test sets, by means of the Kappa metric. The SCCs
4. Experiments                                               produced (using Kappa values on the y-axis) are shown in
                                                             Figure 4. Obviously, Kappa takes values equal to 1 when
4.1. Setup
                                                             the test set is not perturbed (𝛿 = 0), since we are com-
We employed R language with caret package [51] to paring the output labels of the different trained model
carry out our experiments, i.e., training and evaluation with themselves. As we increase the amount of perturbed
of the models. All the models were learnt from scratch, instances (the same proportion for each bin of difficulty),
so we did not used any pre-trained model. We used the we can appreciate differences in the behaviour for the
MIRT R package [49] for estimating IRT 1PL models. To techniques analysed.
feed the IRT method, we obtained the predictions from a         As expected, the most difficult instances are those that
wide variety of models by using OpenML API [52]. In total, are more sensitive to noise, and this can be seen in terms
we employed the predictions of (up to) 2000 evaluations of the level of performance of the different techniques for
per dataset.                                                 the most difficult instance bins. This behaviour may indi-
                                                             cate that these instances are located close to the decision
4.2. Results                                                 boundary or regions with class overlap, so the behaviour
                                                             for most techniques is more unpredictable in those re-
IRT difficulties are built to approximately follow a nor- gions that in easier ones. In general, we may find some
mal distribution with standard deviation 1 but different patterns of behaviour for different sets of techniques.
locations depending on the dataset. When it comes to First, we identify cases where robustness decreases non-
the item difficulty parameters, what you find acceptable linearly with increasing levels of difficulty. This is the
depends very much on the purpose of your test and the most common case, but with differences in robustness
population of interest. For instance, values around 1 are variations for different models and datasets (see, e.g., CI_T,
                                                             FDA, 3NN, MLP , MLR or SVM). Second, we also see cases
               letter
                         C5.0                      CI_T                      FDA                      GBM                         JRip                      3NN                        LVQ                       MLP                       MLR
        1.00
        0.75
        0.50
        0.25
        0.00
Kappa


                          NB                      PART                       RBF                       RDA                         RF                      RFRules                   RPART                       PLS                       SVM
        1.00
        0.75
        0.50
        0.25
        0.00
            −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5
                                                                                                                                Difficulty

               optdigits
                         C5.0                      CI_T                      FDA                      GBM                         JRip                      3NN                        LVQ                       MLP                       MLR
        1.00
        0.75
        0.50
        0.25
        0.00
Kappa


                          NB                      PART                       RBF                       RDA                         RF                      RFRules                   RPART                       PLS                       SVM
        1.00
        0.75
        0.50
        0.25
        0.00
               −2.5     −2.0   −1.5   −1.0 −2.5   −2.0   −1.5   −1.0 −2.5   −2.0   −1.5   −1.0 −2.5   −2.0   −1.5   −1.0 −2.5    −2.0   −1.5   −1.0 −2.5   −2.0   −1.5   −1.0 −2.5    −2.0   −1.5   −1.0 −2.5   −2.0   −1.5   −1.0 −2.5   −2.0   −1.5   −1.0
                                                                                                                                Difficulty

               wall−robot−navigation
                         C5.0                      CI_T                      FDA                      GBM                         JRip                      3NN                        LVQ                       MLP                       MLR
        1.00
        0.75
        0.50
        0.25
Kappa


        0.00

                          NB                      PART                       RBF                       RDA                         RF                      RFRules                   RPART                       PLS                       SVM
        1.00
        0.75
        0.50
        0.25
        0.00
            −3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0
                                                                                                                                Difficulty

                                                                       Proportion of perturbed instances (δ)                            0       0.2         0.4          0.6         0.8       1


Figure 4: Kappa vs Difficulty for different models and benchmarks, varying the proportion of instances for each difficulty bin.


where robustness is mostly affected by the noise level                                                                             the last bin (the most difficult) for this same model and
and less by the difficulty of the instances (see, e.g.,NB,                                                                         dataset, we can see that it presents similar behaviour
RBF or RDA). Finally, there are cases in which robustness                                                                          compared with the first bin. However, this phenomenon
is barely altered by either difficulty or noise level (see,                                                                        happens for a different reason. In this case, the model
e.g., PLS, PART or LVQ).                                                                                                           predicts 160 instances of class “1” for 𝛿 = 0, whereas
   On the other hand, if we analyse the results at the                                                                             for 𝛿 = 1, the number of instances predicted of this
dataset level, we see that the behaviour of some tech-                                                                             class increased up to 244, i.e., this bin tends to absorb the
niques changes significantly. For instance, techniques                                                                             predictions of class ”1” the more noise is introduced. Both
such as C5.0, CI_T and JRip for the dataset optdigits exhibit                                                                      cases may constitute a robustness flaw for a particular
an interesting behaviour. These techniques seem more                                                                               model.
prone to change theirs predictions in easy instances than                                                                             Finally, for Q3, we derive a taxonomy to group similar
medium (even hard) instances. Analysing the results in                                                                             techniques in terms of robustness behaviour consider-
more detail, we have seen that this is due to the class dis-                                                                       ing difficulty. To measure the dissimilarity between sets
tributions in those more easy bins: these bins are usually                                                                         of observations, we employed the Kappa metric com-
composed of many instances of a single class (usually the                                                                          puted for each model, aggregated accross all datasets,
majority class), but these instances may be misclassified                                                                          difficulty bins, and proportions of perturbed instances 𝛿.
as we increase the amount of noise, thus reflecting a drop                                                                         We performed an agglomerative hierarchichal clustering,
in the Kappa value.                                                                                                                employing the euclidean distance and the complete link-
   This is the case, for instance, with the JRip model learnt                                                                      age method as a linkage criteria. The result of applying
on the optdigits dataset. The first bin is composed of 479                                                                         the hierarchical clutering is shown in Figure 5. We found
instances of the class “6”, without introducing any noise                                                                          three main clusters. The first cluster show that CI_T, JRip
(𝛿 = 0). After perturbing all the instances (𝛿 = 1), only                                                                          and C5.0 have very similar behaviour, joining with NB.
256 instances in this bin are predicted of class “6”, which                                                                        The second cluster is composed of the models GBM, RF,
explains the observed descend in Kappa. If we focus on                                                                             MLR , MLP y FDA, joining with RPART and RFRules at
a higher height. The last cluster shows two subgroups.         certain instances of a predicted class in easy bins, which
The first one consists of PART, SIMPLS and LVQ. The sec-       are misclassified after introducing noise. On the other
ond subgroup contains RDA, RBF, 3NN and SVM. These             hand, harder bins may absorb certain classes after intro-
results show that models from different ML families may        ducing noise. Given this variety in model’ behaviour,
present similar robustness (e.g., the models JRip and CI_T),   we derived a model robustness taxonomy by perform-
despite they come from very different techniques.              ing a hierarchical clustering to group items that behave
                                                               similarly. Our results shown that there are three major
                                                               clusters. Within each cluster, we can see very different
                                                               models, from different families, behave very similarly in
                                                               terms of robustness.
                                                                  As future work, we will continue to work in this evalu-
                                                               ation setting by adding more benchmarks (from different
                                                               domains) and perturbation functions in our experiments,
                                                               in order to confirm the results obtained. All this will also
                                                               add more diversity and generality to our method, thus
                                                               providing a better insights into the robustness of ML fam-
                                                               ilies. Future work may also include the application of our
                                                               framework in specific use cases. We may focus, for in-
                                                               stance, on tasks such as object detection for autonomous
                                                               vehicles for which we want to evaluate the robustness of
                                                               a (set of) sytem(s). To do so, we would need a particular
Figure 5: Robustness-based taxonomy for different ML fami-     benchmark(s) for the detection task, a difficulty estimator
lies.                                                          for them [11], and a perturbation function to generate
                                                               invalid inputs, including noise in the captured images or
   Overall, we have shown that estimating difficulty for       different adversarial attacks [53]. By running our setup,
analysing robustness may be very useful and insightful.        we can potentially analyse which systems(s) are more
We would need to inspect the test SCCs as an exercise          robust based on the difficulty of the task (e.g, generating
before selecting and deploying models in real-world situ-      also a taxonomy based on similarities), and select the
ations. SCCs can thus be used to select the (set of) best      best ones according to the their robustness for different
classifier(s) according to the their robustness for different  difficulty ranges. As future work, we are interested in
difficulty ranges. Since we may not know the difficulty        exploring other alternative setups of our methodology to
values of these unseen examples in a test/validation set,      gain new insights of the model’s behaviour. For instance,
we may estimate them in different (an straightforward)         we could introduce noise by perturbing only the most
ways such as by averaging the difficulty values of the         relevant attribute/s, instead of all of them, so that we can
most similar examples in the original set [12] or training     assess the robustness of the model in relation to those
a difficulty estimator [11]. We could even do this with        attributes. We could also apply other noise injection
small sets or even for single instances, always running        methods.
the difficulty estimator to determine which model to use
for it. If we can predict the difficulty of instances, we Acknowledgments
could set a threshold to use the system only for the easy
instances for which it is robust.                             This work has been partially supported by the Norwe-
                                                              gian Research Council grant 329745 Machine Teach-
                                                              ing for Explainable AI, also by the EU (FEDER) and
5. Conclusions and Future Work                                Spanish MINECO grant RTI2018-094403-B-C32 funded
In this work we propose an evaluation setting to analyse by MCIN/AEI/10.13039/501100011033 and by “ERDF A
the robustness of different ML models, from different ML way of making Europe”, Generalitat Valenciana under
families, when addressing noisy instances attending to grant PROMETEO/2019/098, EU’s Horizon 2020 research
their difficulty. Furthermore, we established a ML model and innovation programme under grant agreement No.
taxonomy based on the robustness. Our results shown 952215 (TAILOR), and INNEST/2021/317 (Project co-
that there are models affected by noise, instance difficulty, funded by the European Union with the “Programa Oper-
or both. Some models are more prone to change their pre- ativo del Fondo Europeo de Desarrollo Regional (FEDER)
diction when adding noise to the most difficult instances, de la Comunitat Valenciana 2014-2020”) and the UPV
while other models also performs similarly with easy (Vicerrectorado de Investigación) grant PAI-10-21.
instances. This might be caused by the concentration of
References                                                    [15] B. D. Ripley, Pattern recognition and neural net-
                                                                   works, Cambridge university press, 2007.
 [1] J. Grimmer, M. E. Roberts, B. M. Stewart, Ma-            [16] C. Ferri, J. Hernández-Orallo, R. Modroiu, An exper-
     chine learning for social science: An agnostic ap-            imental comparison of performance measures for
     proach, Annual Review of Political Science 24 (2021)          classification, Pattern recognition letters 30 (2009)
     395–419.                                                      27–38.
 [2] F. Zantalis, G. Koulouras, S. Karabetsos, D. Kan-        [17] J. A. Sáez, M. Galar, J. Luengo, F. Herrera, Analyz-
     dris, A review of machine learning and iot in smart           ing the presence of noise in multi-class problems:
     transportation, Future Internet 11 (2019) 94.                 alleviating its influence with the one-vs-one decom-
 [3] C. Chen, Y. Zuo, W. Ye, X. Li, Z. Deng, S. P. Ong,            position, Knowledge and information systems 38
     A critical review of machine learning of energy               (2014) 179–206.
     materials, Advanced Energy Materials 10 (2020)           [18] C.-M. Teng, Correcting noisy data., in: ICML,
     1903242.                                                      Citeseer, 1999, pp. 239–248.
 [4] I. S. C. Committee, et al., Ieee standard glossary of    [19] X. Zhu, X. Wu, Q. Chen, Eliminating class noise in
     software engineering terminology (ieee std 610.12-            large datasets, in: Proceedings of the 20th Interna-
     1990). los alamitos, CA: IEEE Computer Society 169            tional Conference on Machine Learning (ICML-03),
     (1990) 132.                                                   2003, pp. 920–927.
 [5] J. M. Zhang, M. Harman, L. Ma, Y. Liu, Machine           [20] D. Madaan, J. Shin, S. J. Hwang, Learning to gen-
     learning testing: Survey, landscapes and horizons,            erate noise for multi-attack robustness, in: Inter-
     IEEE Transactions on Software Engineering (2020).             national Conference on Machine Learning, PMLR,
 [6] H. Xu, S. Mannor, Robustness and generalization,              2021, pp. 7279–7289.
     Machine learning 86 (2012) 391–423.                      [21] S. Latif, R. Rana, J. Qadir, Adversarial machine
 [7] J. Rauber, W. Brendel, M. Bethge,            Foolbox:         learning and speech emotion recognition: Utiliz-
     A python toolbox to benchmark the robustness                  ing generative adversarial networks for robustness,
     of machine learning models,            arXiv preprint         arXiv preprint arXiv:1811.11402 (2018).
     arXiv:1707.04131 (2017).                                 [22] C. Leistner, A. Saffari, P. M. Roth, H. Bischof, On
 [8] O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytin-            robustness of on-line boosting-a competitive study,
     iotis, A. Nori, A. Criminisi, Measuring neural net            in: 2009 IEEE 12th International Conference on
     robustness with constraints, Advances in neural               Computer Vision Workshops, ICCV Workshops,
     information processing systems 29 (2016).                     IEEE, 2009, pp. 1362–1369.
 [9] J. Lian, L. Freeman, Y. Hong, X. Deng, Robustness        [23] J. M. Zhang, M. Harman, B. Guedj, E. T. Barr,
     with respect to class imbalance in artificial intelli-        J. Shawe-Taylor, Perturbation validation: A new
     gence classification algorithms, Journal of Quality           heuristic to validate machine learning models,
     Technology 53 (2021) 505–525.                                 arXiv preprint arXiv:1905.10201 (2020).
[10] J. Hernández-Orallo, B. S. Loe, L. Cheke, F. Martínez-   [24] J. A. Sáez, J. Luengo, F. Herrera, Evaluating the
     Plumed, S. Ó hÉigeartaigh, General intelligence               classifier behavior with noisy data considering per-
     disentangled via a generality metric for natural and          formance and robustness: The equalized loss of ac-
     artificial intelligence, Scientific reports 11 (2021)         curacy measure, Neurocomputing 176 (2016) 26–35.
     1–16.                                                    [25] V. Tjeng, K. Xiao, R. Tedrake, Evaluating robustness
[11] F. Martınez-Plumed, D. Castellano-Falcón, C. Mon-             of neural networks with mixed integer program-
     serrat, J. Hernández-Orallo, When AI difficulty is            ming, arXiv preprint arXiv:1711.07356 (2017).
     easy: The explanatory power of predicting irt diffi-     [26] T. Gehr, M. Mirman, D. Drachsler-Cohen,
     culty, in: Proceedings of the AAAI Conference on              P. Tsankov, S. Chaudhuri, M. Vechev, Ai2: Safety
     Artificial Intelligence, 2022.                                and robustness certification of neural networks
[12] F. Martínez-Plumed, R. B. Prudêncio, A. Martínez-             with abstract interpretation,        in: 2018 IEEE
     Usó, J. Hernández-Orallo, Item response theory                Symposium on Security and Privacy (SP), IEEE,
     in AI: Analysing machine learning classifiers at              2018, pp. 3–18.
     the instance level, Artificial Intelligence 271 (2019)   [27] D. Gopinath, K. Wang, M. Zhang, C. S. Pasareanu,
     18–42.                                                        S. Khurshid, Symbolic execution for deep neural
[13] R. K. Hambleton, H. Swaminathan, Item response                networks, arXiv preprint arXiv:1807.10439 (2018).
     theory: Principles and applications, Springer Sci-       [28] M. Usman, Y. Noller, C. S. Păsăreanu, Y. Sun,
     ence & Business Media, 2013.                                  D. Gopinath, Neurospf: A tool for the symbolic anal-
[14] D. Ljunggren, S. Ishii, A comparative analysis of             ysis of neural networks, in: 2021 IEEE/ACM 43rd
     robustness to noise in machine learning classifiers,          International Conference on Software Engineering:
     2021.                                                         Companion Proceedings (ICSE-Companion), IEEE,
     2021, pp. 25–28.                                            [43] J. Hernández Orallo, C. Ferri Ramírez,
[29] G. Katz, C. Barrett, D. L. Dill, K. Julian, M. J. Kochen-        M. Ramírez Quintana, Introducción a la Min-
     derfer, Reluplex: An efficient smt solver for veri-              ería de Datos, Pearson Prentice Hall, 2004.
     fying deep neural networks, in: International con-          [44] P. Flach, Machine learning: the art and science of
     ference on computer aided verification, Springer,                algorithms that make sense of data, Cambridge Uni-
     2017, pp. 97–117.                                                versity Press, 2012.
[30] M. R. Smith, T. Martinez, C. Giraud-Carrier,                [45] M. Fernández-Delgado, E. Cernadas, S. Barro,
     An instance level analysis of data complexity,                   D. Amorim, Do we need hundreds of classifiers
     Mach. Learn. 95 (2014) 225–256. URL: https://                    to solve real world classification problems?, The
     doi.org/10.1007/s10994-013-5422-z. doi:10.1007/                  journal of machine learning research 15 (2014)
     s10994- 013- 5422- z .                                           3133–3181.
[31] O. Russakovsky, J. Deng, H. Su, J. Krause,                  [46] R. Landis, G. Koch, An application of hierarchical
     S. Satheesh, S. Ma, Z. Huang, A. Karpathy,                       kappa-type statistics in the assessment of majority
     A. Khosla, M. Bernstein, et al., Imagenet large scale            agreement among multiple observers, Biometrics
     visual recognition challenge, International journal              (1977) 363–374.
     of computer vision 115 (2015) 211–252.                      [47] B. D. Wright, M. H. Stone, Best test design, Mesa
[32] D. Liu, Y. Xiong, K. Pulli, L. Shapiro, Estimating               press, 1979.
     image segmentation difficulty, in: International            [48] J. Vanschoren, J. N. Van Rijn, B. Bischl, L. Torgo,
     Workshop on Machine Learning and Data Mining                     OpenML: networked science in machine learning,
     in Pattern Recognition, Springer, 2011, pp. 484–495.             ACM SIGKDD Explorations Newsletter 15 (2014)
[33] S. Vijayanarasimhan, K. Grauman, What’s it going                 49–60.
     to cost you?: Predicting effort vs. informativeness         [49] R. P. Chalmers, mirt: A multidimensional item
     for multi-label image annotations, in: 2009 IEEE                 response theory package for the r environment,
     conference on computer vision and pattern recog-                 Journal of statistical Software 48 (2012) 1–29.
     nition, IEEE, 2009, pp. 2262–2269.                          [50] A. Maydeu-Olivares, Goodness-of-fit assessment
[34] B. Richards, Type/token ratios: What do they really              of item response theory models, Measurement: In-
     tell us?, Journal of child language 14 (1987) 201–209.           terdisciplinary Research and Perspectives 11 (2013)
[35] D. L. Hoover, Another perspective on vocabulary                  71–101.
     richness, Computers and the Humanities 37 (2003)            [51] M. Kuhn, Building predictive models in R using
     151–178.                                                         the caret package, Journal of Statistical Software,
[36] S. E. Embretson, S. P. Reise, Item response theory               Articles 28 (2008) 1–26. URL: https://www.jstatsoft.
     for psychologists, L. Erlbaum, 2000.                             org/v028/i05. doi:10.18637/jss.v028.i05 .
[37] F. Martínez-Plumed, R. B. C. Prudêncio, A. Martínez-        [52] J. N. van Rijn, B. Bischl, L. Torgo, B. Gao,
     Usó, J. Hernández-Orallo, Making sense of item                   V. Umaashankar, S. Fischer, P. Winter, B. Wiswedel,
     response theory in machine learning, in: ECAI                    M. R. Berthold, J. Vanschoren, OpenML: a collabo-
     2016 - 22nd European Conference on Artificial                    rative science platform, in: Machine Learning and
     Intelligence, 2016, pp. 1140–1148. doi:10.3233/                  Knowledge Discovery in Databases, Springer, 2013,
     978- 1- 61499- 672- 9- 1140 .                                    pp. 645–649.
[38] F. Martínez-Plumed, J. Hernández-Orallo, Dual               [53] B. J. Petit, B. Stottelaar, M. Feiri, F. Kargl, Remote at-
     indicators to analyse AI benchmarks: Difficulty,                 tacks on automated vehicles sensors: Experiments
     discrimination, ability and generality, IEEE Trans-              on camera and lidar black hat europe, 2015.
     actions on Games 12 (2020) 121–131.
[39] J. P. Lalor, Learning Latent Characteristics of Data
     and Models using Item Response Theory, Ph.D. the-
     sis, Doctoral Dissertations, 1842, 2020.
[40] Z. Chen, H. Ahn, Item response theory based en-
     semble in machine learning, International Journal
     of Automation and Computing 17 (2020) 621.
[41] A. Birnbaum, Statistical Theories of Mental Test
     Scores, Addison-Wesley, Reading, MA., 1968.
[42] R. Fabra-Boluda, C. Ferri, F. Martínez-Plumed,
     J. Hernández-Orallo, M. J. Ramírez-Quintana, Fam-
     ily and prejudice: A behavioural taxonomy of ma-
     chine learning techniques, in: ECAI 2020, IOS Press,
     2020, pp. 1135–1142.