Robustness Testing of Machine Learning Families using Instance-level IRT-Difficulty Raül Fabra-Boluda1 , Cèsar Ferri1 , Fernando Martínez-Plumed1 and M. José Ramírez-Quintana1 1 Universitat Politècnica de València Abstract Performance evaluation of Machine Learning systems have been usually limited to performance measures on curated and clean datasets that may not properly reflect how robustly these systems can operate in real-world situations. One key element in this understanding of robustness is instance difficulty. The effect of instance difficulty on robustness could be understood as how unexpected would be that a customary system fails on a particular instance of certain difficulty. In order to provide further understanding on this issue, we estimate IRT-based instance difficulty for an illustrative set of supervised tasks and we implement and test perturbation methods that simulate noise and variability depending on the type of input data. With this, we evaluate the robustness of different families of machine learning models, which we select and characterise according to their behaviour. The preliminary results of this work in progress allow us to define a novel taxonomy based on the robustness of the different models and the difficulty of the instances addressed. This study is a significant step towards exposing vulnerabilities of particular families of machine learning models. Keywords Machine Learning, Robustness, Robustness Taxonomy, IRT, Noise, Adversarial 1. Introduction On the other hand, there are a wide range of factors that can affect the robustness of a model [9] and the dif- The success of AI and specially Machine Learning ficulty of the instances (intrinsic or extrinsic) [10] is one (ML) technologies caused these type of systems to of the most relevant [11]. Robustness must be based on spread across many applications from different domains, knowing where and why the model fails, avoiding highly e.g., medical, financial, social or autonomous transport, unexpected failure, and difficulty is one key element in among others [1, 2, 3]. These applications form part of this understanding. Therefore, the question we want to our daily life and shapes our lifestyle. They recommend analyse is whether the performance of a particular model us music to listen or people to establish career/social re- varies equally distributed across difficulties. We may also lationships with. They diagnose our health and monitor analyse model robustness by perturbing data and locat- our finance. Given this scenario, there is an obvious need ing those examples that are more likely to change their of more robust ML systems. predicted label under certain amount of noise, i.e., those Robustness is defined by the IEEE standard glossary instances for which the model is less robust, and hence, of software engineering terminology [4] as: The degree more prone to cause a vulnerability. We can expect that to which a system or component can function correctly in the more difficult an instances is, the more likely it is to the presence of invalid inputs or stressful environmental change their predicted label under minor perturbations. conditions. In the context of ML, robustness measures Estimating the difficulty of instances is another prob- the resilience of a system towards perturbations in any lem we need to address. We may just calculate the av- of its components (the data, the learning program, or erage error of a set of systems for each instance as a the framework) [5]. Earlier works [6] assessed model proxy for difficulty [12]. However, the use of a popula- robustness by perturbing instances in the training set, tion of systems entails some risks as well. For instance, test set, or both. A general way to perturb instances is if the population contains a non-conformant system (fail- adding noise, a method that has been extensively applied ing on the easy instances and succeeding in some of the in the adversarial ML field for the generation of adver- hard ones), it may lead to very unstable difficulty met- sarial examples ([7]). This is why most of the research in rics. A solution to this problem was introduced several ML robustness focused on measuring the robustness of decades ago, and it is known as item response theory systems with adversarial samples [7, 8]. (IRT) [13], where difficulty is inferred from a matrix of items (instances) and respondents (systems), giving more EBeM’22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022, Vienna, Austria relevance to conformant systems. In addition, IRT gives Envelope-Open rafabbo@dsic.upv.es (R. Fabra-Boluda); cferri@dsic.upv.es a scaled metric of difficulty that follows a normal distri- (C. Ferri); fmartinez@dsic.upv.es (F. Martínez-Plumed); bution and can be compared directly against the ability mramirez@dsic.upv.es (M. J. Ramírez-Quintana) of a system. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In this work in progress paper, we present, as a proof of CEUR Workshop Proceedings (CEUR-WS.org) concept, an evaluation setting to analyse the robustness output of the model changes. Attribute noise has been of different ML models empirically, considering the dif- also used in different approaches to improve the robust- ficulty of the instances. We also perform a hierarchical ness of models to adversarial examples. For instance, [20] clustering to derive taxonomies of ML models according shows that the injection of noise in the training dataset to their robustness. The setting is general as we em- results in models more robust to attacks since they are ployed datasets from different domains, a wide set of rep- able to detect the perturbed instances beforehand. Simi- resentative ML techniques, and an instance perturbation larly, adding adversarial instances to the training set can function that introduce random noise with no specific improve the robustness of neural nets [8], and make a goal. Therefore, it can be adapted to more specific prob- Speech Emotion Recognition system more robust [21]. lems by changing the datasets, models and perturbation On the other hand, artificial label noise is useful to simu- function to the domain of interest. In this general evalu- late wrongly annotated instances or other sources of data ation framework, we measure the robustness of a model corruption. Label noise has been used, for instance, to as the agreement modulo instance difficulty between the evaluate robustness in computer vision applications [22]. output of the model for the original and the perturbed Additionally, label noise in the training set can be em- test sets. ployed to enhance the robustness of models, for instance, The paper is structured as follows. In Section 2 we by reducing errors derived from overfitting [23]. review part of the literature related to the assessment There has been proposed different ways of assessing of robustness in presence of noise, the estimation of in- the robustness of a model in noise environments. The stances difficulty and a taxonomy of machine learning most general method consists in measuring the correct- techniques derived from a notion of behavioral similarity. ness loss of models with noise in the data, with respect to Section 3 describes the method we developed to assess the case without noise, regardless of where the noise is model robustness. We apply that method and describe located (training set or test set). For noise in the training the experiments in Section 4. Finally, 5 concludes the set, the metrics used to quantify the loss are the standard paper. classification metrics such as accuracy and F-measure [14] and the Equalised Loss of Accuracy (a metric for measuring a classifier’s noise robustness [24]). For the 2. Background sub-field of adversarial robustness (where noise is used in the test set to generate the adversarial examples) there In this section we revisit some key concepts related to has been used specific metrics such as adversarial accu- model robustness, instance IRT-based difficulty and the racy [8]. There are other general techniques for evalu- definition of behavioural taxonomies of machine learning ating robustness, including mixed integer programming techniques. [25], abstract interpretation [26], and symbolic execution [27, 28, 29]. 2.1. Robustness in a noisy framework In this work, we are interested in studying how the ro- Robustness is one of the properties of ML systems that bustness of a model is affected by the distortion produced characterise their behavior [5], being particularly suitable by injecting different levels of noise into test instances for checking whether the system behaves as expected to considering the difficulty of the perturbed instances. As changes in the data. A common way to simulate those far as we know, instance difficulty has not yet been taking changes is injecting noise into the data, given that real into account to evaluate the robustness of models. world data often contain some degree of noise [14]. In the literature, there is a large number of approaches 2.2. Instance difficulty for adding artificial noise to datasets [15, 16, 17]. Usu- Difficult instances may cause problems during AI sys- ally, noise is introduced by perturbing the values of the tem development, especially for models that are trained. attribute(s) (attribute noise) or perturbing the class la- These instances (e.g., usually associated with noise, out- bel (label noise). A general technique for the injection liers or decision boundaries) have been blamed for over- of attribute noise consists in perturbing the instance at- fitting, lack of convergence or both. Handling these sort tribute(s) value following a well-known distribution (e.g., of anomalies has been addressed in a number of different a Gaussian distribution) for numerical attributes, or ran- ways trying to prevent overfitting. However, these ap- domly choosing a different value for categorical attributes proaches usually try to identify anomalies or mislabeled [14, 18, 19]. This is the method usually used in adver- instances but without defining what characterise them. sarial ML, where adversarial examples are created by For instance, in [30] instances that are hard to classify are slightly modifying attribute values of examples correctly identified through instance hardness metrics. These met- classified by the model to craft new instances (ideally rics try to characterise the level of difficulty of each input indistinguishable from the original ones) for which the sample following a (populational) empirical definition based on the classification behaviour over the instances values of ability. Difficult items in turn are those cor- to be evaluated. If we move out of the field of machine rectly answered only by the most proficient respondents. learning (e.g., on computer vision or NLP-related tasks), From this understanding and some common assumptions we find an area still to be explored, where the different (ability and difficulty following some particular normal works are limited to analysing global image properties distributions), the latent variables can be inferred from a (e.g., salience, memorability, photo quality, tone, colour, table of item-respondent pairs 𝑈𝑗𝑖 . Some two-step itera- texture, etc.) [31, 32, 33], or, in the case of NLP, they are tive variants of maximum-likelihood estimation (MLE), based on lexical readability and richness [34, 35]. such as Birnbaum’s method [41], can be used to infer all All the approaches above are specific to a domain the IRT parameters. and in many cases also anthropocentric. A completely IRT difficulty is characterised by being system- different approach is Item Response Theory, a well- independent an domain-generic unlike the other metrics developed subdiscipline in psychometrics [36], only re- described above [11]. It also has some advantages over cently brought to AI and machine learning [37, 12, 38, 39, using average performance as a metric of difficulty, in 40]. In IRT, the probability of a correct response for an terms of distribution, stability and predictability, as has item is a function of the respondent’s ability and some been studied in the literature of IRT. item’s parameters. The respondent solves the problem and the item is the problem instance itself. We focus on the dichotomous models where the response can be either correct or incorrect. Let 𝑈𝑗𝑖 be a binary response of a respondent 𝑗 to item 𝑖, with 𝑈𝑗𝑖 = 1 for a correct response and 𝑈𝑗𝑖 = 0 otherwise. For the basic one-parameter logistic (1PL) IRT model, the probability of a correct response given the examinee’s ability is modelled as a logistic function: 1 𝑃(𝑈𝑗𝑖 = 1|𝜃𝑗 ) = (1) 1 + 𝑒𝑥𝑝(−𝑎𝑖 (𝜃𝑗 − 𝑏𝑖 )) The parameter 𝜃𝑗 is the ability or proficiency of 𝑗 and 𝑏𝑖 is the difficulty of 𝑖. If ability 𝜃 equals item difficulty 𝑏𝑖 , then there are even odds of a correct answer (cutting the curve as exactly 0.5, as the light blue dashed shows in Fig. 1. For each item, the above model provides an Item Characteristic Curve (ICC) (see Fig. 1, left), characterised by difficulty (𝑏𝑖 ), which is the location parameter of the logistic function. 1.00 1.00 Ability = 2 Probability Correct Probability Correct Ability = 3 Ability = 4 0.75 0.75 Figure 2: Dendrogram representing the hierarchical cluster- 0.50 0.50 ing (18 groups) of ML models. From [42]. 0.25 0.25 0.00 0.00 0 2 4 6 0 2 4 6 Ability Instance Difficulty Figure 1: Left: Example of a 2PL IRT ICC curve, with slope 2.3. Behaviour-based Machine Learning 𝑎 = 2 in red and location parameter 𝑏 = 3 in blue. Right: families Example of SCC curves with different abilities. One classical way of characterising the rich range of machine learning techniques is by defining ‘families’, ac- cording to their formulation and learning strategy (e.g., As all IRT models assume one single parameter for neural networks, Bayesian methods, etc.) [43, 44, 45]. the respondent, their dual plots (known originally as However, this taxonomy of learning techniques does not person characteristic curves, here renamed as system take into account the intrinsic behaviour of the models characteristic curves (SCC), also follow a logistic func- (measured as output agreement), especially considering tion (see Fig. 1, right). Respondents who tend to correctly predictions in sparse zones where insufficient training answer the most difficult items will be assigned to high data was available. If we want to characterise the robust- ness of ML models, we need to analyse a diverse set of the preliminary objective of testing the effectiveness of models, as many as possible, under different parameters our setting), but is also limited by those benchmarks as well. In this regard, [42] derived a taxonomy of ML where there is a sufficiently large number |𝐼 | of examples techniques for classification, where families are clustered (articles) and |𝐽 | of models (respondents). according to their degree of (dis)agreement in behaviour, i.e., the differences between models on how they dis- Dataset # Instances # Features # Classes tribute the output class labels along the feature space. letter 20000 16 26 optdigits 5620 64 10 We considered both dense and sparse zones (where train- wall-robot-navigation 5456 4 4 ing data is scarce or inexistent), using Cohen’s kappa statistic [46]. While in dense areas differences between Table 1 List of datasets for the experiments. models may be difficult to find, in sparse areas the algo- rithms diverge significantly, and unveil the characteristic behaviour of the trained models using those techniques. Regarding ML models, we employed a set of 18 ML The methodology was based on comparing the be- models from different ML families (see Table 2), derived haviour of 65 different learning models (including hyper- in [42]. These 18 model families were obtained from a parameter variations), performing a pairwise comparison pool of 65 models learned and evaluated on a wide range (based on Kappa) and averaging the results obtained for of datasets for categorisation into different families, as 75 datasets. For grouping in families, authors applied a described in the section 2.3. For each family we selected hierarchical clustering so that the models that presented a single model, its centroid (i.e., representing the center similar behaviour fell in the same cluster, which is consid- of each family cluster), assuming it to be representative ered a model family (see the 18 different families obtained of its family. Thus, we can assume that the 18 selected in Figure 2). This method is useful to objectively quantify models are diverse enough to provide a wide view of how how different two models (or model families) are. different model families behave in terms of robustness. 3. Empirical Methodology 3.2. Estimation of Difficulty As mentioned in Section 3.1, in order to estimate the In this section we describe the experimental methodol- difficulty of the instances, we first check that for each ogy performed to obtain a taxonomy of classification benchmark selected from OpenML there is at least 10-20 algorithms according to their robustness. We start in- reponses (model evaluations) per item/feature (e.g., we troducing the set of representative datasets and learn- would need between 640 and 1280 responses for optdigits) ing models we have employed. Then, we describe how and that they are sufficiently diverse (different architec- we estimate instance difficulty, the approach followed tures or technologies). Next, we obtain their responses to introduce noise in the data and, finally, how we de- for unseen instances (e.g., we will be using the test folds, fine the taxonomy of ML families. All the data, code, so it is actually test performance, even if we cover the complete experiments, plots and results can be found in whole dataset). This will be our |𝐽 | × |𝐼 | matrix 𝑈 with all https://github.com/rfabra/family-robustness. binary responses 𝑈𝑗𝑖 . We follow the recommendations from [12] for the ap- 3.1. Data and Classifiers plication of IRT. In practice, for generating the IRT mod- els, we used the MIRT R package [49], using Birnbaum’s In order to estimate IRT-difficulty, we need to find bench- method, as explained above. The package MIRT (as many marks that had instance-wise results of a good number other IRT libraries) output indicators about the goodness of models. It is recommended to have at least 10-20 re- of fit which can be used to quantify the discrepancy be- sponses per item [47]. More importantly, we need the tween the values observed in the data (items) and the instance-wise results, i.e., a |𝐽 | × |𝐼 | matrix with the per- values expected under the statistical IRT model. Item-fit formance of each system 𝑗 ∈ 𝐽 for each instance 𝑖 ∈ 𝐼. statistics may be used to test the hypothesis of whether Finding experiments not reported in an aggregated way the fitted model could truly be the data-generating model was not an easy task. As an exception to the instance- or, conversely, we expect the item parameter estimates wise result problem, we find platforms such as OpenML to be biased. In practice, an IRT model may be rejected [48], a repository in which AI researchers and practi- on the basis of bad item-fit statistics, as we would not be tioners can share data sets and results in as much detail reasonably confident about the validity of the inferences as possible. The platform also provides several curated drawn from it [50]. In the present case, none of the es- datasets such as OpenML-CC18, from which we address a timated models were discarded because of bad item-fit set of 3 benchmarks for supervised learning (see Table 1). statistics or inconsistency in their results. The selection is guided by the illustrative character (with Technique Parameters id 𝑎𝑡, and 𝑝 the vector that represents the empiri- C5.0 C5.0 Cond. Inf. Tree mincriterion = 0.05 CI_T cal distribution of 𝑎𝑡, that is, 𝑝 = (𝑝𝑎𝑡1 , … , 𝑝𝑎𝑡𝑚 ), Flex. Disc. Analysis degree = 1, nprune = 17 FDA where, 𝑝𝑖 is the frequency of value 𝑖. Consider Stoch. Grad. Boosting Machine interaction.depth = 2, n.trees = 50 GBM JRip JRip we have an instance of value 𝑥 = 𝑎𝑡𝑗 in 𝑎𝑡, we K-Nearest Neighbor K=3 3NN Learning Vector Quant. size = 50, K = 3 LVQ represent as the vector 𝑡 = (𝑡𝑎𝑡1 , … , 𝑡𝑎𝑡𝑚 ) with MultiLayer Perceptron 1 hidden layer, 7 neurons MLP 𝑡𝑎𝑡𝑖 = 0 ∀𝑖 ∈ {1..𝑚}, 𝑖 ≠ 𝑗, and 𝑡𝑎𝑡𝑗 = 1. To in- Multinomial Log. Regression MLR Naive Bayes NB sert a noise level 𝜈, we calculate 𝛼 = 1 − 𝑒 (−𝜈) , PART PART Radial Basis Function Network RBF and then compute a new vector of probabilities Regularised Discriminant Analysis RDA 𝑝 ′ = 𝛼 ⋅ 𝑝 + (1 − 𝛼) ⋅ 𝑡. Finally, we use 𝑝 ′ in order Random Forest mtry = 64 RF RPART RPART to sample the new value 𝑥 ′ of the attribute. Part. Least Squares ncomp = 3 PLS SVM Poly, degree = 2 SVM RFRules mtry = 64 For the experiments, we will generate noisy datasets RFRules (test set) using a noise level 𝜈 = 0.2. We vary the pro- Table 2 portion of perturbed instances 𝛿 in each bin, from 𝛿 = 0 List of the 18 models employed for the experiments, along with the parameters used. (keeping unperturbed the original test set) to 𝛿 = 1 (per- turbing the whole test set). This is performed under a 5-fold cross validation setting. For each model, we will compare its predictions on the original test set with the 3.2.1. System Characteristic Curves predictions of each of the noisy test sets, by means of the One of the most powerful visualisation tools that de- Kappa metric, as we describe below. rives from difficulty is what we call system characteristic curves (SCC) (Fig. 1, right). Inspired by the concept of 3.4. Model robustness to noise and person characteristic curve previously developed in IRT, difficulty a SCC is a plot for the response probability (e.g., accuracy, kappa, etc.) of a particular classifier as a function of the We compare the behaviour of ML models from different instance difficulty. For producing the SCC, we divide the families by classifying the same test set from a particu- instances in bins according to difficulty. For each bin, we lar benchmark, to which we introduce different levels of plot on the 𝑥-axis the average difficulty of the instances noise. The more the behaviour of a model changes under in the bin and on the 𝑦-axis we plot the performance noise, the less robust it is. This difference in behaviour metric selected. can be measured with Cohen’s Kappa metric [46]. More concretely, given 𝑇 the domain of all data sets we can create from all possible inputs, a test set 𝑇 ∈ 𝑇, a pertur- 3.3. Introduction of Noise bation function 𝜙 ∶ 𝑇 → 𝑇 to introduce noise into a data We need a method to generate noise, representative and set, the perturbed test set 𝑇 ′ = 𝜙(𝑇 ), the predictions of a general enough, so that the experimental results can be model 𝑀 for the original test 𝑦𝑀 = 𝑀(𝑇 ), the predictions adapted to other noise settings, e.g., to include adversarial of a model 𝑀 for the perturbed test 𝑦𝑀 ′ = 𝑀(𝑇 ′ ) and two attacks. Hence, we will work directly with noise levels, models 𝑀1 and 𝑀2 learned on the same data, the model assuming that they are mapped from contexts. Noise is 𝑀1 is considered more robust than model 𝑀2 if generated randomly by using some well-known proba- bility distributions, following a similar procedure as in ′ ) > 𝜅(𝑦 , 𝑦 ′ ) 𝜅(𝑦𝑀1 , 𝑦𝑀 1 𝑀2 𝑀2 [16]. Instances are perturbed by changing their attribute Thus, we employ the Kappa as a measure of similarity values into a range of possible values. The process to between the predictions of a model on the original and the select among the possible values depends on whether the perturbed test sets. It is important to notice that we are attribute is nominal or numerical: not accounting for the real class label, since adding noise • Numerical attributes: Let 𝜈 be the level of noise to the input attributes of an instance implies that the to be injected into a numerical attribute 𝑎𝑡, and 𝜎 actual class is probably not the same as it was originally. the standard deviation of all values of 𝑎𝑡. Then, Instead, we compare the model predicted labels for the a value 𝑥 in 𝑎𝑡 is modified as 𝑥 ′ ∼ 𝑁 (𝑥, 𝜎 ⋅ 𝜈), i.e., original test set (without noise) with the ones predicted we follow a normal distribution using 𝑥 as mean for the noisy test sets. Our goal is not to determine the and 𝜎 multiplied by the noise level 𝜈 as standard well-performance of a model to solve a task, but to assess deviation. how the behaviour of the model changes under different levels of noise applied to instances of different levels of • Nominal attributes: Let {𝑎𝑡1 ,...,𝑎𝑡𝑚 } be the set difficulty. As we want to analyse whether the model of the 𝑚 possible values of a nominal attribute robustness may vary depending on the difficulty of the instances addressed, we estimated the difficulty of each typical in educational measurement. In health measure- instance in the dataset following the procedure described ment, however, these values are usually much higher and above. Later, we grouped instances into difficulty bins to around 4. In our case, when addressing ML benchmarks, analyse the robustness (to produce SCCs), as explained difficulty values around -3 and 3 are the norm (see [12]). above. For this reason, we decided to remove those instances Analysing the data from the SCCs for different mod- whose difficulty is out of the range [−6, 6], which are els we also derive a ML robustness model taxonomy at- considered outliers. This happened in all benchmarks tending at the different shapes of the SCCs and models’ for very easy instances for which all techniques are cor- behaviour. In this regard, for each dataset, we built a ma- rect, never affecting more than 0.1% of the instances. trix where each row represents a model and each column Figure 3 shows the IRT-difficulty distribution per bench- represent a combination of difficulty bins and proportion mark, with a standard deviation around 1 (as expected). of noisy instances per bin. Each element represents the In terms of location (Q1), the letter benchmark contains similarity (i.e., the kappa metric) between the predictions more difficult instances (mean difficulty of −1.50 ± 0.92 ) of the model for the original test set and the predictions than the others (−1.92 ± 0.67 for optdigits and −2.36 ± 0.9 for each noisy test and bin. By averaging these across all for wall-robot navigation). Although the distribution is the datasets, we may perform a hierarchical clustering generally normally distributed, the wall-robot-navigation with the aim of obtaining different grouping of models dataset presents a higher number of difficult instances, by robustness, showing the similarity between different skewing the distribution to the right. This may be due to ML families, in a data-driven fashion. the diversity of the population of systems used for the difficulty estimation (similar cases can be observed in 3.5. Experimental questions [11]). Once the experimental methodology is clear, we now dataset wall−robot−navigation want to investigate the relationship between the robust- optdigits ness of the models and the difficulty of the instances, the letter latter having been altered with different levels of noise. −4 0 Difficulty 4 For this, we set 3 experimental questions. Q1: How do Figure 3: IRT-difficulty distribution per dataset. Benchmarks difficulties distribute per benchmark for the IRT-difficulty sorted by average difficulty. metric estimated? Q2. Can we see differences in robust- ness for different models based on the difficulty of the Regarding Q2, for each technique in Table 2, we com- instances? Q3. Can we group models by robustness? pare its predictions on the original test set (for each dataset in Table 1) with the predictions of each of the noisy test sets, by means of the Kappa metric. The SCCs 4. Experiments produced (using Kappa values on the y-axis) are shown in Figure 4. Obviously, Kappa takes values equal to 1 when 4.1. Setup the test set is not perturbed (𝛿 = 0), since we are com- We employed R language with caret package [51] to paring the output labels of the different trained model carry out our experiments, i.e., training and evaluation with themselves. As we increase the amount of perturbed of the models. All the models were learnt from scratch, instances (the same proportion for each bin of difficulty), so we did not used any pre-trained model. We used the we can appreciate differences in the behaviour for the MIRT R package [49] for estimating IRT 1PL models. To techniques analysed. feed the IRT method, we obtained the predictions from a As expected, the most difficult instances are those that wide variety of models by using OpenML API [52]. In total, are more sensitive to noise, and this can be seen in terms we employed the predictions of (up to) 2000 evaluations of the level of performance of the different techniques for per dataset. the most difficult instance bins. This behaviour may indi- cate that these instances are located close to the decision 4.2. Results boundary or regions with class overlap, so the behaviour for most techniques is more unpredictable in those re- IRT difficulties are built to approximately follow a nor- gions that in easier ones. In general, we may find some mal distribution with standard deviation 1 but different patterns of behaviour for different sets of techniques. locations depending on the dataset. When it comes to First, we identify cases where robustness decreases non- the item difficulty parameters, what you find acceptable linearly with increasing levels of difficulty. This is the depends very much on the purpose of your test and the most common case, but with differences in robustness population of interest. For instance, values around 1 are variations for different models and datasets (see, e.g., CI_T, FDA, 3NN, MLP , MLR or SVM). Second, we also see cases letter C5.0 CI_T FDA GBM JRip 3NN LVQ MLP MLR 1.00 0.75 0.50 0.25 0.00 Kappa NB PART RBF RDA RF RFRules RPART PLS SVM 1.00 0.75 0.50 0.25 0.00 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 −2.5 −2.0 −1.5 −1.0 −0.5 Difficulty optdigits C5.0 CI_T FDA GBM JRip 3NN LVQ MLP MLR 1.00 0.75 0.50 0.25 0.00 Kappa NB PART RBF RDA RF RFRules RPART PLS SVM 1.00 0.75 0.50 0.25 0.00 −2.5 −2.0 −1.5 −1.0 −2.5 −2.0 −1.5 −1.0 −2.5 −2.0 −1.5 −1.0 −2.5 −2.0 −1.5 −1.0 −2.5 −2.0 −1.5 −1.0 −2.5 −2.0 −1.5 −1.0 −2.5 −2.0 −1.5 −1.0 −2.5 −2.0 −1.5 −1.0 −2.5 −2.0 −1.5 −1.0 Difficulty wall−robot−navigation C5.0 CI_T FDA GBM JRip 3NN LVQ MLP MLR 1.00 0.75 0.50 0.25 Kappa 0.00 NB PART RBF RDA RF RFRules RPART PLS SVM 1.00 0.75 0.50 0.25 0.00 −3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0−3.5−3.0−2.5−2.0−1.5−1.0 Difficulty Proportion of perturbed instances (δ) 0 0.2 0.4 0.6 0.8 1 Figure 4: Kappa vs Difficulty for different models and benchmarks, varying the proportion of instances for each difficulty bin. where robustness is mostly affected by the noise level the last bin (the most difficult) for this same model and and less by the difficulty of the instances (see, e.g.,NB, dataset, we can see that it presents similar behaviour RBF or RDA). Finally, there are cases in which robustness compared with the first bin. However, this phenomenon is barely altered by either difficulty or noise level (see, happens for a different reason. In this case, the model e.g., PLS, PART or LVQ). predicts 160 instances of class “1” for 𝛿 = 0, whereas On the other hand, if we analyse the results at the for 𝛿 = 1, the number of instances predicted of this dataset level, we see that the behaviour of some tech- class increased up to 244, i.e., this bin tends to absorb the niques changes significantly. For instance, techniques predictions of class ”1” the more noise is introduced. Both such as C5.0, CI_T and JRip for the dataset optdigits exhibit cases may constitute a robustness flaw for a particular an interesting behaviour. These techniques seem more model. prone to change theirs predictions in easy instances than Finally, for Q3, we derive a taxonomy to group similar medium (even hard) instances. Analysing the results in techniques in terms of robustness behaviour consider- more detail, we have seen that this is due to the class dis- ing difficulty. To measure the dissimilarity between sets tributions in those more easy bins: these bins are usually of observations, we employed the Kappa metric com- composed of many instances of a single class (usually the puted for each model, aggregated accross all datasets, majority class), but these instances may be misclassified difficulty bins, and proportions of perturbed instances 𝛿. as we increase the amount of noise, thus reflecting a drop We performed an agglomerative hierarchichal clustering, in the Kappa value. employing the euclidean distance and the complete link- This is the case, for instance, with the JRip model learnt age method as a linkage criteria. The result of applying on the optdigits dataset. The first bin is composed of 479 the hierarchical clutering is shown in Figure 5. We found instances of the class “6”, without introducing any noise three main clusters. The first cluster show that CI_T, JRip (𝛿 = 0). After perturbing all the instances (𝛿 = 1), only and C5.0 have very similar behaviour, joining with NB. 256 instances in this bin are predicted of class “6”, which The second cluster is composed of the models GBM, RF, explains the observed descend in Kappa. If we focus on MLR , MLP y FDA, joining with RPART and RFRules at a higher height. The last cluster shows two subgroups. certain instances of a predicted class in easy bins, which The first one consists of PART, SIMPLS and LVQ. The sec- are misclassified after introducing noise. On the other ond subgroup contains RDA, RBF, 3NN and SVM. These hand, harder bins may absorb certain classes after intro- results show that models from different ML families may ducing noise. Given this variety in model’ behaviour, present similar robustness (e.g., the models JRip and CI_T), we derived a model robustness taxonomy by perform- despite they come from very different techniques. ing a hierarchical clustering to group items that behave similarly. Our results shown that there are three major clusters. Within each cluster, we can see very different models, from different families, behave very similarly in terms of robustness. As future work, we will continue to work in this evalu- ation setting by adding more benchmarks (from different domains) and perturbation functions in our experiments, in order to confirm the results obtained. All this will also add more diversity and generality to our method, thus providing a better insights into the robustness of ML fam- ilies. Future work may also include the application of our framework in specific use cases. We may focus, for in- stance, on tasks such as object detection for autonomous vehicles for which we want to evaluate the robustness of a (set of) sytem(s). To do so, we would need a particular Figure 5: Robustness-based taxonomy for different ML fami- benchmark(s) for the detection task, a difficulty estimator lies. for them [11], and a perturbation function to generate invalid inputs, including noise in the captured images or Overall, we have shown that estimating difficulty for different adversarial attacks [53]. By running our setup, analysing robustness may be very useful and insightful. we can potentially analyse which systems(s) are more We would need to inspect the test SCCs as an exercise robust based on the difficulty of the task (e.g, generating before selecting and deploying models in real-world situ- also a taxonomy based on similarities), and select the ations. SCCs can thus be used to select the (set of) best best ones according to the their robustness for different classifier(s) according to the their robustness for different difficulty ranges. As future work, we are interested in difficulty ranges. Since we may not know the difficulty exploring other alternative setups of our methodology to values of these unseen examples in a test/validation set, gain new insights of the model’s behaviour. For instance, we may estimate them in different (an straightforward) we could introduce noise by perturbing only the most ways such as by averaging the difficulty values of the relevant attribute/s, instead of all of them, so that we can most similar examples in the original set [12] or training assess the robustness of the model in relation to those a difficulty estimator [11]. We could even do this with attributes. We could also apply other noise injection small sets or even for single instances, always running methods. the difficulty estimator to determine which model to use for it. If we can predict the difficulty of instances, we Acknowledgments could set a threshold to use the system only for the easy instances for which it is robust. This work has been partially supported by the Norwe- gian Research Council grant 329745 Machine Teach- ing for Explainable AI, also by the EU (FEDER) and 5. Conclusions and Future Work Spanish MINECO grant RTI2018-094403-B-C32 funded In this work we propose an evaluation setting to analyse by MCIN/AEI/10.13039/501100011033 and by “ERDF A the robustness of different ML models, from different ML way of making Europe”, Generalitat Valenciana under families, when addressing noisy instances attending to grant PROMETEO/2019/098, EU’s Horizon 2020 research their difficulty. Furthermore, we established a ML model and innovation programme under grant agreement No. taxonomy based on the robustness. Our results shown 952215 (TAILOR), and INNEST/2021/317 (Project co- that there are models affected by noise, instance difficulty, funded by the European Union with the “Programa Oper- or both. Some models are more prone to change their pre- ativo del Fondo Europeo de Desarrollo Regional (FEDER) diction when adding noise to the most difficult instances, de la Comunitat Valenciana 2014-2020”) and the UPV while other models also performs similarly with easy (Vicerrectorado de Investigación) grant PAI-10-21. instances. This might be caused by the concentration of References [15] B. D. Ripley, Pattern recognition and neural net- works, Cambridge university press, 2007. [1] J. Grimmer, M. E. Roberts, B. M. Stewart, Ma- [16] C. Ferri, J. Hernández-Orallo, R. Modroiu, An exper- chine learning for social science: An agnostic ap- imental comparison of performance measures for proach, Annual Review of Political Science 24 (2021) classification, Pattern recognition letters 30 (2009) 395–419. 27–38. [2] F. Zantalis, G. Koulouras, S. Karabetsos, D. Kan- [17] J. A. Sáez, M. Galar, J. Luengo, F. Herrera, Analyz- dris, A review of machine learning and iot in smart ing the presence of noise in multi-class problems: transportation, Future Internet 11 (2019) 94. alleviating its influence with the one-vs-one decom- [3] C. Chen, Y. Zuo, W. Ye, X. Li, Z. Deng, S. P. Ong, position, Knowledge and information systems 38 A critical review of machine learning of energy (2014) 179–206. materials, Advanced Energy Materials 10 (2020) [18] C.-M. Teng, Correcting noisy data., in: ICML, 1903242. Citeseer, 1999, pp. 239–248. [4] I. S. C. Committee, et al., Ieee standard glossary of [19] X. Zhu, X. Wu, Q. Chen, Eliminating class noise in software engineering terminology (ieee std 610.12- large datasets, in: Proceedings of the 20th Interna- 1990). los alamitos, CA: IEEE Computer Society 169 tional Conference on Machine Learning (ICML-03), (1990) 132. 2003, pp. 920–927. [5] J. M. Zhang, M. Harman, L. Ma, Y. Liu, Machine [20] D. Madaan, J. Shin, S. J. Hwang, Learning to gen- learning testing: Survey, landscapes and horizons, erate noise for multi-attack robustness, in: Inter- IEEE Transactions on Software Engineering (2020). national Conference on Machine Learning, PMLR, [6] H. Xu, S. Mannor, Robustness and generalization, 2021, pp. 7279–7289. Machine learning 86 (2012) 391–423. [21] S. Latif, R. Rana, J. Qadir, Adversarial machine [7] J. Rauber, W. Brendel, M. Bethge, Foolbox: learning and speech emotion recognition: Utiliz- A python toolbox to benchmark the robustness ing generative adversarial networks for robustness, of machine learning models, arXiv preprint arXiv preprint arXiv:1811.11402 (2018). arXiv:1707.04131 (2017). [22] C. Leistner, A. Saffari, P. M. Roth, H. Bischof, On [8] O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytin- robustness of on-line boosting-a competitive study, iotis, A. Nori, A. Criminisi, Measuring neural net in: 2009 IEEE 12th International Conference on robustness with constraints, Advances in neural Computer Vision Workshops, ICCV Workshops, information processing systems 29 (2016). IEEE, 2009, pp. 1362–1369. [9] J. Lian, L. Freeman, Y. Hong, X. Deng, Robustness [23] J. M. Zhang, M. Harman, B. Guedj, E. T. Barr, with respect to class imbalance in artificial intelli- J. Shawe-Taylor, Perturbation validation: A new gence classification algorithms, Journal of Quality heuristic to validate machine learning models, Technology 53 (2021) 505–525. arXiv preprint arXiv:1905.10201 (2020). [10] J. Hernández-Orallo, B. S. Loe, L. Cheke, F. Martínez- [24] J. A. Sáez, J. Luengo, F. Herrera, Evaluating the Plumed, S. Ó hÉigeartaigh, General intelligence classifier behavior with noisy data considering per- disentangled via a generality metric for natural and formance and robustness: The equalized loss of ac- artificial intelligence, Scientific reports 11 (2021) curacy measure, Neurocomputing 176 (2016) 26–35. 1–16. [25] V. Tjeng, K. Xiao, R. Tedrake, Evaluating robustness [11] F. Martınez-Plumed, D. Castellano-Falcón, C. Mon- of neural networks with mixed integer program- serrat, J. Hernández-Orallo, When AI difficulty is ming, arXiv preprint arXiv:1711.07356 (2017). easy: The explanatory power of predicting irt diffi- [26] T. Gehr, M. Mirman, D. Drachsler-Cohen, culty, in: Proceedings of the AAAI Conference on P. Tsankov, S. Chaudhuri, M. Vechev, Ai2: Safety Artificial Intelligence, 2022. and robustness certification of neural networks [12] F. Martínez-Plumed, R. B. Prudêncio, A. Martínez- with abstract interpretation, in: 2018 IEEE Usó, J. Hernández-Orallo, Item response theory Symposium on Security and Privacy (SP), IEEE, in AI: Analysing machine learning classifiers at 2018, pp. 3–18. the instance level, Artificial Intelligence 271 (2019) [27] D. Gopinath, K. Wang, M. Zhang, C. S. Pasareanu, 18–42. S. Khurshid, Symbolic execution for deep neural [13] R. K. Hambleton, H. Swaminathan, Item response networks, arXiv preprint arXiv:1807.10439 (2018). theory: Principles and applications, Springer Sci- [28] M. Usman, Y. Noller, C. S. Păsăreanu, Y. Sun, ence & Business Media, 2013. D. Gopinath, Neurospf: A tool for the symbolic anal- [14] D. Ljunggren, S. Ishii, A comparative analysis of ysis of neural networks, in: 2021 IEEE/ACM 43rd robustness to noise in machine learning classifiers, International Conference on Software Engineering: 2021. Companion Proceedings (ICSE-Companion), IEEE, 2021, pp. 25–28. [43] J. Hernández Orallo, C. Ferri Ramírez, [29] G. Katz, C. Barrett, D. L. Dill, K. Julian, M. J. Kochen- M. Ramírez Quintana, Introducción a la Min- derfer, Reluplex: An efficient smt solver for veri- ería de Datos, Pearson Prentice Hall, 2004. fying deep neural networks, in: International con- [44] P. Flach, Machine learning: the art and science of ference on computer aided verification, Springer, algorithms that make sense of data, Cambridge Uni- 2017, pp. 97–117. versity Press, 2012. [30] M. R. Smith, T. Martinez, C. Giraud-Carrier, [45] M. Fernández-Delgado, E. Cernadas, S. Barro, An instance level analysis of data complexity, D. Amorim, Do we need hundreds of classifiers Mach. Learn. 95 (2014) 225–256. URL: https:// to solve real world classification problems?, The doi.org/10.1007/s10994-013-5422-z. doi:10.1007/ journal of machine learning research 15 (2014) s10994- 013- 5422- z . 3133–3181. [31] O. Russakovsky, J. Deng, H. Su, J. Krause, [46] R. Landis, G. Koch, An application of hierarchical S. Satheesh, S. Ma, Z. Huang, A. Karpathy, kappa-type statistics in the assessment of majority A. Khosla, M. Bernstein, et al., Imagenet large scale agreement among multiple observers, Biometrics visual recognition challenge, International journal (1977) 363–374. of computer vision 115 (2015) 211–252. [47] B. D. Wright, M. H. Stone, Best test design, Mesa [32] D. Liu, Y. Xiong, K. Pulli, L. Shapiro, Estimating press, 1979. image segmentation difficulty, in: International [48] J. Vanschoren, J. N. Van Rijn, B. Bischl, L. Torgo, Workshop on Machine Learning and Data Mining OpenML: networked science in machine learning, in Pattern Recognition, Springer, 2011, pp. 484–495. ACM SIGKDD Explorations Newsletter 15 (2014) [33] S. Vijayanarasimhan, K. Grauman, What’s it going 49–60. to cost you?: Predicting effort vs. informativeness [49] R. P. Chalmers, mirt: A multidimensional item for multi-label image annotations, in: 2009 IEEE response theory package for the r environment, conference on computer vision and pattern recog- Journal of statistical Software 48 (2012) 1–29. nition, IEEE, 2009, pp. 2262–2269. [50] A. Maydeu-Olivares, Goodness-of-fit assessment [34] B. Richards, Type/token ratios: What do they really of item response theory models, Measurement: In- tell us?, Journal of child language 14 (1987) 201–209. terdisciplinary Research and Perspectives 11 (2013) [35] D. L. Hoover, Another perspective on vocabulary 71–101. richness, Computers and the Humanities 37 (2003) [51] M. Kuhn, Building predictive models in R using 151–178. the caret package, Journal of Statistical Software, [36] S. E. Embretson, S. P. Reise, Item response theory Articles 28 (2008) 1–26. URL: https://www.jstatsoft. for psychologists, L. Erlbaum, 2000. org/v028/i05. doi:10.18637/jss.v028.i05 . [37] F. Martínez-Plumed, R. B. C. Prudêncio, A. Martínez- [52] J. N. van Rijn, B. Bischl, L. Torgo, B. Gao, Usó, J. Hernández-Orallo, Making sense of item V. Umaashankar, S. Fischer, P. Winter, B. Wiswedel, response theory in machine learning, in: ECAI M. R. Berthold, J. Vanschoren, OpenML: a collabo- 2016 - 22nd European Conference on Artificial rative science platform, in: Machine Learning and Intelligence, 2016, pp. 1140–1148. doi:10.3233/ Knowledge Discovery in Databases, Springer, 2013, 978- 1- 61499- 672- 9- 1140 . pp. 645–649. [38] F. Martínez-Plumed, J. Hernández-Orallo, Dual [53] B. J. Petit, B. Stottelaar, M. Feiri, F. Kargl, Remote at- indicators to analyse AI benchmarks: Difficulty, tacks on automated vehicles sensors: Experiments discrimination, ability and generality, IEEE Trans- on camera and lidar black hat europe, 2015. actions on Games 12 (2020) 121–131. [39] J. P. Lalor, Learning Latent Characteristics of Data and Models using Item Response Theory, Ph.D. the- sis, Doctoral Dissertations, 1842, 2020. [40] Z. Chen, H. Ahn, Item response theory based en- semble in machine learning, International Journal of Automation and Computing 17 (2020) 621. [41] A. Birnbaum, Statistical Theories of Mental Test Scores, Addison-Wesley, Reading, MA., 1968. [42] R. Fabra-Boluda, C. Ferri, F. Martínez-Plumed, J. Hernández-Orallo, M. J. Ramírez-Quintana, Fam- ily and prejudice: A behavioural taxonomy of ma- chine learning techniques, in: ECAI 2020, IOS Press, 2020, pp. 1135–1142.