=Paper=
{{Paper
|id=Vol-3169/paper5
|storemode=property
|title=The Relevance of Non-Human Errors in Machine Learning
|pdfUrl=https://ceur-ws.org/Vol-3169/paper5.pdf
|volume=Vol-3169
|authors=Ricardo Baeza-Yates,Marina Estévez-Almenzar
|dblpUrl=https://dblp.org/rec/conf/ijcai/Baeza-YatesE22
}}
==The Relevance of Non-Human Errors in Machine Learning==
The Relevance of Non-Human Errors in Machine Learning Ricardo Baeza-Yates2,1,∗ , Marina Estévez-Almenzar1,∗ 1 DTIC, Pompeu Fabra University, Barcelona, Spain 2 Institute for Experiential AI, Northeastern University, USA Abstract The current practice of focusing the evaluation of a machine learning model on the accuracy of validation has been lately questioned, and has been declared as a systematic habit that is ignoring some important aspects when developing a possible solution to a problem. This lack of diversity in evaluation procedures reinforces the difference between human and machine perception on the relevance of data features, and reinforces the lack of alignment between the fidelity of current benchmarks and human-centered tasks. Hence, we argue that there is an urgent need to start paying more attention to the search for metrics that, given a task, take into account the most humanly relevant aspects. We propose to base this search on the errors made by the machine and the consequent risks involved in moving human logic away from that of the machine. If we work on identifying these errors and organize them hierarchically according to this logic, we can use this information to provide a reliable evaluation of machine learning models, and improve the alignment between training processes and the different considerations humans make when solving a problem and analyzing outcomes. In this context we define the concept of non-human errors, exemplifying it with an image classification task and discussing its implications. Keywords Machine Learning, Responsible AI, Evaluation, Error Analysis, Non-Human Errors 1. Introduction human and machine perception of the relevance of data features [6]. In fact, in many cases, the best operation Imagine that you enter a skyscraper and the elevator has point of a model is not the one of maximal accuracy. a sign that says: “Works 99% of the time”. Would you take We can state that the benchmark-task misalignment the elevator? Most people would not. However, if the can be directly explained by the misalignment between sign says “Does not work 1% of the time and when that human and machine perceptual mechanisms, and we pro- happens, stops”, you probably would use it, because you pose a simple taxonomy to bridge those differences that perceive that you will be safe thanks to the explanation of are potentially harmful to humans, in order to achieve the error and the possibility to evaluate the consequences: more reliable model training and evaluation procedures, “The elevator may fail, but when it does, the fail consists even if it implies a decrease of the validation accuracy. of stopping”. Today, Machine Learning (ML) models are In order to do that, it is essential to drive a decentral- evaluated primarily on the basis of success rather than ization of the evaluation process in ML models, which is failure. Worse, this evaluation does not take into account mainly focused on maximizing accuracy without paying the potential harm of its mistakes, like is done in the attention to other parameters that could be of great rel- pharmaceutical or the food industry. evance. Hence, our main objective will be to highlight Along the same lines as this example, the current the need to define new methodologies and metrics that benchmarks fidelity to human-centered tasks has re- represent the mechanisms of human perception in a more cently been called into question [1, 2, 3]. The practice realistic way. Obviously, these metrics do not have to of centering the model evaluation on the validation ac- fully represent human perception but should, at least, curacy has been stated as a dangerous habit [4, 5] that cover the most humanly relevant aspects of the task at is ignoring some important aspects of the human per- hand. We propose to base the search of these metrics on ception when developing a solution for a problem, such the different types of errors done by the model. More as carefully studying the risks of the solution and its specifically, we want to focus on how they differ from different points of operation. This lack of diversity in those errors that a human might make. If we work on evaluation procedures reinforces the difference between identifying these errors and organize them hierarchically according to these differences, we can use this infor- EBeM’22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022, mation to provide a more meaningful evaluation of ML Vienna, Austria Envelope-Open rbaeza@acm.org (R. Baeza-Yates); marina.estevez@upf.edu models, and improve the alignment between ML training (M. Estévez-Almenzar) processes and the different considerations humans make GLOBE https://https://www.baeza.cl/ (R. Baeza-Yates); when solving a problem. https://ealmenzar.github.io/ (M. Estévez-Almenzar) Therefore, after discussing the state of the art in Sec- ∗ The authors contributed equally to this work as first authors. tion 2, we introduce the concept of non-human errors in © 2022 Copyright for this paper by its authors. Use permitted under Creative CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Section 3. This concept allow us to build our error tax- onomy. Then we use an image classification task to do have been proposed. For example, [5] points out that ac- a proof of concept that uses our taxonomy in Section 4, curacy alone cannot distinguish between strategies. Two ending with a discussion of its consequences in Section 5. systems – brains or algorithms – may achieve similar ac- A simple notebook to illustrate this work is available in curacy with very different strategies. In their study, they Github.1 conclude that the consistency between human errors and errors made by deep learning models is not far away from what can be expected by chance alone, indicating that 2. Related Work machines still employ very different perceptual mecha- nisms. There are also some benchmarks that approach The practice of centering the model evaluation on accu- decentralisation with respect to accuracy. In [7] the au- racy has been and still is being questioned. For example, thors propose to study the impact of errors and compare [6] warns that the only-use of accuracy to measure ma- them according to their type, but both the proposed error chine performance works as a limitation to humans when classification they offer and the associated impact focus analyzing machines, and states that adversarial vulner- only on pose estimation algorithms. ability results from the susceptibility of models to data In [8], the author also proposes to pay particular at- features that are potentially candidates for generalization. tention to the analysis of error from a quantitative per- They recall that the fact that we train machines to solely spective. He proposes to focus this analysis on those maximize accuracy is making its learning system use any errors whose correction has the greatest impact in terms available signal to achieve this goal, even those signals of improving the accuracy of the algorithm. Even though that look incomprehensible to humans. this analysis is fundamental, we propose to focus on This idea is also supported by [1], who states that train- improving our models in qualitative terms. Given a con- ing a model in a robust way leads to a reduction of accu- text, if a type of error is sufficiently serious, illogical or racy. They argue that this trade-off between the accuracy risky, it does not matter if it is made infrequently: we of a model and its robustness to adversarial perturbations should work on minimising this type of error in order to is a consequence of robust classifiers learning fundamen- minimise the possible harmful consequences. Another tally different feature representations than standard clas- important discussion that the author mentions is how sifiers. These differences, in particular, seem to result to define human-level performance in order to compare in unexpected benefits: the representations learned by it with the performance of a machine. This highlights robust models tend to align better with salient data char- the importance of considering the context in which the acteristics and human perception. Other trade-offs of machine learning model is applied. The human-level per- this kind were recently addressed for language models formance we choose to consider will depend on the task and their intrinsic risks in [3], where authors state that itself, or the harm risks it may pose. researchers are extending the state of the art on a wide To solve the problem of the misalignment between array of tasks as measured by classification scores on state-of-the-art benchmarks and human-centered tasks, some benchmarks, following the methodology of using some of the works mentioned above propose as the main some pre-trained models and then fine-tuning them for solution either to correct the labels in data sets or redefine specific tasks. In this scenario, they take a step back and the way we store and represent these labels. Even when pay attention to the possible risks associated with this these corrections are essential, using correctly labeled or technology in terms of environmental and financial costs. redefined labels to evaluate models could still not be suf- Another research that calls into question the results ficient to cover the diversity of human perception. While obtained with current state-of-the-art benchmarks was working on improving the representations associated done by [4], that highlights the fact that some data sets with the inputs in ML systems is very necessary, we need contain errors in their labels, and they expose a subse- to do a similar effort on improving the way we interpret quent study about the potential for these label errors to and analyze the outputs. These outputs are mainly char- affect benchmark results. Surprisingly, they find that acterized by two elements: successes and failures. Until lower capacity models may be practically more useful now, ML evaluation metrics have been mainly based on than higher capacity models in real-world data sets with successes, giving visibility to the accuracy of the algo- high proportions of erroneously labeled data. They con- rithm over other possible ways of measuring the overall clude that ML practitioners must be careful when choos- performance of the algorithm. We propose to change the ing which model to deploy based on validation or test focus and start prioritizing the analysis of errors, as well accuracy. as their classification according to the potential damage In an attempt to overcome the limitation that entails they may cause in the context in which they occur. the only-use of accuracy in ML evaluation, other metrics 1 https://github.com/ealmenzar/non-human-errors crete use case is specified, we wonder whether we can determine which errors are the most relevant in terms of human harm risk. Although harm risk may be perceived very differently depending on the performer (human or machine), it is clear that we are interested in avoiding harmful consequences for the humans involved, directly or indirectly, in the task at hand. It is reasonable to think (a) (b) that the errors related to these consequences are those that are unexpected and atypical for humans, and there- Figure 1: Visual representation of human and ML perfor- fore those that are difficult for us to explain and control. mances. On the left, the red triangle represents the ML model, the blue ellipse represents the human, and the green sphere Since, as humans, we are accustomed to human errors, represents the ground truth. They are positioned in the solu- we might expect that those errors that are furthest away tion space of a binary prediction problem. For every figure, from the errors that a human might make could be con- positive answers are inside, and negative answers are outside, sidered risky: we refer to these types of disparate errors being the correct answers determined by the green sphere. On as non-human errors (see Figure 3). the right we can see the yellow region representing the correct We can also formalise this idea in terms of mathemat- answers obtained by both the human and the ML model. ical sets. This will help us to formally define the differ- ent types of errors mentioned and graphically expressed above. We denote 𝑆 as the green sphere, 𝑇 as the red trian- 3. Non-Human Errors gle, and 𝐸 as the blue ellipse (see in Figure 1a). Following the logic explained above, we could consider these sets Finding new methodologies and metrics that allow us to of points in the solution space (and their complementary exploit the valuable information in errors done by ma- sets, noted as 𝑆, 𝑇, and 𝐸 respectively) as follows: chines is not trivial. To illustrate in a simplified way the error exploration that we are proposing, let us consider 𝑆 ≡ true positives a problem with a binary solution space. In this space, we 𝑇 ≡ positives predicted by the model are given the ground truth, so we are capable of deter- 𝐸 ≡ positives predicted by the human mining whether a point from the space is a correct or an 𝑆 ≡ true negatives incorrect answer, as well as its impact. For example, a typical assumption when solving a binary classification 𝑇 ≡ negatives predicted by the machine problem, is to consider that false negatives have the same 𝐸 ≡ negatives predicted by the human weight of false positives. This is not always correct, be- cause their harm might be quite different. Indeed, when Focusing on the errors shown in Figure 2, now we de- predicting an illness, a physician will prefer to see many note the false positives errors made only by the machine more healthy patients just to avoid missing any ill one. as 𝑃𝑚 (Figure 2d), and we define this set of errors as One solution is to use a weighted accuracy but still the operational point might be different because here the re- 𝑃𝑚 = 𝑇 ∩ 𝐸 ∩ 𝑆 call of ill patients is much more relevant than the overall Similarly, we denote and define the false positives er- accuracy (weighted or not). rors made only by the human (Figure 2e), the false pos- In Figure 1 we can see this ground truth represented itive errors made by both the machine and the human as a green sphere, positioned on the solution space, such (Figure 2f), the false negative errors made only by the that those points that fall into the green area are the true machine (Figure 2d), the false negative errors made only positive answers, and the rest of the points are the true by the human (Figure 2b), and the false negative errors negative answers. We can see two more shapes in this made by both the machine and the human (Figure 2c) as space that symbolize the perceptual agents; a red triangle follows, respectively: representing the machine predictions, and a blue ellipse representing the human predictions. Following the pre- 𝑃ℎ = 𝑇 ∩ 𝐸 ∩ 𝑆 vious logic, in Figure 1b we can see the true answers 𝑃𝑏 = 𝑇 ∩ 𝐸 ∩ 𝑆 correctly predicted by both the model and a human. In Figure 2, we focus on the errors. Here we are able to dis- 𝑁𝑚 = 𝑇 ∩ 𝐸 ∩ 𝑆 tinguish between two kinds of errors: false positives and 𝑁ℎ = 𝑇 ∩ 𝐸 ∩ 𝑆 false negatives. And we make another distinction based 𝑁𝑏 = 𝑇 ∩ 𝐸 ∩ 𝑆 on the entity that is making the error (human and/or ML model). Note that all these sets are disjoint because of the exclu- In this general and abstract scenario, where no con- sivity imposed when considering which agent commits (a) (b) (c) (d) (e) (f) Figure 2: Visual representation of false negatives and false positives attributed to the model, the human, or to both. The first three diagrams ((a), (b) and (c)) represent false negatives errors done by only the model, only the human, or by both, respectively. The next three diagrams ((d), (e) and (f)) represent the false positives errors done by only the model, only the human, or by both, respectively. Error areas are overstated to emphasize the idea. points. The distinction of non-human errors is based on the distinction between the successes and mistakes made by the different perceptual agents (human and machine). Also, this distinction only makes sense in a context in which we can expect reasonable human performance. Thus, the category of non-human errors can be found in those human centered tasks that can be at least partially solved in a reasonable way by the humans and where a ML algorithm is applied instead. However, in practice, consideration of the specific use case will be decisive. We next apply this idea to a simple but illustrative Figure 3: Non-human errors stressed in red: both false neg- problem: classifying images of dogs and cats according atives and false positives done by the ML model but not by to their breed [9]. This translates into a fine-grained im- humans (cases (a) and (d) in Figure 2). age classification problem that is mainly solved by using deep neural networks. Based on expert sources in the classification of these animals (FIFe and FCI Federations), the error. Now the sets of interest arise from the union of we have been able to construct a taxonomy that repre- some of the previous sets. We note 𝑀 as the non-human sents the possible errors that can be made in this task. errors (those committed by the machine but not by the Following our definition of non-human errors, in this human) explained above, 𝐻 as those errors committed by problem we can identify them as those errors that are the human but not by the machine, and 𝐵 as those errors fundamentally different from the errors that a human committed by both the human and the machine together: solving this task would commit. Therefore, we define as non-human errors those cases in which the machine 𝑀 = 𝑁 𝑚 ∪ 𝑃𝑚 classifies a dog as a cat, or vice versa (see Figure 4). No- 𝐻 = 𝑁 ℎ ∪ 𝑃ℎ tice that there might be other non-human errors when 𝐵 = 𝑁 𝑏 ∪ 𝑃𝑏 comparing among only cats or dogs, but those are much less important and less common than the definition that In this paper we focus on 𝑀, non-human errors, which we use for this proof of concept and provides a lower we believe are the errors that we should address first be- bound for non-human errors. cause of the harm risks that could be involved in making So far, we have selected one of the top-ranked algo- errors that escape human logic. But how can we precisely rithms for solving this specific task, the Big Transfer (BiT) determine these errors? How can we measure how far model from [10], which achieved 93% of accuracy. In Fig- an answer should be from human logic in order to call it ure 6 we give the full confusion matrix of 3,312 prediction a non-human error? We address these challenges in the pairs among 25 dog breeds (top-left) and 12 cat breeds next section. (bottom-right), where we can see that there are 4 pairs that are hard to classify (two breeds of Terriers and 3 pairs of cat breeds). Notice that this confusion matrix 4. Proof of Concept is in general non-symmetric, as the output of the model Approaching a problem by adopting the previous abstract may differ because the input and the prediction for each perspective allows us to visualize it with some indepen- pair is different. dence from the use case or real-world application, which Here we found that more than 3% of the errors were is good for understanding the wide range of operating non-human errors (8 of 241 errors), which appear as light Figure 4: Part of the error taxonomy obtained from the anal- ysis made with Oxford-IIIT Pets data set [9]. In red, one of the most common non-human errors committed by the BiT model [10]. Figure 6: Confusion matrix for the Oxford-IIIT Pets data set [9]. Darker the squares, more errors were made for that pair of breeds. 5. Discussion (a) Why should we care if algorithms mistake dogs for cats? This is clear when similar tasks are proposed in fields where the lives of human beings and their fundamen- tal rights are at risk of being left unprotected. In these fields, even in the case of a low percentage of non-human errors, the consequences could have a catastrophic and ir- reversible impact. One concrete such example happened (b) in 2018, when a Uber self-driving car was not able to rec- ognize a woman in a bicycle crossing a road at night in Figure 5: Two of the non-human errors obtained when run- Tempe, Arizona.2 . A human most probably would have ning the BiT model [10] over the Oxford-IIIT Pets data set [9]. recognized the woman and hence this is a non-human In (a) a Chihuahua is mistaken for an Abyssinian cat with a error. We do not know if the backup driver could have confidence of 46.24%. In (b) a Bengal cat is mistaken for a Chi- reacted on time, but she was seeing a video as the car huahua with a confidence of 20.83%, a percentage very close to the one of the second option in the list of breeds sorted by was working well until then. Finally, she was charged their probability of being selected as the tag for that image. of negligence, as Uber quickly settled with the family of the victim to avoid being sued [11].Hence, this event at the end impacted the lives of two women. One related issue that we do not discuss is another bad squares in the top-right and bottom-left of Figure 6. Two habit: predicting an answer even when we have low con- of these errors are shown in Figure 5, where a Chihuahua fidence. For example, in Figure 5 (b), any smart/honest is classified as an Abyssinian cat (Figure 5a) and a Bengal person would say “I don’t know” with such low confi- cat is classified as a Chihuahua (Figure 5b). However, dence. Even in case (a), if there is a harm risk, not giving there is a notable difference between these two errors: an answer might be a safer output. In the Uber example is the certainty of the answer provided by the algorithm. the same. Predicting ”I don’t know” and stopping, might This supports the need to start providing new metrics. be safer than predicting ”there is no human in front of me In this case, for instance, it would be interesting to focus and is safer to run over the object” (notice that the later on the extent to which an algorithm is, under unreliable assumption might be still dangerous for the passengers). certainty, either predicting correctly or erring, regardless of whether the answer is right or wrong. 2 https://www.theguardian.com/technology/2018/mar/19/uber- self-driving-car-kills-woman-arizona-tempe A. Madry, From imagenet to image classification: Contextualizing progress on benchmarks, in: Inter- national Conference on Machine Learning, PMLR, 2020, pp. 9625–9635. [3] E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dangers of stochastic par- rots: Can language models be too big?, in: FAccT ’21: 2021 ACM Conference on Fairness, Account- ability, and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021, ACM, 2021, pp. 610–623. [4] C. G. Northcutt, A. Athalye, J. Mueller, Pervasive Figure 7: Classification of white blood cells. As it happened label errors in test sets destabilize machine learning with cats and dogs, after further investigation on the risk benchmarks, arXiv preprint 2103.14749 (2021). associated to mistake one cell for another [12, 13], this tree [5] R. Geirhos, K. Meding, F. A. Wichmann, Beyond could be used as a taxonomy to define disparate errors. Images collected by [14]. accuracy: Quantifying trial-by-trial behaviour of cnns and humans by measuring error consistency, arXiv preprint arXiv:2006.16736 (2020). [6] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, In fact the self-driving car did predict a bicycle one of the B. Tran, A. Madry, Adversarial examples are times [11]. not bugs, they are features, arXiv preprint We are currently working on a problem that is tech- arXiv:1905.02175 (2019). nically very similar to the classification of dogs and cats [7] M. Ruggero Ronchi, P. Perona, Benchmarking and according to breed, but is a real-life application that is error diagnosis in multi-instance pose estimation, much more relevant to humans: the classification of in: Proceedings of the IEEE international confer- white blood cells. This problem is also formulated as ence on computer vision, 2017, pp. 369–378. a fine-grained image classification problem and, even [8] A. Ng, Machine learning yearning, when the number of different classes of elements is much 2018. URL: https://info.deeplearning.ai/ smaller than in the previous example (see Figure 7), their machine-learning-yearning-book. differentiation is very important. Indeed, [12] points out [9] O. M. Parkhi, A. Vedaldi, A. Zisserman, C. Jawahar, that neutrophil levels were associated with breast can- Cats and dogs, in: 2012 IEEE conference on com- cer risk, including advanced stages of breast cancer. In puter vision and pattern recognition, IEEE, 2012, the meta-analysis proposed by [13], it was shown that pp. 3498–3505. breast cancer patients with a higher ratio of neutrophils [10] A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, to lymphocytes had a higher relapse and lower overall J. Yung, S. Gelly, N. Houlsby, Big transfer (bit): survival. General visual representation learning, in: 16th The importance of including an in-depth study of the European Conference on Computer Vision, Part V errors that an algorithm could make in this classification 16, Springer, 2020, pp. 491–507. is evident, just as it is fundamental that in these complex [11] L. Smiley, ‘I’m the operator’: The aftermath of a use cases, both the evaluations of the algorithms and their self-driving tragedy, Wired (2022). publication are accompanied by the corresponding pa- [12] Y. Okuturlar, M. Gunaldi, E. E. Tiken, B. Ozto- rameters or new metrics that make visible the different er- sun, Y. O. Inan, T. Ercan, S. Tuna, A. O. Kaya, rors made, their frequency and their associated risk based O. Harmankaya, A. Kumbasar, Utility of periph- on professional knowledge. Moreover, these parameters eral blood parameters in predicting breast cancer could provide not only transparency and explainability risk, Asian Pacific Journal of Cancer Prevention 16 to the model, but also valuable clues to researchers that (2015) 2409–2412. would allow the algorithms to be improved in terms of [13] B. Wei, M. Yao, C. Xing, W. Wang, J. Yao, Y. Hong, human-centered responsibility and accountability. Y. Liu, P. Fu, The neutrophil lymphocyte ratio is as- sociated with breast cancer prognosis: an updated systematic review and meta-analysis, OncoTargets References and therapy 9 (2016). [1] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, [14] X. Zheng, Y. Wang, G. Wang, J. Liu, Fast A. Madry, Robustness may be at odds with accuracy, and robust segmentation of white blood cell im- arXiv preprint arXiv:1805.12152 (2018). ages by self-supervised learning, Micron 107 [2] D. Tsipras, S. Santurkar, L. Engstrom, A. Ilyas, (2018) 55–71. URL: https://www.sciencedirect.com/ science/article/pii/S0968432817303037.