=Paper= {{Paper |id=Vol-3169/paper5 |storemode=property |title=The Relevance of Non-Human Errors in Machine Learning |pdfUrl=https://ceur-ws.org/Vol-3169/paper5.pdf |volume=Vol-3169 |authors=Ricardo Baeza-Yates,Marina Estévez-Almenzar |dblpUrl=https://dblp.org/rec/conf/ijcai/Baeza-YatesE22 }} ==The Relevance of Non-Human Errors in Machine Learning== https://ceur-ws.org/Vol-3169/paper5.pdf
The Relevance of Non-Human Errors in Machine Learning
Ricardo Baeza-Yates2,1,∗ , Marina Estévez-Almenzar1,∗
1
    DTIC, Pompeu Fabra University, Barcelona, Spain
2
    Institute for Experiential AI, Northeastern University, USA


                                             Abstract
                                             The current practice of focusing the evaluation of a machine learning model on the accuracy of validation has been lately
                                             questioned, and has been declared as a systematic habit that is ignoring some important aspects when developing a possible
                                             solution to a problem. This lack of diversity in evaluation procedures reinforces the difference between human and machine
                                             perception on the relevance of data features, and reinforces the lack of alignment between the fidelity of current benchmarks
                                             and human-centered tasks. Hence, we argue that there is an urgent need to start paying more attention to the search for
                                             metrics that, given a task, take into account the most humanly relevant aspects. We propose to base this search on the errors
                                             made by the machine and the consequent risks involved in moving human logic away from that of the machine. If we work
                                             on identifying these errors and organize them hierarchically according to this logic, we can use this information to provide a
                                             reliable evaluation of machine learning models, and improve the alignment between training processes and the different
                                             considerations humans make when solving a problem and analyzing outcomes. In this context we define the concept of
                                             non-human errors, exemplifying it with an image classification task and discussing its implications.

                                             Keywords
                                             Machine Learning, Responsible AI, Evaluation, Error Analysis, Non-Human Errors



1. Introduction                                                                                                       human and machine perception of the relevance of data
                                                                                                                      features [6]. In fact, in many cases, the best operation
Imagine that you enter a skyscraper and the elevator has                                                              point of a model is not the one of maximal accuracy.
a sign that says: “Works 99% of the time”. Would you take                                                                We can state that the benchmark-task misalignment
the elevator? Most people would not. However, if the                                                                  can be directly explained by the misalignment between
sign says “Does not work 1% of the time and when that                                                                 human and machine perceptual mechanisms, and we pro-
happens, stops”, you probably would use it, because you                                                               pose a simple taxonomy to bridge those differences that
perceive that you will be safe thanks to the explanation of                                                           are potentially harmful to humans, in order to achieve
the error and the possibility to evaluate the consequences:                                                           more reliable model training and evaluation procedures,
“The elevator may fail, but when it does, the fail consists                                                           even if it implies a decrease of the validation accuracy.
of stopping”. Today, Machine Learning (ML) models are                                                                    In order to do that, it is essential to drive a decentral-
evaluated primarily on the basis of success rather than                                                               ization of the evaluation process in ML models, which is
failure. Worse, this evaluation does not take into account                                                            mainly focused on maximizing accuracy without paying
the potential harm of its mistakes, like is done in the                                                               attention to other parameters that could be of great rel-
pharmaceutical or the food industry.                                                                                  evance. Hence, our main objective will be to highlight
   Along the same lines as this example, the current                                                                  the need to define new methodologies and metrics that
benchmarks fidelity to human-centered tasks has re-                                                                   represent the mechanisms of human perception in a more
cently been called into question [1, 2, 3]. The practice                                                              realistic way. Obviously, these metrics do not have to
of centering the model evaluation on the validation ac-                                                               fully represent human perception but should, at least,
curacy has been stated as a dangerous habit [4, 5] that                                                               cover the most humanly relevant aspects of the task at
is ignoring some important aspects of the human per-                                                                  hand. We propose to base the search of these metrics on
ception when developing a solution for a problem, such                                                                the different types of errors done by the model. More
as carefully studying the risks of the solution and its                                                               specifically, we want to focus on how they differ from
different points of operation. This lack of diversity in                                                              those errors that a human might make. If we work on
evaluation procedures reinforces the difference between                                                               identifying these errors and organize them hierarchically
                                                                                                                      according to these differences, we can use this infor-
EBeM’22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022,                                                     mation to provide a more meaningful evaluation of ML
Vienna, Austria
Envelope-Open rbaeza@acm.org (R. Baeza-Yates); marina.estevez@upf.edu                                                 models, and improve the alignment between ML training
(M. Estévez-Almenzar)                                                                                                 processes and the different considerations humans make
GLOBE https://https://www.baeza.cl/ (R. Baeza-Yates);                                                                 when solving a problem.
https://ealmenzar.github.io/ (M. Estévez-Almenzar)                                                                       Therefore, after discussing the state of the art in Sec-
               ∗
                 The authors contributed equally to this work as first authors.                                       tion 2, we introduce the concept of non-human errors in
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        Section 3. This concept allow us to build our error tax-
onomy. Then we use an image classification task to do           have been proposed. For example, [5] points out that ac-
a proof of concept that uses our taxonomy in Section 4,         curacy alone cannot distinguish between strategies. Two
ending with a discussion of its consequences in Section 5.      systems – brains or algorithms – may achieve similar ac-
A simple notebook to illustrate this work is available in       curacy with very different strategies. In their study, they
Github.1                                                        conclude that the consistency between human errors and
                                                                errors made by deep learning models is not far away from
                                                                what can be expected by chance alone, indicating that
2. Related Work                                                 machines still employ very different perceptual mecha-
                                                                nisms. There are also some benchmarks that approach
The practice of centering the model evaluation on accu-
                                                                decentralisation with respect to accuracy. In [7] the au-
racy has been and still is being questioned. For example,
                                                                thors propose to study the impact of errors and compare
[6] warns that the only-use of accuracy to measure ma-
                                                                them according to their type, but both the proposed error
chine performance works as a limitation to humans when
                                                                classification they offer and the associated impact focus
analyzing machines, and states that adversarial vulner-
                                                                only on pose estimation algorithms.
ability results from the susceptibility of models to data
                                                                   In [8], the author also proposes to pay particular at-
features that are potentially candidates for generalization.
                                                                tention to the analysis of error from a quantitative per-
They recall that the fact that we train machines to solely
                                                                spective. He proposes to focus this analysis on those
maximize accuracy is making its learning system use any
                                                                errors whose correction has the greatest impact in terms
available signal to achieve this goal, even those signals
                                                                of improving the accuracy of the algorithm. Even though
that look incomprehensible to humans.
                                                                this analysis is fundamental, we propose to focus on
   This idea is also supported by [1], who states that train-
                                                                improving our models in qualitative terms. Given a con-
ing a model in a robust way leads to a reduction of accu-
                                                                text, if a type of error is sufficiently serious, illogical or
racy. They argue that this trade-off between the accuracy
                                                                risky, it does not matter if it is made infrequently: we
of a model and its robustness to adversarial perturbations
                                                                should work on minimising this type of error in order to
is a consequence of robust classifiers learning fundamen-
                                                                minimise the possible harmful consequences. Another
tally different feature representations than standard clas-
                                                                important discussion that the author mentions is how
sifiers. These differences, in particular, seem to result
                                                                to define human-level performance in order to compare
in unexpected benefits: the representations learned by
                                                                it with the performance of a machine. This highlights
robust models tend to align better with salient data char-
                                                                the importance of considering the context in which the
acteristics and human perception. Other trade-offs of
                                                                machine learning model is applied. The human-level per-
this kind were recently addressed for language models
                                                                formance we choose to consider will depend on the task
and their intrinsic risks in [3], where authors state that
                                                                itself, or the harm risks it may pose.
researchers are extending the state of the art on a wide
                                                                   To solve the problem of the misalignment between
array of tasks as measured by classification scores on
                                                                state-of-the-art benchmarks and human-centered tasks,
some benchmarks, following the methodology of using
                                                                some of the works mentioned above propose as the main
some pre-trained models and then fine-tuning them for
                                                                solution either to correct the labels in data sets or redefine
specific tasks. In this scenario, they take a step back and
                                                                the way we store and represent these labels. Even when
pay attention to the possible risks associated with this
                                                                these corrections are essential, using correctly labeled or
technology in terms of environmental and financial costs.
                                                                redefined labels to evaluate models could still not be suf-
   Another research that calls into question the results
                                                                ficient to cover the diversity of human perception. While
obtained with current state-of-the-art benchmarks was
                                                                working on improving the representations associated
done by [4], that highlights the fact that some data sets
                                                                with the inputs in ML systems is very necessary, we need
contain errors in their labels, and they expose a subse-
                                                                to do a similar effort on improving the way we interpret
quent study about the potential for these label errors to
                                                                and analyze the outputs. These outputs are mainly char-
affect benchmark results. Surprisingly, they find that
                                                                acterized by two elements: successes and failures. Until
lower capacity models may be practically more useful
                                                                now, ML evaluation metrics have been mainly based on
than higher capacity models in real-world data sets with
                                                                successes, giving visibility to the accuracy of the algo-
high proportions of erroneously labeled data. They con-
                                                                rithm over other possible ways of measuring the overall
clude that ML practitioners must be careful when choos-
                                                                performance of the algorithm. We propose to change the
ing which model to deploy based on validation or test
                                                                focus and start prioritizing the analysis of errors, as well
accuracy.
                                                                as their classification according to the potential damage
   In an attempt to overcome the limitation that entails
                                                                they may cause in the context in which they occur.
the only-use of accuracy in ML evaluation, other metrics

    1
        https://github.com/ealmenzar/non-human-errors
                                                                  crete use case is specified, we wonder whether we can
                                                                  determine which errors are the most relevant in terms of
                                                                  human harm risk. Although harm risk may be perceived
                                                                  very differently depending on the performer (human or
                                                                  machine), it is clear that we are interested in avoiding
                                                                  harmful consequences for the humans involved, directly
                                                                  or indirectly, in the task at hand. It is reasonable to think
            (a)                                (b)                that the errors related to these consequences are those
                                                                  that are unexpected and atypical for humans, and there-
Figure 1: Visual representation of human and ML perfor-
                                                                  fore those that are difficult for us to explain and control.
mances. On the left, the red triangle represents the ML model,
the blue ellipse represents the human, and the green sphere       Since, as humans, we are accustomed to human errors,
represents the ground truth. They are positioned in the solu-     we might expect that those errors that are furthest away
tion space of a binary prediction problem. For every figure,      from the errors that a human might make could be con-
positive answers are inside, and negative answers are outside,    sidered risky: we refer to these types of disparate errors
being the correct answers determined by the green sphere. On      as non-human errors (see Figure 3).
the right we can see the yellow region representing the correct      We can also formalise this idea in terms of mathemat-
answers obtained by both the human and the ML model.              ical sets. This will help us to formally define the differ-
                                                                  ent types of errors mentioned and graphically expressed
                                                                  above. We denote 𝑆 as the green sphere, 𝑇 as the red trian-
3. Non-Human Errors                                               gle, and 𝐸 as the blue ellipse (see in Figure 1a). Following
                                                                  the logic explained above, we could consider these sets
Finding new methodologies and metrics that allow us to            of points in the solution space (and their complementary
exploit the valuable information in errors done by ma-            sets, noted as 𝑆, 𝑇, and 𝐸 respectively) as follows:
chines is not trivial. To illustrate in a simplified way the
error exploration that we are proposing, let us consider                  𝑆 ≡ true positives
a problem with a binary solution space. In this space, we                 𝑇 ≡ positives predicted by the model
are given the ground truth, so we are capable of deter-                   𝐸 ≡ positives predicted by the human
mining whether a point from the space is a correct or an
                                                                          𝑆 ≡ true negatives
incorrect answer, as well as its impact. For example, a
typical assumption when solving a binary classification                   𝑇 ≡ negatives predicted by the machine
problem, is to consider that false negatives have the same                𝐸 ≡ negatives predicted by the human
weight of false positives. This is not always correct, be-
cause their harm might be quite different. Indeed, when              Focusing on the errors shown in Figure 2, now we de-
predicting an illness, a physician will prefer to see many        note the false positives errors made only by the machine
more healthy patients just to avoid missing any ill one.          as 𝑃𝑚 (Figure 2d), and we define this set of errors as
One solution is to use a weighted accuracy but still the
operational point might be different because here the re-                                𝑃𝑚 = 𝑇 ∩ 𝐸 ∩ 𝑆
call of ill patients is much more relevant than the overall          Similarly, we denote and define the false positives er-
accuracy (weighted or not).                                       rors made only by the human (Figure 2e), the false pos-
   In Figure 1 we can see this ground truth represented           itive errors made by both the machine and the human
as a green sphere, positioned on the solution space, such         (Figure 2f), the false negative errors made only by the
that those points that fall into the green area are the true      machine (Figure 2d), the false negative errors made only
positive answers, and the rest of the points are the true         by the human (Figure 2b), and the false negative errors
negative answers. We can see two more shapes in this              made by both the machine and the human (Figure 2c) as
space that symbolize the perceptual agents; a red triangle        follows, respectively:
representing the machine predictions, and a blue ellipse
representing the human predictions. Following the pre-                                   𝑃ℎ = 𝑇 ∩ 𝐸 ∩ 𝑆
vious logic, in Figure 1b we can see the true answers                                    𝑃𝑏 = 𝑇 ∩ 𝐸 ∩ 𝑆
correctly predicted by both the model and a human. In
Figure 2, we focus on the errors. Here we are able to dis-                              𝑁𝑚 = 𝑇 ∩ 𝐸 ∩ 𝑆
tinguish between two kinds of errors: false positives and                               𝑁ℎ = 𝑇 ∩ 𝐸 ∩ 𝑆
false negatives. And we make another distinction based
                                                                                         𝑁𝑏 = 𝑇 ∩ 𝐸 ∩ 𝑆
on the entity that is making the error (human and/or ML
model).                                                              Note that all these sets are disjoint because of the exclu-
   In this general and abstract scenario, where no con-           sivity imposed when considering which agent commits
        (a)                  (b)                  (c)                   (d)                  (e)                  (f)

Figure 2: Visual representation of false negatives and false positives attributed to the model, the human, or to both. The
first three diagrams ((a), (b) and (c)) represent false negatives errors done by only the model, only the human, or by both,
respectively. The next three diagrams ((d), (e) and (f)) represent the false positives errors done by only the model, only the
human, or by both, respectively. Error areas are overstated to emphasize the idea.



                                                                points. The distinction of non-human errors is based on
                                                                the distinction between the successes and mistakes made
                                                                by the different perceptual agents (human and machine).
                                                                Also, this distinction only makes sense in a context in
                                                                which we can expect reasonable human performance.
                                                                Thus, the category of non-human errors can be found in
                                                                those human centered tasks that can be at least partially
                                                                solved in a reasonable way by the humans and where a
                                                                ML algorithm is applied instead. However, in practice,
                                                                consideration of the specific use case will be decisive.
                                                                   We next apply this idea to a simple but illustrative
Figure 3: Non-human errors stressed in red: both false neg-     problem: classifying images of dogs and cats according
atives and false positives done by the ML model but not by
                                                                to their breed [9]. This translates into a fine-grained im-
humans (cases (a) and (d) in Figure 2).
                                                                age classification problem that is mainly solved by using
                                                                deep neural networks. Based on expert sources in the
                                                                classification of these animals (FIFe and FCI Federations),
the error. Now the sets of interest arise from the union of     we have been able to construct a taxonomy that repre-
some of the previous sets. We note 𝑀 as the non-human           sents the possible errors that can be made in this task.
errors (those committed by the machine but not by the           Following our definition of non-human errors, in this
human) explained above, 𝐻 as those errors committed by          problem we can identify them as those errors that are
the human but not by the machine, and 𝐵 as those errors         fundamentally different from the errors that a human
committed by both the human and the machine together:           solving this task would commit. Therefore, we define
                                                                as non-human errors those cases in which the machine
                       𝑀 = 𝑁 𝑚 ∪ 𝑃𝑚
                                                                classifies a dog as a cat, or vice versa (see Figure 4). No-
                       𝐻 = 𝑁 ℎ ∪ 𝑃ℎ                             tice that there might be other non-human errors when
                       𝐵 = 𝑁 𝑏 ∪ 𝑃𝑏                             comparing among only cats or dogs, but those are much
                                                                less important and less common than the definition that
   In this paper we focus on 𝑀, non-human errors, which         we use for this proof of concept and provides a lower
we believe are the errors that we should address first be-      bound for non-human errors.
cause of the harm risks that could be involved in making           So far, we have selected one of the top-ranked algo-
errors that escape human logic. But how can we precisely        rithms for solving this specific task, the Big Transfer (BiT)
determine these errors? How can we measure how far              model from [10], which achieved 93% of accuracy. In Fig-
an answer should be from human logic in order to call it        ure 6 we give the full confusion matrix of 3,312 prediction
a non-human error? We address these challenges in the           pairs among 25 dog breeds (top-left) and 12 cat breeds
next section.                                                   (bottom-right), where we can see that there are 4 pairs
                                                                that are hard to classify (two breeds of Terriers and 3
                                                                pairs of cat breeds). Notice that this confusion matrix
4. Proof of Concept                                             is in general non-symmetric, as the output of the model
Approaching a problem by adopting the previous abstract         may differ because the input and the prediction for each
perspective allows us to visualize it with some indepen-        pair is different.
dence from the use case or real-world application, which           Here we found that more than 3% of the errors were
is good for understanding the wide range of operating           non-human errors (8 of 241 errors), which appear as light
Figure 4: Part of the error taxonomy obtained from the anal-
ysis made with Oxford-IIIT Pets data set [9]. In red, one of
the most common non-human errors committed by the BiT
model [10].




                                                                   Figure 6: Confusion matrix for the Oxford-IIIT Pets data set
                                                                   [9]. Darker the squares, more errors were made for that pair
                                                                   of breeds.



                                                                   5. Discussion
                                 (a)
                                                                   Why should we care if algorithms mistake dogs for cats?
                                                                   This is clear when similar tasks are proposed in fields
                                                                   where the lives of human beings and their fundamen-
                                                                   tal rights are at risk of being left unprotected. In these
                                                                   fields, even in the case of a low percentage of non-human
                                                                   errors, the consequences could have a catastrophic and ir-
                                                                   reversible impact. One concrete such example happened
                                 (b)
                                                                   in 2018, when a Uber self-driving car was not able to rec-
                                                                   ognize a woman in a bicycle crossing a road at night in
Figure 5: Two of the non-human errors obtained when run-           Tempe, Arizona.2 . A human most probably would have
ning the BiT model [10] over the Oxford-IIIT Pets data set [9].    recognized the woman and hence this is a non-human
In (a) a Chihuahua is mistaken for an Abyssinian cat with a        error. We do not know if the backup driver could have
confidence of 46.24%. In (b) a Bengal cat is mistaken for a Chi-
                                                                   reacted on time, but she was seeing a video as the car
huahua with a confidence of 20.83%, a percentage very close
to the one of the second option in the list of breeds sorted by
                                                                   was working well until then. Finally, she was charged
their probability of being selected as the tag for that image.     of negligence, as Uber quickly settled with the family of
                                                                   the victim to avoid being sued [11].Hence, this event at
                                                                   the end impacted the lives of two women.
                                                                      One related issue that we do not discuss is another bad
squares in the top-right and bottom-left of Figure 6. Two          habit: predicting an answer even when we have low con-
of these errors are shown in Figure 5, where a Chihuahua           fidence. For example, in Figure 5 (b), any smart/honest
is classified as an Abyssinian cat (Figure 5a) and a Bengal        person would say “I don’t know” with such low confi-
cat is classified as a Chihuahua (Figure 5b). However,             dence. Even in case (a), if there is a harm risk, not giving
there is a notable difference between these two errors:            an answer might be a safer output. In the Uber example is
the certainty of the answer provided by the algorithm.             the same. Predicting ”I don’t know” and stopping, might
This supports the need to start providing new metrics.             be safer than predicting ”there is no human in front of me
In this case, for instance, it would be interesting to focus       and is safer to run over the object” (notice that the later
on the extent to which an algorithm is, under unreliable           assumption might be still dangerous for the passengers).
certainty, either predicting correctly or erring, regardless
of whether the answer is right or wrong.                                2
                                                                          https://www.theguardian.com/technology/2018/mar/19/uber-
                                                                   self-driving-car-kills-woman-arizona-tempe
                                                                      A. Madry, From imagenet to image classification:
                                                                      Contextualizing progress on benchmarks, in: Inter-
                                                                      national Conference on Machine Learning, PMLR,
                                                                      2020, pp. 9625–9635.
                                                                  [3] E. M. Bender, T. Gebru, A. McMillan-Major,
                                                                      S. Shmitchell, On the dangers of stochastic par-
                                                                      rots: Can language models be too big?, in: FAccT
                                                                      ’21: 2021 ACM Conference on Fairness, Account-
                                                                      ability, and Transparency, Virtual Event / Toronto,
                                                                      Canada, March 3-10, 2021, ACM, 2021, pp. 610–623.
                                                                  [4] C. G. Northcutt, A. Athalye, J. Mueller, Pervasive
Figure 7: Classification of white blood cells. As it happened         label errors in test sets destabilize machine learning
with cats and dogs, after further investigation on the risk
                                                                      benchmarks, arXiv preprint 2103.14749 (2021).
associated to mistake one cell for another [12, 13], this tree
                                                                  [5] R. Geirhos, K. Meding, F. A. Wichmann, Beyond
could be used as a taxonomy to define disparate errors. Images
collected by [14].                                                    accuracy: Quantifying trial-by-trial behaviour of
                                                                      cnns and humans by measuring error consistency,
                                                                      arXiv preprint arXiv:2006.16736 (2020).
                                                                  [6] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom,
In fact the self-driving car did predict a bicycle one of the         B. Tran, A. Madry, Adversarial examples are
times [11].                                                           not bugs, they are features,           arXiv preprint
   We are currently working on a problem that is tech-                arXiv:1905.02175 (2019).
nically very similar to the classification of dogs and cats       [7] M. Ruggero Ronchi, P. Perona, Benchmarking and
according to breed, but is a real-life application that is            error diagnosis in multi-instance pose estimation,
much more relevant to humans: the classification of                   in: Proceedings of the IEEE international confer-
white blood cells. This problem is also formulated as                 ence on computer vision, 2017, pp. 369–378.
a fine-grained image classification problem and, even             [8] A.      Ng,      Machine        learning    yearning,
when the number of different classes of elements is much              2018.        URL:        https://info.deeplearning.ai/
smaller than in the previous example (see Figure 7), their            machine-learning-yearning-book.
differentiation is very important. Indeed, [12] points out        [9] O. M. Parkhi, A. Vedaldi, A. Zisserman, C. Jawahar,
that neutrophil levels were associated with breast can-               Cats and dogs, in: 2012 IEEE conference on com-
cer risk, including advanced stages of breast cancer. In              puter vision and pattern recognition, IEEE, 2012,
the meta-analysis proposed by [13], it was shown that                 pp. 3498–3505.
breast cancer patients with a higher ratio of neutrophils        [10] A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver,
to lymphocytes had a higher relapse and lower overall                 J. Yung, S. Gelly, N. Houlsby, Big transfer (bit):
survival.                                                             General visual representation learning, in: 16th
   The importance of including an in-depth study of the               European Conference on Computer Vision, Part V
errors that an algorithm could make in this classification            16, Springer, 2020, pp. 491–507.
is evident, just as it is fundamental that in these complex      [11] L. Smiley, ‘I’m the operator’: The aftermath of a
use cases, both the evaluations of the algorithms and their           self-driving tragedy, Wired (2022).
publication are accompanied by the corresponding pa-             [12] Y. Okuturlar, M. Gunaldi, E. E. Tiken, B. Ozto-
rameters or new metrics that make visible the different er-           sun, Y. O. Inan, T. Ercan, S. Tuna, A. O. Kaya,
rors made, their frequency and their associated risk based            O. Harmankaya, A. Kumbasar, Utility of periph-
on professional knowledge. Moreover, these parameters                 eral blood parameters in predicting breast cancer
could provide not only transparency and explainability                risk, Asian Pacific Journal of Cancer Prevention 16
to the model, but also valuable clues to researchers that             (2015) 2409–2412.
would allow the algorithms to be improved in terms of            [13] B. Wei, M. Yao, C. Xing, W. Wang, J. Yao, Y. Hong,
human-centered responsibility and accountability.                     Y. Liu, P. Fu, The neutrophil lymphocyte ratio is as-
                                                                      sociated with breast cancer prognosis: an updated
                                                                      systematic review and meta-analysis, OncoTargets
References                                                            and therapy 9 (2016).
 [1] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner,           [14] X. Zheng, Y. Wang, G. Wang, J. Liu,               Fast
     A. Madry, Robustness may be at odds with accuracy,               and robust segmentation of white blood cell im-
     arXiv preprint arXiv:1805.12152 (2018).                          ages by self-supervised learning, Micron 107
 [2] D. Tsipras, S. Santurkar, L. Engstrom, A. Ilyas,                 (2018) 55–71. URL: https://www.sciencedirect.com/
                                                                      science/article/pii/S0968432817303037.