Towards international standards for evaluating machine learning
                     Frank Rudzicz1,2,3,4 and P Alison Paprica4,5 and Marta Janczarski6
                                       1
                                      Li Ka Shing Knowledge Institute, St Michael’s Hospital
                                      2
                                      Department of Computer Science, University of Toronto
                                               3
                                                 Surgical Safety Technologies Inc.
                                           4
                                             Vector Institute for Artificial Intelligence
       5
         Institute for Health Policy, Management and Evaluation, University of Toronto 6 Standards Council of Canada


                            Abstract                                 quently evaluate our systems in a cursory or incorrect way.
                                                                     Without careful consideration to empirical methodology, the
  Various international efforts to standardize artificial intelli-
  gence have begun, and many of these efforts involve issues
                                                                     truth and generalizability of our own statements may be sus-
  related to privacy, trustworthiness, safety, and public well-      pect. Without any conceptual barriers to overcome, success
  being, which are topics that don’t necessarily have interna-       in innocuous ‘toy’ tasks, such as distinguishing images of
  tional consensus, and may not for the foreseeable future.          cats from images of dogs (Elson et al. 2007), can quickly
  Meanwhile, the pursuit of achieving state-of-the-art accuracy      transfer to other tasks with superficially similar data types,
  in machine learning has resulted in a somewhat ad hoc ap-          such as distinguishing images of malignant skin lesions from
  plication of empirical methodology that may limit the cor-         benign ones (Esteva et al. 2017). That is, ML allows for
  rectness of the computation of those accuracies, resulting in      modelling procedures to be easily transferred across data
  unpredictable applicability of those models. Trusting the ob-      sets without necessarily considering possible covariates hid-
  jective quantitative performance of our systems is itself a        den in those data sets, nor the potential consequences of
  safety concern and should inform the earliest standards to-
  wards safety in AI.
                                                                     false positives or false negatives. Indeed, not accounting for
                                                                     skin colour in those images may aversely affect underrepre-
                                                                     sented populations in the data (Adamson and Smith 2018;
                        Introduction                                 Lashbrook 2018). The relative ease with which modern ML
Implementing international standards is a primary method             can be implemented may reveal unintended biases of this
to ensure the safety of a process or product. Outlining a            type or, more concerning, biases that we cannot understand.
set of specifications enables consumers and producers to
both abide by the requirements set out to reduce risk and                         The dawn of AI standards
harm. There are two principal avenues to employ these
                                                                     In addition to various regional efforts, two international or-
safety standards. The first is to develop a conformity assess-
                                                                     ganizations are leading the global standardization effort in
ment scheme against a set standard; this provides a certi-
                                                                     AI. The IEEE has launched the Global Initiative on Ethics of
fication that can clearly and quickly determine whether a
                                                                     Autonomous and Intelligent Systems to address some of the
product or process meets agreed-upon requirements. The
                                                                     societal concerns that are emerging with AI. These include
process of certification often includes both testing the prod-
                                                                     areas such as data governance and privacy, algorithmic bias,
uct or process, and analyzing the documentation through-
                                                                     transparency, ethically driven robots and autonomous sys-
out the design and creation process. The second avenue is
                                                                     tems, failsafe design and wellbeing metrics. Also, the ISO
incorporation by reference of standards in regulation. Re-
                                                                     has recently created a new technical subcommittee (SC) in
gional and federal regulations incorporate national and in-
                                                                     the area of artificial intelligence, ISO/JTC 1 SC 42, whose
ternational standards to ensure uniformity and a level of
                                                                     scope covers foundational standards as well as issues related
product requirements that are expected. Prominent examples
                                                                     to safety and trustworthiness. At their first plenary in April
include building codes, and the Consumer Product Safety
                                                                     2018, the subcommittee created the following study groups:
Act1 . For example, the International Standards Organization
(ISO) committee on consumer policy (COPOLCO) provides                Computational approaches and characteristics This
a potential platform to investigate the consumer impact of             concerns different technologies (e.g., ML algorithms,
AI, in particular through aspects of safety (ISO/IEC 2014).            reasoning) used by AI systems including their properties
   Developing foundational machine learning (ML) has, to               and characteristics. This will include specialized AI
some extent, been ad hoc, susceptible to trends, and rela-             systems (e.g., NLP or computer vision) to understand
tively undirected. There is no inherent moral or ethical prob-         and identify their underlying computational approaches,
lem with this approach except for the potential to conse-              architectures, and characteristics, and industry practices,
                                                                       processes and methods for the application of AI systems.
Copyright is held by the authors c 2019. All rights reserved.
  1                                                                  Trustworthiness This concerns approaches to establish
    https://laws-lois.justice.gc.ca/eng/acts/
C-1.68/                                                                trust in AI systems, e.g., through transparency, verifia-
  bility, explainability, controllability. Engineering pitfalls,      biases towards recidivism or gender biases in word em-
  typical threats and risks, their mitigation techniques, and         beddings (Bolukbasi et al. 2016); if the system is meant to
  approaches to robustness, accuracy, privacy, and safety             enable decision support prospectively, techniques to miti-
  will also be investigated.                                          gate bias should be taken.
Use cases and applications This focuses on application              Appropriate baselines Any classifier of interest should be
  domains for AI (e.g., social networks and embedded sys-             compared against at least one representative, appropriate
  tems) and the different context of their use (e.g., health          baseline. Trivial baselines should not be considered. A
  care, smart homes, and autonomous cars).                            trivial baseline, for example, always predicts the major-
   Work items related to the specification of performance of          ity class and, in general, is not the result of a machine
ML models, and comparison between models, are relevant                learning process.
to each of these study groups.                                      Appropriate measures There is a tendency to report accu-
                                                                      racy or area under the precision-recall (or other operator
       Standards for evaluating ML models                             characteristic) curves in nominal classification; however,
Machine learning research is driven, to a large extent, by the        this is not always correct. For example, systems that pre-
search for mechanisms that achieve greater accuracy than              dict cause-of-death according to international standards
competing approaches. Unfortunately, there have been sev-             for disease coding should not merely report accuracy,
eral methodological limitations or misapplications that have          but should include the cause-specific mortality fraction
hindered the comparison of models, or made such compar-               (CSMF), which is the fraction of in-hospital deaths for
isons forfeit. Indeed, references are routinely made in the           a given cause normalized over all causes (Murray et al.
literature to ‘state-of-the-art’ performance, sometimes in-           2007). CSMF accuracy is therefore a measure of predic-
volving minuscule differences on small data sets, and in              tive quality at the population level, which quantifies how
relatively esoteric tasks. Broad acceptance of such empiri-           closely the estimated CSMF values approximate the truth.
cal procedures, or their assumed generalizability, makes the          In fact, when it is possible to compute its associated coef-
supposed direct comparison between approaches tenuous at              ficients, the chance-corrected version of CSMF accuracy
best, and suspicious at worse.                                        should be used instead (Flaxman et al. 2015). Clearly, the
   When comparing the performance of two or more algo-                measure used, itself, can be highly context-dependent and
rithms, the following aspects must be carefully controlled            result in very different outcomes for different samples.
and reported:                                                       Limiting information leakage It is necessary to partition
Implementation For example, if an algorithm can be accel-             data between training sets and test sets in such a way so
  erated (e.g., by using GPU processing) in such a way that           that no latent information exists across sets, other than di-
  can affect outcomes (e.g., if a there is a stopping condition       rectly obtained from observation variables. This can occur
  in time), then this must be made explicit.                          when latent information is highly correlated to labels, an-
                                                                      notation, or other supervised information.
Hyper-parameters If the hyper-parameters of a ML model
                                                                      For example, a system may be designed to classify be-
 are optimized, the hyper-parameters of the comparative
                                                                      tween people with and without neurodegeneration from
 ML models should also be optimized, except when hyper-
                                                                      audio (Fraser, Meltzer, and Rudzicz 2015) and have mul-
 parameters themselves are being compared.
                                                                      tiple data points recorded from each human subject in
Preprocessing Preprocessing steps will not unjustly favour            the data. Some acoustic features, such as vocal jitter or
  one model over another. For example, if a classifier re-            phonation rates, may be used to identify pathology cross-
  quires ‘stop words’ (e.g., prepositions) to be retained in          sectionally, but they can also be used to identify the
  an NLP task, those words should not be removed. More-               speaker themselves. Since each speaker is associated with
  over, preprocessing should be consistent across all data.           a label for the outcome, even if individual samples are
  For example, if outliers, incomplete data, or noise are re-         partitioned across training and test sets, it would be inap-
  moved, it must be done uniformly.                                   propriate if an individual speaker is represented in both
Training and testing data When several machine learning               sets. This is because any model could learn the identity
  models are being compared, the data used to train those             of a speaker from the training data, and apply the known
  models or, separately, evaluate those models, should be             label to test data, tainting the results. Leave-one-out cross-
  identical. These data should be ecologically valid, statis-         validation is one mitigation strategy.
  tically indistinct, or otherwise similar to data expected to      Limiting channel effects A channel effect occurs when a
  be observed in deployment.                                          classifier may learn characteristics of the manner in which
Representative data The data should be as free of sam-                data were recorded, in addition to the nature of the data
  pling bias as possible. That is, the distribution of classes in     themselves. For example, a hospital-based system may be
  the data should be identical to their distribution in the real      designed to classify among patient data. However, if all or
  world, to the extent possible. There may be special con-            most patients with complex cancers seek treatment in ur-
  siderations to this point. To some extent, models trained           ban centres, then a classifier may learn to associate those
  on historical data may encapsulate biases from the past             cancers with certain regions.
  that the developers wish removed, such as demographic               Channel effects can be caused by the mechanism used
  to obtain the data, any preprocessing that occurred on           the governments represented on standardization bodies. Per-
  one or more proper subsets of the data, the identity of          haps more concerning is whether standards in AI will even
  the individual or individuals obtaining the data, or envi-       be acknowledged or endorsed by developers and practition-
  ronmental changes in which data were recorded, for ex-           ers, especially in academia. Agreeing upon a minimal set of
  ample. If these effects cannot be controlled, they must          standards for evaluation, across sectors, may require a broad
  be accounted for as covariates during statistical signif-        cultural change within the ML community. Indeed, Dror et
  icance testing. Additionally, strategies have been devel-        al. (2018) showed through a meta-analysis that, while the
  oped to explicitly factor out channel effects, as with i-        ML and natural language processing communities are driven
  Vectors through expectation-maximization, probabilistic          by experimental results, statistical significance testing is ig-
  linear discriminant analysis, and factor analysis (Verma         nored or misused most of the time. Therefore, as machine
  and Das 2015).                                                   learning becomes an increasingly applied science, empirical
                                                                   methods should be emphasized early in relevant University
   Furthermore, appropriate statistical tests of significance
                                                                   and educational programs, including Computer Science.
must be undertaken, when possible, in order to establish
                                                                      Naturally, innovation should continue to be encouraged.
whether there is any meaningful difference between ap-
                                                                   In fact, lowering certain regulatory barriers may promote
proaches. A difference of 0.5%, for example, on a single test
                                                                   certain safe uses of ML in the service of the public well-
set is not necessarily conclusive with regards to the models
                                                                   being, including healthcare. The point is not that risk should
compared. Naturally, tests of significance can also be mis-
                                                                   be averted – there is substantial evidence that undertaking
used (i.e., so-called ‘p-hacking’), so effort must be taken to
                                                                   certain kinds of risk can be beneficial when those risks are
choose appropriate tests. For example, if a test has an as-
                                                                   understood. The challenge we face is that we do not truly
sumed distribution (as in standard t-tests), then the validity
                                                                   understand the risks of AI and ML – we barely understand
of that assumption in the data should also be evaluated (e.g.,
                                                                   how to assess those systems in the first place, which should
through a Lilliefors or Kolmogorov-Smirnov test). If mul-
                                                                   be among our first priorities.
tiple comparisons are made (e.g., through multiple hyper-
                                                                      With some exceptions, such as cases where human rights
parameterizations), then this must be accounted for also,
                                                                   are at risk, we do not propose that standards in AI should
e.g., through a Bonferroni test. Alternatively, standardized
                                                                   inhibit our scientific exploration in any way. Nor do we pro-
computations of effect sizes (e.g., Cohen’s d) can mitigate
                                                                   pose limits to the capabilities of AI systems – those will
against the risks of p-hacking. Finally, where possible, all
                                                                   largely be dependent on national or regional laws or regu-
relevant covariates must be accounted for in the model; as
                                                                   lations. Rather, the international standardization community
mentioned above, this includes all aspects in the channel,
                                                                   has the opportunity to place certain expectations as to how
including confounding variables in the data themselves, that
                                                                   we, the ML community, evaluate our own work. Trusting the
could effect the outcomes, as well as the interactions be-
                                                                   objective quantitative performance of our systems is itself a
tween those variables.
                                                                   safety concern.
   The assessment of nominal classification or continuous
regression has been implicit in the discussion above, but the
same principles may be taken in evaluating reinforcement
                                                                                        Final comments
learning, for example.                                             This paper represents the beginning of a long, multi-year
                                                                   process in surveying challenges or shortcomings in the eval-
                                                                   uation of performance of machine learning, especially as it
                   Recommendations                                 relates to the trustworthiness of that performance. These are
Ensuring the safety and maintenance of AI systems will be          not a complete set of requirements, nor are the recommenda-
the subject of various standardization efforts, including ex-      tions fully expressed. Training software, rather than explic-
plainable models, unintended biases (including cultural, so-       itly programming it, poses unique challenges and imposes a
cial, historical, or sampling biases), human-machine interac-      specific development and test life cycle suitable that is dis-
tion, and scalable oversight. However, the clear-eyed refine-      tinct from traditional software development and, crucially,
ment of the actual evaluation methodologies will be crucial        regulation and standardization. Accurately controlling for
to many of these challenges. When so many independent re-          the behaviours of deployed machine learning, through a
searchers and organizations (across academic, commercial,          quantitative evaluation of its performance, is crucial.
or governmental sectors) are actively and competitively en-
gaged to achieve state-of-the-art performance, it is essen-                          Acknowledgements
tial to be able to objectively and quantitatively establish that   Rudzicz is the Canadian Chair of the mirror committee to
performance correctly, consistently, and with expected min-        ISO/JTC 1 SC 42 on Artificial Intelligence.
imal levels of reporting, otherwise any claims should not be
trusted.                                                                                   References
                                                                   Adamson, A. S., and Smith, A. 2018. Machine Learning
Endorsement versus enforcement                                     and Health Care Disparities in Dermatology. JAMA Derma-
As international standards are optional, a core concern is         tology 154(11):1247–1248.
whether these standards will be enforceable by governments,        Bolukbasi, T.; Chang, K.-w.; Zou, J.; Saligrama, V.; and
given a variety of attitudes towards artificial intelligence by    Kalai, A. 2016. Man is to Computer Programmer as Woman
is to Homemaker? Debiasing Word Embeddings. In NIPS,
1–9.
Dror, R.; Baumer, G.; Shlomov, S.; and Reichart, R. 2018.
The Hitchhiker’ s Guide to Testing Statistical Significance
in Natural Language Processing. In 56th Annual Meeting
ofthe Association for Computational Linguistics, 1–10.
Elson, J.; Douceur, J. R.; Howell, J.; and Saul, J. 2007.
Asirra: A CAPTCHA that Exploits Interest-Aligned Man-
ual Image Categorization. Proceedings of 14th ACM Con-
ference on Computer and Communications Security (CCS)
366–374.
Esteva, A.; Kuprel, B.; Novoa, R. A.; Ko, J.; Swetter, S. M.;
Blau, H. M.; and Thrun, S. 2017. Dermatologist-level clas-
sification of skin cancer with deep neural networks. Nature
542(7639):115–118.
Flaxman, A. D.; Serina, P. T.; Hernandez, B.; Murray, C.
J. L.; Riley, I.; and Lopez, A. D. 2015. Measuring causes
of death in populations: a new metric that corrects cause-
specific mortality fractions for chance. Population health
metrics 13(1):28.
Fraser, K. C.; Meltzer, J. A.; and Rudzicz, F. 2015. Linguis-
tic features identify Alzheimer’s disease in narrative speech.
Journal of Alzheimer’s Disease 49(2):407–422.
ISO/IEC. 2014. Safety aspects – guidelines for their inclu-
sion in standards. Standard, International Organization for
Standardization, Geneva, CH.
Lashbrook, A. 2018. AI-Driven Dermatology Could Leave
Dark-Skinned Patients Behind. The Atlantic.
Murray, C. J. L.; Lopez, A. D.; Barofsky, J. T.; Bryson-Cahn,
C.; and Lozano, R. 2007. Estimating Population Cause-
Specific Mortality Fractions from in-Hospital Mortality:
Validation of a New Method. PLoS Medicine 4(11):e326.
Verma, P., and Das, P. K. 2015. i-Vectors in speech process-
ing applications: a survey. International Journal of Speech
Technology 18(4):529–546.