Towards international standards for evaluating machine learning Frank Rudzicz1,2,3,4 and P Alison Paprica4,5 and Marta Janczarski6 1 Li Ka Shing Knowledge Institute, St Michael’s Hospital 2 Department of Computer Science, University of Toronto 3 Surgical Safety Technologies Inc. 4 Vector Institute for Artificial Intelligence 5 Institute for Health Policy, Management and Evaluation, University of Toronto 6 Standards Council of Canada Abstract quently evaluate our systems in a cursory or incorrect way. Without careful consideration to empirical methodology, the Various international efforts to standardize artificial intelli- gence have begun, and many of these efforts involve issues truth and generalizability of our own statements may be sus- related to privacy, trustworthiness, safety, and public well- pect. Without any conceptual barriers to overcome, success being, which are topics that don’t necessarily have interna- in innocuous ‘toy’ tasks, such as distinguishing images of tional consensus, and may not for the foreseeable future. cats from images of dogs (Elson et al. 2007), can quickly Meanwhile, the pursuit of achieving state-of-the-art accuracy transfer to other tasks with superficially similar data types, in machine learning has resulted in a somewhat ad hoc ap- such as distinguishing images of malignant skin lesions from plication of empirical methodology that may limit the cor- benign ones (Esteva et al. 2017). That is, ML allows for rectness of the computation of those accuracies, resulting in modelling procedures to be easily transferred across data unpredictable applicability of those models. Trusting the ob- sets without necessarily considering possible covariates hid- jective quantitative performance of our systems is itself a den in those data sets, nor the potential consequences of safety concern and should inform the earliest standards to- wards safety in AI. false positives or false negatives. Indeed, not accounting for skin colour in those images may aversely affect underrepre- sented populations in the data (Adamson and Smith 2018; Introduction Lashbrook 2018). The relative ease with which modern ML Implementing international standards is a primary method can be implemented may reveal unintended biases of this to ensure the safety of a process or product. Outlining a type or, more concerning, biases that we cannot understand. set of specifications enables consumers and producers to both abide by the requirements set out to reduce risk and The dawn of AI standards harm. There are two principal avenues to employ these In addition to various regional efforts, two international or- safety standards. The first is to develop a conformity assess- ganizations are leading the global standardization effort in ment scheme against a set standard; this provides a certi- AI. The IEEE has launched the Global Initiative on Ethics of fication that can clearly and quickly determine whether a Autonomous and Intelligent Systems to address some of the product or process meets agreed-upon requirements. The societal concerns that are emerging with AI. These include process of certification often includes both testing the prod- areas such as data governance and privacy, algorithmic bias, uct or process, and analyzing the documentation through- transparency, ethically driven robots and autonomous sys- out the design and creation process. The second avenue is tems, failsafe design and wellbeing metrics. Also, the ISO incorporation by reference of standards in regulation. Re- has recently created a new technical subcommittee (SC) in gional and federal regulations incorporate national and in- the area of artificial intelligence, ISO/JTC 1 SC 42, whose ternational standards to ensure uniformity and a level of scope covers foundational standards as well as issues related product requirements that are expected. Prominent examples to safety and trustworthiness. At their first plenary in April include building codes, and the Consumer Product Safety 2018, the subcommittee created the following study groups: Act1 . For example, the International Standards Organization (ISO) committee on consumer policy (COPOLCO) provides Computational approaches and characteristics This a potential platform to investigate the consumer impact of concerns different technologies (e.g., ML algorithms, AI, in particular through aspects of safety (ISO/IEC 2014). reasoning) used by AI systems including their properties Developing foundational machine learning (ML) has, to and characteristics. This will include specialized AI some extent, been ad hoc, susceptible to trends, and rela- systems (e.g., NLP or computer vision) to understand tively undirected. There is no inherent moral or ethical prob- and identify their underlying computational approaches, lem with this approach except for the potential to conse- architectures, and characteristics, and industry practices, processes and methods for the application of AI systems. Copyright is held by the authors c 2019. All rights reserved. 1 Trustworthiness This concerns approaches to establish https://laws-lois.justice.gc.ca/eng/acts/ C-1.68/ trust in AI systems, e.g., through transparency, verifia- bility, explainability, controllability. Engineering pitfalls, biases towards recidivism or gender biases in word em- typical threats and risks, their mitigation techniques, and beddings (Bolukbasi et al. 2016); if the system is meant to approaches to robustness, accuracy, privacy, and safety enable decision support prospectively, techniques to miti- will also be investigated. gate bias should be taken. Use cases and applications This focuses on application Appropriate baselines Any classifier of interest should be domains for AI (e.g., social networks and embedded sys- compared against at least one representative, appropriate tems) and the different context of their use (e.g., health baseline. Trivial baselines should not be considered. A care, smart homes, and autonomous cars). trivial baseline, for example, always predicts the major- Work items related to the specification of performance of ity class and, in general, is not the result of a machine ML models, and comparison between models, are relevant learning process. to each of these study groups. Appropriate measures There is a tendency to report accu- racy or area under the precision-recall (or other operator Standards for evaluating ML models characteristic) curves in nominal classification; however, Machine learning research is driven, to a large extent, by the this is not always correct. For example, systems that pre- search for mechanisms that achieve greater accuracy than dict cause-of-death according to international standards competing approaches. Unfortunately, there have been sev- for disease coding should not merely report accuracy, eral methodological limitations or misapplications that have but should include the cause-specific mortality fraction hindered the comparison of models, or made such compar- (CSMF), which is the fraction of in-hospital deaths for isons forfeit. Indeed, references are routinely made in the a given cause normalized over all causes (Murray et al. literature to ‘state-of-the-art’ performance, sometimes in- 2007). CSMF accuracy is therefore a measure of predic- volving minuscule differences on small data sets, and in tive quality at the population level, which quantifies how relatively esoteric tasks. Broad acceptance of such empiri- closely the estimated CSMF values approximate the truth. cal procedures, or their assumed generalizability, makes the In fact, when it is possible to compute its associated coef- supposed direct comparison between approaches tenuous at ficients, the chance-corrected version of CSMF accuracy best, and suspicious at worse. should be used instead (Flaxman et al. 2015). Clearly, the When comparing the performance of two or more algo- measure used, itself, can be highly context-dependent and rithms, the following aspects must be carefully controlled result in very different outcomes for different samples. and reported: Limiting information leakage It is necessary to partition Implementation For example, if an algorithm can be accel- data between training sets and test sets in such a way so erated (e.g., by using GPU processing) in such a way that that no latent information exists across sets, other than di- can affect outcomes (e.g., if a there is a stopping condition rectly obtained from observation variables. This can occur in time), then this must be made explicit. when latent information is highly correlated to labels, an- notation, or other supervised information. Hyper-parameters If the hyper-parameters of a ML model For example, a system may be designed to classify be- are optimized, the hyper-parameters of the comparative tween people with and without neurodegeneration from ML models should also be optimized, except when hyper- audio (Fraser, Meltzer, and Rudzicz 2015) and have mul- parameters themselves are being compared. tiple data points recorded from each human subject in Preprocessing Preprocessing steps will not unjustly favour the data. Some acoustic features, such as vocal jitter or one model over another. For example, if a classifier re- phonation rates, may be used to identify pathology cross- quires ‘stop words’ (e.g., prepositions) to be retained in sectionally, but they can also be used to identify the an NLP task, those words should not be removed. More- speaker themselves. Since each speaker is associated with over, preprocessing should be consistent across all data. a label for the outcome, even if individual samples are For example, if outliers, incomplete data, or noise are re- partitioned across training and test sets, it would be inap- moved, it must be done uniformly. propriate if an individual speaker is represented in both Training and testing data When several machine learning sets. This is because any model could learn the identity models are being compared, the data used to train those of a speaker from the training data, and apply the known models or, separately, evaluate those models, should be label to test data, tainting the results. Leave-one-out cross- identical. These data should be ecologically valid, statis- validation is one mitigation strategy. tically indistinct, or otherwise similar to data expected to Limiting channel effects A channel effect occurs when a be observed in deployment. classifier may learn characteristics of the manner in which Representative data The data should be as free of sam- data were recorded, in addition to the nature of the data pling bias as possible. That is, the distribution of classes in themselves. For example, a hospital-based system may be the data should be identical to their distribution in the real designed to classify among patient data. However, if all or world, to the extent possible. There may be special con- most patients with complex cancers seek treatment in ur- siderations to this point. To some extent, models trained ban centres, then a classifier may learn to associate those on historical data may encapsulate biases from the past cancers with certain regions. that the developers wish removed, such as demographic Channel effects can be caused by the mechanism used to obtain the data, any preprocessing that occurred on the governments represented on standardization bodies. Per- one or more proper subsets of the data, the identity of haps more concerning is whether standards in AI will even the individual or individuals obtaining the data, or envi- be acknowledged or endorsed by developers and practition- ronmental changes in which data were recorded, for ex- ers, especially in academia. Agreeing upon a minimal set of ample. If these effects cannot be controlled, they must standards for evaluation, across sectors, may require a broad be accounted for as covariates during statistical signif- cultural change within the ML community. Indeed, Dror et icance testing. Additionally, strategies have been devel- al. (2018) showed through a meta-analysis that, while the oped to explicitly factor out channel effects, as with i- ML and natural language processing communities are driven Vectors through expectation-maximization, probabilistic by experimental results, statistical significance testing is ig- linear discriminant analysis, and factor analysis (Verma nored or misused most of the time. Therefore, as machine and Das 2015). learning becomes an increasingly applied science, empirical methods should be emphasized early in relevant University Furthermore, appropriate statistical tests of significance and educational programs, including Computer Science. must be undertaken, when possible, in order to establish Naturally, innovation should continue to be encouraged. whether there is any meaningful difference between ap- In fact, lowering certain regulatory barriers may promote proaches. A difference of 0.5%, for example, on a single test certain safe uses of ML in the service of the public well- set is not necessarily conclusive with regards to the models being, including healthcare. The point is not that risk should compared. Naturally, tests of significance can also be mis- be averted – there is substantial evidence that undertaking used (i.e., so-called ‘p-hacking’), so effort must be taken to certain kinds of risk can be beneficial when those risks are choose appropriate tests. For example, if a test has an as- understood. The challenge we face is that we do not truly sumed distribution (as in standard t-tests), then the validity understand the risks of AI and ML – we barely understand of that assumption in the data should also be evaluated (e.g., how to assess those systems in the first place, which should through a Lilliefors or Kolmogorov-Smirnov test). If mul- be among our first priorities. tiple comparisons are made (e.g., through multiple hyper- With some exceptions, such as cases where human rights parameterizations), then this must be accounted for also, are at risk, we do not propose that standards in AI should e.g., through a Bonferroni test. Alternatively, standardized inhibit our scientific exploration in any way. Nor do we pro- computations of effect sizes (e.g., Cohen’s d) can mitigate pose limits to the capabilities of AI systems – those will against the risks of p-hacking. Finally, where possible, all largely be dependent on national or regional laws or regu- relevant covariates must be accounted for in the model; as lations. Rather, the international standardization community mentioned above, this includes all aspects in the channel, has the opportunity to place certain expectations as to how including confounding variables in the data themselves, that we, the ML community, evaluate our own work. Trusting the could effect the outcomes, as well as the interactions be- objective quantitative performance of our systems is itself a tween those variables. safety concern. The assessment of nominal classification or continuous regression has been implicit in the discussion above, but the same principles may be taken in evaluating reinforcement Final comments learning, for example. This paper represents the beginning of a long, multi-year process in surveying challenges or shortcomings in the eval- uation of performance of machine learning, especially as it Recommendations relates to the trustworthiness of that performance. These are Ensuring the safety and maintenance of AI systems will be not a complete set of requirements, nor are the recommenda- the subject of various standardization efforts, including ex- tions fully expressed. Training software, rather than explic- plainable models, unintended biases (including cultural, so- itly programming it, poses unique challenges and imposes a cial, historical, or sampling biases), human-machine interac- specific development and test life cycle suitable that is dis- tion, and scalable oversight. However, the clear-eyed refine- tinct from traditional software development and, crucially, ment of the actual evaluation methodologies will be crucial regulation and standardization. Accurately controlling for to many of these challenges. When so many independent re- the behaviours of deployed machine learning, through a searchers and organizations (across academic, commercial, quantitative evaluation of its performance, is crucial. or governmental sectors) are actively and competitively en- gaged to achieve state-of-the-art performance, it is essen- Acknowledgements tial to be able to objectively and quantitatively establish that Rudzicz is the Canadian Chair of the mirror committee to performance correctly, consistently, and with expected min- ISO/JTC 1 SC 42 on Artificial Intelligence. imal levels of reporting, otherwise any claims should not be trusted. References Adamson, A. S., and Smith, A. 2018. Machine Learning Endorsement versus enforcement and Health Care Disparities in Dermatology. JAMA Derma- As international standards are optional, a core concern is tology 154(11):1247–1248. whether these standards will be enforceable by governments, Bolukbasi, T.; Chang, K.-w.; Zou, J.; Saligrama, V.; and given a variety of attitudes towards artificial intelligence by Kalai, A. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In NIPS, 1–9. Dror, R.; Baumer, G.; Shlomov, S.; and Reichart, R. 2018. The Hitchhiker’ s Guide to Testing Statistical Significance in Natural Language Processing. In 56th Annual Meeting ofthe Association for Computational Linguistics, 1–10. Elson, J.; Douceur, J. R.; Howell, J.; and Saul, J. 2007. Asirra: A CAPTCHA that Exploits Interest-Aligned Man- ual Image Categorization. Proceedings of 14th ACM Con- ference on Computer and Communications Security (CCS) 366–374. Esteva, A.; Kuprel, B.; Novoa, R. A.; Ko, J.; Swetter, S. M.; Blau, H. M.; and Thrun, S. 2017. Dermatologist-level clas- sification of skin cancer with deep neural networks. Nature 542(7639):115–118. Flaxman, A. D.; Serina, P. T.; Hernandez, B.; Murray, C. J. L.; Riley, I.; and Lopez, A. D. 2015. Measuring causes of death in populations: a new metric that corrects cause- specific mortality fractions for chance. Population health metrics 13(1):28. Fraser, K. C.; Meltzer, J. A.; and Rudzicz, F. 2015. Linguis- tic features identify Alzheimer’s disease in narrative speech. Journal of Alzheimer’s Disease 49(2):407–422. ISO/IEC. 2014. Safety aspects – guidelines for their inclu- sion in standards. Standard, International Organization for Standardization, Geneva, CH. Lashbrook, A. 2018. AI-Driven Dermatology Could Leave Dark-Skinned Patients Behind. The Atlantic. Murray, C. J. L.; Lopez, A. D.; Barofsky, J. T.; Bryson-Cahn, C.; and Lozano, R. 2007. Estimating Population Cause- Specific Mortality Fractions from in-Hospital Mortality: Validation of a New Method. PLoS Medicine 4(11):e326. Verma, P., and Das, P. K. 2015. i-Vectors in speech process- ing applications: a survey. International Journal of Speech Technology 18(4):529–546.