<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards international standards for evaluating machine learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Frank Rudzicz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>P Alison Paprica</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marta Janczarski</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Toronto</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Health Policy</institution>
          ,
          <addr-line>Management and Evaluation</addr-line>
          ,
          <institution>University of Toronto</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Li Ka Shing Knowledge Institute, St Michael's Hospital</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Standards Council of</institution>
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Surgical Safety Technologies Inc</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Vector Institute for Artificial Intelligence</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Various international efforts to standardize artificial intelligence have begun, and many of these efforts involve issues related to privacy, trustworthiness, safety, and public wellbeing, which are topics that don't necessarily have international consensus, and may not for the foreseeable future. Meanwhile, the pursuit of achieving state-of-the-art accuracy in machine learning has resulted in a somewhat ad hoc application of empirical methodology that may limit the correctness of the computation of those accuracies, resulting in unpredictable applicability of those models. Trusting the objective quantitative performance of our systems is itself a safety concern and should inform the earliest standards towards safety in AI.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Implementing international standards is a primary method
to ensure the safety of a process or product. Outlining a
set of specifications enables consumers and producers to
both abide by the requirements set out to reduce risk and
harm. There are two principal avenues to employ these
safety standards. The first is to develop a conformity
assessment scheme against a set standard; this provides a
certification that can clearly and quickly determine whether a
product or process meets agreed-upon requirements. The
process of certification often includes both testing the
product or process, and analyzing the documentation
throughout the design and creation process. The second avenue is
incorporation by reference of standards in regulation.
Regional and federal regulations incorporate national and
international standards to ensure uniformity and a level of
product requirements that are expected. Prominent examples
include building codes, and the Consumer Product Safety
Act1. For example, the International Standards Organization
(ISO) committee on consumer policy (COPOLCO) provides
a potential platform to investigate the consumer impact of
AI, in particular through aspects of safety (ISO/IEC 2014).</p>
      <p>
        Developing foundational machine learning (ML) has, to
some extent, been ad hoc, susceptible to trends, and
relatively undirected. There is no inherent moral or ethical
problem with this approach except for the potential to
consequently evaluate our systems in a cursory or incorrect way.
Without careful consideration to empirical methodology, the
truth and generalizability of our own statements may be
suspect. Without any conceptual barriers to overcome, success
in innocuous ‘toy’ tasks, such as distinguishing images of
cats from images of dogs (Elson et al. 2007), can quickly
transfer to other tasks with superficially similar data types,
such as distinguishing images of malignant skin lesions from
benign ones (Esteva et al. 2017). That is, ML allows for
modelling procedures to be easily transferred across data
sets without necessarily considering possible covariates
hidden in those data sets, nor the potential consequences of
false positives or false negatives. Indeed, not accounting for
skin colour in those images may aversely affect
underrepresented populations in the data
        <xref ref-type="bibr" rid="ref1">(Adamson and Smith 2018;
Lashbrook 2018)</xref>
        . The relative ease with which modern ML
can be implemented may reveal unintended biases of this
type or, more concerning, biases that we cannot understand.
      </p>
    </sec>
    <sec id="sec-2">
      <title>The dawn of AI standards</title>
      <p>In addition to various regional efforts, two international
organizations are leading the global standardization effort in
AI. The IEEE has launched the Global Initiative on Ethics of
Autonomous and Intelligent Systems to address some of the
societal concerns that are emerging with AI. These include
areas such as data governance and privacy, algorithmic bias,
transparency, ethically driven robots and autonomous
systems, failsafe design and wellbeing metrics. Also, the ISO
has recently created a new technical subcommittee (SC) in
the area of artificial intelligence, ISO/JTC 1 SC 42, whose
scope covers foundational standards as well as issues related
to safety and trustworthiness. At their first plenary in April
2018, the subcommittee created the following study groups:
Computational approaches and characteristics This
concerns different technologies (e.g., ML algorithms,
reasoning) used by AI systems including their properties
and characteristics. This will include specialized AI
systems (e.g., NLP or computer vision) to understand
and identify their underlying computational approaches,
architectures, and characteristics, and industry practices,
processes and methods for the application of AI systems.
Trustworthiness This concerns approaches to establish
trust in AI systems, e.g., through transparency,
verifiability, explainability, controllability. Engineering pitfalls,
typical threats and risks, their mitigation techniques, and
approaches to robustness, accuracy, privacy, and safety
will also be investigated.</p>
      <p>Use cases and applications This focuses on application
domains for AI (e.g., social networks and embedded
systems) and the different context of their use (e.g., health
care, smart homes, and autonomous cars).</p>
      <p>Work items related to the specification of performance of
ML models, and comparison between models, are relevant
to each of these study groups.</p>
    </sec>
    <sec id="sec-3">
      <title>Standards for evaluating ML models</title>
      <p>Machine learning research is driven, to a large extent, by the
search for mechanisms that achieve greater accuracy than
competing approaches. Unfortunately, there have been
several methodological limitations or misapplications that have
hindered the comparison of models, or made such
comparisons forfeit. Indeed, references are routinely made in the
literature to ‘state-of-the-art’ performance, sometimes
involving minuscule differences on small data sets, and in
relatively esoteric tasks. Broad acceptance of such
empirical procedures, or their assumed generalizability, makes the
supposed direct comparison between approaches tenuous at
best, and suspicious at worse.</p>
      <p>When comparing the performance of two or more
algorithms, the following aspects must be carefully controlled
and reported:
Implementation For example, if an algorithm can be
accelerated (e.g., by using GPU processing) in such a way that
can affect outcomes (e.g., if a there is a stopping condition
in time), then this must be made explicit.</p>
      <p>Hyper-parameters If the hyper-parameters of a ML model
are optimized, the hyper-parameters of the comparative
ML models should also be optimized, except when
hyperparameters themselves are being compared.</p>
      <p>Preprocessing Preprocessing steps will not unjustly favour
one model over another. For example, if a classifier
requires ‘stop words’ (e.g., prepositions) to be retained in
an NLP task, those words should not be removed.
Moreover, preprocessing should be consistent across all data.
For example, if outliers, incomplete data, or noise are
removed, it must be done uniformly.</p>
      <p>Training and testing data When several machine learning
models are being compared, the data used to train those
models or, separately, evaluate those models, should be
identical. These data should be ecologically valid,
statistically indistinct, or otherwise similar to data expected to
be observed in deployment.</p>
      <p>
        Representative data The data should be as free of
sampling bias as possible. That is, the distribution of classes in
the data should be identical to their distribution in the real
world, to the extent possible. There may be special
considerations to this point. To some extent, models trained
on historical data may encapsulate biases from the past
that the developers wish removed, such as demographic
biases towards recidivism or gender biases in word
embeddings
        <xref ref-type="bibr" rid="ref2">(Bolukbasi et al. 2016)</xref>
        ; if the system is meant to
enable decision support prospectively, techniques to
mitigate bias should be taken.
      </p>
      <p>Appropriate baselines Any classifier of interest should be
compared against at least one representative, appropriate
baseline. Trivial baselines should not be considered. A
trivial baseline, for example, always predicts the
majority class and, in general, is not the result of a machine
learning process.</p>
      <p>Appropriate measures There is a tendency to report
accuracy or area under the precision-recall (or other operator
characteristic) curves in nominal classification; however,
this is not always correct. For example, systems that
predict cause-of-death according to international standards
for disease coding should not merely report accuracy,
but should include the cause-specific mortality fraction
(CSMF), which is the fraction of in-hospital deaths for
a given cause normalized over all causes (Murray et al.
2007). CSMF accuracy is therefore a measure of
predictive quality at the population level, which quantifies how
closely the estimated CSMF values approximate the truth.
In fact, when it is possible to compute its associated
coefficients, the chance-corrected version of CSMF accuracy
should be used instead (Flaxman et al. 2015). Clearly, the
measure used, itself, can be highly context-dependent and
result in very different outcomes for different samples.</p>
      <sec id="sec-3-1">
        <title>Limiting information leakage It is necessary to partition</title>
        <p>data between training sets and test sets in such a way so
that no latent information exists across sets, other than
directly obtained from observation variables. This can occur
when latent information is highly correlated to labels,
annotation, or other supervised information.</p>
        <p>For example, a system may be designed to classify
between people with and without neurodegeneration from
audio (Fraser, Meltzer, and Rudzicz 2015) and have
multiple data points recorded from each human subject in
the data. Some acoustic features, such as vocal jitter or
phonation rates, may be used to identify pathology
crosssectionally, but they can also be used to identify the
speaker themselves. Since each speaker is associated with
a label for the outcome, even if individual samples are
partitioned across training and test sets, it would be
inappropriate if an individual speaker is represented in both
sets. This is because any model could learn the identity
of a speaker from the training data, and apply the known
label to test data, tainting the results. Leave-one-out
crossvalidation is one mitigation strategy.</p>
        <p>Limiting channel effects A channel effect occurs when a
classifier may learn characteristics of the manner in which
data were recorded, in addition to the nature of the data
themselves. For example, a hospital-based system may be
designed to classify among patient data. However, if all or
most patients with complex cancers seek treatment in
urban centres, then a classifier may learn to associate those
cancers with certain regions.</p>
        <p>Channel effects can be caused by the mechanism used
to obtain the data, any preprocessing that occurred on
one or more proper subsets of the data, the identity of
the individual or individuals obtaining the data, or
environmental changes in which data were recorded, for
example. If these effects cannot be controlled, they must
be accounted for as covariates during statistical
significance testing. Additionally, strategies have been
developed to explicitly factor out channel effects, as with
iVectors through expectation-maximization, probabilistic
linear discriminant analysis, and factor analysis (Verma
and Das 2015).</p>
        <p>Furthermore, appropriate statistical tests of significance
must be undertaken, when possible, in order to establish
whether there is any meaningful difference between
approaches. A difference of 0:5%, for example, on a single test
set is not necessarily conclusive with regards to the models
compared. Naturally, tests of significance can also be
misused (i.e., so-called ‘p-hacking’), so effort must be taken to
choose appropriate tests. For example, if a test has an
assumed distribution (as in standard t-tests), then the validity
of that assumption in the data should also be evaluated (e.g.,
through a Lilliefors or Kolmogorov-Smirnov test). If
multiple comparisons are made (e.g., through multiple
hyperparameterizations), then this must be accounted for also,
e.g., through a Bonferroni test. Alternatively, standardized
computations of effect sizes (e.g., Cohen’s d) can mitigate
against the risks of p-hacking. Finally, where possible, all
relevant covariates must be accounted for in the model; as
mentioned above, this includes all aspects in the channel,
including confounding variables in the data themselves, that
could effect the outcomes, as well as the interactions
between those variables.</p>
        <p>The assessment of nominal classification or continuous
regression has been implicit in the discussion above, but the
same principles may be taken in evaluating reinforcement
learning, for example.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Recommendations</title>
      <p>Ensuring the safety and maintenance of AI systems will be
the subject of various standardization efforts, including
explainable models, unintended biases (including cultural,
social, historical, or sampling biases), human-machine
interaction, and scalable oversight. However, the clear-eyed
refinement of the actual evaluation methodologies will be crucial
to many of these challenges. When so many independent
researchers and organizations (across academic, commercial,
or governmental sectors) are actively and competitively
engaged to achieve state-of-the-art performance, it is
essential to be able to objectively and quantitatively establish that
performance correctly, consistently, and with expected
minimal levels of reporting, otherwise any claims should not be
trusted.</p>
      <sec id="sec-4-1">
        <title>Endorsement versus enforcement</title>
        <p>As international standards are optional, a core concern is
whether these standards will be enforceable by governments,
given a variety of attitudes towards artificial intelligence by
the governments represented on standardization bodies.
Perhaps more concerning is whether standards in AI will even
be acknowledged or endorsed by developers and
practitioners, especially in academia. Agreeing upon a minimal set of
standards for evaluation, across sectors, may require a broad
cultural change within the ML community. Indeed, Dror et
al. (2018) showed through a meta-analysis that, while the
ML and natural language processing communities are driven
by experimental results, statistical significance testing is
ignored or misused most of the time. Therefore, as machine
learning becomes an increasingly applied science, empirical
methods should be emphasized early in relevant University
and educational programs, including Computer Science.</p>
        <p>Naturally, innovation should continue to be encouraged.
In fact, lowering certain regulatory barriers may promote
certain safe uses of ML in the service of the public
wellbeing, including healthcare. The point is not that risk should
be averted – there is substantial evidence that undertaking
certain kinds of risk can be beneficial when those risks are
understood. The challenge we face is that we do not truly
understand the risks of AI and ML – we barely understand
how to assess those systems in the first place, which should
be among our first priorities.</p>
        <p>With some exceptions, such as cases where human rights
are at risk, we do not propose that standards in AI should
inhibit our scientific exploration in any way. Nor do we
propose limits to the capabilities of AI systems – those will
largely be dependent on national or regional laws or
regulations. Rather, the international standardization community
has the opportunity to place certain expectations as to how
we, the ML community, evaluate our own work. Trusting the
objective quantitative performance of our systems is itself a
safety concern.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Final comments</title>
      <p>This paper represents the beginning of a long, multi-year
process in surveying challenges or shortcomings in the
evaluation of performance of machine learning, especially as it
relates to the trustworthiness of that performance. These are
not a complete set of requirements, nor are the
recommendations fully expressed. Training software, rather than
explicitly programming it, poses unique challenges and imposes a
specific development and test life cycle suitable that is
distinct from traditional software development and, crucially,
regulation and standardization. Accurately controlling for
the behaviours of deployed machine learning, through a
quantitative evaluation of its performance, is crucial.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>Rudzicz is the Canadian Chair of the mirror committee to
ISO/JTC 1 SC 42 on Artificial Intelligence.
Elson, J.; Douceur, J. R.; Howell, J.; and Saul, J. 2007.
Asirra: A CAPTCHA that Exploits Interest-Aligned
Manual Image Categorization. Proceedings of 14th ACM
Conference on Computer and Communications Security (CCS)
366–374.</p>
      <p>ISO/IEC. 2014. Safety aspects – guidelines for their
inclusion in standards. Standard, International Organization for
Standardization, Geneva, CH.</p>
      <p>Lashbrook, A. 2018. AI-Driven Dermatology Could Leave
Dark-Skinned Patients Behind. The Atlantic.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Adamson</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Machine Learning and Health Care Disparities in Dermatology</article-title>
          .
          <source>JAMA Dermatology</source>
          <volume>154</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1247</fpage>
          -
          <lpage>1248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bolukbasi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chang</surname>
            , K.-w.; Zou,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Saligrama</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kalai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2016</year>
          . Man is to Computer Programmer as Woman
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>