<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the Use of Available Testing Methods for Verification &amp; Validation of AI-based Software and Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Franz Wotawa</string-name>
          <email>wotawa@ist.tugraz.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graz University of Technology, Institute for Software Technology Inffeldgasse 16b/2</institution>
          ,
          <addr-line>A-8010 Graz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Verification and validation of software and systems is the essential part of the development cycle in order to meet given quality criteria including functional and non-functional requirements. Testing and in particular its automation has been an active research area for decades providing many methods and tools for automating test case generation and execution. Due to the increasing use of AI in software and systems, the question arises whether it is possible to utilize available testing techniques in the context of AI-based systems. In this position paper, we elaborate on testing issues arising when using AI methods for systems, consider the case of different stages of AI, and start investigating on the usefulness of certain testing methods for testing AI. We focus especially on testing at the system level where we are interesting not only in assuring a system to be correctly implemented but also to meet given criteria like not contradicting moral rules, or being dependable. We state that some well-known testing techniques can still be applied providing being tailored to the specific needs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Because of the growing importance of AI methodologies
for current and future software and systems, there is a need
for coming up with appropriate quality assurance measures.
Such measures should come up with certain guarantees that
the resulting products fulfill their requirements, e.g., provide
the requested functionality and safety concerns. Providing
guarantees seem to be essential in order to gain trust in
AIbased system solutions. In particular, in autonomous driving
to mention one more recent application area of AI, we have
to establish a certification and homologation process that
assures an autonomous vehicle to follow given regulations and
other requirements.</p>
      <p>Because of the fact that artifacts making use of AI
technology are themselves systems, the question is whether it is
possible re-use ordinary testing methodologies and to adapt
them for providing means for certification and
homologation. In particular, besides components like vision systems
relying on machine learning, there are other components that
do not rely on any AI methodology. In Figure 1 we give an
overview of the architecture of such a system comprising the</p>
      <p>AI component and other components implementing
functionality like providing user interfaces or database access. In
addition, such system rely on a computational stack where
we also have to consider the operating system, firmware, and
even the hardware for verification and validation purposes.
As a consequence, we have to consider verification and
validation of the whole system for quality assurance.</p>
      <p>
        In a previous paper
        <xref ref-type="bibr" rid="ref18 ref2 ref40">(Wotawa 2019)</xref>
        , we already focused
on the need for system testing. In contrast, in this paper, we
try to give a first answer regarding the usefulness of certain
available system testing methods for testing AI applications.
Furthermore, we discuss the corresponding general
verification and validation problem of such application in more
detail. We have always to understand what we want to test
and what we want to achieve. We have also to be aware of
shortcomings arising when focusing only a subparts of the
overall verification and validation problem. First, faults
often arise because of untested interactions between different
system components. Such cases may arise because of
unintended interactions not considered during development.
Second, we might not be able to sufficiently make guarantees
regarding the degree of testing. And finally, we may miss
critical inputs or scenarios that lead to trouble. The latter
especially holds for different machine learning approaches
and is referred to adversarial attacks (see e.g.,
        <xref ref-type="bibr" rid="ref18 ref2 ref33 ref34">(Su, Vargas,
and Sakurai 2019)</xref>
        and
        <xref ref-type="bibr" rid="ref39 ref6">(Goodfellow, McDaniel, and
Papernot 2018)</xref>
        ).
      </p>
      <p>We organize this paper as follows. In the next section, we
discuss the system testing challenge in detail. We focus on
different aspects of testing to be considered and refer to
related literature. Afterwards, we present three approaches of
systems testing that have been proven to find faults when
testing systems using AI techniques. Finally, we summarize
the obtained findings.</p>
    </sec>
    <sec id="sec-2">
      <title>The testing challenge</title>
      <p>
        As depicted in Figure 1 systems comprising AI methodology
also rely on other components providing interfaces and
functionality, as well as runtime support including operation
systems, firmware, and hardware. As a consequence, we have to
consider testing as a holistic activity that has to take care of
all different parts of the whole system. In particular, we have
to clarify what to test and how to test. For example, a
logicbased reasoning system comprises a compiler for reading
in the logic rules and facts, and the reasoning part. Hence,
we have to test the compiler and the reasoning part first
separately and afterwards together in close interaction. The
compiler can be tested, for example using, fuzzing where
more or less randomly generated inputs are generated (see
e.g.,
        <xref ref-type="bibr" rid="ref18 ref2 ref33 ref40">(Ko¨roglu and Wotawa 2019)</xref>
        ). The reasoning engine
itself can be tested using certain known relations like that the
sequence of rules provided to the system does not influence
the final outcome (see e.g.,
        <xref ref-type="bibr" rid="ref39">(Wotawa 2018)</xref>
        ). The overall
system itself may be tested using fault injection, e.g.,
        <xref ref-type="bibr" rid="ref38">(Wotawa
2016)</xref>
        . All these examples have – more or less – in common
that they only capture some parts of the expected behavior.
      </p>
      <p>
        If using fault injection, we are interested in how systems
react on inputs that occur in case of faults. When using
invariants like the order of rules, we do not test all aspects
of reasoning. Hence, in order to thoroughly test such
systems, we need to understand what to test in order to identify
shortcomings of underlying testing methods to be used.
Besides this and more specifically to AI methods, we have to
provide some measures that at least indicate the quality of
testing. For ordinary programs, coverage (e.g.,
        <xref ref-type="bibr" rid="ref1">(Ammann,
Offutt, and Huang 2003)</xref>
        ) and mutation score (e.g.,
        <xref ref-type="bibr" rid="ref10">(Jia and
Harman 2011)</xref>
        ) are used to determine whether test suites are
good enough, i.e., being likely able to reveal a faulty
behavior. Coverage helps to identify those parts of the program
that are executed using the test suite, i.e., code coverage1.
The mutation score is an indicator of the number of program
variants, i.e., the mutations, that can be detected using the
given test suite. It is worth noting that coverage or mutation
score can be seen as a measure or indicator for guaranteeing
that a test suite has the required capabilities for detecting a
failing behavior.
      </p>
      <p>Let us consider testing neural networks as an example.
Neural networks are trained using a set of examples and
evaluated afterwards. Evaluation is used for assuring that a
network reaches a given quality of the prediction outcome.
The set of examples used for training and evaluation have to
be distinct. The question is now whether this evaluation is
good enough for replacing further testing effort. The answer
is no, because but not only of adversarial attacks (Su,
Var1Note that besides code coverage there are other coverage
definitions used like test input coverage, combinatorial coverage, etc.
(a)
(c)
(b)
(d)
gas, and Sakurai 2019) that lead to misclassifications even in
case of small input variations. Other reasons for
misclassifications are the use of a training data set that is not covering
all different examples, and other aspects like the
distribution of examples. Furthermore, note that variations of the
appearance of objects in the real world exists often. In
Figure 2 we depict different images of the traffic sign ”do not
enter” ranging from a bend to occlusions because of
stickers attached. An autonomous car would require always to
handle these case, and it is very unlikely that we really have
all of such cases represented in the training data set.
Moreover, if so, we still would have misclassifications occurring,
requiring to assure that there is no unwanted effect on the
behavior of the overall system.</p>
      <p>
        There is plenty of literature regarding different testing
approaches for neural networks, e.g.,
        <xref ref-type="bibr" rid="ref27 ref28 ref31 ref36 ref39">(Pei et al. 2017; Sun,
Huang, and Kroening 2018; Ma et al. 2018b,a)</xref>
        and most
recently
        <xref ref-type="bibr" rid="ref11 ref18 ref18 ref2 ref2 ref33 ref33">(Kim, Feldt, and Yoo 2019; Sekhon and Fleming
2019)</xref>
        . In some of the methods also an adapted version of
coverage and mutation score for neural networks has been
used. Unfortunately, coverage information maybe somehow
misleading
        <xref ref-type="bibr" rid="ref25 ref32">(Li et al. 2019)</xref>
        leaving the question regarding
the quality of the test suite open.
      </p>
      <p>
        In the case of neural network we may also ask whether
classical coverage or mutation score used in ordinary
software engineering can be used as quality measure when
testing a current neural network implementation.
        <xref ref-type="bibr" rid="ref18 ref2 ref33 ref40 ref5">(Chetouane,
Klampfl, and Wotawa 2019)</xref>
        showed that making use of
these measures when testing the configuration of neural
networks, i.e., setting the type of neurons, the number of layers
and neurons, can be justified. Unfortunately, this is not the
case when testing the whole neural network library as
discussed in
        <xref ref-type="bibr" rid="ref13">(Klampfl, Chetouane, and Wotawa 2020)</xref>
        . Hence,
for neural networks or measures and means for testing shall
be provided.
      </p>
      <p>
        Although, we may need to live with the challenge that we
cannot completely tests certain system parts and that there
is always a critical case where the AI part of a system may
deliver a wrong result, the further question is whether this
establishes a problem for the whole system. The answer in
this case is no, providing that the system itself is able to
detect this critical case and to react appropriately. For example
in autonomous driving, we may make use of more than one
sensor for obtaining information regarding objects around
the vehicle and use sensor fusion to obtain reliable
information. We only need to assure that the whole system interacts
with the environment in a way that is dependable and
fulfills our requirements including ethical or moral
considerations. Hence, identifying critical scenarios between the
system and its environment seems to be a crucial factor of
testing AI-based systems
        <xref ref-type="bibr" rid="ref16 ref29 ref39">(Koopman and Wagner 2016; Menzel,
Bagschik, and Maurer 2018)</xref>
        .
      </p>
      <p>Moreover, it seems also of importance to consider that
critical scenarios often originate from different settings that
have to occur at the same time. One issue, e.g., missing a
certain traffic sign may not lead to an accident, but in
combination with other issues would.</p>
      <p>We summarize our discussion in the following position:
Position 1 Testing aims at identifying interactions between
the system under test and its environment leading to an
unexpected behavior. When testing systems utilizing AI, we
have to consider testing all parts of a system including
the one with and the one without AI as well as their
interactions. Evaluating performance characteristic of
implemented AI methodology may not be sufficient for assuring
meeting quality criteria.</p>
      <p>Most of testing is performed during development of
systems before deployment. In some cases certification (or even
homologation), i.e., the formal confirmation that an
application, product or system, meets its required characteristics, is
needed. In case of AI technology we are interested that the
system fulfills dependability goals like safety but maybe also
given ethical or moral rules. For example, we want a
conversational agent or a decision support system not to be racist or
sexist. Furthermore, because of the fact that the system’s
underlying software is updated regularly in order to cope with
changes required because of bugs or improved functionality,
there is a need of carrying out any certification regularly as
well. For example, in autonomous driving we have to assure
that a new software update is not going to lead to an unsafe
system. However, regression testing may require a lot of
effort or come with high costs, which may be reduced when
automating testing.</p>
      <p>Hence, automating at least part of certification may be a
future requirement. But how can certification of AI be
carried out? What we need is a process where we identify what
we want to achieve, and how this can be checked (or tested)?
How can we come up with certain parameters justifying that
testing is appropriate? We shall also think about the
methods for checking, their limitations, and how to assure that
the methods can guarantee (with respect to a given certainty)
that the system fulfills requested needs. However, in any case
in order to bring AI technology into practice, we have to
convince customers that the systems are not of harm.
Certification that takes into account such customer’s considerations
as well as regulations provide the right means for further
supporting the delivery of AI technology into practical
applications we are using on a daily basis.</p>
      <p>
        It is worth noting that there are many initiatives like the
ethics guidelines for trustworthy AI
        <xref ref-type="bibr" rid="ref32">(Pietila¨ et al. 2019)</xref>
        for
coming up with first steps of how AI-based systems have
to be constructed, evaluated, and verified. However, for
example, in autonomous driving such principles have to be
concretized leading to practical rules companies can follow
when developing AI systems or systems at least partially
based on AI methodologies and tools.
      </p>
      <p>Position 2 There is a need for well-defined certification and
homologation processes for AI-based systems that ideally
can be carried out in an automated way. Such certification
and homologation processes shall rely on existing guidelines
considering all aspects of trustworthy AI.</p>
      <p>When we want to carry out certification at least partially
automated we may rely on testing. Hence, we have to state
the question whether existing testing techniques can be used
for confirming that an AI-based system fulfills regulation
and other rules and expectations. This includes besides
testing functionality the degree of fulfilling generally agreed
ethical and moral rules. In the following section, we
introduce three techniques that can (at least partially) serve this
purpose.</p>
    </sec>
    <sec id="sec-3">
      <title>Testing AI</title>
      <p>
        As discussed there seems to be a need for testing the whole
system considering functional and non-functional
requirements including moral and ethical rules. For testing systems
at the system level black-box approaches are used that do
not consider the internal structure. Various methods with
corresponding tools have been proposed including
modelbased testing (MBT)
        <xref ref-type="bibr" rid="ref37">(Utting and Legeard 2006)</xref>
        ,
combinatorial testing (CT)
        <xref ref-type="bibr" rid="ref21">(Kuhn et al. 2015)</xref>
        , or metamorphic
testing
        <xref ref-type="bibr" rid="ref4">(Chen, Cheung, and Yiu 1998)</xref>
        . MBT makes use of a
model of the system for obtaining test cases. In order to find
critical interactions between the system and its environment
this may not be sufficient. It would be required to model the
environment including potential interactions and have a look
about the reactions of the system.
      </p>
      <p>The focus on modeling the environment of the system in
order to obtain test cases is somehow different to ordinary
MBT where a model of the system is used for test case
generation. Changing from modeling the system to modeling
the environment is necessary for finding critical interactions
between an AI-based system and its environment. Moreover,
in this kind of testing we are not interested in showing that
an implementation works accordingly to a model, but is
capable of handling arbitrary interactions that may not be
foreseen during development.</p>
      <p>
        In contrast to MBT, CT has been developed to search for
critical interactions between configuration parameters and
inputs. It has been shown that CT can effectively detect
faults in many different kinds of software
        <xref ref-type="bibr" rid="ref20">(Kuhn et al. 2009)</xref>
        .
      </p>
      <p>
        The question is whether we can also apply CT for AI
testing? In
        <xref ref-type="bibr" rid="ref22">(Li, Tao, and Wotawa 2020)</xref>
        the authors introduced
an approach utilizing a model of the system environment in
combination with CT for obtaining a test suite. In their
paper, the authors not only provide the foundations but also
reported on a case study where the authors tested an automated
emergency braking (AEB) function. From 319 test cases, 9
test cases lead to crashes (including test cases where
pedestrians would have been killed (see Figure 3)), and 30 were
considered as being critical. It is worth noting that the
proposed overall approach also includes a simulation
environment for carrying out the generated test cases in a realistic
setting automatically.
      </p>
      <p>
        <xref ref-type="bibr" rid="ref32">(Klu¨ck et al. 2019)</xref>
        introduced an alternative method for
generating critical scenarios, where the authors rely on
genetic algorithms for obtaining test cases. The idea is to
model test cases as genes that can be crossed and mutated.
The evaluation function maps test cases to a goodness value.
In each generation the best test cases are taken modified
and again evaluated. This kind of testing is also referred to
search-based test
        <xref ref-type="bibr" rid="ref26">ing. In (Klu¨ck et al. 2019</xref>
        ) the authors also
evaluated the approach using an AEB function too. The
obtained results showed that genetic algorithms can be applied
to detect faults in the setting of autonomous and automated
driving leading to the following position:
Position 3 Combinatorial testing and search-based testing
are effective testing techniques for identifying critical
scenarios.
      </p>
      <p>CT and also search-based testing applied to test
autonomous and automated driving functions always has to
fulfill the property that no crash with another car or even a
pedestrian occurs. In this context closeness to a crash is often
represented as the time to collision (TTC), where 0 means
that a crash occurs. Usually, in many applications positive
but small TTC values may also be considered as unwanted.
When testing in the case of the automotive domain
including autonomous driving, we can always rely on the TTC for
judging whether a test case passes or fails.Hence, there the
test oracle can be automated using the TTC, which is not
always the case when testing AI. We, therefore, require other
means for dealing with the oracle problem, i.e., providing
a function that allows to distinguish passing executions of
programs and systems from failing ones.</p>
      <p>
        The objective behind metamorphic testing
        <xref ref-type="bibr" rid="ref4">(Chen,
Cheung, and Yiu 1998)</xref>
        is to provide a solution to the oracle
problem of testing. The underlying idea is to define
relations over different inputs that always deliver the same
output. For example, sin(x) is equivalent for all values of x
and x + 2 , i.e., sin(x) = sin(x + 2 ) always holds.
In
        <xref ref-type="bibr" rid="ref32 ref9">(Guichard et al. 2019)</xref>
        and more specifically in
        <xref ref-type="bibr" rid="ref18 ref2 ref33 ref40">(Bozic and
Wotawa 2019)</xref>
        the authors proposed the use of metamorphic
testing for testing conversational agents, i.e., chatbots. The
underlying described idea was to propose relations
considering semantical relationships between words and sentences,
e.g., some sentences have the same semantics when
replacing one word with its synonym, or sometimes the sequence
of sentences given to a chatbot, does not change the answer
provided by the chatbot. Moreover, we are able to test for
fulfilling certain moral and ethical regulations. For example,
if an answer of a chatbot should not be influenced by the
race or sex of the chat participant, we can be formulate this
as a metamorphic relation, where we say that a conversion
considering one race or sex should lead to the same results
when changing race or sex. In case of AI systems, where
we are able to come up with metamorphic relations, we are
also able to apply metamorphic testing for solving the oracle
problem.
      </p>
      <p>Position 4 Metamorphic testing seems to be of use for
implementing the test oracle problem of AI systems allowing
to identifying contradictions with requirements, which may
include ethical and moral considerations.</p>
      <p>There are more system testing approaches that can be also
adapted to fit the purpose of AI testing with the objective of
assuring safety of AI-based systems and software. However,
we have identified approaches where there is experimental
evidence that they could be effectively used for testing
AIbased systems. These approaches may also fit into
certification and homologation processes. For this purpose, certain
measures have to be developed that can be used for deciding
when to stop testing in cases no failing test case could be
obtained.</p>
      <p>Moreover, the presented methods and techniques for
testing AI-based systems have disadvantages. They are mainly
focusing on quality assurance of the overall system and not
its comprising parts. For example, the CT approach
considers a model of the environment, which works as basis for
obtaining the CT input model. The approach is testing whether
certain interactions of the CT with its environment reveal
a fault, and in case of automated or autonomous driving a
crash, but does not consider any knowledge regarding the
SUT’s internal structure or behavior. Finding out the root
cause of any misbehavior within the SUT might be
complicated. Moreover, we are not able to make use of quality
assurance measures like code coverage or mutation score for
the particular test suite. Furthermore, CT like MBT requires
to concretize the abstract test cases computed using these
testing methods. This concretization step cause additional
effort and has to be done carefully in order to come up with
good test cases that can be executed and most likely reveal a
fault.</p>
      <p>In case of metamorphic testing it is essential to define the
metamorphic relations, which cause additional effort and
influence the ability to work as a good test oracle. There are
maybe metamorphic relations leading to test cases a SUT
can easy fulfill only allowing to test a fraction of
functionality. In such cases metamorphic testing would not lead to
tests covering most of the functionality and, therefore, can
be considered as incomplete. Search-based testing requires
to implement a search procedure using a function allowing
to estimate the quality of a current test, e.g., the ability of
a test revealing a fault. Again this requires additional effort
and costs. It is worth noting that in some cases random
testing, i.e., generating test inputs using a random procedure,
also provides fault revealing test cases requiring even less
time than search-based testing at almost no additional costs.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this position paper, we focused on providing an answer to
the question whether there exists testing techniques that can
be efficiently used for checking that a software or system
comprising AI methodologies fulfills requirements
including also moral and ethical rules, and regulations. We also
discussed the involved challenges of testing where we
identified also shortcomings that arise when only focusing on
specific parts and not providing a holistic view. Finally, we
introduced several testing methods that have been developed
in the context of testing ordinary systems and elaborate on
their usefulness in the context of AI-based systems.
Searchbased testing, combinatorial testing, and metamorphic
testing seem to be excellent candidate for this purpose and may
also be of use for automating certification and homologation
processes for AI applications.</p>
      <p>However, further studies have to be carried out. For CT
more experiments making use of other autonomous and
automated functions have to be considered. Moreover, we
require to come up with certain measures of guarantees for the
computed test suites. Parameters of CT like the
combinatorial strength maybe sufficient but in the context of AI-based
systems there is no experimental evidence. For metamorphic
testing we further need more use cases and experimental
evaluations making use of AI-based systems. In the case of
chatbots and also logic-based reasoning metamorphic
testing has already been successfully applied. However, there is
a need to show the usefulness of metamorphic testing also in
other applications where AI technology is a central part.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The research was supported by ECSEL JU under the project
H2020 826060 AI4DI - Artificial Intelligence for Digitising
Industry. AI4DI is funded by the Austrian Federal Ministry
of Transport, Innovation and Technology (BMVIT) under
the program ”ICT of the Future” between May 2019 and
April 2022. More information can be retrieved from https:
//iktderzukunft.at/en/ .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Ammann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Offutt</surname>
            , J.; and Huang,
            <given-names>H.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Coverage Criteria for Logical Expressions</article-title>
          .
          <source>In Proceedings of the 14th International Symposium on Software Reliability Engineering</source>
          , ISSRE '
          <fpage>03</fpage>
          . Washington, DC, USA: IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bozic</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Wotawa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Testing Chatbots Using Metamorphic Relations</article-title>
          . In Gaston, C.;
          <string-name>
            <surname>Kosmatov</surname>
            , N.; and
            <given-names>Le</given-names>
          </string-name>
          <string-name>
            <surname>Gall</surname>
          </string-name>
          , P., eds.,
          <source>Testing Software and Systems</source>
          ,
          <volume>41</volume>
          -
          <fpage>55</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          Cham: Springer International Publishing.
          <source>ISBN 978-3-030- 31280-0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cheung</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Yiu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>Metamorphic testing: a new approach for generating next test cases</article-title>
          .
          <source>Technical report</source>
          , Department of Computer Science, Hong Kong University of Science and Technology,
          <string-name>
            <given-names>Hong</given-names>
            <surname>Kong</surname>
          </string-name>
          .
          <source>Technical Report HKUST-CS98-01.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Chetouane</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Klampfl</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Wotawa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Investigating the Effectiveness of Mutation Testing Tools in the Context of Deep Neural Networks</article-title>
          .
          <source>In IWANN (1)</source>
          , volume
          <volume>11506</volume>
          of Lecture Notes in Computer Science,
          <volume>766</volume>
          -
          <fpage>777</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>McDaniel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Papernot</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Making Machine Learning Robust Against Adversarial Inputs</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>Commun. ACM</source>
          <volume>61</volume>
          (
          <issue>7</issue>
          ):
          <fpage>56</fpage>
          -
          <lpage>66</lpage>
          . ISSN 0001-
          <fpage>0782</fpage>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <volume>1145</volume>
          /3134599. URL http://doi.acm.
          <source>org/10</source>
          .1145/3134599.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Guichard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ruane</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Bean,
          <string-name>
            <given-names>D.</given-names>
            ; and
            <surname>Ventresque</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Assessing the Robustness of Conversational Agents using Paraphrases</article-title>
          .
          <source>In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest)</source>
          ,
          <fpage>55</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and Harman,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>An Analysis and Survey of the Development of Mutation Testing</article-title>
          .
          <source>IEEE Transactions on Software Engineering</source>
          <volume>37</volume>
          (
          <issue>5</issue>
          ):
          <fpage>649</fpage>
          -
          <lpage>678</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Feldt, R.; and
          <string-name>
            <surname>Yoo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Guiding Deep Learning System Testing Using Surprise Adequacy</article-title>
          .
          <source>In Proceedings of the 41st International Conference on Software Engineering</source>
          , ICSE'
          <volume>19</volume>
          ,
          <fpage>1039</fpage>
          -
          <lpage>1049</lpage>
          . IEEE Press. doi:
          <volume>10</volume>
          .1109/ICSE.
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          00108. URL https://doi.org/10.1109/ICSE.
          <year>2019</year>
          .
          <volume>00108</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Klampfl</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chetouane</surname>
          </string-name>
          , N.; and
          <string-name>
            <surname>Wotawa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Mutation Testing for Artificial Neural Networks: An Empirical Evaluation</article-title>
          .
          <source>In IEEE 20th International Conference on Software Quality, Reliability and Security (QRS)</source>
          ,
          <fpage>356</fpage>
          -
          <lpage>365</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          2019.
          <article-title>Performance Comparison of Two Search-Based Testing Strategies for ADAS System Validation</article-title>
          . In Gaston, C.;
          <string-name>
            <surname>Kosmatov</surname>
            , N.; and
            <given-names>Le</given-names>
          </string-name>
          <string-name>
            <surname>Gall</surname>
          </string-name>
          , P., eds.,
          <source>Testing Software and Systems</source>
          ,
          <volume>140</volume>
          -
          <fpage>156</fpage>
          . Cham: Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>ISBN 978-3-030-31280-0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Koopman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and Wagner,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Challenges in Autonomous Vehicle Testing and Validation</article-title>
          .
          <source>SAE Int. J. Trans.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>Safety</source>
          <volume>4</volume>
          :
          <fpage>15</fpage>
          -
          <lpage>24</lpage>
          . doi:
          <volume>10</volume>
          .4271/2016-01-
          <fpage>0128</fpage>
          . URL https: //doi.org/10.4271/2016-01-0128.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Ko</surname>
          </string-name>
          ¨roglu, Y.; and
          <string-name>
            <surname>Wotawa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Fully automated compiler testing of a reasoning engine via mutated grammar fuzzing</article-title>
          . In Choi, B.;
          <string-name>
            <surname>Escalona</surname>
            ,
            <given-names>M. J.;</given-names>
          </string-name>
          and
          <string-name>
            <surname>Herzig</surname>
          </string-name>
          , K., eds.,
          <source>Proceedings of the 14th International Workshop on Automation of Software Test, AST@ICSE</source>
          <year>2019</year>
          , May 27,
          <year>2019</year>
          , Montreal, QC, Canada,
          <fpage>28</fpage>
          -
          <lpage>34</lpage>
          . IEEE / ACM. doi:
          <volume>10</volume>
          .1109/AST.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <year>2019</year>
          .00010. URL https://doi.org/10.1109/AST.
          <year>2019</year>
          .
          <volume>00010</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Kuhn</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Kacker,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Lei,
          <string-name>
            <given-names>Y.</given-names>
            ; and
            <surname>Hunter</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2009</year>
          .
          <article-title>Combinatorial Software Testing</article-title>
          .
          <source>Computer</source>
          <volume>94</volume>
          -96.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Kuhn</surname>
            ,
            <given-names>D. R.</given-names>
          </string-name>
          ; Bryce,
          <string-name>
            <given-names>R.</given-names>
            ;
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ;
            <surname>Ghandehari</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. S.</surname>
          </string-name>
          ; Lei,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; and Kacker,
          <string-name>
            <surname>R. N.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Combinatorial Testing: Theory and Practice</article-title>
          .
          <source>In Advances in Computers</source>
          , volume
          <volume>99</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tao</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Wotawa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Ontology-based test generation for automated and autonomous driving functions</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Inf. Softw. Technol.</surname>
          </string-name>
          117. doi:
          <volume>10</volume>
          .1016/j.infsof.
          <year>2019</year>
          .
          <volume>106200</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          URL https://doi.org/10.1016/j.infsof.
          <year>2019</year>
          .
          <volume>106200</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; Ma,
          <string-name>
            <given-names>X.</given-names>
            ;
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ; and
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Structural Coverage Criteria for Neural Networks Could Be Misleading</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>In 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSENIER)</source>
          ,
          <fpage>89</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Xue,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Juefei-Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ;
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; et al. 2018a.
          <article-title>Deepmutation: Mutation testing of deep learning systems</article-title>
          .
          <source>In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE)</source>
          ,
          <fpage>100</fpage>
          -
          <lpage>111</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ; and
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2018b</year>
          .
          <article-title>Combinatorial testing for deep learning systems</article-title>
          . arXiv preprint arXiv:
          <year>1806</year>
          .07723 .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Menzel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bagschik</surname>
            , G.; and Maurer,
            <given-names>M.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Scenarios for Development, Test and Validation of Automated Vehicles</article-title>
          . In arXiv:
          <year>1801</year>
          .08598. URL https://arxiv.org/abs/
          <year>1801</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          08598.
          <source>Appeared in Proc. of the IEEE Intelligent Vehicles Symposium.</source>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Pei</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Jana</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Deepxplore: Automated whitebox testing of deep learning systems</article-title>
          .
          <source>In proceedings of the 26th Symposium on Operating Systems Principles</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Pietila</surname>
            ¨,
            <given-names>P. A.</given-names>
          </string-name>
          ; et al.
          <year>2019</year>
          .
          <article-title>Ethics Guidelines For Trustworthy AI</article-title>
          .
          <string-name>
            <surname>High-Level Expert</surname>
          </string-name>
          Group on AI, European Commission.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Sekhon</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Fleming</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Towards Improved Testing For Deep Learning</article-title>
          .
          <source>In 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)</source>
          ,
          <fpage>85</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vargas</surname>
            ,
            <given-names>D. V.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Sakurai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>One Pixel Attack for Fooling Deep Neural Networks</article-title>
          .
          <source>IEEE Transactions on Evolutionary Computation 1-1. ISSN 1089-778X</source>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          1109/TEVC.
          <year>2019</year>
          .
          <volume>2890858</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kroening</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Testing deep neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1803</source>
          .04792 .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <surname>Utting</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Legeard</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Practical Model-Based Testing - A Tools Approach</article-title>
          . Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>Wotawa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Testing Self-Adaptive Systems using Fault Injection and Combinatorial Testing</article-title>
          .
          <source>In Proceedings of the Intl. Workshop on Verification and Validation of Adaptive Systems (VVASS</source>
          <year>2016</year>
          ). Vienna, Austria.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <surname>Wotawa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Combining Combinatorial Testing and Metamorphic Testing for Testing a Logic-based NonMonotonic Reasoning System</article-title>
          . In
          <source>In Proceedings of the 7th International Workshop on Combinatorial Testing (IWCT) / ICST</source>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <surname>Wotawa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>On the importance of system testing for assuring safety of AI systems</article-title>
          .
          <source>In CEUR Workshop Proceedings , Workshop on Artificial Intelligence Safety</source>
          ,
          <source>AISafety</source>
          <year>2019</year>
          , volume
          <volume>2419</volume>
          .
          <string-name>
            <surname>Macao</surname>
          </string-name>
          , China. URL http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2419</volume>
          /.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>