On the Use of Available Testing Methods for Verification & Validation of AI-based
                              Software and Systems
                                                              Franz Wotawa
                                  Graz University of Technology, Institute for Software Technology
                                             Inffeldgasse 16b/2, A-8010 Graz, Austria
                                                       wotawa@ist.tugraz.at


                            Abstract
  Verification and validation of software and systems is the es-
  sential part of the development cycle in order to meet given
  quality criteria including functional and non-functional re-
  quirements. Testing and in particular its automation has been
  an active research area for decades providing many methods
  and tools for automating test case generation and execution.
  Due to the increasing use of AI in software and systems, the
  question arises whether it is possible to utilize available test-
  ing techniques in the context of AI-based systems. In this po-
  sition paper, we elaborate on testing issues arising when using
  AI methods for systems, consider the case of different stages
  of AI, and start investigating on the usefulness of certain test-   Figure 1: AI-based systems their boundaries, and environ-
  ing methods for testing AI. We focus especially on testing at       ment.
  the system level where we are interesting not only in assuring
  a system to be correctly implemented but also to meet given
  criteria like not contradicting moral rules, or being depend-
  able. We state that some well-known testing techniques can          AI component and other components implementing func-
  still be applied providing being tailored to the specific needs.    tionality like providing user interfaces or database access. In
                                                                      addition, such system rely on a computational stack where
                                                                      we also have to consider the operating system, firmware, and
                        Introduction                                  even the hardware for verification and validation purposes.
Because of the growing importance of AI methodologies                 As a consequence, we have to consider verification and val-
for current and future software and systems, there is a need          idation of the whole system for quality assurance.
for coming up with appropriate quality assurance measures.               In a previous paper (Wotawa 2019), we already focused
Such measures should come up with certain guarantees that             on the need for system testing. In contrast, in this paper, we
the resulting products fulfill their requirements, e.g., provide      try to give a first answer regarding the usefulness of certain
the requested functionality and safety concerns. Providing            available system testing methods for testing AI applications.
guarantees seem to be essential in order to gain trust in AI-         Furthermore, we discuss the corresponding general verifi-
based system solutions. In particular, in autonomous driving          cation and validation problem of such application in more
to mention one more recent application area of AI, we have            detail. We have always to understand what we want to test
to establish a certification and homologation process that as-        and what we want to achieve. We have also to be aware of
sures an autonomous vehicle to follow given regulations and           shortcomings arising when focusing only a subparts of the
other requirements.                                                   overall verification and validation problem. First, faults of-
   Because of the fact that artifacts making use of AI tech-          ten arise because of untested interactions between different
nology are themselves systems, the question is whether it is          system components. Such cases may arise because of unin-
possible re-use ordinary testing methodologies and to adapt           tended interactions not considered during development. Sec-
them for providing means for certification and homologa-              ond, we might not be able to sufficiently make guarantees
tion. In particular, besides components like vision systems           regarding the degree of testing. And finally, we may miss
relying on machine learning, there are other components that          critical inputs or scenarios that lead to trouble. The latter
do not rely on any AI methodology. In Figure 1 we give an             especially holds for different machine learning approaches
overview of the architecture of such a system comprising the          and is referred to adversarial attacks (see e.g., (Su, Vargas,
Copyright © 2021 for this paper by its authors. Use permitted un-     and Sakurai 2019) and (Goodfellow, McDaniel, and Paper-
der Creative Commons License Attribute 4.0 International (CC BY       not 2018)).
4.0).                                                                    We organize this paper as follows. In the next section, we
discuss the system testing challenge in detail. We focus on
different aspects of testing to be considered and refer to re-
lated literature. Afterwards, we present three approaches of
systems testing that have been proven to find faults when
testing systems using AI techniques. Finally, we summarize
the obtained findings.

                  The testing challenge                                                   (a)                     (b)
As depicted in Figure 1 systems comprising AI methodology
also rely on other components providing interfaces and func-
tionality, as well as runtime support including operation sys-
tems, firmware, and hardware. As a consequence, we have to
consider testing as a holistic activity that has to take care of
all different parts of the whole system. In particular, we have
to clarify what to test and how to test. For example, a logic-
based reasoning system comprises a compiler for reading
in the logic rules and facts, and the reasoning part. Hence,                             (c)                      (d)
we have to test the compiler and the reasoning part first
separately and afterwards together in close interaction. The           Figure 2: Different variants of the ”do not enter” traffic sign
compiler can be tested, for example using, fuzzing where               someone sees in reality: (a) is the original sign, (b) the traffic
more or less randomly generated inputs are generated (see              sign with a bend, a sticker, and partially missing color, and
e.g., (Köroglu and Wotawa 2019)). The reasoning engine it-            (c) and (d) are traffic sign with various stickers attached.
self can be tested using certain known relations like that the
sequence of rules provided to the system does not influence
the final outcome (see e.g., (Wotawa 2018)). The overall sys-          gas, and Sakurai 2019) that lead to misclassifications even in
tem itself may be tested using fault injection, e.g., (Wotawa          case of small input variations. Other reasons for misclassifi-
2016). All these examples have – more or less – in common              cations are the use of a training data set that is not covering
that they only capture some parts of the expected behavior.            all different examples, and other aspects like the distribu-
   If using fault injection, we are interested in how systems          tion of examples. Furthermore, note that variations of the
react on inputs that occur in case of faults. When using in-           appearance of objects in the real world exists often. In Fig-
variants like the order of rules, we do not test all aspects           ure 2 we depict different images of the traffic sign ”do not
of reasoning. Hence, in order to thoroughly test such sys-             enter” ranging from a bend to occlusions because of stick-
tems, we need to understand what to test in order to identify          ers attached. An autonomous car would require always to
shortcomings of underlying testing methods to be used. Be-             handle these case, and it is very unlikely that we really have
sides this and more specifically to AI methods, we have to             all of such cases represented in the training data set. More-
provide some measures that at least indicate the quality of            over, if so, we still would have misclassifications occurring,
testing. For ordinary programs, coverage (e.g., (Ammann,               requiring to assure that there is no unwanted effect on the
Offutt, and Huang 2003)) and mutation score (e.g., (Jia and            behavior of the overall system.
Harman 2011)) are used to determine whether test suites are
good enough, i.e., being likely able to reveal a faulty behav-            There is plenty of literature regarding different testing ap-
ior. Coverage helps to identify those parts of the program             proaches for neural networks, e.g., (Pei et al. 2017; Sun,
that are executed using the test suite, i.e., code coverage1 .         Huang, and Kroening 2018; Ma et al. 2018b,a) and most
The mutation score is an indicator of the number of program            recently (Kim, Feldt, and Yoo 2019; Sekhon and Fleming
variants, i.e., the mutations, that can be detected using the          2019). In some of the methods also an adapted version of
given test suite. It is worth noting that coverage or mutation         coverage and mutation score for neural networks has been
score can be seen as a measure or indicator for guaranteeing           used. Unfortunately, coverage information maybe somehow
that a test suite has the required capabilities for detecting a        misleading (Li et al. 2019) leaving the question regarding
failing behavior.                                                      the quality of the test suite open.
   Let us consider testing neural networks as an example.                 In the case of neural network we may also ask whether
Neural networks are trained using a set of examples and                classical coverage or mutation score used in ordinary soft-
evaluated afterwards. Evaluation is used for assuring that a           ware engineering can be used as quality measure when test-
network reaches a given quality of the prediction outcome.             ing a current neural network implementation. (Chetouane,
The set of examples used for training and evaluation have to           Klampfl, and Wotawa 2019) showed that making use of
be distinct. The question is now whether this evaluation is            these measures when testing the configuration of neural net-
good enough for replacing further testing effort. The answer           works, i.e., setting the type of neurons, the number of layers
is no, because but not only of adversarial attacks (Su, Var-           and neurons, can be justified. Unfortunately, this is not the
                                                                       case when testing the whole neural network library as dis-
    1                                                                  cussed in (Klampfl, Chetouane, and Wotawa 2020). Hence,
      Note that besides code coverage there are other coverage defi-
nitions used like test input coverage, combinatorial coverage, etc.    for neural networks or measures and means for testing shall
be provided.                                                       in order to bring AI technology into practice, we have to con-
    Although, we may need to live with the challenge that we       vince customers that the systems are not of harm. Certifica-
cannot completely tests certain system parts and that there        tion that takes into account such customer’s considerations
is always a critical case where the AI part of a system may        as well as regulations provide the right means for further
deliver a wrong result, the further question is whether this       supporting the delivery of AI technology into practical ap-
establishes a problem for the whole system. The answer in          plications we are using on a daily basis.
this case is no, providing that the system itself is able to de-      It is worth noting that there are many initiatives like the
tect this critical case and to react appropriately. For example    ethics guidelines for trustworthy AI (Pietilä et al. 2019) for
in autonomous driving, we may make use of more than one            coming up with first steps of how AI-based systems have
sensor for obtaining information regarding objects around          to be constructed, evaluated, and verified. However, for ex-
the vehicle and use sensor fusion to obtain reliable informa-      ample, in autonomous driving such principles have to be
tion. We only need to assure that the whole system interacts       concretized leading to practical rules companies can follow
with the environment in a way that is dependable and ful-          when developing AI systems or systems at least partially
fills our requirements including ethical or moral considera-       based on AI methodologies and tools.
tions. Hence, identifying critical scenarios between the sys-      Position 2 There is a need for well-defined certification and
tem and its environment seems to be a crucial factor of test-      homologation processes for AI-based systems that ideally
ing AI-based systems (Koopman and Wagner 2016; Menzel,             can be carried out in an automated way. Such certification
Bagschik, and Maurer 2018).                                        and homologation processes shall rely on existing guidelines
    Moreover, it seems also of importance to consider that         considering all aspects of trustworthy AI.
critical scenarios often originate from different settings that
have to occur at the same time. One issue, e.g., missing a            When we want to carry out certification at least partially
certain traffic sign may not lead to an accident, but in com-      automated we may rely on testing. Hence, we have to state
bination with other issues would.                                  the question whether existing testing techniques can be used
    We summarize our discussion in the following position:         for confirming that an AI-based system fulfills regulation
                                                                   and other rules and expectations. This includes besides test-
Position 1 Testing aims at identifying interactions between        ing functionality the degree of fulfilling generally agreed
the system under test and its environment leading to an un-        ethical and moral rules. In the following section, we intro-
expected behavior. When testing systems utilizing AI, we           duce three techniques that can (at least partially) serve this
have to consider testing all parts of a system including           purpose.
the one with and the one without AI as well as their in-
teractions. Evaluating performance characteristic of imple-                                Testing AI
mented AI methodology may not be sufficient for assuring
meeting quality criteria.                                          As discussed there seems to be a need for testing the whole
                                                                   system considering functional and non-functional require-
   Most of testing is performed during development of sys-         ments including moral and ethical rules. For testing systems
tems before deployment. In some cases certification (or even       at the system level black-box approaches are used that do
homologation), i.e., the formal confirmation that an applica-      not consider the internal structure. Various methods with
tion, product or system, meets its required characteristics, is    corresponding tools have been proposed including model-
needed. In case of AI technology we are interested that the        based testing (MBT) (Utting and Legeard 2006), combina-
system fulfills dependability goals like safety but maybe also     torial testing (CT) (Kuhn et al. 2015), or metamorphic test-
given ethical or moral rules. For example, we want a conver-       ing (Chen, Cheung, and Yiu 1998). MBT makes use of a
sational agent or a decision support system not to be racist or    model of the system for obtaining test cases. In order to find
sexist. Furthermore, because of the fact that the system’s un-     critical interactions between the system and its environment
derlying software is updated regularly in order to cope with       this may not be sufficient. It would be required to model the
changes required because of bugs or improved functionality,        environment including potential interactions and have a look
there is a need of carrying out any certification regularly as     about the reactions of the system.
well. For example, in autonomous driving we have to assure            The focus on modeling the environment of the system in
that a new software update is not going to lead to an unsafe       order to obtain test cases is somehow different to ordinary
system. However, regression testing may require a lot of ef-       MBT where a model of the system is used for test case gen-
fort or come with high costs, which may be reduced when            eration. Changing from modeling the system to modeling
automating testing.                                                the environment is necessary for finding critical interactions
   Hence, automating at least part of certification may be a       between an AI-based system and its environment. Moreover,
future requirement. But how can certification of AI be car-        in this kind of testing we are not interested in showing that
ried out? What we need is a process where we identify what         an implementation works accordingly to a model, but is ca-
we want to achieve, and how this can be checked (or tested)?       pable of handling arbitrary interactions that may not be fore-
How can we come up with certain parameters justifying that         seen during development.
testing is appropriate? We shall also think about the meth-           In contrast to MBT, CT has been developed to search for
ods for checking, their limitations, and how to assure that        critical interactions between configuration parameters and
the methods can guarantee (with respect to a given certainty)      inputs. It has been shown that CT can effectively detect
that the system fulfills requested needs. However, in any case     faults in many different kinds of software (Kuhn et al. 2009).
                                                                   ing autonomous driving, we can always rely on the TTC for
                                                                   judging whether a test case passes or fails.Hence, there the
                                                                   test oracle can be automated using the TTC, which is not al-
                                                                   ways the case when testing AI. We, therefore, require other
                                                                   means for dealing with the oracle problem, i.e., providing
                                                                   a function that allows to distinguish passing executions of
                                                                   programs and systems from failing ones.
                                                                      The objective behind metamorphic testing (Chen, Che-
                                                                   ung, and Yiu 1998) is to provide a solution to the oracle
                                                                   problem of testing. The underlying idea is to define rela-
                                                                   tions over different inputs that always deliver the same out-
Figure 3: The last episode of a failing test case applied to an    put. For example, sin(x) is equivalent for all values of x
implementation of an automated emergency braking system,           and x + 2 · π, i.e., sin(x) = sin(x + 2 · π) always holds.
close to the time where a simulated pedestrian tries to cross      In (Guichard et al. 2019) and more specifically in (Bozic and
the street coming from the right side. The crash occurred in       Wotawa 2019) the authors proposed the use of metamorphic
a scenario where another vehicle at the front brakes, caus-        testing for testing conversational agents, i.e., chatbots. The
ing the ego vehicle to brake. A first pedestrian crossing the      underlying described idea was to propose relations consid-
street from left passing by, and the second one coming from        ering semantical relationships between words and sentences,
right who is overseen by the automated emergency braking           e.g., some sentences have the same semantics when replac-
system and hit.                                                    ing one word with its synonym, or sometimes the sequence
                                                                   of sentences given to a chatbot, does not change the answer
                                                                   provided by the chatbot. Moreover, we are able to test for
The question is whether we can also apply CT for AI test-          fulfilling certain moral and ethical regulations. For example,
ing? In (Li, Tao, and Wotawa 2020) the authors introduced          if an answer of a chatbot should not be influenced by the
an approach utilizing a model of the system environment in         race or sex of the chat participant, we can be formulate this
combination with CT for obtaining a test suite. In their pa-       as a metamorphic relation, where we say that a conversion
per, the authors not only provide the foundations but also re-     considering one race or sex should lead to the same results
ported on a case study where the authors tested an automated       when changing race or sex. In case of AI systems, where
emergency braking (AEB) function. From 319 test cases, 9           we are able to come up with metamorphic relations, we are
test cases lead to crashes (including test cases where pedes-      also able to apply metamorphic testing for solving the oracle
trians would have been killed (see Figure 3)), and 30 were         problem.
considered as being critical. It is worth noting that the pro-     Position 4 Metamorphic testing seems to be of use for im-
posed overall approach also includes a simulation environ-         plementing the test oracle problem of AI systems allowing
ment for carrying out the generated test cases in a realistic      to identifying contradictions with requirements, which may
setting automatically.                                             include ethical and moral considerations.
   (Klück et al. 2019) introduced an alternative method for
generating critical scenarios, where the authors rely on ge-          There are more system testing approaches that can be also
netic algorithms for obtaining test cases. The idea is to          adapted to fit the purpose of AI testing with the objective of
model test cases as genes that can be crossed and mutated.         assuring safety of AI-based systems and software. However,
The evaluation function maps test cases to a goodness value.       we have identified approaches where there is experimental
In each generation the best test cases are taken modified          evidence that they could be effectively used for testing AI-
and again evaluated. This kind of testing is also referred to      based systems. These approaches may also fit into certifica-
search-based testing. In (Klück et al. 2019) the authors also     tion and homologation processes. For this purpose, certain
evaluated the approach using an AEB function too. The ob-          measures have to be developed that can be used for deciding
tained results showed that genetic algorithms can be applied       when to stop testing in cases no failing test case could be
to detect faults in the setting of autonomous and automated        obtained.
driving leading to the following position:                            Moreover, the presented methods and techniques for test-
                                                                   ing AI-based systems have disadvantages. They are mainly
Position 3 Combinatorial testing and search-based testing          focusing on quality assurance of the overall system and not
are effective testing techniques for identifying critical sce-     its comprising parts. For example, the CT approach consid-
narios.                                                            ers a model of the environment, which works as basis for ob-
   CT and also search-based testing applied to test au-            taining the CT input model. The approach is testing whether
tonomous and automated driving functions always has to             certain interactions of the CT with its environment reveal
fulfill the property that no crash with another car or even a      a fault, and in case of automated or autonomous driving a
pedestrian occurs. In this context closeness to a crash is often   crash, but does not consider any knowledge regarding the
represented as the time to collision (TTC), where 0 means          SUT’s internal structure or behavior. Finding out the root
that a crash occurs. Usually, in many applications positive        cause of any misbehavior within the SUT might be com-
but small TTC values may also be considered as unwanted.           plicated. Moreover, we are not able to make use of quality
When testing in the case of the automotive domain includ-          assurance measures like code coverage or mutation score for
the particular test suite. Furthermore, CT like MBT requires      April 2022. More information can be retrieved from https:
to concretize the abstract test cases computed using these        //iktderzukunft.at/en/     .
testing methods. This concretization step cause additional
effort and has to be done carefully in order to come up with                             References
good test cases that can be executed and most likely reveal a
fault.                                                            Ammann, P.; Offutt, J.; and Huang, H. 2003. Coverage Cri-
   In case of metamorphic testing it is essential to define the   teria for Logical Expressions. In Proceedings of the 14th In-
metamorphic relations, which cause additional effort and in-      ternational Symposium on Software Reliability Engineering,
fluence the ability to work as a good test oracle. There are      ISSRE ’03. Washington, DC, USA: IEEE Computer Society.
maybe metamorphic relations leading to test cases a SUT           Bozic, J.; and Wotawa, F. 2019. Testing Chatbots Us-
can easy fulfill only allowing to test a fraction of function-    ing Metamorphic Relations. In Gaston, C.; Kosmatov, N.;
ality. In such cases metamorphic testing would not lead to        and Le Gall, P., eds., Testing Software and Systems, 41–55.
tests covering most of the functionality and, therefore, can      Cham: Springer International Publishing. ISBN 978-3-030-
be considered as incomplete. Search-based testing requires        31280-0.
to implement a search procedure using a function allowing
                                                                  Chen, T.; Cheung, S.; and Yiu, S. 1998. Metamorphic test-
to estimate the quality of a current test, e.g., the ability of
                                                                  ing: a new approach for generating next test cases. Technical
a test revealing a fault. Again this requires additional effort
                                                                  report, Department of Computer Science, Hong Kong Uni-
and costs. It is worth noting that in some cases random test-
                                                                  versity of Science and Technology, Hong Kong. Technical
ing, i.e., generating test inputs using a random procedure,
                                                                  Report HKUST-CS98-01.
also provides fault revealing test cases requiring even less
time than search-based testing at almost no additional costs.     Chetouane, N.; Klampfl, L.; and Wotawa, F. 2019. Inves-
                                                                  tigating the Effectiveness of Mutation Testing Tools in the
                       Conclusion                                 Context of Deep Neural Networks. In IWANN (1), vol-
                                                                  ume 11506 of Lecture Notes in Computer Science, 766–777.
In this position paper, we focused on providing an answer to      Springer.
the question whether there exists testing techniques that can
be efficiently used for checking that a software or system        Goodfellow, I.; McDaniel, P.; and Papernot, N. 2018. Mak-
comprising AI methodologies fulfills requirements includ-         ing Machine Learning Robust Against Adversarial Inputs.
ing also moral and ethical rules, and regulations. We also        Commun. ACM 61(7): 56–66. ISSN 0001-0782. doi:10.
discussed the involved challenges of testing where we iden-       1145/3134599. URL http://doi.acm.org/10.1145/3134599.
tified also shortcomings that arise when only focusing on         Guichard, J.; Ruane, E.; Smith, R.; Bean, D.; and Ven-
specific parts and not providing a holistic view. Finally, we     tresque, A. 2019. Assessing the Robustness of Conversa-
introduced several testing methods that have been developed       tional Agents using Paraphrases. In 2019 IEEE International
in the context of testing ordinary systems and elaborate on       Conference On Artificial Intelligence Testing (AITest), 55–
their usefulness in the context of AI-based systems. Search-      62.
based testing, combinatorial testing, and metamorphic test-
ing seem to be excellent candidate for this purpose and may       Jia, Y.; and Harman, M. 2011. An Analysis and Survey of
also be of use for automating certification and homologation      the Development of Mutation Testing. IEEE Transactions
processes for AI applications.                                    on Software Engineering 37(5): 649–678.
    However, further studies have to be carried out. For CT       Kim, J.; Feldt, R.; and Yoo, S. 2019. Guiding Deep Learning
more experiments making use of other autonomous and au-           System Testing Using Surprise Adequacy. In Proceedings of
tomated functions have to be considered. Moreover, we re-         the 41st International Conference on Software Engineering,
quire to come up with certain measures of guarantees for the      ICSE’19, 1039–1049. IEEE Press. doi:10.1109/ICSE.2019.
computed test suites. Parameters of CT like the combinato-        00108. URL https://doi.org/10.1109/ICSE.2019.00108.
rial strength maybe sufficient but in the context of AI-based
systems there is no experimental evidence. For metamorphic        Klampfl, L.; Chetouane, N.; and Wotawa, F. 2020. Mutation
testing we further need more use cases and experimental           Testing for Artificial Neural Networks: An Empirical Eval-
evaluations making use of AI-based systems. In the case of        uation. In IEEE 20th International Conference on Software
chatbots and also logic-based reasoning metamorphic test-         Quality, Reliability and Security (QRS), 356–365. IEEE.
ing has already been successfully applied. However, there is      Klück, F.; Zimmermann, M.; Wotawa, F.; and Nica, M.
a need to show the usefulness of metamorphic testing also in      2019. Performance Comparison of Two Search-Based Test-
other applications where AI technology is a central part.         ing Strategies for ADAS System Validation. In Gaston, C.;
                                                                  Kosmatov, N.; and Le Gall, P., eds., Testing Software and
                   Acknowledgments                                Systems, 140–156. Cham: Springer International Publishing.
                                                                  ISBN 978-3-030-31280-0.
The research was supported by ECSEL JU under the project
H2020 826060 AI4DI - Artificial Intelligence for Digitising       Koopman, P.; and Wagner, M. 2016. Challenges in Au-
Industry. AI4DI is funded by the Austrian Federal Ministry        tonomous Vehicle Testing and Validation. SAE Int. J. Trans.
of Transport, Innovation and Technology (BMVIT) under             Safety 4: 15–24. doi:10.4271/2016-01-0128. URL https:
the program ”ICT of the Future” between May 2019 and              //doi.org/10.4271/2016-01-0128.
Köroglu, Y.; and Wotawa, F. 2019. Fully automated com-           Wotawa, F. 2016. Testing Self-Adaptive Systems using Fault
piler testing of a reasoning engine via mutated grammar           Injection and Combinatorial Testing. In Proceedings of the
fuzzing. In Choi, B.; Escalona, M. J.; and Herzig, K., eds.,      Intl. Workshop on Verification and Validation of Adaptive
Proceedings of the 14th International Workshop on Automa-         Systems (VVASS 2016). Vienna, Austria.
tion of Software Test, AST@ICSE 2019, May 27, 2019, Mon-          Wotawa, F. 2018.      Combining Combinatorial Testing
treal, QC, Canada, 28–34. IEEE / ACM. doi:10.1109/AST.            and Metamorphic Testing for Testing a Logic-based Non-
2019.00010. URL https://doi.org/10.1109/AST.2019.00010.           Monotonic Reasoning System. In In Proceedings of the 7th
Kuhn, D.; Kacker, R.; Lei, Y.; and Hunter, J. 2009. Combi-        International Workshop on Combinatorial Testing (IWCT) /
natorial Software Testing. Computer 94–96.                        ICST 2018.
Kuhn, D. R.; Bryce, R.; Duan, F.; Ghandehari, L. S.; Lei, Y.;     Wotawa, F. 2019. On the importance of system testing for
and Kacker, R. N. 2015. Combinatorial Testing: Theory and         assuring safety of AI systems. In CEUR Workshop Proceed-
Practice. In Advances in Computers, volume 99, 1–66.              ings , Workshop on Artificial Intelligence Safety, AISafety
                                                                  2019, volume 2419. Macao, China. URL http://ceur-ws.org/
Li, Y.; Tao, J.; and Wotawa, F. 2020. Ontology-based test         Vol-2419/.
generation for automated and autonomous driving functions.
Inf. Softw. Technol. 117. doi:10.1016/j.infsof.2019.106200.
URL https://doi.org/10.1016/j.infsof.2019.106200.
Li, Z.; Ma, X.; Xu, C.; and Cao, C. 2019. Structural Cov-
erage Criteria for Neural Networks Could Be Misleading.
In 2019 IEEE/ACM 41st International Conference on Soft-
ware Engineering: New Ideas and Emerging Results (ICSE-
NIER), 89–92.
Ma, L.; Zhang, F.; Sun, J.; Xue, M.; Li, B.; Juefei-Xu, F.;
Xie, C.; Li, L.; Liu, Y.; Zhao, J.; et al. 2018a. Deepmu-
tation: Mutation testing of deep learning systems. In 2018
IEEE 29th International Symposium on Software Reliability
Engineering (ISSRE), 100–111. IEEE.
Ma, L.; Zhang, F.; Xue, M.; Li, B.; Liu, Y.; Zhao, J.; and
Wang, Y. 2018b. Combinatorial testing for deep learning
systems. arXiv preprint arXiv:1806.07723 .
Menzel, T.; Bagschik, G.; and Maurer, M. 2018. Scenarios
for Development, Test and Validation of Automated Vehi-
cles. In arXiv:1801.08598. URL https://arxiv.org/abs/1801.
08598. Appeared in Proc. of the IEEE Intelligent Vehicles
Symposium.
Pei, K.; Cao, Y.; Yang, J.; and Jana, S. 2017. Deepxplore:
Automated whitebox testing of deep learning systems. In
proceedings of the 26th Symposium on Operating Systems
Principles, 1–18. ACM.
Pietilä, P. A.; et al. 2019. Ethics Guidelines For Trustworthy
AI. High-Level Expert Group on AI, European Commission.
Sekhon, J.; and Fleming, C. 2019. Towards Improved Test-
ing For Deep Learning. In 2019 IEEE/ACM 41st Interna-
tional Conference on Software Engineering: New Ideas and
Emerging Results (ICSE-NIER), 85–88.
Su, J.; Vargas, D. V.; and Sakurai, K. 2019. One Pixel Attack
for Fooling Deep Neural Networks. IEEE Transactions on
Evolutionary Computation 1–1. ISSN 1089-778X. doi:10.
1109/TEVC.2019.2890858.
Sun, Y.; Huang, X.; and Kroening, D. 2018. Testing deep
neural networks. arXiv preprint arXiv:1803.04792 .
Utting, M.; and Legeard, B. 2006. Practical Model-Based
Testing - A Tools Approach. Morgan Kaufmann Publishers
Inc.