On the Use of Available Testing Methods for Verification & Validation of AI-based Software and Systems Franz Wotawa Graz University of Technology, Institute for Software Technology Inffeldgasse 16b/2, A-8010 Graz, Austria wotawa@ist.tugraz.at Abstract Verification and validation of software and systems is the es- sential part of the development cycle in order to meet given quality criteria including functional and non-functional re- quirements. Testing and in particular its automation has been an active research area for decades providing many methods and tools for automating test case generation and execution. Due to the increasing use of AI in software and systems, the question arises whether it is possible to utilize available test- ing techniques in the context of AI-based systems. In this po- sition paper, we elaborate on testing issues arising when using AI methods for systems, consider the case of different stages of AI, and start investigating on the usefulness of certain test- Figure 1: AI-based systems their boundaries, and environ- ing methods for testing AI. We focus especially on testing at ment. the system level where we are interesting not only in assuring a system to be correctly implemented but also to meet given criteria like not contradicting moral rules, or being depend- able. We state that some well-known testing techniques can AI component and other components implementing func- still be applied providing being tailored to the specific needs. tionality like providing user interfaces or database access. In addition, such system rely on a computational stack where we also have to consider the operating system, firmware, and Introduction even the hardware for verification and validation purposes. Because of the growing importance of AI methodologies As a consequence, we have to consider verification and val- for current and future software and systems, there is a need idation of the whole system for quality assurance. for coming up with appropriate quality assurance measures. In a previous paper (Wotawa 2019), we already focused Such measures should come up with certain guarantees that on the need for system testing. In contrast, in this paper, we the resulting products fulfill their requirements, e.g., provide try to give a first answer regarding the usefulness of certain the requested functionality and safety concerns. Providing available system testing methods for testing AI applications. guarantees seem to be essential in order to gain trust in AI- Furthermore, we discuss the corresponding general verifi- based system solutions. In particular, in autonomous driving cation and validation problem of such application in more to mention one more recent application area of AI, we have detail. We have always to understand what we want to test to establish a certification and homologation process that as- and what we want to achieve. We have also to be aware of sures an autonomous vehicle to follow given regulations and shortcomings arising when focusing only a subparts of the other requirements. overall verification and validation problem. First, faults of- Because of the fact that artifacts making use of AI tech- ten arise because of untested interactions between different nology are themselves systems, the question is whether it is system components. Such cases may arise because of unin- possible re-use ordinary testing methodologies and to adapt tended interactions not considered during development. Sec- them for providing means for certification and homologa- ond, we might not be able to sufficiently make guarantees tion. In particular, besides components like vision systems regarding the degree of testing. And finally, we may miss relying on machine learning, there are other components that critical inputs or scenarios that lead to trouble. The latter do not rely on any AI methodology. In Figure 1 we give an especially holds for different machine learning approaches overview of the architecture of such a system comprising the and is referred to adversarial attacks (see e.g., (Su, Vargas, Copyright © 2021 for this paper by its authors. Use permitted un- and Sakurai 2019) and (Goodfellow, McDaniel, and Paper- der Creative Commons License Attribute 4.0 International (CC BY not 2018)). 4.0). We organize this paper as follows. In the next section, we discuss the system testing challenge in detail. We focus on different aspects of testing to be considered and refer to re- lated literature. Afterwards, we present three approaches of systems testing that have been proven to find faults when testing systems using AI techniques. Finally, we summarize the obtained findings. The testing challenge (a) (b) As depicted in Figure 1 systems comprising AI methodology also rely on other components providing interfaces and func- tionality, as well as runtime support including operation sys- tems, firmware, and hardware. As a consequence, we have to consider testing as a holistic activity that has to take care of all different parts of the whole system. In particular, we have to clarify what to test and how to test. For example, a logic- based reasoning system comprises a compiler for reading in the logic rules and facts, and the reasoning part. Hence, (c) (d) we have to test the compiler and the reasoning part first separately and afterwards together in close interaction. The Figure 2: Different variants of the ”do not enter” traffic sign compiler can be tested, for example using, fuzzing where someone sees in reality: (a) is the original sign, (b) the traffic more or less randomly generated inputs are generated (see sign with a bend, a sticker, and partially missing color, and e.g., (Köroglu and Wotawa 2019)). The reasoning engine it- (c) and (d) are traffic sign with various stickers attached. self can be tested using certain known relations like that the sequence of rules provided to the system does not influence the final outcome (see e.g., (Wotawa 2018)). The overall sys- gas, and Sakurai 2019) that lead to misclassifications even in tem itself may be tested using fault injection, e.g., (Wotawa case of small input variations. Other reasons for misclassifi- 2016). All these examples have – more or less – in common cations are the use of a training data set that is not covering that they only capture some parts of the expected behavior. all different examples, and other aspects like the distribu- If using fault injection, we are interested in how systems tion of examples. Furthermore, note that variations of the react on inputs that occur in case of faults. When using in- appearance of objects in the real world exists often. In Fig- variants like the order of rules, we do not test all aspects ure 2 we depict different images of the traffic sign ”do not of reasoning. Hence, in order to thoroughly test such sys- enter” ranging from a bend to occlusions because of stick- tems, we need to understand what to test in order to identify ers attached. An autonomous car would require always to shortcomings of underlying testing methods to be used. Be- handle these case, and it is very unlikely that we really have sides this and more specifically to AI methods, we have to all of such cases represented in the training data set. More- provide some measures that at least indicate the quality of over, if so, we still would have misclassifications occurring, testing. For ordinary programs, coverage (e.g., (Ammann, requiring to assure that there is no unwanted effect on the Offutt, and Huang 2003)) and mutation score (e.g., (Jia and behavior of the overall system. Harman 2011)) are used to determine whether test suites are good enough, i.e., being likely able to reveal a faulty behav- There is plenty of literature regarding different testing ap- ior. Coverage helps to identify those parts of the program proaches for neural networks, e.g., (Pei et al. 2017; Sun, that are executed using the test suite, i.e., code coverage1 . Huang, and Kroening 2018; Ma et al. 2018b,a) and most The mutation score is an indicator of the number of program recently (Kim, Feldt, and Yoo 2019; Sekhon and Fleming variants, i.e., the mutations, that can be detected using the 2019). In some of the methods also an adapted version of given test suite. It is worth noting that coverage or mutation coverage and mutation score for neural networks has been score can be seen as a measure or indicator for guaranteeing used. Unfortunately, coverage information maybe somehow that a test suite has the required capabilities for detecting a misleading (Li et al. 2019) leaving the question regarding failing behavior. the quality of the test suite open. Let us consider testing neural networks as an example. In the case of neural network we may also ask whether Neural networks are trained using a set of examples and classical coverage or mutation score used in ordinary soft- evaluated afterwards. Evaluation is used for assuring that a ware engineering can be used as quality measure when test- network reaches a given quality of the prediction outcome. ing a current neural network implementation. (Chetouane, The set of examples used for training and evaluation have to Klampfl, and Wotawa 2019) showed that making use of be distinct. The question is now whether this evaluation is these measures when testing the configuration of neural net- good enough for replacing further testing effort. The answer works, i.e., setting the type of neurons, the number of layers is no, because but not only of adversarial attacks (Su, Var- and neurons, can be justified. Unfortunately, this is not the case when testing the whole neural network library as dis- 1 cussed in (Klampfl, Chetouane, and Wotawa 2020). Hence, Note that besides code coverage there are other coverage defi- nitions used like test input coverage, combinatorial coverage, etc. for neural networks or measures and means for testing shall be provided. in order to bring AI technology into practice, we have to con- Although, we may need to live with the challenge that we vince customers that the systems are not of harm. Certifica- cannot completely tests certain system parts and that there tion that takes into account such customer’s considerations is always a critical case where the AI part of a system may as well as regulations provide the right means for further deliver a wrong result, the further question is whether this supporting the delivery of AI technology into practical ap- establishes a problem for the whole system. The answer in plications we are using on a daily basis. this case is no, providing that the system itself is able to de- It is worth noting that there are many initiatives like the tect this critical case and to react appropriately. For example ethics guidelines for trustworthy AI (Pietilä et al. 2019) for in autonomous driving, we may make use of more than one coming up with first steps of how AI-based systems have sensor for obtaining information regarding objects around to be constructed, evaluated, and verified. However, for ex- the vehicle and use sensor fusion to obtain reliable informa- ample, in autonomous driving such principles have to be tion. We only need to assure that the whole system interacts concretized leading to practical rules companies can follow with the environment in a way that is dependable and ful- when developing AI systems or systems at least partially fills our requirements including ethical or moral considera- based on AI methodologies and tools. tions. Hence, identifying critical scenarios between the sys- Position 2 There is a need for well-defined certification and tem and its environment seems to be a crucial factor of test- homologation processes for AI-based systems that ideally ing AI-based systems (Koopman and Wagner 2016; Menzel, can be carried out in an automated way. Such certification Bagschik, and Maurer 2018). and homologation processes shall rely on existing guidelines Moreover, it seems also of importance to consider that considering all aspects of trustworthy AI. critical scenarios often originate from different settings that have to occur at the same time. One issue, e.g., missing a When we want to carry out certification at least partially certain traffic sign may not lead to an accident, but in com- automated we may rely on testing. Hence, we have to state bination with other issues would. the question whether existing testing techniques can be used We summarize our discussion in the following position: for confirming that an AI-based system fulfills regulation and other rules and expectations. This includes besides test- Position 1 Testing aims at identifying interactions between ing functionality the degree of fulfilling generally agreed the system under test and its environment leading to an un- ethical and moral rules. In the following section, we intro- expected behavior. When testing systems utilizing AI, we duce three techniques that can (at least partially) serve this have to consider testing all parts of a system including purpose. the one with and the one without AI as well as their in- teractions. Evaluating performance characteristic of imple- Testing AI mented AI methodology may not be sufficient for assuring meeting quality criteria. As discussed there seems to be a need for testing the whole system considering functional and non-functional require- Most of testing is performed during development of sys- ments including moral and ethical rules. For testing systems tems before deployment. In some cases certification (or even at the system level black-box approaches are used that do homologation), i.e., the formal confirmation that an applica- not consider the internal structure. Various methods with tion, product or system, meets its required characteristics, is corresponding tools have been proposed including model- needed. In case of AI technology we are interested that the based testing (MBT) (Utting and Legeard 2006), combina- system fulfills dependability goals like safety but maybe also torial testing (CT) (Kuhn et al. 2015), or metamorphic test- given ethical or moral rules. For example, we want a conver- ing (Chen, Cheung, and Yiu 1998). MBT makes use of a sational agent or a decision support system not to be racist or model of the system for obtaining test cases. In order to find sexist. Furthermore, because of the fact that the system’s un- critical interactions between the system and its environment derlying software is updated regularly in order to cope with this may not be sufficient. It would be required to model the changes required because of bugs or improved functionality, environment including potential interactions and have a look there is a need of carrying out any certification regularly as about the reactions of the system. well. For example, in autonomous driving we have to assure The focus on modeling the environment of the system in that a new software update is not going to lead to an unsafe order to obtain test cases is somehow different to ordinary system. However, regression testing may require a lot of ef- MBT where a model of the system is used for test case gen- fort or come with high costs, which may be reduced when eration. Changing from modeling the system to modeling automating testing. the environment is necessary for finding critical interactions Hence, automating at least part of certification may be a between an AI-based system and its environment. Moreover, future requirement. But how can certification of AI be car- in this kind of testing we are not interested in showing that ried out? What we need is a process where we identify what an implementation works accordingly to a model, but is ca- we want to achieve, and how this can be checked (or tested)? pable of handling arbitrary interactions that may not be fore- How can we come up with certain parameters justifying that seen during development. testing is appropriate? We shall also think about the meth- In contrast to MBT, CT has been developed to search for ods for checking, their limitations, and how to assure that critical interactions between configuration parameters and the methods can guarantee (with respect to a given certainty) inputs. It has been shown that CT can effectively detect that the system fulfills requested needs. However, in any case faults in many different kinds of software (Kuhn et al. 2009). ing autonomous driving, we can always rely on the TTC for judging whether a test case passes or fails.Hence, there the test oracle can be automated using the TTC, which is not al- ways the case when testing AI. We, therefore, require other means for dealing with the oracle problem, i.e., providing a function that allows to distinguish passing executions of programs and systems from failing ones. The objective behind metamorphic testing (Chen, Che- ung, and Yiu 1998) is to provide a solution to the oracle problem of testing. The underlying idea is to define rela- tions over different inputs that always deliver the same out- Figure 3: The last episode of a failing test case applied to an put. For example, sin(x) is equivalent for all values of x implementation of an automated emergency braking system, and x + 2 · π, i.e., sin(x) = sin(x + 2 · π) always holds. close to the time where a simulated pedestrian tries to cross In (Guichard et al. 2019) and more specifically in (Bozic and the street coming from the right side. The crash occurred in Wotawa 2019) the authors proposed the use of metamorphic a scenario where another vehicle at the front brakes, caus- testing for testing conversational agents, i.e., chatbots. The ing the ego vehicle to brake. A first pedestrian crossing the underlying described idea was to propose relations consid- street from left passing by, and the second one coming from ering semantical relationships between words and sentences, right who is overseen by the automated emergency braking e.g., some sentences have the same semantics when replac- system and hit. ing one word with its synonym, or sometimes the sequence of sentences given to a chatbot, does not change the answer provided by the chatbot. Moreover, we are able to test for The question is whether we can also apply CT for AI test- fulfilling certain moral and ethical regulations. For example, ing? In (Li, Tao, and Wotawa 2020) the authors introduced if an answer of a chatbot should not be influenced by the an approach utilizing a model of the system environment in race or sex of the chat participant, we can be formulate this combination with CT for obtaining a test suite. In their pa- as a metamorphic relation, where we say that a conversion per, the authors not only provide the foundations but also re- considering one race or sex should lead to the same results ported on a case study where the authors tested an automated when changing race or sex. In case of AI systems, where emergency braking (AEB) function. From 319 test cases, 9 we are able to come up with metamorphic relations, we are test cases lead to crashes (including test cases where pedes- also able to apply metamorphic testing for solving the oracle trians would have been killed (see Figure 3)), and 30 were problem. considered as being critical. It is worth noting that the pro- Position 4 Metamorphic testing seems to be of use for im- posed overall approach also includes a simulation environ- plementing the test oracle problem of AI systems allowing ment for carrying out the generated test cases in a realistic to identifying contradictions with requirements, which may setting automatically. include ethical and moral considerations. (Klück et al. 2019) introduced an alternative method for generating critical scenarios, where the authors rely on ge- There are more system testing approaches that can be also netic algorithms for obtaining test cases. The idea is to adapted to fit the purpose of AI testing with the objective of model test cases as genes that can be crossed and mutated. assuring safety of AI-based systems and software. However, The evaluation function maps test cases to a goodness value. we have identified approaches where there is experimental In each generation the best test cases are taken modified evidence that they could be effectively used for testing AI- and again evaluated. This kind of testing is also referred to based systems. These approaches may also fit into certifica- search-based testing. In (Klück et al. 2019) the authors also tion and homologation processes. For this purpose, certain evaluated the approach using an AEB function too. The ob- measures have to be developed that can be used for deciding tained results showed that genetic algorithms can be applied when to stop testing in cases no failing test case could be to detect faults in the setting of autonomous and automated obtained. driving leading to the following position: Moreover, the presented methods and techniques for test- ing AI-based systems have disadvantages. They are mainly Position 3 Combinatorial testing and search-based testing focusing on quality assurance of the overall system and not are effective testing techniques for identifying critical sce- its comprising parts. For example, the CT approach consid- narios. ers a model of the environment, which works as basis for ob- CT and also search-based testing applied to test au- taining the CT input model. The approach is testing whether tonomous and automated driving functions always has to certain interactions of the CT with its environment reveal fulfill the property that no crash with another car or even a a fault, and in case of automated or autonomous driving a pedestrian occurs. In this context closeness to a crash is often crash, but does not consider any knowledge regarding the represented as the time to collision (TTC), where 0 means SUT’s internal structure or behavior. Finding out the root that a crash occurs. Usually, in many applications positive cause of any misbehavior within the SUT might be com- but small TTC values may also be considered as unwanted. plicated. Moreover, we are not able to make use of quality When testing in the case of the automotive domain includ- assurance measures like code coverage or mutation score for the particular test suite. Furthermore, CT like MBT requires April 2022. More information can be retrieved from https: to concretize the abstract test cases computed using these //iktderzukunft.at/en/ . testing methods. This concretization step cause additional effort and has to be done carefully in order to come up with References good test cases that can be executed and most likely reveal a fault. Ammann, P.; Offutt, J.; and Huang, H. 2003. Coverage Cri- In case of metamorphic testing it is essential to define the teria for Logical Expressions. In Proceedings of the 14th In- metamorphic relations, which cause additional effort and in- ternational Symposium on Software Reliability Engineering, fluence the ability to work as a good test oracle. There are ISSRE ’03. Washington, DC, USA: IEEE Computer Society. maybe metamorphic relations leading to test cases a SUT Bozic, J.; and Wotawa, F. 2019. Testing Chatbots Us- can easy fulfill only allowing to test a fraction of function- ing Metamorphic Relations. In Gaston, C.; Kosmatov, N.; ality. In such cases metamorphic testing would not lead to and Le Gall, P., eds., Testing Software and Systems, 41–55. tests covering most of the functionality and, therefore, can Cham: Springer International Publishing. ISBN 978-3-030- be considered as incomplete. Search-based testing requires 31280-0. to implement a search procedure using a function allowing Chen, T.; Cheung, S.; and Yiu, S. 1998. Metamorphic test- to estimate the quality of a current test, e.g., the ability of ing: a new approach for generating next test cases. Technical a test revealing a fault. Again this requires additional effort report, Department of Computer Science, Hong Kong Uni- and costs. It is worth noting that in some cases random test- versity of Science and Technology, Hong Kong. Technical ing, i.e., generating test inputs using a random procedure, Report HKUST-CS98-01. also provides fault revealing test cases requiring even less time than search-based testing at almost no additional costs. Chetouane, N.; Klampfl, L.; and Wotawa, F. 2019. Inves- tigating the Effectiveness of Mutation Testing Tools in the Conclusion Context of Deep Neural Networks. In IWANN (1), vol- ume 11506 of Lecture Notes in Computer Science, 766–777. In this position paper, we focused on providing an answer to Springer. the question whether there exists testing techniques that can be efficiently used for checking that a software or system Goodfellow, I.; McDaniel, P.; and Papernot, N. 2018. Mak- comprising AI methodologies fulfills requirements includ- ing Machine Learning Robust Against Adversarial Inputs. ing also moral and ethical rules, and regulations. We also Commun. ACM 61(7): 56–66. ISSN 0001-0782. doi:10. discussed the involved challenges of testing where we iden- 1145/3134599. URL http://doi.acm.org/10.1145/3134599. tified also shortcomings that arise when only focusing on Guichard, J.; Ruane, E.; Smith, R.; Bean, D.; and Ven- specific parts and not providing a holistic view. Finally, we tresque, A. 2019. Assessing the Robustness of Conversa- introduced several testing methods that have been developed tional Agents using Paraphrases. In 2019 IEEE International in the context of testing ordinary systems and elaborate on Conference On Artificial Intelligence Testing (AITest), 55– their usefulness in the context of AI-based systems. Search- 62. based testing, combinatorial testing, and metamorphic test- ing seem to be excellent candidate for this purpose and may Jia, Y.; and Harman, M. 2011. An Analysis and Survey of also be of use for automating certification and homologation the Development of Mutation Testing. IEEE Transactions processes for AI applications. on Software Engineering 37(5): 649–678. However, further studies have to be carried out. For CT Kim, J.; Feldt, R.; and Yoo, S. 2019. Guiding Deep Learning more experiments making use of other autonomous and au- System Testing Using Surprise Adequacy. In Proceedings of tomated functions have to be considered. Moreover, we re- the 41st International Conference on Software Engineering, quire to come up with certain measures of guarantees for the ICSE’19, 1039–1049. IEEE Press. doi:10.1109/ICSE.2019. computed test suites. Parameters of CT like the combinato- 00108. URL https://doi.org/10.1109/ICSE.2019.00108. rial strength maybe sufficient but in the context of AI-based systems there is no experimental evidence. For metamorphic Klampfl, L.; Chetouane, N.; and Wotawa, F. 2020. Mutation testing we further need more use cases and experimental Testing for Artificial Neural Networks: An Empirical Eval- evaluations making use of AI-based systems. In the case of uation. In IEEE 20th International Conference on Software chatbots and also logic-based reasoning metamorphic test- Quality, Reliability and Security (QRS), 356–365. IEEE. ing has already been successfully applied. However, there is Klück, F.; Zimmermann, M.; Wotawa, F.; and Nica, M. a need to show the usefulness of metamorphic testing also in 2019. Performance Comparison of Two Search-Based Test- other applications where AI technology is a central part. ing Strategies for ADAS System Validation. In Gaston, C.; Kosmatov, N.; and Le Gall, P., eds., Testing Software and Acknowledgments Systems, 140–156. Cham: Springer International Publishing. ISBN 978-3-030-31280-0. The research was supported by ECSEL JU under the project H2020 826060 AI4DI - Artificial Intelligence for Digitising Koopman, P.; and Wagner, M. 2016. Challenges in Au- Industry. AI4DI is funded by the Austrian Federal Ministry tonomous Vehicle Testing and Validation. SAE Int. J. Trans. of Transport, Innovation and Technology (BMVIT) under Safety 4: 15–24. doi:10.4271/2016-01-0128. URL https: the program ”ICT of the Future” between May 2019 and //doi.org/10.4271/2016-01-0128. Köroglu, Y.; and Wotawa, F. 2019. Fully automated com- Wotawa, F. 2016. Testing Self-Adaptive Systems using Fault piler testing of a reasoning engine via mutated grammar Injection and Combinatorial Testing. In Proceedings of the fuzzing. In Choi, B.; Escalona, M. J.; and Herzig, K., eds., Intl. Workshop on Verification and Validation of Adaptive Proceedings of the 14th International Workshop on Automa- Systems (VVASS 2016). Vienna, Austria. tion of Software Test, AST@ICSE 2019, May 27, 2019, Mon- Wotawa, F. 2018. Combining Combinatorial Testing treal, QC, Canada, 28–34. IEEE / ACM. doi:10.1109/AST. and Metamorphic Testing for Testing a Logic-based Non- 2019.00010. URL https://doi.org/10.1109/AST.2019.00010. Monotonic Reasoning System. In In Proceedings of the 7th Kuhn, D.; Kacker, R.; Lei, Y.; and Hunter, J. 2009. Combi- International Workshop on Combinatorial Testing (IWCT) / natorial Software Testing. Computer 94–96. ICST 2018. Kuhn, D. R.; Bryce, R.; Duan, F.; Ghandehari, L. S.; Lei, Y.; Wotawa, F. 2019. On the importance of system testing for and Kacker, R. N. 2015. Combinatorial Testing: Theory and assuring safety of AI systems. In CEUR Workshop Proceed- Practice. In Advances in Computers, volume 99, 1–66. ings , Workshop on Artificial Intelligence Safety, AISafety 2019, volume 2419. Macao, China. URL http://ceur-ws.org/ Li, Y.; Tao, J.; and Wotawa, F. 2020. Ontology-based test Vol-2419/. generation for automated and autonomous driving functions. Inf. Softw. Technol. 117. doi:10.1016/j.infsof.2019.106200. URL https://doi.org/10.1016/j.infsof.2019.106200. Li, Z.; Ma, X.; Xu, C.; and Cao, C. 2019. Structural Cov- erage Criteria for Neural Networks Could Be Misleading. In 2019 IEEE/ACM 41st International Conference on Soft- ware Engineering: New Ideas and Emerging Results (ICSE- NIER), 89–92. Ma, L.; Zhang, F.; Sun, J.; Xue, M.; Li, B.; Juefei-Xu, F.; Xie, C.; Li, L.; Liu, Y.; Zhao, J.; et al. 2018a. Deepmu- tation: Mutation testing of deep learning systems. In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE), 100–111. IEEE. Ma, L.; Zhang, F.; Xue, M.; Li, B.; Liu, Y.; Zhao, J.; and Wang, Y. 2018b. Combinatorial testing for deep learning systems. arXiv preprint arXiv:1806.07723 . Menzel, T.; Bagschik, G.; and Maurer, M. 2018. Scenarios for Development, Test and Validation of Automated Vehi- cles. In arXiv:1801.08598. URL https://arxiv.org/abs/1801. 08598. Appeared in Proc. of the IEEE Intelligent Vehicles Symposium. Pei, K.; Cao, Y.; Yang, J.; and Jana, S. 2017. Deepxplore: Automated whitebox testing of deep learning systems. In proceedings of the 26th Symposium on Operating Systems Principles, 1–18. ACM. Pietilä, P. A.; et al. 2019. Ethics Guidelines For Trustworthy AI. High-Level Expert Group on AI, European Commission. Sekhon, J.; and Fleming, C. 2019. Towards Improved Test- ing For Deep Learning. In 2019 IEEE/ACM 41st Interna- tional Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), 85–88. Su, J.; Vargas, D. V.; and Sakurai, K. 2019. One Pixel Attack for Fooling Deep Neural Networks. IEEE Transactions on Evolutionary Computation 1–1. ISSN 1089-778X. doi:10. 1109/TEVC.2019.2890858. Sun, Y.; Huang, X.; and Kroening, D. 2018. Testing deep neural networks. arXiv preprint arXiv:1803.04792 . Utting, M.; and Legeard, B. 2006. Practical Model-Based Testing - A Tools Approach. Morgan Kaufmann Publishers Inc.