On the importance of system testing for assuring safety of AI systems Franz Wotawa1 1 CD Lab for Quality Assurance Methodologies for Autonomous Cyber-Physical Systems, TU Graz, Institute for Software Technology, Graz, Austria. wotawa@ist.tugraz.at Abstract tegrity levels (ASIL) where ASIL A is the weakest and ASIL D the strongest requiring specific considerations during de- Rigorous testing of automated and autonomous velopment in order to keep risks below an acceptable level. It systems is inevitable especially in case of safety- is well known that standards like IEC 61580 do not recom- critical systems like cars or airplanes. There ex- mend AI methodology to be used in systems with a higher ist several functional safety standards that have to safety integrity level (see IEC 61580-3:2010 Table A.2). It be fulfilled like IEC 61508 explicitly stating that is interesting to note that this restriction applies also to the AI methodologies are not recommended to be used automotive industry where we see an increasing use of au- in case of systems with higher safety requirements. tomated and autonomous functions some of them also based Hence, there is a necessity to adopt these standards on machine learning technologies. There is obviously a gap in a direction where AI methodology is allowed between the safety standards and the use of AI methodology to be used providing to fulfill certain standardized in practice requiring adaptations in the standards. See for ex- quality assurance method to be taken care of dur- ample, Henriksson and colleagues [Henriksson et al., 2018] ing development. In this paper, we contribute to contributed ideas for evolving the standards towards captur- this endeavor and discuss the urgent need for sys- ing machine learning applications. In addition, there are new tem testing in the context of safety-critical systems standards coming up like ISO/PAS 21448:2019 considering comprising AI methodologies. In particular, we safety of the intended functionality for road vehicles that con- argue based on one example from the automotive sider situations comprising complex sensors and processing industry that it is strongly recommended to con- algorithms including the application of machine learning. sider not only subsystems but instead the whole system interacting with its environment when car- There have been many papers dealing with the general rying out tests. The discussed example is an ad- challenge of verifying and validating autonomous vehicles vanced driver-assistance systems used to break in like Koopman and Wagner [Koopman and Wagner, 2016], case of an emergency that does not rely on machine Wotawa [Wotawa, 2016a], Schuldt and colleagues [Schuldt et learning but comprises a decision part that invokes al., 2018], or Wotawa and colleagues [Wotawa et al., 2018]. breaking once the sensors identify an obstacle that An essential challenge in this context is how to assure to test might be hit otherwise. Results obtained from an such systems in a way that can be considered as good enough? already reported testing methodology, revealed that Kalra and Paddock [Kalra and Paddock, 2016] answered this when using tests considering the environment of question stating that an autonomous vehicle has to operate for an automated emergency breaking systems, we ob- 275 million miles for verification purposes. In their calcula- tain critical scenarios that might otherwise have not tion, Kalra and Paddock considered the fatality rate of driving been detected. From this observation, we conclude in the USA and assumed that an autonomous vehicle should that rigorous system testing becomes even more have a far lower fatality rate. Despite the fact that such a huge important for systems with AI methodology based number of miles can hardly be achieved with a small fleet on machine learning or allowing to adapt the sys- of test cars, there is also another hidden assumption behind tem’s behavior during operation. the calculation, i.e., during testing on streets the autonomous vehicle has to deal with all critical scenarios, which seems to be somehow unrealistic. Hence, as a consequence, re- 1 Introduction searchers have proposed to use ontologies for testing auto- Safety-critical systems are systems where internal fault or mated and autonomous vehicles, e.g., [Xiong et al., 2013; a malfunction may cause death or serious injury to people, Geyer et al., 2014; Menzel et al., 2018]. These contributions loss or severe damage to property, or environmental harm. deal with testing the whole system using different scenarios. For such systems, software and system engineering standards The intention behind this paper is to discuss the need for have been introduced like IEC 61508 or ISO 26262 for the au- system testing in the context of automated and autonomous tomotive industry. The latter introduces automotive safety in- driving and to learn important requirements that can be used in other application areas utilizing AI technology as well. In other paper dealing with system testing of an AEB in Sec- particular, we discuss results obtained when applying two dif- tion 3, from which we are going to derive requirements neces- ferent testing techniques to verify the functionality of the ad- sary to verify safety-critical systems using AI methods (Sec- vanced driver-assistance system (ADAS) autonomous emer- tion 4). Finally, we conclude the paper in Section 5. gency breaking (AEB). The AEB comprises sensors for de- tection obstacles, and a decision system that controls auto- mated breaking whenever necessary to avoid or mitigate col- 2 Related research lisions. Currently, AEB systems utilize different sensor tech- Testing AI-based system itself is not a novel research area. nologies like radar, cameras, or LIDAR, together with sensor Starting in the 90s of the last century, R. Plant [Plant, 1991; fusion capabilities. In cases where obstacles need to be classi- Plant, 1992] reported on the development of expert systems fied, e.g., as persons or bicyclists, machine learning methods and also considering testing. Together with S. Murell, R. might be applied to learn the vision sensor distinguishing cat- Plant [Murrell and Plant, 1997] published also a survey on egories. Verification of AEB requires to verify its subsystems tools for verifying and validating knowledge-based systems as well as the system itself during operation or at least in an that had been published between 1985 and 1995. Later El- environment that is close to the real use. Korany and colleagues [El-Korany et al., 2000] presented a There are similarities and expected differences when con- structured approach, and Hartung and Håkansson [Hartung sidering systems comprising AI methods like machine learn- and Håkansson, 2007] an automated approach for testing such ing as subjects of testing. For example, in case of machine systems. Other work, in the field of knowledge-based system learning the outcome depends on the data used for learning a testing include [Hayes and Parzen, 1997], [Felfernig et al., certain model and the underlying machine learning approach. 2005] dealing with testing recommender systems, [Tiihonen Hence, the outcome of the finally obtained model might vary. et al., 2002], and [Wotawa and Pill, 2014]. Most recently, Other parts of the overall system relying on such a varying there has been some paper dealing with testing specific prop- outcome have to deal with this uncertainties in an appropriate erties of logic reasoning engines [Wotawa, 2018a], the gen- way not causing safety hazards. Hence, any verification ap- eral challenge of testing such systems [Wotawa, 2018b], and proach needs to consider this variation and try find a critical testing subsystems of logic reasoning engines like their com- situation. In addition, sensor based on machine learning may pilers [Koroglu and Wotawa, 2019]. In any of these cases, not always deliver the correct classification results. Even if the focus were on testing the reasoning engine alone not con- coming up with a correct classification in 99% of the cases, sidering its use in a specific application like a mobile robot we have to assure that the 1% of the remaining cases does or any other application that requires reasoning capabilities. not lead to safety violations. Hence, the overall system has In contrast, Wotawa [Wotawa, 2016b] presented a testing ap- to compensate for inaccuracies that might be higher than for proach of adaptive systems that make use of models of the ordinary sensors, and there is a strong requirement to verify system itself, i.e., knowledge of system’s internal structure the system especially considering all the cases where classifi- and behavior. There the author considers fault injection for cation goes wrong. testing whether the adaptive system handles internal faults as In the following, we will introduce such a testing environ- expected. ment that is used together with test case generation methods In case of machine learning and in particular neural net- to carry out system testing automatically without the need of works there have been many papers dealing with testing in- user interference. Interestingly, we will see that there are test- cluding [Ma et al., 2018a], [Sun et al., 2018], [Pei et al., ing approaches that reveal faults in an AEB that very much 2017], [Ma et al., 2018b], and [Ma et al., 2018c] applying likely would not have been found otherwise. This testing ap- different well-known testing techniques, like mutation test- proach combines environmental ontologies with combinato- ing, combinatorial testing or whitebox testing approaches to rial testing [Kuhn et al., 2012; Kuhn et al., 2015]. The un- neural networks. Chetouane and colleagues [Chetouane et al., derlying assumption is that there is a need for finding combi- 2019] discussed a slightly different approach to testing neural nations of environmental entities for revealing faults. Hence, networks using mutation testing and coverage in a more ordi- we argue that it is not only necessary to carry out system tests nary setting. In addition, it is well known that neural networks but also to consider environmental interactions in case of au- are vulnerable against adversarial inputs where only changing tonomous systems. In addition, we discuss some further chal- one pixel in an image lead to a wrong classification. See for lenges of testing autonomous systems, i.e., providing some example, Su and colleagues [Su et al., 2019] work. Adver- sort of guarantees and methods for estimating the residual sarial attacks can also be more tailored towards more realistic risk, i.e., the risk of still comprising a fault even after car- attacks. We refer the interested reader to Wicker and col- rying out a proposed testing methodology. We will see that leagues [Wicker et al., 2018] for one example. Counter mea- the use of environmental models allow for specifying such sures against adversarial input has been considered. Most re- guarantees based on the degree of considering interactions cently, Goddfellow and colleagues [Goodfellow et al., 2018] between environmental entities and the degree to which the discuss and summarize some of them. environmental model has been used for verifying a particular In the context of automated and autonomous driving we system. discussed related papers in the introduction. The main fo- This paper is organized as follows: In Section 2 we discuss cus is on identifying critical scenarios using ontologies from related research focusing on AI-based systems. Afterwards, which test cases can be derived, e.g., [Xiong et al., 2013; and to be self-contained, we introduce the results from an- Geyer et al., 2014; Menzel et al., 2018]. In addition, there is a shift from carrying out vehicle tests on the road to a simu- lation environment capturing physics as well as 3D models. In such an environment more tests in less time can be carried out. Because of the improvement in simulation technology, the carried out tests become closer to reality finding faults that would also been detected in a real environment. In an explorative study Sotiropoulos and colleagues [Sotiropoulos et al., 2017] showed that simulation of a mobile robot in deed revealed faults that had been also detected when carrying out test in our physical world. In addition, there have been research work on formally ver- ifying machine learning and AI-based solutions. Seshia and Sadigh [Seshia and Sadigh, 2016] discussed the use of formal Figure 1: Overview of the testing process: from ontology to execu- methods for verifying AI including how to handle involved tion challenges. Gauerhof et al. [Gauerhof et al., 2018] tackled the case of the use of machine learning in the context of au- tonomous driving focusing on validation issues. What we can take with us from the mentioned related re- sufficient to restrict testing to all combinations of all subsets search is the following: (1) Different AI methodologies like of size t instead of considering all combinations of parame- knowledge-based systems or machine learning require spe- ters. cific testing methods, (2) improved 3D and physics simula- Klück and colleagues [Klück et al., 2018] improved the tion allow for revealing faults, and (3) in case of automated conversion algorithm from ontologies to combinatorial tests and autonomous driving the use ontologies capturing the en- and also introduced how to apply the proposed testing vironment to generate critical scenarios seems to be of partic- methodology in a practical setup. In Figure 1 (from [Tao ular importance. et al., 2019]), we depict the overall application process that can be fully automated. The process starts with an ontology 3 System testing for automated driving that captures the environment of the system under test (SUT). From this ontology, we obtain a combinatorial testing (CT) functions input model that is used to generate tests. Note that the gener- In this section, we recapitulate an approach for system testing ated tests are abstract tests and need to be further concretized. an ADAS functionality relying on an environmental ontology For example, in the ontology we may only distinguish road and combinatorial testing for generating test cases. The con- fragments to be straight, or a left or a right curve. The de- tent of this section relies on Tao and colleague’s paper [Tao et tails about the length or the radius of a curve are not given. al., 2019]. The intention of this section is to show the neces- Hence, in a concretization step we have to set these values. sity of capturing interactions between environmental entities Afterwards, the SUT can be simulated. In case of [Tao et al., in the case of autonomous driving and ADAS functionality 2019] this is done using certain tools for 3D simulation and for revealing faults. physical simulation like VTD or ModelConnect. The underlying idea behind Tao et al.’s work is to auto- In [Tao et al., 2019], the authors make use of an AEB as a matically extract test cases from an environmental ontology SUT. Instead of considering a general ontology that captures directly. The basic foundations behind have been outlined in all different scenarios that might occur during driving, the other papers. Wotawa and Li [Wotawa and Li, 2018] pre- authors focus on the scenarios from the European New Car sented a first algorithm that allows to convert ontologies into Assessment Program (Euro NCAP), which is a well-known input models of combinatorial testing [Kuhn et al., 2012; organization for car safety performance assessment provid- Kuhn et al., 2015]. An input model captures basically nec- ing consumers with a safety performance assessment for the essary parameters and their domains. In case of automated majority of the most popular cars. In Figure 2 from NCAP and autonomous driving the parameters are road fragments Euro [Euro, 2017; Euro and Protocol, 2017], we see some together with their conditions, other cars or pedestrians, the typical scenarios that have to be considered when testing an weather conditions, and so on. In order to find critical sce- AEB implementation. This includes the ego vehicle, i.e., the narios, it would be required to consider all different combi- SUT, to approach another vehicle but also to pass by park- nations of parameter values, which of course is not feasible. ing cars and also to consider pedestrians that may cross the In combinatorial testing, we are not considering all combi- street. [Tao et al., 2019] made use of these scenarios to come nations but only all combinations for an arbitrary subset of up with an adapted ontology for automated AEB testing. the set of parameters of size t. If t is smaller than the to- Using the conversion algorithm from ontologies into CT tal number of parameters, we have to generated substantially input models and a CT algorithm for generating test cases, fewer tests. A test suite where all combinations for all subsets Tao et al. were able to generate 993 test cases from 39 pa- of the parameters of size t are considered, is called a t-way rameters and a domain size of maximum 27 considering a combinatorial test suite or a test suite of strength t. When combinatorial strength of 2 only. From these tests, Tao et al. applying combinatorial testing to the domain of autonomous identified 17 that lead to a crash. Interestingly to note that in and automated driving, the underlying assumption is that it is the test suite we have two test cases that distinguishes only in results. Hence, we have to check how the whole system deals with this fact. (2) A system implementing more and more autonomy, e.g., an AEB autonomously mak- ing a decision about invoking emergency breaking, has to be tested as a whole in very much detail. Such sys- tems usually are at least very much complicated if not even complex. We have to assure that the system ful- fills its specification under a sheer amount of potential interactions between the system and its surrounding en- vironment. Therefore, it is also very much important to carry out testing in an automated way making use of simulation environments. Note that we do not restrict simulation in this context to simulation where all hard- ware parts are represented as virtual models. We may Figure 2: Some AEB scenarios from the EuroNcap protocol also test the whole system including hardware and soft- ware using a test bench where the hardware directly can be stimulated. the fact that one captures the situation of a dry road where the There is also another reason why automated system test- other requires the road to be wet. The latter test case leads to ing becomes increasingly important. There is a grow- a crash whereas the other does not. In addition, Tao et al. also ing need for making changes in the system after de- identified a case where a crash with a pedestrian happened but ployment because of the increasing amount of software only because two pedestrians cross the road one from left to in such AI-based systems. When dealing with safety- right and the other from right to left in close proximity. These critical systems every change causes the SUT to be results show that it is important to take care of the interaction again tested thoroughly. Without automation such and of the parameters. Moreover, it was not necessary to consider endeavor would be impossible to achieve considering all interactions, i.e., all combinations but only a few, at least available budget, effort and time constraints. Previous partially confirming the underlying assumption that we only research – some briefly discussed in this paper – also need to take care of all combinations for all subsets of param- demonstrates the usefulness of automated system testing eters of size t. for finding faults in autonomous systems using simula- In summary, we can conclude from the described example tion environments (e.g., see [Sotiropoulos et al., 2017] application of testing in this section the following: (1) System and [Tao et al., 2019]). Therefore, we may consider this testing based on 3D and physical simulation is able to reveal type of testing as a best practice also for safety-critical faults in ADAS, (2) ontologies capturing the environment of AI-based systems intended to interact with entities of the SUT are good enough to detect faults, (3) interactions of our physical world. parameters of scenarios are required for fault detection, and (4) it seems to be sufficient considering only a restricted num- Consider environmental knowledge: When dealing with ber of combinations of parameters of scenarios. testing the question is always how to obtain test cases? In practice, test cases are often manually crafted even 4 Testing safety-critical AI in case of safety-critical systems. In addition, other approaches like model-based testing (MBT) [Schiefer- In this section, we want to summarize the finding obtained decker, 2012] are used, which utilizes a model of the from testing autonomous and automated driving functions, SUT for generating tests. MBT is without any doubt an and in addition, generalize the finding to other application important method for test case generation to guarantee areas with safety requirements. In particular, we discuss the covering the functionality of the SUT. However, in the necessity of carrying out system tests automatically using test case of AI-based systems interacting with our physical cases obtained from environmental knowledge, and outline world, it is at least equally important to also consider the challenges and partial solution regarding guarantees of test- environmental interactions, which might also come from ing and the prediction of remaining risks after testing. Es- independent entities like other cars or pedestrians cross- pecially, in the context of safety-critical systems the last two ing the streets. Hence, finding a way to represent the issues are of importance. knowledge we have about our world is essential for test- ing AI-based systems with increased autonomy or adap- Automated system testing: System testing is of uttermost tive behavior. This requirement is very well supported importance not only in the context of AI-based systems. by the large amount of research papers dealing with the Even in the case that each subcomponent of a SUT might use of environmental ontologies for testing such systems be tested or formally verified thoroughly, the interaction in the context of the autonomous driving. Without envi- between subsystems may reveal a faulty behavior. In ronmental models finding critical situations that might the context of AI-based systems the system tests is even cause the SUT to violate its specification can hardly be more important. There are two reasons: (1) An AI-based achieved. system, e.g., a vision sensor used for classification obsta- cles in case of an AEB, might not always deliver correct Providing guarantees: Testing is well-known to be incom- plete. Only in case of a failure revealing test, we know 5 Conclusions that the SUT is still faulty. If the SUT passes all tests, Testing AI-based systems has been in the focus of research either we have not used the right test case, or the SUT for several decades. Because of the increasing importance itself is really fulfilling its specification. Because of the and use of AI methodologies ranging from knowledge-based fact that we cannot test forever the questions of when to systems to machine learning, there is a strong need for test- stop testing and also about the consequences are of ut- ing methodologies that come with certain guarantees. Espe- termost importance. Providing guarantees like knowing cially, for safety-critical systems such a testing methodology to achieve a certain type of coverage or mutation score would be required. In this paper, we argue that the automated is especially important when testing safety-critical sys- system test is of uttermost importance comprising test case tems where standards require to test for reaching a given generation from environmental models and test execution us- coverage criteria. Testing the whole system usually is ing simulation. When relying on environmental models, i.e., carried out without knowing internal details of the sys- ontologies, and testing techniques like combinatorial testing, tem i.e., as black-box testing. In order to come up with the ontology coverage and the combinatorial strength can be a testing criteria for testing AI-based systems we may used for giving guarantees and also for estimating potential borrow some ideas from CT. residual risks. In CT the strength t is used as means for represent- Future research has to consider studies mapping the com- ing coverage. This is due to the fact that t represents binatorial strength to the number of not detected faults in case the number of interactions between any t parameters. of autonomous and AI-based systems. In addition, we have Hence, t guarantees that all interactions of size t has to come up with other coverage definitions like ontology cov- been captured. When using CT like in [Tao et al., 2019] erage and a prediction of the residual risk in case of testing. work, the strength used for generating the test can be used as a guarantee. What is missing is a detailed anal- Acknowledgment ysis about meaningful values for t in the context of The financial support by the Austrian Federal Ministry for autonomous systems. For ordinary systems like web Digital and Economic Affairs and the National Foundation browsers etc. Kuhn and colleagues [Kuhn et al., 2009] for Research, Technology and Development is gratefully ac- showed that at the maximum 6 interactions are neces- knowledged. sary to reveal all previously detected faults. For the au- tonomous systems domain such an analysis is missing. References Another way of coming up with a measure representing [Chetouane et al., 2019] Nour Chetouane, Lorenz Klampfl, some sort of guarantees would be to consider the used and Franz Wotawa. Investigating the effectiveness of underlying ontologies themselves. For example, if we mutation testing tools in the context of deep neural net- have a generally agreed ontology of the environment, works. In In Proceedings of the 15th International Work- we are able to judge testing with respect to the use of Conference on Artificial Neural Networks (IWANN), Gran the environmental entities. Ontology coverage may be Canaria, Spain, 2019. Springer. the percentage of concepts used in testing from such an [El-Korany et al., 2000] Abeer El-Korany, Ahmed Rafea, ontology. Again research is needed for (1) coming up with such an ontology for the application domain, and Hoda Baraka, and Saad Eid. A structured testing method- (2) to define ontology coverage formally. ology for knowledge-based systems. In 11th International Conference on Database and Expert Systems Applications (DEXA), pages 427–436. Springer, 2000. Estimate the residual risk: The residual risk in case of test- [Euro and Protocol, 2017] NCAP Euro and AEB VRU Test ing corresponds to the risk that the SUT after testing Protocol. Test protocol - aeb vru test, 2017. will fail during operation causing serious harm. Hence, the remaining risk after applying verification and vali- [Euro, 2017] NCAP Euro. Test protocol - aeb systems. Brus- dation, is proportional to the risk of missing important sels, Belgium: Eur. New Car Assess. Programme (Euro test cases, i.e., those leading to critical situations. When NCAP), 2017. considering parameters like the combinatorial strength [Felfernig et al., 2005] A. Felfernig, K. Isak, and T. Kruggel. or ontology coverage as discussed before, the residual Testing knowledge-based recommender systems. OEGAI risk should be proportional to these parameters. As far Journal, 4:12–18, 2005. as we know there has been no research trying to estimate [Gauerhof et al., 2018] Lydia Gauerhof, Peter Munk, and Si- the residual risk based on metrics used to specify some sort of guarantees. mon Burton. Structuring validation targets of a machine learning function applied to automated driving. In Bar- In order to be of use in practice, we have to search for a bara Gallina, Amund Skavhaug, and Friedemann Bitsch, method that allows to estimate the residual risk of testing editors, Computer Safety, Reliability, and Security, pages automated and autonomous systems based on certain pa- 45–58, Cham, 2018. Springer International Publishing. rameters like coverage, mutation score, or combinatorial [Geyer et al., 2014] S. Geyer, M. Baltzer, B. Franz, strength. S. Hakuli, M. Kauer, M. Kienle, S. Meier, T. Weissgerber, K. Bengler, R. Bruder, F. Flemisch, and H. Winner. [Ma et al., 2018a] Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Concept and development of a unified ontology for Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting generating test and use-case catalogues for assisted and Su, Li Li, Yang Liu, et al. Deepgauge: Multi-granularity automated vehicle guidance. IET Intelligent Transport testing criteria for deep learning systems. In Proceedings Systems, 8(3):183–189, May 2014. of the 33rd ACM/IEEE International Conference on Auto- [Goodfellow et al., 2018] Ian Goodfellow, Patrick Mc- mated Software Engineering, pages 120–131. ACM, 2018. Daniel, and Nicolas Papernot. Making machine learning [Ma et al., 2018b] Lei Ma, Fuyuan Zhang, Jiyuan Sun, Min- robust against adversarial inputs. Commun. ACM, hui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang 61(7):56–66, June 2018. Liu, Jianjun Zhao, et al. Deepmutation: Mutation testing [Hartung and Håkansson, 2007] Ronald Hartung and Anne of deep learning systems. In 2018 IEEE 29th International Håkansson. Automated testing for knowledge based sys- Symposium on Software Reliability Engineering (ISSRE), tems. In Bruno Apolloni, RobertJ. Howlett, and Lakhmi pages 100–111. IEEE, 2018. Jain, editors, Knowledge-Based Intelligent Information [Ma et al., 2018c] Lei Ma, Fuyuan Zhang, Minhui Xue, and Engineering Systems, volume 4692 of Lecture Notes Bo Li, Yang Liu, Jianjun Zhao, and Yadong Wang. Com- in Computer Science, pages 270–278. Springer Berlin Hei- binatorial testing for deep learning systems. arXiv preprint delberg, 2007. arXiv:1806.07723, 2018. [Hayes and Parzen, 1997] Caroline C. Hayes and Michael I. [Menzel et al., 2018] Till Menzel, Gerrit Bagschik, and Parzen. Quem: An achievement test for knowledge-based Markus Maurer. Scenarios for development, test and vali- systems. IEEE Transactions on Knowledge and Data En- dation of automated vehicles. In arXiv:1801.08598, 2018. gineering, 9(6):838–847, November/December 1997. appeared in Proc. of the IEEE Intelligent Vehicles Sympo- [Henriksson et al., 2018] Jens Henriksson, Markus Borg, sium. and Cristofer Englund. Automotive safety and ma- [Murrell and Plant, 1997] Stephen Murrell and Robert T. chine learning: initial results from a study on how to Plant. A survey of tools for the validation and verifica- adapt the iso 26262 safety standard. In Proceedings of tion of knowledge-based systems: 1985-1995. Decision the 1st International Workshop on Software Engineer- Support Systems, 21(4):307–323, 1997. ing for AI in Autonomous Systems, May 2018. DOI: [Pei et al., 2017] Kexin Pei, Yinzhi Cao, Junfeng Yang, and 10.1145/3194085.3194090. Suman Jana. Deepxplore: Automated whitebox testing of [Kalra and Paddock, 2016] Nidhi Kalra and Susan M. Pad- deep learning systems. In proceedings of the 26th Sympo- dock. Driving to safety: How many miles of driving sium on Operating Systems Principles, pages 1–18. ACM, would it take to demonstrate autonomous vehicle reliabil- 2017. ity? Transportation Research Part A: Policy and Practice, 94:182 – 193, 2016. [Plant, 1991] Robert Plant. Rigorous approach to the devel- opment of knowledge-based systems. Knowl.-Based Syst., [Klück et al., 2018] Florian Klück, Yihao Li, Mihai Nica, 4(4):186–196, 1991. Jianbo Tao, and Franz Wotawa. Using ontologies for test suites generation for automated and autonomous driving [Plant, 1992] Robert T. Plant. Expert system development functions. In In Proc. of the 29th IEEE International Sym- and testing: A knowledge engineer’s perspective. Journal posium on Software Reliability Engineering (ISSRE2018) of Systems and Software, 19(2):141–146, 1992. - Industrial Track, 2018. [Schieferdecker, 2012] Ina Schieferdecker. Model-based [Koopman and Wagner, 2016] Philip Koopman and Michael testing. IEEE Software, 29(1):14–18, Jan/Feb 2012. Wagner. Challenges in autonomous vehicle testing and [Schuldt et al., 2018] Fabian Schuldt, Andreas Reschka, and validation. SAE Int. J. Trans. Safety, 4:15–24, 04 2016. Markus Maurer. A method for an efficient, systematic test [Koroglu and Wotawa, 2019] Yavuz Koroglu and Franz case generation for advanced driver assistance systems in Wotawa. Fully automated compiler testing of a reasoning virtual environments. In Hermann Winner, Gunter Prokop, engine via mutated grammar fuzzing. In In Proc. of the and Markus Maurer, editors, Automotive Systems Engi- 14th IEEE/ACM International Workshop on Automation neering II. Springer International Publishing AG, 2018. of Software Test (AST), Montreal, Canada, 27th May 2019. [Seshia and Sadigh, 2016] Sanjit A. Seshia and Dorsa [Kuhn et al., 2009] D.R. Kuhn, R.N. Kacker, Y. Lei, and Sadigh. Towards verified artificial intelligence. CoRR, J. Hunter. Combinatorial software testing. Computer, abs/1606.08514, 2016. pages 94–96, August 2009. [Sotiropoulos et al., 2017] T. Sotiropoulos, H. Waeselynck, [Kuhn et al., 2012] D. R. Kuhn, R. N. Kacker, and Y. Lei. J. Guiochet, and F. Ingrand. Can robot navigation bugs be Combinatorial testing. In Phillip A. Laplante, editor, Ency- found in simulation? an exploratory study. In 2017 IEEE clopedia of Software Engineering. Taylor & Francis, 2012. International Conference on Software Quality, Reliability [Kuhn et al., 2015] D. Richard Kuhn, Renee Bryce, Feng and Security (QRS), pages 150–159, July 2017. Duan, Laleh Sh. Ghandehari, Yu Lei, and Raghu N. [Su et al., 2019] J. Su, D. V. Vargas, and K. Sakurai. One Kacker. Combinatorial testing: Theory and practice. In pixel attack for fooling deep neural networks. IEEE Trans- Advances in Computers, volume 99, pages 1–66. 2015. actions on Evolutionary Computation, pages 1–1, 2019. [Sun et al., 2018] Youcheng Sun, Xiaowei Huang, and [Xiong et al., 2013] Zhitao Xiong, Hamish Jamson, An- Daniel Kroening. Testing deep neural networks. arXiv thony G. Cohn, and Oliver Carsten. Ontology for Scenario preprint arXiv:1803.04792, 2018. Orchestration (OSO): A Standardised Scenario Descrip- [Tao et al., 2019] Jianbo Tao, Yihao Li, Franz Wotawa, Her- tion in Driving Simulation. ASCE Library, 2013. mann Felbinger, and Mihai Nica. On the industrial ap- plication of combinatorial testing for autonomous driving functions. In Proceedings of the International Workshop on Combinatorial Testing (IWCT). IEEE, 2019. [Tiihonen et al., 2002] Juha Tiihonen, Timo Soininen, Ilkka Niemelä, and Reijo Sulonen. Empirical testing of a weight constraint rule based configurator. In Proceedings of the ECAI 2002 Configuration Workshop, pages 17–22, 2002. [Wicker et al., 2018] Matthew Wicker, Xiaowei Huang, and Marta Kwiatkowska. Feature-guided black-box safety testing of deep neural networks. In Dirk Beyer and Marieke Huisman, editors, Tools and Algorithms for the Construction and Analysis of Systems - 24th International Conference, TACAS 2018, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2018, Thessaloniki, Greece, April 14-20, 2018, Proceedings, Part I, volume 10805 of Lecture Notes in Computer Science, pages 408–426. Springer, 2018. [Wotawa and Li, 2018] Franz Wotawa and Yihao Li. From ontologies to input models for combinatorial testing. In In- maculada Medina-Bulo, Mercedes G. Merayo, and Robert Hierons, editors, Testing Software and Systems, pages 155–170. Springer International Publishing, 2018. [Wotawa and Pill, 2014] Franz Wotawa and Ingo Pill. Test- ing configuration knowledge-bases. In Proceedings of the 16th Workshop on Configuration, Novi Sad, Serbia, September 2014. [Wotawa et al., 2018] Franz Wotawa, Bernhard Peischl, Flo- rian Klück, and Mihai Nica. Quality assurance methodolo- gies for automated driving. Elektrotechnik & Information- stechnik, 135(4–5), 2018. https://doi.org/10.1007/s00502- 018-0630-7. [Wotawa, 2016a] Franz Wotawa. Testing autonomous and highly configurable systems: Challenges and feasible so- lutions. In D. Watzenig and M. Horn, editors, Automated Driving. Springer International Publishing, 2016. DOI 10.1007/978-3-319-31895-0 22. [Wotawa, 2016b] Franz Wotawa. Testing self-adaptive sys- tems using fault injection and combinatorial testing. In Proceedings of the Intl. Workshop on Verification and Val- idation of Adaptive Systems (VVASS 2016), Vienna, Aus- tria, 2016. [Wotawa, 2018a] Franz Wotawa. Combining combinatorial testing and metamorphic testing for testing a logic-based non-monotonic reasoning system. In In Proceedings of the 7th International Workshop on Combinatorial Testing (IWCT) / ICST 2018, April 13th 2018. [Wotawa, 2018b] Franz Wotawa. On the automation of test- ing a logic-based diagnosis system. In In Proceedings of the 13th International Workshop on Testing: Academia- Industry Collaboration, Practice and Research Techniques (TAIC PART) / ICST 2018, April 9th 2018.