On the importance of system testing for assuring safety of AI systems

                                            Franz Wotawa1
         1
           CD Lab for Quality Assurance Methodologies for Autonomous Cyber-Physical Systems,
                       TU Graz, Institute for Software Technology, Graz, Austria.
                                          wotawa@ist.tugraz.at


                         Abstract                                tegrity levels (ASIL) where ASIL A is the weakest and ASIL
                                                                 D the strongest requiring specific considerations during de-
     Rigorous testing of automated and autonomous                velopment in order to keep risks below an acceptable level. It
     systems is inevitable especially in case of safety-         is well known that standards like IEC 61580 do not recom-
     critical systems like cars or airplanes. There ex-          mend AI methodology to be used in systems with a higher
     ist several functional safety standards that have to        safety integrity level (see IEC 61580-3:2010 Table A.2). It
     be fulfilled like IEC 61508 explicitly stating that         is interesting to note that this restriction applies also to the
     AI methodologies are not recommended to be used             automotive industry where we see an increasing use of au-
     in case of systems with higher safety requirements.         tomated and autonomous functions some of them also based
     Hence, there is a necessity to adopt these standards        on machine learning technologies. There is obviously a gap
     in a direction where AI methodology is allowed              between the safety standards and the use of AI methodology
     to be used providing to fulfill certain standardized        in practice requiring adaptations in the standards. See for ex-
     quality assurance method to be taken care of dur-           ample, Henriksson and colleagues [Henriksson et al., 2018]
     ing development. In this paper, we contribute to            contributed ideas for evolving the standards towards captur-
     this endeavor and discuss the urgent need for sys-          ing machine learning applications. In addition, there are new
     tem testing in the context of safety-critical systems       standards coming up like ISO/PAS 21448:2019 considering
     comprising AI methodologies. In particular, we              safety of the intended functionality for road vehicles that con-
     argue based on one example from the automotive              sider situations comprising complex sensors and processing
     industry that it is strongly recommended to con-            algorithms including the application of machine learning.
     sider not only subsystems but instead the whole
     system interacting with its environment when car-              There have been many papers dealing with the general
     rying out tests. The discussed example is an ad-            challenge of verifying and validating autonomous vehicles
     vanced driver-assistance systems used to break in           like Koopman and Wagner [Koopman and Wagner, 2016],
     case of an emergency that does not rely on machine          Wotawa [Wotawa, 2016a], Schuldt and colleagues [Schuldt et
     learning but comprises a decision part that invokes         al., 2018], or Wotawa and colleagues [Wotawa et al., 2018].
     breaking once the sensors identify an obstacle that         An essential challenge in this context is how to assure to test
     might be hit otherwise. Results obtained from an            such systems in a way that can be considered as good enough?
     already reported testing methodology, revealed that         Kalra and Paddock [Kalra and Paddock, 2016] answered this
     when using tests considering the environment of             question stating that an autonomous vehicle has to operate for
     an automated emergency breaking systems, we ob-             275 million miles for verification purposes. In their calcula-
     tain critical scenarios that might otherwise have not       tion, Kalra and Paddock considered the fatality rate of driving
     been detected. From this observation, we conclude           in the USA and assumed that an autonomous vehicle should
     that rigorous system testing becomes even more              have a far lower fatality rate. Despite the fact that such a huge
     important for systems with AI methodology based             number of miles can hardly be achieved with a small fleet
     on machine learning or allowing to adapt the sys-           of test cars, there is also another hidden assumption behind
     tem’s behavior during operation.                            the calculation, i.e., during testing on streets the autonomous
                                                                 vehicle has to deal with all critical scenarios, which seems
                                                                 to be somehow unrealistic. Hence, as a consequence, re-
1   Introduction                                                 searchers have proposed to use ontologies for testing auto-
Safety-critical systems are systems where internal fault or      mated and autonomous vehicles, e.g., [Xiong et al., 2013;
a malfunction may cause death or serious injury to people,       Geyer et al., 2014; Menzel et al., 2018]. These contributions
loss or severe damage to property, or environmental harm.        deal with testing the whole system using different scenarios.
For such systems, software and system engineering standards         The intention behind this paper is to discuss the need for
have been introduced like IEC 61508 or ISO 26262 for the au-     system testing in the context of automated and autonomous
tomotive industry. The latter introduces automotive safety in-   driving and to learn important requirements that can be used
in other application areas utilizing AI technology as well. In       other paper dealing with system testing of an AEB in Sec-
particular, we discuss results obtained when applying two dif-       tion 3, from which we are going to derive requirements neces-
ferent testing techniques to verify the functionality of the ad-     sary to verify safety-critical systems using AI methods (Sec-
vanced driver-assistance system (ADAS) autonomous emer-              tion 4). Finally, we conclude the paper in Section 5.
gency breaking (AEB). The AEB comprises sensors for de-
tection obstacles, and a decision system that controls auto-
mated breaking whenever necessary to avoid or mitigate col-
                                                                     2   Related research
lisions. Currently, AEB systems utilize different sensor tech-       Testing AI-based system itself is not a novel research area.
nologies like radar, cameras, or LIDAR, together with sensor         Starting in the 90s of the last century, R. Plant [Plant, 1991;
fusion capabilities. In cases where obstacles need to be classi-     Plant, 1992] reported on the development of expert systems
fied, e.g., as persons or bicyclists, machine learning methods       and also considering testing. Together with S. Murell, R.
might be applied to learn the vision sensor distinguishing cat-      Plant [Murrell and Plant, 1997] published also a survey on
egories. Verification of AEB requires to verify its subsystems       tools for verifying and validating knowledge-based systems
as well as the system itself during operation or at least in an      that had been published between 1985 and 1995. Later El-
environment that is close to the real use.                           Korany and colleagues [El-Korany et al., 2000] presented a
   There are similarities and expected differences when con-         structured approach, and Hartung and Håkansson [Hartung
sidering systems comprising AI methods like machine learn-           and Håkansson, 2007] an automated approach for testing such
ing as subjects of testing. For example, in case of machine          systems. Other work, in the field of knowledge-based system
learning the outcome depends on the data used for learning a         testing include [Hayes and Parzen, 1997], [Felfernig et al.,
certain model and the underlying machine learning approach.          2005] dealing with testing recommender systems, [Tiihonen
Hence, the outcome of the finally obtained model might vary.         et al., 2002], and [Wotawa and Pill, 2014]. Most recently,
Other parts of the overall system relying on such a varying          there has been some paper dealing with testing specific prop-
outcome have to deal with this uncertainties in an appropriate       erties of logic reasoning engines [Wotawa, 2018a], the gen-
way not causing safety hazards. Hence, any verification ap-          eral challenge of testing such systems [Wotawa, 2018b], and
proach needs to consider this variation and try find a critical      testing subsystems of logic reasoning engines like their com-
situation. In addition, sensor based on machine learning may         pilers [Koroglu and Wotawa, 2019]. In any of these cases,
not always deliver the correct classification results. Even if       the focus were on testing the reasoning engine alone not con-
coming up with a correct classification in 99% of the cases,         sidering its use in a specific application like a mobile robot
we have to assure that the 1% of the remaining cases does            or any other application that requires reasoning capabilities.
not lead to safety violations. Hence, the overall system has         In contrast, Wotawa [Wotawa, 2016b] presented a testing ap-
to compensate for inaccuracies that might be higher than for         proach of adaptive systems that make use of models of the
ordinary sensors, and there is a strong requirement to verify        system itself, i.e., knowledge of system’s internal structure
the system especially considering all the cases where classifi-      and behavior. There the author considers fault injection for
cation goes wrong.                                                   testing whether the adaptive system handles internal faults as
   In the following, we will introduce such a testing environ-       expected.
ment that is used together with test case generation methods            In case of machine learning and in particular neural net-
to carry out system testing automatically without the need of        works there have been many papers dealing with testing in-
user interference. Interestingly, we will see that there are test-   cluding [Ma et al., 2018a], [Sun et al., 2018], [Pei et al.,
ing approaches that reveal faults in an AEB that very much           2017], [Ma et al., 2018b], and [Ma et al., 2018c] applying
likely would not have been found otherwise. This testing ap-         different well-known testing techniques, like mutation test-
proach combines environmental ontologies with combinato-             ing, combinatorial testing or whitebox testing approaches to
rial testing [Kuhn et al., 2012; Kuhn et al., 2015]. The un-         neural networks. Chetouane and colleagues [Chetouane et al.,
derlying assumption is that there is a need for finding combi-       2019] discussed a slightly different approach to testing neural
nations of environmental entities for revealing faults. Hence,       networks using mutation testing and coverage in a more ordi-
we argue that it is not only necessary to carry out system tests     nary setting. In addition, it is well known that neural networks
but also to consider environmental interactions in case of au-       are vulnerable against adversarial inputs where only changing
tonomous systems. In addition, we discuss some further chal-         one pixel in an image lead to a wrong classification. See for
lenges of testing autonomous systems, i.e., providing some           example, Su and colleagues [Su et al., 2019] work. Adver-
sort of guarantees and methods for estimating the residual           sarial attacks can also be more tailored towards more realistic
risk, i.e., the risk of still comprising a fault even after car-     attacks. We refer the interested reader to Wicker and col-
rying out a proposed testing methodology. We will see that           leagues [Wicker et al., 2018] for one example. Counter mea-
the use of environmental models allow for specifying such            sures against adversarial input has been considered. Most re-
guarantees based on the degree of considering interactions           cently, Goddfellow and colleagues [Goodfellow et al., 2018]
between environmental entities and the degree to which the           discuss and summarize some of them.
environmental model has been used for verifying a particular            In the context of automated and autonomous driving we
system.                                                              discussed related papers in the introduction. The main fo-
   This paper is organized as follows: In Section 2 we discuss       cus is on identifying critical scenarios using ontologies from
related research focusing on AI-based systems. Afterwards,           which test cases can be derived, e.g., [Xiong et al., 2013;
and to be self-contained, we introduce the results from an-          Geyer et al., 2014; Menzel et al., 2018]. In addition, there is
a shift from carrying out vehicle tests on the road to a simu-
lation environment capturing physics as well as 3D models.
In such an environment more tests in less time can be carried
out. Because of the improvement in simulation technology,
the carried out tests become closer to reality finding faults
that would also been detected in a real environment. In an
explorative study Sotiropoulos and colleagues [Sotiropoulos
et al., 2017] showed that simulation of a mobile robot in deed
revealed faults that had been also detected when carrying out
test in our physical world.
   In addition, there have been research work on formally ver-
ifying machine learning and AI-based solutions. Seshia and
Sadigh [Seshia and Sadigh, 2016] discussed the use of formal
                                                                   Figure 1: Overview of the testing process: from ontology to execu-
methods for verifying AI including how to handle involved
                                                                   tion
challenges. Gauerhof et al. [Gauerhof et al., 2018] tackled
the case of the use of machine learning in the context of au-
tonomous driving focusing on validation issues.
   What we can take with us from the mentioned related re-         sufficient to restrict testing to all combinations of all subsets
search is the following: (1) Different AI methodologies like       of size t instead of considering all combinations of parame-
knowledge-based systems or machine learning require spe-           ters.
cific testing methods, (2) improved 3D and physics simula-            Klück and colleagues [Klück et al., 2018] improved the
tion allow for revealing faults, and (3) in case of automated      conversion algorithm from ontologies to combinatorial tests
and autonomous driving the use ontologies capturing the en-        and also introduced how to apply the proposed testing
vironment to generate critical scenarios seems to be of partic-    methodology in a practical setup. In Figure 1 (from [Tao
ular importance.                                                   et al., 2019]), we depict the overall application process that
                                                                   can be fully automated. The process starts with an ontology
3   System testing for automated driving                           that captures the environment of the system under test (SUT).
                                                                   From this ontology, we obtain a combinatorial testing (CT)
    functions                                                      input model that is used to generate tests. Note that the gener-
In this section, we recapitulate an approach for system testing    ated tests are abstract tests and need to be further concretized.
an ADAS functionality relying on an environmental ontology         For example, in the ontology we may only distinguish road
and combinatorial testing for generating test cases. The con-      fragments to be straight, or a left or a right curve. The de-
tent of this section relies on Tao and colleague’s paper [Tao et   tails about the length or the radius of a curve are not given.
al., 2019]. The intention of this section is to show the neces-    Hence, in a concretization step we have to set these values.
sity of capturing interactions between environmental entities      Afterwards, the SUT can be simulated. In case of [Tao et al.,
in the case of autonomous driving and ADAS functionality           2019] this is done using certain tools for 3D simulation and
for revealing faults.                                              physical simulation like VTD or ModelConnect.
   The underlying idea behind Tao et al.’s work is to auto-           In [Tao et al., 2019], the authors make use of an AEB as a
matically extract test cases from an environmental ontology        SUT. Instead of considering a general ontology that captures
directly. The basic foundations behind have been outlined in       all different scenarios that might occur during driving, the
other papers. Wotawa and Li [Wotawa and Li, 2018] pre-             authors focus on the scenarios from the European New Car
sented a first algorithm that allows to convert ontologies into    Assessment Program (Euro NCAP), which is a well-known
input models of combinatorial testing [Kuhn et al., 2012;          organization for car safety performance assessment provid-
Kuhn et al., 2015]. An input model captures basically nec-         ing consumers with a safety performance assessment for the
essary parameters and their domains. In case of automated          majority of the most popular cars. In Figure 2 from NCAP
and autonomous driving the parameters are road fragments           Euro [Euro, 2017; Euro and Protocol, 2017], we see some
together with their conditions, other cars or pedestrians, the     typical scenarios that have to be considered when testing an
weather conditions, and so on. In order to find critical sce-      AEB implementation. This includes the ego vehicle, i.e., the
narios, it would be required to consider all different combi-      SUT, to approach another vehicle but also to pass by park-
nations of parameter values, which of course is not feasible.      ing cars and also to consider pedestrians that may cross the
In combinatorial testing, we are not considering all combi-        street. [Tao et al., 2019] made use of these scenarios to come
nations but only all combinations for an arbitrary subset of       up with an adapted ontology for automated AEB testing.
the set of parameters of size t. If t is smaller than the to-         Using the conversion algorithm from ontologies into CT
tal number of parameters, we have to generated substantially       input models and a CT algorithm for generating test cases,
fewer tests. A test suite where all combinations for all subsets   Tao et al. were able to generate 993 test cases from 39 pa-
of the parameters of size t are considered, is called a t-way      rameters and a domain size of maximum 27 considering a
combinatorial test suite or a test suite of strength t. When       combinatorial strength of 2 only. From these tests, Tao et al.
applying combinatorial testing to the domain of autonomous         identified 17 that lead to a crash. Interestingly to note that in
and automated driving, the underlying assumption is that it is     the test suite we have two test cases that distinguishes only in
                                                                         results. Hence, we have to check how the whole system
                                                                         deals with this fact. (2) A system implementing more
                                                                         and more autonomy, e.g., an AEB autonomously mak-
                                                                         ing a decision about invoking emergency breaking, has
                                                                         to be tested as a whole in very much detail. Such sys-
                                                                         tems usually are at least very much complicated if not
                                                                         even complex. We have to assure that the system ful-
                                                                         fills its specification under a sheer amount of potential
                                                                         interactions between the system and its surrounding en-
                                                                         vironment. Therefore, it is also very much important
                                                                         to carry out testing in an automated way making use of
                                                                         simulation environments. Note that we do not restrict
                                                                         simulation in this context to simulation where all hard-
                                                                         ware parts are represented as virtual models. We may
    Figure 2: Some AEB scenarios from the EuroNcap protocol              also test the whole system including hardware and soft-
                                                                         ware using a test bench where the hardware directly can
                                                                         be stimulated.
the fact that one captures the situation of a dry road where the
                                                                         There is also another reason why automated system test-
other requires the road to be wet. The latter test case leads to
                                                                         ing becomes increasingly important. There is a grow-
a crash whereas the other does not. In addition, Tao et al. also
                                                                         ing need for making changes in the system after de-
identified a case where a crash with a pedestrian happened but
                                                                         ployment because of the increasing amount of software
only because two pedestrians cross the road one from left to
                                                                         in such AI-based systems. When dealing with safety-
right and the other from right to left in close proximity. These
                                                                         critical systems every change causes the SUT to be
results show that it is important to take care of the interaction
                                                                         again tested thoroughly. Without automation such and
of the parameters. Moreover, it was not necessary to consider
                                                                         endeavor would be impossible to achieve considering
all interactions, i.e., all combinations but only a few, at least
                                                                         available budget, effort and time constraints. Previous
partially confirming the underlying assumption that we only
                                                                         research – some briefly discussed in this paper – also
need to take care of all combinations for all subsets of param-
                                                                         demonstrates the usefulness of automated system testing
eters of size t.
                                                                         for finding faults in autonomous systems using simula-
   In summary, we can conclude from the described example                tion environments (e.g., see [Sotiropoulos et al., 2017]
application of testing in this section the following: (1) System         and [Tao et al., 2019]). Therefore, we may consider this
testing based on 3D and physical simulation is able to reveal            type of testing as a best practice also for safety-critical
faults in ADAS, (2) ontologies capturing the environment of              AI-based systems intended to interact with entities of
the SUT are good enough to detect faults, (3) interactions of            our physical world.
parameters of scenarios are required for fault detection, and
(4) it seems to be sufficient considering only a restricted num-    Consider environmental knowledge: When dealing with
ber of combinations of parameters of scenarios.                         testing the question is always how to obtain test cases?
                                                                        In practice, test cases are often manually crafted even
4    Testing safety-critical AI                                         in case of safety-critical systems. In addition, other
                                                                        approaches like model-based testing (MBT) [Schiefer-
In this section, we want to summarize the finding obtained              decker, 2012] are used, which utilizes a model of the
from testing autonomous and automated driving functions,                SUT for generating tests. MBT is without any doubt an
and in addition, generalize the finding to other application            important method for test case generation to guarantee
areas with safety requirements. In particular, we discuss the           covering the functionality of the SUT. However, in the
necessity of carrying out system tests automatically using test         case of AI-based systems interacting with our physical
cases obtained from environmental knowledge, and outline                world, it is at least equally important to also consider the
challenges and partial solution regarding guarantees of test-           environmental interactions, which might also come from
ing and the prediction of remaining risks after testing. Es-            independent entities like other cars or pedestrians cross-
pecially, in the context of safety-critical systems the last two        ing the streets. Hence, finding a way to represent the
issues are of importance.                                               knowledge we have about our world is essential for test-
                                                                        ing AI-based systems with increased autonomy or adap-
Automated system testing: System testing is of uttermost
                                                                        tive behavior. This requirement is very well supported
    importance not only in the context of AI-based systems.
                                                                        by the large amount of research papers dealing with the
    Even in the case that each subcomponent of a SUT might
                                                                        use of environmental ontologies for testing such systems
    be tested or formally verified thoroughly, the interaction
                                                                        in the context of the autonomous driving. Without envi-
    between subsystems may reveal a faulty behavior. In
                                                                        ronmental models finding critical situations that might
    the context of AI-based systems the system tests is even
                                                                        cause the SUT to violate its specification can hardly be
    more important. There are two reasons: (1) An AI-based
                                                                        achieved.
    system, e.g., a vision sensor used for classification obsta-
    cles in case of an AEB, might not always deliver correct        Providing guarantees: Testing is well-known to be incom-
     plete. Only in case of a failure revealing test, we know        5   Conclusions
     that the SUT is still faulty. If the SUT passes all tests,      Testing AI-based systems has been in the focus of research
     either we have not used the right test case, or the SUT         for several decades. Because of the increasing importance
     itself is really fulfilling its specification. Because of the   and use of AI methodologies ranging from knowledge-based
     fact that we cannot test forever the questions of when to       systems to machine learning, there is a strong need for test-
     stop testing and also about the consequences are of ut-         ing methodologies that come with certain guarantees. Espe-
     termost importance. Providing guarantees like knowing           cially, for safety-critical systems such a testing methodology
     to achieve a certain type of coverage or mutation score         would be required. In this paper, we argue that the automated
     is especially important when testing safety-critical sys-       system test is of uttermost importance comprising test case
     tems where standards require to test for reaching a given       generation from environmental models and test execution us-
     coverage criteria. Testing the whole system usually is          ing simulation. When relying on environmental models, i.e.,
     carried out without knowing internal details of the sys-        ontologies, and testing techniques like combinatorial testing,
     tem i.e., as black-box testing. In order to come up with        the ontology coverage and the combinatorial strength can be
     a testing criteria for testing AI-based systems we may          used for giving guarantees and also for estimating potential
     borrow some ideas from CT.                                      residual risks.
     In CT the strength t is used as means for represent-               Future research has to consider studies mapping the com-
     ing coverage. This is due to the fact that t represents         binatorial strength to the number of not detected faults in case
     the number of interactions between any t parameters.            of autonomous and AI-based systems. In addition, we have
     Hence, t guarantees that all interactions of size t has         to come up with other coverage definitions like ontology cov-
     been captured. When using CT like in [Tao et al., 2019]         erage and a prediction of the residual risk in case of testing.
     work, the strength used for generating the test can be
     used as a guarantee. What is missing is a detailed anal-        Acknowledgment
     ysis about meaningful values for t in the context of            The financial support by the Austrian Federal Ministry for
     autonomous systems. For ordinary systems like web               Digital and Economic Affairs and the National Foundation
     browsers etc. Kuhn and colleagues [Kuhn et al., 2009]           for Research, Technology and Development is gratefully ac-
     showed that at the maximum 6 interactions are neces-            knowledged.
     sary to reveal all previously detected faults. For the au-
     tonomous systems domain such an analysis is missing.
                                                                     References
     Another way of coming up with a measure representing            [Chetouane et al., 2019] Nour Chetouane, Lorenz Klampfl,
     some sort of guarantees would be to consider the used              and Franz Wotawa. Investigating the effectiveness of
     underlying ontologies themselves. For example, if we               mutation testing tools in the context of deep neural net-
     have a generally agreed ontology of the environment,               works. In In Proceedings of the 15th International Work-
     we are able to judge testing with respect to the use of            Conference on Artificial Neural Networks (IWANN), Gran
     the environmental entities. Ontology coverage may be               Canaria, Spain, 2019. Springer.
     the percentage of concepts used in testing from such an
                                                                     [El-Korany et al., 2000] Abeer El-Korany, Ahmed Rafea,
     ontology. Again research is needed for (1) coming up
     with such an ontology for the application domain, and              Hoda Baraka, and Saad Eid. A structured testing method-
     (2) to define ontology coverage formally.                          ology for knowledge-based systems. In 11th International
                                                                        Conference on Database and Expert Systems Applications
                                                                        (DEXA), pages 427–436. Springer, 2000.
Estimate the residual risk: The residual risk in case of test-       [Euro and Protocol, 2017] NCAP Euro and AEB VRU Test
    ing corresponds to the risk that the SUT after testing              Protocol. Test protocol - aeb vru test, 2017.
    will fail during operation causing serious harm. Hence,
    the remaining risk after applying verification and vali-         [Euro, 2017] NCAP Euro. Test protocol - aeb systems. Brus-
    dation, is proportional to the risk of missing important            sels, Belgium: Eur. New Car Assess. Programme (Euro
    test cases, i.e., those leading to critical situations. When        NCAP), 2017.
    considering parameters like the combinatorial strength           [Felfernig et al., 2005] A. Felfernig, K. Isak, and T. Kruggel.
    or ontology coverage as discussed before, the residual              Testing knowledge-based recommender systems. OEGAI
    risk should be proportional to these parameters. As far             Journal, 4:12–18, 2005.
    as we know there has been no research trying to estimate
                                                                     [Gauerhof et al., 2018] Lydia Gauerhof, Peter Munk, and Si-
    the residual risk based on metrics used to specify some
    sort of guarantees.                                                 mon Burton. Structuring validation targets of a machine
                                                                        learning function applied to automated driving. In Bar-
     In order to be of use in practice, we have to search for a         bara Gallina, Amund Skavhaug, and Friedemann Bitsch,
     method that allows to estimate the residual risk of testing        editors, Computer Safety, Reliability, and Security, pages
     automated and autonomous systems based on certain pa-              45–58, Cham, 2018. Springer International Publishing.
     rameters like coverage, mutation score, or combinatorial        [Geyer et al., 2014] S. Geyer, M. Baltzer, B. Franz,
     strength.                                                          S. Hakuli, M. Kauer, M. Kienle, S. Meier, T. Weissgerber,
  K. Bengler, R. Bruder, F. Flemisch, and H. Winner.             [Ma et al., 2018a] Lei Ma, Felix Juefei-Xu, Fuyuan Zhang,
  Concept and development of a unified ontology for                 Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting
  generating test and use-case catalogues for assisted and          Su, Li Li, Yang Liu, et al. Deepgauge: Multi-granularity
  automated vehicle guidance. IET Intelligent Transport             testing criteria for deep learning systems. In Proceedings
  Systems, 8(3):183–189, May 2014.                                  of the 33rd ACM/IEEE International Conference on Auto-
[Goodfellow et al., 2018] Ian Goodfellow, Patrick Mc-               mated Software Engineering, pages 120–131. ACM, 2018.
  Daniel, and Nicolas Papernot. Making machine learning          [Ma et al., 2018b] Lei Ma, Fuyuan Zhang, Jiyuan Sun, Min-
  robust against adversarial inputs.         Commun. ACM,           hui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang
  61(7):56–66, June 2018.                                           Liu, Jianjun Zhao, et al. Deepmutation: Mutation testing
[Hartung and Håkansson, 2007] Ronald Hartung and Anne              of deep learning systems. In 2018 IEEE 29th International
  Håkansson. Automated testing for knowledge based sys-            Symposium on Software Reliability Engineering (ISSRE),
  tems. In Bruno Apolloni, RobertJ. Howlett, and Lakhmi             pages 100–111. IEEE, 2018.
  Jain, editors, Knowledge-Based Intelligent Information         [Ma et al., 2018c] Lei Ma, Fuyuan Zhang, Minhui Xue,
  and Engineering Systems, volume 4692 of Lecture Notes             Bo Li, Yang Liu, Jianjun Zhao, and Yadong Wang. Com-
  in Computer Science, pages 270–278. Springer Berlin Hei-          binatorial testing for deep learning systems. arXiv preprint
  delberg, 2007.                                                    arXiv:1806.07723, 2018.
[Hayes and Parzen, 1997] Caroline C. Hayes and Michael I.        [Menzel et al., 2018] Till Menzel, Gerrit Bagschik, and
  Parzen. Quem: An achievement test for knowledge-based             Markus Maurer. Scenarios for development, test and vali-
  systems. IEEE Transactions on Knowledge and Data En-              dation of automated vehicles. In arXiv:1801.08598, 2018.
  gineering, 9(6):838–847, November/December 1997.                  appeared in Proc. of the IEEE Intelligent Vehicles Sympo-
[Henriksson et al., 2018] Jens Henriksson, Markus Borg,             sium.
  and Cristofer Englund. Automotive safety and ma-               [Murrell and Plant, 1997] Stephen Murrell and Robert T.
  chine learning: initial results from a study on how to            Plant. A survey of tools for the validation and verifica-
  adapt the iso 26262 safety standard. In Proceedings of            tion of knowledge-based systems: 1985-1995. Decision
  the 1st International Workshop on Software Engineer-              Support Systems, 21(4):307–323, 1997.
  ing for AI in Autonomous Systems, May 2018. DOI:
                                                                 [Pei et al., 2017] Kexin Pei, Yinzhi Cao, Junfeng Yang, and
  10.1145/3194085.3194090.
                                                                    Suman Jana. Deepxplore: Automated whitebox testing of
[Kalra and Paddock, 2016] Nidhi Kalra and Susan M. Pad-             deep learning systems. In proceedings of the 26th Sympo-
  dock. Driving to safety: How many miles of driving                sium on Operating Systems Principles, pages 1–18. ACM,
  would it take to demonstrate autonomous vehicle reliabil-         2017.
  ity? Transportation Research Part A: Policy and Practice,
  94:182 – 193, 2016.                                            [Plant, 1991] Robert Plant. Rigorous approach to the devel-
                                                                    opment of knowledge-based systems. Knowl.-Based Syst.,
[Klück et al., 2018] Florian Klück, Yihao Li, Mihai Nica,         4(4):186–196, 1991.
  Jianbo Tao, and Franz Wotawa. Using ontologies for test
  suites generation for automated and autonomous driving         [Plant, 1992] Robert T. Plant. Expert system development
  functions. In In Proc. of the 29th IEEE International Sym-        and testing: A knowledge engineer’s perspective. Journal
  posium on Software Reliability Engineering (ISSRE2018)            of Systems and Software, 19(2):141–146, 1992.
  - Industrial Track, 2018.                                      [Schieferdecker, 2012] Ina Schieferdecker. Model-based
[Koopman and Wagner, 2016] Philip Koopman and Michael               testing. IEEE Software, 29(1):14–18, Jan/Feb 2012.
  Wagner. Challenges in autonomous vehicle testing and           [Schuldt et al., 2018] Fabian Schuldt, Andreas Reschka, and
  validation. SAE Int. J. Trans. Safety, 4:15–24, 04 2016.          Markus Maurer. A method for an efficient, systematic test
[Koroglu and Wotawa, 2019] Yavuz Koroglu and Franz                  case generation for advanced driver assistance systems in
  Wotawa. Fully automated compiler testing of a reasoning           virtual environments. In Hermann Winner, Gunter Prokop,
  engine via mutated grammar fuzzing. In In Proc. of the            and Markus Maurer, editors, Automotive Systems Engi-
  14th IEEE/ACM International Workshop on Automation                neering II. Springer International Publishing AG, 2018.
  of Software Test (AST), Montreal, Canada, 27th May 2019.       [Seshia and Sadigh, 2016] Sanjit A. Seshia and Dorsa
[Kuhn et al., 2009] D.R. Kuhn, R.N. Kacker, Y. Lei, and             Sadigh. Towards verified artificial intelligence. CoRR,
  J. Hunter. Combinatorial software testing. Computer,              abs/1606.08514, 2016.
  pages 94–96, August 2009.                                      [Sotiropoulos et al., 2017] T. Sotiropoulos, H. Waeselynck,
[Kuhn et al., 2012] D. R. Kuhn, R. N. Kacker, and Y. Lei.           J. Guiochet, and F. Ingrand. Can robot navigation bugs be
  Combinatorial testing. In Phillip A. Laplante, editor, Ency-      found in simulation? an exploratory study. In 2017 IEEE
  clopedia of Software Engineering. Taylor & Francis, 2012.         International Conference on Software Quality, Reliability
[Kuhn et al., 2015] D. Richard Kuhn, Renee Bryce, Feng              and Security (QRS), pages 150–159, July 2017.
  Duan, Laleh Sh. Ghandehari, Yu Lei, and Raghu N.               [Su et al., 2019] J. Su, D. V. Vargas, and K. Sakurai. One
  Kacker. Combinatorial testing: Theory and practice. In            pixel attack for fooling deep neural networks. IEEE Trans-
  Advances in Computers, volume 99, pages 1–66. 2015.               actions on Evolutionary Computation, pages 1–1, 2019.
[Sun et al., 2018] Youcheng Sun, Xiaowei Huang, and               [Xiong et al., 2013] Zhitao Xiong, Hamish Jamson, An-
   Daniel Kroening. Testing deep neural networks. arXiv             thony G. Cohn, and Oliver Carsten. Ontology for Scenario
   preprint arXiv:1803.04792, 2018.                                 Orchestration (OSO): A Standardised Scenario Descrip-
[Tao et al., 2019] Jianbo Tao, Yihao Li, Franz Wotawa, Her-         tion in Driving Simulation. ASCE Library, 2013.
   mann Felbinger, and Mihai Nica. On the industrial ap-
   plication of combinatorial testing for autonomous driving
   functions. In Proceedings of the International Workshop
   on Combinatorial Testing (IWCT). IEEE, 2019.
[Tiihonen et al., 2002] Juha Tiihonen, Timo Soininen, Ilkka
   Niemelä, and Reijo Sulonen. Empirical testing of a weight
   constraint rule based configurator. In Proceedings of the
   ECAI 2002 Configuration Workshop, pages 17–22, 2002.
[Wicker et al., 2018] Matthew Wicker, Xiaowei Huang, and
   Marta Kwiatkowska. Feature-guided black-box safety
   testing of deep neural networks. In Dirk Beyer and
   Marieke Huisman, editors, Tools and Algorithms for the
   Construction and Analysis of Systems - 24th International
   Conference, TACAS 2018, Held as Part of the European
   Joint Conferences on Theory and Practice of Software,
   ETAPS 2018, Thessaloniki, Greece, April 14-20, 2018,
   Proceedings, Part I, volume 10805 of Lecture Notes in
   Computer Science, pages 408–426. Springer, 2018.
[Wotawa and Li, 2018] Franz Wotawa and Yihao Li. From
   ontologies to input models for combinatorial testing. In In-
   maculada Medina-Bulo, Mercedes G. Merayo, and Robert
   Hierons, editors, Testing Software and Systems, pages
   155–170. Springer International Publishing, 2018.
[Wotawa and Pill, 2014] Franz Wotawa and Ingo Pill. Test-
   ing configuration knowledge-bases. In Proceedings of
   the 16th Workshop on Configuration, Novi Sad, Serbia,
   September 2014.
[Wotawa et al., 2018] Franz Wotawa, Bernhard Peischl, Flo-
   rian Klück, and Mihai Nica. Quality assurance methodolo-
   gies for automated driving. Elektrotechnik & Information-
   stechnik, 135(4–5), 2018. https://doi.org/10.1007/s00502-
   018-0630-7.
[Wotawa, 2016a] Franz Wotawa. Testing autonomous and
   highly configurable systems: Challenges and feasible so-
   lutions. In D. Watzenig and M. Horn, editors, Automated
   Driving. Springer International Publishing, 2016. DOI
   10.1007/978-3-319-31895-0 22.
[Wotawa, 2016b] Franz Wotawa. Testing self-adaptive sys-
   tems using fault injection and combinatorial testing. In
   Proceedings of the Intl. Workshop on Verification and Val-
   idation of Adaptive Systems (VVASS 2016), Vienna, Aus-
   tria, 2016.
[Wotawa, 2018a] Franz Wotawa. Combining combinatorial
   testing and metamorphic testing for testing a logic-based
   non-monotonic reasoning system. In In Proceedings of
   the 7th International Workshop on Combinatorial Testing
   (IWCT) / ICST 2018, April 13th 2018.
[Wotawa, 2018b] Franz Wotawa. On the automation of test-
   ing a logic-based diagnosis system. In In Proceedings of
   the 13th International Workshop on Testing: Academia-
   Industry Collaboration, Practice and Research Techniques
   (TAIC PART) / ICST 2018, April 9th 2018.