1. Introduction

Enhancing Autonomous Vehicle Safety through N-version Machine Learning Systems

Qiang Wen

Júlio Mendonça

Fumio Machida

Marcus Völp

1 0 Department of Computer Science, University of Tsukuba , 305-8573 , Japan 1 Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg , L-1855 , Luxembourg

Unreliable outputs of machine learning (ML) models are a significant concern, particularly for safety-critical applications such as autonomous driving. ML models are susceptible to out-of-distribution samples, distribution shifts, hardware transient faults, and even malicious attacks. To address the concerns, the N-version ML system gives a general solution to enhance the reliability of ML system outputs by employing diversification on ML models and their inputs. However, the existing studies of N-version ML systems mainly focused on classification errors and did not consider their impacts in a practical application scenario. In this paper, we investigate the applicability of N-version ML approach in an autonomous vehicle (AV) scenario within the AV simulator CARLA. We deploy two-version and three-version perception systems in an AV implemented in CARLA, using healthy ML models and compromised ML models, which are generated using fault-injection techniques and analyze the behavior of the AV in the simulator. Our findings reveal the critical impacts of compromised models on AV collision rates and show the potential of three-version perception systems in mitigating the risk. Our three-version perception system improves driving safety by tolerating one compromised model and delaying collisions when having at least one healthy model.

eol>autonomous driving fault injection machine learning system N-version programming perception

1. Introduction

the robustness of ML systems. ML testing is one of these approaches that focuses on detecting diferences between Rapid machine learning (ML) advancements have led to existing and required behaviors of machine learning syswidespread applications across various domains. ML- tems [8]. However, the existing works mainly focus on based intelligent software systems, including face recog- ofline testing rather than runtime monitoring. To imnition, medical diagnosis, and autonomous robots, have prove correctness during runtime, additional safety mechbecome integral parts of our daily lives [1, 2]. However, anisms such as data validation [9], safety monitors [10], ML models cannot guarantee a correct output in the appli- and redundant architecture [11, 12] must be deployed. cation context due to ML models’ uncertainties in dealing Current ML data validation techniques pose operational with real samples [3]. Additionally, transient faults (e.g., challenges, including an abundance of false positive warnleading to bit-flip errors [ 4]) and malicious attacks such ings and the necessity for manual adjustments. Similarly, as adversarial attacks [5] may afect the system’s capabil- safety monitors, while crucial, lack adaptability due to ity to provide correct outputs, especially when a single their simultaneous training with the ML model. Model ML model is in the software stack [6, 7]. When ML-based enhancement and specialization led to the generation of applications are incorporated into safety-critical systems, large Deep Neural Networks (DNNs), capable of modelincorrect outputs can cause undesirable consequences. ing more complex patterns and data relationships, which, For example, the misrecognition of trafic signs by ML- consequently, present improved results as output [13]. based classifiers could result in accidents in autonomous However, large DNNs should require more computational driving scenarios. By using this example, we should agree resources to be executed, which specific systems, such as that ensuring the correctness of ML-based system out- autonomous vehicles (AVs), may not aford as it would puts has become a critical concern, especially for systems incur extra resource costs (e.g., energy). Although feain safety-critical domains. sible and sometimes suitable, adopting a large DNN in Various approaches have been proposed to enhance limited-resource systems would incur the use of a single DNN, ofering the system a single point of failure, which The IJCAI-24 Workshop on Artificial Intelligence Safety (AISafety 2024), could cause malfunction of the entire system in the case August 04, 2024, Jeju, South Korea of hardware or software failures or malicious attacks. *$Cowrerens.qpioanndgi@ngsda.uctsh.tosru.kuba.ac.jp (Q. Wen); In contrast, adopting redundant architectures ofers a julio.mendonca@uni.lu (J. Mendonça); machida@cs.tsukuba.ac.jp more straightforward approach by utilizing diverse ML (F. Machida); marcus.voelp@uni.lu (M. Völp) models and data inputs. Using multiple and diverse ML © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons models, an ML-based system can avoid a single point CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g LCicEenUseRAttWribuotironk4s.0hIontpernPatrioonacl e(CeCdBiYn4g.0)s. (CEUR-WS.org) of failure since replicated models can execute the same tems to the perception module of an AV to entasks, masking failures or misclassifications. Also, adopt- hance autonomous driving safety. ing model diversity can help the system mitigate prob- • We conduct fault injection experiments to reveal lems such as overfitting and adversarial attacks, as difer- the impact of compromised ML models in the ent models could have a distinct structure and training perception module on the safety of AVs simulated data. Leveraging the idea of a traditional software fault in CARLA. tolerant technique, N-version programming (NVP) [14], • Through the experiment, we demonstrate the enthe N -version ML system approach uses replication and hanced driving safety achieved by a three-version diversification to improve the output reliability of ML sys- perception system that can mitigate incorrect outtems [12]. By integrating multiple, independently func- puts from compromised ML models and delay tioning ML models, the N-version system is designed to possible AV collisions. maintain operation and accurate decision-making even when one or more components are compromised or The remainder of the paper is organized as follows. faulty. The multiple versions of ML models and input Section 2 presents background and related work. Secdata sources are used to generate multiple inference re- tion 3 details the system and fault model adopted in this sults, which may difer from each other. These results are work. Section 4 clarifies the research questions addressed subsequently analyzed using decision logic (e.g., a voter in the following experiment and describes the experiment employing a majority voting rule [15]) or a protocol to settings. Section 5 discusses the achieved results, focusagree on a single value (e.g., consensus protocols [16]) ing on answering the defined research questions. Finally, to determine the final output. This approach enables the Section 6 concludes the paper and briefly presents future system to detect and mitigate incorrect outputs arising work. from individual ML models. More recent studies have analyzed the adoption of N-version ML systems and pre- 2. Background and Related Works sented their benefits for output reliability [ 11, 17, 15, 18].

However, none of these works have examined the safety impact in a practical application scenario. 2.1. N-version Machine Learning

Therefore, this paper leverages N-version ML system N-version ML architecture, based on NVP, comprises N architectures for the perception module of AVs, aiming to (≥ 2) diverse versions of ML components operating in investigate the impact of such architectures on the safety parallel for the same task [12]. The ML components genof autonomous driving scenarios using the CARLA sim- erate multiple inference results individually, and the final ulator [19]. Specifically, we consider two-version and output can be determined using a voting mechanism. Unthree-version perception systems, each comprising two like ensemble learning [21], which aims to build a better or three independent ML modules, respectively, for object model by combining weak learners, the N-version ML ardetection tasks in AVs. We incorporate multiple versions chitecture is configured with pre-trained black-box modof ML models within the systems by deploying diferent els and designed for ML system operation. Recent studies versions of the YOLOv5 model. In addition, to simulate have investigated N-version ML approaches to improve failures and errors, caused by transient faults or mali- system reliability. Xu et al. [22] proposed the NV-DNN, cious attacks, we create compromised ML models using a framework aimed at enhancing the fault tolerance of the fault-injection tool PyTorchFI [20]. The tool intention- deep learning systems comprising N independently develally changes ML model parameters, which can introduce oped models and decision-making procedures. NV-DNN errors into the ML models, representing situations where assumes processing a single input at a time, whereas ML systems may be afected by diferent types of faults N-version ML can also consider diferent inputs to ex(e.g., radiation, induced memory corruption). Then, we ploit input diversity. Furthermore, diversifying input data combine healthy and compromised ML models, following can contribute to improving the reliability of N-version an N-version system architecture, and deploy it into an ML systems, as demonstrated in works from Machida AV running on the CARLA simulator. The results show and Wen [11, 15, 23]. Hong et al. [24] proposed a multithat single compromised models can significantly impact modal deep-learning approach to improve the classificathe AV collision rate in up to 90% of the analyzed scenar- tion accuracy of remote-sensing imagery, outperforming ios. We also find that the three-version system has the single-model or single-modality approaches. Mendonça potential to tolerate one compromised model eficiently et al. [18] investigated the improvement of output reliabiland delay collisions caused by incorrect object detection ity in perception systems through a modeling approach when having at least one healthy model. We make the when integrating N-version programming with rejuvefollowing contributions in this paper: nation techniques. Nevertheless, none of the existing studies have shown the efectiveness of the N-version • We propose the application of N-version ML sysML approach in AV safety against the risk of faulty ML its surroundings. It can integrate inputs from cameras, models. LiDAR, radar, and ultrasonic sensors, each contributing unique capabilities for detecting and classifying objects 2.2. Fault-injection for ML Models such as other vehicles, pedestrians, and road signs, as well as identifying lane markings and trafic signals [ 30].

Fault injection is a testing technique used to analyze The comprehensive sensory data is then forwarded to systems under the presence of faults [25]. This method the planning and prediction modules to form a dynamic entails intentionally introducing faults behavior into a 3D map of the environment, enabling the AV to navigate system to examine its function under abnormal condi- safely and eficiently. tions. The objective is to evaluate whether the system Perception modules heavily rely on ML models to decan tolerate faults and continue to operate correctly or tect obstacles, pedestrians, trafic signs, signals, lanes, will misbehave. Recent studies have investigated fault and other vehicles from the input captured by cameras injection techniques in deep neural networks (DNNs). and other sensors. Therefore, a failure or simple misFor example, single bias attack and Gradient descent at- classification of objects in the environment may impact tack are two types of fault injection attacks proposed the safe driving behavior of the AV, which may lead to to misclassify a specified input pattern into an adversar- dangerous trafic situations and cause accidents. ial class by modifying the parameters used in DNNs by Next, we detail the fault model adopted in this work Liu et al. [26]. Tools such as PyTorchFI [20] have been and then present an N-version perception system for proposed for disturbing DNNs on the PyTorch platform, AVs, which aims to mitigate the impact of faulty and allowing users to induce perturbations in the weights or compromised ML models to enhance AV safety driving. neurons of DNNs at runtime. Piazzesi et al. [27] used fault-injection tools to evaluate autonomous agents un- 3.1. Fault Model der the presence of artificial faults and attacks. In this study, we leverage a fault injection tool to evaluate an N-version perception system for AV.

The fault model focuses on transient faults and malicious

attacks related to ML models’ output correctness. Thus, we assume sensors produce correct data and they, as well 2.3. Autonomous Driving Simulation as other components outside the perception system, are not subject to failures or attacks. On the other hand, CARLA [19] is a well-known and adopted open-source we consider vulnerabilities in deep learning frameworks simulator designed for autonomous driving research. The (e.g., PyTorch, TensorFlow, or Cafe) could allow attacksimulation platform supports flexible specification of sen- ers to (1) launch denial-of-service attacks, (2) crash deep sor suites and environmental conditions. CARLA has learning applications due to memory exhaustion, (3) genbeen extensively used to assess various aspects of au- erate wrong classification outputs by corrupting the clastonomous driving. For instance, simulations enable the sifier’s memory, or (4) hijack the control flow to remote verification of whether a driving system, trained using control the deep learning application hosting system [31]. data from a simulator, can be efectively deployed on The latest CVE reports on Tensorflow (CVE-2023-27506, a real car [28]. Besides, works developed by Gao et CVE-2023-25668), PyTorch (CVE-2022-45907), and Cafe al. [29] and Piazzesi et al. [27] leverage CARLA to de- (CVE-2021-39158) confirm the presence of such vulneravelop and evaluate object detection algorithms tailored bilities. Besides, ML models shall be subject to transient for autonomous driving applications. By utilizing the faults such as radiation, which are capable of causing simulation environment, the detection models can be bit-flips. tested under various conditions. In this work, we shall In this way, we assume an ML model can have three focus on object detection tasks within the perception possible states: healthy (H), compromised but operational system, particularly in analyzing N-version architectures (C), or non-operational (N). When in a healthy state (H), for perception systems running in the CARLA simulator. the ML model performs normally, but it intrinsically includes producing incorrect outputs according to its ac3. Fault and System Model curacy. When faults or malicious attacks (e.g., radiation, induced memory corruption) afect the ML model, it may We focus on an ML-based perception module running cause errors, which could lead to a subsequent failure. in an AV. A perception module is an essential compo- When faults or attacks cause errors in the ML model, it nent of AVs. The perception module leverages inputs reaches a compromised but functional state (C). Comfrom advanced sensors present in an AV. It serves as promised ML models can still perform object detection the vehicle’s sensory hub, collecting and processing vast tasks but have a reduced probability of producing correct amounts of data to create a detailed understanding of perception outputs. However, when the errors lead to failures, the ML model completely stops, entering a nonoperational state (N), incapable of executing perception tasks.

In this work, we shall focus on how vulnerability exploitation and fault efects could be mitigated by using N-version ML models. In this way, we assume that errors or failures could harm all (or none) N ML models at the same time. This allows us to generalize N-version architecture to consider situations where ML models are executed isolated (e.g., in diferent cores) and are not subject to the same failures and situations where ML models are afected equally by a single failure. In practice, we demonstrate how artificially injected faults afect distinct ML models’ output diferently by generating diferent compromised ML models, as well as the overall efect when the system only has diferent compromised ML models executing.

3.2. An N-version Perception System To mitigate the impact of failed and compromised ML

models on AV safety, we present an N-version perception system for enhancing perception outputs. Figure 1 shows the architecture of an N-version perception system. We assume a perception system of an AV composed of N ML models capable of executing object detection tasks, which aim to avoid AV collisions. We shall focus in this work on situations where data input variation is not employed. It means that all N ML models should receive the same data input from the AV sensors (e.g., cameras) to perform object detection. Note that in some systems, it is also possible that diferent sensor data could be combined through a sensor fusion component before being forwarded to the ML models [32]. After executing the object detection task, each model shall forward its output to a voter, which decides the final perception output based on a pre-defined voting rule. In the adopted system, we consider the voter implementing a majority-based voting rule for simplicity, while other rules can be implemented later. Besides, we assume the voter is implemented in a trustworthy component and shall not be susceptible to malicious attacks or faults. Such mechanism implementation has been demonstrated in practice by previous works, such as Gouveia et al. [33].

Assuming the mentioned architecture, an N -version ML perception system can be represented in a set of reachable states in which (ℎ, , ) ∈ and h, c, and n represent the numbers of ML models in the healthy, compromised, and non-operational state, respectively. Additionally, we assume the voter can automatically detect when an ML model is in a non-operational state (N). Usually, failure detection tools can be easily adopted to verify whether a component is operational. This would be necessary to prevent the voter from waiting indefinitely for the output of non-operational models and for it to be able to reconfigure itself with diferent pre-determined Sensors Camera GPS

Input data

ML models N-version Perception System voter

Perception

Output

4. Experiments 4.1. Objective The objective of this study is to investigate the applicability of an N-version perception system architecture for AV safety. We aim to answer the following research questions throughout experiments using CARLA.

RQ1: How does a compromised ML model of a perception system impact AV driving safety?

RQ2: How eficiently can an N-version perception system (N=2,3) tolerate compromised and non-operational models?

To address RQ1, we set up a simulation environment deploying diferent compromised ML models into the AV perception system to evaluate its driving behavior. We simulate compromised ML models using PyTorchFI to generate artificial faults in the ML models. We also compare the driving behavior of the compromised ML models against the healthy ML models. To answer RQ2, we implement two-version and three-version perception systems, incorporating various combinations of healthy and compromised models. Then, we investigate the driving behavior across all the possible configurations, including entirely healthy, mixed (healthy and compromised), and entirely compromised models. This analysis allows us to evaluate how the N-version ML approach influences the driving behavior of the AV’s perception system under various conditions.

4.2. Testbed Setup

We utilize the CARLA AV simulator and a cooperative driving co-simulation framework OpenCDA [34] to simulate a single-lane driving scenario. During the simulation process in OpenCDA, sensors installed on each AV collect the surrounding environment as well as the ego vehicle information (e.g., 3D LiDAR points and Global Naviga- After injecting artificial faults in the healthy models, we tion Satellite System (GNSS) data). The collected sensors’ generate new models, renaming them to YOLOv5s6_FI, data are used by the perception and localization systems YOLOv5m6_FI, and YOLOv5l6_FI. Each one of these modfor object detection and localization. Subsequently, the els shall represent the ML models when in a compromised perception output, including object 3D pose and ego po- state (C). Note that the weight perturbations injected on sition, is delivered to the downstream planning system these compromised models afect all input data (e.g., imto generate the AV trajectory and, consequently, update age frames) during the entire period. the AV’s acceleration, speed, and wheel turning. Finally, In our three-version perception system, we employ a the planned trajectory and commands are passed to the majority voting rule. When three models are operational control system, which generates the final control com- (i.e., in the states H or C), the voter provides a perception mands. In this paper, we choose Town03 in CARLA as output when 2-out-of-3 models agree on the same output. the map shown in Figure 2. The yellow oval marks the The agreement is defined based on the criterion that starting point, and the yellow star marks the endpoint of the bounding boxes (bboxes) have an Intersection over the simulation run that an AV must execute. Towards this Union (IoU) exceeding 0.8, and the labels are identical. path, the AV relies on its perception system to accurately When the majority cannot be reached, the voter provides detect other vehicles and road obstacles. no perception output. Consequently, the AV does not update its driving properties (e.g., speed and acceleration).

When one model stops completely (i.e., entering a nonoperational (N) state), the system degrades to a 2-version.

In those cases, the voting rule implemented in the voter is that the two models must agree on the same output.

Otherwise, it should not provide any perception output.

Therefore, the AV can only update its trajectory and driving properties if the voter receives equal output from the two models.

Evaluation Metric: We measure the collision rate of the AV as the number of collision frames over the total number of frames in a run. We also give the first collision frame number and total frame number as evaluation metrics. The metrics measure the driving behavior of the AV under diferent system configurations that a threeFigure 2: Adopted scenario in Town03 of the CARLA simula- version system could assume. We conduct ten runs for tor. each system configuration and report the average of the metrics.

Next, we define two-version and three-version sys

tems using diferent object detection models in the 5. Results architecture. We employ unmodified versions of the YOLOv5 model [35], including YOLOv5s6, YOLOv5m6, In this section, we discuss the experiment results to anand YOLOv5l6 as healthy models to deploy them into the swer research questions related to the N-version percepAV perception system. Then, we generate compromised tion systems. We focus on the impact of compromised versions of these models using PyTorchFI [20]. More models and the efectiveness of an N-version perception specifically, adopting PyTorchFI’s runtime perturbation system to answer the defined research questions. feature for weights and neurons in DNNs. Those functionalities are crucial for simulating real-world scenarios 5.1. Evaluation of Compromised ML where models may encounter unexpected disruptions. Models Thus, we employ PyTorchFI’s random_weight_inj function with a weight range of (-100, 300) to mimic the condi- First, we investigate the efects of compromised models tions compromised models may encounter. The injection on object detection with an AV adopting a single-version function randomly alters parameters within a randomly perception system. We compare the object detection selected layer of the neural network, thereby introducing results between ML models in the healthy (i.e., in state variability into the YOLO image detection algorithm. The H ) and compromised (i.e., in state C) states. Recall that degree to which the ego vehicle’s perception is impacted an ML model in a non-operational state (i.e., state N ) (i.e., whether it causes an error) depends on the sensibil- cannot produce any output. Thus, the AV cannot aford ity of the model layer for the randomly injected weight. a single-version architecture with only one model in this state. Figure 3 illustrates an object detection case of (a) a healthy YOLOv5m model, which accurately detects the vehicle in front, whereas (b) its fault-injected variant, YOLOv5m_FI, produces numerous erroneous bounding boxes, which has more probability of leading the AV to potential collisions.

(a) Using a healthy YOLOv5m model. (a) No collision case.

(b) Collision case.

5.2. Evaluation of N-version Perception Systems Next, we present the results achieved when adopting

two- and three-version perception systems. The middle part of Table 1 presents the experimental results for two(b) Using a compromised YOLOv5m_FI model. version systems. The results indicate that no collisions occurred in all runs when both healthy models were exeFigure 3: Example of object detection using a healthy and cuted (i.e., state (2,0,1)). In contrast, configurations (1,1,1) compromised model during AV driving. and (0,2,1) experienced collisions. Notably, the number of collisions in the (1,1,1) configurations was lower than in

Table 1 presents the average values for three metrics the (0,1,2) configurations, suggesting that a two-version across ten runs, including the first collision frame, to- system with one compromised model can still mitigate tal number of frames, and collision rate. System state some collisions. The average collision rates for the AV represents (ℎ, , ), where h, c, and n represent the num- under the system state (1,1,1) and (0,2,1) were more than bers of ML models in the healthy, compromised, and 50%. Additionally, the results demonstrate that the first non-operational state, respectively. The upper part dis- collision frame, when considering compromised models, plays the results for single-version perception systems. was at around frame 64, which is at a very early stage of Notably, the healthy models consistently exhibited a 0% the simulation. It is an important observation related to average collision rate, while the compromised models the layout of the vehicles during the simulation scenario, showed significantly higher average collision rates (more in which the ego vehicle starts the simulation in movethan 70%). The AV had a collision in 90% runs when driv- ment and is relatively close to the vehicle in front of it. ing with diferent versions of compromised models. The When two ML models disagree, and the detection of the numbers demonstrated that the AV using compromised vehicle in front is abnormal, the ego vehicle tends to have models tends to experience collisions from the very onset a rear collision with the vehicle in front of it. Thus, when of the simulation. Figure 4 illustrates the collision and adopting a two-version perception system, the AV would no collision cases in the simulated scenario. In our study, have a short time before entering a critical erroneous we primarily focus on vehicle-to-vehicle collisions. Even state, generating wrong object detection outputs after though there is a curve in the lane, other collisions such having at least one ML model compromised. as vehicle-to-infrastructure collisions will not happen. The bottom part of Table 1 presents the computed Answer to RQ1: Compromised models of an AV percep- results for the three-version perception system. When tion system demonstrate a high average collision rate of the system had the majority of models in a healthy state more than 70%, adversely afecting AV driving safety. (i.e., (3,0,0) and (2,1,0)), no collisions were observed.

System state The result indicates that a three-version perception when the majority of the models are compromised, the system can efectively tolerate at least one compromised system has the potential to prevent some collisions or model and mask its failures when adopting the majority delay erroneous perception outputs. voting rule. For the configuration (1,2,0), where a majority of the models were compromised, collisions 5.3. Discussion occurred in most runs. However, there were instances where the system successfully avoided collisions. The The findings from three-version perception systems average collision rates for this configuration were about demonstrate the application of the N-version ML ap30%, significantly lower than those of single-version proach in improving the safety of AV. Specifically, the compromised models. This observation suggests configurations did not result in a collision when most that even a system with more compromised models models were in healthy states, showing the three-version has the potential to prevent collisions under certain perception system’s ability to mitigate disruptions. Alcircumstances. Notably, the average first collision frame though collisions still occur in some configurations with in the (1,2,0) configurations was much later compared to the majority of compromised models, the system shows single-version compromised models. This is a significant the potential to delay erroneous perception outputs that observation as it demonstrates that the system can could lead to collisions. This capability is crucial in envidelay the onset of erroneous outputs. This delay could ronments where even a minor delay in failure onset can provide critical additional time for the AV to take evasive provide essential time for initiating corrective actions, action, thereby possibly avoiding a collision. Finally, thereby preventing potentially catastrophic outcomes. for the configuration (0,3,0), where all models were Limitations. In this experiment, we did not consider compromised, most runs ended in collisions, as expected the cost and performance overhead imposed by the redue to the lack of healthy models to correct the errors. dundant modules. The use of multiple ML models in However, two out of ten runs did not result in a collision, the N-version perception system introduces additional suggesting that even with all models compromised, computational overheads and may be costly to implespecific conditions within the scenario might prevent ment in a real vehicle. The overhead and cost can be failures temporarily. mitigated by adjusting the number of modules activated and/or the frame rates. The perception output can also Answer to RQ2: A three-version perception system be enhanced by diversifying the input data without using can eficiently tolerate one compromised model. Even multiple models [36]. Such a design optimization under resource constraints needs to be investigated further in future work. Besides, our current evaluation is limited [4] M. A. Hanif, F. Khalid, R. V. W. Putra, S. Rehman, to a short run on one specific map. Further experiments M. Shafique, Robust machine learning systems: with more diverse driving scenarios are needed to make Reliability and security for deep neural networks, more general conclusions. in: International Symposium on On-Line Testing And Robust System Design, 2018. [5] S. Qiu, Q. Liu, S. Zhou, C. Wu, Review of arti6. Conclusions and Future Work ifcial intelligence adversarial attack and defense technologies, Applied Sciences 9 (2019) 909.

In this study, we explored the practical application of [6] A. Toschi, M. Sanic, J. Leng, Q. Chen, C. Wang, N-version ML system architectures through experiments M. Guo, Characterizing perception module perconducted in the autonomous vehicle simulator CARLA. formance and robustness in production-scale auBy deploying two-version and three-version perception tonomous driving system, in: IFIP International systems for object detection tasks, we investigated the Conference on Network and Parallel Computing, efectiveness of incorporating multiple versions of both Springer, 2019, pp. 235–247. healthy and compromised ML models within the sys- [7] Apollo perception, 2019. URL: http://www.fzb.me/ tems. Our findings demonstrate that compromised mod- apollo/specs/perception_apollo_5.0.html. els within the perception system significantly impact the [8] J. M. Zhang, M. Harman, L. Ma, Y. Liu, Machine AV collision rate, with rates exceeding 70%. In addition, learning testing: Survey, landscapes and horizons, we observed that three-version perception systems have IEEE Transactions on Software Engineering (2020). the potential to mitigate object detection misclassifica- [9] W. Wu, H. Xu, S. Zhong, M. Lyu, I. King, Deep valitions, tolerating one compromised model and delaying dation: Toward detecting real-world corner cases collisions when at least one healthy model remains op- for deep neural networks, in: Proc. of the 49th erational. In future work, we consider evaluating other IEEE/IFIP International Conference on Dependable system architectures and exploring alternative decision- Systems and Networks (DSN), 2019, pp. 125–137. making mechanisms beyond simple majority voting rules [10] R. S. Ferreira, J. Arlat, J. Guiochet, H. Waselynck, that could improve N-version ML systems’ output cor- Benchmarking safety monitors for image classifiers rectness. with machine learning, in: Proc. of IEEE Pacific Rim International Symposium on Dependable ComAcknowledgments puting (PRDC), 2021, pp. 7–16. [11] F. Machida, On the diversity of machine learnThis work was supported by JST SPRING Grant Number ing models for system reliability, in: IEEE Pacific JPMJSP2124, and partly supported by JSPS KAKENHI Rim Int’l Symp. on Dependable Computing (PRDC), Grant Numbers 19K24337 and 22K17871. This work 2019, pp. 276–285. has also been supported by the German Research Coun- [12] F. Machida, N-version machine learning models for cil (DFG) and by the Luxembourg Fond Nationale de safety critical systems, in: Proc. of the DSN WorkRecherche (FNR) through the Core Inter Project ByzRT shop on Dependable and Secure Machine Learning, (C19-IS-13691843). 2019, pp. 48–51. [13] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (2015) 436–444.

References [14] L. Chen, A. Avizienis, N-version programming: A fault-tolerance approach to reliability of software [1] J. A. Sidey-Gibbons, C. J. Sidey-Gibbons, Machine operation, in: Proc. of 8th IEEE Int. Symp. on Faultlearning in medicine: a practical introduction, BMC Tolerant Computing (FTCS-8), 1978, pp. 3–9. medical research methodology 19 (2019) 1–18. [15] Q. Wen, F. Machida, Reliability models and analysis [2] H. J. Vishnukumar, B. Butting, C. Müller, E. Sax, Ma- for triple-model with triple-input machine learning chine learning and deep neural network—artificial systems, in: Proc. of the 5th IEEE Conference on intelligence core for lab and real-world test and Dependable and Secure Computing, 2022, pp. 1–8. validation for adas and autonomous vehicles: Ai [16] R. Olfati-Saber, J. A. Fax, R. M. Murray, Consensus for eficient and quality test and validation, in: In- and cooperation in networked multi-agent systems, telligent systems conference (IntelliSys), 2017, pp. Proceedings of the IEEE 95 (2007) 215–233. 714–721. [17] S. Latifi, B. Zamirai, S. Mahlke, Polygraphmr, en[3] M. Henne, A. Schwaiger, G. Weiss, Managing un- hancing the reliability and dependability of cnns, certainty of ai-based perception for autonomous in: Proc. of 50th IEEE/IFIP International Conference systems, in: AISafety@ IJCAI, 2019, pp. 11–12. on Dependable Systems and Networks (DSN), 2020, pp. 99–112. [18] J. Mendonça, F. Machida, M. Völp, Enhancing the [30] S. D. Pendleton, H. Andersen, X. Du, X. Shen, reliability of perception systems using n-version M. Meghjani, Y. H. Eng, D. Rus, M. H. Ang, Perprogramming and rejuvenation, in: Proc. of the ception, planning, control, and coordination for 53rd Annual IEEE/IFIP International Conference autonomous vehicles, Machines 5 (2017) 6. on Dependable Systems and Networks Workshops [31] Q. Xiao, K. Li, D. Zhang, W. Xu, Security risks in (DSN-W), 2023, pp. 149–156. deep learning implementations, in: IEEE Security [19] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and Privacy Workshops (SPW), 2018, pp. 123–128.

V. Koltun, Carla: An open urban driving simulator, [32] R. Maurice, M. Gerla, Autonomous driving: Sensor in: Proc. of the 1st Annual Conference on Robot fusion for multiple sensor types, in: Proceedings Learning, 2017, pp. 1–16. of the IEEE International Conference on Intelligent [20] A. Mahmoud, N. Aggarwal, A. Nobbe, J. R. S. Vi- Transportation Systems, IEEE, 2012. carte, S. V. Adve, C. W. Fletcher, I. Frosio, S. K. S. [33] I. P. Gouveia, M. Völp, P. Esteves-Verissimo, Behind Hari, Pytorchfi: A runtime perturbation tool for the last line of defense: Surviving soc faults and dnns, in: 2020 50th Annual IEEE/IFIP International intrusions, Computers & Security 123 (2022) 102920. Conference on Dependable Systems and Networks doi:10.1016/j.cose.2022.102920.

Workshops (DSN-W), 2020, pp. 25–31. [34] R. Xu, H. Xiang, X. Han, X. Xia, Z. Meng, C.-J. [21] Z.-H. Zhou, Ensemble methods: foundations and Chen, C. Correa-Jullian, J. Ma, The opencda openalgorithms, CRC press, 2012. source ecosystem for cooperative driving automa[22] H. Xu, Z. Chen, W. Wu, Z. Jin, S. Kuo, M. R. Lyu, tion research, IEEE Transactions on Intelligent VehiNv-dnn: towards fault-tolerant dnn systems with n- cles 8 (2023) 2698–2711. doi:10.1109/TIV.2023. version programming, in: Proc. of the 49th Annual 3244948.

IEEE/IFIP International Conference on Dependable [35] G. Jocher, et al., ultralytics/yolov5: v5.0 - yolov5-p6 Systems and Networks Workshops (DSN-W), 2019, 1280 models, aws, supervise.ly and youtube intepp. 44–47. grations, 2021. URL: https://doi.org/10.5281/zenodo. [23] Q. Wen, F. Machida, Characterizing reliability of 4679653.

three-version trafic sign classifier system through [36] K. Wakigami, F. Machida, T. Phung-Duc, Reliability diversity metrics, in: Proc. of the 34th International and performance evaluation of two-input machine Symposium on Software Reliability Engineering learning systems, in: 2023 IEEE 28th Pacific Rim In(ISSRE), 2023, pp. 333–343. ternational Symposium on Dependable Computing [24] D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot, (PRDC), IEEE, 2023, pp. 278–286.

Q. Du, B. Zhang, More diverse means better: Multimodal deep learning meets remote-sensing imagery classification, IEEE Transactions on Geoscience and

Remote Sensing 59 (2020) 4340–4354. [25] M. C. Hsueh, T. K. Tsai, R. K. Iyer, Fault injection

techniques and tools, Computer 30 (1997) 75–82. [26] Y. Liu, L. Wei, B. Luo, Q. Xu, Fault injection attack on deep neural network, in: Proc. of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2017, pp. 131–138. [27] N. Piazzesi, M. Hong, A. Ceccarelli, Attack and fault injection in self-driving agents on the carla simulator – experience report, in: Computer Safety, Reliability, and Security: 40th International Conference, SAFECOMP 2021, Springer-Verlag, York, UK, 2021, pp. 210–225. [28] B. Osiński, A. Jakubowski, P. Zięcina, P. Miłoś,

C. Galias, S. Homoceanu, H. Michalewski, Simulation-based reinforcement learning for realworld autonomous driving, in: IEEE international conference on robotics and automation (ICRA), 2020, pp. 6411–6418. [29] W. Gao, J. Tang, T. Wang, An object detection research method based on carla simulation, Journal of Physics: Conference Series 1948 (2021) 012163.