Enhancing Autonomous Vehicle Safety through N-version Machine Learning Systems Qiang Wen1,* , Júlio Mendonça2 , Fumio Machida1 and Marcus Völp2 1 Department of Computer Science, University of Tsukuba, 305-8573, Japan 2 Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, L-1855, Luxembourg Abstract Unreliable outputs of machine learning (ML) models are a significant concern, particularly for safety-critical applications such as autonomous driving. ML models are susceptible to out-of-distribution samples, distribution shifts, hardware transient faults, and even malicious attacks. To address the concerns, the N-version ML system gives a general solution to enhance the reliability of ML system outputs by employing diversification on ML models and their inputs. However, the existing studies of N-version ML systems mainly focused on classification errors and did not consider their impacts in a practical application scenario. In this paper, we investigate the applicability of N-version ML approach in an autonomous vehicle (AV) scenario within the AV simulator CARLA. We deploy two-version and three-version perception systems in an AV implemented in CARLA, using healthy ML models and compromised ML models, which are generated using fault-injection techniques and analyze the behavior of the AV in the simulator. Our findings reveal the critical impacts of compromised models on AV collision rates and show the potential of three-version perception systems in mitigating the risk. Our three-version perception system improves driving safety by tolerating one compromised model and delaying collisions when having at least one healthy model. Keywords autonomous driving, fault injection, machine learning system, N-version programming, perception 1. Introduction the robustness of ML systems. ML testing is one of these approaches that focuses on detecting differences between Rapid machine learning (ML) advancements have led to existing and required behaviors of machine learning sys- widespread applications across various domains. ML- tems [8]. However, the existing works mainly focus on based intelligent software systems, including face recog- offline testing rather than runtime monitoring. To im- nition, medical diagnosis, and autonomous robots, have prove correctness during runtime, additional safety mech- become integral parts of our daily lives [1, 2]. However, anisms such as data validation [9], safety monitors [10], ML models cannot guarantee a correct output in the appli- and redundant architecture [11, 12] must be deployed. cation context due to ML models’ uncertainties in dealing Current ML data validation techniques pose operational with real samples [3]. Additionally, transient faults (e.g., challenges, including an abundance of false positive warn- leading to bit-flip errors [4]) and malicious attacks such ings and the necessity for manual adjustments. Similarly, as adversarial attacks [5] may affect the system’s capabil- safety monitors, while crucial, lack adaptability due to ity to provide correct outputs, especially when a single their simultaneous training with the ML model. Model ML model is in the software stack [6, 7]. When ML-based enhancement and specialization led to the generation of applications are incorporated into safety-critical systems, large Deep Neural Networks (DNNs), capable of model- incorrect outputs can cause undesirable consequences. ing more complex patterns and data relationships, which, For example, the misrecognition of traffic signs by ML- consequently, present improved results as output [13]. based classifiers could result in accidents in autonomous However, large DNNs should require more computational driving scenarios. By using this example, we should agree resources to be executed, which specific systems, such as that ensuring the correctness of ML-based system out- autonomous vehicles (AVs), may not afford as it would puts has become a critical concern, especially for systems incur extra resource costs (e.g., energy). Although fea- in safety-critical domains. sible and sometimes suitable, adopting a large DNN in Various approaches have been proposed to enhance limited-resource systems would incur the use of a single DNN, offering the system a single point of failure, which The IJCAI-24 Workshop on Artificial Intelligence Safety (AISafety 2024), could cause malfunction of the entire system in the case August 04, 2024, Jeju, South Korea of hardware or software failures or malicious attacks. * Corresponding author. In contrast, adopting redundant architectures offers a $ wen.qiang@sd.cs.tsukuba.ac.jp (Q. Wen); julio.mendonca@uni.lu (J. Mendonça); machida@cs.tsukuba.ac.jp more straightforward approach by utilizing diverse ML (F. Machida); marcus.voelp@uni.lu (M. Völp) models and data inputs. Using multiple and diverse ML © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). models, an ML-based system can avoid a single point CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings of failure since replicated models can execute the same tems to the perception module of an AV to en- tasks, masking failures or misclassifications. Also, adopt- hance autonomous driving safety. ing model diversity can help the system mitigate prob- • We conduct fault injection experiments to reveal lems such as overfitting and adversarial attacks, as differ- the impact of compromised ML models in the ent models could have a distinct structure and training perception module on the safety of AVs simulated data. Leveraging the idea of a traditional software fault in CARLA. tolerant technique, N-version programming (NVP) [14], • Through the experiment, we demonstrate the en- the N -version ML system approach uses replication and hanced driving safety achieved by a three-version diversification to improve the output reliability of ML sys- perception system that can mitigate incorrect out- tems [12]. By integrating multiple, independently func- puts from compromised ML models and delay tioning ML models, the N-version system is designed to possible AV collisions. maintain operation and accurate decision-making even when one or more components are compromised or The remainder of the paper is organized as follows. faulty. The multiple versions of ML models and input Section 2 presents background and related work. Sec- data sources are used to generate multiple inference re- tion 3 details the system and fault model adopted in this sults, which may differ from each other. These results are work. Section 4 clarifies the research questions addressed subsequently analyzed using decision logic (e.g., a voter in the following experiment and describes the experiment employing a majority voting rule [15]) or a protocol to settings. Section 5 discusses the achieved results, focus- agree on a single value (e.g., consensus protocols [16]) ing on answering the defined research questions. Finally, to determine the final output. This approach enables the Section 6 concludes the paper and briefly presents future system to detect and mitigate incorrect outputs arising work. from individual ML models. More recent studies have analyzed the adoption of N-version ML systems and pre- sented their benefits for output reliability [11, 17, 15, 18]. 2. Background and Related Works However, none of these works have examined the safety impact in a practical application scenario. 2.1. N-version Machine Learning Therefore, this paper leverages N-version ML system N-version ML architecture, based on NVP, comprises N architectures for the perception module of AVs, aiming to (≥2) diverse versions of ML components operating in investigate the impact of such architectures on the safety parallel for the same task [12]. The ML components gen- of autonomous driving scenarios using the CARLA sim- erate multiple inference results individually, and the final ulator [19]. Specifically, we consider two-version and output can be determined using a voting mechanism. Un- three-version perception systems, each comprising two like ensemble learning [21], which aims to build a better or three independent ML modules, respectively, for object model by combining weak learners, the N-version ML ar- detection tasks in AVs. We incorporate multiple versions chitecture is configured with pre-trained black-box mod- of ML models within the systems by deploying different els and designed for ML system operation. Recent studies versions of the YOLOv5 model. In addition, to simulate have investigated N-version ML approaches to improve failures and errors, caused by transient faults or mali- system reliability. Xu et al. [22] proposed the NV-DNN, cious attacks, we create compromised ML models using a framework aimed at enhancing the fault tolerance of the fault-injection tool PyTorchFI [20]. The tool intention- deep learning systems comprising N independently devel- ally changes ML model parameters, which can introduce oped models and decision-making procedures. NV-DNN errors into the ML models, representing situations where assumes processing a single input at a time, whereas ML systems may be affected by different types of faults N-version ML can also consider different inputs to ex- (e.g., radiation, induced memory corruption). Then, we ploit input diversity. Furthermore, diversifying input data combine healthy and compromised ML models, following can contribute to improving the reliability of N-version an N-version system architecture, and deploy it into an ML systems, as demonstrated in works from Machida AV running on the CARLA simulator. The results show and Wen [11, 15, 23]. Hong et al. [24] proposed a multi- that single compromised models can significantly impact modal deep-learning approach to improve the classifica- the AV collision rate in up to 90% of the analyzed scenar- tion accuracy of remote-sensing imagery, outperforming ios. We also find that the three-version system has the single-model or single-modality approaches. Mendonça potential to tolerate one compromised model efficiently et al. [18] investigated the improvement of output reliabil- and delay collisions caused by incorrect object detection ity in perception systems through a modeling approach when having at least one healthy model. We make the when integrating N-version programming with rejuve- following contributions in this paper: nation techniques. Nevertheless, none of the existing studies have shown the effectiveness of the N-version • We propose the application of N-version ML sys- ML approach in AV safety against the risk of faulty ML its surroundings. It can integrate inputs from cameras, models. LiDAR, radar, and ultrasonic sensors, each contributing unique capabilities for detecting and classifying objects 2.2. Fault-injection for ML Models such as other vehicles, pedestrians, and road signs, as well as identifying lane markings and traffic signals [30]. Fault injection is a testing technique used to analyze The comprehensive sensory data is then forwarded to systems under the presence of faults [25]. This method the planning and prediction modules to form a dynamic entails intentionally introducing faults behavior into a 3D map of the environment, enabling the AV to navigate system to examine its function under abnormal condi- safely and efficiently. tions. The objective is to evaluate whether the system Perception modules heavily rely on ML models to de- can tolerate faults and continue to operate correctly or tect obstacles, pedestrians, traffic signs, signals, lanes, will misbehave. Recent studies have investigated fault and other vehicles from the input captured by cameras injection techniques in deep neural networks (DNNs). and other sensors. Therefore, a failure or simple mis- For example, single bias attack and Gradient descent at- classification of objects in the environment may impact tack are two types of fault injection attacks proposed the safe driving behavior of the AV, which may lead to to misclassify a specified input pattern into an adversar- dangerous traffic situations and cause accidents. ial class by modifying the parameters used in DNNs by Next, we detail the fault model adopted in this work Liu et al. [26]. Tools such as PyTorchFI [20] have been and then present an N-version perception system for proposed for disturbing DNNs on the PyTorch platform, AVs, which aims to mitigate the impact of faulty and allowing users to induce perturbations in the weights or compromised ML models to enhance AV safety driving. neurons of DNNs at runtime. Piazzesi et al. [27] used fault-injection tools to evaluate autonomous agents un- 3.1. Fault Model der the presence of artificial faults and attacks. In this study, we leverage a fault injection tool to evaluate an The fault model focuses on transient faults and malicious N-version perception system for AV. attacks related to ML models’ output correctness. Thus, we assume sensors produce correct data and they, as well 2.3. Autonomous Driving Simulation as other components outside the perception system, are not subject to failures or attacks. On the other hand, CARLA [19] is a well-known and adopted open-source we consider vulnerabilities in deep learning frameworks simulator designed for autonomous driving research. The (e.g., PyTorch, TensorFlow, or Caffe) could allow attack- simulation platform supports flexible specification of sen- ers to (1) launch denial-of-service attacks, (2) crash deep sor suites and environmental conditions. CARLA has learning applications due to memory exhaustion, (3) gen- been extensively used to assess various aspects of au- erate wrong classification outputs by corrupting the clas- tonomous driving. For instance, simulations enable the sifier’s memory, or (4) hijack the control flow to remote verification of whether a driving system, trained using control the deep learning application hosting system [31]. data from a simulator, can be effectively deployed on The latest CVE reports on Tensorflow (CVE-2023-27506, a real car [28]. Besides, works developed by Gao et CVE-2023-25668), PyTorch (CVE-2022-45907), and Caffe al. [29] and Piazzesi et al. [27] leverage CARLA to de- (CVE-2021-39158) confirm the presence of such vulnera- velop and evaluate object detection algorithms tailored bilities. Besides, ML models shall be subject to transient for autonomous driving applications. By utilizing the faults such as radiation, which are capable of causing simulation environment, the detection models can be bit-flips. tested under various conditions. In this work, we shall In this way, we assume an ML model can have three focus on object detection tasks within the perception possible states: healthy (H), compromised but operational system, particularly in analyzing N-version architectures (C), or non-operational (N). When in a healthy state (H), for perception systems running in the CARLA simulator. the ML model performs normally, but it intrinsically in- cludes producing incorrect outputs according to its ac- curacy. When faults or malicious attacks (e.g., radiation, 3. Fault and System Model induced memory corruption) affect the ML model, it may We focus on an ML-based perception module running cause errors, which could lead to a subsequent failure. in an AV. A perception module is an essential compo- When faults or attacks cause errors in the ML model, it nent of AVs. The perception module leverages inputs reaches a compromised but functional state (C). Com- from advanced sensors present in an AV. It serves as promised ML models can still perform object detection the vehicle’s sensory hub, collecting and processing vast tasks but have a reduced probability of producing correct amounts of data to create a detailed understanding of perception outputs. However, when the errors lead to failures, the ML model completely stops, entering a non- operational state (N), incapable of executing perception N-version Perception System tasks. Sensors Perception In this work, we shall focus on how vulnerability ex- Input Output Camera data voter ploitation and fault effects could be mitigated by using N N-version ML models. In this way, we assume that er- GPS ML models rors or failures could harm all (or none) N ML models at the same time. This allows us to generalize N-version Figure 1: An example of an N-version perception system architecture to consider situations where ML models are architecture, containing N ML models. ML models can assume executed isolated (e.g., in different cores) and are not sub- one of the states in a given moment: healthy (H), compromised ject to the same failures and situations where ML models but operational (C), or non-operational (N). are affected equally by a single failure. In practice, we demonstrate how artificially injected faults affect distinct ML models’ output differently by generating different compromised ML models, as well as the overall effect voting rules automatically. when the system only has different compromised ML models executing. 4. Experiments 3.2. An N-version Perception System 4.1. Objective To mitigate the impact of failed and compromised ML The objective of this study is to investigate the applica- models on AV safety, we present an N-version perception bility of an N-version perception system architecture system for enhancing perception outputs. Figure 1 shows for AV safety. We aim to answer the following research the architecture of an N-version perception system. We questions throughout experiments using CARLA. assume a perception system of an AV composed of N ML models capable of executing object detection tasks, RQ1: How does a compromised ML model of a perception which aim to avoid AV collisions. We shall focus in this system impact AV driving safety? work on situations where data input variation is not em- RQ2: How efficiently can an N-version perception ployed. It means that all N ML models should receive the system (N=2,3) tolerate compromised and non-operational same data input from the AV sensors (e.g., cameras) to models? perform object detection. Note that in some systems, it is also possible that different sensor data could be com- To address RQ1, we set up a simulation environment bined through a sensor fusion component before being deploying different compromised ML models into the AV forwarded to the ML models [32]. After executing the ob- perception system to evaluate its driving behavior. We ject detection task, each model shall forward its output to simulate compromised ML models using PyTorchFI to a voter, which decides the final perception output based generate artificial faults in the ML models. We also com- on a pre-defined voting rule. In the adopted system, we pare the driving behavior of the compromised ML mod- consider the voter implementing a majority-based voting els against the healthy ML models. To answer RQ2, we rule for simplicity, while other rules can be implemented implement two-version and three-version perception sys- later. Besides, we assume the voter is implemented in tems, incorporating various combinations of healthy and a trustworthy component and shall not be susceptible compromised models. Then, we investigate the driving to malicious attacks or faults. Such mechanism imple- behavior across all the possible configurations, including mentation has been demonstrated in practice by previous entirely healthy, mixed (healthy and compromised), and works, such as Gouveia et al. [33]. entirely compromised models. This analysis allows us to Assuming the mentioned architecture, an N -version evaluate how the N-version ML approach influences the ML perception system can be represented in a set of driving behavior of the AV’s perception system under reachable states 𝑆 in which (ℎ, 𝑐, 𝑛) ∈ 𝑆 and h, c, and various conditions. n represent the numbers of ML models in the healthy, compromised, and non-operational state, respectively. 4.2. Testbed Setup Additionally, we assume the voter can automatically de- tect when an ML model is in a non-operational state (N). We utilize the CARLA AV simulator and a cooperative Usually, failure detection tools can be easily adopted to driving co-simulation framework OpenCDA [34] to simu- verify whether a component is operational. This would be late a single-lane driving scenario. During the simulation necessary to prevent the voter from waiting indefinitely process in OpenCDA, sensors installed on each AV collect for the output of non-operational models and for it to be the surrounding environment as well as the ego vehicle able to reconfigure itself with different pre-determined information (e.g., 3D LiDAR points and Global Naviga- After injecting artificial faults in the healthy models, we tion Satellite System (GNSS) data). The collected sensors’generate new models, renaming them to YOLOv5s6_FI, data are used by the perception and localization systems YOLOv5m6_FI, and YOLOv5l6_FI. Each one of these mod- for object detection and localization. Subsequently, the els shall represent the ML models when in a compromised perception output, including object 3D pose and ego po- state (C). Note that the weight perturbations injected on sition, is delivered to the downstream planning system these compromised models affect all input data (e.g., im- to generate the AV trajectory and, consequently, update age frames) during the entire period. the AV’s acceleration, speed, and wheel turning. Finally, In our three-version perception system, we employ a the planned trajectory and commands are passed to the majority voting rule. When three models are operational control system, which generates the final control com- (i.e., in the states H or C), the voter provides a perception mands. In this paper, we choose Town03 in CARLA as output when 2-out-of-3 models agree on the same output. the map shown in Figure 2. The yellow oval marks the The agreement is defined based on the criterion that starting point, and the yellow star marks the endpoint of the bounding boxes (bboxes) have an Intersection over the simulation run that an AV must execute. Towards this Union (IoU) exceeding 0.8, and the labels are identical. path, the AV relies on its perception system to accuratelyWhen the majority cannot be reached, the voter provides detect other vehicles and road obstacles. no perception output. Consequently, the AV does not update its driving properties (e.g., speed and acceleration). When one model stops completely (i.e., entering a non- operational (N) state), the system degrades to a 2-version. In those cases, the voting rule implemented in the voter is that the two models must agree on the same output. Otherwise, it should not provide any perception output. Therefore, the AV can only update its trajectory and driving properties if the voter receives equal output from the two models. Evaluation Metric: We measure the collision rate of the AV as the number of collision frames over the total number of frames in a run. We also give the first colli- sion frame number and total frame number as evaluation metrics. The metrics measure the driving behavior of the AV under different system configurations that a three- Figure 2: Adopted scenario in Town03 of the CARLA simula- version system could assume. We conduct ten runs for tor. each system configuration and report the average of the metrics. Next, we define two-version and three-version sys- tems using different object detection models in the 5. Results architecture. We employ unmodified versions of the YOLOv5 model [35], including YOLOv5s6, YOLOv5m6, In this section, we discuss the experiment results to an- and YOLOv5l6 as healthy models to deploy them into the swer research questions related to the N-version percep- AV perception system. Then, we generate compromised tion systems. We focus on the impact of compromised versions of these models using PyTorchFI [20]. More models and the effectiveness of an N-version perception specifically, adopting PyTorchFI’s runtime perturbation system to answer the defined research questions. feature for weights and neurons in DNNs. Those func- tionalities are crucial for simulating real-world scenarios 5.1. Evaluation of Compromised ML where models may encounter unexpected disruptions. Models Thus, we employ PyTorchFI’s random_weight_inj func- tion with a weight range of (-100, 300) to mimic the condi- First, we investigate the effects of compromised models tions compromised models may encounter. The injection on object detection with an AV adopting a single-version function randomly alters parameters within a randomly perception system. We compare the object detection selected layer of the neural network, thereby introducing results between ML models in the healthy (i.e., in state variability into the YOLO image detection algorithm. The H ) and compromised (i.e., in state C) states. Recall that degree to which the ego vehicle’s perception is impacted an ML model in a non-operational state (i.e., state N ) (i.e., whether it causes an error) depends on the sensibil- cannot produce any output. Thus, the AV cannot afford ity of the model layer for the randomly injected weight. a single-version architecture with only one model in this state. Figure 3 illustrates an object detection case of (a) a healthy YOLOv5m model, which accurately detects the vehicle in front, whereas (b) its fault-injected variant, YOLOv5m_FI, produces numerous erroneous bounding boxes, which has more probability of leading the AV to potential collisions. (a) No collision case. (b) Collision case. (a) Using a healthy YOLOv5m model. Figure 4: Fault injection effect example on the AV for the scenario adopted. 5.2. Evaluation of N-version Perception Systems Next, we present the results achieved when adopting two- and three-version perception systems. The middle part of Table 1 presents the experimental results for two- (b) Using a compromised YOLOv5m_FI model. version systems. The results indicate that no collisions occurred in all runs when both healthy models were exe- Figure 3: Example of object detection using a healthy and cuted (i.e., state (2,0,1)). In contrast, configurations (1,1,1) compromised model during AV driving. and (0,2,1) experienced collisions. Notably, the number of collisions in the (1,1,1) configurations was lower than in Table 1 presents the average values for three metrics the (0,1,2) configurations, suggesting that a two-version across ten runs, including the first collision frame, to- system with one compromised model can still mitigate tal number of frames, and collision rate. System state some collisions. The average collision rates for the AV represents (ℎ, 𝑐, 𝑛), where h, c, and n represent the num- under the system state (1,1,1) and (0,2,1) were more than bers of ML models in the healthy, compromised, and 50%. Additionally, the results demonstrate that the first non-operational state, respectively. The upper part dis- collision frame, when considering compromised models, plays the results for single-version perception systems. was at around frame 64, which is at a very early stage of Notably, the healthy models consistently exhibited a 0% the simulation. It is an important observation related to average collision rate, while the compromised models the layout of the vehicles during the simulation scenario, showed significantly higher average collision rates (more in which the ego vehicle starts the simulation in move- than 70%). The AV had a collision in 90% runs when driv- ment and is relatively close to the vehicle in front of it. ing with different versions of compromised models. The When two ML models disagree, and the detection of the numbers demonstrated that the AV using compromised vehicle in front is abnormal, the ego vehicle tends to have models tends to experience collisions from the very onset a rear collision with the vehicle in front of it. Thus, when of the simulation. Figure 4 illustrates the collision and adopting a two-version perception system, the AV would no collision cases in the simulated scenario. In our study, have a short time before entering a critical erroneous we primarily focus on vehicle-to-vehicle collisions. Even state, generating wrong object detection outputs after though there is a curve in the lane, other collisions such having at least one ML model compromised. as vehicle-to-infrastructure collisions will not happen. The bottom part of Table 1 presents the computed Answer to RQ1: Compromised models of an AV percep- results for the three-version perception system. When tion system demonstrate a high average collision rate of the system had the majority of models in a healthy state more than 70%, adversely affecting AV driving safety. (i.e., (3,0,0) and (2,1,0)), no collisions were observed. Table 1 Collision data of the experiments over different states in a single, two, and three-version system. System state YOLO Model 1st collision frame Total frames Collision rate% # Collisions Single-version (1,0,2) v5s NA 687 0 0/10 (1,0,2) v5m NA 685 0 0/10 (1,0,2) v5l NA 682 0 0/10 (0,1,2) v5s_FI 119 628 71.40 9/10 (0,1,2) v5m_FI 103 622 74.33 9/10 (0,1,2) v5l_FI 89 644 76.72 9/10 Two-version (2,0,1) v5s,v5m NA 685 0 0/10 (1,1,1) v5s,v5m_FI 64 704 54.63 6/10 (1,1,1) v5l,v5m_FI 66 690 63.27 7/10 (0,2,1) v5s_FI,v5m_FI 64 667 81.22 9/10 Three-version (3,0,0) v5s, v5m, v5l NA 682 0 0/10 (2,1,0) v5s, v5m, v5m_FI NA 693 0 0/10 (2,1,0) v5s, v5m, v5s_FI NA 682 0 0/10 (1,2,0) v5s, v5s_FI, v5m_FI 272 666 28.82 5/10 (1,2,0) v5m, v5s_FI, v5m_FI 335 654 33.08 7/10 (0,3,0) v5s_FI, v5m_FI, v5l_FI 187 643 57.00 8/10 The result indicates that a three-version perception when the majority of the models are compromised, the system can effectively tolerate at least one compromised system has the potential to prevent some collisions or model and mask its failures when adopting the majority delay erroneous perception outputs. voting rule. For the configuration (1,2,0), where a majority of the models were compromised, collisions 5.3. Discussion occurred in most runs. However, there were instances where the system successfully avoided collisions. The The findings from three-version perception systems average collision rates for this configuration were about demonstrate the application of the N-version ML ap- 30%, significantly lower than those of single-version proach in improving the safety of AV. Specifically, the compromised models. This observation suggests configurations did not result in a collision when most that even a system with more compromised models models were in healthy states, showing the three-version has the potential to prevent collisions under certain perception system’s ability to mitigate disruptions. Al- circumstances. Notably, the average first collision frame though collisions still occur in some configurations with in the (1,2,0) configurations was much later compared to the majority of compromised models, the system shows single-version compromised models. This is a significant the potential to delay erroneous perception outputs that observation as it demonstrates that the system can could lead to collisions. This capability is crucial in envi- delay the onset of erroneous outputs. This delay could ronments where even a minor delay in failure onset can provide critical additional time for the AV to take evasive provide essential time for initiating corrective actions, action, thereby possibly avoiding a collision. Finally, thereby preventing potentially catastrophic outcomes. for the configuration (0,3,0), where all models were Limitations. In this experiment, we did not consider compromised, most runs ended in collisions, as expected the cost and performance overhead imposed by the re- due to the lack of healthy models to correct the errors. dundant modules. The use of multiple ML models in However, two out of ten runs did not result in a collision, the N-version perception system introduces additional suggesting that even with all models compromised, computational overheads and may be costly to imple- specific conditions within the scenario might prevent ment in a real vehicle. The overhead and cost can be failures temporarily. mitigated by adjusting the number of modules activated and/or the frame rates. The perception output can also Answer to RQ2: A three-version perception system be enhanced by diversifying the input data without using can efficiently tolerate one compromised model. Even multiple models [36]. Such a design optimization under resource constraints needs to be investigated further in future work. Besides, our current evaluation is limited [4] M. A. Hanif, F. Khalid, R. V. W. Putra, S. Rehman, to a short run on one specific map. Further experiments M. Shafique, Robust machine learning systems: with more diverse driving scenarios are needed to make Reliability and security for deep neural networks, more general conclusions. in: International Symposium on On-Line Testing And Robust System Design, 2018. [5] S. Qiu, Q. Liu, S. Zhou, C. Wu, Review of arti- 6. Conclusions and Future Work ficial intelligence adversarial attack and defense technologies, Applied Sciences 9 (2019) 909. In this study, we explored the practical application of [6] A. Toschi, M. Sanic, J. Leng, Q. Chen, C. Wang, N-version ML system architectures through experiments M. Guo, Characterizing perception module per- conducted in the autonomous vehicle simulator CARLA. formance and robustness in production-scale au- By deploying two-version and three-version perception tonomous driving system, in: IFIP International systems for object detection tasks, we investigated the Conference on Network and Parallel Computing, effectiveness of incorporating multiple versions of both Springer, 2019, pp. 235–247. healthy and compromised ML models within the sys- [7] Apollo perception, 2019. URL: http://www.fzb.me/ tems. Our findings demonstrate that compromised mod- apollo/specs/perception_apollo_5.0.html. els within the perception system significantly impact the [8] J. M. Zhang, M. Harman, L. Ma, Y. Liu, Machine AV collision rate, with rates exceeding 70%. In addition, learning testing: Survey, landscapes and horizons, we observed that three-version perception systems have IEEE Transactions on Software Engineering (2020). the potential to mitigate object detection misclassifica- [9] W. Wu, H. Xu, S. Zhong, M. Lyu, I. King, Deep vali- tions, tolerating one compromised model and delaying dation: Toward detecting real-world corner cases collisions when at least one healthy model remains op- for deep neural networks, in: Proc. of the 49th erational. In future work, we consider evaluating other IEEE/IFIP International Conference on Dependable system architectures and exploring alternative decision- Systems and Networks (DSN), 2019, pp. 125–137. making mechanisms beyond simple majority voting rules [10] R. S. Ferreira, J. Arlat, J. Guiochet, H. Waselynck, that could improve N-version ML systems’ output cor- Benchmarking safety monitors for image classifiers rectness. with machine learning, in: Proc. of IEEE Pacific Rim International Symposium on Dependable Com- Acknowledgments puting (PRDC), 2021, pp. 7–16. [11] F. Machida, On the diversity of machine learn- This work was supported by JST SPRING Grant Number ing models for system reliability, in: IEEE Pacific JPMJSP2124, and partly supported by JSPS KAKENHI Rim Int’l Symp. on Dependable Computing (PRDC), Grant Numbers 19K24337 and 22K17871. This work 2019, pp. 276–285. has also been supported by the German Research Coun- [12] F. Machida, N-version machine learning models for cil (DFG) and by the Luxembourg Fond Nationale de safety critical systems, in: Proc. of the DSN Work- Recherche (FNR) through the Core Inter Project ByzRT shop on Dependable and Secure Machine Learning, (C19-IS-13691843). 2019, pp. 48–51. [13] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (2015) 436–444. References [14] L. Chen, A. Avizienis, N-version programming: A fault-tolerance approach to reliability of software [1] J. A. Sidey-Gibbons, C. J. Sidey-Gibbons, Machine operation, in: Proc. of 8th IEEE Int. Symp. on Fault- learning in medicine: a practical introduction, BMC Tolerant Computing (FTCS-8), 1978, pp. 3–9. medical research methodology 19 (2019) 1–18. [15] Q. Wen, F. Machida, Reliability models and analysis [2] H. J. Vishnukumar, B. Butting, C. Müller, E. Sax, Ma- for triple-model with triple-input machine learning chine learning and deep neural network—artificial systems, in: Proc. of the 5th IEEE Conference on intelligence core for lab and real-world test and Dependable and Secure Computing, 2022, pp. 1–8. validation for adas and autonomous vehicles: Ai [16] R. Olfati-Saber, J. A. Fax, R. M. Murray, Consensus for efficient and quality test and validation, in: In- and cooperation in networked multi-agent systems, telligent systems conference (IntelliSys), 2017, pp. Proceedings of the IEEE 95 (2007) 215–233. 714–721. [17] S. Latifi, B. Zamirai, S. Mahlke, Polygraphmr, en- [3] M. Henne, A. Schwaiger, G. Weiss, Managing un- hancing the reliability and dependability of cnns, certainty of ai-based perception for autonomous in: Proc. of 50th IEEE/IFIP International Conference systems, in: AISafety@ IJCAI, 2019, pp. 11–12. on Dependable Systems and Networks (DSN), 2020, pp. 99–112. [18] J. Mendonça, F. Machida, M. Völp, Enhancing the [30] S. D. Pendleton, H. Andersen, X. Du, X. Shen, reliability of perception systems using n-version M. Meghjani, Y. H. Eng, D. Rus, M. H. Ang, Per- programming and rejuvenation, in: Proc. of the ception, planning, control, and coordination for 53rd Annual IEEE/IFIP International Conference autonomous vehicles, Machines 5 (2017) 6. on Dependable Systems and Networks Workshops [31] Q. Xiao, K. Li, D. Zhang, W. Xu, Security risks in (DSN-W), 2023, pp. 149–156. deep learning implementations, in: IEEE Security [19] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and Privacy Workshops (SPW), 2018, pp. 123–128. V. Koltun, Carla: An open urban driving simulator, [32] R. Maurice, M. Gerla, Autonomous driving: Sensor in: Proc. of the 1st Annual Conference on Robot fusion for multiple sensor types, in: Proceedings Learning, 2017, pp. 1–16. of the IEEE International Conference on Intelligent [20] A. Mahmoud, N. Aggarwal, A. Nobbe, J. R. S. Vi- Transportation Systems, IEEE, 2012. carte, S. V. Adve, C. W. Fletcher, I. Frosio, S. K. S. [33] I. P. Gouveia, M. Völp, P. Esteves-Verissimo, Behind Hari, Pytorchfi: A runtime perturbation tool for the last line of defense: Surviving soc faults and dnns, in: 2020 50th Annual IEEE/IFIP International intrusions, Computers & Security 123 (2022) 102920. Conference on Dependable Systems and Networks doi:10.1016/j.cose.2022.102920. Workshops (DSN-W), 2020, pp. 25–31. [34] R. Xu, H. Xiang, X. Han, X. Xia, Z. Meng, C.-J. [21] Z.-H. Zhou, Ensemble methods: foundations and Chen, C. Correa-Jullian, J. Ma, The opencda open- algorithms, CRC press, 2012. source ecosystem for cooperative driving automa- [22] H. Xu, Z. Chen, W. Wu, Z. Jin, S. Kuo, M. R. Lyu, tion research, IEEE Transactions on Intelligent Vehi- Nv-dnn: towards fault-tolerant dnn systems with n- cles 8 (2023) 2698–2711. doi:10.1109/TIV.2023. version programming, in: Proc. of the 49th Annual 3244948. IEEE/IFIP International Conference on Dependable [35] G. Jocher, et al., ultralytics/yolov5: v5.0 - yolov5-p6 Systems and Networks Workshops (DSN-W), 2019, 1280 models, aws, supervise.ly and youtube inte- pp. 44–47. grations, 2021. URL: https://doi.org/10.5281/zenodo. [23] Q. Wen, F. Machida, Characterizing reliability of 4679653. three-version traffic sign classifier system through [36] K. Wakigami, F. Machida, T. Phung-Duc, Reliability diversity metrics, in: Proc. of the 34th International and performance evaluation of two-input machine Symposium on Software Reliability Engineering learning systems, in: 2023 IEEE 28th Pacific Rim In- (ISSRE), 2023, pp. 333–343. ternational Symposium on Dependable Computing [24] D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot, (PRDC), IEEE, 2023, pp. 278–286. Q. Du, B. Zhang, More diverse means better: Multi- modal deep learning meets remote-sensing imagery classification, IEEE Transactions on Geoscience and Remote Sensing 59 (2020) 4340–4354. [25] M. C. Hsueh, T. K. Tsai, R. K. Iyer, Fault injection techniques and tools, Computer 30 (1997) 75–82. [26] Y. Liu, L. Wei, B. Luo, Q. Xu, Fault injection attack on deep neural network, in: Proc. of the IEEE/ACM International Conference on Computer-Aided De- sign (ICCAD), 2017, pp. 131–138. [27] N. Piazzesi, M. Hong, A. Ceccarelli, Attack and fault injection in self-driving agents on the carla simulator – experience report, in: Computer Safety, Reliability, and Security: 40th International Confer- ence, SAFECOMP 2021, Springer-Verlag, York, UK, 2021, pp. 210–225. [28] B. Osiński, A. Jakubowski, P. Zięcina, P. Miłoś, C. Galias, S. Homoceanu, H. Michalewski, Simulation-based reinforcement learning for real- world autonomous driving, in: IEEE international conference on robotics and automation (ICRA), 2020, pp. 6411–6418. [29] W. Gao, J. Tang, T. Wang, An object detection research method based on carla simulation, Journal of Physics: Conference Series 1948 (2021) 012163.