Enhancing Autonomous Vehicle Safety through N-version
                                Machine Learning Systems
                                Qiang Wen1,* , Júlio Mendonça2 , Fumio Machida1 and Marcus Völp2
                                1
                                    Department of Computer Science, University of Tsukuba, 305-8573, Japan
                                2
                                    Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, L-1855, Luxembourg


                                                                          Abstract
                                                                          Unreliable outputs of machine learning (ML) models are a significant concern, particularly for safety-critical applications
                                                                          such as autonomous driving. ML models are susceptible to out-of-distribution samples, distribution shifts, hardware transient
                                                                          faults, and even malicious attacks. To address the concerns, the N-version ML system gives a general solution to enhance the
                                                                          reliability of ML system outputs by employing diversification on ML models and their inputs. However, the existing studies
                                                                          of N-version ML systems mainly focused on classification errors and did not consider their impacts in a practical application
                                                                          scenario. In this paper, we investigate the applicability of N-version ML approach in an autonomous vehicle (AV) scenario
                                                                          within the AV simulator CARLA. We deploy two-version and three-version perception systems in an AV implemented in
                                                                          CARLA, using healthy ML models and compromised ML models, which are generated using fault-injection techniques and
                                                                          analyze the behavior of the AV in the simulator. Our findings reveal the critical impacts of compromised models on AV
                                                                          collision rates and show the potential of three-version perception systems in mitigating the risk. Our three-version perception
                                                                          system improves driving safety by tolerating one compromised model and delaying collisions when having at least one
                                                                          healthy model.

                                                                          Keywords
                                                                          autonomous driving, fault injection, machine learning system, N-version programming, perception


                                1. Introduction                                                                                      the robustness of ML systems. ML testing is one of these
                                                                                                                                     approaches that focuses on detecting differences between
                                Rapid machine learning (ML) advancements have led to existing and required behaviors of machine learning sys-
                                widespread applications across various domains. ML- tems [8]. However, the existing works mainly focus on
                                based intelligent software systems, including face recog- offline testing rather than runtime monitoring. To im-
                                nition, medical diagnosis, and autonomous robots, have prove correctness during runtime, additional safety mech-
                                become integral parts of our daily lives [1, 2]. However, anisms such as data validation [9], safety monitors [10],
                                ML models cannot guarantee a correct output in the appli- and redundant architecture [11, 12] must be deployed.
                                cation context due to ML models’ uncertainties in dealing Current ML data validation techniques pose operational
                                with real samples [3]. Additionally, transient faults (e.g., challenges, including an abundance of false positive warn-
                                leading to bit-flip errors [4]) and malicious attacks such ings and the necessity for manual adjustments. Similarly,
                                as adversarial attacks [5] may affect the system’s capabil- safety monitors, while crucial, lack adaptability due to
                                ity to provide correct outputs, especially when a single their simultaneous training with the ML model. Model
                                ML model is in the software stack [6, 7]. When ML-based enhancement and specialization led to the generation of
                                applications are incorporated into safety-critical systems, large Deep Neural Networks (DNNs), capable of model-
                                incorrect outputs can cause undesirable consequences. ing more complex patterns and data relationships, which,
                                For example, the misrecognition of traffic signs by ML- consequently, present improved results as output [13].
                                based classifiers could result in accidents in autonomous However, large DNNs should require more computational
                                driving scenarios. By using this example, we should agree resources to be executed, which specific systems, such as
                                that ensuring the correctness of ML-based system out- autonomous vehicles (AVs), may not afford as it would
                                puts has become a critical concern, especially for systems incur extra resource costs (e.g., energy). Although fea-
                                in safety-critical domains.                                                                          sible and sometimes suitable, adopting a large DNN in
                                    Various approaches have been proposed to enhance limited-resource systems would incur the use of a single
                                                                                                                                     DNN, offering the system a single point of failure, which
                                The IJCAI-24 Workshop on Artificial Intelligence Safety (AISafety 2024), could cause malfunction of the entire system in the case
                                August 04, 2024, Jeju, South Korea                                                                   of hardware or software failures or malicious attacks.
                                *
                                  Corresponding author.
                                                                                                                                        In contrast, adopting redundant architectures offers a
                                $ wen.qiang@sd.cs.tsukuba.ac.jp (Q. Wen);
                                julio.mendonca@uni.lu (J. Mendonça); machida@cs.tsukuba.ac.jp                                        more straightforward approach by utilizing diverse ML
                                (F. Machida); marcus.voelp@uni.lu (M. Völp)                                                          models and data inputs. Using multiple and diverse ML
                                         © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons
                                         License Attribution 4.0 International (CC BY 4.0).                                          models, an ML-based system can avoid a single point
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
of failure since replicated models can execute the same                 tems to the perception module of an AV to en-
tasks, masking failures or misclassifications. Also, adopt-             hance autonomous driving safety.
ing model diversity can help the system mitigate prob-                • We conduct fault injection experiments to reveal
lems such as overfitting and adversarial attacks, as differ-            the impact of compromised ML models in the
ent models could have a distinct structure and training                 perception module on the safety of AVs simulated
data. Leveraging the idea of a traditional software fault               in CARLA.
tolerant technique, N-version programming (NVP) [14],                 • Through the experiment, we demonstrate the en-
the N -version ML system approach uses replication and                  hanced driving safety achieved by a three-version
diversification to improve the output reliability of ML sys-            perception system that can mitigate incorrect out-
tems [12]. By integrating multiple, independently func-                 puts from compromised ML models and delay
tioning ML models, the N-version system is designed to                  possible AV collisions.
maintain operation and accurate decision-making even
when one or more components are compromised or                      The remainder of the paper is organized as follows.
faulty. The multiple versions of ML models and input             Section 2 presents background and related work. Sec-
data sources are used to generate multiple inference re-         tion 3 details the system and fault model adopted in this
sults, which may differ from each other. These results are       work. Section 4 clarifies the research questions addressed
subsequently analyzed using decision logic (e.g., a voter        in the following experiment and describes the experiment
employing a majority voting rule [15]) or a protocol to          settings. Section 5 discusses the achieved results, focus-
agree on a single value (e.g., consensus protocols [16])         ing on answering the defined research questions. Finally,
to determine the final output. This approach enables the         Section 6 concludes the paper and briefly presents future
system to detect and mitigate incorrect outputs arising          work.
from individual ML models. More recent studies have
analyzed the adoption of N-version ML systems and pre-
sented their benefits for output reliability [11, 17, 15, 18].
                                                                 2. Background and Related Works
However, none of these works have examined the safety
impact in a practical application scenario.
                                                                 2.1. N-version Machine Learning
   Therefore, this paper leverages N-version ML system           N-version ML architecture, based on NVP, comprises N
architectures for the perception module of AVs, aiming to        (≥2) diverse versions of ML components operating in
investigate the impact of such architectures on the safety       parallel for the same task [12]. The ML components gen-
of autonomous driving scenarios using the CARLA sim-             erate multiple inference results individually, and the final
ulator [19]. Specifically, we consider two-version and           output can be determined using a voting mechanism. Un-
three-version perception systems, each comprising two            like ensemble learning [21], which aims to build a better
or three independent ML modules, respectively, for object        model by combining weak learners, the N-version ML ar-
detection tasks in AVs. We incorporate multiple versions         chitecture is configured with pre-trained black-box mod-
of ML models within the systems by deploying different           els and designed for ML system operation. Recent studies
versions of the YOLOv5 model. In addition, to simulate           have investigated N-version ML approaches to improve
failures and errors, caused by transient faults or mali-         system reliability. Xu et al. [22] proposed the NV-DNN,
cious attacks, we create compromised ML models using             a framework aimed at enhancing the fault tolerance of
the fault-injection tool PyTorchFI [20]. The tool intention-     deep learning systems comprising N independently devel-
ally changes ML model parameters, which can introduce            oped models and decision-making procedures. NV-DNN
errors into the ML models, representing situations where         assumes processing a single input at a time, whereas
ML systems may be affected by different types of faults          N-version ML can also consider different inputs to ex-
(e.g., radiation, induced memory corruption). Then, we           ploit input diversity. Furthermore, diversifying input data
combine healthy and compromised ML models, following             can contribute to improving the reliability of N-version
an N-version system architecture, and deploy it into an          ML systems, as demonstrated in works from Machida
AV running on the CARLA simulator. The results show              and Wen [11, 15, 23]. Hong et al. [24] proposed a multi-
that single compromised models can significantly impact          modal deep-learning approach to improve the classifica-
the AV collision rate in up to 90% of the analyzed scenar-       tion accuracy of remote-sensing imagery, outperforming
ios. We also find that the three-version system has the          single-model or single-modality approaches. Mendonça
potential to tolerate one compromised model efficiently          et al. [18] investigated the improvement of output reliabil-
and delay collisions caused by incorrect object detection        ity in perception systems through a modeling approach
when having at least one healthy model. We make the              when integrating N-version programming with rejuve-
following contributions in this paper:                           nation techniques. Nevertheless, none of the existing
                                                                 studies have shown the effectiveness of the N-version
     • We propose the application of N-version ML sys-
ML approach in AV safety against the risk of faulty ML      its surroundings. It can integrate inputs from cameras,
models.                                                     LiDAR, radar, and ultrasonic sensors, each contributing
                                                            unique capabilities for detecting and classifying objects
2.2. Fault-injection for ML Models                          such as other vehicles, pedestrians, and road signs, as
                                                            well as identifying lane markings and traffic signals [30].
Fault injection is a testing technique used to analyze      The comprehensive sensory data is then forwarded to
systems under the presence of faults [25]. This method      the planning and prediction modules to form a dynamic
entails intentionally introducing faults behavior into a    3D map of the environment, enabling the AV to navigate
system to examine its function under abnormal condi-        safely and efficiently.
tions. The objective is to evaluate whether the system         Perception modules heavily rely on ML models to de-
can tolerate faults and continue to operate correctly or    tect obstacles, pedestrians, traffic signs, signals, lanes,
will misbehave. Recent studies have investigated fault      and other vehicles from the input captured by cameras
injection techniques in deep neural networks (DNNs).        and other sensors. Therefore, a failure or simple mis-
For example, single bias attack and Gradient descent at-    classification of objects in the environment may impact
tack are two types of fault injection attacks proposed      the safe driving behavior of the AV, which may lead to
to misclassify a specified input pattern into an adversar-  dangerous traffic situations and cause accidents.
ial class by modifying the parameters used in DNNs by          Next, we detail the fault model adopted in this work
Liu et al. [26]. Tools such as PyTorchFI [20] have been     and then present an N-version perception system for
proposed for disturbing DNNs on the PyTorch platform,       AVs, which aims to mitigate the impact of faulty and
allowing users to induce perturbations in the weights or    compromised ML models to enhance AV safety driving.
neurons of DNNs at runtime. Piazzesi et al. [27] used
fault-injection tools to evaluate autonomous agents un-
                                                            3.1. Fault Model
der the presence of artificial faults and attacks. In this
study, we leverage a fault injection tool to evaluate an The fault model focuses on transient faults and malicious
N-version perception system for AV.                         attacks related to ML models’ output correctness. Thus,
                                                            we assume sensors produce correct data and they, as well
2.3. Autonomous Driving Simulation                          as other components outside the perception system, are
                                                            not subject to failures or attacks. On the other hand,
CARLA [19] is a well-known and adopted open-source we consider vulnerabilities in deep learning frameworks
simulator designed for autonomous driving research. The (e.g., PyTorch, TensorFlow, or Caffe) could allow attack-
simulation platform supports flexible specification of sen- ers to (1) launch denial-of-service attacks, (2) crash deep
sor suites and environmental conditions. CARLA has learning applications due to memory exhaustion, (3) gen-
been extensively used to assess various aspects of au- erate wrong classification outputs by corrupting the clas-
tonomous driving. For instance, simulations enable the sifier’s memory, or (4) hijack the control flow to remote
verification of whether a driving system, trained using control the deep learning application hosting system [31].
data from a simulator, can be effectively deployed on The latest CVE reports on Tensorflow (CVE-2023-27506,
a real car [28]. Besides, works developed by Gao et CVE-2023-25668), PyTorch (CVE-2022-45907), and Caffe
al. [29] and Piazzesi et al. [27] leverage CARLA to de- (CVE-2021-39158) confirm the presence of such vulnera-
velop and evaluate object detection algorithms tailored bilities. Besides, ML models shall be subject to transient
for autonomous driving applications. By utilizing the faults such as radiation, which are capable of causing
simulation environment, the detection models can be bit-flips.
tested under various conditions. In this work, we shall        In this way, we assume an ML model can have three
focus on object detection tasks within the perception possible states: healthy (H), compromised but operational
system, particularly in analyzing N-version architectures (C), or non-operational (N). When in a healthy state (H),
for perception systems running in the CARLA simulator. the ML model performs normally, but it intrinsically in-
                                                            cludes producing incorrect outputs according to its ac-
                                                            curacy. When faults or malicious attacks (e.g., radiation,
3. Fault and System Model                                   induced memory corruption) affect the ML model, it may
We focus on an ML-based perception module running cause errors, which could lead to a subsequent failure.
in an AV. A perception module is an essential compo- When faults or attacks cause errors in the ML model, it
nent of AVs. The perception module leverages inputs reaches a compromised but functional state (C). Com-
from advanced sensors present in an AV. It serves as promised ML models can still perform object detection
the vehicle’s sensory hub, collecting and processing vast tasks but have a reduced probability of producing correct
amounts of data to create a detailed understanding of perception outputs. However, when the errors lead to
                                                            failures, the ML model completely stops, entering a non-
operational state (N), incapable of executing perception                             N-version Perception System
tasks.                                                            Sensors
                                                                                                                    Perception
   In this work, we shall focus on how vulnerability ex-                    Input                                    Output
                                                                  Camera
                                                                             data                           voter
ploitation and fault effects could be mitigated by using                                   N
N-version ML models. In this way, we assume that er-               GPS
                                                                                        ML models
rors or failures could harm all (or none) N ML models
at the same time. This allows us to generalize N-version
                                                                Figure 1: An example of an N-version perception system
architecture to consider situations where ML models are
                                                                architecture, containing N ML models. ML models can assume
executed isolated (e.g., in different cores) and are not sub-   one of the states in a given moment: healthy (H), compromised
ject to the same failures and situations where ML models        but operational (C), or non-operational (N).
are affected equally by a single failure. In practice, we
demonstrate how artificially injected faults affect distinct
ML models’ output differently by generating different
compromised ML models, as well as the overall effect            voting rules automatically.
when the system only has different compromised ML
models executing.                                               4. Experiments
3.2. An N-version Perception System                             4.1. Objective
To mitigate the impact of failed and compromised ML             The objective of this study is to investigate the applica-
models on AV safety, we present an N-version perception         bility of an N-version perception system architecture
system for enhancing perception outputs. Figure 1 shows         for AV safety. We aim to answer the following research
the architecture of an N-version perception system. We          questions throughout experiments using CARLA.
assume a perception system of an AV composed of N
ML models capable of executing object detection tasks,            RQ1: How does a compromised ML model of a perception
which aim to avoid AV collisions. We shall focus in this        system impact AV driving safety?
work on situations where data input variation is not em-          RQ2: How efficiently can an N-version perception
ployed. It means that all N ML models should receive the        system (N=2,3) tolerate compromised and non-operational
same data input from the AV sensors (e.g., cameras) to          models?
perform object detection. Note that in some systems, it
is also possible that different sensor data could be com-          To address RQ1, we set up a simulation environment
bined through a sensor fusion component before being            deploying different compromised ML models into the AV
forwarded to the ML models [32]. After executing the ob-        perception system to evaluate its driving behavior. We
ject detection task, each model shall forward its output to     simulate compromised ML models using PyTorchFI to
a voter, which decides the final perception output based        generate artificial faults in the ML models. We also com-
on a pre-defined voting rule. In the adopted system, we         pare the driving behavior of the compromised ML mod-
consider the voter implementing a majority-based voting         els against the healthy ML models. To answer RQ2, we
rule for simplicity, while other rules can be implemented       implement two-version and three-version perception sys-
later. Besides, we assume the voter is implemented in           tems, incorporating various combinations of healthy and
a trustworthy component and shall not be susceptible            compromised models. Then, we investigate the driving
to malicious attacks or faults. Such mechanism imple-           behavior across all the possible configurations, including
mentation has been demonstrated in practice by previous         entirely healthy, mixed (healthy and compromised), and
works, such as Gouveia et al. [33].                             entirely compromised models. This analysis allows us to
   Assuming the mentioned architecture, an N -version           evaluate how the N-version ML approach influences the
ML perception system can be represented in a set of             driving behavior of the AV’s perception system under
reachable states 𝑆 in which (ℎ, 𝑐, 𝑛) ∈ 𝑆 and h, c, and         various conditions.
n represent the numbers of ML models in the healthy,
compromised, and non-operational state, respectively.
                                                                4.2. Testbed Setup
Additionally, we assume the voter can automatically de-
tect when an ML model is in a non-operational state (N).        We utilize the CARLA AV simulator and a cooperative
Usually, failure detection tools can be easily adopted to       driving co-simulation framework OpenCDA [34] to simu-
verify whether a component is operational. This would be        late a single-lane driving scenario. During the simulation
necessary to prevent the voter from waiting indefinitely        process in OpenCDA, sensors installed on each AV collect
for the output of non-operational models and for it to be       the surrounding environment as well as the ego vehicle
able to reconfigure itself with different pre-determined
information (e.g., 3D LiDAR points and Global Naviga-     After injecting artificial faults in the healthy models, we
tion Satellite System (GNSS) data). The collected sensors’generate new models, renaming them to YOLOv5s6_FI,
data are used by the perception and localization systems  YOLOv5m6_FI, and YOLOv5l6_FI. Each one of these mod-
for object detection and localization. Subsequently, the  els shall represent the ML models when in a compromised
perception output, including object 3D pose and ego po-   state (C). Note that the weight perturbations injected on
sition, is delivered to the downstream planning system    these compromised models affect all input data (e.g., im-
to generate the AV trajectory and, consequently, update   age frames) during the entire period.
the AV’s acceleration, speed, and wheel turning. Finally,     In our three-version perception system, we employ a
the planned trajectory and commands are passed to the     majority voting rule. When three models are operational
control system, which generates the final control com-    (i.e., in the states H or C), the voter provides a perception
mands. In this paper, we choose Town03 in CARLA as        output when 2-out-of-3 models agree on the same output.
the map shown in Figure 2. The yellow oval marks the      The agreement is defined based on the criterion that
starting point, and the yellow star marks the endpoint of the bounding boxes (bboxes) have an Intersection over
the simulation run that an AV must execute. Towards this  Union (IoU) exceeding 0.8, and the labels are identical.
path, the AV relies on its perception system to accuratelyWhen the majority cannot be reached, the voter provides
detect other vehicles and road obstacles.                 no perception output. Consequently, the AV does not
                                                          update its driving properties (e.g., speed and acceleration).
                                                          When one model stops completely (i.e., entering a non-
                                                          operational (N) state), the system degrades to a 2-version.
                                                          In those cases, the voting rule implemented in the voter
                                                          is that the two models must agree on the same output.
                                                          Otherwise, it should not provide any perception output.
                                                          Therefore, the AV can only update its trajectory and
                                                          driving properties if the voter receives equal output from
                                                          the two models.
                                                              Evaluation Metric: We measure the collision rate of
                                                          the AV as the number of collision frames over the total
                                                          number of frames in a run. We also give the first colli-
                                                          sion frame number and total frame number as evaluation
                                                          metrics. The metrics measure the driving behavior of the
                                                          AV under different system configurations that a three-
Figure 2: Adopted scenario in Town03 of the CARLA simula- version system could assume. We conduct ten runs for
tor.                                                      each system configuration and report the average of the
                                                          metrics.

    Next, we define two-version and three-version sys-
tems using different object detection models in the           5. Results
architecture. We employ unmodified versions of the
YOLOv5 model [35], including YOLOv5s6, YOLOv5m6,              In this section, we discuss the experiment results to an-
and YOLOv5l6 as healthy models to deploy them into the        swer research questions related to the N-version percep-
AV perception system. Then, we generate compromised           tion systems. We focus on the impact of compromised
versions of these models using PyTorchFI [20]. More           models and the effectiveness of an N-version perception
specifically, adopting PyTorchFI’s runtime perturbation       system to answer the defined research questions.
feature for weights and neurons in DNNs. Those func-
tionalities are crucial for simulating real-world scenarios   5.1. Evaluation of Compromised ML
where models may encounter unexpected disruptions.                 Models
Thus, we employ PyTorchFI’s random_weight_inj func-
tion with a weight range of (-100, 300) to mimic the condi-   First, we investigate the effects of compromised models
tions compromised models may encounter. The injection         on object detection with an AV adopting a single-version
function randomly alters parameters within a randomly         perception system. We compare the object detection
selected layer of the neural network, thereby introducing     results between ML models in the healthy (i.e., in state
variability into the YOLO image detection algorithm. The      H ) and compromised (i.e., in state C) states. Recall that
degree to which the ego vehicle’s perception is impacted      an ML model in a non-operational state (i.e., state N )
(i.e., whether it causes an error) depends on the sensibil-   cannot produce any output. Thus, the AV cannot afford
ity of the model layer for the randomly injected weight.      a single-version architecture with only one model in this
state. Figure 3 illustrates an object detection case of (a) a
healthy YOLOv5m model, which accurately detects the
vehicle in front, whereas (b) its fault-injected variant,
YOLOv5m_FI, produces numerous erroneous bounding
boxes, which has more probability of leading the AV to
potential collisions.

                                                                                     (a) No collision case.


                                                                                      (b) Collision case.
           (a) Using a healthy YOLOv5m model.
                                                                Figure 4: Fault injection effect example on the AV for the
                                                                scenario adopted.


                                                                5.2. Evaluation of N-version Perception
                                                                     Systems
                                                                Next, we present the results achieved when adopting
                                                                two- and three-version perception systems. The middle
                                                                part of Table 1 presents the experimental results for two-
       (b) Using a compromised YOLOv5m_FI model.                version systems. The results indicate that no collisions
                                                                occurred in all runs when both healthy models were exe-
Figure 3: Example of object detection using a healthy and       cuted (i.e., state (2,0,1)). In contrast, configurations (1,1,1)
compromised model during AV driving.
                                                                and (0,2,1) experienced collisions. Notably, the number of
                                                                collisions in the (1,1,1) configurations was lower than in
   Table 1 presents the average values for three metrics        the (0,1,2) configurations, suggesting that a two-version
across ten runs, including the first collision frame, to-       system with one compromised model can still mitigate
tal number of frames, and collision rate. System state          some collisions. The average collision rates for the AV
represents (ℎ, 𝑐, 𝑛), where h, c, and n represent the num-      under the system state (1,1,1) and (0,2,1) were more than
bers of ML models in the healthy, compromised, and              50%. Additionally, the results demonstrate that the first
non-operational state, respectively. The upper part dis-        collision frame, when considering compromised models,
plays the results for single-version perception systems.        was at around frame 64, which is at a very early stage of
Notably, the healthy models consistently exhibited a 0%         the simulation. It is an important observation related to
average collision rate, while the compromised models            the layout of the vehicles during the simulation scenario,
showed significantly higher average collision rates (more       in which the ego vehicle starts the simulation in move-
than 70%). The AV had a collision in 90% runs when driv-        ment and is relatively close to the vehicle in front of it.
ing with different versions of compromised models. The          When two ML models disagree, and the detection of the
numbers demonstrated that the AV using compromised              vehicle in front is abnormal, the ego vehicle tends to have
models tends to experience collisions from the very onset       a rear collision with the vehicle in front of it. Thus, when
of the simulation. Figure 4 illustrates the collision and       adopting a two-version perception system, the AV would
no collision cases in the simulated scenario. In our study,     have a short time before entering a critical erroneous
we primarily focus on vehicle-to-vehicle collisions. Even       state, generating wrong object detection outputs after
though there is a curve in the lane, other collisions such      having at least one ML model compromised.
as vehicle-to-infrastructure collisions will not happen.            The bottom part of Table 1 presents the computed
Answer to RQ1: Compromised models of an AV percep-              results for the three-version perception system. When
tion system demonstrate a high average collision rate of        the system had the majority of models in a healthy state
more than 70%, adversely affecting AV driving safety.           (i.e., (3,0,0) and (2,1,0)), no collisions were observed.
Table 1
Collision data of the experiments over different states in a single, two, and three-version system.

  System state        YOLO Model                1st collision frame     Total frames      Collision rate%    # Collisions
  Single-version
  (1,0,2)             v5s                                NA                   687                 0               0/10
  (1,0,2)             v5m                                NA                   685                 0               0/10
  (1,0,2)             v5l                                NA                   682                 0               0/10
  (0,1,2)             v5s_FI                             119                  628               71.40             9/10
  (0,1,2)             v5m_FI                             103                  622               74.33             9/10
  (0,1,2)             v5l_FI                              89                  644               76.72             9/10
  Two-version
  (2,0,1)             v5s,v5m                            NA                   685                 0               0/10
  (1,1,1)             v5s,v5m_FI                         64                   704               54.63             6/10
  (1,1,1)             v5l,v5m_FI                         66                   690               63.27             7/10
  (0,2,1)             v5s_FI,v5m_FI                      64                   667               81.22             9/10
  Three-version
  (3,0,0)             v5s, v5m, v5l                      NA                   682                 0               0/10
  (2,1,0)             v5s, v5m, v5m_FI                   NA                   693                 0               0/10
  (2,1,0)             v5s, v5m, v5s_FI                   NA                   682                 0               0/10
  (1,2,0)             v5s, v5s_FI, v5m_FI                272                  666               28.82             5/10
  (1,2,0)             v5m, v5s_FI, v5m_FI                335                  654               33.08             7/10
  (0,3,0)             v5s_FI, v5m_FI, v5l_FI             187                  643               57.00             8/10


The result indicates that a three-version perception             when the majority of the models are compromised, the
system can effectively tolerate at least one compromised         system has the potential to prevent some collisions or
model and mask its failures when adopting the majority           delay erroneous perception outputs.
voting rule. For the configuration (1,2,0), where a
majority of the models were compromised, collisions              5.3. Discussion
occurred in most runs. However, there were instances
where the system successfully avoided collisions. The            The findings from three-version perception systems
average collision rates for this configuration were about        demonstrate the application of the N-version ML ap-
30%, significantly lower than those of single-version            proach in improving the safety of AV. Specifically, the
compromised models.           This observation suggests          configurations did not result in a collision when most
that even a system with more compromised models                  models were in healthy states, showing the three-version
has the potential to prevent collisions under certain            perception system’s ability to mitigate disruptions. Al-
circumstances. Notably, the average first collision frame        though collisions still occur in some configurations with
in the (1,2,0) configurations was much later compared to         the majority of compromised models, the system shows
single-version compromised models. This is a significant         the potential to delay erroneous perception outputs that
observation as it demonstrates that the system can               could lead to collisions. This capability is crucial in envi-
delay the onset of erroneous outputs. This delay could           ronments where even a minor delay in failure onset can
provide critical additional time for the AV to take evasive      provide essential time for initiating corrective actions,
action, thereby possibly avoiding a collision. Finally,          thereby preventing potentially catastrophic outcomes.
for the configuration (0,3,0), where all models were             Limitations. In this experiment, we did not consider
compromised, most runs ended in collisions, as expected          the cost and performance overhead imposed by the re-
due to the lack of healthy models to correct the errors.         dundant modules. The use of multiple ML models in
However, two out of ten runs did not result in a collision,      the N-version perception system introduces additional
suggesting that even with all models compromised,                computational overheads and may be costly to imple-
specific conditions within the scenario might prevent            ment in a real vehicle. The overhead and cost can be
failures temporarily.                                            mitigated by adjusting the number of modules activated
                                                                 and/or the frame rates. The perception output can also
Answer to RQ2: A three-version perception system                 be enhanced by diversifying the input data without using
can efficiently tolerate one compromised model. Even             multiple models [36]. Such a design optimization under
                                                                 resource constraints needs to be investigated further in
future work. Besides, our current evaluation is limited        [4] M. A. Hanif, F. Khalid, R. V. W. Putra, S. Rehman,
to a short run on one specific map. Further experiments            M. Shafique, Robust machine learning systems:
with more diverse driving scenarios are needed to make             Reliability and security for deep neural networks,
more general conclusions.                                          in: International Symposium on On-Line Testing
                                                                   And Robust System Design, 2018.
                                                               [5] S. Qiu, Q. Liu, S. Zhou, C. Wu, Review of arti-
6. Conclusions and Future Work                                     ficial intelligence adversarial attack and defense
                                                                   technologies, Applied Sciences 9 (2019) 909.
In this study, we explored the practical application of
                                                               [6] A. Toschi, M. Sanic, J. Leng, Q. Chen, C. Wang,
N-version ML system architectures through experiments
                                                                   M. Guo, Characterizing perception module per-
conducted in the autonomous vehicle simulator CARLA.
                                                                   formance and robustness in production-scale au-
By deploying two-version and three-version perception
                                                                   tonomous driving system, in: IFIP International
systems for object detection tasks, we investigated the
                                                                   Conference on Network and Parallel Computing,
effectiveness of incorporating multiple versions of both
                                                                   Springer, 2019, pp. 235–247.
healthy and compromised ML models within the sys-
                                                               [7] Apollo perception, 2019. URL: http://www.fzb.me/
tems. Our findings demonstrate that compromised mod-
                                                                   apollo/specs/perception_apollo_5.0.html.
els within the perception system significantly impact the
                                                               [8] J. M. Zhang, M. Harman, L. Ma, Y. Liu, Machine
AV collision rate, with rates exceeding 70%. In addition,
                                                                   learning testing: Survey, landscapes and horizons,
we observed that three-version perception systems have
                                                                   IEEE Transactions on Software Engineering (2020).
the potential to mitigate object detection misclassifica-
                                                               [9] W. Wu, H. Xu, S. Zhong, M. Lyu, I. King, Deep vali-
tions, tolerating one compromised model and delaying
                                                                   dation: Toward detecting real-world corner cases
collisions when at least one healthy model remains op-
                                                                   for deep neural networks, in: Proc. of the 49th
erational. In future work, we consider evaluating other
                                                                   IEEE/IFIP International Conference on Dependable
system architectures and exploring alternative decision-
                                                                   Systems and Networks (DSN), 2019, pp. 125–137.
making mechanisms beyond simple majority voting rules
                                                              [10] R. S. Ferreira, J. Arlat, J. Guiochet, H. Waselynck,
that could improve N-version ML systems’ output cor-
                                                                   Benchmarking safety monitors for image classifiers
rectness.
                                                                   with machine learning, in: Proc. of IEEE Pacific
                                                                   Rim International Symposium on Dependable Com-
Acknowledgments                                                    puting (PRDC), 2021, pp. 7–16.
                                                              [11] F. Machida, On the diversity of machine learn-
This work was supported by JST SPRING Grant Number                 ing models for system reliability, in: IEEE Pacific
JPMJSP2124, and partly supported by JSPS KAKENHI                   Rim Int’l Symp. on Dependable Computing (PRDC),
Grant Numbers 19K24337 and 22K17871. This work                     2019, pp. 276–285.
has also been supported by the German Research Coun-          [12] F. Machida, N-version machine learning models for
cil (DFG) and by the Luxembourg Fond Nationale de                  safety critical systems, in: Proc. of the DSN Work-
Recherche (FNR) through the Core Inter Project ByzRT               shop on Dependable and Secure Machine Learning,
(C19-IS-13691843).                                                 2019, pp. 48–51.
                                                              [13] Y. LeCun, Y. Bengio, G. Hinton, Deep learning,
                                                                   nature 521 (2015) 436–444.
References                                                    [14] L. Chen, A. Avizienis, N-version programming: A
                                                                   fault-tolerance approach to reliability of software
 [1] J. A. Sidey-Gibbons, C. J. Sidey-Gibbons, Machine
                                                                   operation, in: Proc. of 8th IEEE Int. Symp. on Fault-
     learning in medicine: a practical introduction, BMC
                                                                   Tolerant Computing (FTCS-8), 1978, pp. 3–9.
     medical research methodology 19 (2019) 1–18.
                                                              [15] Q. Wen, F. Machida, Reliability models and analysis
 [2] H. J. Vishnukumar, B. Butting, C. Müller, E. Sax, Ma-
                                                                   for triple-model with triple-input machine learning
     chine learning and deep neural network—artificial
                                                                   systems, in: Proc. of the 5th IEEE Conference on
     intelligence core for lab and real-world test and
                                                                   Dependable and Secure Computing, 2022, pp. 1–8.
     validation for adas and autonomous vehicles: Ai
                                                              [16] R. Olfati-Saber, J. A. Fax, R. M. Murray, Consensus
     for efficient and quality test and validation, in: In-
                                                                   and cooperation in networked multi-agent systems,
     telligent systems conference (IntelliSys), 2017, pp.
                                                                   Proceedings of the IEEE 95 (2007) 215–233.
     714–721.
                                                              [17] S. Latifi, B. Zamirai, S. Mahlke, Polygraphmr, en-
 [3] M. Henne, A. Schwaiger, G. Weiss, Managing un-
                                                                   hancing the reliability and dependability of cnns,
     certainty of ai-based perception for autonomous
                                                                   in: Proc. of 50th IEEE/IFIP International Conference
     systems, in: AISafety@ IJCAI, 2019, pp. 11–12.
                                                                   on Dependable Systems and Networks (DSN), 2020,
                                                                   pp. 99–112.
[18] J. Mendonça, F. Machida, M. Völp, Enhancing the          [30] S. D. Pendleton, H. Andersen, X. Du, X. Shen,
     reliability of perception systems using n-version             M. Meghjani, Y. H. Eng, D. Rus, M. H. Ang, Per-
     programming and rejuvenation, in: Proc. of the                ception, planning, control, and coordination for
     53rd Annual IEEE/IFIP International Conference                autonomous vehicles, Machines 5 (2017) 6.
     on Dependable Systems and Networks Workshops             [31] Q. Xiao, K. Li, D. Zhang, W. Xu, Security risks in
     (DSN-W), 2023, pp. 149–156.                                   deep learning implementations, in: IEEE Security
[19] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez,               and Privacy Workshops (SPW), 2018, pp. 123–128.
     V. Koltun, Carla: An open urban driving simulator,       [32] R. Maurice, M. Gerla, Autonomous driving: Sensor
     in: Proc. of the 1st Annual Conference on Robot               fusion for multiple sensor types, in: Proceedings
     Learning, 2017, pp. 1–16.                                     of the IEEE International Conference on Intelligent
[20] A. Mahmoud, N. Aggarwal, A. Nobbe, J. R. S. Vi-               Transportation Systems, IEEE, 2012.
     carte, S. V. Adve, C. W. Fletcher, I. Frosio, S. K. S.   [33] I. P. Gouveia, M. Völp, P. Esteves-Verissimo, Behind
     Hari, Pytorchfi: A runtime perturbation tool for              the last line of defense: Surviving soc faults and
     dnns, in: 2020 50th Annual IEEE/IFIP International            intrusions, Computers & Security 123 (2022) 102920.
     Conference on Dependable Systems and Networks                 doi:10.1016/j.cose.2022.102920.
     Workshops (DSN-W), 2020, pp. 25–31.                      [34] R. Xu, H. Xiang, X. Han, X. Xia, Z. Meng, C.-J.
[21] Z.-H. Zhou, Ensemble methods: foundations and                 Chen, C. Correa-Jullian, J. Ma, The opencda open-
     algorithms, CRC press, 2012.                                  source ecosystem for cooperative driving automa-
[22] H. Xu, Z. Chen, W. Wu, Z. Jin, S. Kuo, M. R. Lyu,             tion research, IEEE Transactions on Intelligent Vehi-
     Nv-dnn: towards fault-tolerant dnn systems with n-            cles 8 (2023) 2698–2711. doi:10.1109/TIV.2023.
     version programming, in: Proc. of the 49th Annual             3244948.
     IEEE/IFIP International Conference on Dependable         [35] G. Jocher, et al., ultralytics/yolov5: v5.0 - yolov5-p6
     Systems and Networks Workshops (DSN-W), 2019,                 1280 models, aws, supervise.ly and youtube inte-
     pp. 44–47.                                                    grations, 2021. URL: https://doi.org/10.5281/zenodo.
[23] Q. Wen, F. Machida, Characterizing reliability of             4679653.
     three-version traffic sign classifier system through     [36] K. Wakigami, F. Machida, T. Phung-Duc, Reliability
     diversity metrics, in: Proc. of the 34th International        and performance evaluation of two-input machine
     Symposium on Software Reliability Engineering                 learning systems, in: 2023 IEEE 28th Pacific Rim In-
     (ISSRE), 2023, pp. 333–343.                                   ternational Symposium on Dependable Computing
[24] D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot,             (PRDC), IEEE, 2023, pp. 278–286.
     Q. Du, B. Zhang, More diverse means better: Multi-
     modal deep learning meets remote-sensing imagery
     classification, IEEE Transactions on Geoscience and
     Remote Sensing 59 (2020) 4340–4354.
[25] M. C. Hsueh, T. K. Tsai, R. K. Iyer, Fault injection
     techniques and tools, Computer 30 (1997) 75–82.
[26] Y. Liu, L. Wei, B. Luo, Q. Xu, Fault injection attack
     on deep neural network, in: Proc. of the IEEE/ACM
     International Conference on Computer-Aided De-
     sign (ICCAD), 2017, pp. 131–138.
[27] N. Piazzesi, M. Hong, A. Ceccarelli, Attack and
     fault injection in self-driving agents on the carla
     simulator – experience report, in: Computer Safety,
     Reliability, and Security: 40th International Confer-
     ence, SAFECOMP 2021, Springer-Verlag, York, UK,
     2021, pp. 210–225.
[28] B. Osiński, A. Jakubowski, P. Zięcina, P. Miłoś,
     C. Galias, S. Homoceanu, H. Michalewski,
     Simulation-based reinforcement learning for real-
     world autonomous driving, in: IEEE international
     conference on robotics and automation (ICRA),
     2020, pp. 6411–6418.
[29] W. Gao, J. Tang, T. Wang, An object detection
     research method based on carla simulation, Journal
     of Physics: Conference Series 1948 (2021) 012163.