<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Autonomous Vehicle Safety through N-version Machine Learning Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qiang Wen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Júlio Mendonça</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fumio Machida</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcus Völp</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Tsukuba</institution>
          ,
          <addr-line>305-8573</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg</institution>
          ,
          <addr-line>L-1855</addr-line>
          ,
          <country country="LU">Luxembourg</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Unreliable outputs of machine learning (ML) models are a significant concern, particularly for safety-critical applications such as autonomous driving. ML models are susceptible to out-of-distribution samples, distribution shifts, hardware transient faults, and even malicious attacks. To address the concerns, the N-version ML system gives a general solution to enhance the reliability of ML system outputs by employing diversification on ML models and their inputs. However, the existing studies of N-version ML systems mainly focused on classification errors and did not consider their impacts in a practical application scenario. In this paper, we investigate the applicability of N-version ML approach in an autonomous vehicle (AV) scenario within the AV simulator CARLA. We deploy two-version and three-version perception systems in an AV implemented in CARLA, using healthy ML models and compromised ML models, which are generated using fault-injection techniques and analyze the behavior of the AV in the simulator. Our findings reveal the critical impacts of compromised models on AV collision rates and show the potential of three-version perception systems in mitigating the risk. Our three-version perception system improves driving safety by tolerating one compromised model and delaying collisions when having at least one healthy model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;autonomous driving</kwd>
        <kwd>fault injection</kwd>
        <kwd>machine learning system</kwd>
        <kwd>N-version programming</kwd>
        <kwd>perception</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>the robustness of ML systems. ML testing is one of these
approaches that focuses on detecting diferences between
Rapid machine learning (ML) advancements have led to existing and required behaviors of machine learning
syswidespread applications across various domains. ML- tems [8]. However, the existing works mainly focus on
based intelligent software systems, including face recog- ofline testing rather than runtime monitoring. To
imnition, medical diagnosis, and autonomous robots, have prove correctness during runtime, additional safety
mechbecome integral parts of our daily lives [1, 2]. However, anisms such as data validation [9], safety monitors [10],
ML models cannot guarantee a correct output in the appli- and redundant architecture [11, 12] must be deployed.
cation context due to ML models’ uncertainties in dealing Current ML data validation techniques pose operational
with real samples [3]. Additionally, transient faults (e.g., challenges, including an abundance of false positive
warnleading to bit-flip errors [ 4]) and malicious attacks such ings and the necessity for manual adjustments. Similarly,
as adversarial attacks [5] may afect the system’s capabil- safety monitors, while crucial, lack adaptability due to
ity to provide correct outputs, especially when a single their simultaneous training with the ML model. Model
ML model is in the software stack [6, 7]. When ML-based enhancement and specialization led to the generation of
applications are incorporated into safety-critical systems, large Deep Neural Networks (DNNs), capable of
modelincorrect outputs can cause undesirable consequences. ing more complex patterns and data relationships, which,
For example, the misrecognition of trafic signs by ML- consequently, present improved results as output [13].
based classifiers could result in accidents in autonomous However, large DNNs should require more computational
driving scenarios. By using this example, we should agree resources to be executed, which specific systems, such as
that ensuring the correctness of ML-based system out- autonomous vehicles (AVs), may not aford as it would
puts has become a critical concern, especially for systems incur extra resource costs (e.g., energy). Although
feain safety-critical domains. sible and sometimes suitable, adopting a large DNN in
Various approaches have been proposed to enhance limited-resource systems would incur the use of a single
DNN, ofering the system a single point of failure, which
The IJCAI-24 Workshop on Artificial Intelligence Safety (AISafety 2024), could cause malfunction of the entire system in the case
August 04, 2024, Jeju, South Korea of hardware or software failures or malicious attacks.
*$Cowrerens.qpioanndgi@ngsda.uctsh.tosru.kuba.ac.jp (Q. Wen); In contrast, adopting redundant architectures ofers a
julio.mendonca@uni.lu (J. Mendonça); machida@cs.tsukuba.ac.jp more straightforward approach by utilizing diverse ML
(F. Machida); marcus.voelp@uni.lu (M. Völp) models and data inputs. Using multiple and diverse ML
© 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons models, an ML-based system can avoid a single point
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g LCicEenUseRAttWribuotironk4s.0hIontpernPatrioonacl e(CeCdBiYn4g.0)s. (CEUR-WS.org)
of failure since replicated models can execute the same tems to the perception module of an AV to
entasks, masking failures or misclassifications. Also, adopt- hance autonomous driving safety.
ing model diversity can help the system mitigate prob- • We conduct fault injection experiments to reveal
lems such as overfitting and adversarial attacks, as difer- the impact of compromised ML models in the
ent models could have a distinct structure and training perception module on the safety of AVs simulated
data. Leveraging the idea of a traditional software fault in CARLA.
tolerant technique, N-version programming (NVP) [14], • Through the experiment, we demonstrate the
enthe N -version ML system approach uses replication and hanced driving safety achieved by a three-version
diversification to improve the output reliability of ML sys- perception system that can mitigate incorrect
outtems [12]. By integrating multiple, independently func- puts from compromised ML models and delay
tioning ML models, the N-version system is designed to possible AV collisions.
maintain operation and accurate decision-making even
when one or more components are compromised or The remainder of the paper is organized as follows.
faulty. The multiple versions of ML models and input Section 2 presents background and related work.
Secdata sources are used to generate multiple inference re- tion 3 details the system and fault model adopted in this
sults, which may difer from each other. These results are work. Section 4 clarifies the research questions addressed
subsequently analyzed using decision logic (e.g., a voter in the following experiment and describes the experiment
employing a majority voting rule [15]) or a protocol to settings. Section 5 discusses the achieved results,
focusagree on a single value (e.g., consensus protocols [16]) ing on answering the defined research questions. Finally,
to determine the final output. This approach enables the Section 6 concludes the paper and briefly presents future
system to detect and mitigate incorrect outputs arising work.
from individual ML models. More recent studies have
analyzed the adoption of N-version ML systems and pre- 2. Background and Related Works
sented their benefits for output reliability [ 11, 17, 15, 18].</p>
      <p>However, none of these works have examined the safety
impact in a practical application scenario. 2.1. N-version Machine Learning</p>
      <p>Therefore, this paper leverages N-version ML system N-version ML architecture, based on NVP, comprises N
architectures for the perception module of AVs, aiming to (≥ 2) diverse versions of ML components operating in
investigate the impact of such architectures on the safety parallel for the same task [12]. The ML components
genof autonomous driving scenarios using the CARLA sim- erate multiple inference results individually, and the final
ulator [19]. Specifically, we consider two-version and output can be determined using a voting mechanism.
Unthree-version perception systems, each comprising two like ensemble learning [21], which aims to build a better
or three independent ML modules, respectively, for object model by combining weak learners, the N-version ML
ardetection tasks in AVs. We incorporate multiple versions chitecture is configured with pre-trained black-box
modof ML models within the systems by deploying diferent els and designed for ML system operation. Recent studies
versions of the YOLOv5 model. In addition, to simulate have investigated N-version ML approaches to improve
failures and errors, caused by transient faults or mali- system reliability. Xu et al. [22] proposed the NV-DNN,
cious attacks, we create compromised ML models using a framework aimed at enhancing the fault tolerance of
the fault-injection tool PyTorchFI [20]. The tool intention- deep learning systems comprising N independently
develally changes ML model parameters, which can introduce oped models and decision-making procedures. NV-DNN
errors into the ML models, representing situations where assumes processing a single input at a time, whereas
ML systems may be afected by diferent types of faults N-version ML can also consider diferent inputs to
ex(e.g., radiation, induced memory corruption). Then, we ploit input diversity. Furthermore, diversifying input data
combine healthy and compromised ML models, following can contribute to improving the reliability of N-version
an N-version system architecture, and deploy it into an ML systems, as demonstrated in works from Machida
AV running on the CARLA simulator. The results show and Wen [11, 15, 23]. Hong et al. [24] proposed a
multithat single compromised models can significantly impact modal deep-learning approach to improve the
classificathe AV collision rate in up to 90% of the analyzed scenar- tion accuracy of remote-sensing imagery, outperforming
ios. We also find that the three-version system has the single-model or single-modality approaches. Mendonça
potential to tolerate one compromised model eficiently et al. [18] investigated the improvement of output
reliabiland delay collisions caused by incorrect object detection ity in perception systems through a modeling approach
when having at least one healthy model. We make the when integrating N-version programming with
rejuvefollowing contributions in this paper: nation techniques. Nevertheless, none of the existing
studies have shown the efectiveness of the N-version
• We propose the application of N-version ML
sysML approach in AV safety against the risk of faulty ML its surroundings. It can integrate inputs from cameras,
models. LiDAR, radar, and ultrasonic sensors, each contributing
unique capabilities for detecting and classifying objects
2.2. Fault-injection for ML Models such as other vehicles, pedestrians, and road signs, as
well as identifying lane markings and trafic signals [ 30].</p>
      <p>Fault injection is a testing technique used to analyze The comprehensive sensory data is then forwarded to
systems under the presence of faults [25]. This method the planning and prediction modules to form a dynamic
entails intentionally introducing faults behavior into a 3D map of the environment, enabling the AV to navigate
system to examine its function under abnormal condi- safely and eficiently.
tions. The objective is to evaluate whether the system Perception modules heavily rely on ML models to
decan tolerate faults and continue to operate correctly or tect obstacles, pedestrians, trafic signs, signals, lanes,
will misbehave. Recent studies have investigated fault and other vehicles from the input captured by cameras
injection techniques in deep neural networks (DNNs). and other sensors. Therefore, a failure or simple
misFor example, single bias attack and Gradient descent at- classification of objects in the environment may impact
tack are two types of fault injection attacks proposed the safe driving behavior of the AV, which may lead to
to misclassify a specified input pattern into an adversar- dangerous trafic situations and cause accidents.
ial class by modifying the parameters used in DNNs by Next, we detail the fault model adopted in this work
Liu et al. [26]. Tools such as PyTorchFI [20] have been and then present an N-version perception system for
proposed for disturbing DNNs on the PyTorch platform, AVs, which aims to mitigate the impact of faulty and
allowing users to induce perturbations in the weights or compromised ML models to enhance AV safety driving.
neurons of DNNs at runtime. Piazzesi et al. [27] used
fault-injection tools to evaluate autonomous agents un- 3.1. Fault Model
der the presence of artificial faults and attacks. In this
study, we leverage a fault injection tool to evaluate an
N-version perception system for AV.</p>
      <sec id="sec-1-1">
        <title>The fault model focuses on transient faults and malicious</title>
        <p>attacks related to ML models’ output correctness. Thus,
we assume sensors produce correct data and they, as well
2.3. Autonomous Driving Simulation as other components outside the perception system, are
not subject to failures or attacks. On the other hand,
CARLA [19] is a well-known and adopted open-source we consider vulnerabilities in deep learning frameworks
simulator designed for autonomous driving research. The (e.g., PyTorch, TensorFlow, or Cafe) could allow
attacksimulation platform supports flexible specification of sen- ers to (1) launch denial-of-service attacks, (2) crash deep
sor suites and environmental conditions. CARLA has learning applications due to memory exhaustion, (3)
genbeen extensively used to assess various aspects of au- erate wrong classification outputs by corrupting the
clastonomous driving. For instance, simulations enable the sifier’s memory, or (4) hijack the control flow to remote
verification of whether a driving system, trained using control the deep learning application hosting system [31].
data from a simulator, can be efectively deployed on The latest CVE reports on Tensorflow (CVE-2023-27506,
a real car [28]. Besides, works developed by Gao et CVE-2023-25668), PyTorch (CVE-2022-45907), and Cafe
al. [29] and Piazzesi et al. [27] leverage CARLA to de- (CVE-2021-39158) confirm the presence of such
vulneravelop and evaluate object detection algorithms tailored bilities. Besides, ML models shall be subject to transient
for autonomous driving applications. By utilizing the faults such as radiation, which are capable of causing
simulation environment, the detection models can be bit-flips.
tested under various conditions. In this work, we shall In this way, we assume an ML model can have three
focus on object detection tasks within the perception possible states: healthy (H), compromised but operational
system, particularly in analyzing N-version architectures (C), or non-operational (N). When in a healthy state (H),
for perception systems running in the CARLA simulator. the ML model performs normally, but it intrinsically
includes producing incorrect outputs according to its
ac3. Fault and System Model curacy. When faults or malicious attacks (e.g., radiation,
induced memory corruption) afect the ML model, it may
We focus on an ML-based perception module running cause errors, which could lead to a subsequent failure.
in an AV. A perception module is an essential compo- When faults or attacks cause errors in the ML model, it
nent of AVs. The perception module leverages inputs reaches a compromised but functional state (C).
Comfrom advanced sensors present in an AV. It serves as promised ML models can still perform object detection
the vehicle’s sensory hub, collecting and processing vast tasks but have a reduced probability of producing correct
amounts of data to create a detailed understanding of perception outputs. However, when the errors lead to
failures, the ML model completely stops, entering a
nonoperational state (N), incapable of executing perception
tasks.</p>
        <p>In this work, we shall focus on how vulnerability
exploitation and fault efects could be mitigated by using
N-version ML models. In this way, we assume that
errors or failures could harm all (or none) N ML models
at the same time. This allows us to generalize N-version
architecture to consider situations where ML models are
executed isolated (e.g., in diferent cores) and are not
subject to the same failures and situations where ML models
are afected equally by a single failure. In practice, we
demonstrate how artificially injected faults afect distinct
ML models’ output diferently by generating diferent
compromised ML models, as well as the overall efect
when the system only has diferent compromised ML
models executing.</p>
        <sec id="sec-1-1-1">
          <title>3.2. An N-version Perception System</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>To mitigate the impact of failed and compromised ML</title>
        <p>models on AV safety, we present an N-version perception
system for enhancing perception outputs. Figure 1 shows
the architecture of an N-version perception system. We
assume a perception system of an AV composed of N
ML models capable of executing object detection tasks,
which aim to avoid AV collisions. We shall focus in this
work on situations where data input variation is not
employed. It means that all N ML models should receive the
same data input from the AV sensors (e.g., cameras) to
perform object detection. Note that in some systems, it
is also possible that diferent sensor data could be
combined through a sensor fusion component before being
forwarded to the ML models [32]. After executing the
object detection task, each model shall forward its output to
a voter, which decides the final perception output based
on a pre-defined voting rule. In the adopted system, we
consider the voter implementing a majority-based voting
rule for simplicity, while other rules can be implemented
later. Besides, we assume the voter is implemented in
a trustworthy component and shall not be susceptible
to malicious attacks or faults. Such mechanism
implementation has been demonstrated in practice by previous
works, such as Gouveia et al. [33].</p>
        <p>Assuming the mentioned architecture, an N -version
ML perception system can be represented in a set of
reachable states  in which (ℎ, , ) ∈  and h, c, and
n represent the numbers of ML models in the healthy,
compromised, and non-operational state, respectively.
Additionally, we assume the voter can automatically
detect when an ML model is in a non-operational state (N).
Usually, failure detection tools can be easily adopted to
verify whether a component is operational. This would be
necessary to prevent the voter from waiting indefinitely
for the output of non-operational models and for it to be
able to reconfigure itself with diferent pre-determined
Sensors
Camera
GPS</p>
        <p>Input
data</p>
        <p>N</p>
        <p>ML models
N-version Perception System
voter</p>
        <p>Perception</p>
        <p>Output</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experiments</title>
      <sec id="sec-2-1">
        <title>4.1. Objective</title>
        <sec id="sec-2-1-1">
          <title>The objective of this study is to investigate the applicability of an N-version perception system architecture for AV safety. We aim to answer the following research questions throughout experiments using CARLA.</title>
          <p>RQ1: How does a compromised ML model of a perception
system impact AV driving safety?</p>
          <p>RQ2: How eficiently can an N-version perception
system (N=2,3) tolerate compromised and non-operational
models?</p>
          <p>To address RQ1, we set up a simulation environment
deploying diferent compromised ML models into the AV
perception system to evaluate its driving behavior. We
simulate compromised ML models using PyTorchFI to
generate artificial faults in the ML models. We also
compare the driving behavior of the compromised ML
models against the healthy ML models. To answer RQ2, we
implement two-version and three-version perception
systems, incorporating various combinations of healthy and
compromised models. Then, we investigate the driving
behavior across all the possible configurations, including
entirely healthy, mixed (healthy and compromised), and
entirely compromised models. This analysis allows us to
evaluate how the N-version ML approach influences the
driving behavior of the AV’s perception system under
various conditions.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>4.2. Testbed Setup</title>
        <p>We utilize the CARLA AV simulator and a cooperative
driving co-simulation framework OpenCDA [34] to
simulate a single-lane driving scenario. During the simulation
process in OpenCDA, sensors installed on each AV collect
the surrounding environment as well as the ego vehicle
information (e.g., 3D LiDAR points and Global Naviga- After injecting artificial faults in the healthy models, we
tion Satellite System (GNSS) data). The collected sensors’ generate new models, renaming them to YOLOv5s6_FI,
data are used by the perception and localization systems YOLOv5m6_FI, and YOLOv5l6_FI. Each one of these
modfor object detection and localization. Subsequently, the els shall represent the ML models when in a compromised
perception output, including object 3D pose and ego po- state (C). Note that the weight perturbations injected on
sition, is delivered to the downstream planning system these compromised models afect all input data (e.g.,
imto generate the AV trajectory and, consequently, update age frames) during the entire period.
the AV’s acceleration, speed, and wheel turning. Finally, In our three-version perception system, we employ a
the planned trajectory and commands are passed to the majority voting rule. When three models are operational
control system, which generates the final control com- (i.e., in the states H or C), the voter provides a perception
mands. In this paper, we choose Town03 in CARLA as output when 2-out-of-3 models agree on the same output.
the map shown in Figure 2. The yellow oval marks the The agreement is defined based on the criterion that
starting point, and the yellow star marks the endpoint of the bounding boxes (bboxes) have an Intersection over
the simulation run that an AV must execute. Towards this Union (IoU) exceeding 0.8, and the labels are identical.
path, the AV relies on its perception system to accurately When the majority cannot be reached, the voter provides
detect other vehicles and road obstacles. no perception output. Consequently, the AV does not
update its driving properties (e.g., speed and acceleration).</p>
        <p>When one model stops completely (i.e., entering a
nonoperational (N) state), the system degrades to a 2-version.</p>
        <p>In those cases, the voting rule implemented in the voter
is that the two models must agree on the same output.</p>
        <p>Otherwise, it should not provide any perception output.</p>
        <p>Therefore, the AV can only update its trajectory and
driving properties if the voter receives equal output from
the two models.</p>
        <p>Evaluation Metric: We measure the collision rate of
the AV as the number of collision frames over the total
number of frames in a run. We also give the first
collision frame number and total frame number as evaluation
metrics. The metrics measure the driving behavior of the
AV under diferent system configurations that a
threeFigure 2: Adopted scenario in Town03 of the CARLA simula- version system could assume. We conduct ten runs for
tor. each system configuration and report the average of the
metrics.</p>
        <sec id="sec-2-2-1">
          <title>Next, we define two-version and three-version sys</title>
          <p>tems using diferent object detection models in the 5. Results
architecture. We employ unmodified versions of the
YOLOv5 model [35], including YOLOv5s6, YOLOv5m6, In this section, we discuss the experiment results to
anand YOLOv5l6 as healthy models to deploy them into the swer research questions related to the N-version
percepAV perception system. Then, we generate compromised tion systems. We focus on the impact of compromised
versions of these models using PyTorchFI [20]. More models and the efectiveness of an N-version perception
specifically, adopting PyTorchFI’s runtime perturbation system to answer the defined research questions.
feature for weights and neurons in DNNs. Those
functionalities are crucial for simulating real-world scenarios 5.1. Evaluation of Compromised ML
where models may encounter unexpected disruptions. Models
Thus, we employ PyTorchFI’s random_weight_inj
function with a weight range of (-100, 300) to mimic the condi- First, we investigate the efects of compromised models
tions compromised models may encounter. The injection on object detection with an AV adopting a single-version
function randomly alters parameters within a randomly perception system. We compare the object detection
selected layer of the neural network, thereby introducing results between ML models in the healthy (i.e., in state
variability into the YOLO image detection algorithm. The H ) and compromised (i.e., in state C) states. Recall that
degree to which the ego vehicle’s perception is impacted an ML model in a non-operational state (i.e., state N )
(i.e., whether it causes an error) depends on the sensibil- cannot produce any output. Thus, the AV cannot aford
ity of the model layer for the randomly injected weight. a single-version architecture with only one model in this
state. Figure 3 illustrates an object detection case of (a) a
healthy YOLOv5m model, which accurately detects the
vehicle in front, whereas (b) its fault-injected variant,
YOLOv5m_FI, produces numerous erroneous bounding
boxes, which has more probability of leading the AV to
potential collisions.</p>
          <p>(a) Using a healthy YOLOv5m model.
(a) No collision case.</p>
          <p>(b) Collision case.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>5.2. Evaluation of N-version Perception</title>
      </sec>
      <sec id="sec-2-4">
        <title>Systems</title>
        <sec id="sec-2-4-1">
          <title>Next, we present the results achieved when adopting</title>
          <p>two- and three-version perception systems. The middle
part of Table 1 presents the experimental results for
two(b) Using a compromised YOLOv5m_FI model. version systems. The results indicate that no collisions
occurred in all runs when both healthy models were
exeFigure 3: Example of object detection using a healthy and cuted (i.e., state (2,0,1)). In contrast, configurations (1,1,1)
compromised model during AV driving. and (0,2,1) experienced collisions. Notably, the number of
collisions in the (1,1,1) configurations was lower than in</p>
          <p>Table 1 presents the average values for three metrics the (0,1,2) configurations, suggesting that a two-version
across ten runs, including the first collision frame, to- system with one compromised model can still mitigate
tal number of frames, and collision rate. System state some collisions. The average collision rates for the AV
represents (ℎ, , ), where h, c, and n represent the num- under the system state (1,1,1) and (0,2,1) were more than
bers of ML models in the healthy, compromised, and 50%. Additionally, the results demonstrate that the first
non-operational state, respectively. The upper part dis- collision frame, when considering compromised models,
plays the results for single-version perception systems. was at around frame 64, which is at a very early stage of
Notably, the healthy models consistently exhibited a 0% the simulation. It is an important observation related to
average collision rate, while the compromised models the layout of the vehicles during the simulation scenario,
showed significantly higher average collision rates (more in which the ego vehicle starts the simulation in
movethan 70%). The AV had a collision in 90% runs when driv- ment and is relatively close to the vehicle in front of it.
ing with diferent versions of compromised models. The When two ML models disagree, and the detection of the
numbers demonstrated that the AV using compromised vehicle in front is abnormal, the ego vehicle tends to have
models tends to experience collisions from the very onset a rear collision with the vehicle in front of it. Thus, when
of the simulation. Figure 4 illustrates the collision and adopting a two-version perception system, the AV would
no collision cases in the simulated scenario. In our study, have a short time before entering a critical erroneous
we primarily focus on vehicle-to-vehicle collisions. Even state, generating wrong object detection outputs after
though there is a curve in the lane, other collisions such having at least one ML model compromised.
as vehicle-to-infrastructure collisions will not happen. The bottom part of Table 1 presents the computed
Answer to RQ1: Compromised models of an AV percep- results for the three-version perception system. When
tion system demonstrate a high average collision rate of the system had the majority of models in a healthy state
more than 70%, adversely afecting AV driving safety. (i.e., (3,0,0) and (2,1,0)), no collisions were observed.</p>
          <p>System state
The result indicates that a three-version perception when the majority of the models are compromised, the
system can efectively tolerate at least one compromised system has the potential to prevent some collisions or
model and mask its failures when adopting the majority delay erroneous perception outputs.
voting rule. For the configuration (1,2,0), where a
majority of the models were compromised, collisions 5.3. Discussion
occurred in most runs. However, there were instances
where the system successfully avoided collisions. The The findings from three-version perception systems
average collision rates for this configuration were about demonstrate the application of the N-version ML
ap30%, significantly lower than those of single-version proach in improving the safety of AV. Specifically, the
compromised models. This observation suggests configurations did not result in a collision when most
that even a system with more compromised models models were in healthy states, showing the three-version
has the potential to prevent collisions under certain perception system’s ability to mitigate disruptions.
Alcircumstances. Notably, the average first collision frame though collisions still occur in some configurations with
in the (1,2,0) configurations was much later compared to the majority of compromised models, the system shows
single-version compromised models. This is a significant the potential to delay erroneous perception outputs that
observation as it demonstrates that the system can could lead to collisions. This capability is crucial in
envidelay the onset of erroneous outputs. This delay could ronments where even a minor delay in failure onset can
provide critical additional time for the AV to take evasive provide essential time for initiating corrective actions,
action, thereby possibly avoiding a collision. Finally, thereby preventing potentially catastrophic outcomes.
for the configuration (0,3,0), where all models were Limitations. In this experiment, we did not consider
compromised, most runs ended in collisions, as expected the cost and performance overhead imposed by the
redue to the lack of healthy models to correct the errors. dundant modules. The use of multiple ML models in
However, two out of ten runs did not result in a collision, the N-version perception system introduces additional
suggesting that even with all models compromised, computational overheads and may be costly to
implespecific conditions within the scenario might prevent ment in a real vehicle. The overhead and cost can be
failures temporarily. mitigated by adjusting the number of modules activated
and/or the frame rates. The perception output can also
Answer to RQ2: A three-version perception system be enhanced by diversifying the input data without using
can eficiently tolerate one compromised model. Even multiple models [36]. Such a design optimization under
resource constraints needs to be investigated further in
future work. Besides, our current evaluation is limited [4] M. A. Hanif, F. Khalid, R. V. W. Putra, S. Rehman,
to a short run on one specific map. Further experiments M. Shafique, Robust machine learning systems:
with more diverse driving scenarios are needed to make Reliability and security for deep neural networks,
more general conclusions. in: International Symposium on On-Line Testing
And Robust System Design, 2018.
[5] S. Qiu, Q. Liu, S. Zhou, C. Wu, Review of
arti6. Conclusions and Future Work ifcial intelligence adversarial attack and defense
technologies, Applied Sciences 9 (2019) 909.</p>
          <p>In this study, we explored the practical application of [6] A. Toschi, M. Sanic, J. Leng, Q. Chen, C. Wang,
N-version ML system architectures through experiments M. Guo, Characterizing perception module
perconducted in the autonomous vehicle simulator CARLA. formance and robustness in production-scale
auBy deploying two-version and three-version perception tonomous driving system, in: IFIP International
systems for object detection tasks, we investigated the Conference on Network and Parallel Computing,
efectiveness of incorporating multiple versions of both Springer, 2019, pp. 235–247.
healthy and compromised ML models within the sys- [7] Apollo perception, 2019. URL: http://www.fzb.me/
tems. Our findings demonstrate that compromised mod- apollo/specs/perception_apollo_5.0.html.
els within the perception system significantly impact the [8] J. M. Zhang, M. Harman, L. Ma, Y. Liu, Machine
AV collision rate, with rates exceeding 70%. In addition, learning testing: Survey, landscapes and horizons,
we observed that three-version perception systems have IEEE Transactions on Software Engineering (2020).
the potential to mitigate object detection misclassifica- [9] W. Wu, H. Xu, S. Zhong, M. Lyu, I. King, Deep
valitions, tolerating one compromised model and delaying dation: Toward detecting real-world corner cases
collisions when at least one healthy model remains op- for deep neural networks, in: Proc. of the 49th
erational. In future work, we consider evaluating other IEEE/IFIP International Conference on Dependable
system architectures and exploring alternative decision- Systems and Networks (DSN), 2019, pp. 125–137.
making mechanisms beyond simple majority voting rules [10] R. S. Ferreira, J. Arlat, J. Guiochet, H. Waselynck,
that could improve N-version ML systems’ output cor- Benchmarking safety monitors for image classifiers
rectness. with machine learning, in: Proc. of IEEE Pacific
Rim International Symposium on Dependable
ComAcknowledgments puting (PRDC), 2021, pp. 7–16.
[11] F. Machida, On the diversity of machine
learnThis work was supported by JST SPRING Grant Number ing models for system reliability, in: IEEE Pacific
JPMJSP2124, and partly supported by JSPS KAKENHI Rim Int’l Symp. on Dependable Computing (PRDC),
Grant Numbers 19K24337 and 22K17871. This work 2019, pp. 276–285.
has also been supported by the German Research Coun- [12] F. Machida, N-version machine learning models for
cil (DFG) and by the Luxembourg Fond Nationale de safety critical systems, in: Proc. of the DSN
WorkRecherche (FNR) through the Core Inter Project ByzRT shop on Dependable and Secure Machine Learning,
(C19-IS-13691843). 2019, pp. 48–51.
[13] Y. LeCun, Y. Bengio, G. Hinton, Deep learning,
nature 521 (2015) 436–444.</p>
          <p>References [14] L. Chen, A. Avizienis, N-version programming: A
fault-tolerance approach to reliability of software
[1] J. A. Sidey-Gibbons, C. J. Sidey-Gibbons, Machine operation, in: Proc. of 8th IEEE Int. Symp. on
Faultlearning in medicine: a practical introduction, BMC Tolerant Computing (FTCS-8), 1978, pp. 3–9.
medical research methodology 19 (2019) 1–18. [15] Q. Wen, F. Machida, Reliability models and analysis
[2] H. J. Vishnukumar, B. Butting, C. Müller, E. Sax, Ma- for triple-model with triple-input machine learning
chine learning and deep neural network—artificial systems, in: Proc. of the 5th IEEE Conference on
intelligence core for lab and real-world test and Dependable and Secure Computing, 2022, pp. 1–8.
validation for adas and autonomous vehicles: Ai [16] R. Olfati-Saber, J. A. Fax, R. M. Murray, Consensus
for eficient and quality test and validation, in: In- and cooperation in networked multi-agent systems,
telligent systems conference (IntelliSys), 2017, pp. Proceedings of the IEEE 95 (2007) 215–233.
714–721. [17] S. Latifi, B. Zamirai, S. Mahlke, Polygraphmr,
en[3] M. Henne, A. Schwaiger, G. Weiss, Managing un- hancing the reliability and dependability of cnns,
certainty of ai-based perception for autonomous in: Proc. of 50th IEEE/IFIP International Conference
systems, in: AISafety@ IJCAI, 2019, pp. 11–12. on Dependable Systems and Networks (DSN), 2020,
pp. 99–112.
[18] J. Mendonça, F. Machida, M. Völp, Enhancing the [30] S. D. Pendleton, H. Andersen, X. Du, X. Shen,
reliability of perception systems using n-version M. Meghjani, Y. H. Eng, D. Rus, M. H. Ang,
Perprogramming and rejuvenation, in: Proc. of the ception, planning, control, and coordination for
53rd Annual IEEE/IFIP International Conference autonomous vehicles, Machines 5 (2017) 6.
on Dependable Systems and Networks Workshops [31] Q. Xiao, K. Li, D. Zhang, W. Xu, Security risks in
(DSN-W), 2023, pp. 149–156. deep learning implementations, in: IEEE Security
[19] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and Privacy Workshops (SPW), 2018, pp. 123–128.</p>
          <p>V. Koltun, Carla: An open urban driving simulator, [32] R. Maurice, M. Gerla, Autonomous driving: Sensor
in: Proc. of the 1st Annual Conference on Robot fusion for multiple sensor types, in: Proceedings
Learning, 2017, pp. 1–16. of the IEEE International Conference on Intelligent
[20] A. Mahmoud, N. Aggarwal, A. Nobbe, J. R. S. Vi- Transportation Systems, IEEE, 2012.
carte, S. V. Adve, C. W. Fletcher, I. Frosio, S. K. S. [33] I. P. Gouveia, M. Völp, P. Esteves-Verissimo, Behind
Hari, Pytorchfi: A runtime perturbation tool for the last line of defense: Surviving soc faults and
dnns, in: 2020 50th Annual IEEE/IFIP International intrusions, Computers &amp; Security 123 (2022) 102920.
Conference on Dependable Systems and Networks doi:10.1016/j.cose.2022.102920.</p>
          <p>Workshops (DSN-W), 2020, pp. 25–31. [34] R. Xu, H. Xiang, X. Han, X. Xia, Z. Meng, C.-J.
[21] Z.-H. Zhou, Ensemble methods: foundations and Chen, C. Correa-Jullian, J. Ma, The opencda
openalgorithms, CRC press, 2012. source ecosystem for cooperative driving
automa[22] H. Xu, Z. Chen, W. Wu, Z. Jin, S. Kuo, M. R. Lyu, tion research, IEEE Transactions on Intelligent
VehiNv-dnn: towards fault-tolerant dnn systems with n- cles 8 (2023) 2698–2711. doi:10.1109/TIV.2023.
version programming, in: Proc. of the 49th Annual 3244948.</p>
          <p>IEEE/IFIP International Conference on Dependable [35] G. Jocher, et al., ultralytics/yolov5: v5.0 - yolov5-p6
Systems and Networks Workshops (DSN-W), 2019, 1280 models, aws, supervise.ly and youtube
intepp. 44–47. grations, 2021. URL: https://doi.org/10.5281/zenodo.
[23] Q. Wen, F. Machida, Characterizing reliability of 4679653.</p>
          <p>three-version trafic sign classifier system through [36] K. Wakigami, F. Machida, T. Phung-Duc, Reliability
diversity metrics, in: Proc. of the 34th International and performance evaluation of two-input machine
Symposium on Software Reliability Engineering learning systems, in: 2023 IEEE 28th Pacific Rim
In(ISSRE), 2023, pp. 333–343. ternational Symposium on Dependable Computing
[24] D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot, (PRDC), IEEE, 2023, pp. 278–286.</p>
          <p>Q. Du, B. Zhang, More diverse means better:
Multimodal deep learning meets remote-sensing imagery
classification, IEEE Transactions on Geoscience and</p>
          <p>Remote Sensing 59 (2020) 4340–4354.
[25] M. C. Hsueh, T. K. Tsai, R. K. Iyer, Fault injection</p>
          <p>techniques and tools, Computer 30 (1997) 75–82.
[26] Y. Liu, L. Wei, B. Luo, Q. Xu, Fault injection attack
on deep neural network, in: Proc. of the IEEE/ACM
International Conference on Computer-Aided
Design (ICCAD), 2017, pp. 131–138.
[27] N. Piazzesi, M. Hong, A. Ceccarelli, Attack and
fault injection in self-driving agents on the carla
simulator – experience report, in: Computer Safety,
Reliability, and Security: 40th International
Conference, SAFECOMP 2021, Springer-Verlag, York, UK,
2021, pp. 210–225.
[28] B. Osiński, A. Jakubowski, P. Zięcina, P. Miłoś,</p>
          <p>C. Galias, S. Homoceanu, H. Michalewski,
Simulation-based reinforcement learning for
realworld autonomous driving, in: IEEE international
conference on robotics and automation (ICRA),
2020, pp. 6411–6418.
[29] W. Gao, J. Tang, T. Wang, An object detection
research method based on carla simulation, Journal
of Physics: Conference Series 1948 (2021) 012163.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>