<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>decision making for cybersecurity⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Madeline Cheah</string-name>
          <email>madeline.cheah@cambridgeconsultants.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jack Stone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samuel Bailey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Haubrick</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Rimmer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matt Lacey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Dorn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CAMLIS'24: Conference on Applied Machine Learning for Information Security</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Cambridge Consultants</institution>
          ,
          <addr-line>29 Cambridge Science Park, Milton Road, Cambridge, CB0 5DW</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Fully autonomous decision-making for cyber-defence (the ability to make expert-level defensive choices without human intervention) is desirable but challenging. This is particularly so for operational technology because of its cyber-physical nature and the need to take into account multiple dimensions of context. Our contribution is the creation and substantial extension of our co-operative decision-making framework for cyber-defence (Co-Decyber). This framework allows us to break up a large multi-contextual action space into smaller decisions for multiple agents to optimise between. We have applied this framework to a vehicle platooning scenario (the linking of two or more trucks in a convoy) . This paper discusses development since our last published work, which is based on increased complexity by defending against a more sophisticated attack (diversion of the convoy using GPS message spoofing) using more agents. Results show that Co-Decyber agents are able to successfully defend against an attack and recover the situation. We conclude that this framework is viable and once mature, will assist in fully autonomous cyber-defence of operational technology.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;autonomous</kwd>
        <kwd>cyber-defence</kwd>
        <kwd>cybersecurity</kwd>
        <kwd>automotive</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>multi-agent</kwd>
        <kwd>reinforcement learning</kwd>
        <kwd>autonomous cyber-defence</kwd>
        <kwd>operational technology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Our contribution in this paper is a substantial extension of our co-operative decision-making
framework for cyber-defence (which we call Co-Decyber). To the best of our knowledge, this is the first such
framework to be applied to autonomous cyber-defence (ACD) of vehicles.</p>
      <p>
        We apply this framework to operational technology, namely an automated vehicle platoon as
proofof-concept. We extend our previous work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] in the following ways:
• Created a significantly more complex attack scenario (GPS divert) which requires cooperation
between diferent agents to resolve the cyber-attack. To support this, we have extended our
simulator with new NATO Generic Vehicle Architecture (NGVA) NGVA modules (see Section
3.2), creating models for modules that do not yet exist in the NGVA specification (such as the
vehicle-to-vehicle or V2V module), and created three variants of the scenario (stateful, stateless
and varied attack source – see Section 3.3) to test against our updated Co-Decyber framework.
• Trained existing agents to work with our additional modules (GPS0 and GPS1) as well as an
entirely new kind of agent (NAV_RESET) to defend against the GPS divert attack.
• Created a model evaluation framework which allows us to assess performance of diferent agents
systematically.
      </p>
      <p>The rest of this paper is structured as follows: we explore related work in Section 2 before expanding
on the Co-Decyber framework in Section 3, including the modelling techniques and scenarios we
considered. We then cover our training approach in Section 4 before presenting our results in Section 5.
Discussions of our work is given in Section 6 and conclusions and future directions outlined in Section
7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Cyber-threats are growing more sophisticated, driven by the acceleration in pace of technological
change. The threat landscape is also evolving quickly, with more systems becoming connected, more
complex and more intelligent. This is propelling research in autonomous and automated cyber-defence,
in both ofensive and defensive security, covering red-teaming, intrusion detection and prevention,
incident response and security planning [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        ACD can be defined as a decision-making system with competent capabilities that should be capable
of detecting an attack and responding by recovering, proving or correcting the vulnerability in real
time without human intervention [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A discussion of definitions for ACD (and its components such as
agents) may be found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>There are broadly two categories in terms of ACD frameworks: rule-based and machine-learning
based.</p>
      <sec id="sec-2-1">
        <title>2.1. Rule-based frameworks</title>
        <p>
          The most mature rule-based methods can be found in incident response playbooks, security incident
and event management (SIEM) and security orchestration, automation and response (SOAR) products
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. However, these are focused on enterprise defence and are typically dependent on a centralised
architecture. They also require computational resource and guaranteed connectivity that is not typically
found in operational situations. There is also academic work looking to extend these frameworks,
especially in the challenges of mission or context-awareness. For example, there has been work looking
at automating attack-defence trees using a game theoretic approach, by modelling the optimal benefits
for attackers or defenders to help decide on the best response [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          This work is supported by studies that deal with explicit expressions and proofs of equivalence, such
as that between game theory and attack-defence trees [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Some use graphical models, e.g., [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which
have been precomputed to estimate possible attack paths of an ongoing attack to mitigate in real-time.
Other approaches include automating the risk assessment process (again using attack-defence trees) for
deciding on optimal security controls [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          These methods are all high-confidence methods, with plentiful and rigorous mathematical and
algorithmic foundations and tools for cyber-attack and defence expression. However, there are limitations
with such approaches, including high knowledge barriers required for modelling systems and
interactions, a barrier that is compounded by the fact that operational technology is most likely to be legacy,
bespoke or proprietary. Further limitations arise from the rigidity of expression and increased
abstraction from the real-world which could also lead to missing out on real-world edge cases. Furthermore,
rule-based systems can be limited in their adaptability when facing novel attacks or dynamic situations
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which can lead to a lack of resilience.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Multi-agent approaches</title>
        <p>
          The primary alternative that the security community has explored to address such limitations is
through the use of probabilistic approaches, including machine learning. Machine learning - and in
particular reinforcement learning (RL) - has been the most widely explored due to the ability to take
into account context. Uses include red teaming (such as simulating spoofing activity or malware) and
defences, including detection of deception and preventing data injection. A comprehensive survey of
reinforcement learning in ACD may be found in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          Multi-agent approaches are more recent. Agents within a multi-agent reinforcement learning (MARL)
framework can be competitive, collaborative or both. There are also two broad schools of research: the
ifrst is in training MARL agents using game theoretic approaches in adversarial settings [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ] using
intelligent red agents to train against. The second is through optimisation of defences through use of
the RL exploration versus exploitation paradigm.
        </p>
        <p>There has been a strong growth in interest in using MARL, with many agreeing that intelligent
autonomous agents will be present in future resilience, defence and operational technology contexts
[13] and that cooperation may be more successful than single or multiple independent agents [14]. Uses
explored have included generating defence strategies [15], intrusion detection [16] and learning tactical
responses (from scratch) in IT networks [17].</p>
        <p>All approaches have similar challenges, in that optimisation is not universal and is highly dependent
on the definitions of the defensive or tactical actions taken by the MARL cyber-defence, and that
these definitions require input from human subject matter experts. Additionally, the environment is
non-stationery and there are sparse reward challenges (discussed further in Section 3.4). Many of the
approaches described remain theoretical and have rarely been transferred to real world implementation.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The Co-Decyber framework</title>
      <p>Our approach is based on cooperative MARL, where agents attached to a unit of functionality work
together to protect the system globally (see Section 3.4 and 3.5). In this section, we describe the
underlying attack and defence paradigm that allows us to scope and define our cyber-defence work in
Section 3.1. We expand on our high-level scenario (automated vehicle platooning) in Section 3.2 before
describing our attack scenario (GPS divert) in Section 3.3. We discuss mission level impacts and the
reward structure we use in Section 3.4 before giving an overview of the entire Co-Decyber framework
in Section 3.5.</p>
      <sec id="sec-3-1">
        <title>3.1. Attack and defence modelling</title>
        <p>We use attack-defence tree methodology [18] to map out attacks and defences (Figure 1).</p>
        <p>Each defence (I, II, III etc.) can be considered an action that Co-Decyber can take. The set of defences
is therefore Co-Decyber’s action space (although due to training constraints, we have not implemented
a complete coverage of this defensive action space). Note also that since Co-Decyber is intended to
provide operational cyber-defence, design-time defences are out of scope in our current research.</p>
        <p>We perform no computation on the tree; the attack-defence tree paradigm was chosen to help scope
our action space and communicate conceptually how we break down the decision making to give rise
ATTACK GOAL: A
(–OR: I)
(–OR: II)
–OR: B
–OR: C
–OR: D
(–OR:III)
(–OR:IV)
(–OR:V)
–OR: E
–AND: F
–AND: G
to the multi-agent architecture.</p>
        <p>The advantages and disadvantages of the multi-agent architecture in the context of our framework
are discussed in more detail in Section 3.5. For ease of reading, we use an adapted notation (see Figure
2) to represent all attack-defence trees in this paper.</p>
        <p>
          As an example of how we translate between attack-defence tree and multi-agent architecture, we
provide an attack of using a false fire alarm (see Figure 3) to delay the convoy (this is the scenario we
used in an earlier study [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] to establish the Co-Decyber framework).
        </p>
        <p>ATTACK GOAL: DELAY CONVOY
–OR: Message alteration
–OR: Unauthorised publication
–AND: Hijack publisher
–AND: Send false fire alarm
(–OR: Mitigate attack)
(–OR: Isolate using ACL)
(–OR: Block alarm)
(–OR: 1 alarm)
(–OR: All alarms)
(–OR: Power down alarm src)</p>
        <p>To enable Co-Decyber’s decision making, each action is assigned to individual but interconnected RL
agent (see Figure 4). These agents (aggregated into a node) negotiate between themselves on what the
best course of action should be. Each node is coupled with a unit of functionality to be defended, to
ensure that defensive actions are appropriate and relevant to that particular piece or type of functionality
(see Section 3.5). There is also a hierarchy of agents:
• Lower-level agents are responsible for atomic actions (e.g. to isolate or block)
• Higher level aggregation agents are responsible for co-ordinating these actions, including deciding
between these actions, based on bottom-up input from lower-level agents. The lower-level agents
essentially act as a feature extractor for the aggregation agents.</p>
        <p>An illustrative Co-Decyber node can be seen in Figure 4.</p>
        <p>We traded of the additional complexity of a hierarchical architecture against the disadvantages of a
lfat peer-to-peer architecture. The latter would have meant that:
• Information from other agents will have been delayed by one time tick. This means that the other
agents may have already taken “bad” actions because they did not have full context when they
made their own decision.
• Individual agents must know about each other. Many of the advantages of a composable
multiagent approach would be lost as agents must be trained as part of a specific collection.</p>
        <p>We posit that the hierarchical approach gives the best of both worlds. Aggregation agents are more
tightly coupled with the specific modules they are trained for to determine the “best” action, but the
lower-level modules have a simpler development approach and can be re-used or re-composed into
diferent nodes for diferent modules. Hierarchical agents therefore provide module specific knowledge
and arbitrate between diferent low-level agents without requiring retraining of the low-level agents.</p>
        <p>Some nodes are simple (containing one RL agent), and others are complex (containing multiple RL
agents).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Platooning vehicles</title>
        <p>The Co-Decyber framework has been developed using a high-level leader-follower scenario (Figure 5),
in which military vehicles are aiming to move cargo from a rear operating base to a forward base. The
leader vehicle is manned, and the follower vehicle has an automated system that follows the leader’s
path.</p>
        <p>We chose this scenario for the richness of the attack-defence space and its relevance to
generationafter-next technologies as well as to defence. High level attacks and defences in the scenario are
modelled using attack-defence tree methodology to scope the attack and explore the associated defence
action space (see Section 3.1).</p>
        <p>To enable training and generate the datasets we require, we built a training simulator (Figure ??)
to represent a NATO Generic Vehicle Architecture compliant system to be defended. There are two
components to the simulator: the environment and agent inference. These components allow exploration
of diferent environments and agent configurations, and are extensible to take into account future
scenarios we may want to generate datasets for. These are joined using an OpenAI Gym [19] interface,
with multi-agent support coming from PettingZoo [20].</p>
        <p>NGVA is a framework that introduces pluggable modules for reconfigurability (called GVA modules),
connected together using a Data Distribution Service (DDS) backbone. DDS is a publish-subscribe
system which provides shared topics on which all modules can read and/or write. A GVA module
describes a specific unit of functionality e.g. engine controls, sensor systems, navigation. We use NGVA
to inform our attack-defence trees as well as the Co-Decyber architecture.</p>
        <p>
          For this paper, we have extended our simulator to include simulation of more GVA modules (in
addition to the HMI, MEDIA, ECU and SWITCH modules which we had previously modelled [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]). This
includes:
• NAV, which refers to the navigation reference module which traditionally integrates position
information from GPS and inertial measurements. The NAV module contains two low-level
        </p>
        <p>Co-Decyber RL agents. The first is the DDS level Access Control List (ACL) agent, and the other
is the NAV reset (which manages resetting the state of the NAV module)
• GPS, which refers to the modules which perform processing on information received through the
GPS sensors. There are two GPS modules (GPS0 and GPS1) which each have a single low-level
agent (ACL) attached. This agent provides control over the message flow to and from the module.</p>
        <p>An overview of the vehicle architecture as implemented in simulation is given in Figure 7.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. GPS divert scenario</title>
        <p>The attack we have modelled involves hijacking a module on the DDS network, via a malicious insider
located in the rear operating base. We assume this is possible as DDS specifications do not by default
include authentication of GVA modules. The hijacked module is then able to send a series of spoofed
navigation messages to cause a diversion of the leader vehicle and therefore a diversion of the entire
convoy (see Figure 8).</p>
        <p>ATTACK GOAL: DIVERT CONVOY
–OR: Integrity attack through DDS msg</p>
        <p>(–OR: Correct course using other vehicle)
–SAND: Hijack node
–OR: Hijack publisher (GPS topic)
–OR: Add malicious node
–SAND: Reset ODOM or IMS
–SAND: Start GPS attack messages
(–OR: Stop writing to GPS topic)
(–SAND: Isolate using ACL)
(–SAND: Reset NAV)
(–OR: Control power to module)
(–OR: Switch off MEDIA)
(–OR: Switch off NAV)
(–OR: Isolate switch port)
(–OR: Isolate MEDIA)
(–OR: Isolate NAV)
• We are targeting the DDS stream of messages between GVA modules rather than the GPS sensor
itself.
• We assume that the usually optional DDS Security Specifications [ 21] have been implemented
in the system to a degree, with loose global policies that allowed for an insider to hijack a node.</p>
        <p>These specifications are required to enable defensive actions by Co-Decyber.</p>
        <p>
          This scenario represents significant additional complexity to our earlier work in defending against
publishing of false fire alarms [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>Firstly, detection latency is more of a significant factor since the longer the attack goes undetected,
the more adrift the vehicles will become. Implications are discussed in Section 4.2. Secondly, there are
two variations of this scenario which we needed to model in the simulator, stateless and stateful. This
refers to if the NAV GVA module is configured with or without state:
• Stateless: The NAV module accepts new location updates regardless of how far the vehicle is from
its current position estimate, allowing for immediate correction of course.
• Stateful: The NAV module implements a simple noise filter which will reject updates that are too
far away from the current position estimate. The noise filter is a more realistic representation of
the module, as real-world position fusion systems all have various filters used to improve noise
rejection and positioning accuracy.</p>
        <p>These two variants were modelled because of the change in the dificulty of defence against a
cyberattack, and provided the necessary stepwise levels of dificulty in the curriculum learning training
regime for the MARL (this training technique is discussed more in Section 4.1.4).</p>
        <p>In the stateless form, once the attacker is prevented from writing corrupt updates to NAV, it will
recover on its own as it receives new correct data. However, if filtering is implemented (as per the
stateful configuration) it is likely that the filters will reject the new corrupt updates. The recovery in
this case requires cooperation between multiple agents, first to block the corrupt data, and then to
recover the NAV module state, for example, by resetting. This can also be seen in the attack-defence
tree modelling with the addition of the SAND (Sequential AND) gate as a defence to hijack of publisher
in Figure 8.</p>
        <p>From this, we can conclude that the stateful variant is the most representative of a situation in which
we would want to see cooperative MARL. As such, we will be discussing results, ramifications and
extensions using the stateful variant.</p>
        <p>We had assumed in earlier work that the hijacked module that would most likely be attacked would
be a low efort but connected pivot point in the vehicle system such as the MEDIA module. However,
the hijacked module could be any of the GVA modules on the system, which could be attacked through
various mechanisms.</p>
        <p>Therefore, in addition to the stateful and stateless variants above, we have also created another
scenario variation called varied attack source whereby the GPS divert attack could originate from
MEDIA, GPS1 and GPS0. This also helps to add another level of variation to our datasets that would
allow us to address the question of generalisation and reduce the risk of overfitting.</p>
        <p>The mission impact of such an attack is also dependent on the level of location drift being broadcast
by the hijacked node and detection latency. Implications of mission impact are discussed below in
Section 3.4.</p>
        <p>We use a stochastic hard-coded attacker in this work with varying parameters including:
• Time-to-detect
• Attacker location in the vehicle
• Attack start time
• Noise in the environment</p>
        <p>This is written as a simple state machine which is both responsible for launching the attack(s) against
the vehicle and updating the IDS. From the environment’s point of view, the attacker is implemented as
an additional agent.</p>
        <p>We did not use intelligent ‘red’ agents as a trade-of against the complexity of training a MARL system
and because balancing two systems learning against each other could create unwanted behaviours.
For example, the red agent and Co-Decyber could become hyper-specialised to defeating each other
rather than protecting generalised systems, or may otherwise game the system in unexpected ways.
We further explore implications (including overfitting) in Section 4.1.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Mission level impacts and reward structure</title>
        <p>From a cyber-defence point of view, attacker and defender have system level goals (attack target vehicle
and mitigate attack). However, a “good” choice made by Co-Decyber can only be judged based on the
mission level.</p>
        <p>At system level, sending GPS divert messages afects our vehicle by compromising the navigation,
but there is a higher level of adversarial intent, which we can assume here to be the possible intent to
divert the cargo to an alternative enemy-desired destination (i.e. a mission level impact).</p>
        <p>This mission level impact is the end-goal attacker and defender wish to achieve in a given context.
This is important to consider due to three factors:
• Cascading impacts through the system: achieving only system-level defender goals could still
result in mission failure (e.g., excessive security processing leading to a denial of service)
• With the multi-agent architecture, a dense reward structure (and having to consider the value
of every action and associated reward for each timestep of a simulation for each agent) would
be challenging, with no guarantee that the path of actions leading to the highest dense reward
corresponds to the best mission objective.
• Mission level goals should be the highest priority and therefore should form the foundation of
the reward structure. The more of a defender’s mission that is achieved, the better the choices
that Co-Decyber is judged to have made. Note that although mission level goals have priority,
that system-level goals may still influence (e.g., a small delay at the system level before defensive
actions kick in).</p>
        <p>To represent the situation above, a sparse reward structure is used. This allows the models to optimise
for best mission outcome without putting artificial constraints on the process, and directly incentivises
agents towards the desired mission objective. We note that sparse rewards representing mission level
feedback may give too little feedback to the RL agents. We mitigate this using an ofline training
approach (see Section 4.1).</p>
        <p>The reward structure for our work is shown in Table 1.</p>
        <p>The total reward is a sum of whichever conditions hold true, for example:
• One vehicle that arrives without cargo (on time or late) is given a 2.
• One vehicle that arrives without cargo (on time or late) is given a 2.</p>
        <p>• Two vehicles arriving delayed, with one cargo intact would be given 6.</p>
        <p>Cargo is simulated as having a percentage chance of being stolen if the delay or diversion is significant
enough. These numbers are currently arbitrary. As we move towards the real-world implementation,
we envisage adding more real-world metrics (such as whether defensive actions would afect availability
of the system) as well as a less binary interpretation between diferent outcomes (to represent scalar
diferences such as being a little bit late versus extremely late).</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Generic framework and architecture</title>
        <p>Vehicles contain distributed systems. As such, we chose an architecture for Co-Decyber that reflects this
distribution (see Figure 9), with a Co-Decyber node coupled to and defending a unit of functionality (i.e.
a GVA module). Each GVA module therefore contains a Co-Decyber node, with each node containing
1. . .  RL agents.</p>
        <p>The framework also allows for communication not just between RL agents within a node, but also
between nodes themselves. This not only allows for cyber-defence at the individual module level, but
globally. As further work, we also plan to extend so that communications can also happen across
systems thereby providing protection to the entire platoon. The decentralised nature of Co-Decyber
makes coordination between agents more dificult, however this makes the framework more robust,
since defeating Co-Decyber now involves compromising or disabling many more nodes.</p>
        <p>We have also made design choices that would ensure a smoother transition between simulation and
the real world as soon as the concept matures enough for that transfer. At a high level, this means that
we:
• Ensure that partial coverage is accounted for (e.g. where only some system elements are covered by
an agent or where it may not be possible to implement Co-Decyber because of system elements).
• Ensure that Co-Decyber can handle changes in available modules (i.e. modules may be added or
removed from the system over time, or where modules may be compromised or isolated because
of a defensive actions).
• Ensure that attack detection reflects some element of real-world performance (i.e. that it is not
perfect).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Training</title>
      <sec id="sec-4-1">
        <title>4.1. Training approach</title>
        <p>In this section we describe our training approach in Section 4.1 and expand on how our RL agents fit
within the training framework in Section 4.2.</p>
        <p>For this work we have used Q-learning with Deep Q Networks (DQNs). DQNs have the ability to
generalise and tackle large state spaces. Q-learning has also been studied in a MARL context providing
further support for its use. DQNs are also well established in both research and industrial domains. All
the above makes this approach a good foundation for development.</p>
        <p>To manage system and training complexity, we have adopted several novel strategies regarding
training methodology: ofline RL, agent matrix training, simulation generation and curriculum learning.
These are described below.
4.1.1. Ofline RL
The ofline RL process splits work into two decoupled phases (Figure 10) [ 22]. Firstly, the scenario
exploration phase where diferent situations are set up and agents allowed to explore. The logs of these
explorations are collected into an “experience database.” During these simulation runs, the parameters
of the agent’s network are frozen, meaning that they do not learn during this phase.</p>
        <p>Next comes the training phase, during which agents learn. In this phase, the exploration logs in the
experience database are curated and used to train an improved variant of an agent.</p>
        <p>These two phases are run in an iterative fashion to improve agents over time. By decoupling the two
phases, the agent has a large and diverse database to learn from. There are several advantages [23]:
• Firstly, this approach reduces the interface requirements between simulation and model training
systems; this is unlike an online RL approach where both systems are tightly coupled and must
run simultaneously, which leads to a training system sitting idle whilst exploration progresses.
• Secondly, the experience database can, in efect, be used as a labelled dataset (and reused as
needed). This also helps to mitigate the increased amount of exploration needed to deal with a
sparse reward structure as mentioned in Section 3.4).
• Finally, the format, size and quality of the data within this database is flexible depending on
the scenarios explored. Furthermore, optimisation of policies can happen through policy
guidance from this ofline experience, without having to perform exceptionally large numbers of
explorations. This also mitigates against sparse reward issues.</p>
        <sec id="sec-4-1-1">
          <title>4.1.2. Matrix training</title>
          <p>Agents in a MARL architecture can learn to rely on each other too much, which leads to a fragile system
[24]. We mitigate against this by ensuring that any given agent is trained against a variety of other
agents so that they do not become overly dependent on any particular behaviour.</p>
          <p>In practice this means that we created training datasets which include a variety of diferent
configurations and versions of agents.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.3. Simulation generation for high-quality data</title>
          <p>The simulation environment has many diferent parameters that can be used to vary behaviour, broadly
categorised into:
• Environmental (e.g. attacker task, vehicle locations, GPS noise etc.).
• Dificulty (e.g. NAV error tolerance, how long a fire lasts before damage occurs, etc.).
• Architectural (e.g. which agents are deployed or where an attacker is located).</p>
          <p>• Agent (e.g. which agent variant is used as the deployed agent).</p>
          <p>We use a generation system that samples diferent combinations of parameters to generate many
diferent versions of a given scenario. We can then set an upper bound on the number of samples to
draw from the available combinations. To streamline this process, we have standardized templates used
to set exploration ranges for these parameters, which means that whenever new data is generated it
includes a variety of diferent situations for the agent to learn from.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>4.1.4. Curriculum learning</title>
          <p>This is an RL training approach where the agent is presented with incrementally increased dificulty to
help agents learn more complex behaviour[25].</p>
          <p>We implement this curriculum across multiple training sessions rather than in a single run, starting
with simple scenarios (e.g. beginning with a false fire alarm, moving on to a stateless GPS divert, to a
stateful one, with varied attack sources). The data from those agents can then be collated into a training
set to train a new set of agents that can learn more complex behaviour, building on the behaviour of
the previous agents. This incremental approach allows us to assess how well agents are learning as we
can compare the new agents with prior versions to see if they can tackle more situations.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Co-Decyber agents</title>
        <p>Co-Decyber is designed to be composable so that diferent agents can be combined to form diferent
types of Co-Decyber nodes.</p>
        <p>The input data for these agents come from our representative intrusion detection system (IDS). This
is implemented as a publication of detection messages as part of our attack model and is hard coded
into our scenarios to reduce integration issues and the need to develop an actual detection system.</p>
        <p>Attack detection systems in operational technology are still generally low maturity [26] and because
of this, there are significant unknowns around typical IDS concerns such as false positive rates and
detection latency. Because the IDS serves as input to Co-Decyber, any changes to the IDS and its
performance could have significant impact on output. As such, as part of this work, we modelled
diferent qualities of attack detection system (such as varying the time of detection) and conducted
more simulation runs to cover these states to mitigate the risk. This also helps to address any potential
overfitting risk.</p>
        <p>Every low-level agent is a fully connected neural network. The agent takes input from our
representative IDS, and has two types of output:
• The policy function ()∀ ∈  where the  function describes the discounted expected
future rewards from taking action  in state . By choosing max () the expected reward is
maximised. The agent does not calculate the state explicitly, but instead it is implicit in the policy
function that the agent returns.
• The situation assessment output which attempts to predict the total reward the agent will receive
at the end of the scenario. By adding this output, the agent is expected to learn a long-term
assessment of the situation.</p>
        <p>The agents and their corresponding defensive actions that we have currently defined are ACL,
ISOLATE, FIREWALL, POWER, NAV_RESET. An outline of their actions is given below in Table 2.</p>
        <sec id="sec-4-2-1">
          <title>Agent</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Description</title>
          <p>Access Control List (ACL)</p>
          <p>ISOLATE
FIREWALL</p>
          <p>POWER</p>
          <p>NAV_RESET</p>
          <p>One of the security extensions to DDS which allows the system designer to
create a policy for which topics a module is allowed to access (both read and
write). It is assumed that it would be possible to extend this support to allow
additional block rules to be injected at run time to provide additional security.
This agent can block all messages from a particular participant to the GVA
module it is attached to. It can also block a particular participant from writing
to a specified topic from the GVA module’s point of view.</p>
          <p>Disables or enables the switch port for the GVA module that it is attached to.
Blocks all messages from a specified source address or blocks messages from
a specified source address to a specified destination address.</p>
          <p>Removes power from any GVA module.</p>
          <p>Allows Co-Decyber to reset the internal state of the NAV module. This is
important because the navigation module contains the position fusion algorithm.
If the attacker manages to corrupt the state of this module by injecting false
data, then the vehicle may drive to the wrong location.</p>
          <p>By resetting the internal state of this module (once the false message stream
has been isolated) the vehicle should recover its position and return to its
proper course.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>There is a substantial combinatorial challenge when assessing agent performance because there are
so many models, in so many locations in the system, with many simulation variations and variants of
scenarios. To address this, we have created an evaluation process which allows us to look at performance
of these agents systematically.</p>
      <p>The model evaluation flow (per scenario) begins with running the control setup, which has all defences
that Co-Decyber may action disabled. Then for each low-level model, we select the best performing to
run the single low-level agent setup. For each aggregation model, we select high performing low-level
models and run the aggregation agent setup. For multi-node scenarios, we select high performing
models (which can be low-level or aggregation models depending on whether it’s a single agent
CoDecyber node or multi-agent Co-Decyber node) and run the multi-node setup. An overview can be
seen in Figure 11</p>
      <p>Each model is assessed on the average peak reward it received, which is calculated as:
• For any given run of the simulator, we choose the maximum reward achieved on any single step.
• Across multiple equivalent runs, we average them together.</p>
      <p>This analysis approach is appropriate because of the mission level reward structure in use. It was
also chosen to avoid issues where longer running simulations achieved higher rewards just because
they ran for a longer period of simulated time. Initial results are presented as a simple mean across
diferent equivalent runs since standard deviation of these results is low.</p>
      <p>Single agent nodes, i.e., nodes that contain only one of ISOLATE, POWER CONTROL, ACL and
FIREWALL all performed well within the attack scenario, gaining a maximum of a 5 reward (for a
vehicle to arrive on time with cargo) 100% of the time.</p>
      <p>We have also proven that multiple nodes can work together to defend against the attack (ISOLATE
in one node and NAV_RESET in another), even when those attacks originate from varying parts of the
system (labelled as “Varied Attack Scenario” or VAS – see Section 3.3). These results can be seen in
Figure 12 and Table 3 for one vehicle.</p>
      <p>We have also visualised the defence using track data taken directly from the Co-Decyber environment,
where we can clearly see that without Co-Decyber, the GPS spoof messages cause the estimate of the
vehicle to be far adrift of actual vehicle position, which diverts the vehicle to an attack destination
(Figure 13).</p>
      <p>With Co-Decyber (this time using the FIREWALL and NAV_RESET agents), the attack occurs (which
causes the red dotted blip), but the defence system successfully recovers the situation, and the vehicle
continues towards its intended destination (Figure 14).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>We have shown that the Co-Decyber framework is successful in defending against an attack. Extending
from our earlier work, Co-Decyber is in fact also capable of defending against diferent kinds of attacks
with generalised agents able to handle both a false fire alarm and the GPS divert attack. However, there
are still challenges. Here we discuss two aspects: issues around generalisation (Section 6.1) and briefly
outline real world considerations (6.2).</p>
      <sec id="sec-6-1">
        <title>6.1. Generalisation</title>
        <p>We recognise generalisation as a significant challenge, both in this work and in MARL research more
generally. One of the rfist behaviours learned by all agents was what we called “pre-blocking”, whereby
agents had learned to disable or block the GVA module and its connections at T0 (before any actions
could take place). This led to the attack never having a chance to work, leading to subsequent learning
by other agents being unnecessary. This ultimately led to agents that learnt to do nothing.</p>
        <p>From a cyber-defence point of view, this may be useful in that it successfully prevents an attack
from happening, but from a MARL point of view, it meant we could not demonstrate the full power
of co-operative decision making. This is a dificult problem to balance because agents are learning a
valid policy, which we do not completely want to invalidate. However, it is also a static policy. We have
addressed this through a combination of techniques.</p>
        <p>Firstly, we improved our datasets, particularly in relation to data sparsity since ‘interesting’ events
could be drowned out by the ‘nothing happening’ events within our simulation.</p>
        <p>Secondly, we addressed dataset imbalance. In total there are 288 parameter combinations of which we
take 50 random samples. Depending on the specific model and training set combinations, there can be
significant sensitivity to hyperparameter selections. It is not unusual for only 1 of the 50 combinations
to give good results. We therefore also re-evaluated our hyperparameter selections to ensure that we
maximised the chances of gaining a good outcome. Some examples include:
• removing very high values of gamma (0.999) since they never gave performant results,
• changing the balance of weights for the model’s two output heads (the Q function and the total
reward prediction) to one where the two heads are weighted equally. We had expected this
latter configuration to sacrifice some Q function performance in favour of the total reward head,
however we found this was not the case. When using the balanced loss function we got good
performance for both output heads,
• higher target model update frequencies can lead to worse results (likely because of instability).</p>
        <p>We tried two values, updating every 10 epochs and updating every 20 epochs. Using the lower
frequency (every 20 epochs) resulted in better Q function loss values, likely because the network
had longer to learn the distribution before it was updated again leading to less spikes. Using the
value of 10 also produces good results but has more variability in loss values.</p>
        <p>We also increased the observation space. For example, we added the observations of the current
isolation state of all modules (in addition to the IDS detection) to any given ISOLATE agent. This
allowed the agent to learn IF-THEN policies combining detection and isolation state; for example, if
the IDS detects an attack on GPS0, and GPS0 is not isolated, then isolate GPS0. We extended this to
POWER and FIREWALL.</p>
        <p>Finally, where the state space was particularly large (specifically the ACL agent where there are
many more deployments in the vehicle than other agents), we simplified our approach by training ACL
agents for specific nodes (Figure 15).</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Real world considerations</title>
        <p>Finally, for the simulation-to-reality transition, Co-Decyber shows promise in terms of edge deployment.
Our trained system is currently running at 10Hz (equivalent to 10 decisions per second). We aim to
keep this decision rate as we continue to move from simulation into the real world, targeting a bare
CPU deployment (without GPUs or accelerators) as the compute envelope.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and future work</title>
      <p>We are planning to extend this work in several ways. Firstly, we are envisaging introducing more attack
scenarios as part of curriculum learning, including modelling denial of service on the vehicle-to-vehicle
link, and attacking the time-distance sensor to start to introduce the agents to physical recovery actions
(as opposed to cyber-actions such as isolating or firewalling) including slowing down and braking. This
would allow us to continue to explore generalisation and learn how such a framework may work with
the underlying vehicle controls.</p>
      <p>Secondly, we will be updating our reward structure to include a “system availability” metric. This
reward should encourage agents to keep as much of system functionality enabled as possible. We intend
to implement this by extending our simulator to include tracking which parts of the system are online
and able to communicate during a simulation run.</p>
      <p>Finally, we are roadmapping possibilities in progressively making the system more realistic, including
use of digital twins as a next step, through to implementing the framework in real-world representative
scaled vehicles. This will give us greater knowledge and understanding of edge deployment, physical
requirements, and potential real-world interactions (albeit in a demonstration environment).</p>
      <p>In conclusion, we have shown that Co-Decyber as a framework is able to deal with our more complex
GPS divert cyber-attack and is also able to generalise across both this simulated attack and the false fire
alarm attack in our earlier work. This demonstrates the viability of using such a framework to assist in
autonomous cyber-defence of vehicles.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This research is funded by Frazer-Nash Consultancy Ltd. on behalf of the Defence Science and
Technology Laboratory (Dstl), an executive agency of the UK Ministry of Defence. The research forms part of
the Autonomous Resilient Cyber Defence (ARCD) project within the Dstl Cyber Defence Enhancement
programme.
[13] A. Kott, Intelligent autonomous agents are key to cyber defense of the future army networks, The</p>
      <p>Cyber Defense Review 3 (2018) 57–70.
[14] A. Wilson, R. Menzies, N. Morarji, D. Foster, M. C. Mont, E. Turkbeyler, L. Gralewski,
Multiagent reinforcement learning for maritime operational technology cyber security, arXiv preprint
arXiv:2401.10149 (2024).
[15] Y. Tang, J. Sun, H. Wang, J. Deng, L. Tong, W. Xu, A method of network attack-defense game and
collaborative defense decision-making based on hierarchical multi-agent reinforcement learning,
Computers &amp; Security 142 (2024) 103871.
[16] I. A. Saeed, A. Selamat, M. F. Rohani, O. Krejcar, J. A. Chaudhry, A systematic state-of-the-art
analysis of multi-agent intrusion detection, IEEE Access 8 (2020) 180184–180209.
[17] J. Wiebe, R. A. Mallah, L. Li, Learning cyber defence tactics from scratch with multi-agent
reinforcement learning, arXiv preprint arXiv:2310.05939 (2023).
[18] B. Kordy, S. Mauw, S. Radomirović, P. Schweitzer, Foundations of attack–defense trees, in: Formal
Aspects of Security and Trust: 7th International Workshop, FAST 2010, Pisa, Italy, September
16-17, 2010. Revised Selected Papers 7, Springer, 2011, pp. 80–95.
[19] G. Brockman, Openai gym, arXiv preprint arXiv:1606.01540 (2016).
[20] J. Terry, B. Black, N. Grammel, M. Jayakumar, A. Hari, R. Sullivan, L. S. Santos, C. Diefendahl,
C. Horsch, R. Perez-Vicente, et al., Pettingzoo: Gym for multi-agent reinforcement learning,
Advances in Neural Information Processing Systems 34 (2021) 15032–15043.
[21] O. M. Group, About the dds security specification version 1.1, 2018. URL: http://www.omg.org/
spc/DDS-SECURITY.
[22] R. F. Prudencio, M. R. Maximo, E. L. Colombini, A survey on ofline reinforcement learning:
Taxonomy, review, and open problems, IEEE Transactions on Neural Networks and Learning
Systems (2023).
[23] H. Kiyohara, K. Kawakami, Y. Saito, Accelerating ofline reinforcement learning application in
realtime bidding and recommendation: Potential use of simulation, arXiv preprint arXiv:2109.08331
(2021).
[24] L. Buşoniu, R. Babuška, B. De Schutter, Multi-agent reinforcement learning: An overview,
Innovations in multi-agent systems and applications-1 (2010) 183–221.
[25] S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, P. Stone, Curriculum learning for
reinforcement learning domains: A framework and survey, Journal of Machine Learning Research
21 (2020) 1–50.
[26] N. Jefrey, Q. Tan, J. R. Villar, A review of anomaly detection strategies to detect threats to
cyber-physical systems, Electronics 12 (2023) 3283.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cheah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Haubrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bailey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rimmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Till</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lacey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kruczynska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dorn</surname>
          </string-name>
          , Co-decyber:
          <article-title>Co-operative decision making for cybersecurity using deep multi-agent reinforcement learning</article-title>
          ,
          <source>in: European Symposium on Research in Computer Security</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>628</fpage>
          -
          <lpage>643</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dhir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hoeltgebaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Briers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Burke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <article-title>Prospective artificial intelligence approaches for active cyber defence</article-title>
          ,
          <source>arXiv preprint arXiv:2104.09981</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hannay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bolton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Burnap</surname>
          </string-name>
          ,
          <source>Automated cyber defence: A review</source>
          ,
          <source>arXiv preprint arXiv:2303.04926</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Bridges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Rice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oesch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Nichols</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Watson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Spakes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Norem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huettel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jewell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Weber</surname>
          </string-name>
          , et al.,
          <article-title>Testing soar tools in use</article-title>
          ,
          <source>Computers &amp; Security</source>
          <volume>129</volume>
          (
          <year>2023</year>
          )
          <fpage>103201</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jhawar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mauw</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Zakiuddin</surname>
          </string-name>
          ,
          <article-title>Automating cyber defence responses using attack-defence trees and game theory</article-title>
          ,
          <source>in: European Conference on Cyber Warfare and Security</source>
          , Academic Conferences International Limited,
          <year>2016</year>
          , p.
          <fpage>163</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kordy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mauw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melissen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schweitzer</surname>
          </string-name>
          ,
          <article-title>Attack-defense trees and two-player binary zero-sum extensive form games are equivalent, in: Decision and Game Theory for Security: First International Conference</article-title>
          , GameSec
          <year>2010</year>
          , Berlin, Germany, November 22-
          <issue>23</issue>
          ,
          <year>2010</year>
          . Proceedings 1, Springer,
          <year>2010</year>
          , pp.
          <fpage>245</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Eom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>A framework for real-time intrusion response in software defined networking using precomputed graphical security models</article-title>
          ,
          <source>Security and Communication Networks</source>
          <year>2020</year>
          (
          <year>2020</year>
          )
          <fpage>7235043</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Gadyatskaya</surname>
          </string-name>
          ,
          <article-title>Automating defence generation for risk assessment</article-title>
          ,
          <source>in: 1st IEEE European Symposium on Security and Privacy</source>
          ,
          <source>EuroS&amp;P</source>
          <year>2016</year>
          , IEEE,
          <year>2016</year>
          , pp.
          <source>Poster-6.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tabebordbar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Beheshti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Benatallah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Barukh</surname>
          </string-name>
          ,
          <article-title>Adaptive rule adaptation in unstructured and dynamic environments</article-title>
          ,
          <source>in: Web Information Systems Engineering-WISE</source>
          <year>2019</year>
          : 20th International Conference, Hong Kong, China, January
          <volume>19</volume>
          -
          <issue>22</issue>
          ,
          <year>2020</year>
          , Proceedings 20, Springer,
          <year>2019</year>
          , pp.
          <fpage>326</fpage>
          -
          <lpage>340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. J.</given-names>
            <surname>Reddi</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning for cyber security</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>3779</fpage>
          -
          <lpage>3795</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sabur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vadnere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Montrose</surname>
          </string-name>
          ,
          <article-title>Sdn-based moving target defense using multi-agent reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the first International Conference on Autonomous Intelligent Cyber defense Agents (AICA</source>
          <year>2021</year>
          ), Paris, France,
          <year>2021</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Adversarial decision-making for moving target defense: a multi-agent markov game and reinforcement learning approach</article-title>
          ,
          <source>Entropy</source>
          <volume>25</volume>
          (
          <year>2023</year>
          )
          <fpage>605</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>