<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Operational Technology Cyber Security</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alec Wilson</string-name>
          <email>Alec.Wilson@uk.bmt.org</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryan Menzies</string-name>
          <email>Ryan.Menzies@uk.bmt.org</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Neela Morarji</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Foster</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Casassa Mont</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Esin Turkbeyler</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lisa Gralewski</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Multi-Agent Reinforcement Learning, Cyber Security, Operational Technology</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADSP</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>BMT</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper demonstrates the potential for autonomous cyber defence to be applied on industrial control systems and provides a baseline environment to further explore Multi-Agent Reinforcement Learning's (MARL) application to this problem domain. It introduces a simulation environment, IPMSRL, of a generic Integrated Platform Management System (IPMS) and explores the use of MARL for autonomous cyber defence decision-making on generic maritime based IPMS Operational Technology (OT).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Operational Technology (OT) cyber defensive actions are less mature than they are for Enterprise
IT. This is due to the relative ‘brittle’ nature of OT infrastructure originating from the use of
legacy systems, design-time engineering assumptions, and lack of full-scale modern security
controls. Traditional IT controls are rarely deployed on OT infrastructure, and where they
are, some threats aren’t fully addressed. Additionally, there may be a lack of trained cyber
CEUR
Workshop
Proceedings
personnel available in deployed operational OT environments e.g., at-sea vessels, hence the
opportunity to develop autonomous defensive systems. Reinforcement learning (RL) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a
subset of machine learning that allows an AI-driven system (sometimes referred to as an agent)
to learn through trial and error using feedback signals (through rewards) from its actions. An
agent executes autonomous actions and is given a reward (or punishment) contingent with the
consequences of these actions in an environment. The agent adapts its strategy, referred to as a
policy, to maximize cumulative reward.
      </p>
      <p>
        RL has seen key recent developments in various domains such as game environments e.g.,
Go [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and real world problems e.g., video compression [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Many domains, including
cybersecurity, are now using RL to train within simulated environments. There are a few examples
of cyber focused RL environments in the literature, notably: Yawning Titan [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], CyberBattleSim
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and the CybORG environment [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. These environments are focused on IT networks and
currently lack the ability to train multiple defenders. The existing environments do not reflect
the challenges of cyber defence on an OT Industrial Control System (ICS). IPMSRL aims to
reflect the nuances of OT more accurately in an abstract simulator and provide a platform for
defensive agents to be trained to recover an IPMS from a cyber-attack.
      </p>
      <p>We applied Multi-agent Reinforcement Learning (MARL) to the cyber defence of ICSs because
of the collaborative nature of the problem. Agents working together distributed across a network
taking coordinated remedial actions is likely to be more successful than a single, or multiple
uncoordinated, agent(s) working independently.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Operational Technology Scenario</title>
      <p>This paper focused on a generic IPMS, a form of Industrial Control System used onboard ships
and submarines.</p>
      <p>An IPMS provides the capability for remote monitoring, control and management of the
ship’s machinery systems and damage control from key components. This real-time monitoring
and control facilitates continuous situational awareness of the machinery state and damage
condition of the ship.</p>
      <p>IPMS controls and monitors many ship systems across Propulsion, Power, Steering, Stability,
Auxiliary and Ancillary systems as well as those providing Damage Control and Fire Fighting
(DCFF). To achieve this, IPMS utilises a distributed control system architecture that facilitates
interfaces with sensors, equipment, plants, software-based control systems and network-based
data.</p>
      <p>An IPMS produces alerts in event of failures, anomalies, or issues. These alerts are not
necessarily cyber security alerts; however, they are likely to be instrumental in the context of
cyber-attack detection and response. These alerts were assumed to be fed into additional Cyber
Threat Detection and Security Information and Event Management (SIEM) tools, leveraging
IPMS inputs to detect and generate relevant attack alerts. The IPMS architecture is based upon
traditional Industrial Control System (ICS) design, inclusive of Remote Terminal Units (RTUs),
Programmable Logic Controllers (PLCs) and multifunction control consoles providing Human
Machine Interfaces (HMIs) combined by a dual redundant network backbone. Specifically, HMI
refers to IPMS consoles as well as panels on the equipment. The architecture work has been
deliberately abstracted and generalised using open-source information.</p>
    </sec>
    <sec id="sec-4">
      <title>3. IPMSRL - Simulation Environment</title>
      <p>In this paper we introduce IPMSRL, a highly configurable network based multi-agent RL
environment. IPMSRL simulates an abstracted generic maritime IPMS where defensive agents
attempt to restore an infected network. Meanwhile an attacker attempts to propagate through
the network and disrupt IPMS and controlled systems to negatively impact operations.</p>
      <p>The IPMSRL network comprises two types of node: critical nodes and non-critical infectable
nodes. Critical nodes represent components of core-controlled systems critical to efective
operation of the vessel. Two controlled systems are modelled, Chilled Water Plants (CWPs)
and the Propulsion System. This selection represents a diverse subset of available controlled
systems. Infectable nodes represent HMIs, RTUs, Local Operating Panels (LOPs), PLCs, and
network switches.</p>
      <p>The properties of the links between nodes, shown in Figure 1, difer depending on their
type. In Figure 1, the star representation of the network is for visualisation purposes only.
Any HMI/RTU/LOP node which is connected to a network switch is adjacent to any other
node directly connected to a network switch, i.e., they are interconnected. The network holds
redundancy in its structure through its dual ring network backbone.</p>
      <sec id="sec-4-1">
        <title>3.1. MITRE ATT&amp;CK® Framework</title>
        <p>
          MITRE ATT&amp;CK1® ICS [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], provides detailed analysis of typical Tactics, Techniques, and
Procedures (TTPs) adopted by cyber attackers, which is relevant to industrial control systems.
        </p>
        <p>Fine-grained attack steps, based on MITRE ATT&amp;CK® ICS, were implemented to enable
more realism, context and complexity to the cyber defensive remedial actions. At diferent
MITRE ATT&amp;CK® ICS stages the attacker will have diferent capabilities e.g., lateral movement
by leveraging an HMI’s network card and connections. Diferent defensive remedial actions are
required depending on the MITRE ATT&amp;CK® ICS stage that the attacker has exploited.</p>
        <p>The twelve attack stages, represented in IPMSRL, correspond to the associated MITRE
ATT&amp;CK® ICS Tactics are shown in Figure 2:</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Attacker</title>
        <p>
          The goal of the attacker is to compromise the vessel’s operational capacity, for example, by
disrupting IPMS systems and controlled systems (CWPs, Propulsion System) which negatively
impacts the vessel. An attacker propagates through the network by initially infecting a
noncritical infectable node, increasing the node’s infection level using an abstraction of the MITRE
ATT&amp;CK® ICS [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], framework, and then infecting an adjacent node. Any given node after
infection acts independently. It is therefore possible that if multiple nodes are infected then each
infected node can progress the infection or laterally move in the same timestep. This makes it
very dificult for the defender(s) to recover the network to a clean state once an infection has
spread.
        </p>
        <p>Attackers can be created on a sliding scale from fully targeted to fully viral. Fully targeted
attackers move directly towards critical nodes displaying ‘knowledge’ of the network. Fully
viral attackers move randomly to any adjacent node. Partially viral attackers move towards
critical nodes but may also move randomly with a given configured probability.</p>
        <p>Attackers also have other configurable attributes such as the probability an infection will
progress through the MITRE ATT&amp;CK® ICS stages or the probability of a successful lateral
move.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Alerts</title>
        <p>Each of the infectable nodes has an alert system which can be triggered when an infection
is present or when an attacker initially tries to infect a non-infected node. The probability
that an alert is successfully parsed to the defender(s) allows the user to configure the scale of
asymmetric information within the environment. In the context of this paper, an assumption was
1© 2023 The MITRE Corporation. This work is reproduced and distributed with the permission of The MITRE
Corporation.
made that IPMS-generated alerts were fed into a SIEM for further processing and cyber threat
detection. The alert success probability, therefore, directly impacts the partial observability of
the defender’s observation space.</p>
        <p>
          It is worth noting that the alerts are static in nature. This means that if an alert is set of at a
certain timestep, the defender will receive information about the progress of the node infection
at a given timestep. No further information about the progression of the infection will be given
to the defender unless another alert is set of or the defender takes a remedial action on a node.
3.4. NIST SP-800-61
NIST SP-800-61 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], describes a standard process to handle cyber incidents and responses. The
three key cyber defensive steps relevant to the problem space are Contain, Eradicate, and
Recover, shown in Figure 3. Each of these types of action need to be taken in a logical order
based on cyber defensive best practice. In general, this will be to contain an attack to avoid
it spreading across systems, eradicate the attack to remove it from the infected system and
recover a node to return the system back to an operational state. The order of these actions is
important and other factors such as the severity of infection impact the efectiveness and ease
of taking these actions.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>3.5. Defensive Remedial Actions</title>
        <p>Each defensive agent can take four types of action: Contain, Eradicate, Recover and Wait. These
are aligned with the real-world cyber incident response process indicated in NIST SP-800-61
discussed above.</p>
        <p>Containing makes a given node un-operational but prevents the infection from propagating
from that node. Eradicating cleans a node of infection. Recovering returns the node to an
operational state. For critical nodes, only the recover and wait actions are available whereas, for
all other nodes, all actions are available. At each timestep, each defender in succession takes an
action. Each of these actions have a configurable delay where a timestep lag is applied depending
on the infection state of the node and type of action. This delay tries to reflect the real-world
diferences between taking simple actions against early infections e.g., modifying credentials,
and taking more extreme actions to prevent unrecoverable damage e.g., quarantining systems.
Additional dialogue on valuing the costs of remedial actions is considered in later discussion
about the reward function.</p>
        <p>IPMSRL supports partial observability in which each defensive agent has its own view of
the network state. In addition to the alerts, defensive agents can investigate a node’s true
infection information by interacting with that node. This knowledge is not shared amongst
other defensive agents.</p>
      </sec>
      <sec id="sec-4-5">
        <title>3.6. Reward Function</title>
        <p>
          IPMSRL features a highly customisable reward function for both global and intrinsic rewards.
Global rewards are awarded to the agents at the end of the episode, whereas intrinsic rewards
are awarded within an episode2. Global rewards have been subdivided into mission objective
and state rewards. The implemented reward function aims to tackle the problem of sparse
global rewards, where agents struggle to train since they only receive a small number of reward
signals, often at the end of an episode. This is a known problem within RL, and specifically
within applied control problems such as robotics [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The reward function used in IPMSRL
uses a combination of intrinsic rewards and reward shaping [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] to allow for further
experimentation into how these additions positively/negatively afect training. The mission
objective is defined as follows for win/loss/draw. For episode step  and max steps in an episode
 , a win, draw or loss is defined as:
• Win (+1) – There are no infected nodes or contained nodes, and  &lt;  .
• Loss (-1) – Any of the critical infrastructure has been infected, and  &lt;  .
        </p>
        <p>• Draw (+0) –  =  and neither the win nor loss criteria has been met for t steps.
For our experiments,  = 50 .</p>
        <p>The state reward reflects the impact to non-critical and critical nodes during the episode and
is graded as low, medium or high. Critical nodes, such as PLCs, RTUs and PCSs that directly
control machinery or switches, are more strongly penalised than other nodes, such as HMIs.
2The global and intrinsic rewards are divided and shared equally between each agent.</p>
        <p>The weighting of penalties was informed by the failure efects and impact analysis of a generic
IPMS. State reward is produced by a 50/50 split of ‘state generic’ and ‘state specific’ rewards:</p>
        <p>State Generic – this refers to penalties that are provided when nodes within the environment
meet certain conditions e.g., a node is contained but not infected or a node is uninfected and
contained without being recovered.</p>
        <p>State Specific – this is defined, in levels of severity, as the number of diferent types of nodes
which had been infected in an episode. For example, if two HMIs were infected then the smallest
negative reward value (defined in the config) is applied.</p>
        <p>IPMSRL has the ability for agents to receive intrinsic rewards called action score rewards.
The defender(s) receive a negative reward depending on the action chosen and the status of the
node that is being interacted with (graded as low, medium or high). Nodes with a more severe
infection status receive a higher penalty. The agents are therefore incentivised to deal with
infections early when they are first alerted. The weighting of the reward function components
are user configurable and will be discussed in further detail within the experimental results
section.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Multi-Agent Reinforcement Learning (MARL)</title>
      <p>
        In this paper, two MARL algorithms, IPPO [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and MAPPO [14], were tested3. Our
implementation of IPPO is equivalent to the IPPO algorithm referenced in the original paper [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
In IPPO, each agent has an actor and critic network with the actor network representing the
policy, and the critic network representing the value function. These actor and critic networks
are independent to each agent. Our MAPPO implementation is slightly diferent to the MAPPO
algorithm used in the original paper [14]. The original paper states that their MAPPO algorithm
uses shared parameters for both the policy and critic networks. The MAPPO implementation
used in this paper uses independent policy networks for each agent, but uses a shared centralised
critic also referred to as a centralised value function. An important note is that in other papers,
some MAPPO implementations use shared information between agents by using the joint
observations and action spaces4. Both Proximal Policy Optimisation (PPO) based algorithms
were stable during training and demonstrated the ability to recover an infected network within
IPMSRL. Two defensive agents were used for all the experiments in this paper. The hardware
used for training was a 6 core CPU, 56GB RAM and a NVIDIA Tesla K80 with 12GBs of vRAM.
Ray Tune [16] and RLlib were utilised for training. Training the PPO based algorithms for
one million timesteps took approximately one hour. The performance metrics used for the
experiments were mean episode reward, mean episode length and mean episode outcome which
is the mean of the episode outcomes (win (+1)/draw (+0.5)/loss (0)).
      </p>
      <sec id="sec-5-1">
        <title>4.1. Hyperparameter Tuning with MAPPO</title>
        <p>An Initial hyperparameter tuning stage to create a baseline for the environment parameter
experiments was completed. A brief grid-search over commonly successful hyperparameters was
3Multi-Agent Deep Deterministic Policy Gradient (MADDPG) and Q-MIX were also explored but the implementations
within Ray and RLlib were found to be unstable during training on IPMSRL.
4This is the same as the implementation within MARLlib [15], using the global state within training.
used for eleven PPO specific and general machine learning hyperparameters. In Figure 4, a clear
diference between the training performance between the tuned and default hyperparameters
can be seen. The hyperparameters which were most influential in the improved performance
were the Learning Rate, Gamma, Lambda and the Clip Parameter. When MAPPO was trained
using the default hyperparameters, it converged to an optimal policy more slowly than the
tuned model, but it was still able to reach an optimal policy. MAPPO was also used during
the Experimental Results section. Throughout the experiments, it was found that in general,
PPO based algorithms were still able to improve their performances during training with most
configurations of hyperparameters.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. IPPO vs MAPPO</title>
        <p>
          MAPPO outperformed IPPO, converging to the optimal policy quicker. This statement is based
on the higher win rate, higher reward total and the ability to win an episode in fewer episode
steps, suggesting a more eficient strategy has developed. The point at which the two algorithms
diverge in Figure 5 underpins some of the diferences in performance. At 200K timesteps IPPO
and MAPPO have similar performance and are drawing most episodes. After this point, both
IPPO and MAPPO start to develop a winning policy but with MAPPO optimising its policy at a
faster rate. In our experimentation, MAPPO clearly gains an advantage using the same critic
network. The centralised critic allows for the agents to find a better policy where agents are
collaborative faster than without the shared critic network. This is the converse to what has
been found elsewhere in the literature, where the addition of a shared value function reduced
the performance significantly [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>MAPPO’s confidence interval (90%) bands were much narrower than IPPO. The narrower
area of the confidence interval suggests that the individual trials of MAPPO algorithm were
more stable than IPPO, implying the training is more consistent. This experiment demonstrates
that the centralised critic can help improve the policy updates at each training epoch.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Experimental Results</title>
      <sec id="sec-6-1">
        <title>5.1. Alert Success Probability</title>
        <p>A configurable attacker parameter is the alert success probability: this is the probability of an
alert being set of on any infected node after an infection on a node progresses successfully
laterally to another node.
environment is clear. The more accurate the information they receive, the better they can
perform. Without an alert success probability of 1, within the training time set, none of the
parameters were able to produce an optimal strategy. However, an agent with an alert probability
of 0.75 or 0.9 was still able to win in over 97.5% or 99.5% of episodes, respectively. These results
suggest that the agents can learn efective strategies in partially observable environments, but
that the SIEMs alert detection and processing is important to the success of autonomous cyber
defence.</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Reward Experiments</title>
        <p>In Figure 7, an investigation into the diference in training performance between state-based
rewards and balanced rewards was conducted. State based reward refers to a reward function
solely made up of the state rewards discussed in the reward function section. Whereas balanced
reward also includes intrinsic rewards, given to the agents during an episode based on their
actions, and an overall score based on the episode outcome. Using solely state rewards,
convergence towards a winning strategy was faster than for balanced rewards, but performed less
optimally overall than balanced rewards. Figure 7 shows the two parameter sets plotted with
90% confidence intervals, validating their significant diferences in behaviour.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion and Future Work</title>
      <p>This paper introduced IPMSRL, a highly configurable network based multi-agent RL environment
and demonstrated the capabilities of MARL agents to successfully recover an abstract IPMS to an
operational state following a cyber-attack. Findings demonstrate that hyperparameter tuning,
reward shaping and the quality of alerts are all important aspects in developing performant
agents.</p>
      <p>A key issue with using a simulator to train an RL agent is the sim-to-real-gap. This describes
the diferences present between real systems and the simulated environments that are used
to replicate them and the problems that even small diferences can cause. The issues are
compounded further when environments are intentionally abstracted away from the real system
for reasons of dimensionality, security or complexity. RL simulators such as IPMSRL are
important for developing the core concepts needed to build confidence and understanding in
the techniques being employed, such as MARL. But more realistic simulators or emulators will
be required to facilitate movement towards applying these concepts on real systems, which
should be the eventual goal.</p>
      <p>Another avenue of future research is the generalisability of the agents. Agents will need
to be able to adapt to diferent types of attack, scenario (where components may be valued
diferently) and network topology. Without demonstrating this flexibility, agents are unlikely
to be stable enough to be trusted in real world control systems.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>Research funded by Frazer-Nash Consultancy Ltd. on behalf of the Defence Science and
Technology Laboratory (Dstl) which is an executive agency of the UK Ministry of Defence
providing world class expertise and delivering cutting-edge science and technology for the
benefit of the nation and allies. The research supports the Autonomous Resilient Cyber Defence
(ARCD) project within the Dstl Cyber Defence Enhancement programme.</p>
      <p>The authors would also like to thank Clare Jubb, Andy Pollard, Christos Giachritsis, Jake
Rigby, Ben Golding, Julie Kimbrey and Brian Bassil for their wider contribution to the project
and paper.
Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?, 2020. URL:
http://arxiv.org/abs/2011.09533, arXiv:2011.09533 [cs].
[14] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, Y. Wu, The Surprising Efectiveness
of PPO in Cooperative, Multi-Agent Games, 2022. URL: http://arxiv.org/abs/2103.01955,
arXiv:2103.01955 [cs].
[15] S. Hu, Y. Zhong, M. Gao, W. Wang, H. Dong, Z. Li, X. Liang, X. Chang, Y. Yang, MARLlib:
A Scalable Multi-agent Reinforcement Learning Library, 2023. URL: http://arxiv.org/abs/
2210.13708, arXiv:2210.13708 [cs].
[16] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, I. Stoica, Tune: A research
platform for distributed model selection and training, arXiv preprint arXiv:1807.05118
(2018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bradtke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Learning to Act using Real-Time Dynamic Programming'</article-title>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning: an introduction, Second edition, in: Adaptive computation and machine learning series</article-title>
          , The MIT Press, Cambridge, Massachusetts,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Maddison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sifre</surname>
          </string-name>
          , G. van den Driessche, J. Schrittwieser,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Panneershelvam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanctot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dieleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grewe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kalchbrenner</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Leach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Graepel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hassabis</surname>
          </string-name>
          ,
          <article-title>Mastering the game of Go with deep neural networks and tree search</article-title>
          ,
          <source>Nature</source>
          <volume>529</volume>
          (
          <year>2016</year>
          )
          <fpage>484</fpage>
          -
          <lpage>489</lpage>
          . URL: https://doi.org/10.1038/nature16961. doi:
          <volume>10</volume>
          .1038/nature16961.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mandhane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhernov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rauh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Claus</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Chiang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , J. Han,
            <given-names>A</given-names>
          </string-name>
          .
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          <string-name>
            <surname>Mankowitz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Broshear</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schrittwieser</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Hubert</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Vinyals</surname>
          </string-name>
          , T. Mann,
          <article-title>MuZero with Self-competition for Rate Control in VP9 Video Compression</article-title>
          ,
          <year>2022</year>
          . URL: http://arxiv.org/abs/2202.06626, arXiv:
          <fpage>2202</fpage>
          .06626 [cs, eess].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Andrew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Spillard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Collyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dhir</surname>
          </string-name>
          ,
          <source>Developing Optimal Causal Cyber-Defence Agents via Cyber Security Simulation'</source>
          ,
          <year>2022</year>
          . URL: http://arxiv.org/abs/2207.12355, volume:
          <volume>26</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Research</surname>
          </string-name>
          , CyberBattleSim',
          <year>2021</year>
          . URL: https://github.com/microsoft/cyberbattlesim.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Standen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lucas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Richer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          , D. Marriott,
          <article-title>CybORG: A Gym for the Development of Autonomous Cyber Agents'</article-title>
          , arXiv
          <volume>26</volume>
          (
          <year>2021</year>
          )
          <year>2023</year>
          . URL: http: //arxiv.org/abs/2108.09118.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>The</surname>
            <given-names>MITRE Corporation</given-names>
          </string-name>
          , '
          <source>MITRE ATT&amp;CK ICS MATRIX'</source>
          ,
          <year>2023</year>
          . URL: https://attack.mitre. org/matrices/ics/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cichonski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Millar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grance</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Scarfone</surname>
          </string-name>
          , Computer Security Incident Handling Guide :
          <article-title>Recommendations of the National Institute of Standards and Technology</article-title>
          ,
          <source>Technical Report NIST SP 800-61r2, National Institute of Standards and Technology</source>
          ,
          <year>2012</year>
          . URL: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.
          <fpage>800</fpage>
          -
          <lpage>61r2</lpage>
          .pdf. doi:
          <volume>10</volume>
          .6028/ NIST.SP.
          <fpage>800</fpage>
          -
          <lpage>61r2</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Qureshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yoshikawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ishiguro</surname>
          </string-name>
          ,
          <article-title>Intrinsically motivated reinforcement learning for human-robot interaction in the real-world'</article-title>
          ,
          <source>Neural Netw</source>
          <volume>107</volume>
          (
          <year>2018</year>
          )
          <fpage>23</fpage>
          -
          <lpage>33</lpage>
          ,. doi:
          <volume>10</volume>
          .1016/j.neunet.
          <year>2018</year>
          .
          <volume>03</volume>
          .014.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>I.</given-names>
            <surname>Popov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hafner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Barth-Maron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vecerik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lampe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tassa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Erez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <article-title>Data-eficient Deep Reinforcement Learning for Dexterous Manipulation (</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Mguni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jaferjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Slumbers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Perez-Nieves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>LIGS</surname>
          </string-name>
          :
          <article-title>Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning</article-title>
          ,
          <year>2022</year>
          . URL: http://arxiv.org/abs/2112.02618, arXiv:
          <fpage>2112</fpage>
          .02618 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>C. S. de Witt</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Makoviichuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Makoviychuk</surname>
            ,
            <given-names>P. H. S.</given-names>
          </string-name>
          <string-name>
            <surname>Torr</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , S. Whiteson,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>