<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Veluchamy, R. S. Kathavarayan, Deep reinforcement learning for building honeypots
against runtime dos attack, Int. J. Intell. Syst.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3292500.3330932</article-id>
      <title-group>
        <article-title>Multi-agent Reinforcement Learning for Cybersecurity: Approaches and Challenges</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salvo Finistrella</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Mariani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franco Zambonelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Modena and Reggio Emilia</institution>
          ,
          <addr-line>Reggio Emilia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>37</volume>
      <issue>2022</issue>
      <fpage>8</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>In the face of the rapidly evolving threat landscape, traditional security measures often lag behind with sophisticated cyber attacks. Through a review of existing literature, we examine the shortcomings of conventional cybersecurity methods, highlighting the need for Reinforcement Learning based methods. Our study classifies various RL approaches in cybersecurity, aimed to enhance detection, mitigation, and response capabilities, along two dimensions: the RL technique used, and the network configuration. Moving forward, we emphasise the importance of further research and development to address challenges such as model complexity, sample eficiency, and vulnerabilities to adversarial attacks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Reinforcement learning</kwd>
        <kwd>Cybersecurity</kwd>
        <kwd>Multi-agent system</kwd>
        <kwd>DoS attack mitigation</kwd>
        <kwd>Intrusion Detection System (IDS)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        On a global scale, projections indicate that the cost of cybercrime will surpass 8 trillion dollars,
cementing its status as the world’s third-largest and most rapidly expanding economy [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Such cost includes damage and destruction of data, stolen money, lost productivity, theft of
intellectual property, theft of personal and financial data, fraud, post-attack disruption, forensic
investigation, restoration and deletion of hacked data and systems and reputational harm.
Given this and the escalating nature of cyber threats, the integration of Reinforcement Learning
techniques emerges as a promising strategy to fortify cybersecurity defences [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ].
      </p>
      <p>RL is an area of machine learning where an active entity (called agent) is given the goal of
learning a behavioural policy through experience, by interacting with an environment through
trial and error. While interacting with such an environment, the agent may get rewards for
useful actions that advance it towards the task to be accomplished, or punishments for actions
that steer it away. By aiming at maximising the accumulated rewards, the agent learns which
actions lead to favourable outcomes and adjusts its behaviour accordingly [5].</p>
      <p>The motivation for employing RL in cybersecurity stems from its ability to introduce dynamic
and adaptive defence mechanisms. Traditional approaches struggle to keep pace with the rapidly
evolving threat landscape, whereas RL enables security systems to learn from experience and
adjust their strategies timely, and autonomously. By automating response mechanisms, RL
algorithms can not only detect, but also analyse and mitigate cyber threats without human
intervention, significantly reducing response times and potential damages. Furthermore, RL
ofers continuous learning capabilities, allowing security systems to adapt to novel threats, by
continuously updating their knowledge and strategies based on new data and experiences.</p>
      <p>In this article, we present a classification of RL methods tailored to bolster security measures
across diverse domains, encompassing single-agent paradigms as well as multi-agent ones. We
classify RL techniques as applied to host-based, network-based, and centralised network-based
configurations augmented with Software-Defined Networking (SDN, see 4.2). By summarising
these approaches, we aim to provide a road-map for practitioners and researchers navigating
the complex landscape of RL applications in cybersecurity.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Motivation &amp; Background</title>
      <p>In today’s ever-evolving cyber landscape, traditional security measures often fall short in
defending against new threats. The reasons are many.</p>
      <p>• Static defences: they rely on fixed rules, threat signatures, and predefined attack patterns
for detection and prevention. Thus they are inherently limited to recognising known
threats and vulnerabilities [6].
• Lack of contextual understanding: they operate within rigid frameworks that do not
account for the broader context of an attack, failing to adapt to evolving scenarios. For
instance, these systems cannot adapt to new attack vectors or understand the nuances
of diferent operational environments, making it dificult to identify and respond to
sophisticated and previously unseen attacks [7].
• Limited scalability: they may become overwhelmed by the sheer volume of data to
analyse and the diversity of threats to address [8].
• Inadequate response times: they often rely on manual intervention to address security
incidents, introducing obvious delays [8].</p>
      <p>These challenges highlight the need for more adaptive and responsive cybersecurity strategies,
and RL ofers a dynamic and flexible solution to meet these needs. By harnessing RL,
security systems can autonomously adjust to emerging threats, which is particularly crucial for
countering zero-day attacks—exploiting previously unknown vulnerabilities, leaving victims
defenceless with no time to prepare or patch the flaw [9].</p>
      <p>• Dynamic Threat Detection: RL algorithms continuously learn and adapt to new threat
patterns through interactions with the environment, improving accuracy and eficiency.
• Real-Time Threat Analysis: RL can identify and neutralise threats with unparalleled
speed and eficiency.
• Enhanced Decision-Making: RL learns from experience to make smarter decisions,
analysing data patterns and detecting anomalies in threat identification.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Reinforcement Learning in a Cybersecurity Environment</title>
      <p>Reinforcement Learning represents a powerful paradigm in the domain of artificial intelligence,
enabling agents to learn optimal behaviour through interaction with their environment.
Mimicking the trial-and-error learning process observed in humans and animals, RL algorithms
iteratively explore and exploit their environment to maximise cumulative rewards. Rewards
can be sparse or dense. The first are given infrequently, making it challenging for the agent to
learn desired behaviours. Conversely, dense rewards are provided more frequently, facilitating
quicker learning. RL has applications in diverse fields, from robotics and gaming to finance and
healthcare, where systems must autonomously adapt to uncertain and dynamic environments.</p>
      <p>In the domain of cybersecurity, the core RL concepts such as state and environment
observation, action selection, policy optimisation, reward mechanisms, and goal-driven strategies, can
be instantiated as follows.</p>
      <p>• Environment state and observation: RL algorithms rely on observing the (possibly,
hidden) state of the environment to make decisions. As depicted in Figure 1, in
cybersecurity the environment is the network environment, within which several appliances
generate the data constituting the state: firewalls, Intrusion Detection Systems (IDSs),
Intrusion Prevention Systems (IPSs), proxy servers, snifers, Operating Systems, and other
software. This entails monitoring various data such as network trafic patterns, system
logs, software configurations, and user behaviour. The generated observations serve as
inputs for the agents to learn.
• Action selection: after observing the state, the RL agent selects actions based on its
learned policy, there including triggering alarms, deploying patches, updating security
configurations, isolating compromised systems, alerting security personnel, block
services, and dropping packets. The RL environment evaluates the eficacy of these actions,
assessing whether the state of security has improved or deteriorated. Rewards or penalties
are then issued accordingly (see below).
• Policy optimisation: RL agents continuously refine their decision-making policy
through trial and error and rewards, aiming to maximise short or long-term rewards.
• Reward mechanisms: Rewards provide feedback to the agent, indicating the eficacy
of its actions. In cybersecurity, their primary goal is to incentivise actions leading to
successful detection and mitigation of threats while penalising those that fall short. To
achieve this, the reward function is meticulously crafted to align with the objectives
of the Intrusion Detection System (IDS). Its design could involve rewarding accurate
identification and swift response to threats while admonishing false positives and
negatives. The specifics of this function’s formulation and computation vary, contingent
upon the intricacies of the RL algorithm and the objectives of the security system at hand.</p>
      <p>Subsection 4.3 provides practical examples of rewards.
• Goal-driven strategies: RL agents are driven by overarching goals, such as maximising
the security posture of the environment. This encourages them to develop strategies that
prioritise actions leading to the most significant reduction in risk and protection of assets.</p>
      <p>The RL agents are tasked with defending the environment against threat actors, that can be either
actual human/software attackers or emulated through datasets and simulation frameworks.
These threat actors strategically exploit the environment to find vulnerabilities, and additional
challenges may arise from authorised resources or employees attacked to get access to other
parts of the system.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Classification: Framework and Survey</title>
      <p>
        To explore the intersection between RL and cybersecurity, we started by looking at existing
surveys. We thus exploited the Scopus computer science database using the query string
“reinforcement learning” AND “cybersecurity” AND (“survey” OR “review”). This search yielded
31 results. However, manual inspection revealed that only five amongst them truly were broad
surveys focussed on RL applied to general cybersecurity. Other either focussed on a specific
application scenario (e.g. IoT) or were not centred around RL (e.g. mostly covered statistical
machine learning methods). Thus, we delved deeper into these five surveys [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4, 8, 10</xref>
        ] and
applied snowballing when due to clarify which articles could faithfully and clearly represent a
research thread within our proposed classification.
      </p>
      <p>
        Among these five surveys, three in particular inspired this work. The first, Uprety and Rawat
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] organise their survey according to the nature of attacks in the specific application domain
of the Internet of Things (IoT). Adawadkar and Kulkarni [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], instead, focus on IDS and resource
optimisation in IoT environments. The researchers identify key parameters for comparing
RL-based algorithms, including detection rate, precision, and accuracy, providing valuable
insights into the efectiveness of RL in enhancing cybersecurity measures. Finally, Cengiz and
Gök [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] concentrate initially on penetration testing and then on Intrusion Detection Systems
(IDS), providing valuable insights into the evolving cybersecurity landscape. They conclude
by explaining how RL can be applied to various types of attacks. By selectively surveying the
available literature and evidence, they ofer a comprehensive overview of the role of RL in
fortifying defences against emerging threats.
      </p>
      <p>With the similar goal of assessing the state of the art and compare various approaches in RL for
cybersecurity, we have formulated a novel classification meant to better introduce researchers
and practitioners to the filed, encompassing two key dimensions: (i) the architecture of the RL
approach employed, and (ii) the network configuration adopted for cybersecurity.</p>
      <sec id="sec-4-1">
        <title>4.1. RL techniques dimension</title>
        <p>In the context of RL, multiple learning agents can coexist, and they can share learning data or
not. Additionally, when multiple agents exist, they can explicitly try to hinder each other learnt
policies. These possibilities give rise to 5 categories of RL approaches:
• Single-agent. In single-agent systems, there is only one agent operating in the
environment and learning. This agent makes decisions and takes actions independently alone,
considering only its own information.
• Centralized multiagent. In centralized multiagent systems, multiple agents exist, but
there is a central controller or coordinator that learns a decision-making policy for all
agents based on the information assembled from each agent. This central controller is
the only one learning a policy, that is then given to every other agent.
• Decentralized multiagent. In decentralized multiagent systems, each agent makes
its own decisions independently without a central controller. Agents in decentralized
systems typically have limited access to information about the environment and the
actions of other agents. They must use local information and possibly communication
with nearby agents to learn and make decisions. Each agents learns its own policy.
• Multiagent CTDE. The CTDE paradigm involves training a multi-agent system in
a centralised manner, where a central controller learns the policies or strategies for
each agent using global information. However, during execution, each agent operates
independently, making decisions based on the centrally learned policy but conditioned on
its own observations, and without direct coordination with the central controller or other
agents.
• Adversarial multi-agent. In adversarial multi-agent systems, agents operate in a
competitive environment where each agent’s objectives are directly opposed to those of other
agents. These systems often involve strategic interactions, where agents must anticipate
and react to the actions of other agents in order to achieve their own objectives.
This classification helps to quickly identify the learning scenario and enables further, more
ifne-grained categorisation based on the specific RL algorithm adopted.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Network configuration dimension</title>
        <p>The other dimension we consider is the network configuration of the cybersecurity measures.
• Host-based cybersecurity. The defence system is focused on protecting individual
devices (hosts) such as computers, servers, mobile devices, and endpoints. As such, it
involves installing security software directly on these devices. This software may include
antivirus programs, firewalls, intrusion detection/prevention systems (IDS/IPS) [ 11], and
endpoint protection platforms (EPP). Host-based cybersecurity measures are essential for
safeguarding against threats like malware, unauthorised access, and data breaches that
may target specific devices (see Figure 2).
• Network-based cybersecurity. The focus shifts to securing the communication
pathways between diferent devices and sub-systems within a network. It involves
implementing security measures at the network level to detect and prevent unauthorised access,
malicious activities, and data breaches. Network-based security solutions include firewalls,
intrusion detection/prevention systems (IDS/IPS), virtual private networks (VPNs), and
network access control (NAC) systems (see Figure 2). These measures help protect against
threats such as unauthorised access attempts, malware propagation, and network-based
attacks like DDoS (Distributed Denial of Service) attacks.
• Network-based cybersecurity centralised with SDN. This configuration combines
network-based cybersecurity measures with Software-Defined Networking (SDN, see</p>
        <p>Figure 3) technology. SDN [12] is an approach to networking that separates the control
plane from the data plane, allowing for centralised management and programmability of
network resources. Here, security policies and controls are centrally managed and
enforced across the network infrastructure through software-defined policies. This enables
more dynamic and granular control over network trafic. SDN-based security solutions
include centralised firewall management, dynamic access control policies, and real-time
threat intelligence integration, among others.</p>
        <p>This classification helps to quickly identify what kind of defence is needed.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Proposed Classification</title>
        <p>By organising our analysis along these two dimensions, our survey endeavours to furnish a
comprehensive overview of the strengths, limitations, and potential applications of diverse RL
methodologies within diverse network settings. This way, we can construct a holistic portrayal
of the current state-of-the-art in RL applications for cybersecurity, depicted in Figure 4, shedding
light on emerging trends and challenges.</p>
        <p>Single-agent RL for host-based security. This category gathers the most approaches, as it
represents those solutions that are technically easier to set up: a single learning agent learns
based on the inputs coming from all the devices in the network, and controls all the security
measures therein installed.</p>
        <p>Liu et al. [13] present a RL-based approach to enhance the security of wireless networks
by mitigating spoofing attacks. The receiver (Bob) acts as an agent using the Q-learning
algorithm to make decisions about authenticating packets. Bob’s state () represents his
current knowledge of the channel conditions and historical authentication results. His action
() involves selecting a threshold for authentication to decide whether an incoming packet is
legitimate or spoofed. Bob observes the packet’s physical-layer characteristics, such as signal
strength and channel properties. The reward function () provides feedback based on the
accuracy of Bob’s authentication decisions, rewarding correct identifications and penalising
false alarms and missed detections, as follows:
 =
⎧⎪correct
⎪
⎪
⎪⎨false_alarm
if correct identification (legitimate packet)
if false alarm (legitimate packet classified as spoofed)
⎪missed_detection if missed detection (spoofed packet not detected)
⎪
⎪⎪⎩correct_rejection if correct rejection (spoofed packet identified)
Through repeated interactions and rewards, Bob learns to improve his authentication policy,
thus enhancing the network’s resilience to spoofing attacks.</p>
        <p>Elnaggar and Bezzo [14] introduce a method to predict and recover from cyber-physical
attacks on UAV (Unmanned Aerial Vehicle, aircraft operating without a human pilot on board)
using Inverse Reinforcement Learning. The focus is on scenarios where UAVs have to reach a
particular position and the attackers try to manipulate the sensor data to disrupt its navigation.
The key components of the approach are: Actions, which refer to specific movements or
adjustments the UAV can make, such as changing direction, speed, or altitude; States and
Observations, representing the various conditions or positions of the UAVs within its
environment, such as the UAV’s geographic coordinates, velocity, altitude, along with sensor readings
from gyroscope, accelerometer, and GPS; Policy, a strategy derived from IRL that guides the
system to make decisions that avoid the attacker’s goals and maintain a desired operational
level; and Reward, a function that evaluates the success of actions in maintaining system
integrity and achieving goals.</p>
        <p>The reward function (, ) is designed to reflect the system’s objectives and the attacker’s
interference. For example, it might be formulated as:</p>
        <p>(, ) = − ( · (, ) +  · ())
where (, ) is the distance from the current state  to the goal state , () is an
indicator function that penalises unsafe states, and  and  are weighting factors. The algorithm
uses Bayesian IRL within a Markov Decision Process framework, applying Monte Carlo Markov
Chain sampling to predict the attacker’s intentions. The system’s efectiveness is demonstrated
through simulations involving a UAV navigating a stochastic environment.</p>
        <p>Xu et al. proposed a series of works [15, 16, 17] where they introduce TD-SAD
(temporaldiference-based sequential anomaly detection) to combat multi-stage cyber attacks in computer
systems, showcasing its high detection rates and low false alarm rates across various types of
program traces. Building upon this, they extended it to enhance anomaly detection in
hostbased IDS, demonstrating the efectiveness of RL techniques. Finally, they proposed another
method for detecting anomalies in host computers using sequential anomaly detection based on
temporal-diference (TD) learning principles, highlighting its eficiency in modelling complex
sequential behaviours without prior knowledge of the underlying processes.</p>
        <p>Xiao et al. [18] delves into the security vulnerabilities inherent in Mobile Edge Computing
(MEC) systems. The paper uses RL to enhance security measures such as secure mobile ofloading
against smart attacks, lightweight authentication, and collaborative caching schemes.</p>
        <p>Feng et al. [19] address the challenge of defending against application-layer distributed
denialof-service (L7 DDoS) attacks, which exploit legitimate-appearing application-layer requests
to overwhelm server functions. Traditional DDoS defences struggle with L7 DDoS attacks
due to their subtle nature at the transport and network layers. The authors propose a defence
mechanism using RL, where an agent learns to mitigate these attacks through a multi-objective
reward function. This function balances the aggressive mitigation of malicious requests during
severe attacks with conservative mitigation to minimise collateral damage to legitimate trafic
under normal conditions. Their evaluation demonstrates that the proposed approach efectively
mitigates 98.73% of malicious events.</p>
        <p>Oh and Iyengar [20] introduces a sequential anomaly detection method using Inverse RL. It
models an agent’s behaviour through a learned reward function, identifying anomalies when
behaviours deviate from expected patterns. A Bayesian extension to IRL incorporates model
uncertainty, enhancing reliability. Key contributions include the application of IRL to anomaly
detection, handling varying-length input trajectories in real-time, and empirical validation of
efectiveness. The efectiveness of this approach is demonstrated through empirical studies on
publicly available real-world data.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Single-agent RL for network-based security. This category includes approaches where</title>
        <p>a single learning agent is responsible for monitoring and securing the entire network. The
agent analyses network trafic and activities in network devices (routers, firewalls, switches,
gateways, ...), making decisions to enhance the network’s overall security posture.</p>
        <p>Liu et al. [21] present a Deep RL approach for mitigating Distributed Denial of Service (DDoS)
attacks in SDNs. The system employs a Deep Deterministic Policy Gradient algorithm. The
state space captures network features from OpenFlow switches; the action space configures
bandwidth limits for hosts using OpenFlow meters; and the observation process continuously
monitors network trafic. The reward function, defined as:
reward =</p>
        <p>+ (1 −  )(1 − ) if Load ≤ 
{︃− 1
if Load &gt; 
penalises server overload and rewards maximising benign trafic while minimising attack trafic.
In summary, the DRL-based approach dynamically adjusts bandwidth allocations to efectively
mitigate DDoS attacks.</p>
        <p>Veluchamy and Kathavarayan [22] proposed the Deep Adaptive RL for Honeypots (DARLH)
system that operates in both single-agent and multi-agent paradigms to enhance security in
honeypot environments. These are environments featuring decoy systems set up to attract and
analyse cyber attackers, by gathering data on attack methods to improve security. At the
singleagent level, the agent autonomously learns and makes decisions based on its observations of
network trafic and system behaviour. This single-agent approach allows for adaptive behaviour
and decision-making tailored to the specific environment. At the multi-agent level, the system
integrates multiple agents, each responsible for monitoring diferent aspects of the honeypot
environment. These agents collaborate to share information, coordinate actions, and collectively
contribute to the overall security posture of the system. This two-level architecture combines
the adaptability and autonomy of single-agent systems with the collaborative and coordinated
capabilities of multi-agent systems, ofering a holistic approach to network security.</p>
      </sec>
      <sec id="sec-4-5">
        <title>Multi-agent centralised RL for network-based security. This category encompasses</title>
        <p>methods where multiple learning agents are coordinated centrally to secure the network. Each
agent focuses on a specific aspect of the network, and their actions are managed by a central
controller to ensure cohesive and efective security measures.</p>
        <p>Janakiraman and Deva Priya [23] propose a Deep RL approach exploiting Long Short Term
Memory (LSTM) networks for mitigating DDoS attacks in fog-assisted cloud environments.
Multiple agents collaborate in a centralised network-based approach to identify and mitigate
DDoS attacks at the network layer. By utilising SDN controllers (see Section 4.2), the system is
able to analyse network trafic and diferentiate between legitimate and malicious packets. The
LSTM component is used for its ability to handle time-dependent data and efectively categorise
incoming packets. The reward in this context is defined as the successful identification and
mitigation of DDoS attacks while minimising false positives and maintaining the availability of
legitimate network services. A detailed discussion on the reward structure and its implications
can be found in Section 3 of the paper.</p>
        <p>Multi-agent RL for host-based security. In this section, we explore research eforts focused
on leveraging the collective intelligence and collaborative decision-making of multiple agents
to fortify cybersecurity measures on a single device.</p>
        <p>Dasgupta et al. [24] focus on detecting and mitigating GPS spoofing attacks, crucial in
transportation cyber-physical systems. They propose a deep RL based method, using in-vehicle
sensor data and signal processing to detect spoofing attacks turn-by-turn. The State (S) includes
the positions and movements of vehicles as well as the signals received from GPS satellites, and
encapsulates information about the system’s susceptibility to spoofing attacks. The Action (A)
in this scenario corresponds to the responses that the system can take to detect and mitigate GPS
spoofing attacks. These may include adjusting the navigation algorithms, re-calibrating sensors,
or deploying countermeasures to verify the authenticity of GPS signals. The Observation
(O) consists of the data collected from in-vehicle sensors and GPS receivers, as well as the
feedback from the detection and mitigation mechanisms. These observations provide insights
into the efectiveness of the system’s response to spoofing attacks and guide further
decisionmaking processes. The Reward (R) signal in this context reflects the immediate benefits or
costs associated with the system’s actions in response to spoofing attacks. Rewards could be
based on successfully detecting and mitigating spoofing attempts, minimising disruptions to
navigation systems, or avoiding accidents caused by misleading GPS information. This study
demonstrates the potential of RL-based approaches to bolster cybersecurity in transportation
systems vulnerable to GPS spoofing attacks. Employing a multi-agent system, the method
enhances detection accuracy through collaborative decision-making among agents.</p>
      </sec>
      <sec id="sec-4-6">
        <title>Decentralised multi-agent RL for network-based security. This category involves ap</title>
        <p>proaches where multiple learning agents work independently but collaboratively to secure the
network. Each agent is responsible for a segment of the network, making autonomous decisions
while communicating with other agents to maintain overall network security.</p>
        <p>Malialis and Kudenko [25] propose a framework that involves deploying multiple agents
within the network to coordinate responses against threats. These agents use RL to dynamically
adjust router throttling mechanisms, efectively mitigating DDoS attacks’ impact on network
performance and availability. The approach is decentralised, as each agent operates
independently, making decisions based on local observations and interactions with the environment.
However, they collaborate indirectly by collectively improving the overall network resilience
through their individual actions.</p>
        <p>Bhagyashree Deokar [26] propose a cooperative learning method for IDS based on
multiagent systems. The system architecture involves multiple agents distributed across diferent
hosts, each responsible for monitoring network connections and system log files. These agents
collaborate in a decentralised manner by sharing information and making local decisions based
on their observations, contributing to a collective decision about whether an intrusion has
occurred. The decision-making process utilises influence diagrams and Bayesian networks to
model uncertainty and optimise decision outcomes.</p>
        <p>Bhosale et al. [27] propose an approach to IDS by leveraging a multi-agent framework and
RL techniques. It addresses the limitations of traditional single-agent IDS, which struggle to
handle the complexity and real-time demands of modern network security. By employing a
multi-agent system, each agent possesses partial information and collaborates with others to
improve decision-making capabilities. The decision-making process is facilitated by influence
diagrams, which represent probabilistic relationships between events and guide local
decisionmaking. This approach leans towards decentralisation, as agents collaborate but maintain their
autonomy in decision-making.</p>
        <p>Shamshirband et al. [28] use Cooperative Game-based Fuzzy Q-learning (G-FQL) for detecting
and preventing intrusions, particularly DDoS attacks, in wireless sensor networks (WSNs).
GFQL integrates game theory, fuzzy Q-learning, and a cooperative defence strategy involving sink
nodes, a base station, and attackers. The cooperative game mechanism allows the sensor nodes
to act as rational decision-makers, collaborating to detect and defend against attacks. Fuzzy
Q-learning reinforces the nodes’ self-learning abilities, providing them with incentive functions
to protect vulnerable sensor nodes. This approach is a mix of centralisation and decentralisation,
as the system involves centralised elements such as the base station coordinating overall strategy,
while individual sensor nodes operate autonomously but collaborate within the overarching
framework.</p>
      </sec>
      <sec id="sec-4-7">
        <title>Adversarial RL for network-based security. This category focus on a specific RL setting,</title>
        <p>termed adversarial, where RL agents learn a policy that is actively disrupted by hostile agents
(termed ”threat actors” or ”other problems” in Figure 1, which become the adversarial agents).
These agents are trained to anticipate and counteract sophisticated attacks, thereby enhancing
the network’s resilience against adversarial threats.</p>
        <p>Caminero et al. [29] incorporate a multi-agent technique by integrating a classifier, acting as
the agent, with a simulated environment. This environment generates network trafic samples
and provides rewards based on the classifier’s predictive accuracy. The classifier’s objective is to
predict the correct intrusion label for the given network samples, while the environment’s goal
is to actively increase the dificulty of predictions by behaving adversarially, challenging the
classifier to learn from the most dificult cases. Adversarial Environment using RL introduces an
innovative approach for intrusion detection in network security. It employs a multi-agent setup
by combining reinforcement learning principles with a classifier serving as the primary agent.
The classifier aims to predict intrusion labels for network trafic samples, while the adversarial
environment generates scenarios that challenge the classifier by increasing the dificulty of
predictions. By maximising rewards obtained from the environment, the classifier learns to adapt
to these challenges, leading to enhanced performance in detecting and classifying intrusions
within network trafic. This dynamic interaction between the classifier and the adversarial
environment forms the core of Adversarial Environment using RL, enabling it to efectively
address the evolving threats and complexities in network security.</p>
        <p>Turner et al. [30] focus on modelling the interactions between attackers and defenders in
a network environment. Attackers aim to exploit vulnerabilities, while defenders seek to
mitigate risks. Multi-Agent Reinforcement Learning (MARL) involves competition between
RL agents representing attackers and defenders, each learning to optimise its strategy based
on feedback received from the environment. Co-evolution involves evolving populations of
strategies for attackers and defenders simultaneously, with each population adapting to the
strategies of its opponent over time. The paper compares the efectiveness of these approaches
in generating robust solutions for cybersecurity challenges, emphasising the importance of
balancing exploration and exploitation in learning strategies.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations, Challenges, and Open Issues</title>
      <p>While RL holds promise for addressing cybersecurity challenges, it also presents certain
limitations, challenges, and open issues, summarised in Table 2.</p>
      <p>Limitations. A first limitation of RL approaches regarding cybersecurity environments is
complexity. Cybersecurity environments often exhibit high-dimensional and dynamic
characteristics, leading to computationally intensive training processes. The complexity arises from the
need to represent diverse network states, attacker behaviours, and defensive actions accurately.
As a result, RL algorithms may encounter scalability issues, prolonged training times, and
resource constraints, limiting their practical applicability in real-world cybersecurity scenarios.</p>
      <p>Another limitation, generally applicable to RL but even more so to cybersecurity, is sample
eficiency . Many RL algorithms require extensive training data to learn efective policies, posing
challenges in resource-constrained cybersecurity settings where collecting suficient labelled
data is dificult. A notable example in cybersecurity is detecting zero-day attacks.
Challenges. A first challenge for RL is posed by adversarial attacks, that exploit
vulnerabilities in RL-based cybersecurity systems, like poisoning the training data with fake one, to
manipulate the system’s behaviour. In particular, steering learning agents towards sub-optimal
policies. Another challenge is achieving robust generalisation across diverse cyber threats and
environments. For instance, ensuring that an RL-based IDS trained on one network architecture
performs efectively when deployed in a diferent one. Finally, ensuring available of quality data
is another challenge, requiring data collection strategies that reflect real-world cyber threats,
environments, and defence mechanisms.</p>
      <p>Open Issues. Amongst the open issues still to be fully investigated in RL applied to
cybersecurity, at least two emerge strongly from the surveyed literature: explainability and transferability.
The former amounts to ensuring that RL-based cybersecurity systems are understandable and
transparent to human beings. Achieving explainability involves making the decisions and
actions of RL algorithms interpretable to stakeholders. For instance, in autonomous threat
response systems, explainability ensures that security analysts can comprehend the reasoning
behind the system’s actions and trust its recommendations. Transferability instead amounts to
transferring knowledge and policies learned in one cybersecurity context to another. This would
obviously improve eficiency and efectiveness, as there would be less need to re-training RL
systems from scratch. For instance, leveraging knowledge from detecting malware to enhance
intrusion detection in network trafic.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion &amp; Future Works</title>
      <p>In this paper, we have discussed the potential of RL to enhance cybersecurity defences by ofering
adaptive and dynamic mechanisms to combat emerging threats. We highlighted limitations
of traditional approaches, motivated why RL can help surpassing them, and then proposed a
bi-dimensional classification to help researchers enter the field or get a bird-eye view.</p>
      <p>While our analysis demonstrates the promise of RL in improving detection, mitigation, and
response capabilities, the surveyed literature also identify challenges that must be addressed.
These include complexity, sample eficiency, and vulnerabilities to adversarial attacks. Moving
forward, it is imperative to focus on developing robust and explainable RL-based defence
mechanisms, as well as exploring techniques for knowledge transfer and generalisation across
diverse cyber threats and environments. By addressing these challenges, we can harness the
full potential of RL to fortify cybersecurity defences and mitigate emerging threats efectively.
[5] J. H. Connell, K. Sridhar Mahadevan, Robot Learning, Robotica 17 (1999) 229–235. doi:10.</p>
      <p>1017/S0263574799271172.
[6] Z. Hu, P. Chen, M. Zhu, P. Liu, Reinforcement Learning for Adaptive Cyber Defense
Against Zero-Day Attacks, Springer International Publishing, Cham, 2019, pp. 54–93.
doi:10.1007/978-3-030-30719-6\_4.
[7] M. Macas, C. Wu, W. Fuertes, A survey on deep learning for cybersecurity: Progress,
challenges, and opportunities, Computer Networks 212 (2022) 109032. doi:10.1016/j.
comnet.2022.109032.
[8] T. T. Nguyen, V. J. Reddi, Deep reinforcement learning for cyber security, IEEE Transactions
on Neural Networks and Learning Systems 34 (2023) 3779–3795. doi:10.1109/TNNLS.
2021.3121870.
[9] Y. Guo, A review of machine learning-based zero-day attack detection: Challenges and
future directions, Computer Communications 198 (2023) 175–185. doi:10.1016/j.comcom.
2022.11.001.
[10] P. Dixit, S. Silakari, Deep learning algorithms for cybersecurity applications: A
technological and status review, Computer Science Review 39 (2021) 100317. URL: https:
//www.sciencedirect.com/science/article/pii/S1574013720304172. doi:https://doi.org/
10.1016/j.cosrev.2020.100317.
[11] D. Denning, An intrusion-detection model, IEEE Transactions on Software Engineering</p>
      <p>SE-13 (1987) 222–232. doi:10.1109/TSE.1987.232894.
[12] S. Shin, L. Xu, S. Hong, G. Gu, Enhancing network security through software defined
networking (sdn), in: 25th International Conference on Computer Communication and
Networks (ICCCN), 2016, pp. 1–9. doi:10.1109/ICCCN.2016.7568520.
[13] J. Liu, L. Xiao, G. Liu, Y. Zhao, Active authentication with reinforcement learning based
on ambient radio signals, Multimedia Tools and Applications 76 (2017) 3979–3998. doi:10.
1007/s11042-015-2958-x.
[14] M. Elnaggar, N. Bezzo, An irl approach for cyber-physical attack intention prediction
and recovery, in: 2018 Annual American Control Conference (ACC), 2018, pp. 222–227.
doi:10.23919/ACC.2018.8430922.
[15] X. Xu, T. Xie, A reinforcement learning approach for host-based intrusion detection
using sequences of system calls, in: Advances in Intelligent Computing, Springer Berlin
Heidelberg, Berlin, Heidelberg, 2005, pp. 995–1003. doi:10.1007/11538059\_103.
[16] X. Xu, Y. Luo, A kernel-based reinforcement learning approach to dynamic behavior
modeling of intrusion detection, in: Advances in Neural Networks – ISNN 2007, Springer
Berlin Heidelberg, 2007, pp. 455–464. doi:10.1007/978-3-540-72383-7\_54.
[17] X. Xu, Sequential anomaly detection based on temporal-diference learning: Principles,
models and case studies, Applied Soft Computing 10 (2010) 859–867. doi:10.1016/j.
asoc.2009.10.003.
[18] L. Xiao, X. Wan, C. Dai, X. Du, X. Chen, M. Guizani, Security in mobile edge caching with
reinforcement learning, IEEE Wireless Communications 25 (2018) 116–122. doi:10.1109/
MWC.2018.1700291.
[19] Y. Feng, J. Li, T. Nguyen, Application-layer ddos defense with reinforcement learning, in:
IEEE/ACM 28th International Symposium on Quality of Service (IWQoS), 2020, pp. 1–10.
doi:10.1109/IWQoS49365.2020.9213026.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Morgan</surname>
          </string-name>
          , Cybercrime to cost
          <source>the world 8 trillion annually in</source>
          <year>2023</year>
          ,
          <year>2022</year>
          . URL: https://web.archive.org/web/20240429123425/https://cybersecurityventures.com/ cybercrime-to
          <article-title>-cost-the-world-8-trillion-annually-in-</article-title>
          <year>2023</year>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Uprety</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Rawat</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning for iot security: A comprehensive survey</article-title>
          ,
          <source>IEEE Internet of Things Journal</source>
          <volume>8</volume>
          (
          <year>2021</year>
          )
          <fpage>8693</fpage>
          -
          <lpage>8706</lpage>
          . doi:
          <volume>10</volume>
          .1109/JIOT.
          <year>2020</year>
          .
          <volume>3040957</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>A. M. K. Adawadkar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <article-title>Cyber-security and reinforcement learning - a brief survey</article-title>
          ,
          <source>Engineering Applications of Artificial Intelligence</source>
          <volume>114</volume>
          (
          <year>2022</year>
          )
          <article-title>105116</article-title>
          . doi:
          <volume>10</volume>
          .1016/ j.engappai.
          <year>2022</year>
          .
          <volume>105116</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Cengiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gök</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning applications in cyber security: A review</article-title>
          ,
          <source>Sakarya University Journal of Science</source>
          <volume>27</volume>
          (
          <year>2023</year>
          )
          <fpage>481</fpage>
          -
          <lpage>503</lpage>
          . doi:
          <volume>10</volume>
          .16984/saufenbilder. 1237742.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>