Multi-agent Reinforcement Learning for
                                Cybersecurity: Approaches and Challenges
                                Salvo Finistrella* , Stefano Mariani and Franco Zambonelli
                                University of Modena and Reggio Emilia, Reggio Emilia, Italy


                                           Abstract
                                           In the face of the rapidly evolving threat landscape, traditional security measures often lag behind with
                                           sophisticated cyber attacks. Through a review of existing literature, we examine the shortcomings of
                                           conventional cybersecurity methods, highlighting the need for Reinforcement Learning based methods.
                                           Our study classifies various RL approaches in cybersecurity, aimed to enhance detection, mitigation,
                                           and response capabilities, along two dimensions: the RL technique used, and the network configuration.
                                           Moving forward, we emphasise the importance of further research and development to address challenges
                                           such as model complexity, sample efficiency, and vulnerabilities to adversarial attacks.

                                           Keywords
                                           Reinforcement learning, Cybersecurity, Multi-agent system, DoS attack mitigation, Intrusion Detection
                                           System (IDS)


                                1. Introduction
                                On a global scale, projections indicate that the cost of cybercrime will surpass 8 trillion dollars,
                                cementing its status as the world’s third-largest and most rapidly expanding economy [1].
                                Such cost includes damage and destruction of data, stolen money, lost productivity, theft of
                                intellectual property, theft of personal and financial data, fraud, post-attack disruption, forensic
                                investigation, restoration and deletion of hacked data and systems and reputational harm.
                                Given this and the escalating nature of cyber threats, the integration of Reinforcement Learning
                                techniques emerges as a promising strategy to fortify cybersecurity defences [2, 3, 4].
                                   RL is an area of machine learning where an active entity (called agent) is given the goal of
                                learning a behavioural policy through experience, by interacting with an environment through
                                trial and error. While interacting with such an environment, the agent may get rewards for
                                useful actions that advance it towards the task to be accomplished, or punishments for actions
                                that steer it away. By aiming at maximising the accumulated rewards, the agent learns which
                                actions lead to favourable outcomes and adjusts its behaviour accordingly [5].
                                   The motivation for employing RL in cybersecurity stems from its ability to introduce dynamic
                                and adaptive defence mechanisms. Traditional approaches struggle to keep pace with the rapidly
                                evolving threat landscape, whereas RL enables security systems to learn from experience and
                                WOA 2024: 25th Workshop "From Objects to Agents", July 8-10, 2024, Forte di Bard (AO), Italy
                                *
                                 Corresponding author.
                                $ salvo.finistrella@unimore.it (S. Finistrella); stefano.mariani@unimore.it (S. Mariani);
                                franco.zambonelli@unimore.it (F. Zambonelli)
                                 https://smarianimore.github.io (S. Mariani)
                                 0009-0004-8597-9031 (S. Finistrella); 0000-0001-8921-8150 (S. Mariani); 0000-0002-9421-8566 (F. Zambonelli)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
adjust their strategies timely, and autonomously. By automating response mechanisms, RL
algorithms can not only detect, but also analyse and mitigate cyber threats without human
intervention, significantly reducing response times and potential damages. Furthermore, RL
offers continuous learning capabilities, allowing security systems to adapt to novel threats, by
continuously updating their knowledge and strategies based on new data and experiences.
   In this article, we present a classification of RL methods tailored to bolster security measures
across diverse domains, encompassing single-agent paradigms as well as multi-agent ones. We
classify RL techniques as applied to host-based, network-based, and centralised network-based
configurations augmented with Software-Defined Networking (SDN, see 4.2). By summarising
these approaches, we aim to provide a road-map for practitioners and researchers navigating
the complex landscape of RL applications in cybersecurity.


2. Motivation & Background
In today’s ever-evolving cyber landscape, traditional security measures often fall short in
defending against new threats. The reasons are many.
    • Static defences: they rely on fixed rules, threat signatures, and predefined attack patterns
      for detection and prevention. Thus they are inherently limited to recognising known
      threats and vulnerabilities [6].

    • Lack of contextual understanding: they operate within rigid frameworks that do not
      account for the broader context of an attack, failing to adapt to evolving scenarios. For
      instance, these systems cannot adapt to new attack vectors or understand the nuances
      of different operational environments, making it difficult to identify and respond to
      sophisticated and previously unseen attacks [7].

    • Limited scalability: they may become overwhelmed by the sheer volume of data to
      analyse and the diversity of threats to address [8].

    • Inadequate response times: they often rely on manual intervention to address security
      incidents, introducing obvious delays [8].
These challenges highlight the need for more adaptive and responsive cybersecurity strategies,
and RL offers a dynamic and flexible solution to meet these needs. By harnessing RL, secu-
rity systems can autonomously adjust to emerging threats, which is particularly crucial for
countering zero-day attacks—exploiting previously unknown vulnerabilities, leaving victims
defenceless with no time to prepare or patch the flaw [9].
    • Dynamic Threat Detection: RL algorithms continuously learn and adapt to new threat
      patterns through interactions with the environment, improving accuracy and efficiency.

    • Real-Time Threat Analysis: RL can identify and neutralise threats with unparalleled
      speed and efficiency.

    • Enhanced Decision-Making: RL learns from experience to make smarter decisions,
      analysing data patterns and detecting anomalies in threat identification.
Table 1
Limitations of traditional approaches to cybersecurity (columns), and how RL helps improving (rows).
                                                     Context
                                   Adaptive De-                        Scalability     Response
                                                   Understanding
                                   fences                                              Times
 Dynamic Detection                                                    
 Real-time Analysis                                                                  
 Enhanced
                                                   
 Decision-making
 Efficiency through
                                                                                      
 Automation
 Adversarial Evasion                                                                  


    • Efficiency Through Automation: RL can remove manual intervention from basic
      tasks like malware scanning and network traffic monitoring, enhancing consistency and
      precision.

    • Minimising False Positives: RL effectively differentiates between genuine threats and
      routine activities, reducing false alarms over time.

Table 1 summarises which RL capabilities enable surpassing what current limitations.


3. Reinforcement Learning in a Cybersecurity Environment
Reinforcement Learning represents a powerful paradigm in the domain of artificial intelligence,
enabling agents to learn optimal behaviour through interaction with their environment. Mim-
icking the trial-and-error learning process observed in humans and animals, RL algorithms
iteratively explore and exploit their environment to maximise cumulative rewards. Rewards
can be sparse or dense. The first are given infrequently, making it challenging for the agent to
learn desired behaviours. Conversely, dense rewards are provided more frequently, facilitating
quicker learning. RL has applications in diverse fields, from robotics and gaming to finance and
healthcare, where systems must autonomously adapt to uncertain and dynamic environments.
   In the domain of cybersecurity, the core RL concepts such as state and environment observa-
tion, action selection, policy optimisation, reward mechanisms, and goal-driven strategies, can
be instantiated as follows.

    • Environment state and observation: RL algorithms rely on observing the (possibly,
      hidden) state of the environment to make decisions. As depicted in Figure 1, in cyber-
      security the environment is the network environment, within which several appliances
      generate the data constituting the state: firewalls, Intrusion Detection Systems (IDSs),
      Intrusion Prevention Systems (IPSs), proxy servers, sniffers, Operating Systems, and other
      software. This entails monitoring various data such as network traffic patterns, system
      logs, software configurations, and user behaviour. The generated observations serve as
      inputs for the agents to learn.
    • Action selection: after observing the state, the RL agent selects actions based on its
      learned policy, there including triggering alarms, deploying patches, updating security
      configurations, isolating compromised systems, alerting security personnel, block ser-
      vices, and dropping packets. The RL environment evaluates the efficacy of these actions,
      assessing whether the state of security has improved or deteriorated. Rewards or penalties
      are then issued accordingly (see below).

    • Policy optimisation: RL agents continuously refine their decision-making policy
      through trial and error and rewards, aiming to maximise short or long-term rewards.

    • Reward mechanisms: Rewards provide feedback to the agent, indicating the efficacy
      of its actions. In cybersecurity, their primary goal is to incentivise actions leading to
      successful detection and mitigation of threats while penalising those that fall short. To
      achieve this, the reward function is meticulously crafted to align with the objectives
      of the Intrusion Detection System (IDS). Its design could involve rewarding accurate
      identification and swift response to threats while admonishing false positives and neg-
      atives. The specifics of this function’s formulation and computation vary, contingent
      upon the intricacies of the RL algorithm and the objectives of the security system at hand.
      Subsection 4.3 provides practical examples of rewards.

    • Goal-driven strategies: RL agents are driven by overarching goals, such as maximising
      the security posture of the environment. This encourages them to develop strategies that
      prioritise actions leading to the most significant reduction in risk and protection of assets.


Figure 1: A typical RL environment within the domain of cybersecurity. One or multiple agents interact
with a network environment with several appliances available but also susceptible of attacks. A threat
actor is also present, in the form of one or multiple active human/software attackers, cyber threat
simulation software, or dataset of past attacks.
The RL agents are tasked with defending the environment against threat actors, that can be either
actual human/software attackers or emulated through datasets and simulation frameworks.
These threat actors strategically exploit the environment to find vulnerabilities, and additional
challenges may arise from authorised resources or employees attacked to get access to other
parts of the system.


4. Classification: Framework and Survey
To explore the intersection between RL and cybersecurity, we started by looking at existing
surveys. We thus exploited the Scopus computer science database using the query string
“reinforcement learning” AND “cybersecurity” AND (“survey” OR “review”). This search yielded
31 results. However, manual inspection revealed that only five amongst them truly were broad
surveys focussed on RL applied to general cybersecurity. Other either focussed on a specific
application scenario (e.g. IoT) or were not centred around RL (e.g. mostly covered statistical
machine learning methods). Thus, we delved deeper into these five surveys [2, 3, 4, 8, 10] and
applied snowballing when due to clarify which articles could faithfully and clearly represent a
research thread within our proposed classification.
   Among these five surveys, three in particular inspired this work. The first, Uprety and Rawat
[2] organise their survey according to the nature of attacks in the specific application domain
of the Internet of Things (IoT). Adawadkar and Kulkarni [3], instead, focus on IDS and resource
optimisation in IoT environments. The researchers identify key parameters for comparing
RL-based algorithms, including detection rate, precision, and accuracy, providing valuable
insights into the effectiveness of RL in enhancing cybersecurity measures. Finally, Cengiz and
Gök [4] concentrate initially on penetration testing and then on Intrusion Detection Systems
(IDS), providing valuable insights into the evolving cybersecurity landscape. They conclude
by explaining how RL can be applied to various types of attacks. By selectively surveying the
available literature and evidence, they offer a comprehensive overview of the role of RL in
fortifying defences against emerging threats.
   With the similar goal of assessing the state of the art and compare various approaches in RL for
cybersecurity, we have formulated a novel classification meant to better introduce researchers
and practitioners to the filed, encompassing two key dimensions: (i) the architecture of the RL
approach employed, and (ii) the network configuration adopted for cybersecurity.

4.1. RL techniques dimension
In the context of RL, multiple learning agents can coexist, and they can share learning data or
not. Additionally, when multiple agents exist, they can explicitly try to hinder each other learnt
policies. These possibilities give rise to 5 categories of RL approaches:
    • Single-agent. In single-agent systems, there is only one agent operating in the environ-
      ment and learning. This agent makes decisions and takes actions independently alone,
      considering only its own information.
    • Centralized multiagent. In centralized multiagent systems, multiple agents exist, but
      there is a central controller or coordinator that learns a decision-making policy for all
      agents based on the information assembled from each agent. This central controller is
      the only one learning a policy, that is then given to every other agent.

    • Decentralized multiagent. In decentralized multiagent systems, each agent makes
      its own decisions independently without a central controller. Agents in decentralized
      systems typically have limited access to information about the environment and the
      actions of other agents. They must use local information and possibly communication
      with nearby agents to learn and make decisions. Each agents learns its own policy.

    • Multiagent CTDE. The CTDE paradigm involves training a multi-agent system in
      a centralised manner, where a central controller learns the policies or strategies for
      each agent using global information. However, during execution, each agent operates
      independently, making decisions based on the centrally learned policy but conditioned on
      its own observations, and without direct coordination with the central controller or other
      agents.

    • Adversarial multi-agent. In adversarial multi-agent systems, agents operate in a com-
      petitive environment where each agent’s objectives are directly opposed to those of other
      agents. These systems often involve strategic interactions, where agents must anticipate
      and react to the actions of other agents in order to achieve their own objectives.
This classification helps to quickly identify the learning scenario and enables further, more
fine-grained categorisation based on the specific RL algorithm adopted.

4.2. Network configuration dimension
The other dimension we consider is the network configuration of the cybersecurity measures.
    • Host-based cybersecurity. The defence system is focused on protecting individual
      devices (hosts) such as computers, servers, mobile devices, and endpoints. As such, it
      involves installing security software directly on these devices. This software may include
      antivirus programs, firewalls, intrusion detection/prevention systems (IDS/IPS) [11], and
      endpoint protection platforms (EPP). Host-based cybersecurity measures are essential for
      safeguarding against threats like malware, unauthorised access, and data breaches that
      may target specific devices (see Figure 2).

    • Network-based cybersecurity. The focus shifts to securing the communication path-
      ways between different devices and sub-systems within a network. It involves implement-
      ing security measures at the network level to detect and prevent unauthorised access,
      malicious activities, and data breaches. Network-based security solutions include firewalls,
      intrusion detection/prevention systems (IDS/IPS), virtual private networks (VPNs), and
      network access control (NAC) systems (see Figure 2). These measures help protect against
      threats such as unauthorised access attempts, malware propagation, and network-based
      attacks like DDoS (Distributed Denial of Service) attacks.

    • Network-based cybersecurity centralised with SDN. This configuration combines
      network-based cybersecurity measures with Software-Defined Networking (SDN, see
      Figure 2: Host based vs network based security configuration.


      Figure 3) technology. SDN [12] is an approach to networking that separates the control
      plane from the data plane, allowing for centralised management and programmability of
      network resources. Here, security policies and controls are centrally managed and en-
      forced across the network infrastructure through software-defined policies. This enables
      more dynamic and granular control over network traffic. SDN-based security solutions
      include centralised firewall management, dynamic access control policies, and real-time
      threat intelligence integration, among others.

This classification helps to quickly identify what kind of defence is needed.


Figure 3: Traditional network vs Software defined network.


4.3. Proposed Classification
By organising our analysis along these two dimensions, our survey endeavours to furnish a
comprehensive overview of the strengths, limitations, and potential applications of diverse RL
methodologies within diverse network settings. This way, we can construct a holistic portrayal
of the current state-of-the-art in RL applications for cybersecurity, depicted in Figure 4, shedding
light on emerging trends and challenges.
Figure 4: Articles about application of RL in cybersecurity categorised according to the network
environment configuration (y-axis) and the RL approach adopted (x-axis). In both axis, decentralisation
increases in the direction of the arrow.


Single-agent RL for host-based security. This category gathers the most approaches, as it
represents those solutions that are technically easier to set up: a single learning agent learns
based on the inputs coming from all the devices in the network, and controls all the security
measures therein installed.
   Liu et al. [13] present a RL-based approach to enhance the security of wireless networks
by mitigating spoofing attacks. The receiver (Bob) acts as an agent using the Q-learning
algorithm to make decisions about authenticating packets. Bob’s state (𝑠𝑡 ) represents his
current knowledge of the channel conditions and historical authentication results. His action
(𝑎𝑡 ) involves selecting a threshold for authentication to decide whether an incoming packet is
legitimate or spoofed. Bob observes the packet’s physical-layer characteristics, such as signal
strength and channel properties. The reward function (𝑟𝑡 ) provides feedback based on the
accuracy of Bob’s authentication decisions, rewarding correct identifications and penalising
false alarms and missed detections, as follows:
               ⎧
               ⎪
               ⎪
               ⎪ 𝑅correct            if correct identification (legitimate packet)
                                     if false alarm (legitimate packet classified as spoofed)
               ⎪
               ⎨𝑅
                   false_alarm
         𝑟𝑡 =
               ⎪
               ⎪
               ⎪ 𝑅missed_detection if missed detection (spoofed packet not detected)
                                     if correct rejection (spoofed packet identified)
               ⎪
               ⎩𝑅
                   correct_rejection

Through repeated interactions and rewards, Bob learns to improve his authentication policy,
thus enhancing the network’s resilience to spoofing attacks.
   Elnaggar and Bezzo [14] introduce a method to predict and recover from cyber-physical
attacks on UAV (Unmanned Aerial Vehicle, aircraft operating without a human pilot on board)
using Inverse Reinforcement Learning. The focus is on scenarios where UAVs have to reach a
particular position and the attackers try to manipulate the sensor data to disrupt its navigation.
The key components of the approach are: Actions, which refer to specific movements or
adjustments the UAV can make, such as changing direction, speed, or altitude; States and
Observations, representing the various conditions or positions of the UAVs within its environ-
ment, such as the UAV’s geographic coordinates, velocity, altitude, along with sensor readings
from gyroscope, accelerometer, and GPS; Policy, a strategy derived from IRL that guides the
system to make decisions that avoid the attacker’s goals and maintain a desired operational
level; and Reward, a function that evaluates the success of actions in maintaining system
integrity and achieving goals.
   The reward function 𝑅(𝑠, 𝑎) is designed to reflect the system’s objectives and the attacker’s
interference. For example, it might be formulated as:

                             𝑅(𝑠, 𝑎) = − (𝛼 · 𝑑(𝑠, 𝑠𝑔𝑜𝑎𝑙 ) + 𝛽 · 𝐼(𝑠))
   where 𝑑(𝑠, 𝑠𝑔𝑜𝑎𝑙 ) is the distance from the current state 𝑠 to the goal state 𝑠𝑔𝑜𝑎𝑙 , 𝐼(𝑠) is an
indicator function that penalises unsafe states, and 𝛼 and 𝛽 are weighting factors. The algorithm
uses Bayesian IRL within a Markov Decision Process framework, applying Monte Carlo Markov
Chain sampling to predict the attacker’s intentions. The system’s effectiveness is demonstrated
through simulations involving a UAV navigating a stochastic environment.
   Xu et al. proposed a series of works [15, 16, 17] where they introduce TD-SAD (temporal-
difference-based sequential anomaly detection) to combat multi-stage cyber attacks in computer
systems, showcasing its high detection rates and low false alarm rates across various types of
program traces. Building upon this, they extended it to enhance anomaly detection in host-
based IDS, demonstrating the effectiveness of RL techniques. Finally, they proposed another
method for detecting anomalies in host computers using sequential anomaly detection based on
temporal-difference (TD) learning principles, highlighting its efficiency in modelling complex
sequential behaviours without prior knowledge of the underlying processes.
   Xiao et al. [18] delves into the security vulnerabilities inherent in Mobile Edge Computing
(MEC) systems. The paper uses RL to enhance security measures such as secure mobile offloading
against smart attacks, lightweight authentication, and collaborative caching schemes.
   Feng et al. [19] address the challenge of defending against application-layer distributed denial-
of-service (L7 DDoS) attacks, which exploit legitimate-appearing application-layer requests
to overwhelm server functions. Traditional DDoS defences struggle with L7 DDoS attacks
due to their subtle nature at the transport and network layers. The authors propose a defence
mechanism using RL, where an agent learns to mitigate these attacks through a multi-objective
reward function. This function balances the aggressive mitigation of malicious requests during
severe attacks with conservative mitigation to minimise collateral damage to legitimate traffic
under normal conditions. Their evaluation demonstrates that the proposed approach effectively
mitigates 98.73% of malicious events.
   Oh and Iyengar [20] introduces a sequential anomaly detection method using Inverse RL. It
models an agent’s behaviour through a learned reward function, identifying anomalies when
behaviours deviate from expected patterns. A Bayesian extension to IRL incorporates model
uncertainty, enhancing reliability. Key contributions include the application of IRL to anomaly
detection, handling varying-length input trajectories in real-time, and empirical validation of
effectiveness. The effectiveness of this approach is demonstrated through empirical studies on
publicly available real-world data.

Single-agent RL for network-based security. This category includes approaches where
a single learning agent is responsible for monitoring and securing the entire network. The
agent analyses network traffic and activities in network devices (routers, firewalls, switches,
gateways, ...), making decisions to enhance the network’s overall security posture.
   Liu et al. [21] present a Deep RL approach for mitigating Distributed Denial of Service (DDoS)
attacks in SDNs. The system employs a Deep Deterministic Policy Gradient algorithm. The
state space captures network features from OpenFlow switches; the action space configures
bandwidth limits for hosts using OpenFlow meters; and the observation process continuously
monitors network traffic. The reward function, defined as:
                                  {︃
                                    −1                       if Load𝑠 > 𝑈𝑠
                        reward =
                                    𝜆𝑝𝑏 + (1 − 𝜆)(1 − 𝑝𝑎 ) if Load𝑠 ≤ 𝑈𝑠

penalises server overload and rewards maximising benign traffic while minimising attack traffic.
In summary, the DRL-based approach dynamically adjusts bandwidth allocations to effectively
mitigate DDoS attacks.
   Veluchamy and Kathavarayan [22] proposed the Deep Adaptive RL for Honeypots (DARLH)
system that operates in both single-agent and multi-agent paradigms to enhance security in
honeypot environments. These are environments featuring decoy systems set up to attract and
analyse cyber attackers, by gathering data on attack methods to improve security. At the single-
agent level, the agent autonomously learns and makes decisions based on its observations of
network traffic and system behaviour. This single-agent approach allows for adaptive behaviour
and decision-making tailored to the specific environment. At the multi-agent level, the system
integrates multiple agents, each responsible for monitoring different aspects of the honeypot
environment. These agents collaborate to share information, coordinate actions, and collectively
contribute to the overall security posture of the system. This two-level architecture combines
the adaptability and autonomy of single-agent systems with the collaborative and coordinated
capabilities of multi-agent systems, offering a holistic approach to network security.

Multi-agent centralised RL for network-based security. This category encompasses
methods where multiple learning agents are coordinated centrally to secure the network. Each
agent focuses on a specific aspect of the network, and their actions are managed by a central
controller to ensure cohesive and effective security measures.
  Janakiraman and Deva Priya [23] propose a Deep RL approach exploiting Long Short Term
Memory (LSTM) networks for mitigating DDoS attacks in fog-assisted cloud environments.
Multiple agents collaborate in a centralised network-based approach to identify and mitigate
DDoS attacks at the network layer. By utilising SDN controllers (see Section 4.2), the system is
able to analyse network traffic and differentiate between legitimate and malicious packets. The
LSTM component is used for its ability to handle time-dependent data and effectively categorise
incoming packets. The reward in this context is defined as the successful identification and
mitigation of DDoS attacks while minimising false positives and maintaining the availability of
legitimate network services. A detailed discussion on the reward structure and its implications
can be found in Section 3 of the paper.

Multi-agent RL for host-based security. In this section, we explore research efforts focused
on leveraging the collective intelligence and collaborative decision-making of multiple agents
to fortify cybersecurity measures on a single device.
   Dasgupta et al. [24] focus on detecting and mitigating GPS spoofing attacks, crucial in
transportation cyber-physical systems. They propose a deep RL based method, using in-vehicle
sensor data and signal processing to detect spoofing attacks turn-by-turn. The State (S) includes
the positions and movements of vehicles as well as the signals received from GPS satellites, and
encapsulates information about the system’s susceptibility to spoofing attacks. The Action (A)
in this scenario corresponds to the responses that the system can take to detect and mitigate GPS
spoofing attacks. These may include adjusting the navigation algorithms, re-calibrating sensors,
or deploying countermeasures to verify the authenticity of GPS signals. The Observation
(O) consists of the data collected from in-vehicle sensors and GPS receivers, as well as the
feedback from the detection and mitigation mechanisms. These observations provide insights
into the effectiveness of the system’s response to spoofing attacks and guide further decision-
making processes. The Reward (R) signal in this context reflects the immediate benefits or
costs associated with the system’s actions in response to spoofing attacks. Rewards could be
based on successfully detecting and mitigating spoofing attempts, minimising disruptions to
navigation systems, or avoiding accidents caused by misleading GPS information. This study
demonstrates the potential of RL-based approaches to bolster cybersecurity in transportation
systems vulnerable to GPS spoofing attacks. Employing a multi-agent system, the method
enhances detection accuracy through collaborative decision-making among agents.

Decentralised multi-agent RL for network-based security. This category involves ap-
proaches where multiple learning agents work independently but collaboratively to secure the
network. Each agent is responsible for a segment of the network, making autonomous decisions
while communicating with other agents to maintain overall network security.
  Malialis and Kudenko [25] propose a framework that involves deploying multiple agents
within the network to coordinate responses against threats. These agents use RL to dynamically
adjust router throttling mechanisms, effectively mitigating DDoS attacks’ impact on network
performance and availability. The approach is decentralised, as each agent operates indepen-
dently, making decisions based on local observations and interactions with the environment.
However, they collaborate indirectly by collectively improving the overall network resilience
through their individual actions.
  Bhagyashree Deokar [26] propose a cooperative learning method for IDS based on multi-
agent systems. The system architecture involves multiple agents distributed across different
hosts, each responsible for monitoring network connections and system log files. These agents
collaborate in a decentralised manner by sharing information and making local decisions based
on their observations, contributing to a collective decision about whether an intrusion has
occurred. The decision-making process utilises influence diagrams and Bayesian networks to
model uncertainty and optimise decision outcomes.
   Bhosale et al. [27] propose an approach to IDS by leveraging a multi-agent framework and
RL techniques. It addresses the limitations of traditional single-agent IDS, which struggle to
handle the complexity and real-time demands of modern network security. By employing a
multi-agent system, each agent possesses partial information and collaborates with others to
improve decision-making capabilities. The decision-making process is facilitated by influence
diagrams, which represent probabilistic relationships between events and guide local decision-
making. This approach leans towards decentralisation, as agents collaborate but maintain their
autonomy in decision-making.
   Shamshirband et al. [28] use Cooperative Game-based Fuzzy Q-learning (G-FQL) for detecting
and preventing intrusions, particularly DDoS attacks, in wireless sensor networks (WSNs). G-
FQL integrates game theory, fuzzy Q-learning, and a cooperative defence strategy involving sink
nodes, a base station, and attackers. The cooperative game mechanism allows the sensor nodes
to act as rational decision-makers, collaborating to detect and defend against attacks. Fuzzy
Q-learning reinforces the nodes’ self-learning abilities, providing them with incentive functions
to protect vulnerable sensor nodes. This approach is a mix of centralisation and decentralisation,
as the system involves centralised elements such as the base station coordinating overall strategy,
while individual sensor nodes operate autonomously but collaborate within the overarching
framework.

Adversarial RL for network-based security. This category focus on a specific RL setting,
termed adversarial, where RL agents learn a policy that is actively disrupted by hostile agents
(termed ”threat actors” or ”other problems” in Figure 1, which become the adversarial agents).
These agents are trained to anticipate and counteract sophisticated attacks, thereby enhancing
the network’s resilience against adversarial threats.
   Caminero et al. [29] incorporate a multi-agent technique by integrating a classifier, acting as
the agent, with a simulated environment. This environment generates network traffic samples
and provides rewards based on the classifier’s predictive accuracy. The classifier’s objective is to
predict the correct intrusion label for the given network samples, while the environment’s goal
is to actively increase the difficulty of predictions by behaving adversarially, challenging the
classifier to learn from the most difficult cases. Adversarial Environment using RL introduces an
innovative approach for intrusion detection in network security. It employs a multi-agent setup
by combining reinforcement learning principles with a classifier serving as the primary agent.
The classifier aims to predict intrusion labels for network traffic samples, while the adversarial
environment generates scenarios that challenge the classifier by increasing the difficulty of
predictions. By maximising rewards obtained from the environment, the classifier learns to adapt
to these challenges, leading to enhanced performance in detecting and classifying intrusions
within network traffic. This dynamic interaction between the classifier and the adversarial
environment forms the core of Adversarial Environment using RL, enabling it to effectively
address the evolving threats and complexities in network security.
   Turner et al. [30] focus on modelling the interactions between attackers and defenders in
a network environment. Attackers aim to exploit vulnerabilities, while defenders seek to
mitigate risks. Multi-Agent Reinforcement Learning (MARL) involves competition between
RL agents representing attackers and defenders, each learning to optimise its strategy based
on feedback received from the environment. Co-evolution involves evolving populations of
strategies for attackers and defenders simultaneously, with each population adapting to the
strategies of its opponent over time. The paper compares the effectiveness of these approaches
in generating robust solutions for cybersecurity challenges, emphasising the importance of
balancing exploration and exploitation in learning strategies.


5. Limitations, Challenges, and Open Issues
While RL holds promise for addressing cybersecurity challenges, it also presents certain limita-
tions, challenges, and open issues, summarised in Table 2.

Limitations. A first limitation of RL approaches regarding cybersecurity environments is
complexity. Cybersecurity environments often exhibit high-dimensional and dynamic character-
istics, leading to computationally intensive training processes. The complexity arises from the
need to represent diverse network states, attacker behaviours, and defensive actions accurately.
As a result, RL algorithms may encounter scalability issues, prolonged training times, and
resource constraints, limiting their practical applicability in real-world cybersecurity scenarios.
   Another limitation, generally applicable to RL but even more so to cybersecurity, is sample
efficiency. Many RL algorithms require extensive training data to learn effective policies, posing
challenges in resource-constrained cybersecurity settings where collecting sufficient labelled
data is difficult. A notable example in cybersecurity is detecting zero-day attacks.

Challenges. A first challenge for RL is posed by adversarial attacks, that exploit vulnera-
bilities in RL-based cybersecurity systems, like poisoning the training data with fake one, to
manipulate the system’s behaviour. In particular, steering learning agents towards sub-optimal
policies. Another challenge is achieving robust generalisation across diverse cyber threats and
environments. For instance, ensuring that an RL-based IDS trained on one network architecture
performs effectively when deployed in a different one. Finally, ensuring available of quality data


Table 2
Limitations, challenges, and open issues of RL approaches applied to cybersecurity.
              Complexity Sample          Generalisation Adversarial Transfer          Explainability
                         Effi-                          Attacks     Learning
                         ciency
 Limitation                
 Challenge                                               
 Open Is-                                                                            
 sue
is another challenge, requiring data collection strategies that reflect real-world cyber threats,
environments, and defence mechanisms.

Open Issues. Amongst the open issues still to be fully investigated in RL applied to cybersecu-
rity, at least two emerge strongly from the surveyed literature: explainability and transferability.
The former amounts to ensuring that RL-based cybersecurity systems are understandable and
transparent to human beings. Achieving explainability involves making the decisions and
actions of RL algorithms interpretable to stakeholders. For instance, in autonomous threat
response systems, explainability ensures that security analysts can comprehend the reasoning
behind the system’s actions and trust its recommendations. Transferability instead amounts to
transferring knowledge and policies learned in one cybersecurity context to another. This would
obviously improve efficiency and effectiveness, as there would be less need to re-training RL
systems from scratch. For instance, leveraging knowledge from detecting malware to enhance
intrusion detection in network traffic.


6. Conclusion & Future Works
In this paper, we have discussed the potential of RL to enhance cybersecurity defences by offering
adaptive and dynamic mechanisms to combat emerging threats. We highlighted limitations
of traditional approaches, motivated why RL can help surpassing them, and then proposed a
bi-dimensional classification to help researchers enter the field or get a bird-eye view.
   While our analysis demonstrates the promise of RL in improving detection, mitigation, and
response capabilities, the surveyed literature also identify challenges that must be addressed.
These include complexity, sample efficiency, and vulnerabilities to adversarial attacks. Moving
forward, it is imperative to focus on developing robust and explainable RL-based defence
mechanisms, as well as exploring techniques for knowledge transfer and generalisation across
diverse cyber threats and environments. By addressing these challenges, we can harness the
full potential of RL to fortify cybersecurity defences and mitigate emerging threats effectively.


References
 [1] S. Morgan, Cybercrime to cost the world 8 trillion annually in 2023, 2022.
     URL: https://web.archive.org/web/20240429123425/https://cybersecurityventures.com/
     cybercrime-to-cost-the-world-8-trillion-annually-in-2023/.
 [2] A. Uprety, D. B. Rawat, Reinforcement learning for iot security: A comprehensive survey,
     IEEE Internet of Things Journal 8 (2021) 8693–8706. doi:10.1109/JIOT.2020.3040957.
 [3] A. M. K. Adawadkar, N. Kulkarni, Cyber-security and reinforcement learning — a brief
     survey, Engineering Applications of Artificial Intelligence 114 (2022) 105116. doi:10.1016/
     j.engappai.2022.105116.
 [4] E. Cengiz, M. Gök, Reinforcement learning applications in cyber security: A review,
     Sakarya University Journal of Science 27 (2023) 481–503. doi:10.16984/saufenbilder.
     1237742.
 [5] J. H. Connell, K. Sridhar Mahadevan, Robot Learning, Robotica 17 (1999) 229–235. doi:10.
     1017/S0263574799271172.
 [6] Z. Hu, P. Chen, M. Zhu, P. Liu, Reinforcement Learning for Adaptive Cyber Defense
     Against Zero-Day Attacks, Springer International Publishing, Cham, 2019, pp. 54–93.
     doi:10.1007/978-3-030-30719-6\_4.
 [7] M. Macas, C. Wu, W. Fuertes, A survey on deep learning for cybersecurity: Progress,
     challenges, and opportunities, Computer Networks 212 (2022) 109032. doi:10.1016/j.
     comnet.2022.109032.
 [8] T. T. Nguyen, V. J. Reddi, Deep reinforcement learning for cyber security, IEEE Transactions
     on Neural Networks and Learning Systems 34 (2023) 3779–3795. doi:10.1109/TNNLS.
     2021.3121870.
 [9] Y. Guo, A review of machine learning-based zero-day attack detection: Challenges and fu-
     ture directions, Computer Communications 198 (2023) 175–185. doi:10.1016/j.comcom.
     2022.11.001.
[10] P. Dixit, S. Silakari, Deep learning algorithms for cybersecurity applications: A tech-
     nological and status review, Computer Science Review 39 (2021) 100317. URL: https:
     //www.sciencedirect.com/science/article/pii/S1574013720304172. doi:https://doi.org/
     10.1016/j.cosrev.2020.100317.
[11] D. Denning, An intrusion-detection model, IEEE Transactions on Software Engineering
     SE-13 (1987) 222–232. doi:10.1109/TSE.1987.232894.
[12] S. Shin, L. Xu, S. Hong, G. Gu, Enhancing network security through software defined
     networking (sdn), in: 25th International Conference on Computer Communication and
     Networks (ICCCN), 2016, pp. 1–9. doi:10.1109/ICCCN.2016.7568520.
[13] J. Liu, L. Xiao, G. Liu, Y. Zhao, Active authentication with reinforcement learning based
     on ambient radio signals, Multimedia Tools and Applications 76 (2017) 3979–3998. doi:10.
     1007/s11042-015-2958-x.
[14] M. Elnaggar, N. Bezzo, An irl approach for cyber-physical attack intention prediction
     and recovery, in: 2018 Annual American Control Conference (ACC), 2018, pp. 222–227.
     doi:10.23919/ACC.2018.8430922.
[15] X. Xu, T. Xie, A reinforcement learning approach for host-based intrusion detection
     using sequences of system calls, in: Advances in Intelligent Computing, Springer Berlin
     Heidelberg, Berlin, Heidelberg, 2005, pp. 995–1003. doi:10.1007/11538059\_103.
[16] X. Xu, Y. Luo, A kernel-based reinforcement learning approach to dynamic behavior
     modeling of intrusion detection, in: Advances in Neural Networks – ISNN 2007, Springer
     Berlin Heidelberg, 2007, pp. 455–464. doi:10.1007/978-3-540-72383-7\_54.
[17] X. Xu, Sequential anomaly detection based on temporal-difference learning: Principles,
     models and case studies, Applied Soft Computing 10 (2010) 859–867. doi:10.1016/j.
     asoc.2009.10.003.
[18] L. Xiao, X. Wan, C. Dai, X. Du, X. Chen, M. Guizani, Security in mobile edge caching with
     reinforcement learning, IEEE Wireless Communications 25 (2018) 116–122. doi:10.1109/
     MWC.2018.1700291.
[19] Y. Feng, J. Li, T. Nguyen, Application-layer ddos defense with reinforcement learning, in:
     IEEE/ACM 28th International Symposium on Quality of Service (IWQoS), 2020, pp. 1–10.
     doi:10.1109/IWQoS49365.2020.9213026.
[20] M.-h. Oh, G. Iyengar, Sequential anomaly detection using inverse reinforcement learn-
     ing, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
     Discovery & Data Mining, Association for Computing Machinery, 2019, p. 1480–1490.
     doi:10.1145/3292500.3330932.
[21] Y. Liu, M. Dong, K. Ota, J. Li, J. Wu, Deep reinforcement learning based smart mitigation
     of ddos flooding in software-defined networks, in: IEEE 23rd International Workshop on
     Computer Aided Modeling and Design of Communication Links and Networks (CAMAD),
     2018, pp. 1–6. doi:10.1109/CAMAD.2018.8514971.
[22] S. Veluchamy, R. S. Kathavarayan, Deep reinforcement learning for building honeypots
     against runtime dos attack, Int. J. Intell. Syst. 37 (2022) 3981–4007. doi:10.1002/INT.
     22708.
[23] S. Janakiraman, M. Deva Priya, A deep reinforcement learning-based ddos attack mitiga-
     tion scheme for securing big data in fog-assisted cloud environment, Wireless Personal
     Communications 130 (2023) 2869–2886. doi:10.1007/s11277-023-10407-2.
[24] S. Dasgupta, T. Ghosh, M. Rahman, A reinforcement learning approach for global naviga-
     tion satellite system spoofing attack detection in autonomous vehicles, Transportation
     Research Record 2676 (2022) 318–330. doi:10.1177/03611981221095509.
[25] K. Malialis, D. Kudenko, Distributed response to network intrusions using multiagent
     reinforcement learning, Engineering Applications of Artificial Intelligence 41 (2015)
     270–284. doi:10.1016/j.engappai.2015.01.013.
[26] A. H. Bhagyashree Deokar, Intrusion detection system using log files and reinforcement
     learning, International Journal of Computer Applications 45 (2012) 28–35. doi:10.5120/
     7026-9675.
[27] R. Bhosale, D. S. Mahajan, P. A. Kulkarni,                  Cooperative machine learn-
     ing for intrusion detection system,              International Journal of Scientific &
     Engineering Research 5 (2014). URL: https://www.ijser.org/researchpaper/
     Cooperative-Machine-Learning-For-Intrusion-Detection-System.pdf.
[28] S. Shamshirband, A. Patel, N. B. Anuar, M. L. M. Kiah, A. Abraham, Cooperative game
     theoretic approach using fuzzy q-learning for detecting and preventing intrusions in
     wireless sensor networks, Engineering Applications of Artificial Intelligence 32 (2014) 228–
     241. URL: https://www.sciencedirect.com/science/article/pii/S0952197614000311. doi:doi.
     org/10.1016/j.engappai.2014.02.001.
[29] G. Caminero, M. Lopez-Martin, B. Carro, Adversarial environment reinforcement learning
     algorithm for intrusion detection, Computer Networks 159 (2019) 96–109. doi:10.1016/
     j.comnet.2019.05.013.
[30] M. J. Turner, E. Hemberg, U.-M. O’Reilly, Analyzing multi-agent reinforcement learning and
     coevolution in cybersecurity, in: Proceedings of the Genetic and Evolutionary Computation
     Conference, GECCO ’22, Association for Computing Machinery, New York, NY, USA, 2022,
     p. 1290–1298. doi:10.1145/3512290.3528844.