Reinforcement Learning Agents for Simulating
Normal and Malicious Actions in Cyber Range
Scenarios
Alessandro Santorsola1,* , Aldo Migliau1 and Sabino Caporusso1
1
    BV-Tech s.p.a., 20123 Milan, Italy


                                         Abstract
                                         Cyber-attacks and their consequences have become one of the primary sources of risk in recent years.
                                         Cyber-attacks have the potential to cause physical damage both to infrastructures and to people. To
                                         prevent such risks, several methods have been proposed. Cyber security knowledge required for cyber
                                         defense can be developed by active learning in a cyber range. Although this type of cyber learning is
                                         popular and used worldwide by numerous organizations and companies, typically such simulations lack
                                         the presence of users and their relative effects on the systems. In particular, in a cyber environment where
                                         the only activities on the systems are those carried out by the Red Team, the assessment of malicious
                                         actions on the systems will be a trivial activity for the Blue Team. Hence, the reality of the resulting
                                         simulation does not reflect a real working condition. Users simulation is needed for providing more
                                         realistic scenarios for training sessions. Additionally, a cyber range that relies on the actions of simulated
                                         users introduces the possibility to simulate a Zero Trust (ZT) condition. In such scenarios, the simulated
                                         users act also as virtual attackers or use social engineering attacks (i.e., phishing) within the company
                                         network.
                                         This work presents the development of a model whose purpose is to generate human-addressable actions
                                         in the cyber range. Moreover, the agent leverages a Reinforcement Learning (RL) algorithm to simulate
                                         the user-system interactions. Finally, the agent simulates both normal and malicious actions on the
                                         systems.

                                         Keywords
                                         Cyber Range, Reinforcement Learning, Simulation & Modeling, Cyber Attacks


1. Introduction
The rapid technological advancements (e.g., Internet of Things (IoT), 5G) have become the main
transformation source for several IT/OT domains (e.g., energy, health care, public transport) by
increasing their productivity, value creation, and the social welfare [1]. Despite these flourishing
perspectives, the insufficient knowledge jointly with the lack of security awareness provides a
fertile ground for several threat actors [2]. Threat actors may carry out different types of attacks
that can produce tangible damages. In fact, there are several organizations or companies that

ITASEC’22: Italian Conference on Cybersecurity, June 20–23, 2022, Rome, Italy
*
 Corresponding author.
$ a.santorsola@bv-tech.it (A. Santorsola); a.migliau@bv-tech.it (A. Migliau); s.caporusso@bv-tech.it
(S. Caporusso)
 https://www.linkedin.com/in/alessandro-santorsola-680aaa145/ (A. Santorsola)
 0000-0003-1094-8199 (A. Santorsola)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
own or access to different cyber systems that can be exposed to several known and/or unknown
attack vectors.
    The majority of the cyber attacks have involved the categories of Transportation and Storage,
Industrial Control System (ICS), Government, Healthcare and Entertainment [3]. Furthermore,
the proliferation of the IoT devices in industrial plants (e.g., power grids, gas, and water distri-
bution systems) led to an increasing transformation of the traditional ICS. Not only, due to the
migration of the control components from the electronic world to the software one, the resulting
ICS components are exponentially increased in complexity. Consequently, this led to the sudden
increase of the attack surface. Moreover, the increasing trends include remote users, personal
devices, and cloud-based assets that are not physically located within an enterprise-owned
network boundary but are always reachable. According to the ZT paradigm, the aforementioned
operations can generate normal alerts, but they can be also part of a more sophisticated cyber
attack [4]. The resulting expanded attack surface jointly with the well-known capabilities and
motivations of advanced cyber adversaries, has made modern-day critical infrastructure more
likely to be compromised.
    To provide the right tools to the workforce so that they are able to face such risks, several
methods have been proposed. One of the possible solutions takes into account training platforms
(i.e., cyber range). Cyber Range is a virtual environment that enables organizations to simulate
cyber training, system/network development, testing, and benchmarking. Usually, such training
follows the Red vs. Blue Teams format aiming to improve responsive capacity in case of a cyber
crisis.
    In a cyber range, the workforce can practice themselves in detection and response strategies
using real-world tools and techniques. Despite these flourishing outcomes, such simulations do
not consider the presence of users and their relative effects on systems. It is a matter of fact
that users within a corporate network can be the additional source of security alerts linked
with unauthorized operations or sporadic human errors. As a consequence, the lack of users’
presence and their relative effects on the systems will result in a trivial assessment of malicious
actions performed for the Blue Team. Hence, in order to simulate a real working condition, it is
mandatory to include such behaviors in the cyber environment. Moreover, network traffic is
generated in order to create external and/or internal requests (i.e., traffic that came from the
Internet or Enterprise Intranet). In this fashion, the Blue Team activities will be much closer to
a real situation. These activities will require the capability to discriminate the Red Team actions
from the other kind of traffic.
    In this paper, we take into account both the statistical characterization of a series of requests
performed by a user and the possibility to vertically embed RL in such processes. Hence, we
propose the adoption of an agent-based model to generate independent users network traffic
according to specific network topology and security mechanisms by exploiting the potentiality
of RL. The reported results show that the embedded RL agents generate human-addressable
independent requests that are consistent with realistic traffic patterns for the specific network.
    The paper is organized as follows: in Section 2 the contributions regarding cyber range and
the application of reinforcement learning in such scenarios will be analyzed. This is followed by
the description of the proposed contribution in Section 3. In Section 4 the experimental settings
and results will be reported and commented. Finally, Section 5 concludes the work and draws
the future perspective.
2. Background and Motivations
Reinforcement Learning is a machine learning technique in which a computer (i.e., agent)
learns to perform an activity through repeated "trial-and-error" interactions with a static or
dynamic environment. This approach to learning allows the agent to make a series of decisions
that maximize a reward metric for the activity, without being explicitly programmed for such
an operation and without human intervention [5]. Hence, the agent does not receive the
information of what action to take as in other forms of machine learning, instead, it must
find out which actions will produce the best reward in each state from interactions with the
environment. Q-learning [5] is a model-free RL technique in which the action selection is based
on the rewards (i.e., feedback). The agent always chooses the optimal action. The future reward
function in the 𝑆𝑡 state when performing an 𝐴𝑡 action, denoted as 𝑄(𝑆𝑡 , 𝐴𝑡 ), is assimilated
by interactions with the environment. The equation for updating the value function of the
state-action pairs 𝑄(𝑆𝑡 , 𝐴𝑡 ) is based on the value-action function expressed as [5]:
                                         [︂                                          ]︂
           𝑄(𝑆𝑡 , 𝐴𝑡 ) = 𝑄(𝑆𝑡 , 𝐴𝑡 ) + 𝛼 𝑅𝑡+1 + 𝛾 max 𝑄(𝑆𝑡+1 , 𝑎𝑡+1 ) − 𝑄(𝑆𝑡 , 𝐴𝑡 )          (1)
                                                        𝑎

where 𝛼 represents the learning rate, 𝛾 is the discount factor and 𝑡 represents the time instant.
In general, 0 ≤ 𝛼 ≤ 1 and 0 ≤ 𝛾 ≤ 1. 𝑅𝑡 represents the reward extracted from the feedback.
Finally, 𝑄(𝑆𝑡 , 𝐴𝑡 ) represents the value of the Q-function when the agent is in the state 𝑆𝑡
for the action 𝐴𝑡 . A relevant aspect of the Q-learning approach is the choice of actions to be
performed during the process of estimating the 𝑄(𝑆𝑡 , 𝐴𝑡 ) value function. This can be performed
by using any method of exploration/exploitation or even randomly. The most used policies are:
(i) random, (ii) epsilon-greedy, and, (iii) softmax. In the random strategy, the choice of the best
action is modeled as a uniformly distributed random variable. In the epsilon-greedy strategy,
the agent uses both exploitation to take advantage of prior knowledge and exploration to look
for new options [6]. The epsilon-greedy approach selects the action with the highest estimated
reward most of the time. In particular, let 𝜖 be a small probability and let 𝑃𝑀 be the discrete
uniform distribution probability function. For each 𝑡𝑡ℎ round, the model compares the 𝑃𝑀 (𝑡)
value with respect to 𝜖 in order to choose the exploration, i.e., not to exploit what the model
has learned so far. Hence, if the exploration is chosen, the model adopts the random action
selection. Whereas, if the exploitation is chosen, the model selects the upcoming action with the
higher reward. It is crucial to underline that exploration is more important when the agent does
not have enough information about the environment it is interacting with. Hence, the agent
needs to interact optimally with the environment, allowing it to exploit its knowledge. As a
consequence, 𝜖 should decay across the life of an agent to have it learn and act optimally. Finally,
the softmax strategy gives every action in the set of possible actions a chance to be chosen,
based on the estimated reward value of the action. Actions with higher values will have a higher
probability to be chosen. A Boltzmann distribution is used to evaluate the action-selection
probabilities. The distribution can be defined as follows:
                                                    𝑄𝑡 (𝑠,𝑎)
                                                 𝑒     𝜏
                                      𝑃𝑡 (𝑎) = ∑︀       𝑄𝑡 (𝑠,𝑖)
                                                                                                (2)
                                                 𝐾
                                                  𝑖 𝑒
                                                           𝜏
where 𝑃𝑡 (𝑎) is the probability that an action will be chosen, 𝐾 is the total number of possible
actions, and, 𝜏 is the temperature parameter, which indicates the amount of exploration.
   Several approaches for cyber range development are reported in the literature. These ap-
proaches can be classified according to the specific simulation environment (e.g., IT, Industrial)
or according to the nature of the cyber range itself (e.g., hardware and software infrastructure,
simulation, virtualization, and hybrid approaches). Each approach is characterized by different
advantages and disadvantages related to the following aspects: (i) complexity, (ii) learning
efficiency, and (iii) cost.
   For instance, in [7] a physical replication of an electric grid ICS hardware and software is
used to recreate an environment that is truly representative of real-life processes. This approach
is certainly closer to reality, but it is characterized by a higher cost and complexity. In addition,
no machine learning techniques have been included.
   In [8], a simulation-based cyber range is presented. The authors propose a sandbox environ-
ment that provides similar functions to a real system. Like other simulation-based cyber ranges,
this contribution presented the following advantages: (i) reconfigurability, (ii) maintainability,
and (iii) scalability. However, it does not provide high fidelity, especially when software exploits
and cyberattacks need to be considered simply because network and/or physical interactions
are not present. In [9], the authors aim to explore existing studies of AI-based cyber attacks and
to map them into a dedicated framework, providing insight for new threats. The framework
includes the classification of several aspects of malicious uses of Artificial Intelligence (AI)
during the cyber attack life cycle and provides a basis for the detection to predict future threats.
The authors report different types of application of their proposal to analyze AI-based cyber
attacks in a hypothetical scenario of critical smart grid infrastructure. In [10], the authors start
their research from the following assumption: adversaries show no restraint in adopting tools
and techniques that can help them attain their goals. In particular, the authors used AI and
machine learning approach to solve security challenges. Autonomous agents interactions have
been investigated within a simulated enterprise network. Moreover, RL techniques have been
considered to improve security. The simulations take into account the enterprise environment
by using the high-level abstraction of computer networks and cybersecurity concepts.
   In [11], the authors describe the design of cyber security virtualized cyber range designed
to provide a flexible environment to evaluate cyber security tools. In particular, those tools
involve AI/ML to provide realistic environments. The strengths and weaknesses of the tools
have been also investigated. In this contribution, the challenges are related to the evaluation
of the performances and operating costs for AI/ML-based cyber security tools for application
into large, government-sized environments. In [12], a novel-modular framework is proposed to
replicate complex SCADA Systems in a virtual simulation. The authors analyze the process of
virtualization of each major component and they present a real world critical infrastructures as
case studies. The authors demonstrate the use of the framework for cybersecurity research by
including different cyber attacks.
   Recently, cyber range development has investigated the possibility to combine simulation,
virtualization, and physical device replication approaches in a single hybrid cyber range [13]
[14]. The hybrid approach offers the possibility to overcome the disadvantages associated with
the other types of cyber ranges. In [15], the authors present the development of a hybrid cyber
range for ICS that is based on a real-time attacker-defender gameplay model in conjunction
with dynamic simulation models of typical industrial systems. Moreover, the authors present
an industrial gas turbine as one use case of an archetypal industrial system. Finally, the authors
provide a demonstration of a sample training exercise. Finally, in [16], the authors propose an
approach to minimize response time and the impact of cyber-attacks on the organizations. The
authors propose a formative evaluation in the context of a digital twin implementation in the
EU electrical power sector.
   As shown in the literature, the majority of the applications of ML/AI-based models and
algorithms in the cyber range are addressable to anomaly detection and malware behavior
purposes. Moreover, the aforementioned contributions do not take into account the presence of
users during the simulations. The contribution proposed in this paper wants to address this
problem by developing two cooperative Reinforcement Learning Agents whose purpose is to
simulate the presence of users within the virtual environment of the cyber range. In addition,
the proposed contribution takes into account the possibility that such simulated users may
introduce security events in such a way that can be human-addressable. Finally, the model can
simulate the presence of additional attackers within the network that acts as a smoke screen for
the Red Team. To the best of our knowledge, our contribution is the first one that addresses this
kind of problem.


3. The Proposed Model
3.1. Statistical Characterization of Users Networking Actions
The statistical characterization of user actions takes into account the following observations: (i)
the model has to generate network traffic in such a way that it can be addressed to independent
users, (ii) the model has to take into account both external and internal network traffic (i.e.,
network traffic that came from Internet and Enterprise Intranet), and, (iii) the inter-arrival
times between those requests have to be exponential. Hence, the external/internal traffic
generation can be modeled as a series of Independent and Identically Distributed (i.i.d.) requests.
Finally, the traffic generation process can be classified as a memory-less process. According
to these considerations, the network transaction generation process can be modeled as a non-
homogeneous Poisson Process. A non-homogeneous Poisson process is a Poisson process with
a time-varying rate. It can be used to model the arrival times of customers at a store, users
requests to services, events of traffic, and positions of damage along a road [17]. In particular,
the model inter-request time is characterized by the following probability density function:

                                    𝑓 (𝑥, 𝜆(𝑡)) = 𝜆(𝑡)𝑒−𝑥𝜆(𝑡)                                  (3)

where 𝑥 ≥ 0 and 𝜆(𝑡) is the rate function (i.e., requests per second) [18]. To guarantee the
property of i.i.d. requests concerning external and internal traffic two solutions have been
investigated: (i) selection via uniform binary probability distribution function, and, (ii) run
two different and separated instances for simulating external and internal requests. Both the
aforementioned solutions are valid.
3.2. Characterization of User - Service Interactions
In general, an high-level structure of an enterprise network can be composed as follows:

    • External Firewall that exposes some corporate services to the Internet;
    • Delimitarized Zone (DMZ) that contains and exposes some corporate resources (e.g., Web
      Server, FTP Server, Mail Server);
    • Security Network that hosts Blue Team machines (e.g., SOC);
    • Servers Network that hosts different IT systems (e.g., Active Directory Server);
    • Hosts Network that hosts users workstations that have no interactions with other external
      networks.

   The network traffic usually observes a set of rules that are defined at the firewall level.
Considering as an example the incoming traffic from the external networks, this will affect
only those services that are exposed by the company firewall and that are present in the
DMZ. Furthermore, the same considerations can be done concerning the intranet-traffic. The
definition of a set of firewall rules to permit or to block inter-subnet traffic is a common security
best practice. Theoretically, by evaluating each source-destination combination it is possible
to establish if a connection is permitted or not. Moreover, this quantity does not take into
account the number of possible high-level actions that can be performed to the specific service,
e.g., login procedures, HTTP GET methods, and files upload. However, this approach will be
inefficient if the network is larger and highly segmented. Despite the computational problem, a
possible solution is given by modeling the network as a cyber environment in which an agent
has to perform a set of actions. Hence, a Reinforcement Learning approach can be adopted.
In this work, we have trained two Q-Learning RL Agents that operate at the transport and
application level of the TCP/IP stack respectively. In particular, the agents model the interaction
between a user and the network in the specific cyber environment. In a nutshell, the cyber
environment is automatically defined by the network topology jointly with the services and
the firewall rules. At the end of the training, the Networking RL Agent will be able to generate
a series of transactions compliant with respect to the environment. In the same fashion, the
Application-Level RL Agent models the higher-level interactions to the specific service (e.g.,
login procedures, uploads procedures). Moreover, the second RL Agent can generate normal,
malicious, or idle interactions according to a specific behavior. Hence, our model supports a
learning and knowledge system that allow it to perform network operations correctly.

3.3. Network Traffic Characterization
Network traffic model has to generate and manage different connections with different sources.
To address this problem, a series of networking plug-ins have been developed. These modules
are implemented in order to manage the packets exchanged between the machine on which
our model is installed and the target servers. In this perspective, the plug-ins manage TCP
and UDP connections. Moreover, it is possible to define custom-made payloads. The plug-ins
also implement messages such as echo request and reply (i.e., ping), DNS and ARP Request,
and, finally, the handling of connections, login, and commands exchange with the servers.
Additionally, it is possible to simulate a Distributed Denial of Service (DDoS) attack. Moreover,
other types of malicious traffic are modeled (i.e., Brute Force, Host Discovery and Port Scan).
As a final remark, phishing emails delivering and malicious file uploading can be performed by
the model.

3.4. Model Architecture
Figure 1 shows the general architecture of our model. The cyber environment is automatically
defined by the network topology, the services, and the firewall rules. The network plug-in
module realizes the interface between the Q-Learning Agents (i.e., Networking and Application-
Level RL Agents) and it provides all the networking functions. The configurations module
provides the setup parameters both for the RL algorithm and for the services to emulate the
desired traffic pattern (i.e., normal or malicious actions). The current implementation requires
proper setup with topology information, firewall configurations, and network rules, which
limits the applicability to pre-defined and well-known scenarios.
   The RL Engine is composed by the following elements:
   1. RL Dispatcher, acts as an interface between the RL Learners, it delivers initial configura-
      tions and the network primitives to the RL Learners and, finally, provides the action to be
      performed to the transaction generator;
   2. RL Memory, stores the RL Learners Q-Tables;
   3. RL Policy, implements the strategy used by the agent in pursuit of goals;
   4. RL Learners, implement the Q-Learning algorithm.


Figure 1: General Model Architecture.


   Hence, we have chosen to develop two Q-Learning RL Agents that operate at networking and
application levels respectively. In the learning phase, the RL learners update Q-Table values by
evaluating the Q-function expressed in Equation 1. The RL policy rules the action selection and
it can be defined in terms of Markov Decision Process [5]. The following policies are available
within the model: (i) random strategy, (ii) epsilon-greedy strategy, and, (iii) softmax strategy.
   Finally, the Q-tables are subdivided according to the traffic type (i.e., external or internal traffic)
and with respect to the specific TCP/IP working level. Moreover, they are built dynamically
according to number of services, sources within the network, agent behavior, and application-
level functions. Hence, the transaction generator module builds a transaction request, and the
model invokes the appropriate network plugin to correctly manage the connection.


4. Experimental Evaluation
4.1. Experimental Settings


Figure 2: Reference Cyber Range Environment.


   In Figure 2, the reference cyber range structure is shown. In particular, the company network
is composed as follows:

    • DMZ: FTP, Mail and Web Servers. The FTP and Mail servers are the only exposed services;
    • Security Network: SOC and SIEM;
    • Servers Network: Active Directory and DB server;
    • Hosts Network: workstations.

The overall network is monitored by the Security Operations Center (SOC) and the SIEM and
the network traffic is filtered by the company firewall. The firewall configurations are reported
in Tables 1 and 2. The external firewall, the attacker workstations, and the other services
are introduced to simulate Internet within the cyber range scenario. The rules regarding the
redirected traffic to the SOC are not reported. In addition, the firewall default outcome for
no-matching-rule traffic is the "block all" policy.

Table 1
Firewall Configuration for Exposed Services.
                            Allowed/Blocked    Source   Destination
                                ALLOW            *      FTP:21
                                ALLOW            *      Mail:25/143/993
Table 2
Firewall Configuration for Intranet Traffic.
                       Allowed/Blocked             Source        Destination
                            ALLOW              Servers Network   FTP:21
                            ALLOW               Hosts Network    FTP:21
                            ALLOW              Servers Network   Mail:25/143/993
                            ALLOW               Hosts Network    Mail:25/143/993
                            ALLOW               Hosts Network    WWWs:80
                            ALLOW               Hosts Network    Servers Network
                            BLOCK                   DMZ          Servers Network


4.1.1. Model Deployment
The model deployment should take into account both the computational resources and the
network capabilities of the agent machine. Three different approaches have been identified:
   1. Distributed Approach: the model is deployed on each workstation and a common config-
      uration file is shared between them;
   2. Centralized Approach: the model is deployed on a routing-node;
   3. Centralized "ad-hoc" Approach: the model is deployed on a single network node that
      has full view of the network traffic. This approach represents ad hybrid version of the
      previous ones.
Each of the aforementioned approaches are characterized by several advantages and disadvan-
tages regarding: (i) configurations, (ii) network deployment, and (iii) model transparency with
respect to the users. For instance, the first approach is the simplest one. It uses real workstations
in order to generate traffic. However, the deployment procedures and the model transparency
are not guaranteed. In particular, the model could be affected by unpredictable actions from a
Red Team attacker on the workstation itself. The centralized approach guarantees complete
model transparency with respect to the cyber range users. Moreover, the routing capability
of the node provides complete network visibility. As a final remark, the model can virtually
simulate a higher number of sources with respect to the real number of workstations within the
cyber range. In this fashion, it is possible to introduce a time-variant number of users. Hence,
in this work, the centralized approach is presented.

4.2. Experimental Results
The simulation results regarding both learning and traffic generation will be exposed in the
following sections. In Table 3, the general model parameters both for the learning and traffic
generator procedures are reported. The reference cyber environment is depicted in Figure 2.

4.2.1. Learning Performance
We first consider the Cumulative Permitted Transaction Rate (CPTR) generated by the Net-
working RL Agent. With reference to Figure 3, the CPTR is evaluated for each available policy
Table 3
General Model Parameters.
                          Parameter                            Value/Set
                      Default FW Rule                         Block All
                  Number of Simulations                          10
                    Number of Episodes                           50
              Maximum Number of Transactions                     600
                       Learning Rate 𝛼                           0.9
                          Epsilon 𝜖                              0.2
                        Epsilon Decay                           True
                 Temperature Parameter 𝜏                          5
                       Discount Rate 𝜂                           0.1
                           Policies                Random, Epsilon-Greedy, Softmax
                   Normal/Attack Profile                        50 %
                Traffic Generation Sim Time                  ∼ 70 min


Figure 3: Cumulative Permitted Transaction Rate (CPTR) as a function of transactions and RL policies
for the Networking RL Agent.


within the model and the average value over the total simulations and episodes number is
reported as a function of the overall transactions. As depicted in the figure, the CPTR achieves
an asymptotic value of ∼ 38 % in the case of a random policy. On the other hand, the CPTR
scored with Epsilon-Greedy and Softmax policies is 70 % and 80 % respectively. For the sake
of clarity, each curve depicted in Figure 3 is characterized by a confidence interval evaluated
as the standard deviation of the simulation data. In Figure 4a and 4b the cumulative rewards
of the Application-Level RL Agent in case of internal requests (a) and external requests (b)
are reported. The action space of the Application-level agent is dynamically defined from the
available network plugins within the model, as aforementioned. Moreover, the reward function
presents a higher degree of granularity in order to take into account both the advantages and
the cost of a specific operation (i.e., a DDoS attack is characterized by a higher cost with respect
to a port scan). The cumulative rewards depicted in the figures are reported for each available
policy and it is evaluated over the total simulations and transactions number. With reference to
Figure 4a, the cumulative reward achieved by using the random policy is extremely negative
                          (a)                                           (b)
Figure 4: Cumulative Reward as a function of episodes and RL policies for the Application-Level RL
Agent in case of internal (a) and external (b) requests.


(∼ −4000 points). On the other hand, the cumulative reward achieved by using Epsilon-Greedy
and Softmax policies are ∼ 2100 and ∼ 3800 points respectively. Those considerations change
for the cumulative rewards reported in Figure 4b. In this case, the policy that achieved a positive
cumulative reward is Softmax. On the other hand, random and epsilon-greedy policies do not
achieve positive values. Regarding the epsilon-greedy scores, the amount of exploration jointly
with the network constraints (i.e., only the FTP and Mail servers are reachable from external
requests) tends to worsen the learning outcomes.

4.2.2. Simulation Statistics and Outcomes


                          (a)                                           (b)
Figure 5: Execution Time as a function of episodes and RL policies for the Networking (a) and for the
Application-Level (b) RL Agent.


  Figures 5a and 5b report the total learning execution time for each available policy as a
function of episodes number both for the Network and Application-Level RL Agents. The
reported data are evaluated over the total number of simulations. As depicted in the figures,
the total learning time for random policy is quite similar with respect to one achieved by
epsilon-greedy. Whereas, the total learning time achieved by the softmax policy is higher with
respect to the other policies for both the RL agents.


                           (a)                                            (b)
Figure 6: Pearson correlation coefficient between the generated transactions as a function of simulation
time (a) and transactions inter-arrival time distribution (b).


   Figures 6a and 6b show the Pearson correlation coefficient between the transactions and the
transactions inter-arrival time distribution during a long-run simulation.
   With reference to Figure 6a, the correlation coefficient is repeated as a function of the total
simulation time. The results show that the 99 % of the correlation data are located within the
confidence interval of −0.1 and 0.25 for an average correlation of 17.5 %. Hence, the generated
transactions have a relatively low correlation.
   Finally, Figure 6b shows the inter-arrival times distribution. In particular, the simulation data
are reported with a histogram representation. The red curve represents an ideal exponential
decay and the green curve represents the data-derived distribution. From these results, it
is possible to see that the transactions generated by the model are characterized by a lower
correlation and by an exponential inter-arrival times distribution. As a consequence, the model
follows a Poisson Process.


5. Conclusions
In this paper, we analyzed the problem of users actions simulation in cyber range scenarios
exploring the potentiality of Reinforcement Learning. The proposal is strengthened by a sta-
tistical characterization of user actions and by adopting RL Agents to learn and to perform
actions at different protocol stack layers. Moreover, the designed RL Agents cooperate and
perform networking operations by following specific behaviors. In detail, we focused our
investigation on the impact of different RL policies for both the RL Agents among the learning
capabilities and execution time. In addition, we carried out a long-run simulation to evaluate
the correlation between the generated transactions, and, finally, the inter-arrival times distri-
bution. Future works will focus on overcoming the limitation described in 3.4 by involving
autonomous setup configurations, the implementation of more sophisticated RL algorithms (i.e.,
Deep Reinforcement Learning), and, cause-effect relationships between user actions.


Acknowledgments
This work was supported in part by the Fondo Europeo di Sviluppo Regionale Puglia Programma
Operativo Regionale (POR) Puglia 2014–2020–Axis I–Specific Objective 1a–Action 1.1 (Research
and Development)–Project Title: CyberSecurity and Security Operation Center (SOC) Product
Suite by BV TECH S.p.A., under Grant CUP/CIG B93G18000040007.


References
 [1] C. Ebert, C. H. C. Duarte, Digital transformation, IEEE Softw. 35 (2018) 16–21.
 [2] S. Mendhurwar, R. Mishra, Integration of social and iot technologies: architectural
     framework for digital transformation and cyber security challenges, Enterprise Information
     Systems 15 (2021) 565–584.
 [3] A. I. per la Sicurezza Informatica, Clusit report 2021, https://clusit.it/rapporto-clusit/, 2021.
 [4] E. Gilman, D. Barth, Zero Trust Networks, O’Reilly Media, Incorporated, 2017.
 [5] D. Bertsekas, Reinforcement learning and optimal control, Athena Scientific, 2019.
 [6] M. Tokic, G. Palm, Value-difference based exploration: adaptive control between epsilon-
     greedy and softmax, in: Annual conference on artificial intelligence, Springer, 2011, pp.
     335–346.
 [7] I. N. Labs, “securing the electrical grid from cyber and physical threats, https://inl.gov/
     research-programs/grid-resilience/, 2021.
 [8] C. Queiroz, A. Mahmood, Z. Tari, Scadasim—a framework for building scada simulations,
     IEEE Transactions on Smart Grid 2 (2011) 589–597.
 [9] N. Kaloudi, J. Li, The ai-based cyber threat landscape: A survey, ACM Computing Surveys
     (CSUR) 53 (2020) 1–34.
[10] Microsoft,        Cyberbattlesim,     https://www.microsoft.com/en-us/research/project/
     cyberbattlesim/, 2020.
[11] J. A. Nichols, K. Spakes, C. Watson, R. A. Bridges, Assembling a cyber range to evaluate
     artificial intelligence / machine learning (ai/ml) security tools (????). URL: https://www.
     osti.gov/biblio/1772629.
[12] T. Alves, R. Das, A. Werth, T. Morris, Virtualization of scada testbeds for cybersecurity
     research: A modular approach, Computers & Security 77 (2018) 531–546.
[13] Q. Qassim, N. Jamil, I. Z. Abidin, M. E. Rusli, S. Yussof, R. Ismail, F. Abdullah, N. Ja’afar,
     H. C. Hasan, M. Daud, A survey of scada testbed implementation approaches, Indian
     Journal of Science and Technology 10 (2017) 1–8.
[14] T. Morris, R. Vaughn, Y. S. Dandass, A testbed for scada control system cybersecurity
     research and pedagogy, in: Proceedings of the Seventh Annual Workshop on Cyber
     Security and Information Intelligence Research, 2011, pp. 1–1.
[15] S. Khan, A. Volpatto, G. Kalra, J. Esteban, T. Pescanoce, S. Caporusso, M. Siegel, Cyber
     range for industrial control systems (cr-ics) for simulating attack scenarios, Proceedings
     of the Italian Conference on Cybersecurity (ITASEC 2021) 2940 (2021) 246–259.
[16] A. Salvi, P. Spagnoletti, N. S. Noori, Cyber-resilience of critical cyber infrastructures:
     Integrating digital twins in the electric power ecosystem, Computers & Security 112 (2022)
     102507.
[17] V. Sundarapandian, Probability, statistics and queuing theory, PHI Learning Pvt. Ltd., 2009.
[18] P. Z. Peebles Jr, Probability, random variables, and random signal principles, McGraw-Hill,
     2001.
Appendix
Examples of the simulated network traffic.


                                      (a) DDoS Simulation


                                 (b) Port Scan Traffic Simulation
Figure 7: Appendix Figures I
                                 (a) Malicious File Upload to FTP server


                                        (b) Malicious File Specs
Figure 8: Appendix Figures II


                                     (a) Phishing Attack Simulation


                                        (b) Phishing Email Specs
Figure 9: Appendix Figures III