<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>G. Costa, E. Russo, A. Armando, Automating the generation of cyber range virtual scenarios with
vsdl, Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3664476.3669976</article-id>
      <title-group>
        <article-title>Unveiling Attack Patterns from CTF Network Logs with Process Mining Techniques⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesco Romeo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Blefari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco A. Pironti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelo Furfaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIMES - University of Calabria</institution>
          ,
          <addr-line>P. Bucci, 41C, 87036, Rende, CS</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IMT School for Advanced Studies</institution>
          ,
          <addr-line>Piazza S. Francesco, 19, 55100 Lucca, LU</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>13</volume>
      <issue>2022</issue>
      <fpage>207</fpage>
      <lpage>218</lpage>
      <abstract>
        <p>Providing concise yet suficiently detailed representations of potentially malicious interactions with networkexposed services could simplify attack analysis and help identify exploited vulnerabilities. A key challenge is the automatic, real-time generation of such representations to enable prompt defensive actions and rapid responses. Capture the Flag (CTF) competitions of type Attack and Defense (AD) are a gamified example of scenarios where the availability of such a tool could play a critical role. In addition, in CTF-AD environments, the problem of reverse engineering from network packet to high-level interactions is exacerbated by some provisions used to hide the packets' source. This paper proposes an efective solution, based on process mining techniques, which is able to identify and infer the attacker's behavior and to produce its representation as a Directly-Follows Graph (DFG). The approach has been thoroughly validated by exploiting a Cyber Range scenario where  teams fight a CTF-AD competition, comprising: a game server, a set of  machines, one for each team, hosting vulnerable services and from where the own services are handled, and a set of  ≥  machines, one for each simulated player, from where attacks are launched. The developed tool can be used by teams to analyze attacks on their services in order to identify exploited vulnerabilities and replicate them against adversaries.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Log-Analysis</kwd>
        <kwd>Process Mining</kwd>
        <kwd>Cyber Range</kwd>
        <kwd>Network Log</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The ever increasing pervasiveness of connected devices and applications in people’s everyday lives has
improved the way many tasks, even critical or hazardous ones, are carried out. However, at the same
time, the potential interferences due to the exposure to cyber threats are becoming a very important issue
that still needs to be adequately taken into account. Healthcare, infrastructure monitoring, economic
transactions, energy management and other critical applications are vulnerable to Internet threats
and, as highlighted in [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], it is necessary to ensure security in this crucial contexts. The process
of guaranteeing confidentiality, integrity, and availability of IT systems is further complicated by the
continuous emergence of new threats and by the heterogeneity of the involved devices. Elaborating
an efective countermeasure against cyber-attacks is not a trivial task to carry out. The dificulties are
increased by attacks exploiting 0-day vulnerabilities. In this context, the availability of tools able to
automatically detect and analyze anomalous behaviors may play an important role [3].
      </p>
      <p>
        Detection and analysis of security threats are still performed largely manually and often use highly
specialized techniques tailored for the specific domain. Modelling-based techniques are employed to
infer behavioral models of the involved entities, encoded in a mathematical notation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, these
techniques, if not properly used, may produce models that could be very abstract, too complex, and
hard to manipulate. Some attack modeling techniques known in literature are: (a) Attack Surface [4]; (b)
Kill Chain [5]; (c) OWASP’s threat model [6]; (d) Diamond model [7]; (e) Attack Vector [8]; (f) Attack
Tree or Graph [9, 10].
      </p>
      <p>This paper proposes an attack modeling approach, based on process mining techniques [11], which can
be exploited to infer the process model corresponding to the behavior of an external entity, interacting
with a given system, whose behavior is deemed as being anomalous and then probably malicious. Process
models of detected malware can be used to understand the attacker’s cyber-kill chain and identify the
exploited vulnerabilities (e.g. zero-day). The efectiveness of the approach has been evaluated in the
context of Capture the Flag (CTF) competition of type Attack and Defense (AD) executed into a Cyber
Range (CR) platform.</p>
      <p>This paper proposes an efective solution, based on process mining techniques, which is able to
identify and infer the attacker’s behavior and to produce its representation as a Directly-Follows Graph
(DFG).</p>
      <p>The remainder of the paper is organized as follows. Section 2 introduces the background giving a
short overview about process mining, Cyber Ranges and CTF environments. Section 3 illustrates the
proposed methodology. A Proof-of-concept implementing the proposed approach and an evaluation is
given in section 4. Section 5 concludes the paper and discusses future works.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Process Mining</title>
        <p>The aim of Process Mining-based techniques is to extract the knowledge held by a process in such a
way to provide broader perspectives for the interpretation of the data, with the goal of obtaining a
better understanding of the process behavior. By leveraging process mining techniques, it is possible
to discover, monitor, and improve real processes by extracting knowledge from event logs. Clearly,
process mining is particularly relevant in a context where the involved actors are autonomous and can
deviate or exhibit emergent behaviors. The more ways in which services, people, and organizations
can deviate, the more interesting it is to observe and analyze the corresponding processes as they are
executed. Three fundamental types of process mining tasks can be identified [12]:
• Discovery: the model is constructed based on an event log. In this case, there is no a priori model.
• Conformance: in this case, an a priori model is used to verify if reality conforms to the model. It
is employed to detect, identify, explain, and measure the severity of the discovered deviations.
• Extension: similarly to the conformance case, there exists an a priori model, but it is extended
with a new aspect or perspective. The goal here is not to verify conformity but to enrich the
model.</p>
        <p>From the cyber security perspective, the use of PM techniques relying on process discovery are
efective in that, unlike classical monitoring solutions, they allow to reduce implementation time by not
requiring the availability of a process model in advance. Moreover, there are some approaches which
combine classical data-centric (e.g. Data Mining) approaches and process-centric approaches (e.g. BPM
Analysis) to give better results in addressing security issues [13]. These approaches represent an added
value since they are designed to examine when and how an actual process deviates from a given process
model, thus giving critical insights on the real-time behavior of a system, for example deviation from
the normal evolution toward a (potential) unsafe state.</p>
        <p>As stated before, the aim of process discovery is to infer a process model, expressed in a suitable
mathematical notation (e.g., Petri Net, BPMN, process tree), from the analysis of a set of traces [14].
The obtained model must be able to explain all the recorded logs. The resulting “mined” model can be
employed in a workflow management system or used to perform analysis on the actual process behavior,
thus identifying deviations between the model and the recorded behavior. A process discovery algorithm
usually performs the following subtasks: (i) analyzes traces or logs; (ii) extracts causal dependencies;
(iii) presents results in the form of a dependency graph; (iv) enriches the resulting dependency graphs
with advanced aspects of process behaviors by returning a formalized process model in specific process
modelling language. From a given log trace, diferent process models can be extracted, and a choice
of the one considered the most appropriate has to be made. Process discovery is often carried out
using heuristic approaches, as can be seen in [15], and the quality of the resulting models increases
proportionally with the number of available traces. The data set size has to be adequate, indeed, if it is
too small, the quality of the models tends to be very poor.</p>
        <p>Process mining tools such as ProM [16] are capable of working with considerable amounts of data,
allowing process mining to be applied to real web services without any problem.</p>
        <sec id="sec-2-1-1">
          <title>2.1.1. Directly-Follows Graph</title>
          <p>Directly-Follows Graphs [17] (DFGs) are commonly used in process mining to explore event data. A
trace, or process variant,  = ⟨1, 2, . . . , ⟩, is a sequence of activities. #( ) denotes the number of
cases in the event log  that correspond to  , while #() and #(, ) represent the occurrences of
activity  and the directly-follows relation (, ), respectively. Without loss of generality, it is assumed
that each case starts with a start event (▶) and ends with an end event (■ ). A DFG is a directed graph
where nodes represent activities, and edges represent directly-follows relationships. Three thresholds,
 var,  act, and  df, filter out infrequent traces, activities, and relations, respectively. The construction
process is as follows:
1. Input: Event log  and thresholds  var,  act,  df.
2. Filter traces: Remove traces  with #( ) &lt;  var, producing ′.
3. Filter activities: Remove activities  with #′ () &lt;  act, resulting in ′′.
4. Add nodes: Create a node for each activity in ′′.
5. Add edges: Connect nodes  and  if #′′ (, ) ≥  df.
6. Output: A DFG with nodes labeled #′′ () and edges labeled by #′′ (, ).</p>
          <p>Additional timing statistics can be computed for each edge.</p>
          <p>This process results in a simplified DFG that focuses on significant activities and relationships in
the event log. Some limitations of DFGs are their susceptibility to producing misleading results due
to improper filtering or threshold adjustments, which can lead to distorted activity sequences and
inaccurate performance diagnostics. Furthermore, DFGs lack the expressiveness to represent complex
process structures such as parallelism, loops, or routing logic.</p>
          <p>Despite these weaknesses, DFGs present advantages over other types of representations in terms of
immediacy of comprehension. Therefore, DFGs represent a good choice in a scenario where having
an easy to understand representations is deemed more important than obtaining extremely accurate
representations of complex structures.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Anomaly Detection Process Model</title>
          <p>Due to the limitations of manual analysis, there is a growing demand among developers for automated
log analysis methods. During the past decade, extensive research has focused on innovative approaches
to automatic log analysis and anomaly detection [18, 19, 20], resulting in the development of a four-phase
process: (i) log collection; (ii) log parsing; (iii) feature extraction; (iv) anomaly detection [21].</p>
          <p>Usually, in the log collection phase, the logs are retrieved from their storage locations for analysis.
Each log must present a timestamp and a descriptive message indicating its parent event, as well as
other descriptive fields. The collected logs are then processed in order to detect anomalies. Starting
from the parsing phase, event models are extracted from previously collected unstructured logs to
obtain structured logs. These logs are composed of static and dynamic components. The two main
approaches to carry out log analysis are: • cluster-based: it utilizes distances between logs for clustering
and event template generation; and • heuristic-based: it identifies frequent words to compose candidate
log events. In the feature extraction phase, the structured logs are parsed into numerical feature vectors
and categorized into sequences using three grouping techniques: fixed windows, sliding windows, and
session windows, each based on diferent criteria. This is a fundamental phase which transforms logs
to enable the application of machine learning techniques. A feature vector is generated for each log
sequence that represents the occurrence number of each event. The fourth phase is the detection of
anomalies. The output of this phase is a feature matrix which can be provided to a machine learning
model for training and generation of an anomaly detection model. Based on the inferred model, a new
incoming log sequence can be classified as anomalous or normal.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Cyber Range Platforms</title>
        <p>The growing number of attacks occurred in recent years, raises constantly the need of highly skilled
professionals. There is the necessity of cyber-infrastructures able to support training, vulnerability
assessment, and testing activities. Cyber Ranges (CRs) platforms, which are composed of several
components and have characteristics that allow them to be able to adapt to a variety of diferent
scenarios, play an important role in this panorama. It is possible to define a Cyber Range as an
exercising environment containing both physical and virtual components to represent realistic training
scenarios [22]. With this need in mind, Capture the Flag (CTF) or Attack and Defense (AD) challenges
can be seen as a way to improve the player’s skillset.</p>
        <p>The National Institute of Standard Technology (NIST) highlighted that cyber security is a nowadays
challenge requiring even more skilled experts [23]. Indeed, today’s cyber security attacks are increasingly
complex, often coalescing vulnerabilities having diferent nature. In [ 23], Cyber Range (CR) platforms
are identified as a valid tool useful in reducing the lack of knowledge in the field of cyber security. By
supporting this assertion, over the years, CR platforms became one of the most promising advances in
the cyber security area for testing and training purposes. Despite the fact that solid theoretical security
knowledge constitutes the foundation to face cyber threats and vulnerabilities, cyber security experts
need to regularly train their abilities in order to enhance their skills and improve their expertise. Using
CRs, it is possible to provide a suitable platform capable of replicating real world scenarios [24, 25, 26] by
including: (i) physical and virtual components [22, 27]; (ii) software components able to emulate/simulate
some (physical) systems’ behavior; (iii) entities having specific software vulnerabilities [ 28] which are
under study.</p>
        <p>According to the state-of-the-art literature and to the NIST specification, CRs are defined as
“interactive and simulated representations pertaining to an organization’s local network, system, tools, and
applications" [29]. CRs are widely adopted in cyber security field, indeed, they are commonly used
to emulate (or simulate) specific computer environments, especially to test system or infrastructure
security, and to train people across diferent cyber security fields [ 30, 31]. The CRs platform acts as
a virtual playground for cyber security professionals, students and researchers, becoming the base
technology to assess and improve their knowledge and practice against cyber threats. The realistic and
controlled environment provided by a CR allows to create specialized purposes scenarios which mimic
real-world assets (networks, systems and applications) including simulation of auxiliary infrastructures
(such as servers, clients, firewalls and routers). The main advantage ofered by CRs is the high isolation
level that these platforms ensure. Examples of useful scenarios supported by CRs are: (i) red team
vs blue team battlefield; (ii) cyber security training environment for certification [ 32]; (iii) incident
response, forensics investigation, penetration testing, and vulnerability assessment scenarios for testing
purposes; (iv) malware analysis or reverse engineering scenarios for research purposes [33]. According
to the authors of [30], the creation of CRs is a challenging task, thus they propose the “Cyber Range
Instantiation System (CyRIS)” in order to ease the creation of complex CR Platforms. A new development
approach of next-gen CRs systems, based on cloud computing and hypervisor technologies, that could
improve the realism leveraging statistical and AI techniques has been recently proposed in [27].</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Attack &amp; Defense Capture the Flag</title>
        <p>A particular type of red team vs. blue team operations in a Cyber Range environment is an “Attack
and Defense” (A/D) Capture the Flag challenge game. These games are designed as a learning tool for
enhancing participants’ skills and as test bed for new platforms, attacks, and attacks and/or defense
methodologies that will be employed in the wild. In each challenge, every team has to defend a machine
(vulnbox). This specific scenario can be used for both training and fun purposes. An A/D game consists
of several rounds of equal duration. Within each round, the game server interfaces with the vulnbox
assigned to each team in order to refresh the stored flags by employing a series of scripts called checkers.
In this context, the flag term, refers to a secret information that each team has to steal from the others.
Once a flag has been successfully obtained, it is sent to the game server in order to redeem points. In
the same way, when an attacker team steals a flag and submits it, the defenders lose points. At the end
of the competition, the team with the highest score wins the challenge.</p>
        <p>Each vulnbox has a number of exposed services and the machine owner’s team has to keep all services
active in order to maintain their Service Levels Agreement (SLA) as high as possible. The entire game is
handled by a component named Game Server. It is the core component which hosts some services for
each team: (i) a scoreboard allows to see the current point distribution and the SLA for each service;
(ii) a service known as FlagID is exposed in order to communicate some bit of information needed
to find the flags (e.g the name of the right user); (iii) another services is responsible to authenticate
each team and accept or deny the provided flags. If a specific service does not work properly or is not
reachable, a penalty is assigned to the owner team. Such an operation is made by the Game Server that
checks also: (i) the availability of all services updating the SLA score accordingly; (ii) if it is possible
to insert a new flag; (iii) if it is possible to retrieve the previously inserted flag. Teams’ machines and
Game Server, cannot be identified by their IP address due to the Network Address Translation (NAT),
which hides the real IP addresses in order to avoid identification of attackers.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The devised approach is able to perform process mining on logs coming from hosts deployed in a
CR environment and has been tailored to work with HTTP-based web challenges. In this paper, we
decided to limit the evaluation of the methodology to the above-described contexts in order to have
a functional proof-of-concept that can be easily tested in real life challenges. We plan to expand the
validation to more general settings in future work. The basis of the proposal relays on customizing
the anomaly detection process in order to infer a model describing events occurring into the analyzed
service. Indeed, diferently from the classic anomaly detection process, we replaced (i) the feature
extraction phase with a custom-made phase to pre-process log data, and (ii) the anomaly detection
process phase, with the visualization of a dependency graph in such a formalism to better understand
dependencies among network events that occurred into the service. The resulting four-phase process is:
(i) Log collection; (ii) Log parsing; (iii) Log preprocessing; (iv) Inferred Model Visualization. During
the log collection phase, basic network packet snifers are employed to gather logs. The log parsing
is responsible for converting them into a more structured and lightweight JavaScript Object Notation
(JSON) representation, stripped of unnecessary data. In the Log preprocessing phase, parsed logs are
computed by a packet aggregator based on domain-specific metrics in order to reconstruct diferent
HTTP sessions. In the inferred model visualization phase, organized logs are refined by applying a
particularly suitable process mining algorithm in order to produce a better representation of the events.
In this phase we (i) perform process discovery analysis, and (ii) visualize the related process graph.
Log collection. The log collection phase is carried out by using a packet snifer tool (such as tcpdump [ 34]).
Such a tool, intercepts all network trafic for a specific web service storing it into “pcap” files to extract
useful information regarding network ISO/OSI levels from 2 to 7 in the following phase.
Log parsing. Each network packet, in the “pcap” file, is analyzed in this phase that performs their
conversion following the JSON specification. Moreover, pre-filtering, retaining only relevant fields
according to Table 1 for the HTTP request packets and for HTTP response packets, is performed. The
output of the log parsing phase is a list of JSON objects containing processed network packets.</p>
      <p>Field
timestamp
src_ip
src_port
dst_ip
dst_port
packet_type
HTTP_method
request_uri
status_code
cookie
body</p>
      <sec id="sec-3-1">
        <title>Description</title>
        <p>The absolute timestamp
The source IP address
The source port
The destination IP address
The destination port
Specifies if such a packet is a request or a response
The HTTP method used
The URI
The status code related to the HTTP response (Only in
HTTP response)
The cookie</p>
        <p>The body content
Log preprocessing. In this phase, the JSON list is transformed in order to obtain more structured and useful
data suitable for process mining algorithms. The Log Preprocessing phase is composed of three software
modules that sequentially handle data, respectively: (i) to reconstruct data, (ii) to filter and enrich data,
(iii) to convert the JSON list according to the eXtensible Event Stream (XES) format, an XML-based
format commonly used by process mining algorithms. The first module, called Session Reconstructor,
has been designed to merge related network flows based on (i) TCP connection, and (ii) cookie field or
authentication field. The idea behind session reconstruction is to identify all TCP sessions among all
TCP flows in order to merge multiple TCP sessions based on the cookie or authentication fields present
into the request (or response) header. The second module, called Log filtering and marking , initially
performs a packet filtering, retaining only relevant HTTP packets. Subsequently, a string describing the
activity is added to each entry of the filtered logs in the previous module. If the log entry is an HTTP
request, the activity description is composed of HTTP_method and request_uri. On the other hand, in
the case of an HTTP response, the activity description is obtained as status_code, HTTP_method, and
request_uri. Lastly, reconstructed sessions are analyzed using a pattern matcher to spot leaked sensitive
information. If an information leakage is detected, a marker is added to the whole session. Finally, the
module Parser Dict to XES is used to translate pre-processed log data into the XES format.
Inferred Model Visualization. The inferred model visualization phase is carried out using process mining
algorithms. In this phase, the XES file received as input is processed by a specific process mining
algorithm to: (i) create causal dependencies among all information (representing network packets)
present in each trace, and (ii) to obtain a model (visualizable in such a formalism) that specifies attackers’
interaction with the analyzed system.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Proof-of-concept and evaluation</title>
      <p>To prove the efectiveness of our methodology, we designed a proof-of-concept architecture. The
hardware employed in our experiments is based on a server HP ProLiant DL360 gen8 featured by
a 6x 8-core Intel Xeon E3-1260 processor, 256 GB of ram and 6 Solid State Drive (ssd). This server
is equipped with Proxmox VE [35], a type 1 hypervisor that facilitates the virtualization of virtual
machines running Unix or Windows operating systems and LXC containers. In choosing a scenario
suitable for the proposed approach where it can truly be audited and useful but also validated, the Attack
and Defense CTF challenges provide a good testing ground. Moreover, this challenges mimics real-world
vulnerabilities (e.g. services exposing SQL injection vulnerabilities) in a fast paced environment where
every minute count. Having a tool that can help to understand the opponent’s attacks leads to faster
patch and overall better scores, giving the users of this tool an advantage. The rigorous structure of
the flag allows treating this particular data as a marker, enabling a better understanding of what an
anomaly is (e.g. Unauthorized flag inclusion in a file) and the steps that the attackers has taken to gain
that flag.</p>
      <sec id="sec-4-1">
        <title>4.1. Methodology implementation steps</title>
        <p>For each phase of our methodology, we developed ad hoc software modules that, when computing input
data, provide output data suited for the next module or visualization (in the case of the final phase). The
ifrst phase collects network logs from a specific service. To this aim, we used tcpdump [ 34], a packet
snifer based on eBPF technology [ 36]. The output of this phase, as described in Section 3, is a “pcap”
ifle that contains all captured network packets; then it is served as input of the Log Parsing phase to be
converted using the PCAP Parser module. To this end, we relied on PyShark [37], a Python wrapper
for the tshark network protocol analyzer [38], allowing python packet parsing using wireshark [38]
dissectors. To perform process mining analysis, we used the pm4py Python library, which provides
process discovery algorithms. Since all process discovery algorithms in the pm4py library take in input
the data according to the XES format, we developed a conversion module to convert JSON data to meet
XES specification. In our proof-of-concept, we used only the inductive miner as process discovery
algorithm even if we designed the whole system to accommodate the addition of other process mining
algorithms.</p>
        <p>pm4py also allows the use of thresholds in the form of percentage. A proper setting of the threshold
values is crucial for the efectiveness of the mining process. A low threshold results in a highly detailed
graph, which could contain noise. On the other hand, a high threshold results in a loss of relevant
information. For this reason, after empirical evaluations, we choose to use only path thresholds with a
value of 30%.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. The Cyber Range platform</title>
        <p>For proof-of-concept purposes, hosting a complete game infrastructure on a cyber range was deemed
too complex. Instead, a reduced architecture was instantiated. To demonstrate the efectiveness of the
proposed approach, a small-scale cyber range was created and populated with a few vulnboxes.</p>
        <p>The scenario was configured to support a game setup that involves one vulnerable service exposed.
In this environment, three containers based on Ubuntu 20.04 were deployed: the first container (with IP
address 10.60.0.1) is the vulnbox where the network logs are collected and the services are exposed.
Another container (10.60.0.100) is simulating an attacker where a custom made script continuously
launches attacks directed to the vulnbox. The third machine (10.60.0.254) is configured as a Game
GET / (18)
18
18
●
83
83
POST /register (83)</p>
        <p>GET /forms (23)
39
13
200 POST /register (83)</p>
        <p>200 GET /forms (39)
200 GET / (18)
18
27</p>
        <p>9
18</p>
        <p>Server continuously running the checkers scripts and exposing a custom-made flagID service. Since all
services run within the same network (10.60.0.0/24), a virtual switch connecting the three containers
sufices. For more complex scenarios, virtual routers could be introduced to enhance the network
topology.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Challenge “CCForms"</title>
        <p>The CCForms service is a website where it is possible to create forms with multiple questions and
collect relative answers. It is a Webapp with a React front-end and a NodeJS back-end. A functionality
of this application is the possibility of writing a public or private note. Although CCForms has multiple
vulnerabilities, the one chosen for this experiment is an Insecure Direct Object References (IDOR)
vulnerability allowing unauthenticated agents to request directly a page and get the flag. This challenge
was proposed in the 2024 Attack and Defense final of the Italian competition “CyberChallenge.IT” [ 39].</p>
        <p>GET/(15)
15
15
128</p>
        <p>POST/register(128)
200GET/(15)
22</p>
        <p>14</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Evaluation</title>
          <p>In order to properly evaluate the proposed approach, a set of experiments is carried out in diferent
scenarios, and the resulting network logs are collected and processed. In the first scenario, the services
from the vulnbox are deployed and the game server is running the checker script simulating lawful
behavior. An accurate visualization of the checkers work is shown in Figure 2. In the second scenario,
an attacker that actively sends payloads to the vulnbox is added, simulating a combined lawful and
malicious behavior (see Figure 3). Through a “visual diference” between the previously described
scenarios, it is possible to infer the interaction of the attackers with the system (i.e. the path that
connects the “/register” node to the “/form/&lt;uuid&gt;/answers” response node marked as “FLAG_OUT”).</p>
          <p>To support our claim, we ran a scenario comprising only the attacker while omitting the checker.
This resulted in the graph shown in Figure 4, which corresponds to the previously identified path. This
type of validation was possible because the experiments were conducted in a controlled environment.</p>
          <p>A network log of 10 MB has been recorded for each simulated scenario.</p>
          <p>POST /register (274)
200 POST /register (275)
275
268
276
276
■
GET /form/&lt;uuid&gt;/answers FLAG_OUT_REQ (268)
200 GET /form/&lt;uuid&gt;/answers FLAG_OUT (276)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>This work presented an innovative approach that employs process mining techniques to extract attack
graphs, in the form of Directly-Follows Graphs, from network logs. The efectiveness of the developed
technique has been evaluated by applying it to network logs collected in a Cyber Range environment
specifically tailored for this experimentation. The obtained DFGs proved to be accurate and efective
in illustrating the primary attack paths and in helping to distinguish between malicious actions and
legitimate user requests. These behavioral models are valuable tools for: understanding attackers’
behavior; identifying exploited vulnerabilities; gather useful insights for developing both defensive and
ofensive strategies in Attack and Defense (AD) Capture the Flag (CTF) scenarios. Future work is geared
at: evaluating the approach with diferent types of attacks (other than HTTP-based) and vulnerabilities
(other than IDOR); refine and expand the methodology to enhance its robustness and minimize the
need for manual setup.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
ways forward, Maturitas 113 (2018) 48–52. doi:https://doi.org/10.1016/j.maturitas.
2018.04.008.
[3] H. Al-Mohannadi, Q. Mirza, A. Namanya, I. Awan, A. Cullen, J. Disso, Cyber-attack modeling
analysis techniques: An overview, in: 2016 IEEE 4th International Conference on Future Internet of
Things and Cloud Workshops (FiCloudW), 2016, pp. 69–76. doi:10.1109/W-FiCloud.2016.29.
[4] P. K. Manadhata, J. M. Wing, An attack surface metric, IEEE Transactions on Software Engineering
37 (2011) 371–386. doi:10.1109/TSE.2010.60.
[5] E. Hutchins, M. Cloppert, R. Amin, Intelligence-driven computer network defense informed by
analysis of adversary campaigns and intrusion kill chains, Leading Issues in Information Warfare
&amp; Security Research 1 (2011).
[6] X. Lin, P. Zavarsky, R. Ruhl, D. Lindskog, Threat modeling for csrf attacks, 2009, pp. 486 – 491.</p>
      <p>doi:10.1109/CSE.2009.372.
[7] S. Caltagirone, A. D. Pendergast, C. Betz, The diamond model of intrusion analysis, 2013. URL:
https://api.semanticscholar.org/CorpusID:108270876.
[8] M. Mulazzani, S. Schrittwieser, M. Leithner, M. Huber, E. Weippl, Dark clouds on the horizon:
Using cloud storage as attack vector and online slack space, in: 20th USENIX Security Symposium
(USENIX Security 11), 2011.
[9] C. Phillips, L. P. Swiler, A graph-based system for network-vulnerability analysis, in: Proceedings of
the 1998 Workshop on New Security Paradigms, NSPW ’98, Association for Computing Machinery,
New York, NY, USA, 1998, p. 71–79. URL: https://doi.org/10.1145/310889.310919. doi:10.1145/
310889.310919.
[10] B. Schneier, Attack trees, Dr. Dobb’s journal 24 (1999) 21–29.
[11] W. van der Aalst, Data Science in Action, Springer Berlin Heidelberg, Berlin, Heidelberg, 2016, pp.</p>
      <p>3–23. URL: https://doi.org/10.1007/978-3-662-49851-4_1. doi:10.1007/978-3-662-49851-4_1.
[12] W. Van Der Aalst, A. Adriansyah, A. K. A. De Medeiros, F. Arcieri, T. Baier, T. Blickle, J. C. Bose,
P. Van Den Brand, R. Brandtjen, J. Buijs, et al., Process mining manifesto, in: Business Process
Management Workshops: BPM 2011 International Workshops, Clermont-Ferrand, France, August
29, 2011, Revised Selected Papers, Part I 9, Springer, 2012, pp. 169–194.
[13] M. Macak, L. Daubner, M. F. Sani, B. Buhnova, Process mining usage in cybersecurity and software
reliability analysis: A systematic literature review, Array 13 (2022) 100120.
[14] Y. Bertrand, B. Van den Abbeele, S. Veneruso, F. Leotta, M. Mecella, E. Serral, A survey on the
application of process discovery techniques to smart spaces data, Engineering Applications of
Artificial Intelligence 126 (2023) 106748. URL: https://www.sciencedirect.com/science/article/pii/
S0952197623009326. doi:https://doi.org/10.1016/j.engappai.2023.106748.
[15] H. R’bigui, C. Cho, Heuristic rule-based process discovery approach from events data, International</p>
      <p>Journal of Technology Policy and Management 19 (2018). doi:10.1504/IJTPM.2019.10025752.
[16] W. M. Van der Aalst, B. F. van Dongen, C. W. Günther, A. Rozinat, H. Verbeek, A. Weijters, Prom:
The process mining toolkit, in: Proceedings of the BPM 2009 Demonstration Track (BPMDemos
2009, Ulm, Germany, September 8, 2009), CEUR-WS. org, 2009, pp. 1–4.
[17] W. M. van der Aalst, A practitioner’s guide to process mining: Limitations of the directly-follows
graph, Procedia Computer Science 164 (2019) 321–328. doi:https://doi.org/10.1016/j.
procs.2019.12.189, cENTERIS 2019 - International Conference on ENTERprise Information
Systems / ProjMAN 2019 - International Conference on Project MANagement / HCist 2019
International Conference on Health and Social Care Information Systems and Technologies,
CENTERIS/ProjMAN/HCist 2019.
[18] M. A. Siddiqui, J. W. Stokes, C. Seifert, E. Argyle, R. McCann, J. Neil, J. Carroll, Detecting cyber
attacks using anomaly detection with explanations and expert feedback, in: ICASSP 2019 - 2019
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp.
2872–2876. doi:10.1109/ICASSP.2019.8683212.
[19] M. Alabadi, Y. Celik, Anomaly detection for cyber-security based on convolution neural network :
A survey, in: proc of 2021 6th International Symposium on Computer and Information Processing
Technology (ISCIPT), 2020, pp. 1–14. doi:10.1109/HORA49412.2020.9152899.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.-W.</given-names>
            <surname>Ten</surname>
          </string-name>
          , G. Manimaran, C.-C. Liu,
          <article-title>Cybersecurity for critical infrastructures: Attack and defense modeling</article-title>
          ,
          <source>IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans</source>
          <volume>40</volume>
          (
          <year>2010</year>
          )
          <fpage>853</fpage>
          -
          <lpage>865</lpage>
          . doi:
          <volume>10</volume>
          .1109/TSMCA.
          <year>2010</year>
          .
          <volume>2048028</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Coventry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Branley</surname>
          </string-name>
          ,
          <article-title>Cybersecurity in healthcare: A narrative review of trends, threats and</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>