<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Comparative Analysis of Datasets for Intrusion Detection in Software-Defined Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesco Di Gennaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Cucchiarelli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Morbidoni</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Spalazzi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IMT School for Advanced Studies</institution>
          ,
          <addr-line>Lucca</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università Politecnica delle Marche</institution>
          ,
          <addr-line>Ancona</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università degli Studi G. d'Annunzio</institution>
          ,
          <addr-line>Pescara</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Software-Defined Networking (SDN) ofers centralized management, programmability, flexibility and scalability but has significant security risks, especially DDoS attacks against the SDN controller, threatening network availability. Machine learning (ML) and deep learning (DL) show promise in mitigating these threats, but their success depends on available datasets quality. Existing SDN datasets often focus narrowly on specific DDoS scenarios or synthetic environments, limiting their real-world applicability. This paper analyzes SDN threats datasets, evaluating their methodologies, features and ML applications. It highlights strengths like realistic trafic emulation and accessibility, alongside limitations such as narrow attack coverage and synthetic biases. A roadmap is proposed to guide the generation of new datasets, emphasizing diverse attacks, richer features, realistic augmentation and public access to enable robust ML/DL-based SDN security solutions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Dataset</kwd>
        <kwd>IDS</kwd>
        <kwd>SDN</kwd>
        <kwd>DDoS</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Software-Defined Networking (SDN) is an emerging technology that ofers centralized control,
programmability, scalability and flexibility in the management of complex networks. SDN enables dynamic
resource allocation and simplified network management through the decoupling of the control and
data planes.</p>
      <p>
        However, this centralized architecture also introduces significant security vulnerabilities[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], particularly
in the form of Distributed Denial of Service (DDoS) attacks[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which can overwhelm the SDN controller
resources and compromise the availability of network access. For the mentioned reason, there is a
need to mitigate these vulnerabilities to ensure the resilience and robustness of SDN deployments,
especially because these networks are more and more complex and extended. Researchers have focused
on developing intrusion detection systems and mitigation techniques to mitigate these threats, focusing
on leveraging machine learning and deep learning models[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        At the core of these solutions there are datasets that simulate real-world SDN trafic conditions,
including both benign and malicious trafic[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These datasets ofer a benchmark for ML/DL-based systems
by ofering insights into network behavior and enabling improved attack detection. However, they vary
in design, scope, attack target and features, exhibiting both strengths and limitations. This stresses the
need for enhanced and representative data to successfully ensure modern SDN security.
The aim of the paper is to examine the methodologies used to create these datasets, the network
topologies and tools employed and the trafic captured, highlighting their contributions to the field
and the gaps that remain to be filled. By critically evaluating these datasets, the aim is to provide a
foundation for future research in developing resilient, adaptable and scalable security solutions for SDN
environments.
      </p>
      <p>The roadmap of the paper is the following: Section 2 discusses works related to creation of dataset in
SDN environment. The section 3 presents an analysis of the paper mentioned in the Section 2 based on
network architecture. In Section 4 the discussion based on elements of Section 2 and 3 is conducted.
Section 5 deals with the conclusion and perspectives of future works. The Appendix A provides the
network architectures employed by each paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <p>Software-Defined Networks (SDNs) present peculiar security challenges, particularly due to their
centralized architecture, which increases vulnerability to attacks targeting the availability of network
resources. To address these threats, datasets have been developed to support research in intrusion
detection and machine learning (ML)-based security solutions, each with strengths and limitations.
In the sub-paragraph 2.8 the Table 1 provides an overview of the features that will be detailed in the
following subsection for each dataset.</p>
      <sec id="sec-2-1">
        <title>2.1. Niyaz et al. (NSJ) Dataset</title>
        <p>
          The dataset[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] was published in 2017. It analyzes DDoS attack detection in SDN across diferent vectors
with a combination of various flooding techniques on the following protocols: TCP, UDP and ICMP.
Seven specific types of attacks were examined. These attacks were simulated with tools like hping3 1 in
a testing environment, varying packet sizes and rates to overwhelm SDN infrastructure, even though
specific rates were not quantified.
        </p>
        <p>Although the study’s primary focus is DDoS attacks, it also acknowledges the potential applicability of
its methodology to other intrusion types, such as network scanning and malware propagation.
Trafic was collected over three days from a real-world wireless home network for benign trafic and a
segregated lab environment for malicious trafic. The data was captured using tcpdump, manipulated
with bit-twist and replayed via tcpreplay. The experimental setup for malign trafic generation utilized a
SDN testbed consisting of a POX2 controller, OpenFlow-enabled switches and virtual machines running
on the hypervisor VMware ESXi, which ensures centralized trafic monitoring and feature extraction as
shown in the Figure 1.</p>
        <p>The POX controller hosted application employed for the detection and enabled communication based
on Openflow protocol, through central management of all the trafic channels. A set of 68 features was
extracted from TCP, UDP and ICMP trafic, including packet counts, entropy and flags like SYN and
ACK.</p>
        <p>Although the dataset combines real-world and simulated trafic, it is not explicitly available for public use.
As regards the detection, the system employed a Stacked Autoencoder (SAE) both for feature reduction
and for classification, outperforming other models like softmax classifiers and neural networks.
The key contribution of the paper is the simulation of diferent attacks using TCP, UDP and ICMP
protocols. The SDN testbed using the POX controller demonstrated its ability to centralize monitoring
and feature extraction, with 68 features that ofer input for machine learning based detection.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Zerbini et al. (ZCA) Dataset</title>
        <p>
          The authors of the paper[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] explore crucial aspects of security in network environments, with focus on
the analysis, detection and mitigation of both DDoS and port scan attacks. The research provides a
framework for network security research, with emphasis on dataset generation, trafic analysis and
the adoption of machine learning models. It investigates various DDoS attack typologies such as
TCP-SYN, UDP and ICMP floods, as well DDoS attacks on the application layer targeting web servers.
The research also considers attack rates, ranging from high rate floods that saturate bandwidth to low
1Documentation available at https://www.kali.org/tools/hping3/.
2Documentation available at https://noxrepo.github.io/pox-doc/html/.
rate and stealthy attacks aimed at evading detection by targeting resources on the application layer.
In addition to DDoS attacks, the dataset contains other types of intrusion, such as port scan.
Trafic was collected using a virtualized testbed simulating realistic conditions, with tcpdump that is used
to collect packet data and statistical features. The following simulation tools were employed: Hping3,
LOIC3, Metasploit, Hydra4 and SQLMap5 in order to generate trafic, while the network architecture
was based on a tree topology with a root switch and four subnets consisting of 20 hosts each. Open
vSwitch was employed for its compatibility with OpenFlow and robust capabilities in handling network
trafic. The controller platform leveraged the POX controller, integrating discrete wavelet transforms
(DWT), anomaly detection and mitigation modules.
        </p>
        <p>The network architecture is represented in the Figure 1. Although the two-day simulation produced
only six features and focused primarily on traditional DDoS attacks, the dataset supports research in
network security and intrusion detection. It is noted to be available for further study. Various machine
learning models, including Decision Trees, Random Forests, Naive Bayes, K-Nearest Neighbors, Support
Vector Machines and Multi-Layer Perceptrons were applied to assess the dataset, demonstrating its
utility in advancing machine learning applications for SDN security.</p>
        <p>The study provides a systematic approach to dataset generation and trafic analysis for network security,
emphasizing diverse DDoS attack typologies and application-layer attacks. It also explores additional
threats like password guessing and web application exploits. Although in a virtualized testbed, the
dataset deals with realistic network scenarios. The network architecture, based on a tree topology with
Open vSwitch and a POX controller, integrates anomaly detection mechanisms. Even though if focuses
on six features and traditional DDoS scenarios, the dataset demonstrates its applicability for research,
with machine learning models like Random Forest and SVM that show high performances for treaths
detection.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. InSDN Dataset</title>
        <p>
          The InSDN dataset[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] ofers a representation of security challenges in SDN environments. It addresses
diferent attack types and includes advanced simulation tools. It spans multiple DDoS attack modes,
such as TCP-SYN, UDP, ICMP floods and attacks like HTTP Flood conducted using Slowloris 6 and
Torshammer7.
        </p>
        <p>Beyond DDoS, the dataset addresses also other significant threats, including password-guessing attacks,
web application vulnerabilities, botnet and exploitation of services like Samba.</p>
        <p>Trafic collection was performed in a controlled virtualized environment using Tcpdump for packet
capture and CICFlowMeter for feature extraction, encompassing both normal and malicious trafic
against various SDN layers.</p>
        <p>The testbed’s network architecture consists of four virtual machines, including a Kali Linux attacker
virtual machine, an ONOS8-based SDN controller, a Mininet9 emulator with Open vSwitch10 and a
Metasploitable-2 machine for vulnerability testing, as it is possible to see in the scheme of network
architecture in Figure 2. Over 80 features were initially extracted and later reduced to 48 SDN-specific
metrics.</p>
        <p>The dataset is publicly available, supporting diferent research needs. The dataset was tested using
common machine learning algorithms that are: Decision Tree, Random Forests, AdaBoost, K-Nearest
Neighbor, Naive Bayes, Support Vector Machines and Multi-Layer Perceptrons. These models have
demonstrated their efectiveness in detecting common attacks like DDoS and probing, even though
challenges remain in identifying complex patterns such as User-to-Root (U2R) attack.
3Documentation available at: https://github.com/NewEraCracker/LOIC/releases.
4Documentation available at: https://www.kali.org/tools/hydra/.
5Documentation available at: https://sqlmap.org/.
6Documentation available at://github.com/gkbrk/slowloris.
7Documentation available at: https://github.com/Karlheinzniebuhr/torshammer.
8Open Network Operating System, documentation available at https://opennetworking.org/onos/.
9Virtual network simulator available at https://mininet.org/.
10Multilayer virtual switch available at https://www.openvswitch.org/.</p>
        <p>The InSDN dataset, in conclusion, is a robust and diverse resource for advancing SDN security research.
It captures a wide range of attack types, including both volumetric and DDoS attacks on application
layer, as well as other significant threats like password guessing, botnet activities and web attacks.
The virtualized environment ensures realistic trafic conditions, with a rich set of features that were
extracted to support detailed analysis. Its publicly available status enhances accessibility, making it a
valuable resource for machine learning research. Evaluations with diferent machine learning models
like Random Forest and AdaBoost, demonstrate the dataset’s capability in detecting common threats, but
some aspects needed to be deepened, such as the detection of complex patterns like U2R exploitation.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Novaes et al. (NCLP) Dataset</title>
        <p>
          The dataset proposed by NCLP[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] deals with the security problems of Software-Defined Networking,
with emphasis on both DDoS and Portscan attacks. The dataset provides a valuable base for SDN
security research, where various attack scenarios and methodologies are addressed. It encompasses
many types of DDoS attacks, including UDP, SYN, TFTP, DNS, NTP floods, WebDDoS and Apache
remote memory exhaustion attacks. In addition to DDoS, Portscan attacks are conducted, simulating
attempts to identify and exploit open ports.
        </p>
        <p>The dataset simulates high and low intensity DDoS attacks, using the Scapy tool to generate trafic with
diferent rates and durations, trying to reflect real-world conditions. Trafic was collected in a Mininet
environment over two days: normal trafic was captured on the first day, while the second day included
mixed trafic, analyzed in intervals of one second to grant suficient granularity.</p>
        <p>The Floodlight SDN controller, an OpenFlow compliant platform, managed the network, collecting flow
and port statistics for analysis. The topology of the network consisted of a star with six switches and
120 hosts as shown in the architecture in Figure 2, trying to simulate real SDN.</p>
        <p>Six features were captured, including bits per second, packets per second and entropy measures for
source and destination IPs and ports, in order to train machine learning models.</p>
        <p>The dataset is publicly available in order to increase its utility. The authors evaluated the dataset
using common machine learning models, but also including a novel LSTM-Fuzzy Logic model, which
outperformed traditional algorithms like k-Nearest Neighbors, Support Vector Machines and Multi-Layer
Perceptrons. This last model outperformed the others in terms of accuracy as regards the prediction of
normal trafic and detection of anomalies, making the dataset a good resource for developing security
models for SDN environments.</p>
        <p>This dataset ofers a good basis for SDN security research, addressing various DDoS attack types, which
include volumetric and application based attacks and Portscan scenarios. By simulating high and low
intensity attacks in an environment simulated with Mininet, the authors provide trafic data that is
suitable for machine learning applications. The star topology controlled by a single appliance, that is
the Floodlight SDN controller, assures scalability and applicability to real world conditions.
Six features enable the trafic analysis, while the accessibility of the dataset extends its utility. The
advanced models used, like the LSTM-Fuzzy Logic, demonstrate its potential as regards the detection
of anomalies and normal trafic, making it a valuable metric for developing resilient SDN security
mechanisms.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Yungaicela et al.’s (Y-NV-RP-DJM-C) SDN-SlowRate DDoS Dataset</title>
        <p>The dataset[8] was published in 2022. It was designed for deep learning systems which deal with
application layer attacks.</p>
        <p>The research is based on a real data center topology managed by the ONOS SDN controller, with a focus
on application based layer and slow-rate DDoS attack. The slow HTTP read attack is the primary DDoS
variant considered, which is designed to deplete server resources by opening long duration connections
without triggering timeouts.</p>
        <p>Even though it is innovative in concentrating on slow rate attacks, the dataset does not take into
consideration high rate and volumetric DDoS cases. Trafic capture was performed on a physical testbed
representing SDN-enabled physical structure and it consists of HP Aruba and NEC hardware. Trafic
data was captured with CICFlowMeter, with detailed packet level and flow level statistics.
Legitimate trafic sources consisted of FTP and video streaming, allowing also the evaluation of false
positives and system performance. The network architecture tries to simulate a realistic data center
with spine leaf topology, real appliances for trafic simulation and generation. The ONOS controller
grants centralized management, with IDS and IPS integration supported as shown in the Figure 3.
The dataset is available on Internet and presents 13 features for each instance. The authors applied
LSTM model for the evaluation of the performances and demonstrated its quality for identifying slow
rate DDoS attacks in SDN.</p>
        <p>In conclusion, the SDN-SlowRate-DDoS dataset tries to fill the gap in SDN security analyzing slow
rate DDoS attacks. Its realistic configuration managed by the ONOS controller ensures applicability.
Moreover, the feature set of 13 flow metrics provides a good basis for testing IDS systems. However, its
focus on slow rate attacks limits the applicability to diferent DDoS scenarios, providing chances for
future enhancements.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Aladaileh et al. (AAHHBA) Dataset</title>
        <p>This dataset[9] was published in 2022 and was obtained via a POX controller and 64 hosts in a llinear
topology. SDN domain is targeted with low rate and high rate DDoS attacks. The dataset takes into
consideration diferent DDoS attack types, that is to say situations with single or multiple attackers
targeting one or more victims.</p>
        <p>The trafic throughput are defined as low rate (5 packets per second) and high rate (33 packets per
second) for simulating various attack intensities. Dataset generation is achieved in Mininet to cover
the simulation of the network topology with a POX controller managing SDN activities and Kali Linux
used for attack trafic deployment.</p>
        <p>Trafic data is sampled during 60 minutes simulations at 5 seconds intervals across eight attack conditions.
Even though the paper is innovative for the use of mixed trafic rates, the dataset contains only seven
features, among them there are source and destination IP addresses to calculate entropy values for packet
randomness over time. The topology of the network includes one single SDN controller, OpenFlow
switches and 64 hosts, thus providing flexibility in simulating diverse attack configurations.
The study does not account for any other forms of attacks apart from DDoS and relies on statistical
methods such as Renyi joint entropy for its analysis, with future possibility to include machine learning
models for the generation of rules, even though the dataset is not open source. However, its limited
feature set and focus on entropy leave room for further enhancement in the future.
The paper provides a datased focused on low rate and high rate DDoS attacks in SDN environments,
based on a simulated network topology using Mininet and a POX controller. The schematic of the
network architecture is depicted in Figure 3.</p>
        <p>It demonstrates various attack scenarios, from a single attacker to multiple ones and categorizes trafic
based on its intensity. Trafic data collection and the analysis based on entropy provide insights into
packet randomness over time. Nevertheless, the dataset relies on only seven features, moreover the
exclusion of some attacks, such as TCP and ICMP floods, limits its application. The accessibility would
be required to assure its utility in the scientific research.</p>
        <p>Despite this, the innovative aspects of this paper, that is to say the focus on statistical methodology and
potential integration with machine learning, makes it useful for further research in SDN security.</p>
      </sec>
      <sec id="sec-2-7">
        <title>2.7. Ahuja et al. (ASMK) Dataset</title>
        <p>ASMK[10] article deals with DDoS attack detection in Software-Defined Networking (SDN) through
automated machine learning-based attack detection for enhancing security. The research proposes a
machine learning-based framework for DDoS detection in SDN architecture, keeping in view real-time
responsiveness, eficiency and scalability for SDN controller security.</p>
        <p>It primarily targets volumetric DDoS attacks that exhaust controller resources, including TCP SYN, UDP
and ICMP flood attacks. The framework simulates both high-rate and moderate-rate attack scenarios,
emulating realistic volumetric trafic patterns.</p>
        <p>Trafic was emulated in a Mininet-based SDN testbed, both generating normal trafic and malicious
trafic. Certain key tools like Mininet, hping3 and Wireshark supported trafic generation as well as
trafic analysis.</p>
        <p>The networking infrastructure consisted of a centralized tree topology with OpenFlow-switched nodes
and a single SDN controller, in emulation of real-world exposure to controller-directed attacks. The
network architecture is described in Figure 4. The study utilized the POX controller for trafic monitoring
and data plane interaction, extracting 23 features such as packet sizes, flow durations and
protocolspecific statistics to support machine learning model training.</p>
        <p>Evaluations were done with Decision Tree, Random Forest and Neural Networks, with Random Forests
being the best at classifying benign against malicious flows. Although the dataset includes synthetic and
simulated trafic for complete analysis, it is not publicly released. This limitation, along with the focus
of the framework only towards DDoS attacks, helps to highlight potential areas for future research,
particularly in broadening applicability to more general network anomalies.</p>
        <p>In conclusion, the study ofers a structured approach to DDoS detection in SDN , emphasizing scalability
and eficiency. The focus on SDN-specific attack scenarios and machine learning applications provides
valuable insights, although the dataset’s unavailability and limited attack diversity suggest opportunities
for further research and dataset expansion.</p>
      </sec>
      <sec id="sec-2-8">
        <title>2.8. High and Low-Rates Dataset-Based DDoS Attacks Against SDN (HLD-DDoSDN)</title>
      </sec>
      <sec id="sec-2-9">
        <title>Dataset</title>
        <p>This study[10] proposes a benchmark dataset, HLD-DDoSDN, to mitigate the threats posed by
Distributed Denial of Service (DDoS) attacks in SDN platforms based on machine learning and to evaluate
detection performance.</p>
        <p>This study introduces the HLD-DDoSDN dataset, designed to evaluate DDoS attack detection in SDN
environments by simulating realistic trafic conditions. The dataset focuses on DDoS attack typologies,
including TCP SYN, UDP and ICMP floods, with both high-rate (33.33 packets/second) and low-rate
(5 packets/second) to emulate both aggressive and stealthy strategies. However, it exclusively targets
DDoS flooding attacks, without addressing other attacks. Trafic was collected using a Mininet-based
SDN testbed with a linear topology, incorporating a single POX SDN controller, an OpenFlow vSwitch
and 64 hosts, as shown in Figure 4.</p>
        <p>Attack trafic was generated with spoofed IPs and randomized source ports, while normal trafic was
simulated concurrently. Tools such as Scapy crafted attack packets, Wireshark captured trafic for
analysis and CICFlowMeter extracted 71 trafic features, including packet size distributions and inter-arrival
times, to enable robust analysis.</p>
        <p>The labeled dataset, adhering to benchmark standards and being made publicly available, provides a
valuable source of data for SDN security research. Machine learning evaluations employed a Deep
Multilayer Perceptron (D-MLP) model with high detection accuracy for binary and multiclass classification
tasks. The performance metrics highlight the model’s robustness, reinforcing the dataset’s utility for
developing and testing DDoS detection systems in SDN.</p>
        <p>In summary, this paper gives a contribution to SDN security research by addressing existing dataset
limitations and introducing a detection framework. Future research can expand the dataset with more
attack vectors and test scalability on more complex SDN environments.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Network Architecture Complexity</title>
      <p>In evaluating datasets for intrusion detection systems in Software-Defined Networking, it is important to
highlight that network features and attack plans alone are insuficient for a comprehensive assessment.
In order to build a solid evaluation framework, the network architecture complexity should also be
taken into account[11].
While attack categories and network features give valuable insight into the nature of attacks and trafic
behavior, they fail to fully incorporate the scalability and operational problems of actual-world SDN
deployments. The architectural aspects also play a crucial role in accurately simulating and evaluating
the network’s reaction to attacks, particularly with regards to ML-based IDS models[12].
The Appendix A provides an overview of the schematics of the network architectures employed in the
detailed papers. Table 2 provides an overview for assessing the complexity of network architectures as
described in the datasets. Facilitates comparative evaluations of architectural sophistication, focusing
on structural and operational characteristics critical to scalability and representation of realism.
The SDN controller is fundamental to the architecture, enabling the separation of the control and
data planes, centralizing management and providing essential data for machine learning tasks such as
trafic optimization and anomaly detection. Without a controller, the lack of centralized data access
significantly hinders ML applications[ 13] and is assimilated to traditional networking concept. For this
reason, studies underscore the controller’s pivotal role in enabling data-driven approaches for SDN
management[14].</p>
      <p>The number of Open vSwitch (OVS) instances reflects a network’s scalability and complexity. While
single-switch setups mimic traditional networks, they fail to represent real-world SDN scenarios.
Multiple OVS instances allow more realistic evaluations of dynamic network conditions[13], essential
for training ML models.</p>
      <p>Tools like Mininet frequently adopt such setups for comprehensive SDN simulations[14].
The nature of the environment, whether virtualized or hybrid with physical devices, determines the
realism of network simulations. Virtual systems are eficient and cost-saving for prototyping but lack
the dynamics of real world setups, where hardware heterogeneity and latency have a vital role to
play[15].</p>
      <p>The number of hosts is another critical parameter, as hosts generate the trafic patterns essential for
network simulations. More hosts ensure diverse and realistic trafic patterns, which are beneficial for
ML training and testing[13], while fewer hosts can result in oversimplification. Studies highlight the
importance of simulating diverse host densities to simulate real-world environments and train eficient
ML models[16].</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>This section examines the strengths and weaknesses of current datasets for intrusion detection in
SDN environments, highlighting their innovations and limitations. The subsection 4.1 focuses on
the strengths of existing datasets, including their ability to emulate realistic trafic patterns, provide
diferent attack types and integrate wide feature sets essential for training ML models. The subsection
4.2 addresses their weaknesses, that is to say, for instance, limited attack variability, use of synthetic
trafic and lack of real-time features, which limit their usability in complex and dynamic SDN scenarios.
In the end, the subsection 4.3 provides a roadmap for designing new generation datasets to overcome
these challenges and support security research.</p>
      <sec id="sec-4-1">
        <title>4.1. Strenght Points</title>
        <p>
          The importance of realistic datasets in training efective intrusion detection systems (IDS) for
SoftwareDefined Networking (SDN) cannot be overstated. Realistic datasets mimicking real-world SDN trafic
scenarios, including high-rate and low-rate DDoS attacks, are important in the development of detection
systems that can operate in dynamic and complex environments[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. For instance, datasets like InSDN
cover a wide range of attack types, which makes them versatile and suitable for testing a variety of
detection methods.
        </p>
        <p>Richness in features is another important trait. Databases such as those of NSJ and InSDN feature rich
data that plays a significant role when training robust machine learning models. Rich features allow
models to eficiently recognize and react to complex patterns of attacks[17].</p>
        <p>
          Besides, public access is an important aspect for increasing reproducibility and community-based
improvements. Datasets like HLD-DDoSDN and InSDN facilitate community-based improvements and
the reproducibility of results across diferent research settings[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>Additionally, many datasets integrate with modern SDN controllers like POX, ONOS and Floodlight,
ensuring practical relevance to real-world SDN architectures[16].</p>
        <p>By combining realistic trafic simulations, wide attack plan and rich feature sets, these datasets play a
significant role in advancing the efectiveness of ML-driven IDS in SDN environments.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Weakness Points</title>
        <p>Most existing datasets, for instance, NSJ and AAHHBA, have limited attack diversity, with most of
them only accounting for a limited number of DDoS attack types, that is to say TCP SYN, UDP and
ICMP floods. This does not account for more sophisticated hybrid attacks, where attackers combine
volumetric, protocol-based and application-layer methods, along with more intelligent stealthy methods
like low-and-slow attacks or adversarial evasion techniques.</p>
        <p>Consequently, these datasets are less representative of the diverse threat landscape encountered in
real-world networks. Furthermore, narrow feature sets are another issue, with some datasets, ofering
minimal statistical data. This lack of richness in features restricts the ability to explore correlations
between trafic patterns and anomalies, limiting the efectiveness of ML models. As a result, these
datasets are less adaptable and may not perform well in detecting complex or subtle attack patterns.
Many datasets are also predominantly focused on DDoS attacks, ignoring other critical security threats
in SDN environments, such as attacks directed to the controller. This narrow focus leads to a blind spot
in detection systems trained on such datasets, leaving them vulnerable to other types of intrusions.
The reliance on synthetic trafic generated in emulated environments like Mininet presents another
challenge. While these environments ofer control and repeatability, they often fail to replicate the
unpredictable dynamics of real-world networks, such as heterogeneous device interactions, varying
user behaviors and external factors like latency and hardware failures. As a result, machine learning
models trained on these datasets may not generalize well to actual production environments.
Dataset variety and volume are also a concern because some datasets contain either too minimal trafic
volatility or volume. Datasets collected over the short term or from individual testbeds cannot mimic the
long-term patterns and varied scenarios on which realistic model training should have to be established.
Additionally, the absence of device diversity further reduces the relevance of these datasets compared
to dynamic and global SDN settings.</p>
        <p>Moreover, the majority of datasets have no informative attributes to train machine learning time attack
detection systems, without features like timestamps, inter-arrival time and flow state transitions that
are essential in time-sensitive anomaly detection. The absence of such features in datasets precludes
the development of IDS systems that can provide real-time responses, which are critical in high-risk
environments where low latency and quick responses are essential. Table 3 qualitatively maps some
characteristics related to the considerations made in the previous paragraph to the diferent datasets
analyzed in this paper. The colors used in the table have the following meanings:
• green: represents "yes" for binary classification and "high" for quantitative features.
• yellow: indicates an intermediate value between high and low intervals.</p>
        <p>• red: denotes "no" for binary classification and "low" for quantitative features.</p>
        <p>Among the various proposed datasets, dataset Y-NV-RP-DJ-M-C (corresponding to the fourth column)
demonstrates the highest performance, as indicated by the predominance of green boxes. This dataset
contains only two red boxes, indicating the absence of attack types beyond DDoS attacks and the lack
of real-time statistics due to insuficient information on resource usage.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Roadmap for the Generation of a New Dataset</title>
        <p>To address the limitations of existing datasets, a comprehensive dataset should integrate advanced
attack diversity, including hybrid DDoS, ransomware and zero-day exploits, to reflect evolving threats.
It should also enrich feature sets with temporal trends, entropy measurements and real-time signals to
support dynamic detection methods.</p>
        <p>Trafic capture across diferent SDN topologies and controller platforms, such as POX, ONOS and RYU,
can record diverse network configurations, while the combination of synthetic and real-world trafic
ensures practical relevance along with scalability.</p>
        <p>Detailed temporal annotations, such as timestamps and inter-arrival times, enable real-time anomaly
detection. Additionally, datasets should reflect large-scale networks for scalability testing and be made
publicly accessible with clear documentation and preprocessing tools.</p>
        <p>Mitigating misalignment with real SDN architectures through novel network designs and trafic
generation methodologies will continue to enhance dataset applicability.</p>
        <p>By including these enhancements, a new dataset may fill in gaps and be a valuable resource for advancing
SDN security research.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The study highlights the central role played by datasets in driving ML and DL-based remedies for SDN
network security. A critical assessment of existing datasets revealed some major strengths including
trafic simulation, extended sets of features and integration with modern SDN controllers.
Nevertheless, the paper shows primary weaknesses of current datasets in table 3. The majority of them
have poor attack diversity and are only able to identify volumetric DDoS attacks without addressing
increasing threats like application-layer attacks. Also, reliance on simulated environments reduces
the overall generalizability of models that learn from these datasets to actual networks, which are
more varied and complex. The lack of temporal data and real-time annotations further restricts the
development of time-sensitive intrusion detection systems. To address these limitations, this paper
proposes a future direction for the creation of next-generation datasets. It emphasizes hybrid attacks,
more sophisticated feature extraction, realistic data augmentation and multi-scenario trafic collection
across various topologies and controllers.</p>
      <p>Public accessibility and detailed documentation are also highlighted as essential for fostering widespread
adoption and collaborative improvement. By implementing these strategies, future datasets can better
reflect real-world network conditions, enabling the development of robust, scalable and adaptive security
mechanisms for SDN environments.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT-3.5/4.0 in order to paraphrase and
reword. After using this tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[8] N. M. Yungaicela-Naula, C. Vargas-Rosales, J. A. Perez-Diaz, E. Jacob, C. Martinez-Cagnazzo,
Physical Assessment of an SDN-Based Security Framework for DDoS Attack Mitigation: Introducing
the SDN-SlowRate-DDoS Dataset, IEEE Access 11 (2023) 46820–46831. doi:10.1109/ACCESS.
2023.3274577.
[9] M. A. Aladaileh, M. Anbar, A. J. Hintaw, I. H. Hasbullah, A. A. Bahashwan, S. Al-Sarawi, Renyi Joint
Entropy-Based Dynamic Threshold Approach to Detect DDoS Attacks against SDN Controller
with Various Trafic Rates, Applied Sciences 12 (2022). URL: https://www.mdpi.com/2076-3417/12/
12/6127. doi:10.3390/app12126127.
[10] A. A. Bahashwan, M. Anbar, S. Manickam, G. Issa, M. A. Aladaileh, B. A. Alabsi, S. D. A. Rihan,
HLDDDoSDN: High and low-rates dataset-based DDoS attacks against SDN, PLOS ONE 19 (2024) 1–29.</p>
      <p>URL: https://doi.org/10.1371/journal.pone.0297548. doi:10.1371/journal.pone.0297548.
[11] A. Khraisat, I. Gondal, P. Vamplew, J. Kamruzzaman, Survey of intrusion detection systems:
techniques, datasets and challenges, Cybersecurity 2 (2019) 20. URL: https://doi.org/10.1186/
s42400-019-0038-7. doi:10.1186/s42400-019-0038-7.
[12] A. Ahmad, E. Harjula, M. Ylianttila, I. Ahmad, Evaluation of Machine Learning Techniques for</p>
      <p>Security in SDN (2020). doi:10.1109/GCWkshps50303.2020.9367477.
[13] Y. Yoo, G. Yang, C. Shin, J. Lee, C. Yoo, Machine Learning-Based Prediction Models for Control
Trafic in SDN Systems, IEEE Transactions on Services Computing 16 (2023) 4389–4403. doi: 10.
1109/TSC.2023.3324007.
[14] M. Paliwal, D. Shrimankar, O. Tembhurne, Controllers in SDN: A Review Report, IEEE Access 6
(2018) 36256–36270. doi:10.1109/ACCESS.2018.2846236.
[15] M. Hussain, N. Shah, R. Amin, S. S. Alshamrani, A. Alotaibi, S. M. Raza, Software-Defined
Networking: Categories, Analysis, and Future Directions, Sensors 22 (2022). URL: https://www.
mdpi.com/1424-8220/22/15/5551. doi:10.3390/s22155551.
[16] J. Xie, F. R. Yu, T. Huang, R. Xie, J. Liu, C. Wang, Y. Liu, A Survey of Machine Learning
Techniques Applied to Software Defined Networking (SDN): Research Issues and Challenges, IEEE
Communications Surveys &amp; Tutorials 21 (2019) 393–430. doi:10.1109/COMST.2018.2866942.
[17] M. Wang, N. Yang, Y. Guo, N. Weng, Learn-IDS: Bridging Gaps between Datasets and
LearningBased Network Intrusion Detection, Electronics 13 (2024). URL: https://www.mdpi.com/2079-9292/
13/6/1072. doi:10.3390/electronics13061072.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Paper Network Architectures</title>
      <p>Attacker(Kali
Linux)
OVSSwitch
Mininet
Controler
ServerDVWA
Metasploitable2</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Maleh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qasmaoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Gholami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sadqi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mounir</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey on SDN Security: Threats, mitigations, and future directions</article-title>
          ,
          <source>Journal of Reliable Intelligent Environments</source>
          <volume>8</volume>
          (
          <year>2022</year>
          )
          <fpage>29</fpage>
          -
          <lpage>48</lpage>
          . URL: https://link.springer.com/article/10.1007/s40860-022-00171-8#citeas.
          <source>doi:10. 1007/s40860-022-00171-8.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shukla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey on DDoS detection, mitigation, and defense strategies in software-defined networks</article-title>
          ,
          <source>Cluster Computing</source>
          <volume>27</volume>
          (
          <year>2024</year>
          )
          <fpage>13129</fpage>
          -
          <lpage>13164</lpage>
          . URL: https://doi.org/10.1007/s10586-024-04596-z. doi:
          <volume>10</volume>
          .1007/s10586-024-04596-z.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Bahashwan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Anbar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Manickam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A</given-names>
            .
            <surname>Al-Amiedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Aladaileh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Hasbullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Systematic</given-names>
            <surname>Literature</surname>
          </string-name>
          <article-title>Review on Machine Learning and Deep Learning Approaches for Detecting DDoS Attacks in Software-Defined Networking</article-title>
          ,
          <source>Sensors</source>
          <volume>23</volume>
          (
          <year>2023</year>
          ). URL: https://www.mdpi.com/ 1424-8220/23/9/4441. doi:
          <volume>10</volume>
          .3390/s23094441.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Niyaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Javaid</surname>
          </string-name>
          ,
          <article-title>A Deep Learning Based DDoS Detection System in Software-Defined Networking (SDN)</article-title>
          ,
          <source>EAI Endorsed Transactions on Security and Safety</source>
          <volume>4</volume>
          (
          <year>2017</year>
          ). doi:
          <volume>10</volume>
          .4108/ eai.28-
          <fpage>12</fpage>
          -
          <year>2017</year>
          .153515.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C. B.</given-names>
            <surname>Zerbini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. F.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Abrão</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Proença</surname>
          </string-name>
          ,
          <article-title>Wavelet against random forest for anomaly mitigation in software-defined networking</article-title>
          ,
          <source>Applied Soft Computing</source>
          <volume>80</volume>
          (
          <year>2019</year>
          )
          <fpage>138</fpage>
          -
          <lpage>153</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S1568494619301115. doi:https://doi.org/ 10.1016/j.asoc.
          <year>2019</year>
          .
          <volume>02</volume>
          .046.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.-A.</given-names>
            <surname>Le-Khac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Jurcut</surname>
          </string-name>
          ,
          <article-title>InSDN: A Novel SDN Intrusion Dataset</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>165263</fpage>
          -
          <lpage>165284</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2020</year>
          .
          <volume>3022633</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Novaes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. F.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lloret</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>L. Proença, Long Short-Term Memory and Fuzzy Logic for Anomaly Detection and Mitigation in Software-Defined Network Environment, IEEE Access 8 (</article-title>
          <year>2020</year>
          )
          <fpage>83765</fpage>
          -
          <lpage>83781</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2020</year>
          .
          <volume>2992044</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>