<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Torino, Italy.
* Corresponding author.
†These authors contributed equally.
$ mansur.ziiatdinov@unime.it (M. Ziiatdinov); salvatore.distefano@unime.it (S. Distefano)
 https://gltronred.info/ (M. Ziiatdinov)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>The Quantum Bottleneck: An Analysis of Data Balancing in QML for Security</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mansur Ziiatdinov</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Salvatore Distefano</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Quantum computing promises advantage in data analysis by mapping the input data to a higher-dimensional Hilbert space and performing computations based on physics laws. However, its potential is limited by noise and number of qubits, restricting suitability to full fledged, end-to-end quantum machine learning solutions. Therefore, proper classical data preprocessing is required. As a preprocessing step, data sampling, i.e. the selection or generation of data items in imbalanced datasets, assumes strategic importance. Some undersampling techniques rely on pure random choice, while others use geometric considerations to select data items that better represent the original classes. As quantum machine learning approaches difer significantly from classical ones, it can be of interest to investigate sampling techniques in quantum and classical ML. In this study, an approach to combine classical and quantum data-driven ML workflows is proposed. Diferent undersampling techniques are then explored through classical and quantum ML models for cybersecurity threat classification on the well-known UNSW-NB15 network dataset taken from the literature to assess the proposed approach.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;quantum machine learning</kwd>
        <kwd>undersampling</kwd>
        <kwd>network intrusion detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivations</title>
      <p>The rising sophistication, variety, scope and range of cyber threats represent a big challenge for modern
digital infrastructures. Traditional security mechanisms, increasingly augmented by classical machine
learning (ML), strive to detect anomalies and intrusions in real-time. However, classical ML models
face mounting pressure from the sheer scale of network trafic and the ever-evolving, asymmetric
nature of malicious activities. Two significant hurdles stand out: the curse of dimensionality, where the
complexity of data can overwhelm classical algorithms, and the persistent problem of class imbalance,
where anomalous or attack data is vastly outnumbered by normal trafic. This imbalance often biases
classical models towards the majority class, causing them to miss rare but critical security events.</p>
      <p>Quantum Computing ofers a revolutionary computational paradigm with the potential to overcome
classical limitations. By mapping input data into an exponentially large Hilbert space, Quantum Machine
Learning (QML) promises to uncover intricate patterns that are intractable for classical systems. This
intrinsic ability to operate in high-dimensional feature spaces suggests a powerful new tool for tasks
like network intrusion detection. However, the current era of Noisy Intermediate-Scale Quantum (NISQ)
devices imposes significant practical constraints. The performance of today quantum processors is
fundamentally limited by qubit decoherence, gate errors, and restricted qubit counts, making end-to-end,
fully-fledged quantum solutions presently unfeasible.</p>
      <p>This reality requires the development of hybrid quantum-classical models, which strategically
combine the strengths of both computational worlds. In such a workflow, classical computers handle data
preprocessing and postprocessing tasks, while the quantum processor is tasked with the core
computational kernel where its unique capabilities can provide an advantage. Within this hybrid framework, a
critical but underexplored question arises: How do classical data preparation techniques translate to
the quantum domain?</p>
      <p>Specifically, techniques like data sampling, which are fundamental for managing class imbalance in
classical ML, have an uncertain eficacy when applied to QML. Geometric assumptions that underpin
some classical sampling methods may not hold true within the abstract quantum feature space. Such a
gap motivates this research, aiming to bridge it by investigating the impact of a foundational sampling
technique — random undersampling — on the performance of a hybrid QML classifier for a real-world
cybersecurity task. Starting from the well-known UNSW-NB15 network intrusion detection dataset,
this paper investigates whether dataset size, balancing, and undersampling techniques are efective
strategies for cybersecurity analysis through classical and quantum ML.</p>
      <p>The contribution of this work is therefore threefold. First, to provide a methodology and algorithm for
hybrid classical-quantum data-driven machine learning workflows. Second, to explore, by a practical,
empirical analysis in the cybersecurity domain, the interplay between classical data preprocessing
and near-term quantum algorithms. Third, by assessing the performance of a QML model under
diferent conditions of data balance and size, to provide realistic insights into the current capabilities and
bottlenecks of classical and quantum computing, as well as their suitability to cybersecurity problems.</p>
      <p>To such a purpose, the reminder of the paper is organized as follows: Section 2 discusses about classical
and quantum ML overviewing state of the art solutions on data management and balancing, Section
3 proposes a hybrid classical-quantum methodology and algorithm for data-driven ML workflows
focusing on data balancing, then applied in Section 4 on a cybersecurity case study from a real dataset
to demonstrate its efectiveness. Final discussion and remarks are reported in Section 5 closing the
paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>
        Quantum Machine Learning (QML) is an emerging interdisciplinary field that integrates principles
from quantum mechanics and machine learning, aiming to develop algorithms that outperform classical
counterparts in data processing and analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This convergence of quantum physics and artificial
intelligence holds the promise of revolutionizing several fields, including healthcare, finance, and
materials science, by leveraging quantum properties such as superposition and entanglement to enhance
machine learning techniques [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. QML is attracting rising interest, as demonstrated by the volume
of published research, in particular after 2015, witnessing a dramatic increase in scholarly articles
and citations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This highlights not only the growing recognition of QML capabilities but also the
collaborative eforts among researchers across disciplines, including physics, computer science, and
artificial intelligence [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Furthermore, bibliometric analyses reveal a dynamic interplay between QML
research and its practical applications, especially in areas contributing to technological advancement
and intellectual property, such as privacy and cybersecurity [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Despite its promising outlook, the
ifeld of QML faces significant challenges, including limitations in current quantum hardware and
the complexity of integrating quantum algorithms with classical systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Issues such as qubit
decoherence, noise in quantum environments, and the need for efective data encoding techniques
remain critical barriers that researchers must address to realize the full potential of QML applications
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Additionally, the theoretical foundations of QML, although robust, require further exploration
to validate the claimed advantages of quantum algorithms over classical methods, as many existing
approaches lack formal evidence of their eficacy [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The intersection of classical and quantum machine learning presents a compelling comparative
analysis. Recent studies have explored the performance of QML algorithms against classical ML ones
across datasets and metrics, investigating their strengths and weaknesses [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For example, in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
a comprehensive analysis involving classical and quantum ML algorithms across diferent subject
systems demonstrated variances in their ability to predict buggy and clean commits, as well as in other
software defect predictions. Current research on classical and quantum ML mostly aims at refining
quantum-specific solutions while evaluating their generalizability [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], based on the consideration that
foundational aspects of QML, such as quantum data encoding and the unique properties of quantum
systems, can potentially lead to more eficient processing capabilities and improved outcomes compared
to classical ML methods [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        This stresses the strategic importance of data management and preprocessing in ML, especially for
quantum ML [
        <xref ref-type="bibr" rid="ref4 ref8">4, 8</xref>
        ]. In particular, data balancing is an essential step in ML workflows, addressing the
challenges posed by class imbalance in datasets where one class significantly outnumbers another [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Such imbalances can lead to biased predictive models that predominantly favor the majority class,
thereby misrepresenting the performance of minority classes and potentially compromising the
efectiveness of machine learning applications across fields, including healthcare, finance, and cybersecurity
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Techniques to balance datasets, such as random undersampling, random oversampling, and the
Synthetic Minority Over-sampling Technique (SMOTE), have been developed to enhance model
robustness and accuracy by ensuring fair representation of all classes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Even in the realm of quantum
machine learning, innovative strategies have emerged that leverage quantum computing principles to
improve data balancing techniques. A notable one is QuantumSMOTE, exploiting quantum circuits
to generate synthetic data points for minority classes more eficiently than traditional oversampling
methods [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ].
      </p>
      <p>
        Despite the promise of these advancements, both classical and quantum data balancing methods
face significant challenges. For classical techniques, issues such as overfitting from oversampling and
information loss from undersampling persist, necessitating careful evaluation of performance metrics
beyond accuracy, such as precision, recall, and F1-score [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. In the quantum domain, obstacles
include the complexity of encoding large datasets into quantum states and the current limitations
of quantum hardware, which can impact the feasibility and reliability of implementing large-scale
quantum data balancing strategies [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ].
      </p>
      <p>Looking ahead, there is significant potential for advancements in data balancing techniques through
the integration of classical and quantum computing, paving the way for improved machine learning
applications in diverse domains and ultimately contributing to more efective data-driven
decisionmaking. Ultimately, a combined efort to enhance both classical and quantum ML strategies may pave
the way for more efective data-driven decision-making in various domains. This paper follow such
direction, aiming to investigate data balancing efects on classical and quantum ML algorithm to provide
insights on hybrid/combined classical-quantum techniques for data-driven ML workflows.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>DAePfcinqrDoiutabiiostleanitmioannd AIngCFtgFelielrutgeaesrgnraiioaintnintgoigo,n,n,,</p>
      <p>Paradigm =
Classical</p>
      <p>Feature
Engineering</p>
      <p>Data
Balancing
Reset Pars</p>
      <p>Paradigm =
Quantum
Model
Selection</p>
      <p>Quantum</p>
      <p>Data
Encoding and</p>
      <p>Feature
Q Engineering</p>
      <p>Model
Paradigm C Training</p>
      <p>Model
Evaluation
and
Validation</p>
      <p>Model
Interpretation
and
Tuning</p>
      <p>C</p>
      <p>Model
Satisfactory Y Paradigm Q Deployment</p>
      <p>The data-driven ML methodology here proposed is described by the algorithm shown in Figure 1
aiming to systematically address the aforementioned data management challenges, particularly in the
context of unbalanced datasets, by the integration of classical and quantum computational paradigms.
This process starts with Problem Definition and Data Acquisition , where the specific challenge is
rigorously defined, and relevant datasets are identified and queried to select and retrieve useful data.
If such data are not enough for processing, raw data have to be collected by real experiments to
complement the dataset, eventually also resorting to synthetic data, as discussed below. These data
then undergo a series of critical preprocessing steps, including data Integration, Filtering, Cleaning,
Aggregation, and Fusion. During this phase, heterogeneous data streams are unified, irrelevant or noisy
data points are removed, missing values are addressed, and heterogeneous data are combined to form a
coherent, comprehensive dataset suitable for subsequent analysis.</p>
      <p>Following the initial data preparation, the workflow diverges based on the computational paradigm
selected for model development. At first, a loop is triggered by the preprocessed data Feature Engineering
for the classical one (Paradigm=Classical), where domain expertise and analytical techniques are applied
to transform raw data into a set of informative features that enhance the discriminative power for
machine learning models. Then, Data Balancing techniques are applied to mitigate the inherent class
imbalance. This step often involves methods such as undersampling of the majority class, oversampling
of the minority class, or synthetic data generation to achieve a more equitable class distribution, which
is crucial for preventing model bias and improving the detection of rare, critical events, as detailed
below.</p>
      <p>Regardless of the chosen paradigm, the prepared data then feeds into Model Selection, where
appropriate machine learning algorithms — ranging from classical classifiers like Support Vector Machines
and Random Forests to quantum-inspired algorithms or hybrid quantum-classical models — are chosen
based on the problem characteristics and computational resources. This is followed by Model Training,
where the selected model learns the underlying patterns from the balanced and engineered datasets.
The trained model then undergoes rigorous Model Evaluation and Validation using appropriate
performance metrics (e.g., precision, recall, F1-score, accuracy, ROC curves) to assess its efectiveness and
generalization capabilities on unseen data.</p>
      <p>An iterative feedback loop is embedded through Model Interpretation and Tuning. Here, the model
decisions are analyzed for insights, and hyperparameters are optimized to enhance performance. A
Satisfactory decision point then determines whether the model performance meets predefined criteria.
If unsatisfactory, the process loops back on earlier stages, involving to re-evaluate and re-apply feature
engineering and balancing strategies. Once the classical model performance analysis on feature and
balancing parameters is satisfactory, based on these experiments and results the quantum one is triggered
by switching the paradigm (Paradigm=Quantum) and resetting such parameters (Reset Pars).</p>
      <p>The quantum workflow loop is substantially the same as the classical one except for the data encoding
stage required to encode classical data into qubits. The Quantum Data Encoding and Feature Engineering
step, indeed, involves mapping classical data features onto quantum states, which are then prepared on
qubits to be processed by quantum circuits. This encoding is a critical interface between classical data
and quantum computation, influencing how information is represented and manipulated in the Hilbert
space. While nascent, quantum feature engineering aims to leverage quantum efects to extract novel,
potentially more expressive features than classical methods. The quantum loop thus proceeds with
quantum model training, evaluation, validation, interpretation and tuning, then iterating on diferent
dataset features and balances for a parametric analysis.</p>
      <p>Finally, once both classical and quantum loops are done, the workflow proceeds to Model Deployment,
where the validated model is integrated into operational systems for live inferences, marking the
culmination of the data-driven ML workflow.</p>
      <sec id="sec-3-1">
        <title>3.1. Feature Engineering and Data Balancing</title>
        <p>Focusing on classification, a classification problem can be formally defined as follows. Consider a set X
of  tuples in a -dimensional space: X = {x1, . . . , x} ∈ R× , where x = {,1, . . . , ,} ∈ R
and thus  ∈ R with  = 1, . . . , ,  = 1, . . . , , and each data point x ∈ X is labeled by
the corresponding class  ∈ {1, . . . , } ⊂ N. The classification problem asks to predict the ^ ∈
{1, . . . , } ⊂ N class label for a new, previously unseen x^ ∈ R data point. If there are only two classes
(i.e.  = 2), the problem is called binary classification problem.</p>
        <p>The set of data points belonging to a particular class  ∈ {1, . . . , } ⊂ N is denoted by () = {x ∈
X |  = }. Therefore, () ∩ () = ∅ ∀,  ∈ {1, . . . , } ⊂ N with  ̸=  and ⋃︀=1 () = X. The
dataset is called balanced if diferent classes have a similar number of data points: ∀ ̸=  : |()| ≈
|()|. If the dataset is imbalanced, the smallest class is called the minority class and the largest class is
called the majority class.</p>
        <p>There are diferent approaches to balancing datasets. First, the model can extrapolate additional data
points of the minority class (so called oversampling), thus increasing its size; second, the model can
choose a subset ′() ⊂  () for each non-minority class (so called undersampling).</p>
        <p>This study focuses on the latter approach since current quantum models have limitations in managing
large amount of data. The random undersampling method simply selects a random subset of each
class. This method allows setting the required number of samples in each class independently, thus
controlling the balance of the resulting dataset. In this work, the NearMiss [18] algorithm is adopted as
a heuristic algorithm since it includes geometrical criteria into the selection process. Let positive data
items be the items belonging to the class to be undersampled and negative data items be the items from
the minority class. NearMiss-3 (the version 3 of the NearMiss algorithms family) works in 2 steps: first,
for each negative data item, their nearest-neighbors are kept. Then, the positive data items with the
largest average distance to the nearest-neighbors are selected.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Quantum Data Encoding and Model Training</title>
        <p>As outlined before, the key diference between classical and quantum paradigms is the data encoding
step of the workflow. This study adopts a heuristic real amplitudes encoding ⃒⃒  qra(x)︀⟩ from the
Qiskit circuit library [19], denoted here as QRA. This technique involves several layers of angle encoding
interspersed with entanglement layers. Since it is dificult to express the encoded state in closed form,
the corresponding circuit is described. Let the number  of qubits and the number of layers  be such
that  = , where  is the number of features. Then</p>
        <p>⃒⃒  qra(x)︀⟩ = 1(x) 2(x) × · · · ×  (x) |0⟩ ,
where  (x) = ⨂︀=1 Y(,+) rotates the qubits to encode some of the features (i.e.
,(− 1)+1, ,(− 1)+2, . . . , ,(− 1)+). The entanglement layer  performs CNOT gates between
-th and  + 1-th qubits for  ∈ { − 2,  − 3, . . . , 1, 0}. See Figure 2 for an example.
 (1)
 (2)
 (3)
 (4)
 (5)
 (6)
 (7)
 (8)</p>
        <p>After the encoding step, a QML model can be trained. Schuld and Petruccione [20] show that all
supervised QML models can be formulated as quantum kernel methods, so let us focus on them. The
quantum kernel method uses a quantum computer to compute entries  (x, x′) = ⃒⃒ ⟨ (x)| (x′)⟩⃒⃒ 2 of
the kernel matrix. Once the entries are calculated, a support vector classifier (SVC) is trained by solving
a convex optimization problem:</p>
        <p>∑︀=1   − 12 ∑︀=1 ∑︀=1      (x, x ),</p>
        <p>0 ≤   ≤  and ∑︀=1   = 0.</p>
        <p>The assessment of classification is performed with the standard metrics like accuracy and 1 score
(see, e.g., [21]). Let  ,  ,  ,   denote, respectively, the number of true positive, true negative,
false positive, and false negative predictions. The accuracy  is defined as
 =</p>
        <p>+  
  +   +   +</p>
        <p>The 1 score is the harmonic mean of precision prec and recall recall:</p>
        <p>prec · recall
1 = 2 · prec + recall
where prec =  /(  +   ) and recall =  /(  +   ).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Case Study</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset and Testbed</title>
        <p>To demonstrate the efectiveness of the proposed approach, the well-known UNSW-NB15 dataset [ 22, 23]
is taken as benchmark. It focuses on network intrusion detection systems associating with a dataset
item the description of a network packet. The UNSW-NB15 dataset has been cleaned from contaminant
features in [24]. Once cleaning and encoding the categorical feature as an ordinal, the training dataset
contains 175 341 data items with 32 numeric features, while the test dataset contains 82 332 data items
with the same features. Each data item is associated with two labels: first, it is labeled as a “normal”
packet or as an “attack” packet; second, it is further labeled by diferent attack types, i.e. “Backdoor”,
“DoS”, “Exploits”, etc. (10 classes overall including “normal”). The dataset is highly imbalanced: for
example, the smallest class “Worms” contains only 130 data items, while the “Exploits” class contains
33 393 data items.</p>
        <p>The experiments focus on the binary classification of a packet between “normal” and “attack” classes.
They have been organized in 3 stages:
Experiment 1 establishes a baseline for imbalanced data and classical ML model performance,
obtained exploiting the full dataset and the classical SVC with RBF kernel (i.e.  (x, x ) = exp(−  ‖x −
x ‖2)).</p>
        <p>Experiment 2 compares diferent balancing methods for classical and quantum ML models. Both
samplers, implementing the random undersampling method and the NearMiss-3 algorithm, respectively,
sample ℓ = 300 data points per class, 600 overall, which are then used to train classical SVC RBF kernel
models and quantum SVC QRA kernel models (with  = 8,  = 4 for QRA-8-4 and  = 4,  = 8 for
QRA-4-8). The QRA parameters have been selected to ensure that all 32 = 8 × 4 features could be
embedded.</p>
        <p>Experiment 3 explores how balancing afects the performance of diferent classical and quantum
models. In each run of the experiment, the random sampler, demonstrated to be more efective than
NearMiss-3 by Experiment 2, selects ℓ = 50, 100, 200, 300, 400, 500 data points per class, and these
subsets are exploited to train classical RBF and quantum QRA-4-8 and QRA-8-4 SVC kernel models.
The choice of the parameter ℓ ≤ 500 is due to hardware limitations on simulations.</p>
        <p>All of the experiments follow the same design principles. Each experiment performs 10 runs, and
the reported results are statistics obtained by these runs. At each run, the features are scaled using
MinMaxScaler, i.e. each data point x = (,1, . . . , ,) is mapped to the vector z = (,1, . . . , ,),
where  = ( − min=1,...,  )/ max=1,...,  ,  = 1, . . . ,  and  = 1, . . . , . Then, if necessary,
the sampler chooses ℓ data points for each class (if the class contains less than ℓ data points, all these
points are selected, but no new points are generated). Finally, the SVC is trained on this subset of the
training data and is evaluated on the balanced subset of the test data containing 100 points per class.</p>
        <p>All tests have been performed on a machine with the following characteristics. CPU: AMD Ryzen
9 5950X, RAM: 64 GiB, OS: Linux, kernel: 6.6.74-gentoo, Python: 3.12.9, jupyter-core: 5.7.2, numpy:
2.2.2, qiskit: 1.3.2, qiskit-machine-learning: 0.8.2, scikit-learn: 1.6.1. The source code is available in the
repository [25].</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>Experiment 1. The results of the Experiment 1, the baseline classical case, are reported in Table 1.
Although more sophisticated classical methods can achieve better results, this study aims at investigating
the efects of balancing comparing similar classical and quantum methods, i.e. kernel-based SVC.
Experiment 2. The results of the Experiment 2 are shown in Figure 3. The random sampler shows
better f1 score performance than the NearMiss-3 sampler for both classical and quantum models, with
∼ 60% of improvement on the classical model and ∼ 10% for quantum ones. Thus, despite worse than
random sampler, the NearMiss-3 one works better in quantum models.</p>
        <p>Experiment 3. The main findings of Experiment 3 are shown in Figure 4. All the model graph shapes
are similar: in the beginning, the learning curve is increasing till a maximum value and then decreases,
likely due to overfitting. The maximum values for quantum and classical models are close, but do not
coincide: the classical model and the quantum model with 4 qubits (QRA-4-8) have their maximum at
around ℓ = 300 and the quantum model with 8 qubits (QRA-8-4) has the maximum at around ℓ = 400.
The quantum models also show evidence of better generalization: they outperform the classical ones
for smaller number of data items (ℓ ≤ 100) (∼ 5% better with majority class size 50).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusions</title>
      <p>This work focused on investigating how the dataset size and balance can afect on diferent ML
models, both classical and quantum. The investigation on a real and well known cybersecurity dataset
(UNSW-NB15 ) led to three primary findings. First, the use of random undersampling proved to be
an efective technique for preprocessing the imbalanced UNSW-NB15 dataset for both classical and
quantum classification ML models, suggesting that established classical techniques are valuable in
hybrid ML workflows. Second, a strong relationship between the dataset size, despite balanced, and
model performance has been observed. Both classical and quantum ML models show concave learning
curve, index of overfitting issues. This pushes ML experts and practitioners to find out the optimal
majority class size for which the ML model provides best performance. Third, the quantum learning
models has demonstrated accelerated learning curves compared to classical ones, especially much better
performance with small datasets, while classical model performance maximum outperforms quantum
ones. This suggests hybrid solutions exploiting quantum ML models to address data scarcity or as
bootstrap models in the beginning of a data acquisition campaign, then switching to classical ones
when enough data are available. These findings collectively indicate that while quantum approaches
ofer potential eficiencies, their practical application is currently dependent on classical preprocessing
and advances in quantum hardware fault tolerance.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We acknowledge financial support under the National Recovery and Resilience Plan (PNRR), Mission
4, Component 2, Investment 1.4, Call for tender No. 1031 published on 17/06/2022 by the Italian
Ministry of University and Research (MUR), funded by the European Union – NextGenerationEU,
Project Title "National Centre for HPC, Big Data and Quantum Computing (HPC)" – Code National
Center CN00000013 – CUP D43C22001240001.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used LLM tools in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication content.
[18] I. Mani, I. Zhang, knn approach to unbalanced data distributions: a case study involving information
extraction, in: Proceedings of workshop on learning from imbalanced datasets, volume 126, ICML
United States, 2003, pp. 1–7.
[19] M. Treinish, et al., Qiskit/qiskit: Qiskit 1.4.3, 2025. doi:10.5281/zenodo.15374661.
[20] M. Schuld, F. Petruccione, Machine learning with quantum computers, volume 676, Springer, 2021.
[21] A. C. Müller, S. Guido, Introduction to machine learning with Python: a guide for data scientists, "</p>
      <p>O’Reilly Media, Inc.", 2016.
[22] N. Moustafa, J. Slay, Unsw-nb15: a comprehensive data set for network intrusion detection
systems (unsw-nb15 network data set), in: 2015 Military Communications and Information
Systems Conference (MilCIS), 2015, pp. 1–6. doi:10.1109/MilCIS.2015.7348942.
[23] N. Moustafa, J. Slay, Unsw-nb15, 2024. doi:10.34740/KAGGLE/DSV/9350725.
[24] L. D’Hooge, M. Verkerken, T. Wauters, B. Volckaert, F. De Turck, Discovering non-metadata
contaminant features in intrusion detection datasets, in: 2022 19th Annual International Conference
on Privacy, Security &amp; Trust (PST), 2022, pp. 1–11. doi:10.1109/PST55820.2022.9851974.
[25] M. Ziiatdinov, S. Distefano, Cyber-risk case study, https://github.com/usce2qc/notebooks?tab=
readme-ov-file#cybersecurity, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>A comprehensive review of quantum machine learning: from nisq to fault tolerance</article-title>
          ,
          <source>Reports on Progress in Physics 87</source>
          (
          <year>2024</year>
          )
          <article-title>116402</article-title>
          . URL: http://dx.doi.org/10.1088/
          <fpage>1361</fpage>
          -6633/ad7f69. doi:
          <volume>10</volume>
          .1088/
          <fpage>1361</fpage>
          -6633/ad7f69.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tripathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Comprehensive survey of qml: From data analysis to algorithmic advancements, 2025</article-title>
          . URL: https://arxiv.org/abs/2501.09528. arXiv:
          <volume>2501</volume>
          .
          <fpage>09528</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. K. Rajput,</surname>
          </string-name>
          <article-title>Quantum machine learning: Unveiling trends, impacts through bibliometric analysis</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2504.07726. arXiv:
          <volume>2504</volume>
          .
          <fpage>07726</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.-Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Broughton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohseni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Babbush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Boixo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Neven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>McClean</surname>
          </string-name>
          ,
          <article-title>Power of data in quantum machine learning</article-title>
          ,
          <source>Nature communications 12</source>
          (
          <year>2021</year>
          )
          <fpage>2631</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>J. Lopez,</surname>
          </string-name>
          <article-title>Quantum machine learning: Applications, algorithms</article-title>
          , and hardware challenges,
          <source>International Journal of AI</source>
          , BigData,
          <source>Computational and Management Studies</source>
          <volume>5</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nadim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Mandal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <article-title>Quantum vs. classical machine learning algorithms for software defect prediction: Challenges and opportunities</article-title>
          ,
          <source>arXiv preprint arXiv:2412.07698</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Khanal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rivas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanjel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sooksatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Quevedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <article-title>Generalization error bound for quantum machine learning in nisq era-a survey</article-title>
          ,
          <source>Quantum Machine Intelligence</source>
          <volume>6</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Date</surname>
          </string-name>
          ,
          <article-title>Quantum data encoding: A comparative analysis of classical-to-quantum mapping techniques and their impact on machine learning accuracy</article-title>
          ,
          <source>EPJ Quantum Technology</source>
          <volume>11</volume>
          (
          <year>2024</year>
          )
          <fpage>72</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bhatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sahani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Panchal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <article-title>A review on data balancing techniques and machine learning methods</article-title>
          ,
          <source>in: 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1004</fpage>
          -
          <lpage>1008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yousefimehr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghatee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Seifi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fazli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tavakoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Rafei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghafari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikahd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Gandomani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Orouji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Kashani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Heshmati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. S.</given-names>
            <surname>Mousavi</surname>
          </string-name>
          ,
          <article-title>Data balancing strategies: A survey of resampling and augmentation methods</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2505.13518. arXiv:
          <volume>2505</volume>
          .
          <fpage>13518</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mooijman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Catal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tekinerdogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blokland</surname>
          </string-name>
          ,
          <article-title>The efects of data balancing approaches: A case study</article-title>
          ,
          <source>Applied Soft Computing</source>
          <volume>132</volume>
          (
          <year>2023</year>
          )
          <fpage>109853</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mohanty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Behera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ferrie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dash</surname>
          </string-name>
          ,
          <article-title>A quantum approach to synthetic minority oversampling technique (smote</article-title>
          ),
          <source>Quantum Machine Intelligence</source>
          <volume>7</volume>
          (
          <year>2025</year>
          )
          <fpage>38</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Ke,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Zhou,
          <string-name>
            <given-names>G.-W.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>A review of machine learning methods for imbalanced data challenges in chemistry, Chemical Science (</article-title>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mendoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mycroft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Milbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kahani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jaskolka</surname>
          </string-name>
          ,
          <article-title>On the efectiveness of data balancing techniques in the context of ml-based test case prioritization</article-title>
          ,
          <source>in: Proceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>72</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kaur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Pannu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Malhi</surname>
          </string-name>
          ,
          <article-title>A systematic review on imbalanced data challenges in machine learning: Applications and solutions, ACM computing surveys (CSUR) 52 (</article-title>
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mohanty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Behera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ferrie</surname>
          </string-name>
          ,
          <article-title>Quantum smote with angular outliers: Redefining minority class handling</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2501.19001. arXiv:
          <volume>2501</volume>
          .
          <fpage>19001</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Kwon</surname>
          </string-name>
          , S.-h. Choi,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <article-title>Leveraging quantum machine learning to address class imbalance: A novel approach for enhanced predictive accuracy</article-title>
          ,
          <source>Symmetry</source>
          <volume>17</volume>
          (
          <year>2025</year>
          ). URL: https://www.mdpi.com/2073-8994/17/2/186. doi:
          <volume>10</volume>
          .3390/sym17020186.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>