Detecting Zero-Day Attacks under Concept Drift: An
                                Online Unsupervised Threat Detection System
                                Alessandra De Paola1,2,† , Salvatore Drago3,† , Pierluca Ferraro1,2,∗,† and
                                Giuseppe Lo Re1,2,†
                                1
                                  Department of Engineering, University of Palermo, Italy
                                2
                                  Cybersecurity National Lab, CINI - Consorzio Interuniversitario Nazionale per l’Informatica
                                3
                                  IMT School for Advanced Studies Lucca, Italy


                                            Abstract
                                            In recent years, there has been significant interest towards mechanisms for detecting cyber-security
                                            threats. However, the dynamic nature of modern systems and networks poses significant challenges for
                                            threat detection systems exploiting machine learning models, since shifts in data’s statistical distribution
                                            over time, known as concept drift, can cause severe performance degradation. In this scenario, traditional
                                            static systems often need manual retraining by human operators, leaving networks exposed to vulnera-
                                            bilities in the interim. Moreover, the challenge of detecting zero-day attacks through semi-supervised
                                            or unsupervised models remains a critical aspect that has garnered much attention in the literature.
                                            This work introduces an unsupervised online threat detection system designed to identify anomalous
                                            traffic indicative of zero-day attacks, while explicitly handling concept drift by automating retraining
                                            processes only when necessary. An extensive experimental evaluation on the real-world IoT-23 dataset,
                                            encompassing network traffic from IoT devices and malicious traffic from malware-infected devices,
                                            showcases the system’s efficacy, showing superior performance in real-time threat detection compared
                                            to traditional static approaches.

                                            Keywords
                                            Concept Drift, Threat Detection System, Online Unsupervised Learning


                                1. Introduction
                                The digital age has seen an exponential rise in cyber-attacks targeting networks [1], computers,
                                information systems [2] and IoT devices [3]. Consequently, cybersecurity has become essential
                                as the integration of such technologies into our daily lives continues to evolve [4, 5], and
                                ensuring their security is paramount. As a result, interest in developing systems for detecting
                                cybersecurity threats has grown steadily. Intrusion Detection Systems (IDS) are a prime example
                                of systems extensively studied, and many works [6, 7, 8] in the literature have demonstrated
                                remarkable performance in distinguishing between benign and malicious traffic, exploiting
                                supervised machine learning algorithms like deep neural networks [9] and decision trees [10, 11].
                                ITASEC 2024: The Italian Conference on CyberSecurity, April 08–12, 2024, Salerno, IT
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open alessandra.depaola@unipa.it (A. De Paola); salvatore.drago@imtlucca.it (S. Drago); pierluca.ferraro@unipa.it
                                (P. Ferraro); giuseppe.lore@unipa.it (G. Lo Re)
                                Orcid 0000-0002-7340-1847 (A. De Paola); 0009-0009-0367-0484 (S. Drago); 0000-0003-1574-1111 (P. Ferraro);
                                0000-0002-8217-2230 (G. Lo Re)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
These algorithms are capable of learning complex separation criteria between different types of
traffic and generalizing from the training dataset.
   Yet, the rapid evolution and growing complexity of digital systems pose new challenges. The
dynamic nature of these environments [12], compounded by the continuous emergence of new
threats, particularly zero-day attacks, necessitates more adaptable solutions. Recent research
has thus concentrated on two areas in particular. The first is the development of zero-day
detection mechanisms using semi-supervised models that can accurately classify malicious
data by detecting differences from the benign data on which they are trained. The second area
concerns the phenomenon of concept drift [13, 14], where the statistical properties of input data
may unexpectedly change over time, leading to the obsolescence of previously trained machine
learning models and the emergence of new errors and inaccuracies. Such drifts can occur, for
example, when the benign traffic pattern of a corporate network shifts due to alterations in
network topology, infrastructure changes, the introduction or removal of services, software
updates, or changes in employee behavior prompted by new corporate policies. Notably, zero-
day attacks can significantly deviate from the statistical distribution of known malicious traffic,
thus often evading detection by even the most sophisticated supervised systems. In real-world
scenarios, these changes can be sudden, and the same type of statistical input distribution, or
concept, can be recurrent. Consider, for example, a click-day event or the shift from in-person
to remote work or education and vice versa, during pandemic periods.
   Responding to these challenges, this work introduces a novel online unsupervised anomaly
detection system aimed at identifying zero-day attacks within network environments, with
a specific focus on addressing concept drifts. The system’s architecture is multi-layer and
incorporates drift detection mechanisms alongside an unsupervised anomaly detection model.
The primary objective is to reduce the necessity for frequent model re-training in response to
concept drift, while simultaneously maintaining high detection accuracy. A distinctive feature
of the proposed approach is the management of recurring concept drifts, a common occurrence
in dynamic real-world scenarios. Unlike conventional drift-aware systems, which may discard
outdated models, our system retains them for potential future reuse, recognizing the cyclic
nature of certain drift patterns. A new model is trained only when all existing models prove to
be inadequate for incoming data streams, ensuring efficient use of resources.
   The effectiveness of the proposed system was extensively evaluated using the IoT-23
dataset [15], which includes real-world data from IoT devices, allowing for a rigorous evaluation
in realistic conditions. The system is also compared to a static approach to demonstrate its
effectiveness, highlighting its ability to adapt to novel threats with high accuracy and with
reduced re-training frequency, while being fully unsupervised.
   The principal contributions of this paper are summarized as follows: (1) introduction of an
online unsupervised anomaly detection system specifically designed to handle recurring concept
drifts in zero-day attack scenarios; (2) adaptation of the traditional technique of semi-supervised
static anomaly detection with AutoEncoder for unsupervised real-time data stream analysis;
(3) reduction in the frequency of required re-training for anomaly detection models, without
compromising detection accuracy; (4) comprehensive validation of the proposed system using a
real-world dataset.
   The remainder of the paper is structured as follows. Section 2 reviews related work, while
Section 3 details the proposed architecture. Section 4 presents the experimental setup and
Figure 1: Evolution of data distribution over time and types of Concept Drift.


findings, leading to Section 5, where we draw our conclusions and suggest directions for future
research.


2. Related Work
In modern networks, which include a mix of personal and IoT devices, the distribution of input
or output data can change unpredictably over time [16]. This phenomenon, known as concept
drift, has been identified as a significant challenge in machine learning and cybersecurity [17, 18].
Concept drift can arise from various sources and is broadly categorized into two types: virtual
drift, where the distribution of input data changes without affecting the output predictions, and
actual drift, which is characterized by stable input data distributions but changing predictions,
leading to alterations in the decision boundary. Often, these drifts occur at the same time,
resulting in simultaneous changes in both input data distribution and output predictions.
   Another critical classification of concept drift scenarios is based on the evolution pattern of
data distribution over time, typically identified as sudden, incremental, gradual, and recurring
drifts, each with distinct characteristics, as shown in Fig. 1.
   Despite the growing number of studies, research explicitly addressing both threat detection
systems and concept drift remains relatively scarce. The authors of [17] describe the most
commonly used workflow in literature for managing concept drift via sliding windows and three
distinct phases: real-time evaluation of each new record by the model, detection of concept drift,
and adaptation to any detected drift. Concept drift detection is typically based on the model’s
error rate; a system’s performance degradation is attributed to concept drift, triggering an
adaptation phase that often includes supervised retraining with data causing the performance
Figure 2: Workflow of the proposed architecture for the online threat detection system.


decline. However, this approach’s reliance on error rates for drift detection requires knowledge
of the actual data labels for comparison with predictions, a requirement that is impractical in
online contexts due to the unrealistic expectations for real-time labeling by domain experts.
   This challenge is common across the literature on intrusion detection methods that are aware
of concept drift, as seen in works such as [19]. The proposed system overcomes this limitation
by adopting an unsupervised approach for detecting both concept drifts and anomalous traffic.
   In the context of unsupervised drift detection, the authors of [20] assess incoming data
in batches, using a parameter to limit the anomaly rate in these batches. However, selecting
an appropriate value for this parameter is difficult without additional data knowledge in an
unsupervised setting. The authors of [21] address this issue by recommending the use of
KSWIN [22] for incoming data feature drift detection, allowing for model adjustments prior to
the evaluation.
   A notable gap in these methodologies is their failure to account for the potential recurrence of
concept drifts. Unlike these approaches, our system is designed to detect and manage recurring
concept drifts effectively, reducing the frequency of retraining phases. Moreover, many anomaly
detection systems discussed in the literature are not well-suited for online applications.
   For detecting zero-day attacks, AutoEncoders are often used in a semi-supervised manner [23]
to model benign traffic data’s principal characteristics, which constitutes the training set.
Additionally, a threshold for distinguishing between benign and malicious traffic is empirically
defined during the training phase. This procedure does not fit the operational requirements of
the proposed system for online functionality, as detailed in Section 3.1. To address this, we have
adapted advanced anomaly detection techniques using AutoEncoders to enhance the real-time
analysis of streaming data.


3. Proposed Architecture
This section presents the architecture of the unsupervised threat detection system, as illustrated
in Fig. 2. The architecture integrates three core components: the Input Drift Detection (IDD)
module, which monitors the statistical distribution of incoming data; the Unsupervised Anomaly
Detection (UAD) module; the Output Drift Detection (ODD) module that detect drifts based on
the anomaly scores of such models. These components allow the system to handle sudden and
recurring concept drift, autonomously identifying shifts in data distribution indicative of new,
previously unencountered concepts that were not captured during the initial training phase of
the model.
   To mitigate performance degradation caused by such drifts, the system checks its memory
for a pre-existing model capable of addressing the new concept, thereby minimizing the need
for retraining. In the absence of an applicable model, the system proceeds to train a new one
tailored to the data causing the concept drift. Furthermore, the architecture is designed to
manage streaming data, addressing the inherent challenges of real-time data processing.
   A practical implementation of this architecture could be envisioned in a server setup dedicated
to the continuous monitoring of network traffic, aiming to detect and isolate suspicious activities
that may signify compromised devices. To this end, the design of the system also factors in
memory and temporal constraints typical of real-time threat detection scenarios [24, 25].
   To account for temporal constraints, data processing is performed in a sequential manner.
Such a methodology significantly reduces the latency between the acquisition of new data
and its subsequent analysis, allowing the system to provide predictions on novel samples not
previously used for training. This approach, often referred to as the “test-then-train” approach,
is commonly known as Prequential Evaluation [26]. As soon as a new record is available, the
system immediately evaluates it with the anomaly detection model currently in use, as indicated
by the green dotted path in Fig. 2. This data is then saved for potential concept drift identification
and, if required, for the retraining of the model, as depicted by the orange continuous path in
Fig. 2.
   To address memory limitations, a fixed-size sliding window technique is used; the adopted
approach ensures that only the two most recent batches of data are retained at any point in
time, effectively managing the system’s memory resources. The sliding window 𝑊 used by the
system has a constant dimension 2𝑤, containing the two most recent data batches 𝐵𝑛−1 and 𝐵𝑛 ,
where each batch 𝐵𝑖 contains 𝑤 elements. Additionally, the system is constrained to maintain a
finite set of 𝑁 models within UAD, alongside an array of anomaly scores derived from the data
on which these models were originally trained.
   The first unsupervised anomaly detection model used by the system is trained on the first
batch of data, 𝐵0 . From that point on, each new record 𝑥𝑖 is evaluated and labeled through the
current model before being incorporated into the sliding window 𝑊. When a new batch of data
𝐵𝑛 has been stored in 𝑊, the system tries to detect potential drifts through the IDD module. If
IDD detects concept drift between the last two batches of data, then 𝐵𝑛 is re-evaluated by each
model in UAD to obtain the lists of anomaly scores for each model.
   Hence, for each model, the system retains anomaly score lists for both the training data and
the new data that caused the drift. The ODD detection module is then employed to analyze these
lists for each model. In scenarios where drift is detected for all models, a new model is trained
with 𝐵𝑛 and it is set as the currently used model for future evaluations. If the list of models is
full, the oldest one is replaced to accommodate the new one. Conversely, if a pre-existing model
in UAD is found to be drift-resistant by ODD, it is selected as the model to use.
   Our methodology for incorporating concept drift detection into both the IDD and ODD
modules utilizes the KSWIN algorithm [22], which is based on the Kolmogorov-Smirnov (KS)
 Algorithm 1: AutoEncoder training.
  Input :
             𝐵: batch of data used for training;
  Output :
             𝑚𝑜𝑑𝑒𝑙: new AutoEncoder model with trained encoder (𝑒𝜙 ) and decoder (𝑑𝜃 );
             𝑠𝑐: Standard Scaler of 𝑚𝑜𝑑𝑒𝑙;
             𝛼: threshold.
1 𝑚𝑜𝑑𝑒𝑙 ← AE()
2 𝜙,𝜃 ← Random weights initialization
3 𝑠𝑐 ← StandardScaler()
4 𝑠𝑐 ← sc.fit(𝐵)
5 𝐵𝑡 ← sc.transform(𝐵)
6 𝑇 𝑟𝑎𝑖𝑛, 𝑉 𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 ← split(𝐵𝑡)
7 𝑚𝑜𝑑𝑒𝑙, 𝜙,𝜃 ← model.fit(Train, Validation)
8 𝑟 ← model.predict(𝐵𝑡)
9 𝛼 ← 𝜇(𝑟) + 2𝜎(𝑟)


statistical test principles [27]. KS is a non-parametric test, useful for its independence from un-
derlying data distribution assumptions, although it traditionally applies only to one-dimensional
data by evaluating the maximum difference in the distribution functions of two sets of data. The
KSWIN method can be implemented within the ODD module by comparing one-dimensional
distributions of reconstruction errors. Given the multidimensional nature of our data, we
adapted KSWIN for IDD with a necessary modification, as advised in [21], by applying the
algorithm individually to each feature. Whenever drift is detected in any feature, the system
considers the need to either initiate training on a new model or reinstate a previously trained
model.

3.1. Unsupervised Anomaly-Detection: AutoEncoder
In this work, an AutoEncoder (AE) has been chosen as the base model for unsupervised anomaly
detection within the UAD module. AutoEncoders are specialized forms of artificial neural
networks adept at learning efficient and meaningful encodings of a dataset’s feature space.
This is achieved through two principal functions: an encoder that transforms the input into
a reduced encoded form, and a decoder that attempts to recreate the original input from this
encoded representation.
   For anomaly detection purposes, AutoEncoders are generally employed in a semi-supervised
manner to model and replicate the principal characteristics of benign traffic data, which forms
the basis of the training set. The dataset is initially partitioned into training and testing subsets,
with the training subset further refined to include only benign records. These records are used
to fine-tune the AutoEncoder, enabling it to reconstruct benign traffic data accurately while
failing to reconstruct anomalous data, which it has not encountered during training. This
discrepancy in its reconstruction capability, particularly evident during the testing phase when
malicious records are introduced, signals the presence of anomalies [23].
   The reconstruction error is defined as follows:

                                             𝑟(𝑖) = ‖𝑥𝑖 − 𝑑𝜃 (𝑒𝜙 (𝑥𝑖 ))‖.                         (1)
 Algorithm 2: AutoEncoder testing.
  Input :
              𝑥𝑖 : i-th record of network traffic;
              𝑚𝑜𝑑𝑒𝑙: selected AE model with encoder(𝑒𝜙 ) and decoder(𝑑𝜃 );
              𝑠𝑐: Standard Scaler of selected 𝑚𝑜𝑑𝑒𝑙;
              𝛼: threshold.
1 𝑥𝑠𝑖 ← sc.transform(𝑥𝑖 )
2 𝑟(𝑖) ← model.predict(𝑥𝑠𝑖 )
3 if 𝑟(𝑖) > 𝛼 then
4       𝑥𝑖 is anomaly
5 end
6 else
7       𝑥𝑖 is not anomaly
8 end


Equation 1 is computed as the absolute difference between the original data point 𝑥𝑖 and its
reconstruction 𝑑𝜃 (𝑒𝜙 (𝑥𝑖 )), where 𝑑𝜃 denotes the decoding function and 𝑒𝜙 the encoding function.
During the testing phase, this metric is employed as an anomaly score; if the score exceeds a
threshold value 𝛼, the record is deemed anomalous. The threshold value 𝛼 is determined based
on the model’s performance across the entire training dataset, with the objective of effectively
distinguishing between benign and malicious traffic.
   Unlike the semi-supervised static operation described so far, in the proposed system the
AutoEncoder is utilized in a fully unsupervised manner. The online nature of the problem and
the specific application domain preclude the assumption that the system can access ground truth
data about the current window in time for training. Therefore, it is not feasible to guarantee
that the model is exposed exclusively to benign traffic. However, it can be assumed that, in a
fixed period, the frequency of anomalies compared to the entire set of observations is very low,
to such an extent that the impact of these anomalies during the training phase is negligible.
   In this work, all the AutoEncoders used share the structure described in Fig. 3. The encoder
comprises two hidden layers with dimensions of 128 and 64, respectively, while the decoder is
structured with two hidden layers having dimensions of 64 and 128, respectively. The latent
space is characterized by 8 dimensions. The Gaussian Error Linear Unit (GELU) serves as the
activation function for all layers except for the output layer, which employs a linear function.
During the training phase, the Mean Absolute Error (MAE) related to the reconstruction error
is used as the loss function. The MAE is defined as follows:
                                                             𝑁
                                                           ∑𝑖=0 𝑟(𝑖)
                                                𝑀𝐴𝐸 =                                            (2)
                                                              𝑁
   A detailed description of the training phase is provided in Algorithm 1. Lines 1 and 2 initialize
a new AutoEncoder with the previously described structure and random weights for both
the encoder and decoder. Line 3 involves the creation of a new Standard Scaler, which is a
preprocessing tool that normalizes the data by removing the mean and scaling to unit variance,
ensuring that each feature contributes equally to the distance computations. This is crucial for
models sensitive to the magnitude of features. Each AutoEncoder in the set of models available
to the system is paired with a Standard Scaler, fitted on the batch of data on which the new
                  |𝕏|                                                      |𝕏|


                Input                                                    Output

                         Encoder                      Decoder

Figure 3: AutoEncoder architecture for the UAD module.


model is to be trained (line 4), and is then used to transform the training data batch (line 5) as
well as each new data record before it is evaluated by the model.
   The training data batch is first scaled and then split between the training set (80%) and the
validation set (20%), as shown in line 6. After the training phase (line 7), the reconstruction
error 𝑟, as defined in Eq. 1, is calculated for each data point in the training batch (line 8) and
used to determine the threshold value 𝛼 (line 9).
   Unlike in semi-supervised operation, the selection of 𝛼 cannot be based on the system’s
performance on the training batch in terms of correctly labeling anomalous traffic, since ground
truth data is not available. In this work, assuming that the distribution of the reconstruction
error 𝑟 for the training data, and thus for benign traffic, can be approximated by a normal
distribution, 𝛼 is set to 𝜇(𝑟) + 2𝜎 (𝑟), the mean plus two standard deviations of the reconstruction
error distribution. This choice has been deemed appropriate based on experimental evaluations
reported in Section 4.
   Algorithm 2 describes the evaluation phase for a new record. In line 1, the record 𝑥𝑖 is
transformed using the specific model’s Standard Scaler, then, in line 2, the reconstruction error
𝑟(𝑖) is calculated. If 𝑟(𝑖) exceeds the threshold value 𝛼 (line 3), chosen during the training phase
for that model, it is labeled as an anomalous traffic record (line 4); otherwise, it is labeled as
benign traffic (line 7).


4. Experimental Evaluation
This section evaluates the proposed system’s performance utilizing a comprehensive suite of
metrics, as shown in Table 1, including accuracy, F1-score, false positive rate (FPR), and false
negative rate (FNR), in line with established scientific standards [28].
  The experiments leverage the IoT-2023 dataset collection [15], a repository of network
Table 1
Metrics used for the experimental evaluation.
           Metric       Accuracy          Precision   Recall     F1-score            FPR        FNR
                           𝑇 𝑃+𝑇 𝑁             𝑇𝑃        𝑇𝑃     2⋅𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛⋅𝑅𝑒𝑐𝑎𝑙𝑙      𝐹𝑃         𝐹𝑁
          Formula      𝑇 𝑃+𝑇 𝑁 +𝐹 𝑃+𝐹 𝑁     𝑇 𝑃+𝐹 𝑃   𝑇 𝑃+𝐹 𝑁    𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛⋅𝑅𝑒𝑐𝑎𝑙𝑙    𝑇 𝑁 +𝐹 𝑃   𝑇 𝑃+𝐹 𝑁


traffic data including heterogeneous IoT devices in both secure and compromised conditions
due to various malware threats. Specifically, the experiments are performed on a modified
subset of the IoT-2023 collection, integrating data from an expanded range of IoT devices and
malware instances. This subset includes both sudden and recurring concept drifts, presenting a
challenging environment for the evaluation of a threat detection system. Given the system’s
online and unsupervised nature, and its initial training on traffic presumed mainly benign, all
attacks encountered in the experiments are treated as unknown, similar to zero-day attacks.
This feature highlights the system’s capability to identify and respond to novel threats without
prior knowledge, simulating a real-world scenario where each attack poses a novel challenge.
   To address the significant class imbalance present in the dataset, the weighted F1-score was
adopted as the metric of choice. Unlike the standard F1-score, which does not account for
true negatives and may provide a skewed perspective in the presence of unbalanced classes,
the weighted F1-score incorporates the prevalence of each class. This adjustment ensures a
more accurate representation by emphasizing the importance of more frequently occurring
classes [29].
   All experiments utilize the AutoEncoder (AE) architecture described in Sec. 3.1. The AE
models, developed using Keras and TensorFlow, are trained through backpropagation, employing
the Adam optimizer with a learning rate set to 0.001. To enhance training efficiency and prevent
overfitting, an early stopping mechanism was implemented. This mechanism, activated after
the tenth epoch, monitors the Mean Absolute Error (MAE) on the validation set, with a patience
parameter of 5 epochs, terminating training if no improvement is observed. The training is
further bounded by a maximum duration of 100 epochs or 300 seconds.
   Experiments were repeated by varying several hyperparameters of the AE architecture,
including the size 𝑤 of the data window (from a minimum of 150 to a maximum of 500), the
batch size during training (ranging between 16 and the largest power of two less than 𝑤), and
the maximum number of stored models 𝑁 in the UAD module (ranging from 1 to 15).
   Moreover, the experiments were also conducted with two different methodologies in addition
to the proposed one: a conventional static approach, which involves a single initial training
phase with no drift detection technique (referred to as Static), and a dynamic approach where
the model is retrained upon each detected concept drift without explicitly addressing recurring
concept drift (referred to as Detect Drift and Retrain, or DDR).
   Results, illustrated in Figure 4, highlight the superior performance of the proposed system
under optimal parameter configurations for each evaluated method. Notably, the static approach
exhibits a high False Positive Rate (FPR), misclassifying 23.4% of benign traffic as malicious. This
high value arises despite selecting a threshold 𝛼 aiming for a 5% FPR, based on the assumption of
a normal distribution of benign traffic, as discussed in Sec. 3.1. Conversely, both the DDR and the
proposed system closely align with this FPR target, achieving rates of 8% and 7%, respectively,
                          100


                          80

         Percentage (%)
                          60


                          40


                          20


                           0
                                Accuracy       F1-score     FPR            FNR
                                        Static        DDR     Proposed System

Figure 4: Accuracy, F1-score, FPR and FNR of the static approach, DDR and proposed system.


showing their efficacy in adapting to dynamic traffic patterns.
   These findings highlight the dynamic nature of benign traffic, which evolves over time, leading
to concept drifts. When traffic patterns gradually change, the performance of the static system
degrades, because its model and 𝛼 threshold no longer reflect the current input data distribution.
This results in lower accuracy and F1-score (76.7% and 86.3%, respectively), compared to the
other approaches. Indeed, systems equipped to adapt to these shifts in distribution by detecting
and adjusting to concept drift maintain their performance efficacy. The DDR approach and the
proposed system obtain accuracy rates of 92.3% and 92.7%, and F1-scores of 95.5% and 95.7%,
respectively, demonstrating their superior adaptability over the static model. Both methods
exhibit an increase in the FNR, attributable to the misidentification of concept drift by the IDD
module amidst data noise, leading to the detection of a non-existent concept drift and thus
unnecessary model retraining and threshold recalibration.
   The proposed system, however, mitigates this issue, achieving a lower FNR of 15% compared
to DDR’s 27%. By recognizing the current model as appropriate for the “false drift” detected by
the IDD module, the proposed system avoids unnecessary retraining and threshold adjustments.
This capability results in slightly higher accuracy results (92.7%) compared to 𝐷𝐷𝑅 (92.3%)
while reusing an existing model 30% of the time, thus optimizing computational resource
usage. Minimizing evaluation delay is crucial for online threat detection systems, especially in
environments where data arrival patterns are unpredictable, as detailed in Sec. 3.
   Table 2 illustrates the execution times of the three approaches, relative to the static system
that is considered as a baseline (1.0x execution time). While the static system exhibits the
fastest execution time due to its lack of retraining, the trade-off in accuracy (92.3% for the
proposed system versus 76.7% for the static system) justifies the slight increase in execution time
(14% on average). The proposed system represents an optimal compromise, as it significantly
outperforms the static approach and slightly exceeds the performance of the DDR system in
terms of accuracy and F1-score, while exhibiting reduced execution times compared to the DDR
Table 2
Execution times of the three approaches.
                       Drift Detection Strategy    Reused Models      Time
                       Static                                  -      1.00x
                       Always retrain (DDR)                  0%       1.19x
                       Proposed system                      30%       1.14x


approach that continually re-trains without addressing recurring concept drifts.


5. Conclusions and Future work
This work explored the challenges of detecting and managing concept drift in the realm of threat
detection for network traffic streams. A novel system was introduced, combining unsupervised
anomaly detection with concept drift detection techniques for increasing its resilience against
zero-day attacks. The system applies AutoEncoder anomaly detection algorithms, typically used
in static, semi-supervised contexts, to an online unsupervised setting, enhancing its practical
usability in streaming data scenarios. A key contribution of the proposed approach is the
explicit management of recurring concept drifts by maintaining a repository of previous models,
significantly reducing the need for frequent re-training.
   The validity of the proposed methodology was rigorously tested through comprehensive
evaluations on a real-world dataset. Since the system is trained primarily on benign traffic, it
treats all detected attacks as novel, similar to zero-day threats, reflecting real-world conditions
where many attacks are unforeseen. The empirical results highlight the system’s capability
to accurately identify malicious traffic, even in the face of concept drift and unknown attacks,
consistently achieving high accuracy and F1 scores. These outcomes not only outperform
those of static approaches but also highlight the critical importance of effectively managing
concept drift to prevent abrupt performance deterioration. Furthermore, the proposed system’s
performance exceeds that of a method that always retrain its models in both accuracy evaluations
and execution speed, highlighting the often-overlooked significance of addressing recurring
concept drift.
   For future directions, the system could be enhanced by incorporating a more sophisticated
model replacement policy, for example using reputation mechanisms [30, 31] and a concept drift
detection module designed to operate directly in the multidimensional space of input features.
This improvement aims to improve the system’s resilience to noise and increase efficiency by
minimizing false detections of concept drift.


Acknowledgments
This work is partially funded by the European Union Next-Generation EU (PIANO NAZIONALE
DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 –
D.D. 1551.11–10-2022, PE00000004) - MICS (Made in Italy – Circular and Sustainable) Extended
Partnership.
References
 [1] H. Al-Mohannadi, Q. Mirza, A. Namanya, I. Awan, A. Cullen, J. Disso, Cyber-attack
     modeling analysis techniques: An overview, in: 2016 IEEE 4th international conference
     on future internet of things and cloud workshops (FiCloudW), IEEE, 2016, pp. 69–76.
 [2] S. Gaglio, A. Giammanco, G. Lo Re, M. Morana, Adversarial machine learning in e-
     health: attacking a smart prescription system, in: International Conference of the Italian
     Association for Artificial Intelligence, Springer, 2021, pp. 490–502.
 [3] I. Stellios, P. Kotzanikolaou, M. Psarakis, C. Alcaraz, J. Lopez, A survey of iot-enabled
     cyberattacks: Assessing attack paths to critical infrastructures and services, IEEE Commu-
     nications Surveys & Tutorials 20 (2018) 3453–3495.
 [4] V. Agate, A. De Paola, P. Ferraro, G. Lo Re, M. Morana, Secureballot: A secure open source
     e-voting system, Journal of Network and Computer Applications (JNCA) (2021).
 [5] A. Bordonaro, A. De Paola, G. Lo Re, M. Morana, Smart auctions for autonomic ambi-
     ent intelligence systems, in: 2020 IEEE International Conference on Smart Computing
     (SMARTCOMP), IEEE, 2020, pp. 180–187.
 [6] H.-J. Liao, C.-H. R. Lin, Y.-C. Lin, K.-Y. Tung, Intrusion detection system: A comprehensive
     review, Journal of Network and Computer Applications 36 (2013) 16–24.
 [7] R. Vinayakumar, M. Alazab, K. Soman, P. Poornachandran, A. Al-Nemrat, S. Venkatraman,
     Deep learning approach for intelligent intrusion detection system, Ieee Access 7 (2019)
     41525–41550.
 [8] A. De Paola, S. Gaglio, G. Lo Re, M. Morana, A hybrid system for malware detection
     on big data, in: IEEE INFOCOM 2018-IEEE Conference on Computer Communications
     Workshops (INFOCOM WKSHPS), IEEE, 2018, pp. 45–50.
 [9] A. Javaid, Q. Niyaz, W. Sun, M. Alam, A deep learning approach for network intrusion
     detection system, in: Proceedings of the 9th EAI International Conference on Bio-inspired
     Information and Communications Technologies (formerly BIONETICS), 2016, pp. 21–26.
[10] N. Farnaaz, M. Jabbar, Random forest modeling for network intrusion detection system,
     Procedia Computer Science 89 (2016) 213–217.
[11] V. Agate, F. M. D’Anna, A. De Paola, P. Ferraro, G. Lo Re, M. Morana, A behavior-based
     intrusion detection system using ensemble learning techniques., in: CEUR Workshop
     Proceedings, 6th Italian Conference on Cybersecurity, ITASEC 2022, volume 3260, 2022,
     pp. 207–218.
[12] F. Concone, C. Ferdico, G. Lo Re, M. Morana, A federated learning approach for distributed
     human activity recognition, in: 2022 IEEE International Conference on Smart Computing
     (SMARTCOMP), IEEE, 2022, pp. 269–274.
[13] F. Bayram, B. S. Ahmed, A. Kassler, From concept drift to model degradation: An overview
     on performance-aware drift detectors, Knowledge-Based Systems 245 (2022) 108632.
[14] V. Agate, S. Drago, P. Ferraro, G. Lo Re, Anomaly detection for reoccurring concept
     drift in smart environments, in: 18th International Conference on Mobility, Sensing and
     Networking (MSN), IEEE, 2022, pp. 113–120.
[15] S. Garcia, A. Parmisano, M. J. Erquiaga, IoT-23: A labeled dataset with malicious and
     benign IoT network traffic (1.0.0) [Data set]. Zenodo. (2020). URL: https://doi.org/10.5281/
     zenodo.4743746.
[16] A. De Paola, P. Ferraro, S. Gaglio, G. Lo Re, M. Morana, M. Ortolani, D. Peri, A context-aware
     system for ambient assisted living, in: Ubiquitous Computing and Ambient Intelligence:
     11th International Conference, UCAmI 2017, Philadelphia, PA, USA, November 7–10, 2017,
     Proceedings, Springer, 2017, pp. 426–438.
[17] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, A. Bouchachia, A survey on concept drift
     adaptation, ACM computing surveys (CSUR) 46 (2014) 1–37.
[18] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept drift: A review,
     IEEE Transactions on Knowledge and Data Engineering 31 (2018) 2346–2363.
[19] D. Mulimani, S. G. Totad, P. Patil, S. V. Seeri, Adaptive ensemble learning with concept
     drift detection for intrusion detection, in: Data Engineering and Intelligent Computing:
     Proceedings of ICICC 2020, Springer, 2021, pp. 331–339.
[20] Z. Ding, M. Fei, An anomaly detection approach based on isolation forest algorithm for
     streaming data using sliding window, IFAC Proceedings Volumes 46 (2013) 12–17.
[21] M. U. Togbe, Y. Chabchoub, A. Boly, M. Barry, R. Chiky, M. Bahri, Anomalies detection
     using isolation in concept-drifting data streams, Computers 10 (2021) 13.
[22] C. Raab, M. Heusinger, F.-M. Schleif, Reactive soft prototype computing for concept drift
     streams, Neurocomputing 416 (2020) 340–351.
[23] J. An, S. Cho, Variational autoencoder based anomaly detection using reconstruction
     probability, Special lecture on IE 2 (2015) 1–18.
[24] F. Concone, A. De Paola, G. Lo Re, M. Morana, Twitter analysis for real-time malware
     discovery, in: 2017 AEIT International Annual Conference, IEEE, 2017, pp. 1–6.
[25] F. Concone, G. Lo Re, M. Morana, S. K. Das, Spade: Multi-stage spam account detection for
     online social networks, IEEE Transactions on Dependable and Secure Computing (2022).
[26] J. Vinagre, A. M. Jorge, C. Rocha, J. Gama, Statistically robust evaluation of stream-based
     recommender systems, IEEE Transactions on Knowledge and Data Engineering 33 (2019)
     2971–2982.
[27] F. J. Massey Jr, The Kolmogorov-Smirnov test for goodness of fit, Journal of the American
     statistical Association 46 (1951) 68–78.
[28] Ž. Vujović, et al., Classification model evaluation metrics, International Journal of
     Advanced Computer Science and Applications 12 (2021) 599–606.
[29] D. M. Powers, Evaluation: From precision, recall and f-factor to roc, informedness,
     markedness & correlation, Journal of Machine Learning Technologies 2 (2011) 37–63.
[30] V. Agate, A. De Paola, G. Lo Re, M. Morana, A platform for the evaluation of distributed
     reputation algorithms, in: 2018 IEEE/ACM 22nd International Symposium on Distributed
     Simulation and Real Time Applications (DS-RT), IEEE, 2018, pp. 1–8.
[31] V. Agate, A. De Paola, G. Lo Re, M. Morana, A simulation software for the evaluation
     of vulnerabilities in reputation management systems, ACM Transactions on Computer
     Systems (TOCS) 37 (2021) 1–30.