Detection of DDoS Attacks with Gaussian Mixture Model
                         Alessandro Cecchetto1 , Giuseppa Conte1 and Christian Napoli1,2,3
                         1
                           Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, Roma, 00185, Italy
                         2
                           Institute for Systems Analysis and Computer Science, Italian National Research Council, Via dei Taurini 19, Roma, 00185, Italy
                         3
                           Department of Computational Intelligence, Czestochowa University of Technology, 42-201 Czestochowa, Poland


                                           Abstract
                                           A Distributed Denial of Service (DDoS) is an attack which aim is to stop or tamper with an online service incapacitating a server with a
                                           flood of packages or requests, using internet or intranet. The main aim of the DDoS attack is to collapse the network or server with
                                           abnormal traffic to make the service unavailable for the legitimate users. This problem is particularly profound, due to the development
                                           of emerging technologies, such as cloud computing, the Internet of Things or artificial intelligence techniques, from which attackers can
                                           take advantage by launching a huge volume of DDoS attacks at a lower cost, and it is much harder to detect and prevent DDoS attacks,
                                           because DDoS traffic is similar to normal traffic. In this paper we implement a novel technique implementing an unsupervised Gaussian
                                           Mixture Model (GMM) based algorithm. Using a real traffic dataset, the CIC-DDoS2019, for benchmark, the proposed GMM can achieve
                                           recall, precision, and accuracy up to 99%. Experiments reveal that this can be a promising solution for the detection of DDoS attacks.

                                           Keywords
                                           Distributed Denial of Service (DDoS), Gaussian Mixture Model (GMM), Machine Learning, Detection


                         1. Introduction                                                                                                       account the phase of Cross-Validation. Finally, conclusions
                                                                                                                                               and future implementations are discussed in Section VII.
                         DoS (Denial of Service) attacks deplete the network band-                                                             Various ML technologies have been employed, mainly as
                         width and computing resources of a targeted system by                                                                 classifiers, in the detection of DDoS attacks. Meng Wang
                         flooding malicious traffic, preventing the target system from                                                         et al. [4] proposed a dynamic multilayer perceptron (MLP)
                         offering regular services to legitimate users. DDoS (Dis-                                                             combined with a feature selection technique to detect DDoS
                         tributed Denial of Service) goes even further on a much                                                               attacks, where a feedback mechanism is applied to promote
                         larger scale. DDoS attacks can take over a large number                                                               and reconstruct the detector system when detection is not
                         of compromised systems called bots, constituting a botnet,                                                            accurate. In their model, as the complexities of traffic net-
                         which are used to launch coordinated attacks on the victim                                                            work increase and change, some of the selected features
                         system, from this kind of attack behavior, DDoS attacks can                                                           will not be able to distinguish the traffic and normal at-
                         be devided in several branches as reported in [1]. Along with                                                         tacks and determine the failure therein. Nhu-Ngoc Dao
                         the emergence and advancement of Internet technologies                                                                et al. [5] proposed the approach of source based IP filter-
                         [2], DDoS attacks are evolving and proliferating in scale,                                                            ing technique to defeat DDoS attacks. The approach try to
                         frequency, and sophistication. Organizations face poten-                                                              distinguish three kinds of Users. The malicious user who
                         tial threats to their network environment that may cause                                                              has fix source IP address and injects spoofed packets to the
                         severe impacts to their operations, such as business down-                                                            switch infinitely. The DDoS attacking user sends spoofed
                         time, data breaches, or even ransom demands from hack-                                                                packets to the switch infinitely. The frequent user acts as
                         ers [3]. The detection of DDoS attacks is essential before                                                            normal user. The method distinguishes them and processes
                         any mitigation approaches can be taken. In the early era, the                                                         differently according to different users. It works well when
                         alarm of DDoS attacks was triggered by rules programmed                                                               the attack traffic is not very huge, and if the attack type
                         by traffic engineers, but in the current cybersecurity sce-                                                           it is mainly a flooding one. But to use it, we need to sur-
                         nario, the application of artificial intelligence and, more                                                           vey the network first and initiate two parameters for the
                         specifically, machine learning (ML) offers new and promis-                                                            detection method, so it is necessary a kind of "setup" time
                         ing perspectives. Training predictive models to recognize                                                             in order to fix such parameters. The effect of the method
                         anomalous patterns in network traffic provides a more agile                                                           may be affected by the artificial parameters. When involved
                         and proactive means of detecting attacks. The Gaussian                                                                with the behavior of the artificial, the uncertainty of the
                         Mixture Model (GMM) is one such model, used for its ability                                                           detecting result will increase. Seyed Mohammad Mousavi
                         to model complex distributions of data, such as those that                                                            et al. [6] proposed a solution to detect DDoS attacks based
                         characterize network traffic. In this paper, we propose a                                                             on the entropy variation of the destination IP address. Al-
                         novel ML (Machine Learning) method based on GMM, for                                                                  though it is a lightweight and effective detection method, in
                         detecting DDoS malicious packets. The remaining part of                                                               detecting DDoS attacks, we cannot only take one factor into
                         this paper is organized as follows: Section II describes and                                                          consideration, since there are many factors that can be used
                         analyze the related DDoS detection works with some pro-                                                               to identify DDoS attacks, which can manifest themselves
                         posed solutions. Section III describes and analyze the used                                                           in very different ways beyond the proposed control. The
                         dataset. Section IV explain all the steps that belongs to the                                                         detection method lacks of comprehensive consideration of
                         workflow from the feature pre-processing to the GMM. Sec-                                                             multi-factors. Uygar Dincalp et al. [7] proposed a method
                         tion V denotes the implementation of the project. Section VI                                                          based on the clustering algorithm DBSCAN for analyzing
                         resume the obtained results of the project also taking into                                                           the network traffic in order to catch the changes and va-
                                                                                                                                               rieties in attack vectors for showing what the attack and
                          SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engineering
                          and Mathematics, Rome, December 3-6, 2023                                                                            where the attack is based on. The proposed methodology
                          $ cecchetto@diag.uniroma1.it (A. Cecchetto);                                                                         strongly depends on a fixed threshold in order to send an
                          giuseppa.conte@uniroma1.it (G. Conte); cnapoli@diag.uniroma1.it                                                      alarm of a possible attack, it is not clearly explained how the
                          (C. Napoli)                                                                                                          threshold is defined but the results shows that the proposed
                                   © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu-
                                   tion 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                                                                                           1
Alessandro Cecchetto et al. CEUR Workshop Proceedings                                                                           1–8


                                                                      3. Implementation
                                                                      The proposed work (Figure 2) consists of two main phases:
                                                                      data phase and model phase.
                                                                      During data phase, the following points are developed: Fea-
                                                                      ture pre-processing, Feature selection, Dimensionality re-
                                                                      duction. During the model phase, relating to the develop-
                                                                      ment of the Gaussian mixture model, the following points
                                                                      are addressed: Model training, Performance evaluation,
                                                                      Cross-Validation.

                                                                      3.1. Feature pre-processing
                                                                      The efficiency of classification techniques have to be im-
                                                                      proved through adequate data manipulation which concerns
                                                                      different types of actions. Additionally, models trained on
Figure 1: Subdivision of the dataset.
                                                                      manually prepared data exhibited better performance com-
                                                                      pared to those trained on non-prepared data according to
                                                                      [10, 11, 12].
system worked well with chosen attributes in their exper-
iments. Akella et al. [8] proposed a detection mechanism              3.1.1. Feature Scaling
where each intermediate router detects traffic anomalies
using profiles of normal traffic. Each router keeps track of          Feature scaling is a vital step in pre-processing data before
destinations whose traffic occupies greater than a fraction of        building a model using machine learning [13, 14, 15, 16].
the capacity of the outgoing link, and sends this information         The dataset used for model training in machine learning
to its neighbors. Attack detection is determined by interme-          often contain unpredictable values that may have varying
diate routers if the gathered traffic information on a specific       scales. This can result in inequalities in comparing these
destination system exceeds the predefined threshold. This             values. Feature scaling techniques can address these chal-
scheme cannot distinguish the flash crowds provided by a              lenges by adjusting the values and promoting easy and fair
spike of normal traffic from the DDoS attacks. Hence, false           comparisons among values.
alarm rate will be increased.                                         The ML algorithm observes only numbers, and if there is
                                                                      a significant difference in range, it assumes that numbers
                                                                      in the upper ranges are superior such that features with
2. Dataset                                                            larger numerical values have a greater effect on the distance
                                                                      between data and dominate other features when calculat-
The proposed model is trained and validated on the dataset            ing distances. As a result, these more significant numbers
released by the Canadian Institute for Cybersecurity (CIC),           play a more critical role during model training. The scaling
namely CIC-DDoS2019 [9]. The dataset offers an extended               technique used in the experiment is the Normalization, also
set of Distributed Denial of Service attacks, most of which           known as Min-Max scaling, a technique in which values
employ some form of amplification through reflection. This            are shifted and rescaled to a range between 0 and 1 with-
type of attacks are conducted concealing the attacker's iden-         out distorting differences in the ranges of values or losing
tity thanks to the IP spoofing technique in which packets             information.
are sent to reflector servers by attackers with the source
IP address set to the target victim's. The dataset contains           3.1.2. Encoding
benign and the most up-to-date common DDoS attacks, this
two classes are divided as reported in Figure 1, with a total         Many statistical learning algorithms require as input a nu-
amount of data of 431,371 records described by 88 features.           merical feature matrix, as the case of GMM. When categor-
A slight imbalance is highlighted in the composition of the           ical variables are present in the data, feature engineering
dataset, where the connections associated with the attacks            is needed to encode the different categories into a suitable
represent approximately 77% of the total dataset, this slight         feature vector.
lack of data for the benign connections would have been               Feature encoding is the process of transforming textual data
possible to fix through oversampling algorithms, creating             into numerical values so they may be applied to ML algo-
artificial data based on characteristics of the original ones,        rithms, resulting in improved model accuracy. Researchers
the choice in this experiment was not to make changes in              have used many approaches to convert textual data into
this sense, as the imbalance present between the two classes          numerical values, in this work is used the “Label Encoding”
was not too accentuated, equally allowing correct training            technique [17], in such work this process is applied to the
of the model, the model performance evaluation is validated           target feature of the dataset in which there are two different
by applying the cross-validation step.                                values, Attack and Benign. Label encoding has the advan-
The following types of attacks are present: UDP, MSSQL,               tage that it is straightforward, yet it has the disadvantage
Portmap, Syn, NetBIOS, UDPLag, LDAP, DrDoS_DNS, UDP-                  that the numeric values can be “misinterpreted” by the ma-
lag, WebDDoS, TFTP, DrDoS_UDP, DrDoS_SNMP, Dr-                        chine learning algorithms since it uses number sequencing.
DoS_NetBIOS, DrDoS_LDAP, DrDoS_MSSQL, DrDoS_NTP.                      The problem using the number is that they introduce re-
                                                                      lation/comparison between them. Apparently, there is no
                                                                      relation between Attack and Benign, but when looking at
                                                                      the number, ’Benign’ which is encoded using 1 has higher


                                                                  2
Alessandro Cecchetto et al. CEUR Workshop Proceedings                                                                              1–8


          Figure 2: Experiment workflow.


precedence over ‘Attack’ which is encoded using 0, this kind          sample contains a random subset of the original columns
of problem in this work does not occur.                               and is used to fit a decision tree. Each tree of the random
                                                                      forest can calculate the importance of a feature according to
3.1.3. Feature selection                                              its ability to increase the pureness of the leaves. The higher
                                                                      the increment in leaves purity, the higher the importance
Many ML models experience difficulty working with a high              of the feature. This is done for each tree, then is averaged
presence of features in input, generally, features can be cat-        among all the trees and, finally, normalized to 1. So, the sum
egorized as: relevant, irrelevant, or redundant, the last two         of the importance scores calculated by a Random Forest is 1.
categories only increase the size of the input space [18]             In this paper Information Gain (IG) criteria is used for fea-
resulting in difficulty to process data further thus not con-         ture selection by RFC. To use Information Gain for feature
tributing to the learning process. To generate the best per-          selection an entropy value of each attribute of the data has
forming model, feature selection plays a major role, which            to be calculated. The entropy value is used for ranking fea-
process a subset from available features data are selected            tures that affect data classification. A feature which does not
for the process of learning algorithm. The best subset is the         have much effect on the data classification has very small
one with least number of dimensions that most contribute              information gain and it can be ignored without affecting
to learning accuracy, since an irrelevant feature does not            the detection accuracy of a classifier [22].It calculates the
affect describing the target concept in any way, a redundant          amount of entropy (uncertainty) that is reduced as a result
feature does not add anything new to describing the target            of dividing the data by a specific property. Hence for each
concept. Redundant features might possibly add more noise             splitting attribute, information gain is calculated and the
than useful information in describing the concept of inter-           attribute with highest gain is chosen as splitting attribute.
est. The main benefits of feature selection are follows: (i)          This attribute is such that it creates minimum impurity or
reducing the measurement cost and storage requirements,               randomness in the generated splits and hence it minimizes
(ii) coping with the degradation of the classification per-           the information needed to classify the tuples. The entropy
formance due to the finite size of training sample sets, (iii)        of a subset S is determined as follows:
reducing training and utilization time, (iv) facilitating data
                                                                                                      𝐶
visualization and data understanding, (V) reducing the risk                                          ∑︁
of overfitting.                                                                    𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) =         (𝑝𝑖 · 𝑙𝑜𝑔2 (𝑝𝑖 ))           (1)
                                                                                                      𝑐=1
This process can be carried out into three ways [19, 20]:
filter, wrapper, and embedded.                                        Here c = 1, .., C are the different classes in S, 𝑝𝑖 is probabil-
The feature selection process of the proposed work uses a             ity that an arbitrary tuple in S belongs to class 𝐶𝑖 . Let A
filter approach based on the Random Forest Classifier (RFC)           be a feature in S and 𝑎1 , 𝑎2 , . . . , 𝑎𝑣 are different values of
as in according with [21], since it shows the most suitable           attribute A in S such that 𝑆1 , . . . , 𝑆𝑣 are partitions gener-
performance among other filtering approaches, before the              ated based on these values. These partitions are likely to
application of the model, the dataset is divided into train           be impure. How much more information is still needed to
and test sets in the ratio of 80:20, since feature selection          arrive at an exact classification or pure partition is given as:
using only the training data (train set) rather than the entire                                    𝑣
dataset, is particularly important in order to avoid so-called                𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 (𝑆) =
                                                                                                  ∑︁
                                                                                                      (
                                                                                                        |𝑆𝑖 |
                                                                                                              · 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑖 ))       (2)
"data leakage". The main reason behind this choice is that,                                       𝑖=1
                                                                                                         |𝑆|
when selecting features, you want the selection to be based
only on the information available during the model training           The smaller is this additional information the greater the
phase. If you also use data from the test set during the fea-         purity of the partition.
ture selection phase, you may run the risk of using future
                                                                               𝐼𝐺(𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 (𝑆)                    (3)
information to guide feature selection, introducing a bias
into the results. The RFC takes the training dataset and re-            Once the IG relating to all the features of the dataset has
sample it according to a procedure called “bootstrap”. Each           been defined via RFC, the Select From Model class is applied


                                                                  3
Alessandro Cecchetto et al. CEUR Workshop Proceedings                                                                           1–8


          Figure 3: Selected features.


to the results for the selection of the main important features       fundamental step required in order to avoid falsification of
by defining a threshold based on the average of the values            measurements is data standardization, since heterogeneous
coming from the RFC. Then, the features that have the IG              data representations going to influence the PCs construc-
above the threshold are selected, which are shown in Figure           tions, given that in case of small set of variables has a much
3.                                                                    larger magnitude than others, the components in the PCA
                                                                      analysis are heavily weighted along those variables, while
3.1.4. Dimensionality reduction                                       other variables are ignored. As a consequence, the PCA sim-
                                                                      ply recovers the values of these high-magnitude variables
The dataset, after having gone through a pre-processing and           for this reason standardise the scale of features variation
feature selection phases, still has a high dimensionality with        is essential, the consequence that this aspect can have is
the presence of 28 features. In this regard, a reduction of           shown in Figure 4 with a comparison on the first principal
the latter is necessary, as we want to have a representation          component whether features are standardized or not. The
of the data. Dimensionality reduction has been made using             first application of PCA shown in Figure 5 is performed by
Principal Component Analysis (PCA) [23]. Principal Com-               defining a number of 10 PCs to see and analyze how the
ponent Analysis (PCA) simplifies the complexity in high-              cumulative variance is distributed over a larger number of
dimensional data while retaining trends and patterns. It does         components. Subsequently, 3 components were extracted
this by transforming the data into fewer dimensions with              from the results obtained, since with 3 components it is pos-
minimal loss of overall dispersion, which act as summaries            sible to visualize the data, reduce the computational power,
of features. PCA reduces data by geometrically projecting             finding only a data loss of 8%.
them onto lower dimensions called principal components
(PCs) defined as a linear combination of the data's original
variables, with the goal of finding the best summary of the           3.2. GMM
data using a limited number of PCs. The first PC is chosen            Gaussian Mixture Model (GMM) [24] is a probabilistic model
to minimize the total distance between the data and their             to describe subsets of data within a general population,
projection onto the PC. By minimizing this distance, we               that can be represented as a combination of normally dis-
also maximize the variance of the projected points. The               tributed subpopulations. It is commonly used for unsuper-
second (and subsequent) PCs are selected similarly, with              vised learning to learn the subpopulations models that can
the additional requirement that they be orthogonal (proving           be also automatically divided. On the other hand within a
to be uncorrelated) with all previous PCs. Hence, princi-             supervised approach, such models can be used in order to
pal components represents the directions of the data that             determine the boundaries of different subpopulations for
explain a maximal amount of variance, or rather, the lines            classification purposes. In this latter case, the goal is to
that capture most information of the data. The relationship           assign each data point to one of G preexisting unordered
between variance and information here, is that, the larger            classes (or populations) taking into account 𝑑 observed vari-
the variance carried by a line, the larger the dispersion of          ables, 𝑋. The problem amounts to define a function that
the data points along it, and the larger the dispersion along         maps an arbitrary observation 𝑥 ∈ 𝑅𝑑 to a prediction of
a line, the more information it has. Hence, PCs are new               the class from which it stems. This function, named al-
axes that provide the best angle to see and evaluate the              location rule, has to be estimated from the training data
data, so that the differences between the observations are            {𝑥𝑖𝑔 , 𝑖 = 1, ..., 𝑛𝑔 𝑎𝑛𝑑 𝑔 = 1, ..., 𝐺}, which consist
better visible. Given the dealing of the PCs with distance, a


                                                                  4
Alessandro Cecchetto et al. CEUR Workshop Proceedings                                                                         1–8


         Figure 4: Influence of the characteristics on the first principal component.


                                                                      tive quantities that sum to 1. These parameters are de-
                                                                      termined by fitting model (5) to the data {𝑥𝑖𝑔 , 𝑖 =
                                                                      1, ..., 𝑛𝑔 𝑎𝑛𝑑 𝑔 = 1, ..., 𝐺}, usually by maximum likeli-
                                                                      hood via the expectation-maximization (EM) algorithm [25].
                                                                      Then, the model is plugged into rule (4).

                                                                      3.2.1. Components estimation
                                                                      In order to improve the performance of GMM as a classifier,
                                                                      we need to define correctly how many Gaussian compo-
                                                                      nents should be used to approximate the data distribution,
                                                                      a GMM with too many Gaussian components may overfit
                                                                      the data [26], while a GMM with too few components may
                                                                      not be flexible enough to approximate the true underlying
                                                                      density distribution, observing an underfitting of the data.
                                                                      The correct components estimation it was done using the
Figure 5: Cumulative Explained Variance applying PCA.                 Bayesian Information Criterion (BIC) [27], which is a crite-
                                                                      rion for grading models based on the posteriori probability
                                                                      of the models being compared. The Bayesian information
of the∑︀observations of vector 𝑋 and class membership for             criterion (BIC) is given by:
𝑛= 𝐺     𝑔=1 𝑛𝑔 items. In supervised classification, their use                    𝐵𝐼𝐶 = −2 log 𝑓 (𝑥𝑛 |Θ) + 𝑝 log 𝑛,            (6)
is related to the Bayes allocation rule, which is known to
minimize the expected error rate in class membership pre-     where 𝑓 (𝑥𝑛 |Θ) is the chosen model, 𝑝 is the number of pa-
diction. The Bayes rule suggests to allocate 𝑥 to the class 𝑔ˆrameters to be estimated and 𝑛 is the number of sample data
having the highest posterior probability among the other      points. Models with smaller values of BIC are preferable.
classes C, that is:                                           Notice as sample size increases, BIC provides an increas-
                                                              ingly larger penalty per parameter and thus tends to select
𝑔ˆ = 𝑎𝑟𝑔 𝑚𝑎𝑥 {𝜋𝑔 𝑝𝑔 (𝑥)} = 𝑎𝑟𝑔 𝑚𝑎𝑥 {𝑃 (𝐶𝑔 ) 𝑃 (𝑥|𝐶𝑔 )}
         𝑔=1,...,𝐶                    𝑔=1,...,𝐶               more parsimonious models. In fact, BIC tends to overly
                                                          (4) penalize complex models, so the goal is to find the model
Where 𝜋 denotes the a priori probability of class 𝑔 which     for which the result is minimized, preferring the point of
are known or estimated from the training data, and 𝑝𝑔 (𝑥)     the first "elbow" of the function.
denotes the class-conditional probability distribution of 𝑋.
For continuous predictors mixtures of multivariate normal
densities are preferred, because of their computational con-  4. Experimental setup
venience:
                                                              4.0.1. Model implementation
                      𝐻𝑔

            𝑝𝑔 (𝑋) =
                      ∑︁
                          (𝑤𝑔ℎ 𝜑(𝑥 | 𝜇𝑔ℎ , Σ𝑔ℎ ))         (5) In order to have a better prevision on the data, two GMMs
                      ℎ=1
                                                              are implemented in two different ways, one regards only
                                                              Benign data type while the other one regards only Attack
In mixture (5) the components 𝜑(·) are 𝐻 d-dimensional        data type. Hence, for extracting the correct number of com-
Gaussian densities, each parameterized by its mean vec-       ponents of each one of the model, the BIC is applied for
tor 𝜇𝑔ℎ and covariance matrix Σ𝑔ℎ . The parameters            each case and the change for different component values
𝑤𝑔ℎ (ℎ = 1, ..., 𝐻𝑔 ) are mixing proportions named also       is presented in Figure 6. Given the obtained results, for
mixture coefficient, which are constrained to be posi-        what concern the GMM related to Attack data, is chosen


                                                                  5
Alessandro Cecchetto et al. CEUR Workshop Proceedings                                                                                  1–8


                                                                       Table 1
                                                                       Confusion matrix.
                                                                                                       Attack      Benign
                                                                                       Attack          66361        295
                                                                                       Benign           292        19327


                                                                       Table 2
                                                                       Classification report.
                 (a) BIC related to Attack data.                                           Precision      Recall   F1-score       Support
                                                                            Attack              1.00       1.00      1.00          66656
                                                                            Benign              0.98       0.99      0.99          19619
                                                                           Accuracy                                  0.99          86275
                                                                          Macro avg             0.99       0.99      0.99          86275
                                                                         Weighted avg           0.99       0.99      0.99          86275


                                                                       Table 3
                                                                       Cross-Validation results.
                 (b) BIC related to Benign data.
                                                                                  Cross-validation performance            98.7%
Figure 6: Changes in BIC for different components number.                               Standard deviation                0.003


five as number of components, though corresponding to six              returning the average accuracy per class. Using the average
components there is the lowest BIC value, performances                 of Sensitivity and Specificity, we are able to account for
are equal as using five, so even in terms of computational             imbalanced datasets as a model will receive a worse balanced
complexity five was chosen, on the other hand GMM related              accuracy score if it only predicts accurately for the majority
to Benign data is developed using three components given               class in the dataset. As a result, the balanced accuracy report
the clear result. Hence, the two different models are defined          an accuracy equal to 99%. Following is presented a resume
as follows:                                                            of the obtained result using a confusion matrix and then a
                  5
                                                                       classification report, over 86275 samples used as a test data:
                 ∑︁
𝑓𝐴𝑡𝑡𝑎𝑐𝑘 (𝑥) =       (𝑤𝐴𝑡𝑡𝑎𝑐𝑘,𝑖 𝜑(𝑥 | 𝜇𝐴𝑡𝑡𝑎𝑐𝑘,𝑖 , Σ𝐴𝑡𝑡𝑎𝑐𝑘,𝑖 ))
                 𝑖=1
                                                             (7)       5.0.1. Cross-Validation
                  3
                 ∑︁                                                    Cross-validation is one of the most widely used data resam-
𝑓𝐵𝑒𝑛𝑖𝑔𝑛 (𝑥) =       (𝑤𝐵𝑒𝑛𝑖𝑔𝑛,𝑖 𝜑(𝑥 | 𝜇𝐵𝑒𝑛𝑖𝑔𝑛,𝑖 , Σ𝐵𝑒𝑛𝑖𝑔𝑛,𝑖 ))          pling methods to estimate the true prediction error of mod-
                 𝑖=1
                                                                       els. The Cross-validation technique used in the experiment
                                                             (8)       is the Stratified k-fold, in which the available learning set is
                                                                       partitioned into 𝑘 disjoint subsets of approximately equal
                                                                       size, where the presence of data typology is the same as the
5. Results                                                             original dataset. The word “fold” refers to the number of re-
                                                                       sulting subsets. This partitioning is performed by randomly
For the classification of a new point, the posterior probability       sampling cases from the learning set without replacement.
of belonging to the two classes is calculated:                         The model is trained using 𝑘 − 1 subsets, which, together,
                                                                       represent the training set. Then, the model is applied to
                                    𝑃 (𝑥|𝐶𝑖 )𝑃 (𝐶𝑖 )
               𝑃 (𝐶𝑖 |𝑥)     =                       ,       (9)       the remaining subset, which is denoted as the validation
                                         𝑃 (𝑥)
           𝑖∈{𝐴𝑡𝑡𝑎𝑐𝑘,𝐵𝑒𝑛𝑖𝑔𝑛}                                           set, and the performance is measured. This procedure is
                                                                       repeated until each of the 𝑘 subsets has served as valida-
then, the point is associated with the class 𝑔ˆ with the maxi-
                                                                       tion set. Consequently, the average of the 𝑘 performances
mum posterior probability:
                                                                       on the 𝑘 validation sets is calculated, which represents the
     𝑔ˆ = 𝑎𝑟𝑔          𝑚𝑎𝑥          {𝑃 (𝐶𝑖 ) 𝑃 (𝑥|𝐶𝑖 )}    (10)        cross-validation performance and subsequently the stan-
                𝑖∈{𝐴𝑡𝑡𝑎𝑐𝑘,𝐵𝑒𝑛𝑖𝑔𝑛}                                      dard deviation is also calculated considering the result of
                                                                       each validation set compared to the performance score.
The metric used for evaluating the quality of a model’s
predictions is the balanced accuracy, which is the mean of
Sensitivity and Specificity. Where Sensitivity (True Positive          6. Conclusions
Rate) is the probability of a positive case being accurately
classed as being positive, and Specificity (True Negative              In this paper, a procedure for fitting Gaussian mixture mod-
Rate) is the probability of a negative case being accuracy             els (GMM) oriented to supervised classification has been pro-
classed as negative. This specifications helps the metric              posed for classifying DDoS attacks. The proposed method
perform well with the slightly imbalanced dataset used,


                                                                   6
Alessandro Cecchetto et al. CEUR Workshop Proceedings                                                                           1–8


demonstrates exceptional adaptability, flexibility, and perfor-        [7] U. Dincalp, M. S. Güzel, O. Sevine, E. Bostanci,
mance in detecting DDoS attacks, making it a promising so-                 I. Askerzade, Anomaly based distributed denial of
lution for enhancing cybersecurity in critical infrastructures.            service attack detection and prevention with machine
Starting from the raw data taken from the CIC-DDoS2019                     learning, in: 2018 2nd International Symposium on
dataset, a pre-processing phase was necessary in which the                 Multidisciplinary Studies and Innovative Technologies
latter were processed with a subsequent selection of the                   (ISMSIT), IEEE, 2018, pp. 1–4.
characteristics relevant to the purpose. The results from the          [8] A. Aditya, Detecting ddos attacks on isp networks,
test data, provides an accuracy of 99% which is confirmed                  in: Proc. of ACM SIGMOD Workshop on Management
even by the Cross-Validation with a 98.7% accuracy, high-                  and Processing of Data Streams, 2003, 2003.
lighting a reliable result without problems like overfitting           [9] I. Sharafaldin, A. H. Lashkari, S. Hakak, A. A. Ghorbani,
that in some scenario can conceal the real performance. As                 Developing realistic distributed denial of service (ddos)
part of our future work, first of all, we will be focusing on              attack dataset and taxonomy, in: 2019 international
improving the DDoS attack data, increasing the diversity of                carnahan conference on security technology (ICCST),
the training data to keeps as much types of attack as possible,            IEEE, 2019, pp. 1–8.
but even including the normal network traffic data in order           [10] B. Xue, M. Zhang, Evolutionary feature manipulation
to reach a more balanced dataset. This component plays                     in data mining/big data, ACM SIGEVOlution 10 (2017)
a crucial role in the success of our strategy, as it needs to              4–11.
have a deep understanding of DDoS attacks. To achieve this            [11] V. Ponzi, S. Russo, V. Bianco, C. Napoli, A. Wajda,
goal, we plan to automate the process as much as possible.                 Psychoeducative social robots for an healthier lifestyle
Subsequently, The GMM could be implemented to classify                     using artificial intelligence: a case-study, volume 3118,
more than two classes, e.g. trying to classify each kind of                2021, pp. 26 – 33.
DDoS attack or trying to model a multi-class problem where            [12] G. Lo Sciuto, S. Russo, C. Napoli, A cloud-based flexible
the classes could be different kinds of cyberattacks.                      solution for psychometric tests validation, administra-
                                                                           tion and evaluation, volume 2468, 2019, pp. 16 – 21.
                                                                      [13] G. Capizzi, G. L. Sciuto, C. Napoli, E. Tramontana,
Acknowledgments                                                            A multithread nested neural network architecture to
                                                                           model surface plasmon polaritons propagation, Mi-
This work has been developed at is.Lab() Intelligent Sys-
                                                                           cromachines 7 (2016). doi:10.3390/mi7070110.
tems Laboratory at the Department of Computer, Control,
                                                                      [14] M. M. Ahsan, M. P. Mahmud, P. K. Saha, K. D. Gupta,
and Management Engineering, Sapienza University of Rome
                                                                           Z. Siddique, Effect of data scaling methods on machine
(https:// islab.diag.uniroma1.it). The work has also been par-
                                                                           learning algorithms and model performance, Tech-
tially supported from Italian Ministerial grant PRIN 2022
                                                                           nologies 9 (2021) 52.
“ISIDE: Intelligent Systems for Infrastructural Diagnosis in
                                                                      [15] G. Capizzi, G. L. Sciuto, C. Napoli, R. Shikler, M. Woz-
smart-concretE”, n. 2022S88WAY - CUP B53D2301318, and
                                                                           niak, Optimizing the organic solar cell manufac-
by the Age-It: Ageing Well in an ageing society project,
                                                                           turing process by means of afm measurements and
task 9.4.1 work package 4 spoke 9, within topic 8 ex-
                                                                           neural networks, Energies 11 (2018). doi:10.3390/
tended partnership 8, under the National Recovery and Re-
                                                                           en11051221.
silience Plan (PNRR), Mission 4 Component 2 Investment
                                                                      [16] S. Pepe, S. Tedeschi, N. Brandizzi, S. Russo, L. Ioc-
1.3—Call for tender No. 1557 of 11/10/2022 of Italian Min-
                                                                           chi, C. Napoli, Human attention assessment us-
istry of University and Research funded by the European
                                                                           ing a machine learning approach with gan-based
Union—NextGenerationEU, CUP B53C22004090006.
                                                                           data augmentation technique trained using a custom
                                                                           dataset, OBM Neurobiology 6 (2022). doi:10.21926/
References                                                                 obm.neurobiol.2204139.
                                                                      [17] F. Pargent, F. Pfisterer, J. Thomas, B. Bischl, Regular-
 [1] N. Hoque, D. K. Bhattacharyya, J. K. Kalita, Botnet in                ized target encoding outperforms traditional methods
     ddos attacks: trends and challenges, IEEE Communi-                    in supervised machine learning with high cardinality
     cations Surveys & Tutorials 17 (2015) 2242–2270.                      features, Computational Statistics 37 (2022) 2671–2692.
 [2] S. I. Illari, S. Russo, R. Avanzato, C. Napoli, A cloud-         [18] B. Bhattacharya, D. P. Solomatine, Machine learning in
     oriented architecture for the remote assessment and                   sedimentation modelling, Neural Networks 19 (2006)
     follow-up of hospitalized patients, volume 2694, 2020,                208–214.
     pp. 29 – 35.                                                     [19] A. Jović, K. Brkić, N. Bogunović, A review of feature
 [3] Genie-Networks, Ddos attack statistics and trends                     selection methods with applications, in: 2015 38th
     report for 2020 (2021).                                               international convention on information and commu-
 [4] M. Wang, Y. Lu, J. Qin, A dynamic mlp-based ddos                      nication technology, electronics and microelectronics
     attack detection method using feature selection and                   (MIPRO), Ieee, 2015, pp. 1200–1205.
     feedback, Computers & Security 88 (2020) 101645.                 [20] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli,
 [5] N.-N. Dao, J. Park, M. Park, S. Cho, A feasible method                Analysis pre and post covid-19 pandemic rorschach
     to combat against ddos attack in sdn network, in: 2015                test data of using em algorithms and gmm models,
     International Conference on Information Networking                    volume 3360, 2022, pp. 55 – 63.
     (ICOIN), IEEE, 2015, pp. 309–311.                                [21] M. A. Salam, A. T. Azar, M. S. Elgendy, K. M. Fouad,
 [6] S. M. Mousavi, M. St-Hilaire, Early detection of ddos                 The effect of different dimensionality reduction tech-
     attacks against sdn controllers, in: 2015 international               niques on machine learning overfitting problem, Int.
     conference on computing, networking and communi-                      J. Adv. Comput. Sci. Appl 12 (2021) 641–655.
     cations (ICNC), IEEE, 2015, pp. 77–81.                           [22] B. Azhagusundari, A. S. Thanamani, et al., Feature
                                                                           selection based on information gain, International


                                                                  7
Alessandro Cecchetto et al. CEUR Workshop Proceedings                  1–8


     Journal of Innovative Technology and Exploring Engi-
     neering (IJITEE) 2 (2013) 18–21.
[23] F. L. Gewers, G. R. Ferreira, H. F. D. Arruda, F. N. Silva,
     C. H. Comin, D. R. Amancio, L. d. F. Costa, Principal
     component analysis: A natural approach to data ex-
     ploration, ACM Computing Surveys (CSUR) 54 (2021)
     1–34.
[24] D. A. Reynolds, et al., Gaussian mixture models., En-
     cyclopedia of biometrics 741 (2009).
[25] B. Barazandeh, M. Razaviyayn, On the behavior of
     the expectation-maximization algorithm for mixture
     models, in: 2018 IEEE Global Conference on Signal
     and Information Processing (GlobalSIP), IEEE, 2018,
     pp. 61–65.
[26] V. Nannen, The paradox of overfitting, Ph.D. thesis,
     Faculty of Science and Engineering, 2003.
[27] G. Schwarz, Estimating the dimension of a model, The
     annals of statistics (1978) 461–464.


                                                                   8