Detection of DDoS Attacks with Gaussian Mixture Model Alessandro Cecchetto1 , Giuseppa Conte1 and Christian Napoli1,2,3 1 Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, Roma, 00185, Italy 2 Institute for Systems Analysis and Computer Science, Italian National Research Council, Via dei Taurini 19, Roma, 00185, Italy 3 Department of Computational Intelligence, Czestochowa University of Technology, 42-201 Czestochowa, Poland Abstract A Distributed Denial of Service (DDoS) is an attack which aim is to stop or tamper with an online service incapacitating a server with a flood of packages or requests, using internet or intranet. The main aim of the DDoS attack is to collapse the network or server with abnormal traffic to make the service unavailable for the legitimate users. This problem is particularly profound, due to the development of emerging technologies, such as cloud computing, the Internet of Things or artificial intelligence techniques, from which attackers can take advantage by launching a huge volume of DDoS attacks at a lower cost, and it is much harder to detect and prevent DDoS attacks, because DDoS traffic is similar to normal traffic. In this paper we implement a novel technique implementing an unsupervised Gaussian Mixture Model (GMM) based algorithm. Using a real traffic dataset, the CIC-DDoS2019, for benchmark, the proposed GMM can achieve recall, precision, and accuracy up to 99%. Experiments reveal that this can be a promising solution for the detection of DDoS attacks. Keywords Distributed Denial of Service (DDoS), Gaussian Mixture Model (GMM), Machine Learning, Detection 1. Introduction account the phase of Cross-Validation. Finally, conclusions and future implementations are discussed in Section VII. DoS (Denial of Service) attacks deplete the network band- Various ML technologies have been employed, mainly as width and computing resources of a targeted system by classifiers, in the detection of DDoS attacks. Meng Wang flooding malicious traffic, preventing the target system from et al. [4] proposed a dynamic multilayer perceptron (MLP) offering regular services to legitimate users. DDoS (Dis- combined with a feature selection technique to detect DDoS tributed Denial of Service) goes even further on a much attacks, where a feedback mechanism is applied to promote larger scale. DDoS attacks can take over a large number and reconstruct the detector system when detection is not of compromised systems called bots, constituting a botnet, accurate. In their model, as the complexities of traffic net- which are used to launch coordinated attacks on the victim work increase and change, some of the selected features system, from this kind of attack behavior, DDoS attacks can will not be able to distinguish the traffic and normal at- be devided in several branches as reported in [1]. Along with tacks and determine the failure therein. Nhu-Ngoc Dao the emergence and advancement of Internet technologies et al. [5] proposed the approach of source based IP filter- [2], DDoS attacks are evolving and proliferating in scale, ing technique to defeat DDoS attacks. The approach try to frequency, and sophistication. Organizations face poten- distinguish three kinds of Users. The malicious user who tial threats to their network environment that may cause has fix source IP address and injects spoofed packets to the severe impacts to their operations, such as business down- switch infinitely. The DDoS attacking user sends spoofed time, data breaches, or even ransom demands from hack- packets to the switch infinitely. The frequent user acts as ers [3]. The detection of DDoS attacks is essential before normal user. The method distinguishes them and processes any mitigation approaches can be taken. In the early era, the differently according to different users. It works well when alarm of DDoS attacks was triggered by rules programmed the attack traffic is not very huge, and if the attack type by traffic engineers, but in the current cybersecurity sce- it is mainly a flooding one. But to use it, we need to sur- nario, the application of artificial intelligence and, more vey the network first and initiate two parameters for the specifically, machine learning (ML) offers new and promis- detection method, so it is necessary a kind of "setup" time ing perspectives. Training predictive models to recognize in order to fix such parameters. The effect of the method anomalous patterns in network traffic provides a more agile may be affected by the artificial parameters. When involved and proactive means of detecting attacks. The Gaussian with the behavior of the artificial, the uncertainty of the Mixture Model (GMM) is one such model, used for its ability detecting result will increase. Seyed Mohammad Mousavi to model complex distributions of data, such as those that et al. [6] proposed a solution to detect DDoS attacks based characterize network traffic. In this paper, we propose a on the entropy variation of the destination IP address. Al- novel ML (Machine Learning) method based on GMM, for though it is a lightweight and effective detection method, in detecting DDoS malicious packets. The remaining part of detecting DDoS attacks, we cannot only take one factor into this paper is organized as follows: Section II describes and consideration, since there are many factors that can be used analyze the related DDoS detection works with some pro- to identify DDoS attacks, which can manifest themselves posed solutions. Section III describes and analyze the used in very different ways beyond the proposed control. The dataset. Section IV explain all the steps that belongs to the detection method lacks of comprehensive consideration of workflow from the feature pre-processing to the GMM. Sec- multi-factors. Uygar Dincalp et al. [7] proposed a method tion V denotes the implementation of the project. Section VI based on the clustering algorithm DBSCAN for analyzing resume the obtained results of the project also taking into the network traffic in order to catch the changes and va- rieties in attack vectors for showing what the attack and SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engineering and Mathematics, Rome, December 3-6, 2023 where the attack is based on. The proposed methodology $ cecchetto@diag.uniroma1.it (A. Cecchetto); strongly depends on a fixed threshold in order to send an giuseppa.conte@uniroma1.it (G. Conte); cnapoli@diag.uniroma1.it alarm of a possible attack, it is not clearly explained how the (C. Napoli) threshold is defined but the results shows that the proposed © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu- tion 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1 Alessandro Cecchetto et al. CEUR Workshop Proceedings 1–8 3. Implementation The proposed work (Figure 2) consists of two main phases: data phase and model phase. During data phase, the following points are developed: Fea- ture pre-processing, Feature selection, Dimensionality re- duction. During the model phase, relating to the develop- ment of the Gaussian mixture model, the following points are addressed: Model training, Performance evaluation, Cross-Validation. 3.1. Feature pre-processing The efficiency of classification techniques have to be im- proved through adequate data manipulation which concerns different types of actions. Additionally, models trained on Figure 1: Subdivision of the dataset. manually prepared data exhibited better performance com- pared to those trained on non-prepared data according to [10, 11, 12]. system worked well with chosen attributes in their exper- iments. Akella et al. [8] proposed a detection mechanism 3.1.1. Feature Scaling where each intermediate router detects traffic anomalies using profiles of normal traffic. Each router keeps track of Feature scaling is a vital step in pre-processing data before destinations whose traffic occupies greater than a fraction of building a model using machine learning [13, 14, 15, 16]. the capacity of the outgoing link, and sends this information The dataset used for model training in machine learning to its neighbors. Attack detection is determined by interme- often contain unpredictable values that may have varying diate routers if the gathered traffic information on a specific scales. This can result in inequalities in comparing these destination system exceeds the predefined threshold. This values. Feature scaling techniques can address these chal- scheme cannot distinguish the flash crowds provided by a lenges by adjusting the values and promoting easy and fair spike of normal traffic from the DDoS attacks. Hence, false comparisons among values. alarm rate will be increased. The ML algorithm observes only numbers, and if there is a significant difference in range, it assumes that numbers in the upper ranges are superior such that features with 2. Dataset larger numerical values have a greater effect on the distance between data and dominate other features when calculat- The proposed model is trained and validated on the dataset ing distances. As a result, these more significant numbers released by the Canadian Institute for Cybersecurity (CIC), play a more critical role during model training. The scaling namely CIC-DDoS2019 [9]. The dataset offers an extended technique used in the experiment is the Normalization, also set of Distributed Denial of Service attacks, most of which known as Min-Max scaling, a technique in which values employ some form of amplification through reflection. This are shifted and rescaled to a range between 0 and 1 with- type of attacks are conducted concealing the attacker's iden- out distorting differences in the ranges of values or losing tity thanks to the IP spoofing technique in which packets information. are sent to reflector servers by attackers with the source IP address set to the target victim's. The dataset contains 3.1.2. Encoding benign and the most up-to-date common DDoS attacks, this two classes are divided as reported in Figure 1, with a total Many statistical learning algorithms require as input a nu- amount of data of 431,371 records described by 88 features. merical feature matrix, as the case of GMM. When categor- A slight imbalance is highlighted in the composition of the ical variables are present in the data, feature engineering dataset, where the connections associated with the attacks is needed to encode the different categories into a suitable represent approximately 77% of the total dataset, this slight feature vector. lack of data for the benign connections would have been Feature encoding is the process of transforming textual data possible to fix through oversampling algorithms, creating into numerical values so they may be applied to ML algo- artificial data based on characteristics of the original ones, rithms, resulting in improved model accuracy. Researchers the choice in this experiment was not to make changes in have used many approaches to convert textual data into this sense, as the imbalance present between the two classes numerical values, in this work is used the “Label Encoding” was not too accentuated, equally allowing correct training technique [17], in such work this process is applied to the of the model, the model performance evaluation is validated target feature of the dataset in which there are two different by applying the cross-validation step. values, Attack and Benign. Label encoding has the advan- The following types of attacks are present: UDP, MSSQL, tage that it is straightforward, yet it has the disadvantage Portmap, Syn, NetBIOS, UDPLag, LDAP, DrDoS_DNS, UDP- that the numeric values can be “misinterpreted” by the ma- lag, WebDDoS, TFTP, DrDoS_UDP, DrDoS_SNMP, Dr- chine learning algorithms since it uses number sequencing. DoS_NetBIOS, DrDoS_LDAP, DrDoS_MSSQL, DrDoS_NTP. The problem using the number is that they introduce re- lation/comparison between them. Apparently, there is no relation between Attack and Benign, but when looking at the number, ’Benign’ which is encoded using 1 has higher 2 Alessandro Cecchetto et al. CEUR Workshop Proceedings 1–8 Figure 2: Experiment workflow. precedence over ‘Attack’ which is encoded using 0, this kind sample contains a random subset of the original columns of problem in this work does not occur. and is used to fit a decision tree. Each tree of the random forest can calculate the importance of a feature according to 3.1.3. Feature selection its ability to increase the pureness of the leaves. The higher the increment in leaves purity, the higher the importance Many ML models experience difficulty working with a high of the feature. This is done for each tree, then is averaged presence of features in input, generally, features can be cat- among all the trees and, finally, normalized to 1. So, the sum egorized as: relevant, irrelevant, or redundant, the last two of the importance scores calculated by a Random Forest is 1. categories only increase the size of the input space [18] In this paper Information Gain (IG) criteria is used for fea- resulting in difficulty to process data further thus not con- ture selection by RFC. To use Information Gain for feature tributing to the learning process. To generate the best per- selection an entropy value of each attribute of the data has forming model, feature selection plays a major role, which to be calculated. The entropy value is used for ranking fea- process a subset from available features data are selected tures that affect data classification. A feature which does not for the process of learning algorithm. The best subset is the have much effect on the data classification has very small one with least number of dimensions that most contribute information gain and it can be ignored without affecting to learning accuracy, since an irrelevant feature does not the detection accuracy of a classifier [22].It calculates the affect describing the target concept in any way, a redundant amount of entropy (uncertainty) that is reduced as a result feature does not add anything new to describing the target of dividing the data by a specific property. Hence for each concept. Redundant features might possibly add more noise splitting attribute, information gain is calculated and the than useful information in describing the concept of inter- attribute with highest gain is chosen as splitting attribute. est. The main benefits of feature selection are follows: (i) This attribute is such that it creates minimum impurity or reducing the measurement cost and storage requirements, randomness in the generated splits and hence it minimizes (ii) coping with the degradation of the classification per- the information needed to classify the tuples. The entropy formance due to the finite size of training sample sets, (iii) of a subset S is determined as follows: reducing training and utilization time, (iv) facilitating data 𝐶 visualization and data understanding, (V) reducing the risk ∑︁ of overfitting. 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = (𝑝𝑖 · 𝑙𝑜𝑔2 (𝑝𝑖 )) (1) 𝑐=1 This process can be carried out into three ways [19, 20]: filter, wrapper, and embedded. Here c = 1, .., C are the different classes in S, 𝑝𝑖 is probabil- The feature selection process of the proposed work uses a ity that an arbitrary tuple in S belongs to class 𝐶𝑖 . Let A filter approach based on the Random Forest Classifier (RFC) be a feature in S and 𝑎1 , 𝑎2 , . . . , 𝑎𝑣 are different values of as in according with [21], since it shows the most suitable attribute A in S such that 𝑆1 , . . . , 𝑆𝑣 are partitions gener- performance among other filtering approaches, before the ated based on these values. These partitions are likely to application of the model, the dataset is divided into train be impure. How much more information is still needed to and test sets in the ratio of 80:20, since feature selection arrive at an exact classification or pure partition is given as: using only the training data (train set) rather than the entire 𝑣 dataset, is particularly important in order to avoid so-called 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 (𝑆) = ∑︁ ( |𝑆𝑖 | · 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑖 )) (2) "data leakage". The main reason behind this choice is that, 𝑖=1 |𝑆| when selecting features, you want the selection to be based only on the information available during the model training The smaller is this additional information the greater the phase. If you also use data from the test set during the fea- purity of the partition. ture selection phase, you may run the risk of using future 𝐼𝐺(𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 (𝑆) (3) information to guide feature selection, introducing a bias into the results. The RFC takes the training dataset and re- Once the IG relating to all the features of the dataset has sample it according to a procedure called “bootstrap”. Each been defined via RFC, the Select From Model class is applied 3 Alessandro Cecchetto et al. CEUR Workshop Proceedings 1–8 Figure 3: Selected features. to the results for the selection of the main important features fundamental step required in order to avoid falsification of by defining a threshold based on the average of the values measurements is data standardization, since heterogeneous coming from the RFC. Then, the features that have the IG data representations going to influence the PCs construc- above the threshold are selected, which are shown in Figure tions, given that in case of small set of variables has a much 3. larger magnitude than others, the components in the PCA analysis are heavily weighted along those variables, while 3.1.4. Dimensionality reduction other variables are ignored. As a consequence, the PCA sim- ply recovers the values of these high-magnitude variables The dataset, after having gone through a pre-processing and for this reason standardise the scale of features variation feature selection phases, still has a high dimensionality with is essential, the consequence that this aspect can have is the presence of 28 features. In this regard, a reduction of shown in Figure 4 with a comparison on the first principal the latter is necessary, as we want to have a representation component whether features are standardized or not. The of the data. Dimensionality reduction has been made using first application of PCA shown in Figure 5 is performed by Principal Component Analysis (PCA) [23]. Principal Com- defining a number of 10 PCs to see and analyze how the ponent Analysis (PCA) simplifies the complexity in high- cumulative variance is distributed over a larger number of dimensional data while retaining trends and patterns. It does components. Subsequently, 3 components were extracted this by transforming the data into fewer dimensions with from the results obtained, since with 3 components it is pos- minimal loss of overall dispersion, which act as summaries sible to visualize the data, reduce the computational power, of features. PCA reduces data by geometrically projecting finding only a data loss of 8%. them onto lower dimensions called principal components (PCs) defined as a linear combination of the data's original variables, with the goal of finding the best summary of the 3.2. GMM data using a limited number of PCs. The first PC is chosen Gaussian Mixture Model (GMM) [24] is a probabilistic model to minimize the total distance between the data and their to describe subsets of data within a general population, projection onto the PC. By minimizing this distance, we that can be represented as a combination of normally dis- also maximize the variance of the projected points. The tributed subpopulations. It is commonly used for unsuper- second (and subsequent) PCs are selected similarly, with vised learning to learn the subpopulations models that can the additional requirement that they be orthogonal (proving be also automatically divided. On the other hand within a to be uncorrelated) with all previous PCs. Hence, princi- supervised approach, such models can be used in order to pal components represents the directions of the data that determine the boundaries of different subpopulations for explain a maximal amount of variance, or rather, the lines classification purposes. In this latter case, the goal is to that capture most information of the data. The relationship assign each data point to one of G preexisting unordered between variance and information here, is that, the larger classes (or populations) taking into account 𝑑 observed vari- the variance carried by a line, the larger the dispersion of ables, 𝑋. The problem amounts to define a function that the data points along it, and the larger the dispersion along maps an arbitrary observation 𝑥 ∈ 𝑅𝑑 to a prediction of a line, the more information it has. Hence, PCs are new the class from which it stems. This function, named al- axes that provide the best angle to see and evaluate the location rule, has to be estimated from the training data data, so that the differences between the observations are {𝑥𝑖𝑔 , 𝑖 = 1, ..., 𝑛𝑔 𝑎𝑛𝑑 𝑔 = 1, ..., 𝐺}, which consist better visible. Given the dealing of the PCs with distance, a 4 Alessandro Cecchetto et al. CEUR Workshop Proceedings 1–8 Figure 4: Influence of the characteristics on the first principal component. tive quantities that sum to 1. These parameters are de- termined by fitting model (5) to the data {𝑥𝑖𝑔 , 𝑖 = 1, ..., 𝑛𝑔 𝑎𝑛𝑑 𝑔 = 1, ..., 𝐺}, usually by maximum likeli- hood via the expectation-maximization (EM) algorithm [25]. Then, the model is plugged into rule (4). 3.2.1. Components estimation In order to improve the performance of GMM as a classifier, we need to define correctly how many Gaussian compo- nents should be used to approximate the data distribution, a GMM with too many Gaussian components may overfit the data [26], while a GMM with too few components may not be flexible enough to approximate the true underlying density distribution, observing an underfitting of the data. The correct components estimation it was done using the Figure 5: Cumulative Explained Variance applying PCA. Bayesian Information Criterion (BIC) [27], which is a crite- rion for grading models based on the posteriori probability of the models being compared. The Bayesian information of the∑︀observations of vector 𝑋 and class membership for criterion (BIC) is given by: 𝑛= 𝐺 𝑔=1 𝑛𝑔 items. In supervised classification, their use 𝐵𝐼𝐶 = −2 log 𝑓 (𝑥𝑛 |Θ) + 𝑝 log 𝑛, (6) is related to the Bayes allocation rule, which is known to minimize the expected error rate in class membership pre- where 𝑓 (𝑥𝑛 |Θ) is the chosen model, 𝑝 is the number of pa- diction. The Bayes rule suggests to allocate 𝑥 to the class 𝑔ˆrameters to be estimated and 𝑛 is the number of sample data having the highest posterior probability among the other points. Models with smaller values of BIC are preferable. classes C, that is: Notice as sample size increases, BIC provides an increas- ingly larger penalty per parameter and thus tends to select 𝑔ˆ = 𝑎𝑟𝑔 𝑚𝑎𝑥 {𝜋𝑔 𝑝𝑔 (𝑥)} = 𝑎𝑟𝑔 𝑚𝑎𝑥 {𝑃 (𝐶𝑔 ) 𝑃 (𝑥|𝐶𝑔 )} 𝑔=1,...,𝐶 𝑔=1,...,𝐶 more parsimonious models. In fact, BIC tends to overly (4) penalize complex models, so the goal is to find the model Where 𝜋 denotes the a priori probability of class 𝑔 which for which the result is minimized, preferring the point of are known or estimated from the training data, and 𝑝𝑔 (𝑥) the first "elbow" of the function. denotes the class-conditional probability distribution of 𝑋. For continuous predictors mixtures of multivariate normal densities are preferred, because of their computational con- 4. Experimental setup venience: 4.0.1. Model implementation 𝐻𝑔 𝑝𝑔 (𝑋) = ∑︁ (𝑤𝑔ℎ 𝜑(𝑥 | 𝜇𝑔ℎ , Σ𝑔ℎ )) (5) In order to have a better prevision on the data, two GMMs ℎ=1 are implemented in two different ways, one regards only Benign data type while the other one regards only Attack In mixture (5) the components 𝜑(·) are 𝐻 d-dimensional data type. Hence, for extracting the correct number of com- Gaussian densities, each parameterized by its mean vec- ponents of each one of the model, the BIC is applied for tor 𝜇𝑔ℎ and covariance matrix Σ𝑔ℎ . The parameters each case and the change for different component values 𝑤𝑔ℎ (ℎ = 1, ..., 𝐻𝑔 ) are mixing proportions named also is presented in Figure 6. Given the obtained results, for mixture coefficient, which are constrained to be posi- what concern the GMM related to Attack data, is chosen 5 Alessandro Cecchetto et al. CEUR Workshop Proceedings 1–8 Table 1 Confusion matrix. Attack Benign Attack 66361 295 Benign 292 19327 Table 2 Classification report. (a) BIC related to Attack data. Precision Recall F1-score Support Attack 1.00 1.00 1.00 66656 Benign 0.98 0.99 0.99 19619 Accuracy 0.99 86275 Macro avg 0.99 0.99 0.99 86275 Weighted avg 0.99 0.99 0.99 86275 Table 3 Cross-Validation results. (b) BIC related to Benign data. Cross-validation performance 98.7% Figure 6: Changes in BIC for different components number. Standard deviation 0.003 five as number of components, though corresponding to six returning the average accuracy per class. Using the average components there is the lowest BIC value, performances of Sensitivity and Specificity, we are able to account for are equal as using five, so even in terms of computational imbalanced datasets as a model will receive a worse balanced complexity five was chosen, on the other hand GMM related accuracy score if it only predicts accurately for the majority to Benign data is developed using three components given class in the dataset. As a result, the balanced accuracy report the clear result. Hence, the two different models are defined an accuracy equal to 99%. Following is presented a resume as follows: of the obtained result using a confusion matrix and then a 5 classification report, over 86275 samples used as a test data: ∑︁ 𝑓𝐴𝑡𝑡𝑎𝑐𝑘 (𝑥) = (𝑤𝐴𝑡𝑡𝑎𝑐𝑘,𝑖 𝜑(𝑥 | 𝜇𝐴𝑡𝑡𝑎𝑐𝑘,𝑖 , Σ𝐴𝑡𝑡𝑎𝑐𝑘,𝑖 )) 𝑖=1 (7) 5.0.1. Cross-Validation 3 ∑︁ Cross-validation is one of the most widely used data resam- 𝑓𝐵𝑒𝑛𝑖𝑔𝑛 (𝑥) = (𝑤𝐵𝑒𝑛𝑖𝑔𝑛,𝑖 𝜑(𝑥 | 𝜇𝐵𝑒𝑛𝑖𝑔𝑛,𝑖 , Σ𝐵𝑒𝑛𝑖𝑔𝑛,𝑖 )) pling methods to estimate the true prediction error of mod- 𝑖=1 els. The Cross-validation technique used in the experiment (8) is the Stratified k-fold, in which the available learning set is partitioned into 𝑘 disjoint subsets of approximately equal size, where the presence of data typology is the same as the 5. Results original dataset. The word “fold” refers to the number of re- sulting subsets. This partitioning is performed by randomly For the classification of a new point, the posterior probability sampling cases from the learning set without replacement. of belonging to the two classes is calculated: The model is trained using 𝑘 − 1 subsets, which, together, represent the training set. Then, the model is applied to 𝑃 (𝑥|𝐶𝑖 )𝑃 (𝐶𝑖 ) 𝑃 (𝐶𝑖 |𝑥) = , (9) the remaining subset, which is denoted as the validation 𝑃 (𝑥) 𝑖∈{𝐴𝑡𝑡𝑎𝑐𝑘,𝐵𝑒𝑛𝑖𝑔𝑛} set, and the performance is measured. This procedure is repeated until each of the 𝑘 subsets has served as valida- then, the point is associated with the class 𝑔ˆ with the maxi- tion set. Consequently, the average of the 𝑘 performances mum posterior probability: on the 𝑘 validation sets is calculated, which represents the 𝑔ˆ = 𝑎𝑟𝑔 𝑚𝑎𝑥 {𝑃 (𝐶𝑖 ) 𝑃 (𝑥|𝐶𝑖 )} (10) cross-validation performance and subsequently the stan- 𝑖∈{𝐴𝑡𝑡𝑎𝑐𝑘,𝐵𝑒𝑛𝑖𝑔𝑛} dard deviation is also calculated considering the result of each validation set compared to the performance score. The metric used for evaluating the quality of a model’s predictions is the balanced accuracy, which is the mean of Sensitivity and Specificity. Where Sensitivity (True Positive 6. Conclusions Rate) is the probability of a positive case being accurately classed as being positive, and Specificity (True Negative In this paper, a procedure for fitting Gaussian mixture mod- Rate) is the probability of a negative case being accuracy els (GMM) oriented to supervised classification has been pro- classed as negative. This specifications helps the metric posed for classifying DDoS attacks. The proposed method perform well with the slightly imbalanced dataset used, 6 Alessandro Cecchetto et al. CEUR Workshop Proceedings 1–8 demonstrates exceptional adaptability, flexibility, and perfor- [7] U. Dincalp, M. S. Güzel, O. Sevine, E. Bostanci, mance in detecting DDoS attacks, making it a promising so- I. Askerzade, Anomaly based distributed denial of lution for enhancing cybersecurity in critical infrastructures. service attack detection and prevention with machine Starting from the raw data taken from the CIC-DDoS2019 learning, in: 2018 2nd International Symposium on dataset, a pre-processing phase was necessary in which the Multidisciplinary Studies and Innovative Technologies latter were processed with a subsequent selection of the (ISMSIT), IEEE, 2018, pp. 1–4. characteristics relevant to the purpose. The results from the [8] A. Aditya, Detecting ddos attacks on isp networks, test data, provides an accuracy of 99% which is confirmed in: Proc. of ACM SIGMOD Workshop on Management even by the Cross-Validation with a 98.7% accuracy, high- and Processing of Data Streams, 2003, 2003. lighting a reliable result without problems like overfitting [9] I. Sharafaldin, A. H. Lashkari, S. Hakak, A. A. Ghorbani, that in some scenario can conceal the real performance. As Developing realistic distributed denial of service (ddos) part of our future work, first of all, we will be focusing on attack dataset and taxonomy, in: 2019 international improving the DDoS attack data, increasing the diversity of carnahan conference on security technology (ICCST), the training data to keeps as much types of attack as possible, IEEE, 2019, pp. 1–8. but even including the normal network traffic data in order [10] B. Xue, M. Zhang, Evolutionary feature manipulation to reach a more balanced dataset. This component plays in data mining/big data, ACM SIGEVOlution 10 (2017) a crucial role in the success of our strategy, as it needs to 4–11. have a deep understanding of DDoS attacks. To achieve this [11] V. Ponzi, S. Russo, V. Bianco, C. Napoli, A. Wajda, goal, we plan to automate the process as much as possible. Psychoeducative social robots for an healthier lifestyle Subsequently, The GMM could be implemented to classify using artificial intelligence: a case-study, volume 3118, more than two classes, e.g. trying to classify each kind of 2021, pp. 26 – 33. DDoS attack or trying to model a multi-class problem where [12] G. Lo Sciuto, S. Russo, C. Napoli, A cloud-based flexible the classes could be different kinds of cyberattacks. solution for psychometric tests validation, administra- tion and evaluation, volume 2468, 2019, pp. 16 – 21. [13] G. Capizzi, G. L. Sciuto, C. Napoli, E. Tramontana, Acknowledgments A multithread nested neural network architecture to model surface plasmon polaritons propagation, Mi- This work has been developed at is.Lab() Intelligent Sys- cromachines 7 (2016). doi:10.3390/mi7070110. tems Laboratory at the Department of Computer, Control, [14] M. M. Ahsan, M. P. Mahmud, P. K. Saha, K. D. Gupta, and Management Engineering, Sapienza University of Rome Z. Siddique, Effect of data scaling methods on machine (https:// islab.diag.uniroma1.it). The work has also been par- learning algorithms and model performance, Tech- tially supported from Italian Ministerial grant PRIN 2022 nologies 9 (2021) 52. “ISIDE: Intelligent Systems for Infrastructural Diagnosis in [15] G. Capizzi, G. L. Sciuto, C. Napoli, R. Shikler, M. Woz- smart-concretE”, n. 2022S88WAY - CUP B53D2301318, and niak, Optimizing the organic solar cell manufac- by the Age-It: Ageing Well in an ageing society project, turing process by means of afm measurements and task 9.4.1 work package 4 spoke 9, within topic 8 ex- neural networks, Energies 11 (2018). doi:10.3390/ tended partnership 8, under the National Recovery and Re- en11051221. silience Plan (PNRR), Mission 4 Component 2 Investment [16] S. Pepe, S. Tedeschi, N. Brandizzi, S. Russo, L. Ioc- 1.3—Call for tender No. 1557 of 11/10/2022 of Italian Min- chi, C. Napoli, Human attention assessment us- istry of University and Research funded by the European ing a machine learning approach with gan-based Union—NextGenerationEU, CUP B53C22004090006. data augmentation technique trained using a custom dataset, OBM Neurobiology 6 (2022). doi:10.21926/ References obm.neurobiol.2204139. [17] F. Pargent, F. Pfisterer, J. Thomas, B. Bischl, Regular- [1] N. Hoque, D. K. Bhattacharyya, J. K. Kalita, Botnet in ized target encoding outperforms traditional methods ddos attacks: trends and challenges, IEEE Communi- in supervised machine learning with high cardinality cations Surveys & Tutorials 17 (2015) 2242–2270. features, Computational Statistics 37 (2022) 2671–2692. [2] S. I. Illari, S. Russo, R. Avanzato, C. Napoli, A cloud- [18] B. Bhattacharya, D. P. Solomatine, Machine learning in oriented architecture for the remote assessment and sedimentation modelling, Neural Networks 19 (2006) follow-up of hospitalized patients, volume 2694, 2020, 208–214. pp. 29 – 35. [19] A. Jović, K. Brkić, N. Bogunović, A review of feature [3] Genie-Networks, Ddos attack statistics and trends selection methods with applications, in: 2015 38th report for 2020 (2021). international convention on information and commu- [4] M. Wang, Y. Lu, J. Qin, A dynamic mlp-based ddos nication technology, electronics and microelectronics attack detection method using feature selection and (MIPRO), Ieee, 2015, pp. 1200–1205. feedback, Computers & Security 88 (2020) 101645. [20] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli, [5] N.-N. Dao, J. Park, M. Park, S. Cho, A feasible method Analysis pre and post covid-19 pandemic rorschach to combat against ddos attack in sdn network, in: 2015 test data of using em algorithms and gmm models, International Conference on Information Networking volume 3360, 2022, pp. 55 – 63. (ICOIN), IEEE, 2015, pp. 309–311. [21] M. A. Salam, A. T. Azar, M. S. Elgendy, K. M. Fouad, [6] S. M. Mousavi, M. St-Hilaire, Early detection of ddos The effect of different dimensionality reduction tech- attacks against sdn controllers, in: 2015 international niques on machine learning overfitting problem, Int. conference on computing, networking and communi- J. Adv. Comput. Sci. Appl 12 (2021) 641–655. cations (ICNC), IEEE, 2015, pp. 77–81. [22] B. Azhagusundari, A. S. Thanamani, et al., Feature selection based on information gain, International 7 Alessandro Cecchetto et al. CEUR Workshop Proceedings 1–8 Journal of Innovative Technology and Exploring Engi- neering (IJITEE) 2 (2013) 18–21. [23] F. L. Gewers, G. R. Ferreira, H. F. D. Arruda, F. N. Silva, C. H. Comin, D. R. Amancio, L. d. F. Costa, Principal component analysis: A natural approach to data ex- ploration, ACM Computing Surveys (CSUR) 54 (2021) 1–34. [24] D. A. Reynolds, et al., Gaussian mixture models., En- cyclopedia of biometrics 741 (2009). [25] B. Barazandeh, M. Razaviyayn, On the behavior of the expectation-maximization algorithm for mixture models, in: 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), IEEE, 2018, pp. 61–65. [26] V. Nannen, The paradox of overfitting, Ph.D. thesis, Faculty of Science and Engineering, 2003. [27] G. Schwarz, Estimating the dimension of a model, The annals of statistics (1978) 461–464. 8