Real-Time Intrusion Detection via Machine Learning Approaches Erik Murtaj1 , Fausto Marcantoni1 , Michele Loreti1 , Michela Quadrini1,* and Hans-Friedrich Witschel2 1 School of Science and Technology, University of Camerino, Via Madonna delle Carceri, 9, Camerino, 62032, Italy 2 FHNW University of Applied Sciences and Arts Northwestern Switzerland, Riggenbachstrasse 16, CH-4600 Olten Abstract In many cybersecurity contexts, the real-time detections of hostile actions play a fundamental role in protecting network infrastructures. In this scenario, Intrusion Detection Systems (IDS), based on signature-based or anomaly detection, are widely used to analyze network traffic. The signature-based detection relies on databases of known attack signatures, and anomaly detection is mainly based on Artificial Intelligence (AI) techniques. The latter is promising to detect new kinds of cyberattacks in real time. In this work, we propose ReTiNA-IDS, a framework that integrates the CICFlowmeter tool with Machine Learning techniques to analyze Real-Time network traffic patterns and detect abnormalities that may suggest a possible intrusion. The considered machine learning techniques, random forest and multi-layer network, are based on selected features to enhance efficiency and scalability. To select the features and train the models, we use a version of the public dataset, CSECICI-IDS2018. The framework’s effectiveness has been tested in real-case scenarios by identifying different forms of intrusion. Analyzing the results, we conclude that the proposed solution shows valuable features. Keywords Random Forest, Feature Selection, analysis of Real-Time network traffic, Intrusion Detection Systems 1. Introduction that integrates the CICFlowmeter tool with Machine Learning techniques to analyze real-time network traf- Intrusion Detection Systems (IDS) are relevant tools em- fic patterns and detect abnormalities that may suggest a ployed in cybersecurity to protect networks from possible possible intrusion. The integrated methodology, which cyber attacks. is based on random forest and multi-layer networks, is In recent years, the world of cyber security has become based on selected features to enhance efficiency and scala- more turbulent, with a rise in the number of cyber-attacks bility. To select the features and train the models, the pub- that target businesses worldwide. For this reason, always lic dataset CSECICI-IDS2018 has been used. The frame- new methodologies are needed to shield vital assets from work’s effectiveness has been tested in real-case scenarios hostile actors in reaction to this expanding danger. by identifying different forms of intrusion. Analyzing the Recently, an increasing focus on the use of Artificial results, we conclude that the proposed solution shows Intelligence (AI) in cyber security. As a subset of artificial valuable features. intelligence, machine learning algorithms can improve The paper is structured as follows. In Section 2 related danger detection and automate procedures. Organiza- works are discussed while in Section 3 some basic back- tions may examine massive volumes of data in real-time, ground is introduced. In Section 4 the tool ReTiNa-IDS spot patterns suggestive of malicious behaviour, and take is presented, while in Section 5 some evaluation experi- preemptive measures to reduce risks by utilizing machine ments are proposed. Section 6 concludes the paper. learning algorithms. In this work, we propose ReTiNA-IDS, a framework 2. Related Works Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- nized by CINI, May 29-30, 2024, Naples, Italy The use of machine learning approaches in intrusion * Michele Loreti detection systems to obtain real-time analysis has been † These authors contributed equally. exploited by many researchers. Many of them take advan- $ erik.murtaj@studenti.unicam.it (E. Murtaj); tage of Deep Learning (DL) approaches. ARCADE is an fausto.marcantoni@unicam.it (F. Marcantoni); unsupervised DL-based approach for early anomaly de- michele.loreti@unicam.it (M. Loreti); michela.quadrini@unicam.it tection using 1D Convolutional Neural Networks (CNNs) (M. Quadrini); hansfriedrich.witschel@fhnw.ch (H. Witschel)  0000-0002-7779-203X (F. Marcantoni); 0000-0003-3061-863X proposed by Lunardi et al. [1]. The approach builds (M. Loreti); 0000-0003-0539-0290 (M. Quadrini); a profile of normal traffic based on raw packet bytes. 0000-0002-8608-9039 (H. Witschel) Kathareios et al. designed and tested a real-time net- © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings work AD system, able to operate on encrypted and non- 3.1. CICFlowMeter encrypted network packets, based on two learning stages: CICFlowmeter is a network traffic flow generator and an autoencoder for adaptive unsupervised AD and a analyser [13, 14]. It generates bidirectional flows, where custom nearest-neighbour classifier to filter false pos- the first packet determines the forward (source to desti- itives [2]. Shuai proposed a prototype that combines nation) and backward (destination to source) directions. big data processing frameworks like Apache Hadoop, The tool enables the extraction of more than 80 statisti- Apache Kafka, and Apache Storm, along with ML tech- cal network traffic features such as Duration, Number of niques, i.e., Naïve Bayesian (NB), Support Vector Machine packets, Number of bytes, Length of packets, etc. Such (SVM), and Decision Tree (DT). The proposed approach features can be calculated independently for both direc- considers six features related to the IP addresses of the tions. The tool is developed in JAVA and provides a useful sender, receiver, and correspondent port without taking Graphical User Interface, shown in Figure 1 to monitor into account flow measurements. Ho et al. suggested an network flows in real-time. TCP flows are usually termi- Intrusion Detection System (IDS) based on CNN that clas- nated upon connection teardown (by FINpacket), while sifies all packet traffic as benign or malicious, detecting a flow timeout terminates UDP flows [15]. network intrusions [3]. Atefnia and Ahmadi proposed a modular deep neural network model that consists of four complete architectures that are combined with an aggregator module, each generating distinct outputs [4]. The four architectures are a Deep Feed-Forward Mod- ule (DFFM), a Stacked Restricted Boltzmann Machine Module (SRBMM), and two recurrent modules, one uti- lizing gated recurrent units (GRUM) and the other utiliz- ing long short-term memory (LSTMM). Catillo et al. [5] proposed an approach based on Deep Autoencoder, and Fitni and Ramli [6] proposed a model based on deci- sion trees that takes into account 23 features selected by Spearman’s rank correlation coefficient [7]. Gamage and Samarabandu considered four DL architectures, i.e., feed-forward neural network, autoencoder, deep belief network, and LSTM [8]. Karatas et al. in [9] reviewed Figure 1: Example of the CICFlowmeter’s GUI the implementation of a Synthetic Minority Oversam- pling Technique (SMOTE) [10] to balance the data by The tool is developed in JAVA and provides a useful exploiting six models. Kanimozhi and Jacob presented GUI (Graphical User Interface) to monitor network flows a two-layer MLP to detect only botnet attacks that ex- in real time. ploit a grid search for hyper-parameter optimization and a 10-fold cross-validation for mitigating the overfitting 3.2. Machine Learning Approaches and problems [11]. Huancayo Ramos et al. extended this ap- proach by considering botnet data and Random Forests. Feature Selection Kim et al. also designed a model that exploits CNN for The Random Forest is an ML ensemble model used for training on a single type of attack, specifically Denial of both classification and regression tasks. During training, Service (DoS) attacks [12]. the model creates numerous decision trees and deter- mines the output class by either the mode (for classifi- cation) or the mean/average prediction (for regression) 3. Background of the classes predicted by individual trees. Introduced In this section, we present the CICFlowMeter, an Ether- by Breiman in [16], this approach combines the bagging net traffic Bi-flow generator and analyzer for anomaly technique with the random selection of features. Such a detection, and the Random Forest, a machine learning random selection ensures that the decision trees within method used for classifying flow data and evaluating the forest are uncorrelated. In the bagging phase, de- the importance of features. This classifier will then be cision trees are constructed from bootstrap samples of integrated into CICFlowMeter for classifying network the training dataset, where each sample is drawn with flows. replacement, allowing for the possibility of repeated sam- ples. These replicated datasets are then used to train decision trees, ensuring that each tree only sees different portions of the original dataset during training. This bag- ging approach is coupled with random feature selection, which involves using distinct random subsets of the en- Table 2 tire feature space√to train each tree in the random forest. The first 13 attributes ordered by importance Usually, around 𝑛 features are employed in each split Id Attribute Description for a classification task that considers ′ 𝑛′ features. 1 FWD Init Win Bytes The total number of bytes sent in initial window in the forward direction 2 Packet Length Std Standard deviation length of a packet 3 Packet Length Mean Mean length of a packet 3.3. Dataset: CSE-CIC-IDS2018 4 Bwd Packet Length Std Standard deviation size of packet in backward direction The data used in this study is the CSE-CIC-IDS2018, a 5 6 Bwd Packet Length Max Bwd PSH Flags Maximum size of packet in backward direction Number of times the PSH flag was set in packets benchmark dataset for the evaluation of IDSs. Such data travelling in the backward direction 7 ACK Flag Count Number of packets with ACK was collected by the Communications Security Establish- 8 Fwd Seg Size Min Minimum segment size observed in ment (CSE) and the Canadian Institute for Cybersecurity the forward direction 9 Fwd PSH Flags Number of times the PSH flag was set in packets (CIC). The recorded data consists of ten days of traffic travelling in the forward direction and includes seven types of attacks. Liu et al. identified 10 CWR Flag Count 11 Packet Length Variance Number of packets with CWR Variance length of a packet some issues in such dataset related to the creation life- 12 Fwd Packet Length Max Maximum size of packet in forward direction cycle, including attack orchestration, feature generation, 13 Bwd Packet Length Mean Mean size of packet in backward direction documentation, and labelling and provided to reconstruct the datasets by deleting artefacts and corrected labelling logic, including corrected implementations of existing 4. ReTiNA-IDS Approach features and new features that capture valuable flow state ReTiNA-IDS, Real-Time anomaly Detection IDS information [17]. Table 1 reports the corrupt amount of Approach, integrates a ML model mainly based on data. Random Forest in the CICFlowMeter tool to detect Real-Time cyber-attacks and act as a simple IDS. The Attack Type Corruption Rate (%) Random Forest classifier considers only 13 of the 80 Bot 50.06 features calculated by the CICFlowMeter tool. The list Web - Brute Force 53.85 of features with the relative description, selected by Web Attack - XSS 50.43 another Random Forest model, is in Table 2. After being DoS Attacks >50 trained, the model has been exported in a pmml format DDoS Attacks >50 with the use of the “sklearn-pmml-model“ library from FTP-Patator 100.00 Sklearn [18]. The exported model is then imported into Infiltration 76.84 SQL Injection 54.02 CICFlowMeter, which is developed in Java. SSH-Patator 49.97 4.1. ML Pipeline Table 1 Corruption Rate of Different Attacks on the CSE-CIC-IDS The proposed approach is based on Random Forest, de- 2018 dataset [17] scribed in Section 3.2, and its scheme is shown in Figure 2. 3.4. Metrics We evaluate the performance and effectiveness of the ap- proaches by using Precision (𝑃 ), Recall (𝑅) and , defined as follows 𝑇𝑃 𝑃 = 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 Figure 2: Pipeline of our Approach 𝑅= 𝑇𝑃 + 𝐹𝑁 𝑃 ·𝑄 𝐹1 = 2 𝑃 +𝑄 4.1.1. Data Preprocessing where 𝑇 𝑃 represents the number of true positive, 𝐹 𝑁 In this study, the used dataset is a revised version of denotes the number of false negative, 𝐹 𝑃 represents the CSE-CIC-IDS2018, as introduced in Section 3.3. The number of false positive, 𝑇 𝑁 denotes the number of true dataset consists of the network traffic captured on ten negative. days, stored in 10 distinct files according to the day of data capture, as shown in Table 3. Table 3 Table 4 CSE-CIC-IDS2018 files Amount data per network traffic class Id File Name Size Class Count 1 Wednesday-14-02-2018 3.03 GB BENIGN 145904 DoS Attack 145904 2 Thursday-15-02-2018 2.18 GB BruteForce Attack 99147 3 Friday-16-02-2018 3.92 GB PortScan Attack 49740 4 Tuesday-20-02-2018 3.19 GB BotNet Attack 142921 5 Wednesday-21-02-2018 3.68 GB Total 583.616 6 Thursday-22-02-2018 3.23 GB 7 Friday-23-02-2018 3.17 GB Table 5 8 Wednesday-28-02-2018 3.54 GB Classification Performance Metrics Random Forest 9 Thursday-01-03-2018 3.54 GB Class Precision Recall F1-score 10 Friday-02-03-2018 3.43 GB BENIGN 1.00 1.00 1.00 Botnet Ares 1.00 1.00 1.00 BruteForce Attack 1.00 1.00 1.00 DoS Attack 1.00 1.00 1.00 Infiltration - NMAP Portscan 0.99 1.00 1.00 The first step of the preprocessing consists of data Accuracy 1.00 cleaning, i.e., removing missing values, such as incom- plete rows, and containing invalid (or infinite) numerical values. Moreover, many non-relevant features for spot- respectively. To avoid eventually issue related to overfit- ting cyber-attacks have been eliminated, such as the IP ting, we consider the cross-validation with 5-fold. Figure address of the sender and receiver, the connection times- 3 shows the obtained confusion matrix. tamp, the protocol type, and the destination/sender port. Furthermore, the traffic data related to Web Attacks is deleted since its volume is insufficient. 4.1.2. Data Balancing and Data Augmentation The collected data related to network traffic is substan- tially unbalanced: benign traffic is more prevalent than malicious traffic. To balance the data, we have used the one step of the bootstrapping procedure, implemented in the resample function of Sklearn. Due to the corrupted data on the original dataset, it does not contain data re- lated to FTP Brute Force attacks. Therefore, we have Figure 3: Confusion Matrix of the Random Forest Classifier added this kind of data by collecting such data during a simulation of brute force attacks via FTP (File Transfer Protocol). The simulation involved the use of a Windows The performance of the model, evaluated in terms of host (victim machine) and a Kali-Linux host (attacker ma- Precision, Recall and 𝐹1 -score, is shown in the Table 5. chine), both in the same local area network (connected to the same router). The victim machine runs a FileZilla server, an open-source software utility that facilitates the 5. Experimental Setup transmission of files using the File FTP. It enables users to establish their own FTP servers or connect to existingThe ML models have been implemented in a Google Co- FTP servers to exchange data, and the victim machine lab document with Python 3. The default CPU in the accepts connections on port 21, used to attack. When the environment is an Intel Xeon CPU equipped with 2 vir- tual CPUs (vCPUs) and 13GB of RAM [20]. For this study, FileZilla server on the victim machine is running, the Kali Linux host performs a brute-force attack using Patator, athe configuration involved the utilization of extra RAM, multi-purpose brute-forcer tool [19]. Table 4 shows the resulting in a total memory capacity of 50GB (included amount of data and the relative kind of attack, after thewith Google Colab Pro [20]). cleaning and balancing phases. For data handling, preprocessing, analysis, training, and evaluation metrics, the recommended model was built and evaluated using Numpy [21], Pandas [22], and 4.1.3. Feature Selection and Classifier Scikit Learn [23]. Matplotlib [24] were used to visual- To select the features, a Random Forest has been consid- ize the data. The testing phase for this study used a ered and implemented by setting up the depth of each Windows operating system for the with the following decision tree and number of estimators to 16 and 20, specifications: an Intel Core i5-4670 CPU at 3.40GHz, 16 GB of DDR4 memory and a Nvidia GTX 1050 Ti GPU. 5.1. Testing 6. Conclusion and Future Work Retina-IDS, a tool that integrates an ML model into CI- In this work, we have presented ReTiNA-IDS, a tool that CFlowMeter, analyzes data patterns and distinguishes integrates an ML model into CICFlowMeter, which ana- benign traffic from malicious traffic. The testing phase of lyzes data patterns and distinguishes benign traffic from ReTiNA-IDS intends to assess the efficiency and efficacy malicious traffic in real-time. The ML model is based on of the machine learning model in real-world network a Random Forest, used to select features and to classify situations. We take advantage of the Graphical Network the data. The testing phase, performed by running the Simulator-3 (GNS3) software, an open-source network tool in a normal traffic situation (without performing simulation tool used for creating, modelling, and testing any cyberattack) in a local network and the University virtual and real networks [25], to perform the simulations. of Camerino’s network, shows that the tool does not To reach the aim, we create a simple network composed identify false positives. of a Cisco router [26] and two generic switches, outlining In the near future, we intend to test the approach in bot- two different areas of a hypothetical Local Area Network net traffic to investigate the performance of the ReTiNA- (LAN), a Windows machine and a Kali Linux machine. IDS. To reach this aim, we intend to create a central server Figure 4shows the network infrastructure. to control potentially infected hosts. Moreover, we have planned to consider other machine learning models, both supervised and unsupervised. Moreover, motivated by the results obtained for modelling and verifying prop- erties of Collective Adaptive Systems [27, 28, 29], we intend to define formal approaches to specify and verify properties of the data traffic to monitor the traffic and identify anomalous pattern in the traffic. Acknowledgements. This work has been funded by the European Union - NextGenerationEU under the Ital- ian Ministry of University and Research (MUR) National Innovation Ecosystem grant ECS00000041 - VITALITY - CUP J13C22000430001 Figure 4: Network structure in GNS3 for testing simulations The Windows machine represents the hypothetical References victim running the Retina-IDS tool, acting as an IDS, [1] W. T. Lunardi, M. A. Lopez, J.-P. Giacalone, Ar- while the Kali Linux machine plays the role of attacker. cade: Adversarially regularized convolutional au- The victim machine is a Windows 10 host, while the toencoder for network anomaly detection, IEEE used Kali Linux version is Kali 2023.4. Instead, the victim Transactions on Network and Service Management machine is a Windows 10 host. (2022). Different attack simulations were performed, each one [2] G. Kathareios, A. Anghel, A. Mate, R. Clauberg, resulting in a positive detection by the tool: M. Gusat, Catch it if you can: Real-time network • DoS attacks anomaly detection with low false alarm rates, in: • File Transfer Protocol (FTP) and Secure SHell 2017 16th IEEE International Conference on Ma- (SHH) Bruteforce attacks chine Learning and Applications (ICMLA), IEEE, • Portscan attacks 2017, pp. 924–929. [3] S. Ho, S. Al Jufout, K. Dajani, M. Mozumdar, A novel Additionally, more tests were performed with the tool intrusion detection model for detecting known and running in a normal traffic situation (without performing innovative cyberattacks using convolutional neu- any cyberattack) in a local network and in the University ral network, IEEE Open Journal of the Computer of Camerino’s network, for a total of around 5 hours of Society 2 (2021) 14–25. workload. The purpose of letting the tool run for hours [4] R. Atefinia, M. Ahmadi, Network intrusion detec- on end was to see whether any crashes occurred during tion using multi-architectural modular deep neu- execution and to spot any false positive results. During ral network, The Journal of Supercomputing the experiments zero false positives were identified. 3571–3593 (2020). [5] M. Catillo, M. Rak, U. Villano, 2l-zed-ids: A two- level anomaly detector for multiple attack classes, in: Web, Artificial Intelligence and Network Appli- cic-ids-2017 and cse-cic-ids-2018, in: 2022 IEEE cations: Proceedings of the Workshops of the 34th Conference on Communications and Network Se- International Conference on Advanced Informa- curity (CNS), IEEE, 2022, pp. 254–262. tion Networking and Applications (WAINA-2020), [18] scikit-learn: machine learning in python — scikit- Springer, 2020, pp. 687–696. learn 1.4.1 documentation, 2024. URL: https:// [6] Q. R. S. Fitni, K. Ramli, Implementation of ensemble scikit-learn.org/stable/index.html. learning and feature selection for performance im- [19] Kali linux tools, patator, 2024. URL: https://www. provements in anomaly-based intrusion detection kali.org/tools/patator/. systems, in: 2020 IEEE International Conference [20] Google, Google colab, 2024. URL: https://research. on Industry 4.0, Artificial Intelligence, and Com- google.com/colaboratory/faq.html. munications Technology (IAICT), IEEE, 2020, pp. [21] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gom- 118–124. mers, P. Virtanen, D. Cournapeau, E. Wieser, J. Tay- [7] W. W. Daniel, The spearman rank correlation coef- lor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, ficient, Biostatistics: A Foundation for Analysis in M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del the Health Sciences (1987). Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, [8] S. Gamage, J. Samarabandu, Deep learning meth- K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, ods in network intrusion detection: A survey and C. Gohlke, T. E. Oliphant, Array programming an objective comparison, Journal of Network and with NumPy, Nature 585 (2020) 357–362. URL: Computer Applications 169 (2020) 102767. doi:10. https://doi.org/10.1038/s41586-020-2649-2. doi:10. 1016/j.jnca.2020.102767. 1038/s41586-020-2649-2. [9] G. Karatas Baydogmus, O. Demir, O. Sahingoz, [22] T. pandas development team, pandas-dev/pandas: Increasing the performance of machine learning- Pandas, 2020. URL: https://doi.org/10.5281/zenodo. based idss on an imbalanced and up-to-date dataset, 3509134. doi:10.5281/zenodo.3509134. IEEE Access PP (2020) 1–1. doi:10.1109/ACCESS. [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, 2020.2973219. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, [10] B. Jason, Smote for imbalanced classification with R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, python, 2021. D. Cournapeau, M. Brucher, M. Perrot, E. Duch- [11] V. Kanimozhi, T. P. Jacob, Artificial intelligence esnay, Scikit-learn: Machine learning in Python, based network intrusion detection with hyper- Journal of Machine Learning Research 12 (2011) parameter optimization tuning on the realistic cy- 2825–2830. ber dataset cse-cic-ids2018 using cloud computing, [24] J. D. Hunter, Matplotlib: A 2d graphics environ- in: 2019 international conference on communica- ment, Computing in Science & Engineering 9 (2007) tion and signal processing (ICCSP), IEEE, 2019, pp. 90–95. doi:10.1109/MCSE.2007.55. 0033–0036. [25] S. Worldwide, Gns3 documentation, 2024. URL: [12] J. Kim, J. Kim, H. Kim, M. Shim, E. Choi, Cnn- https://docs.gns3.com/docs/. based network intrusion detection against denial- [26] Cisco 3600 series - cisco, 2015. URL: of-service attacks, Electronics 9 (2020) 916. https://www.cisco.com/c/en/us/td/docs/ios/ [13] A. H. Lashkari, G. D. Gil, M. S. I. Mamun, A. A. 12_2/12_2x/12_2xa/release/notes/rn3600xa.html. Ghorbani, Characterization of tor traffic using time [27] M. Loreti, M. Quadrini, A spatial logic for simplicial based features, in: International Conference on In- models, Log. Methods Comput. Sci. 19 (2023). formation Systems Security and Privacy, volume 2, [28] N. Del Giudice, L. Matteucci, M. Quadrini, SciTePress, 2017, pp. 253–262. A. Rehman, M. Loreti, Sibilla: A tool for reason- [14] G. Draper-Gil, A. H. Lashkari, M. S. I. Mamun, A. A. ing about collective systems, Science of Computer Ghorbani, Characterization of encrypted and vpn Programming (2024) 103095. traffic using time-related, in: Proceedings of the 2nd [29] N. D. Giudice, L. Matteucci, M. Quadrini, international conference on information systems A. Rehman, M. Loreti, Sibilla: A tool for reasoning security and privacy (ICISSP), 2016, pp. 407–414. about collective systems, in: Coordination Models [15] U. of New Brunswick | UNB, Applications | research and Languages - 24th IFIP WG 6.1 International | canadian institute for cybersecurity | unb, 2017. Conference, COORDINATION 2022, Held as Part URL: https://www.unb.ca/cic/research/applications. of the 17th International Federated Conference on html. Distributed Computing Techniques, DisCoTec 2022, [16] L. Breiman, Random forests, Machine learning 45 Lucca, Italy, June 13-17, 2022, Proceedings, 2022, pp. (2001) 5–32. 92–98. doi:10.1007/978-3-031-08143-9\_6. [17] L. Liu, G. Engelen, T. Lynar, D. Essam, W. Joosen, Error prevalence in nids datasets: A case study on