Development of a Method for Increasing the Efficiency of Identifying the State of a Computer Network by Selecting the Most Informative Features in the Output Data Svitlana Gavrylenko and Vlad Zozulia National Technical University "Kharkiv Polytechnic Institute", 2, Kyrpychova str., Kharkiv, 61002, Ukraine Abstract The object of the study is the process of identifying the state of the computer network. The subject of the study is methods of selecting the most informative features in the initial data. The purpose of the work is to improve the efficiency of identification of the state of the computer network. Methods used: methods of artificial intelligence, machine learning, genetic and natural algorithms. The input data sets UNSW-NB 15, KDDCUP99, Kyoto2006, NSL-KDD, NSL-KDD, CIC-IDS-2017 DoHBrw-2020, which contain information about the normal functioning of the network and during intrusions, were used as input data. Software models based on genetic and natural algorithms were developed and researched: Genetic Algorithm, Particle Swarm Optimization, Differential Evolution, Cuckoo Search Algorithm, Bat Algorithm and Flower Pollination Algorithm for selecting the most informative features in the raw data at the stage of data preprocessing. To assess the quality of data preprocessing, a computer network state identification model based on the Random Forest algorithm was developed. The following results were obtained. The use of the above algorithms made it possible to significantly reduce the number of features of the complete dataset, reduce its size and speed up the training time of the model. At the same time, the best results were obtained when using the genetic algorithm, which made it possible to increase the learning speed of the models by 47% on average. Conclusions. According to the results of the study, the method of increasing the efficiency of identification of the state of the computer network by using genetic methods to select the most informative features in the source data was further developed.. Keywords intrusion detection systems, computer networks, machine learning, feature informativeness, genetic and natural algorithms 1 1. Introduction With the increasing dependence of the modern world on information technologies, issues of cyber security are becoming more relevant and important. In this environment, ensuring the security of networks and data takes on the highest priority. One of the key aspects in the field of cyber security is intrusion detection [1], the purpose of which is to detect illegal and malicious actions in computer systems and networks. Effective intrusion detection systems can prevent significant threats, minimize risks and protect valuable resources. With the advent of large volumes of data and sophisticated attacks, the issues of efficiency and accuracy of intrusion detection systems become particularly important. In this context, the selection of the most informative features when building an intrusion detection model becomes a decisive factor for achieving optimal results. Reducing the dimensionality of data, while preserving relevant information, can significantly increase the effectiveness of intrusion detection and optimize the learning process. The object of the study is the process of identifying the state of the computer network. The subject of the study is methods of selecting the most informative features in the initial data. ProfIT AI 2023: 3rd International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2023), November 20–22, 2023, Waterloo, Canada svitlana.gavrylenko@khpi.edu.ua (S. Gavrylenko); zozuliavlad@gmail.com (V. Zozulia) 0000-0002-6919-0055 (S. Gavrylenko); 0000-0002-2168-3029 (V. Zozulia) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings The purpose of the work is to improve the efficiency of identification of the state of the computer network. 2. Statement of the problem and relevance of the task The presence of uninformative features in the data set can lead to several problems when building a model: 1. Increase in size. Uninformative features increase the size of the data, which can make analysis and modeling tasks more difficult. This can lead to an increase in the computational complexity and training time of the models: 2. Increased noise. Noninformative features add "noise" to the data because they do not contain useful information about data dependencies. This can reduce the ability of models to identify true patterns and make them less robust to changes in the data. 3. Deterioration of model performance. Incorporating uninformative features into a model can degrade its performance. The model can overlearn by analyzing uninformative features and have a weak generalization ability to new data. 4. Increasing data requirements. Having uninformative features means that more data may be needed to train the model so that the model can have useful dependencies. This can be expensive and time consuming. 5. Interpretation difficulties. Non-informative features can complicate model interpretation and make it difficult to understand the influence of specific features on results. To solve these problems, feature selection methods are often used, which allow identifying and removing uninformative features from the data set, namely: feature importance analysis method (Feature Importance Analysis [2]), recursive feature elimination (Recursive Feature Elimination [3]), wrapper methods, genetic and natural algorithms, etc. These techniques can help identify and remove uninformative features, improving model performance and facilitating data analysis. However, feature selection may lead to the loss of some information, and this may reduce the performance of the model. Feature selection methods can be sensitive to data variability. In some cases, features that seem uninformative may be useful under other conditions. In addition, many methods, for example, filter methods, mutual information methods, recursive feature elimination (RFE), etc. consider signs independently of each other and do not take into account their interaction. Although there are many different feature selection methods, choosing the appropriate method can be complex and depends on the specific machine learning method. Usually, it is necessary to investigate several methods and evaluate their impact on model performance in order to choose the best approach for feature selection. Thus, improving the development of methods for selecting informative features in order to improve the quality of the model is an urgent task and requires research. 3. Review of scientific publications The use of genetic and natural algorithms was investigated in this work to evaluate the informativeness and selection of model output data by creating optimal (or appropriate) subsets of features from a set of available features. The choice of genetic algorithms is justified by their ability to take into account the interaction between features and evaluate how combinations of features affect the predictive ability of the model, which is important when detecting intrusions into computer networks. In this work, to evaluate the informativeness of features, search and their influence on the efficiency of intrusion detection systems, the use of genetic algorithms is investigated: genetic algorithm (Genetic Algorithm, GA [4]), particle swarm optimization (PSO [5]), algorithm of differential of evolution (Differential Evolution, DE [6]), Cuckoo Search Algorithm (CSA [7]), Bat Algorithm (BA [8]), Flower Pollination Algorithm (FPA [9]). A Genetic Algorithm is a type of calculation that solves optimization tasks and is based on the methods of natural evolution: inheritance, crossing, mutation, and selection. A distinctive feature of the genetic algorithm is the emphasis on the use of the crossover operator, which performs the operation of recombining candidate solutions, the role of which is similar to the role of crossover in living nature. In the simplest case, the optimization task consists in finding the extremum (minimum or maximum) of the objective function by systematically sorting input values from a given set and calculating its value 𝑌 = 𝑓(𝑥1 , 𝑥2 , . . . 𝑥𝑛 ) (1) where Y is the objective function that depends on the parameters (x1, x2, ..., xn) and, depending on the problem, tends to the maximum or minimum value. When using a genetic algorithm, the selection of traits or "trait selection" can be described as follows: An example of numbered list is as following. 1. Initialization of the population or selection of the initial population of chromosomes. 2. Assessment of the fitness of each individual in the population based on some criterion, which can be, for example, the accuracy of the model on test data. The better the individual is suited to the task, the higher his adaptability. 3. Selection of individuals to create a new population, taking into account their adaptability. Individuals with higher fitness have a greater chance of selection. 4. Crossing over and mutation, for example combining traits from two parents, and mutation may involve the random addition or deletion of traits. 5. Replacement of the previous generation with a new one. The process of selection, crossing, mutation and assessment of fitness is repeated for each generation. 6. Completion of the algorithm when a certain number of generations is obtained or before a stopping condition is met, such as reaching the desired accuracy or a time limit. The obtained subset of features is optimal and can be used to train the model in order to reduce the dimensionality of the data, improve the performance of the model and reduce the risk of overtraining. The Particle Swarm Algorithm is an optimization algorithm inspired by the behavior of swarms of particles in nature, such as birds, fish, or insects. It is used to solve optimization problems where you need to find an adaptive value function in a multidimensional space by exploring it using a configuration of "particles" that move through space in search of a better solution. PSO optimizes a function by maintaining a population of possible solutions, swarm particles, moving these particles in the solution space according to a simple formula. Movements are subject to the principle of the best position found in this position, which changes when finding fractions of more favorable positions. The algorithm of Differential Evolution is a method of multidimensional mathematical optimization that belongs to the class of stochastic optimization algorithms (that is, it works with the use of random numbers). It simulates the main evolutionary processes in living nature: crossing, mutation, selection. The method is intended for finding the global minimum (or maximum) of undifferentiated nonlinear, multimodal (may have a large number of local extrema) functions from many variables. The method is easy to implement and use (contains few control parameters that require selection), is easily parallelized. The differential evolution algorithm is an optimization method designed to solve unconstrained optimization problems. It is based on the ideas of evolution and is used to find the global optimum in multidimensional spaces. This is how the differential evolution algorithm works: The Cuckoo Search Algorithm is a metaheuristic for optimization problems that simulates the behavior of cuckoo birds forced to use an aggressive breeding strategy. It works by dividing the search space into niches (nests), where each niche represents a potential solution to the optimization problem and the cuckoo egg is a new solution. The Bat Algorithm is a metaheuristic global optimization algorithm. It simulates the process of searching for food by bats, taking into account their ability to echolocate with different pulse rates and volumes. In the basic bat algorithm, each bat is treated as a "massless and size free" particle representing a feasible solution in the solution space. For different fitness functions, each bat has a corresponding function value and determines the current optimal individual by comparing the function values. Then, the acoustic wave frequency, velocity, pulse emission rate, and loudness of each bat in the population are updated, the iterative evolution continues, the current optimal solution is approximated and generated, and finally the global optimal solution is generated. The algorithm updates the frequency, speed and position of each bat. The standard algorithm requires five basic parameters: frequency, loudness, ripple, and loudness and ripple coefficients. Frequency is used to balance the effect of the optimal historical position on the current position. An individual bat will search far from the group's historical position when the search frequency range is large, and vice versa. The Flower Pollination Algorithm is a highly efficient metaheuristic optimization algorithm inspired by the pollination process of flower species. It is aimed at predicting the movement and interaction of pollen grains and other particles inside a flower. 4. Output data Six different datasets covering various scenarios of network activity and types of attacks were used as raw data: 1. The DoHBrw-2020 dataset [10] is focused on the analysis of network activity related to the DNS over HTTPS (DoH) protocol. The data was obtained from the result of the analysis of the traffic associated with the use of the DoH protocol. DoHBrw-2020 provides data for analysis of attacks masquerading as encrypted traffic. 2. The UNSW-NB15 dataset [11] was developed as a comprehensive dataset for evaluating network intrusion detection. It contains different types of attacks and normal network activity. Data were collected in the university network over a period of five months. Network activity logs are included, and artificial attacks are created to cover scenarios more fully. UNSW-NB15 contains over 2 million records and covers attacks of varying complexity. 3. The KDDCUP99 dataset [12] was prepared for the KDD Cup 1999 intrusion detection competition. It contains different types of attacks and normal activity. The data was collected in a real network environment as part of the KDD Cup 1999 and represents a record of network activity, including various types of attacks. kddcup99 has an uneven distribution of classes and problems with re-presentation of some attacks. 4. The Kyoto2006 dataset [13] specializes in the analysis of DoS type attacks and forged packets. It includes records of network activity associated with such attacks. The data was collected in a real network environment using special sensors to detect DoS attacks and packet spoofing. Kyoto2006 provides unique data for packet-level attack analysis. 5. The NSL-KDD dataset [13] was created on the basis of the kddcup99 dataset in order to eliminate shortcomings and add other types of attacks. NSL-KDD includes attacks from the original kddcup99 dataset, as well as added artificial attacks and normal activity. NSL-KDD provides a more even distribution of classes and a variety of attack types. 6. The CIC-IDS-2017 dataset [14] is designed to detect modern attacks, including those targeting applications. The data was collected in a real network environment and includes a variety of attacks from both traditional and emerging threat types. CIC-IDS-2017 provides up- to-date data for analysis and detection of modern threats. Each of these datasets provides unique opportunities to investigate and analyze the performance of genetic algorithms for selecting informative features in the context of intrusion detection. 5. Experimental part This section presents the results of experiments using six different genetic algorithms for extracting informative features in the tasks of detecting intrusions into cyber security systems. Developed feature analysis software in Python in Jupiter Notebook environment. The following methods of character analysis were studied: Genetic Algorithm, Particle Swarm Optimization, Differential Evolution, Cuckoo Search Algorithm, Bat Algorithm, Flower Pollination Algorithm. The Random Forest method [15,16] was used as the basic data classification model. The experiments were carried out by sorting through the aforementioned datasets and genetic and natural algorithms. Each experiment included the following steps: 1. Selection of a set of input data (full dataset). 2. Selection of the basic algorithm and model training, evaluation of the informativeness of the features of the complete dataset. 3. Removal of non-informative features determined by the algorithm in the previous step and formation of a shortened dataset. 4. Estimation of the size of the full and reduced dataset. 5. Training of a classifier based on the Random Forest algorithm using full and reduced datasets. 6. Assessment of classification accuracy on full and reduced datasets. 7. Estimation of classifier training time on full and reduced datasets. According to the results of the study, the following quality indicators of the model were obtained (Figure 1). Figure 1: Comparison of model accuracy when using different datasets and different genetic algorithms for selecting the most informative features As can be seen from Fig. 1, the use of a reduced dataset as input data of the model did not lead to a significant deterioration of its accuracy in comparison with the full dataset. For the UNSW-NB15 data set, the use of the genetic algorithm (GA) and the particle swarm algorithm (PSO), on the contrary, led to a slight increase in the accuracy of the model. The use of genetic algorithms made it possible to reduce the number of features of the complete dataset by an average of 65% (Figure 2), which made it possible to reduce its size from two to six times (Figure 3). Figure 2: Comparison of the number of the most informative traits when using different genetic selection algorithms Figure 3: Comparison of the size of datasets when using different genetic algorithms for selecting the most informative features Decreasing the size of the dataset led to a decrease in the training time of the model. A comparative analysis of the increase in the learning speed of the model in percentage terms when using different datasets and different genetic algorithms for selecting the most informative features is given in the table. 1. As can be seen from the table. 1, different algorithms have different effects on increasing the speed of model training when using different datasets. However, the best results were obtained when using the genetic algorithm. The training speed of models when using different datasets decreased from 36.36% to 59.38% and, on average, set 47%. At the same time, the classification accuracy remained at the level of about 99%. Table 1. Comparative analysis of the increase in the learning rate of the model (%) when using different datasets and different genetic algorithms for selecting the most informative features Algorithm UNSW- KDDCUP99 Kyoto2006 NSL-KDD CIC-IDS- DoHBrw- NB15 2017 2020 GA 50,00 50,00 50,00 38,46 36,36 59,38 PSO 42,86 33,33 50,00 38,46 45,45 37,50 DE 50,00 33,33 33,33 38,46 54,55 34,38 CS 50,00 33,33 50,00 30,77 40,91 59,38 BA 42,86 33,33 50,00 30,77 31,82 46,88 FPA 35,71 16,67 33,33 30,77 45,45 53,13 6. Conclusions In this work, the use of various genetic and natural algorithms for the selection of the most informative features in order to increase the efficiency of identification of the state of the computer network is investigated at the stage of data preprocessing. Considered: Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Differential Evolution (DE), Cuckoo Search Algorithm (CSA), Bat Algorithm (BA), Flower Pollination Algorithm (FPA). Their software models were developed in the Jupiter Notebook environment using Python. To assess the quality of data preprocessing, a computer network state identification model based on the Random Forest algorithm was developed. The following sets (datasets) were used as source data: UNSW-NB 15, KDDCUP99, KYOTO2006, NSL-KDD, NSL-KDD, CIC-IDS-2017 DoHBrw-2020, which contain information about the normal functioning of the network and during intrusions. It was found that the use of genetic algorithms made it possible to significantly reduce the number of features of the complete dataset and reduce its size to 63%. Reducing the size of the dataset accelerated the training time of the model by up to 59%. At the same time, the accuracy of the model did not decrease significantly. Studies have shown that the choice of genetic or natural algorithm type depends on the input data. In our study, better results were obtained when using a genetic algorithm, which made it possible to increase the learning rate of models by 47% on average. According to the results of the study, the method of increasing the efficiency of identifying the state of the computer network by using the procedure for selecting the most informative features in the source data based on the genetic algorithm was further developed. Thus, the obtained results testify to the effectiveness of using genetic algorithms for the selection of informative features at the stage of data pre-processing in the tasks of detecting intrusions into cyber security systems. Accelerating model training and reducing the amount of data can significantly improve the performance of real-time systems, which is an important direction for further research. References [1] Zeeshan Ahmad, Adnan Shahid Khan, Cheah Wai Shiang, Johari Abdullah, Farhan Ahmad, Network intrusion detection system: A systematic study of machine learning and deep learning approaches. Transactions on Emerging Telecommunications Technologies. (2021) 32:e4150. doi: 0.1002/ett.4150 [2] Bouchlaghem Younes, Yassine Akhiat, Souad Amjad Feature Selection: A Review and Comparative Study. E3S Web of Conferences (2022) 351(1):01046. doi: 10.1051/e3sconf/202235101046. [3] Gbashi Ekhlas, Mohammed, Bilal, Intrusion Detection System for NSL-KDD Dataset Based on Deep Learning and Recursive Feature Elimination, Engineering and Technology Journal, 39(7), (2021). doi:10.30684/etj.v39i7.1695. [4] Anita Thengade, Rucha Dondal, Genetic Algorithm – Survey Paper, IJCA Proc National Conference on Recent Trends in Computing, NCRTC. 5, (2012) 25-29. [5] Eberhart Shi Yuhui, Particle swarm optimization: Development, applications and resources. Proceedings of the IEEE Conference on Evolutionary Computation, ICEC, 1 (2001) 81 – 86. doi:10.1109/CEC.2001.934374. [6] Das Swagatam , Suganthan Ponnuthurai, Differential Evolution: A Survey of the State-of-the- Art. IEEE Trans. Evolutionary Computation, 15 (2011) 4-31. [7] Amir Gandomi, Xin-She Yang, Amir Alavi, Cuckoo search algorithm: a metaheuristic approach to solve structural optimization problems,Engineering With Computers, 29 (2013) 245-245. doi:10.1007/s00366-012-0308-4. [8] Xin-She Yang, Bat Algorithm: Literature Review and Applications. International Journal of Bio-Inspired Computation, 5. (2013).141-149. doi:10.1504/IJBIC.2013.055093. [9] Yang, XS. (2012). Flower Pollination Algorithm for Global Optimization. In: Durand-Lose, J., Jonoska, N. (eds) Unconventional Computation and Natural Computation. UCNC 2012. Lecture Notes in Computer Science, vol 7445. Springer, Berlin, Heidelberg. doi:10.1007/978- 3-642-32894-7_27. [10] Jafar Mousa, Al-Fawa'reh Mohammad, Jafar Shifa, Analysis and Investigation of Malicious DNS Queries Using CIRA-CIC-DoHBrw-2020 Dataset 2. Manchester Journal of Artificial Intelligence and Applied Sciences(MJAIAS, (2021) 65-70. [11] Moustafa Nour, Slay Jill. UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set) (2015). doi:10.1109/MilCIS.2015.7348942. [12] Tavallaee Mahbod, Bagheri Ebrahim, Lu Wei, Ghorbani Ali, A detailed analysis of the KDD CUP 99 data set. IEEE Symposium. Computational Intelligence for Security and Defense Applications, CISDA. 2 (2009). doi:10.1109/CISDA.2009.5356528. [13] Protic, Danijela. Review of KDD Cup '99, NSL-KDD and Kyoto 2006+ datasets. Vojnotehnicki glasnik. 66 (2018) 580-596. doi:10.5937/vojtehg66-16670. [14] Jose Jinsi, Jose Deepa, Deep learning algorithms for intrusion detection systems in internet of things using CIC-IDS 2017 dataset, International Journal of Electrical and Computer Engineering (IJECE) 13 (2023). 1134-1141. doi:10.11591/ijece.v13i1.pp1134-1141. [15] Ali Jehad, Khan Rehanullah, AhmadNasir, Maqsood, Imran, Random Forests and Decision Trees, International Journal of Computer Science Issues(IJCSI), 9 (2012) 272-278 [16] S. Gavrylenko, V. Chelak and O. Hornostal, "Ensemble Approach Based on Bagging and Boosting for Identification the Computer System State," XXXI International Scientific Symposium Metrology and Metrology Assurance (MMA), Bulgaria, (2021) 1-7, doi: 10.1109/MMA52675.2021.9610949.