=Paper=
{{Paper
|id=Vol-3641/short4
|storemode=property
|title=Development of a Method for Increasing the Efficiency of Identifying the State of a Computer Network by Selecting the Most Informative Features in the Output Data
|pdfUrl=https://ceur-ws.org/Vol-3641/short4.pdf
|volume=Vol-3641
|authors=Svitlana Gavrylenko,Vlad Zozulia
|dblpUrl=https://dblp.org/rec/conf/profitai/GavrylenkoZ23
}}
==Development of a Method for Increasing the Efficiency of Identifying the State of a Computer Network by Selecting the Most Informative Features in the Output Data==
Development of a Method for Increasing the Efficiency of
Identifying the State of a Computer Network by Selecting
the Most Informative Features in the Output Data
Svitlana Gavrylenko and Vlad Zozulia
National Technical University "Kharkiv Polytechnic Institute", 2, Kyrpychova str., Kharkiv, 61002, Ukraine
Abstract
The object of the study is the process of identifying the state of the computer network. The subject of
the study is methods of selecting the most informative features in the initial data. The purpose of the
work is to improve the efficiency of identification of the state of the computer network. Methods used:
methods of artificial intelligence, machine learning, genetic and natural algorithms. The input data sets
UNSW-NB 15, KDDCUP99, Kyoto2006, NSL-KDD, NSL-KDD, CIC-IDS-2017 DoHBrw-2020, which contain
information about the normal functioning of the network and during intrusions, were used as input data.
Software models based on genetic and natural algorithms were developed and researched: Genetic
Algorithm, Particle Swarm Optimization, Differential Evolution, Cuckoo Search Algorithm, Bat Algorithm
and Flower Pollination Algorithm for selecting the most informative features in the raw data at the stage
of data preprocessing. To assess the quality of data preprocessing, a computer network state
identification model based on the Random Forest algorithm was developed. The following results were
obtained. The use of the above algorithms made it possible to significantly reduce the number of features
of the complete dataset, reduce its size and speed up the training time of the model. At the same time,
the best results were obtained when using the genetic algorithm, which made it possible to increase the
learning speed of the models by 47% on average. Conclusions. According to the results of the study, the
method of increasing the efficiency of identification of the state of the computer network by using
genetic methods to select the most informative features in the source data was further developed..
Keywords
intrusion detection systems, computer networks, machine learning, feature informativeness, genetic
and natural algorithms 1
1. Introduction
With the increasing dependence of the modern world on information technologies, issues of cyber
security are becoming more relevant and important. In this environment, ensuring the security
of networks and data takes on the highest priority. One of the key aspects in the field of cyber
security is intrusion detection [1], the purpose of which is to detect illegal and malicious actions
in computer systems and networks. Effective intrusion detection systems can prevent significant
threats, minimize risks and protect valuable resources.
With the advent of large volumes of data and sophisticated attacks, the issues of efficiency and
accuracy of intrusion detection systems become particularly important. In this context, the
selection of the most informative features when building an intrusion detection model becomes
a decisive factor for achieving optimal results. Reducing the dimensionality of data, while
preserving relevant information, can significantly increase the effectiveness of intrusion
detection and optimize the learning process.
The object of the study is the process of identifying the state of the computer network.
The subject of the study is methods of selecting the most informative features in the initial
data.
ProfIT AI 2023: 3rd International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2023), November
20–22, 2023, Waterloo, Canada
svitlana.gavrylenko@khpi.edu.ua (S. Gavrylenko); zozuliavlad@gmail.com (V. Zozulia)
0000-0002-6919-0055 (S. Gavrylenko); 0000-0002-2168-3029 (V. Zozulia)
© 2023 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
The purpose of the work is to improve the efficiency of identification of the state of the
computer network.
2. Statement of the problem and relevance of the task
The presence of uninformative features in the data set can lead to several problems when building
a model:
1. Increase in size. Uninformative features increase the size of the data, which can make
analysis and modeling tasks more difficult. This can lead to an increase in the computational
complexity and training time of the models:
2. Increased noise. Noninformative features add "noise" to the data because they do not
contain useful information about data dependencies. This can reduce the ability of models to
identify true patterns and make them less robust to changes in the data.
3. Deterioration of model performance. Incorporating uninformative features into a model
can degrade its performance. The model can overlearn by analyzing uninformative features
and have a weak generalization ability to new data.
4. Increasing data requirements. Having uninformative features means that more data may
be needed to train the model so that the model can have useful dependencies. This can be
expensive and time consuming.
5. Interpretation difficulties. Non-informative features can complicate model interpretation
and make it difficult to understand the influence of specific features on results.
To solve these problems, feature selection methods are often used, which allow identifying
and removing uninformative features from the data set, namely: feature importance analysis
method (Feature Importance Analysis [2]), recursive feature elimination (Recursive Feature
Elimination [3]), wrapper methods, genetic and natural algorithms, etc. These techniques can
help identify and remove uninformative features, improving model performance and facilitating
data analysis.
However, feature selection may lead to the loss of some information, and this may reduce the
performance of the model. Feature selection methods can be sensitive to data variability. In some
cases, features that seem uninformative may be useful under other conditions. In addition, many
methods, for example, filter methods, mutual information methods, recursive feature elimination
(RFE), etc. consider signs independently of each other and do not take into account their
interaction.
Although there are many different feature selection methods, choosing the appropriate
method can be complex and depends on the specific machine learning method. Usually, it is
necessary to investigate several methods and evaluate their impact on model performance in
order to choose the best approach for feature selection. Thus, improving the development of
methods for selecting informative features in order to improve the quality of the model is an
urgent task and requires research.
3. Review of scientific publications
The use of genetic and natural algorithms was investigated in this work to evaluate the
informativeness and selection of model output data by creating optimal (or appropriate) subsets
of features from a set of available features. The choice of genetic algorithms is justified by their
ability to take into account the interaction between features and evaluate how combinations of
features affect the predictive ability of the model, which is important when detecting intrusions
into computer networks.
In this work, to evaluate the informativeness of features, search and their influence on the
efficiency of intrusion detection systems, the use of genetic algorithms is investigated: genetic
algorithm (Genetic Algorithm, GA [4]), particle swarm optimization (PSO [5]), algorithm of
differential of evolution (Differential Evolution, DE [6]), Cuckoo Search Algorithm (CSA [7]), Bat
Algorithm (BA [8]), Flower Pollination Algorithm (FPA [9]).
A Genetic Algorithm is a type of calculation that solves optimization tasks and is based on the
methods of natural evolution: inheritance, crossing, mutation, and selection. A distinctive feature
of the genetic algorithm is the emphasis on the use of the crossover operator, which performs the
operation of recombining candidate solutions, the role of which is similar to the role of crossover
in living nature.
In the simplest case, the optimization task consists in finding the extremum (minimum or
maximum) of the objective function by systematically sorting input values from a given set and
calculating its value
𝑌 = 𝑓(𝑥1 , 𝑥2 , . . . 𝑥𝑛 ) (1)
where Y is the objective function that depends on the parameters (x1, x2, ..., xn) and, depending on
the problem, tends to the maximum or minimum value.
When using a genetic algorithm, the selection of traits or "trait selection" can be described as
follows:
An example of numbered list is as following.
1. Initialization of the population or selection of the initial population of chromosomes.
2. Assessment of the fitness of each individual in the population based on some criterion,
which can be, for example, the accuracy of the model on test data. The better the individual
is suited to the task, the higher his adaptability.
3. Selection of individuals to create a new population, taking into account their adaptability.
Individuals with higher fitness have a greater chance of selection.
4. Crossing over and mutation, for example combining traits from two parents, and mutation
may involve the random addition or deletion of traits.
5. Replacement of the previous generation with a new one. The process of selection, crossing,
mutation and assessment of fitness is repeated for each generation.
6. Completion of the algorithm when a certain number of generations is obtained or before a
stopping condition is met, such as reaching the desired accuracy or a time limit. The
obtained subset of features is optimal and can be used to train the model in order to reduce
the dimensionality of the data, improve the performance of the model and reduce the risk
of overtraining.
The Particle Swarm Algorithm is an optimization algorithm inspired by the behavior of
swarms of particles in nature, such as birds, fish, or insects. It is used to solve optimization
problems where you need to find an adaptive value function in a multidimensional space by
exploring it using a configuration of "particles" that move through space in search of a better
solution. PSO optimizes a function by maintaining a population of possible solutions, swarm
particles, moving these particles in the solution space according to a simple formula. Movements
are subject to the principle of the best position found in this position, which changes when finding
fractions of more favorable positions.
The algorithm of Differential Evolution is a method of multidimensional mathematical
optimization that belongs to the class of stochastic optimization algorithms (that is, it works with
the use of random numbers). It simulates the main evolutionary processes in living nature:
crossing, mutation, selection. The method is intended for finding the global minimum (or
maximum) of undifferentiated nonlinear, multimodal (may have a large number of local extrema)
functions from many variables. The method is easy to implement and use (contains few control
parameters that require selection), is easily parallelized.
The differential evolution algorithm is an optimization method designed to solve
unconstrained optimization problems. It is based on the ideas of evolution and is used to find the
global optimum in multidimensional spaces. This is how the differential evolution algorithm
works:
The Cuckoo Search Algorithm is a metaheuristic for optimization problems that simulates the
behavior of cuckoo birds forced to use an aggressive breeding strategy. It works by dividing the
search space into niches (nests), where each niche represents a potential solution to the
optimization problem and the cuckoo egg is a new solution.
The Bat Algorithm is a metaheuristic global optimization algorithm. It simulates the process
of searching for food by bats, taking into account their ability to echolocate with different pulse
rates and volumes.
In the basic bat algorithm, each bat is treated as a "massless and size free" particle
representing a feasible solution in the solution space. For different fitness functions, each bat has
a corresponding function value and determines the current optimal individual by comparing the
function values. Then, the acoustic wave frequency, velocity, pulse emission rate, and loudness of
each bat in the population are updated, the iterative evolution continues, the current optimal
solution is approximated and generated, and finally the global optimal solution is generated. The
algorithm updates the frequency, speed and position of each bat.
The standard algorithm requires five basic parameters: frequency, loudness, ripple, and
loudness and ripple coefficients. Frequency is used to balance the effect of the optimal historical
position on the current position. An individual bat will search far from the group's historical
position when the search frequency range is large, and vice versa.
The Flower Pollination Algorithm is a highly efficient metaheuristic optimization algorithm
inspired by the pollination process of flower species. It is aimed at predicting the movement and
interaction of pollen grains and other particles inside a flower.
4. Output data
Six different datasets covering various scenarios of network activity and types of attacks were
used as raw data:
1. The DoHBrw-2020 dataset [10] is focused on the analysis of network activity related to
the DNS over HTTPS (DoH) protocol. The data was obtained from the result of the analysis of
the traffic associated with the use of the DoH protocol. DoHBrw-2020 provides data for
analysis of attacks masquerading as encrypted traffic.
2. The UNSW-NB15 dataset [11] was developed as a comprehensive dataset for evaluating
network intrusion detection. It contains different types of attacks and normal network activity.
Data were collected in the university network over a period of five months. Network activity
logs are included, and artificial attacks are created to cover scenarios more fully. UNSW-NB15
contains over 2 million records and covers attacks of varying complexity.
3. The KDDCUP99 dataset [12] was prepared for the KDD Cup 1999 intrusion detection
competition. It contains different types of attacks and normal activity. The data was collected
in a real network environment as part of the KDD Cup 1999 and represents a record of
network activity, including various types of attacks. kddcup99 has an uneven distribution of
classes and problems with re-presentation of some attacks.
4. The Kyoto2006 dataset [13] specializes in the analysis of DoS type attacks and forged
packets. It includes records of network activity associated with such attacks. The data was
collected in a real network environment using special sensors to detect DoS attacks and packet
spoofing. Kyoto2006 provides unique data for packet-level attack analysis.
5. The NSL-KDD dataset [13] was created on the basis of the kddcup99 dataset in order to
eliminate shortcomings and add other types of attacks. NSL-KDD includes attacks from the
original kddcup99 dataset, as well as added artificial attacks and normal activity. NSL-KDD
provides a more even distribution of classes and a variety of attack types.
6. The CIC-IDS-2017 dataset [14] is designed to detect modern attacks, including those
targeting applications. The data was collected in a real network environment and includes a
variety of attacks from both traditional and emerging threat types. CIC-IDS-2017 provides up-
to-date data for analysis and detection of modern threats.
Each of these datasets provides unique opportunities to investigate and analyze the
performance of genetic algorithms for selecting informative features in the context of intrusion
detection.
5. Experimental part
This section presents the results of experiments using six different genetic algorithms for
extracting informative features in the tasks of detecting intrusions into cyber security systems.
Developed feature analysis software in Python in Jupiter Notebook environment. The
following methods of character analysis were studied: Genetic Algorithm, Particle Swarm
Optimization, Differential Evolution, Cuckoo Search Algorithm, Bat Algorithm, Flower Pollination
Algorithm.
The Random Forest method [15,16] was used as the basic data classification model.
The experiments were carried out by sorting through the aforementioned datasets and genetic
and natural algorithms. Each experiment included the following steps:
1. Selection of a set of input data (full dataset).
2. Selection of the basic algorithm and model training, evaluation of the informativeness of
the features of the complete dataset.
3. Removal of non-informative features determined by the algorithm in the previous step
and formation of a shortened dataset.
4. Estimation of the size of the full and reduced dataset.
5. Training of a classifier based on the Random Forest algorithm using full and reduced
datasets.
6. Assessment of classification accuracy on full and reduced datasets.
7. Estimation of classifier training time on full and reduced datasets.
According to the results of the study, the following quality indicators of the model were
obtained (Figure 1).
Figure 1: Comparison of model accuracy when using different datasets and different genetic
algorithms for selecting the most informative features
As can be seen from Fig. 1, the use of a reduced dataset as input data of the model did not
lead to a significant deterioration of its accuracy in comparison with the full dataset. For the
UNSW-NB15 data set, the use of the genetic algorithm (GA) and the particle swarm algorithm
(PSO), on the contrary, led to a slight increase in the accuracy of the model.
The use of genetic algorithms made it possible to reduce the number of features of the
complete dataset by an average of 65% (Figure 2), which made it possible to reduce its size from
two to six times (Figure 3).
Figure 2: Comparison of the number of the most informative traits when using different genetic
selection algorithms
Figure 3: Comparison of the size of datasets when using different genetic algorithms for selecting
the most informative features
Decreasing the size of the dataset led to a decrease in the training time of the model. A
comparative analysis of the increase in the learning speed of the model in percentage terms when
using different datasets and different genetic algorithms for selecting the most informative
features is given in the table. 1. As can be seen from the table. 1, different algorithms have
different effects on increasing the speed of model training when using different datasets.
However, the best results were obtained when using the genetic algorithm. The training speed of
models when using different datasets decreased from 36.36% to 59.38% and, on average, set
47%. At the same time, the classification accuracy remained at the level of about 99%.
Table 1.
Comparative analysis of the increase in the learning rate of the model (%) when using different
datasets and different genetic algorithms for selecting the most informative features
Algorithm UNSW- KDDCUP99 Kyoto2006 NSL-KDD CIC-IDS- DoHBrw-
NB15 2017 2020
GA 50,00 50,00 50,00 38,46 36,36 59,38
PSO 42,86 33,33 50,00 38,46 45,45 37,50
DE 50,00 33,33 33,33 38,46 54,55 34,38
CS 50,00 33,33 50,00 30,77 40,91 59,38
BA 42,86 33,33 50,00 30,77 31,82 46,88
FPA 35,71 16,67 33,33 30,77 45,45 53,13
6. Conclusions
In this work, the use of various genetic and natural algorithms for the selection of the most
informative features in order to increase the efficiency of identification of the state of the
computer network is investigated at the stage of data preprocessing. Considered: Genetic
Algorithm (GA), Particle Swarm Optimization (PSO), Differential Evolution (DE), Cuckoo Search
Algorithm (CSA), Bat Algorithm (BA), Flower Pollination Algorithm (FPA). Their software models
were developed in the Jupiter Notebook environment using Python. To assess the quality of data
preprocessing, a computer network state identification model based on the Random Forest
algorithm was developed.
The following sets (datasets) were used as source data: UNSW-NB 15, KDDCUP99,
KYOTO2006, NSL-KDD, NSL-KDD, CIC-IDS-2017 DoHBrw-2020, which contain information about
the normal functioning of the network and during intrusions.
It was found that the use of genetic algorithms made it possible to significantly reduce the
number of features of the complete dataset and reduce its size to 63%. Reducing the size of the
dataset accelerated the training time of the model by up to 59%. At the same time, the accuracy
of the model did not decrease significantly.
Studies have shown that the choice of genetic or natural algorithm type depends on the input
data. In our study, better results were obtained when using a genetic algorithm, which made it
possible to increase the learning rate of models by 47% on average.
According to the results of the study, the method of increasing the efficiency of identifying the
state of the computer network by using the procedure for selecting the most informative features
in the source data based on the genetic algorithm was further developed.
Thus, the obtained results testify to the effectiveness of using genetic algorithms for the
selection of informative features at the stage of data pre-processing in the tasks of detecting
intrusions into cyber security systems. Accelerating model training and reducing the amount of
data can significantly improve the performance of real-time systems, which is an important
direction for further research.
References
[1] Zeeshan Ahmad, Adnan Shahid Khan, Cheah Wai Shiang, Johari Abdullah, Farhan Ahmad,
Network intrusion detection system: A systematic study of machine learning and deep
learning approaches. Transactions on Emerging Telecommunications Technologies. (2021)
32:e4150. doi: 0.1002/ett.4150
[2] Bouchlaghem Younes, Yassine Akhiat, Souad Amjad Feature Selection: A Review and
Comparative Study. E3S Web of Conferences (2022) 351(1):01046. doi:
10.1051/e3sconf/202235101046.
[3] Gbashi Ekhlas, Mohammed, Bilal, Intrusion Detection System for NSL-KDD Dataset Based on
Deep Learning and Recursive Feature Elimination, Engineering and Technology Journal,
39(7), (2021). doi:10.30684/etj.v39i7.1695.
[4] Anita Thengade, Rucha Dondal, Genetic Algorithm – Survey Paper, IJCA Proc National
Conference on Recent Trends in Computing, NCRTC. 5, (2012) 25-29.
[5] Eberhart Shi Yuhui, Particle swarm optimization: Development, applications and resources.
Proceedings of the IEEE Conference on Evolutionary Computation, ICEC, 1 (2001) 81 – 86.
doi:10.1109/CEC.2001.934374.
[6] Das Swagatam , Suganthan Ponnuthurai, Differential Evolution: A Survey of the State-of-the-
Art. IEEE Trans. Evolutionary Computation, 15 (2011) 4-31.
[7] Amir Gandomi, Xin-She Yang, Amir Alavi, Cuckoo search algorithm: a metaheuristic approach
to solve structural optimization problems,Engineering With Computers, 29 (2013) 245-245.
doi:10.1007/s00366-012-0308-4.
[8] Xin-She Yang, Bat Algorithm: Literature Review and Applications. International Journal of
Bio-Inspired Computation, 5. (2013).141-149. doi:10.1504/IJBIC.2013.055093.
[9] Yang, XS. (2012). Flower Pollination Algorithm for Global Optimization. In: Durand-Lose, J.,
Jonoska, N. (eds) Unconventional Computation and Natural Computation. UCNC 2012.
Lecture Notes in Computer Science, vol 7445. Springer, Berlin, Heidelberg. doi:10.1007/978-
3-642-32894-7_27.
[10] Jafar Mousa, Al-Fawa'reh Mohammad, Jafar Shifa, Analysis and Investigation of Malicious
DNS Queries Using CIRA-CIC-DoHBrw-2020 Dataset 2. Manchester Journal of Artificial
Intelligence and Applied Sciences(MJAIAS, (2021) 65-70.
[11] Moustafa Nour, Slay Jill. UNSW-NB15: a comprehensive data set for network intrusion
detection systems (UNSW-NB15 network data set) (2015).
doi:10.1109/MilCIS.2015.7348942.
[12] Tavallaee Mahbod, Bagheri Ebrahim, Lu Wei, Ghorbani Ali, A detailed analysis of the KDD
CUP 99 data set. IEEE Symposium. Computational Intelligence for Security and Defense
Applications, CISDA. 2 (2009). doi:10.1109/CISDA.2009.5356528.
[13] Protic, Danijela. Review of KDD Cup '99, NSL-KDD and Kyoto 2006+ datasets. Vojnotehnicki
glasnik. 66 (2018) 580-596. doi:10.5937/vojtehg66-16670.
[14] Jose Jinsi, Jose Deepa, Deep learning algorithms for intrusion detection systems in internet
of things using CIC-IDS 2017 dataset, International Journal of Electrical and Computer
Engineering (IJECE) 13 (2023). 1134-1141. doi:10.11591/ijece.v13i1.pp1134-1141.
[15] Ali Jehad, Khan Rehanullah, AhmadNasir, Maqsood, Imran, Random Forests and Decision
Trees, International Journal of Computer Science Issues(IJCSI), 9 (2012) 272-278
[16] S. Gavrylenko, V. Chelak and O. Hornostal, "Ensemble Approach Based on Bagging and
Boosting for Identification the Computer System State," XXXI International Scientific
Symposium Metrology and Metrology Assurance (MMA), Bulgaria, (2021) 1-7, doi:
10.1109/MMA52675.2021.9610949.