=Paper=
{{Paper
|id=Vol-3188/short10
|storemode=property
|title=Using Machine Learning Techniques to Increase the Effectiveness of Cybersecurity (short paper)
|pdfUrl=https://ceur-ws.org/Vol-3188/short10.pdf
|volume=Vol-3188
|authors=Vasyl Buhas,Ihor Ponomarenko,Vаlеriy Bugas,Andrii Ramskyi,Volodymyr Sokolov
|dblpUrl=https://dblp.org/rec/conf/cpits/BuhasPBRS21
}}
==Using Machine Learning Techniques to Increase the Effectiveness of Cybersecurity (short paper)==
Using Machine Learning Techniques to Increase the Effectiveness of Cybersecurity Vasyl Buhas1, Ihor Ponomarenko1, Vаlеriy Bugas1, Andrii Ramskyi2, and Volodymyr Sokolov2 1 Kyiv National University of Technologies and Design, 2 Nemyrovycha-Danchenka str., Kyiv, 01011, Ukraine 2 Borys Grinchenko Kyiv University, 18/2 Bulvarno-Kudriavska str., Kyiv, 04053, Ukraine Abstract In today's world, a great number of organizations generate and accumulate large amounts of information, which is of great value to owners, and is also considered by attackers as a valuable resource for enrichment. Any data storage system has vulnerabilities that will be exploited during cyberattacks. The inability to build a system secure enough against unauthorized access to data, forces companies to respond on an ongoing basis to evolving technologies of misappropriation of information by developing more effective methods of identifying and combating cyberattacks. This article examines the features of the use of machine learning methods to identify illegal access by third parties to the information of individuals and legal entities with economic and reputational damage. The study considers methods of processing various types of data (numerical values, textual information, video and audio content, images) that can be used to build an effective cybersecurity system. Obtaining a high level of identification of unauthorized access to data and combating their theft is possible through the implementation of modern machine learning approaches, which are constantly improving by creating innovative data processing algorithms and the use of powerful cloud computing services, acting as an element to counter rapidly evolving technologies. Keywords1 Cybersecurity, machine learning, neural networks, image recognition, optimization, information, dataset. 1. Introduction In the context of digitalization, the number of Internet users is growing rapidly both in the private sector and in the business environment. The reorientation to the digital environment is associated with the intensive development of advanced information technologies that simplify the implementation of economic, technological and social processes. Respectively, the demand for innovative products is growing. In this aspect, it is important to pay special attention to the development of cloud technologies, which allow to accumulate large amounts of structured, semi-structured and unstructured information in the mode 24/7 [1]. At the same time, the methods of processing the generated information are actively evolving, which, owing to powerful and capacious servers, make it possible to speed up the data processing by using cloud computing. The existence of significant competition in the market of data collection and processing leads to an increase in the level of availability of cloud services with appropriate software solutions for most users. If at the beginning of the introduction of this technology the users were only TNCs and organizations with the support of national governments, in modern conditions a large number of small and medium-sized companies can use cloud services to optimize their business processes. Due to the variety of such services, users also apply cloud technologies to ensure the performance of certain works and to access certain services (e-mail, mobile banking, personal accounts, etc.) [2]. CPITS-II-2021: Cybersecurity Providing in Information and Telecommunication Systems, October 26, 2021, Kyiv, Ukraine EMAIL: buhas.vv@knutd.edu.ua (V. Buhas); ponomarenko.iv@knutd.com.ua (I. Ponomarenko); bugas.vv@knutd.edu.ua (V. Bugas); a.ramskyi@kubg.edu.ua (A. Ramskyi); v.sokolov@kubg.edu.ua (V. Sokolov) ORCID: 0000-0001-8317-3350 (V. Buhas); 0000-0003-3532-8332 (I. Ponomarenko); 0000-0003-1046-9737 (V. Bugas); 0000-0001-7368- 697X (A. Ramskyi); 0000-0002-9349-7946 (V. Sokolov) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 273 In many cases, the information generated in cloud services is of interest not only to data owners and related institutions, but also to third parties who are interested in obtaining certain illegal benefits as a result of the acquisition of personal information. The purpose of data theft may be to obtain trade secrets of companies, access to private user information, misappropriation of funds in bank accounts, interference with software to disrupt the management and production processes of various private and public organizations, access to state secrets, illegal access to servers supporting web resources of public administration bodies, etc. In any case, it is an illegal access to information in order to obtain certain benefits and cause significant harm to data owners. Due to the intensification of digitalization processes, the number of fraudulent actions with information in the global environment is gradually increasing, causing significant damage to international and national economic systems. The spread of cybercrime compels various national and international institutions to develop and implement strategies to ensure a sufficient level of safety of valuable information resources, to involve relevant specialists and specialized software solutions. Building an effective cybersecurity system minimizes the risks of information loss and the use of illegally obtained data to cause economic and reputational losses to owners [3]. The process of ensuring data retention requires constant improvement of the technologies and techniques used, because due to the evolution of the information environment and the emergence of new approaches, fraudsters are getting more effective tools for illegal access to data. Periodic cases of successful theft of information force companies to act quickly, leveling the vulnerabilities identified in the protection procedures. Improving the cybersecurity system involves the use of various approaches that are implemented in software, hardware and organizational solutions of specialized companies. One of the effective ways to ensure data storage is machine learning methods, which are implemented as part of Data science. Thanks to the use of specialized algorithms it is possible with a high level of probability to identify fraudulent actions and limit unauthorized access to private information. The prospects of using machine learning methods as an important element of cybersecurity are based on the possibility of their improvement through the accumulation of relevant information, which allows to increase the accuracy of identification on the principle of “friend or foe” and to detect illegal actions. Various data (numerical values, textual information, video and audio content, images) can be used as sources of information for constructing models [4]. 2. The Aim The importance of detecting cybercrime in the digital environment and addressing existing challenges is vital to ensuring the stability of individual institutions and the system as a whole. Identifying serious threats requires significant financial resources to support scientific and practical developments in the field of cybersecurity. Modern cybercriminals use such approaches as phishing, host intrusion, malware integration, etc. to commit illegal acts [5]. Based on the existing needs in modern conditions, a large number of methods are developed and various scientific papers are published to ensure the protection of information through the introduction of innovations. The analysis of publications shows a significant interest of scientists in the use of machine learning methods to build modern and effective information security systems. The presented research is devoted to the study of advanced methods of machine learning, the introduction of which allows to increase the efficiency and resilience of cybersecurity systems. Significant importance is attached to the use of various types of data in the process of building machine learning models. First of all, it is advisable to pay attention to the use of images to build neural networks as a way to raise the effectiveness of cybersecurity. It should be noted that it is advisable to conduct research in the field of using machine learning methods to improve the efficiency of information security systems on a permanent basis. Confirmation of the reliability and prospects of this approach was made by a group of scientists in the implementation of algorithms that detect fake images of faces created by fraudsters through the use of ceramic masks with foil in certain areas to model the uneven heat in different parts of the head (Spoofing) [6]. In addition, scientists are addressing the issue of countering Deepfake technology, which allows original photos of the owners of certain information resources to create "live" and close to the original videos. Because of the improvement of Deepfake algorithms, it is possible to mislead the facial identification system and gain access to valuable information resources. Respectively, scientists and practitioners will constantly test various models for detecting fake photos and videos [7]. 274 3. Models and Methods Machine learning methods have become widespread in the vast majority of areas of human activity due to the ability to process large amounts of information and use the results of modeling to optimize the relevant processes. Constructing an effective cybersecurity system involves the implementation of a comprehensive strategy, which should include the use of machine learning methods to identify illegal interferences in the information system. Based on the peculiarities of building a system to combat fraud and the specifics of the available information, specialists in the field of Data Science use a variety of methods of machine learning. It should be noted that to obtain results with a high level of reliability several methods of machine learning can be used, among which the best in a particular situation will be selected not only from the standpoint of accuracy, but also based on the time spent on the process of modeling. Here are some of the most common machine learning techniques used to improve cybersecurity. 3.1. Using Neural Networks in Image Recognition Images in modern conditions are very often used for cybersecurity due to the active use of webcams, smartphones and other specialized devices for tracking and identification. The development of effective neural networks through the use of images involves the implementation of several successive stages, but only their accuracy makes it possible to get a quality result. The specifics of the implementation of this model of machine learning involves the transformation of the graphic image into a digital form as a basis for modeling. In practice, there are two main approaches to image transformation: the 2D function and the 3D function. The 2D function is a function with x and y coordinates in space. The image in digital format is presented for calculations as the amplitude of the function F with finite values of x and y. When using the 3D function to transform images, the spatial coordinates x, y and z are entered. This approach to digital image conversion is called RGB (Red, Green, Blue). The transformation of data into digital form has its own specifics and in some way may have a negative impact on the simulation results. A key disadvantage of using RGB color space is the inability to separate color data from other information. The use of three channels in the implementation of the RGB approach significantly slows down the process of computing within the corresponding neural network. The HSV (Hue, Saturation, Value) color space approach is devoid of the above drawback because it transforms the image into a single H (Hue) channel [9]. The practice of implementing RGB and HSV approaches shows the feasibility of choosing the best method of digitalization and the implementation of the corresponding neural network based on a set of factors. When modeling with neural models in some cases it is possible to achieve a better result due to RGB, and in others - due to the HSV approach. Considering a set of algorithms for image processing while building an effective cybersecurity system, alongside with neural networks, such tools as Edge Detection in Image Processing, Fourier Transform in Image Processing, Gaussian Image Processing, Morphological Image Processing, Wavelet Image Processing are also used [10]. Based on the specifics of the problem, the features of the image dataset, the level of competence of the analyst, different algorithms can be used, but at the present stage of science development neural networks have the greatest prospects for implementation in the direction of graphic image recognition. Neural networks are constructed as the basic elements of information processing (neurons), which are combined into a complex system with a certain number of layers. The principle of the neural network function is based on approaches to the functioning of the human brain: data are obtained from the environment, the process of modeling and learning to identify implicit patterns in information is realized by connecting neurons. The last stage involves obtaining predictive values or assigning an object to a specific group. A typical neural network model has the following layers: An input layer. A hidden layer. An output layer. The basic structure of the neural network contains input layers, hidden layers and output layers (Fig. 1). The input layers are used to enter the primary transformed information into the neural 275 network. At the next stage, the calculation process is implemented, which involves the activation of a certain number of neurons in accordance with the selected probabilities and the number of hidden layers used. The choice of architecture with a certain number of hidden layers is based on the analyst's experience and the specifics of the primary data. In the process of neural network implementation, a number of iterations are performed, which allow to track the level of accuracy of the implemented algorithm and to adjust the number of hidden layers to achieve an acceptable level of model quality. Figure 1: Basic structure of the neural network [11] The algorithm for implementing the neural network in image processing is as follows: 1. A specific image according to the chosen approach is divided into pixels, which act as neurons of the first layer. 2. Each channel in accordance with scientifically sound approaches is assigned a certain probability (from 0 to 1). 3. Weighted sums are calculated by multiplying the weights by the corresponding input data, the calculated value is used as input to the hidden layers of the neural network. 4. A certain activation function is set for the source data, based on the specifics of the data, the neuron is activated or this channel is blocked. 5. Activated neurons act as data propagators to the next layers of the neural network. 6. The output neuron on the layer is selected automatically according to the maximum probability. 7. To assess the optimality, the error is calculated by subtracting from the expected value of the actual output. To approximate the optimal result, the calculated values are inversely propagated through the network to the previous layers. 8. The learning process involves the implementation of a certain number of iterations of direct and reverse propagation of data, at each stage there is a change in weights. The neural network stops the learning process at the stage of achieving optimal value. Fig. 3 illustrates a typical operation for a single neuron that is part of a neural network, where ai –is the i-th input, wi – is the i-th weight, z is the output, and g is a specific activation function. 276 Figure 2: Operations on the neuron of the neural network [12] The need to select the activation function for the correct implementation of the neural network was mentioned above. There are a large number of activation functions characterized by different specifics of neuronal activation (Fig. 3) [13]. It should be noted that in modern conditions the ReLU activation function (linear equalizer with "leakage") has become widespread [14]. 3.2. Application of cluster analysis for identification of suspicious transactions Constructing an effective cybersecurity system in the presence of digital and attributive indicators can be done through the integration of a classification system that provides for the identification of suspicious transactions. One of the approaches, which involves the division of operations into several groups without prior selection of them, is cluster analysis. In conditions of uncertainty, the presented method of machine learning, for example, allows you to assign individual records in network traffic to certain groups, which have characteristic features. At the next stage, the analysis of each of the groups is carried out to identify atypical records, which in the studied aggregates look like emissions [15]. The process of implementing cluster analysis based on available information involves the implementation of the following stages: 1. Recoding of textual information into digital form for the use of attributive indicators in the division of the aggregate into groups using this method of machine learning. 2. Standardization of data or use of the method of expert assessments. The data generated for the purposes of cluster analysis are contained in indicators characterized by different dimensions. The use of actual data leads to distortion of the results of cluster formation due to indicators of large dimension, while the impact of indicators of small dimension will not be significant. Due to the standardization of primary data, the indicators lead to a moderate form of influence on the process of cluster construction. In addition, the method of expert assessments is expected to be used when it is necessary to provide different weights to a certain indicator, assigning greater weights to more significant from the point of view of the researcher indicators. Owing to the process of standardization, we move to a certain same-type description of all the indicators used, i.e. a new conditional unit of measurement is calculated, which allows a formal comparison of objects [16]. Standardization of indicators is carried out according to the following formula: xij x j zij j (1) When allocating indicators of direct and reverse action, it is advisable to divide the indicators into stimulants and destimulators, respectively. The following formulas are used to standardize each of the types of indicators: 277 xij x min z ij x max x min (di rect direction of action), (2) xmax xij z ij xmax xmin (reverse direction of action), (3) xij where is the value of the ith indicator in the ith element; x min is the minimum value of the ith indicator; x max is the maximum value of the ith indicator. Figure 3: Main functions of the activation [13] Among the available indicators it is advisable to use only statistically significant indicators by calculating p-value. In order to improve the results, it is also logical in the process of clustering to 278 carry out checking of any group of sets by overriding various combinations of indicators in the process of modeling implementation. 3. Determining the number of clusters is carried out using one of the methods: graph of silhouette width, graph of GAP-statistics as well as the “elbow” method. To determine the optimal number of clusters, it is also necessary to analyze the obtained groups, because in some situations the formation of a group with only one unit of the population takes place. The outlined situation indicates the need to use another method to identify the number of clusters [17]. 4. Choice of clustering method. Among clustering methods, hierarchical and non-hierarchical cluster analysis are the most popular. In the process of choosing the optimal approach to the formation of groups, it is possible to conduct experiments alternately using the above clustering methods. Among the available indicators, it is advisable to use only statistically significant indicators for clustering, calculating the p-value. To improve the results, it is also advisable in the process of clustering to check any groups of the set by searching for various combinations of indicators in the modeling process. The main parameters for the formation of clusters are the measure of distance and the rule of aggregation, which allows direct grouping of aggregate units into appropriate groups according to the formed system of indicators. The measure of distance (tree clustering method) allows us to select individual clusters based on available indicators. It should be noted that the distance between the formed clusters is measured in any dimension. An example of a tree clustering method is the Euclidean distance, which is calculated by the following formula: p d ij (z z ) k 1 ik jk 2 , (4) d z where ij is the distance between the objects i and j, and ik is standardized value of k variable for ith оbject. Non-hierarchical clustering is characterized by a certain flexibility in the redistribution of aggregate units between clusters in the optimization process. At the first stage, the centers of clusters are identified in accordance with their number. At the next stage of modeling, the distances of each of the elements to the existing centers are estimated and assigned to the nearest cluster according to the established threshold distances. 5. Validation of clusters. Evaluation of the adequacy of the obtained clusters is as follows: External validation is implemented by conducting a comparative analysis of the results of cluster analysis with the reference result. Relative validation involves the study of the structure of clusters, provided that the values of the parameters are used in the implementation of a separate method of cluster analysis. Internal validation is carried out on the basis of internal information on the implementation of the cluster formation procedure. Assessment of the stability of clustering involves the implementation of cluster analysis algorithms based on various samples [10, 18, 19]. 6. Selection of the optimal clustering method. Examining the results obtained after the implementation of various approaches to cluster analysis, the optimal option is selected in accordance with the needs. The selection process can be based both on the system of qualitative assessments of the formed clusters and on the basis of the obtained visualizations of the formed groups. 4. Further Research The results obtained in the study show the effectiveness of the use of machine learning methods to identify illegal actions of fraudsters to obtain access to information through the use of various data (numerical values, textual information, video and audio content, images). Further research should be focused on improving various approaches in the sphere of Data science in order to perfect the cybersecurity system and constantly bring data protection in line with existing realities. Optimization of machine learning algorithms in the field of cyberattack involves the use of such specialized programming languages as Python with the connection of appropriate libraries (Keras, PyTorch, 279 Scikit-learn, TensorFlow, etc.) [20]. The development of deep learning technologies leads to the emergence of more complex neural networks, which allow to optimize the cybersecurity system. The implementation of comprehensive research in the field of deep learning will test a variety of models and offer the best solutions to the market. 5 Conclusions The introduction of innovative technologies expands the company's ability to collect, process and use large amounts of information to optimize key processes. Comprehensive databases of companies arouse the interest of outsiders, which leads to the creation of various tools for illegal acquision of information. To counter cyberattacks, effective security systems for storing valuable information and a multi-level algorithm for access to relevant resources are created. Advanced approaches to the implementation of effective and robust cybersecurity systems involve the use of machine learning methods, which are implemented by building appropriate models based on structured, semi-structured and unstructured data. Practice shows the effectiveness of the use of neural networks in the process of combating spoofing, as fraudsters are trying to seize someone else's data in order to create high-quality forged images. Cluster analysis allows you to segment objects based on a system of various metrics, identifying specific groups and distinguishing emissions that are likely to be considered suspicious transactions. 6 References [1] J. Xie, et al., Efficient Indexing Mechanism for Unstructured Data Sharing Systems in Edge Computing, in: IEEE Conference on Computer Communications, 820–828, 2019. https://doi.org/10.1109/infocom.2019.8737617 [2] Y. Kravchenko, et al., Evaluating the Effectiveness of Cloud Services, 2019 IEEE International Conference on Advanced Trends in Information Theory (ATIT), 120–124, 2019. https://doi.org/10.1109/atit49449.2019.9030430 [3] A. Corallo, M. Lazoi, M. Lezzi, Cybersecurity in the context of industry 4.0: A structured classification of critical assets and business impacts, Computers in Industry, vol. 114, 103–165, 2020. https://doi.org/10.1016/j.compind.2019.103165 [4] I. H. Sarker, et al. Cybersecurity data science: an overview from machine learning perspective, J Big Data, 7, 41, 2020. https://doi.org/10.1186/s40537-020-00318-5 [5] J. E. Thomas, Individual cyber security: Empowering employees to resist spear phishing to prevent identity theft and ransomware attacks. International Journal of Business Management, 12(3), 1– 23, 2018. https://doi.org/10.5539/ijbm.v13n6p1 [6] Z. Yu, et al., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5295-5305, 2020. [7] M. H. Maras, A. Alexandrou, Determining authenticity of video evidence in the age of artificial intelligence and in the wake of Deepfake videos. International Journal of Evidence & Proof, 23(3), 255–262, 2019. https://doi.org/10.1177/1365712718807226 [8] M. Kubanek, J. Bobulski, J. Kulawik, A Method of Speech Coding for Speech Recognition Using a Convolutional Neural Network. Symmetry, 11, 1185, 2019. https://doi.org/10.3390/ sym11091185 [9] A. Radovan, Z. Ban, Prediction of HSV color model parameter values of cloud movement picture based on artificial neural networks, 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), 1110–1114, 2018. https://doi.org/10.23919/mipro.2018.8400202 [10] M. Gandhi, J. Kamdar, M. Shah, Preprocessing of Non-symmetrical Images for Edge Detection. Augment Hum Res 5, 10, 2020. https://doi.org/10.1007/s41133-019-0030-5 [11] Introduction to Different Activation Functions for Deep Learning. https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning- 9689331ba092 [12] Everything you need to know about Neural Networks. https://hackernoon.com/everything-you- need-to-know-about-neural-networks-8988c3ee4491 280 [13] Activation Functions for Artificial Neural Networks. http://rasbt.github.io/mlxtend/user_guide /general_concepts/activation-functions/ [14] D. Zou, et al. Gradient descent optimizes over-parameterized deep ReLU networks. Mach Learn 109, 467–492 (2020). https://doi.org/10.1007/s10994-019-05839-6 [15] K. Demertzis, L. Iliadis, S. Spartalis, A spiking one-class anomaly detection framework for cyber-security on industrial control systems. In: International Conference on Engineering Applications of Neural Networks, pp. 122–134. Springer, Cham (2017). [16] W. Liang, et al., An Industrial Network Intrusion Detection Algorithm Based on Multifeature Data Clustering Optimization Model, in: IEEE Transactions on Industrial Informatics, vol. 16, no. 3, 2063-2071, 2020, https://doi.org/10.1109/TII.2019.2946791. [17] Y. Chunhui, H. Yang, Research on K-Value Selection Method of K-Means Clustering Algorithm, 2, no. 2: 226–235, 2019. https://doi.org/10.3390/j2020016 [18] O. Romanovskyi, et al., Automated Pipeline for Training Dataset Creation from Unlabeled Audios for Automatic Speech Recognition, in: Advances in Computer Science for Engineering and Education IV, 25–36, 2021. https://doi.org/10.1007/978-3-030-80472-5_3 [19] Z. B. Hu, et al., Authentication System by Human Brainwaves Using Machine Learning and Artificial Intelligence, in: Advances in Computer Science for Engineering and Education IV, 374–388, 2021. https://doi.org/10.1007/978-3-030-80472-5_31 [20] C. D. Costa, Python libraries for modern machine learning models & projects, 2020. https://towardsdatascience.com/best-python-libraries-for-machine-learning-and-deep-learning- b0bd40c7e8c 281