Feature extraction for anomaly detection in industrial control systems Silvio Russo1,* , Claudio Zanasi1,* and Isabella Marasco1,* 1 Department of Computer Science and Engineering, University of Bologna, Italy Abstract The threat landscape for industrial systems is in rapid evolution, with cyber-attacks becoming increasingly sophisticated, targeted, and motivated. This situation should raise many concerns because of the growing interconnection of industrial control systems with the Internet, as well as the proliferation of cyber- physical systems and the Industrial Internet of Things. In these scenarios, an accurate detection of attacks is of utmost importance. The swiftness with which the environment of security risks in IoT and industrial systems is a cause for concern, given the rising complexity, specificity, and determination of cyber-attacks. This issue becomes particularly problematic due to the expanding integration of industrial control systems with the Internet and the widespread adoption of cyber-physical systems. In this work, we introduce a novel methodology for improving the Feature Extraction process. The solution shows versatility, operating not only as a standalone tool for identifying network attacks but, more significantly, as a valuable tool for pre-processing raw packet data tailored for integration with artificial intelligence models. The proposed solution was developed with an emphasis on addressing the specific cybersecurity needs of the industrial sector. This approach is driven by the imperative requirements of the industrial landscape, where safeguarding critical systems against cyber threats is of paramount importance. Furthermore, our system was tested on an industrial dataset that demonstrates the applicability and efficacy of our solution within the peculiar context of industrial environments. The outcomes of these tests contribute to the validation of our approach. Keywords Anomaly Detection, Control Systems, Feature Extraction 1. Introduction The analysis of network traffic through machine learning models is becoming necessary for the early detection of suspicious activities. Cyber-physical systems (CPS) constitute a network of interconnected devices facilitating seamless information exchange among tangible IoT devices. These types of devices are used in different contexts like medical devices, autonomous vehicles, industrial automatons, wearable, and urban smart infrastructures, and can be remotely con- figured and managed. With their progressive adoption, these devices can monitor and access huge amounts of critical and even sensitive data. Moreover, the proliferation of these devices in the industries expands the surface area for cyber-attacks with consequent higher risks of data leak, ransomware and sabotage operations. In certain industrial contexts and critical infrastructures, the severity of cyber threats increases due to the potential consequences of an ITASEC 2024: The Italian Conference on CyberSecurity, April 08–12, 2024, Salerno, IT * Corresponding author. $ silvio.russo3@unibo.it (S. Russo); claudio.zanasi4@unibo.it (C. Zanasi); isabella.marasco4@unibo.it (I. Marasco) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings attack on Industrial Control Systems (ICS) [1] and the challenges associated with implementing modern security measures for these systems. Current cybersecurity research and products primarily focus on methods and techniques for the information technology (IT) environment. However, the increasing integration of industrial control systems with IT systems, the proliferation of CPS, Industrial Internet of Things (IIoT) solutions, and the rising frequency of attacks on industrial systems necessitate a shift in research objectives towards addressing the unique challenges emerging from industrial settings. Common datasets used for evaluating network intrusion systems (NIDS), such as KDDCUPP99 [2] and NSL-KDD [3], are old and not applicable for industrial systems. For this reason, we consider the recent CIC Modbus 2023 dataset [4] and use it for the extraction of various features, including Flow Features, Basic Features, Content Features, Time Features, and Labelled Features. This comprehensive set allows a solid foundation for network traffic analysis specifically tailored for industrial environment. The feature extraction rules, implemented in the Bro-code scripting language, represent the core of this paper and a necessary step for the effective implementation of a classification neural network that takes a pcap file as its input and provides immediate evaluation of malicious or legitimate industrial traffic. Network Intrusion Detection Systems (NIDS) based on AI models assume a pivotal role in safeguarding network infrastructure [5]. In this context, the Feature Extractor (FE) components stand out as fundamental elements that contribute to the overall efficacy of the detection phase. The Feature Extractor serves as a critical component that is responsible for extracting relevant features and patterns from network traffic data. Its role is to analyze the incoming data streams, identify anomalies, and extract key information that can be indicative of potential security threats or malicious activities. This process involves scrutinizing various aspects of the network traffic, including packet headers, payload contents, communication patterns, and other pertinent attributes. The motivation behind this research stems from a critical issue observed in real-world scenarios: NIDS frequently generate an excessive number of false positives. This persistent problem significantly undermines the effectiveness of these systems in practical settings outside of controlled research environments. One of the primary causes we have identified is the poor quality of data that classification models receive as input, due to ineffective Feature Extraction processes. Therefore, with this research, we aim to take a first step towards a reliable and effective feature extraction process that can extract truly relevant information for detection purposes, thereby maximizing the efficacy of attack classification models without sacrificing system performance. Differently from other solutions that can be found in literature mostly based on artificial intelligence (AI) [6] techniques, our solution is based on a deterministic feature extraction process able to guarantee the quality of the extracted data to improve the training process. The paper aims to provide a macroscopic view of the functioning of NIDS, with a particular focus on the Feature Extraction process, demonstrating how they can be customized and integrated by implementing a neural network to enhance their accuracy. The rest of the paper is organized as follows. Section 2 discusses related works. Section 3 describes the methodology and the proposed solution. Section 4 discusses the used dataset and the identified attacks. Section 5 presents details about the prototype and the experimental tests. Section 6 summarizes the main conclusions and outlines future work. 2. Related Work The problem of IoT[7] and industrial security [8] has become more relevant because these components are crucial for the security and safety of cyber-physical environments. Several works[9, 10] analyze the critical situation in which the cybersecurity of IoT/OT word, proposing new solutions. These works examine the challenges and vulnerabilities present in the evolving interplay between the IoT and OT. In response to the identified risks, researchers have put forth innovative solutions[11, 12] aimed at enhancing the overall security posture of this interconnected ecosystem. The work [12] introduces a dynamic and adaptable security paradigm, aiming to redefine traditional security models by embracing the evolving nature of contemporary digital environ- ments. Several works have analyzed the problem of feature extraction from Network traffic, the main goal of these works is to determine the best solution to extract relevant information from traffic data to improve the capacity to identify attacks or anomalies. The adopted solutions are very different, leveraging a wide type of different techniques, from old-style script-based solutions to AI-based ones. In [13] the author presents and examines the UNSW-NB15 data set creation, comparing it with other datasets like e KDDCUP 99[2] and NSL-KDD[3]. To benchmark and compare the different datasets the authors propose a solution based on a FE composed of two different software, Zeek [14] and Argus[15]. The extracted features identify relevant information to have a comprehensive knowledge of the network behaviour, starting from this work we improve the feature extraction process by extracting a different set of features, as described in section3, and developing a set of scripts to automatic labelling process, also identifying specific types of attacks. In [6] the authors evaluate and compare the efficacy of different Deep Learning (DL) models in identifying attack vectors against three Shallow Learning models: Deep Feed Forward, Convolution Neural Network, Recurrent Neural Network, Decision Trees, Logistic Regression, and Naive Bayes. This work analyzes three FE techniques that have been evaluated on different datasets, in particular, Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Auto-encoder (AE) are investigated for their impact on three benchmark datasets: UNSW-NB15, ToN-IoT, and CSE-CIC-IDS2018. As we have done in this work the authors leverage the output of the FE as input for the Machine Learning models trying to improve the performances of the NIDS, in particular for PCA and AE, several dimensions (1,2,3,4,5,10,20 and 30) are selected trying to find the optimal number. The problem is that the FE process based on these algorithms makes it difficult to determine the quality of the extracted data, and this has an impact on the performance of the classifier. In our work, we emphasize the importance of a deterministic FE process to ensure the quality and reliability of extracted data. While PCA and AE in the cited paper attempt to optimize performance by varying the number of dimensions, our solution focuses on maximizing the efficacy of the FE process without compromising classifier performance. This approach addresses the challenge of determining the quality of extracted data, which directly impacts classifier performance. Furthermore, our research takes into account the practical constraints of real-world environ- ments, particularly the crucial aspect of execution time Moreover, in a real environment, the execution time is crucial to have an effective solution to detect and prevent cyber-attacks. In this sense, the suggested FE algorithm requires significantly more resources than the solution presented in our work. 3. Feature extraction An important part of this work is the FE designed to work with industrial data to build quality datasets that can be used to train machine learning model to detect anomalous activities. Today most papers use artificial networks of different types (AE, PCA, RNN ecc...) to extract relevant features to improve the efficiency and the effectiveness of other machine learning models that play the role of NIDS. This approach introduces a significant computational complexity that is acceptable in research but is useless in real contexts where NIDS must analyze huge numbers of packets per second. The proposed solution leverages Zeek [14] to implement an effective and fast FE. It is an open-source network security monitoring tool developed by researchers at Lawrence Berkeley National Laboratory. It is able of actively capturing, indexing, and analyzing real-time network traffic. Its capabilities include the passive monitoring of network traffic to extract information about protocols, connections, and data transfer. It can identify various protocols like HTTP, DNS, FTP, and others, showcasing a focus on behavioural analysis for anomaly detection. The tool supports signature-based detection for known threats, generating detailed logs that facilitate forensic analysis and incident investigation. More importantly, being an open-source project its extensibility, allows users to create custom scripts and plugins tailored to their specific security requirements. The scripting language associated with Zeek allows us to create customized scripts for effective network data analysis and processing. This tool is widely used in cybersecurity to augment network security by providing visibility into network traffic and identifying potential threats in real time. The proposed solution will be tested on the "CIC Modbus 2023" [4]. The extracted features are diverse, in this way, we can identify all the relevant information that allows us to analyze the connection. They are categorized into different groups, each focusing on specific aspects of communication: • Flow Feature: These features are designed to identify a connection uniquely. • Basic Feature: Offering general information about the connection, these features con- tribute to building a basic understanding. • Content Feature: Analyzing packet content, typically of TCP type, these features help identify specific communication characteristics. • Time Features: Focusing on packet timing within a connection. Building upon the foundation laid out in the referenced work [13], we have expanded the set of features to enhance the quality of the extracted information. Our goal is to acquire relevant data that can significantly enhance the subsequent training of the machine learning model. In particular, we extract new features to closely monitor the retransmitted packets that occur during communication sessions, we delve into the analysis of the time intervals between Feature Name Description Label 0 for normal and 1 for attack records s_retrans Source segments retransmitted (TCP) d_retrans Destination segments retransmitted (TCP) m_int_s For each connection, the mean interval between two packets (Source - in mSec) m_int_d For each connection, the mean interval between two packets (Destination - in mSec) http_post No. of flows that have the method Post in HTTP service user_ftp User FTP if requested pwd_ftp Password FTP if captured uid A unique identifier of the connection local_orig If the connection is originated locally, this value will be T. If it was originated remotely, it will be F local_resp If the connection is responded to locally, this value will be T. If it was responded to remotely, it will be F history Records the state history of connections as a string of letters tunnel_parents If this connection was over a tunnel, indicate the uid values for any encapsulating parent connections used over the lifetime of this inner connection orig_bytes The number of payload bytes the originator sent resp_bytes The number of payload bytes the responder sent Table 1 Extracted features successive packets for each connection, providing valuable insights into the dynamics of data transfer. Additionally, we employ a quantitative approach to evaluate flows by scrutinizing the usage of the POST method in the HTTP service, integrating also data related to FTP usernames and passwords. Furthermore, a unique identifier has been introduced for each connection, two boolean fields have been added to differentiate whether a connection originated locally or remotely, information particularly relevant in industrial networks to identify attacks. In industrial settings, it is crucial to isolate the networks from the external world. This involves minimizing remote communications as much as possible, as they can serve as potential access points for attackers, putting at risk the security of the entire infrastructure and posing risks to the safety of workers. Also, specific information for tunnelling and the connection state history has been included and documented as a string. The new extracted features are described in Table 1 The solution is based on different scripts for extracting the features, BASH automation was used to automatize the feature extraction. Several data structures have been employed to facilitate the management of blocks of infor- mation in the logs. These structures include: • Info: stores information to be logged for each identified connection in the analyzed pcap files; • FlowFeatures: preserves details of the addresses involved in a connection and the protocol used; • eachPackets: collects information described in the reference document for each packet; Figure 1: Simulated network architecture to create CIC Modbus Database • each TCP Conn: specific to TCP connections, stores relevant information; • infoAllC: summarizes the total number of HTTP flows using the Get and Post methods. Regarding log management, crucial for the usability of the extracted feature, it was decided to group similar functionalities to minimize the number of generated log files and enhance their management. 4. Threat and anomaly detection The dataset used is diverse and includes various types of network traffic. It is generated by capturing pcap files that simulate both legitimate and malicious network traffic within a fictional network. The dataset is divided into two parts: the ’Dataset Attack’ and the ’Dataset Benign’. Within the simulated network architecture (Figure 1), the SecureIEDs identified as IED1A (185.175.0.4) and IED4C (185.175.0.8) are considered secure. Similarly, the Secure SCADA HMI (185.175.0.3) is included among the devices considered secure. However, there is also an insecure IED (185.175.0.5) and an insecure SCADA HMI (185.175.0.3), making them potentially vulnerable. A crucial element of the architecture is the Central Agent (185.175.0.6), which receives the so-called detection score from agents present in each secure device. Simultaneously, there is an Attacker (185.175.0.7), responsible for the attacks visible in the "Dataset Attack". The provided dataset necessitates a preprocessing phase before being suitable for input into machine learning models. The FE, discussed in Section 3, facilitates the preparation of the data for the subsequent training phase. Given that the resulting data is unlabeled, the most immediate application is an unsupervised algorithm to discern the intrinsic characteristics of the traffic and perform an anomaly detection task. First, we used a K-Means clustering algorithm to perform an initial analysis of the data. Then we used the Isolation Forest algorithm to perform anomaly detection. Unsupervised algorithms excel at detecting anomalies in comparison to normal traffic. How- ever, for the accurate detection of malicious attacks, particularly those employing hiding techniques, a supervised learning approach may prove more suitable. Labelling the dataset becomes crucial for effectively training the models to distinguish real attacks from normal traffic. Therefore, we implemented an automatic labelling process based on various heuristics. Using these scripts, we comprehensively labelled all samples from the initial dataset to create a new dataset suitable for training a machine learning model. Following this, we trained a neural network on the refined dataset to build an automatic threat detection system. The neural network adopts at its core is a Long Short-Term Memory (LSTM) layer with 128 units, making it particularly effective for analyzing sequential data, such as the available traffic logs. Following the LSTM is a dense layer with a ReLU activation function, succeeded by a dropout layer for regularization to reduce the risk of overfitting. Lastly, there is a dense layer with a sigmoid activation function to produce the final binary classification output. The network can only distinguish between benign and malicious samples, as we did not include the specific type of attack in the labels. This decision was made to enhance the general- ization capabilities of the network, allowing the model to attempt recognition of attacks not explicitly present in the training data. 5. Results The use of Zeek to perform feature extraction from the pcap files of the “CIC Modbus 2023” dataset, exploiting its scripting capability, proved to be a good solution due to the reduced computational time required compared to machine learning models and the possibility of obtaining a deterministic solution, while also allowing a detailed analysis of the different connection types. To evaluate the capability of the proposed method, we applied machine learning techniques (K-means and Isolation Forest) on features extracted with Zeek to identify unusual patterns and behaviours in network traffic. Figure 2 shows the results obtained applying K-means to the analyzer.log file containing the information related to network protocols extracted by Zeek. We can observe the presence of five clusters, where cluster 0 (green) is the densest, indicating a prevalent traffic type. Clusters 1 (blue) and 4 (purple), which contain fewer instances and are more compact, represent less frequent categories of network traffic. Clusters 2 (red) and 3 (yellow), the former consisting of a single instance and the latter of only two, could indicate the presence of outliers or the presence of a new type of traffic or anomalies. The results from the conn.log file, containing traffic data and the subsequent analysis, is depicted in Figure 3, unveil the existence of three distinct clusters. Cluster 2 (green) stands out as the most cohesive, characterized by lower internal variance, representing standard traffic. In contrast, both cluster 0 (red) and cluster 1 (blue) exhibit greater dispersion and lower density, suggesting increased variability and heterogeneity in the type of traffic. The results of applying Isolation Forest to detect anomalies in the data, is reported in the Figure 4. It shows inliers, which represent observations that fit the general pattern of the data and are classified as normal activity and outlier observations, highlighted in black, that deviate significantly from the norm and may indicate anomalies or suspicious activity. Figure 2: Clustering results of analyzer.log Figure 3: Clustering results of conn.log Finally, by training the Artificial Neural Network model on the labelled dataset generated with Zeek, we evaluated its capability to detect the presence of attacks in network traffic. Table 2 presents the sample distribution in the training dataset, categorized into normal traffic and attacks. We partitioned the data into a standard 70% - 30% split for training and testing purposes. Using the neural network described in section 4 we achieved an accuracy of 84.21% in correctly detecting malicious samples from the normal traffic. 6. Conclusions The use of machine learning approach for feature extraction is common, although its high computational requirements make it impractical for real-world applications. In this paper, we Figure 4: Visualization of inliers and outliers Measure Benign Malicious N° of flows 4,559,770 9,055,995 Src bytes 1,478,356,764 3,179,361,798 Dst bytes 1,128,175,330 2,368,174,501 Src pkts 26,006,278 49,821,269 Dst pkts 20,094,581 38,436,078 TCP 4,550,004 9,042,823 UDP 8,589 11,964 ICMP 1,177 1,208 Other 60 80 Table 2 Training dataset size propose the use of Zeek for feature extraction to improve the speed and effectiveness, and machine learning techniques for the analysis of the extracted features. The dataset used is "CIC Modbus 2023", which contains traffic data from an industrial network. The effectiveness of the proposal was confirmed by the results of the detailed analysis performed on the different types of connections. The K-means and Isolation Forest algorithms were able to identify patterns within the network traffic and detect possible anomalies when additional features were used. Furthermore, the neural network achieved an accuracy of 84.21%, demonstrating its capability to distinguish between malicious and benign network traffic using the dataset generated by Zeek. This research emphasizes the potential of combining traditional tools with machine learning methods to enhance the detection of potential malicious activity. Acknowledgments This work was partially supported by project SERICS (PE00000014) under the MUR National Recovery and Resilience Plan funded by the European Union - NextGenerationEU. References [1] M. Benmalek, Ransomware on cyber-physical systems: Taxonomies, case studies, security gaps, and open challenges, Internet of Things and Cyber-Physical Systems (2024). [2] I. University of California, Kdd cup 1999 data, 1999. URL: https://kdd.ics.uci.edu/databases/ kddcup99/kddcup99.html. [3] T. et al., Nsl-kdd dataset, 2009. URL: https://www.unb.ca/cic/datasets/nsl.html. [4] K. Boakye-Boateng, A. A. Ghorbani, A. Lashkari, Securing substations with trust, risk posture, and multi-agent systems: A comprehensive approach, in: 2023 20th Annual International Conference on Privacy, Security and Trust (PST), IEEE Computer Society, Los Alamitos, CA, USA, 2023, pp. 1–12. URL: https://doi.ieeecomputersociety.org/10.1109/ PST58708.2023.10320154. doi:10.1109/PST58708.2023.10320154. [5] Z. Ahmad, A. Shahid Khan, C. Wai Shiang, J. Abdullah, F. Ahmad, Network intrusion detection system: A systematic study of machine learning and deep learning approaches, Transactions on Emerging Telecommunications Technologies 32 (2021) e4150. [6] M. Sarhan, S. Layeghy, N. Moustafa, M. Gallagher, M. Portmann, Feature extraction for ma- chine learning-based intrusion detection in iot networks, Digital Communications and Net- works (2022). URL: https://www.sciencedirect.com/science/article/pii/S2352864822001754. doi:https://doi.org/10.1016/j.dcan.2022.08.012. [7] nozominetworks, What it needs to know about ot/iot security threats in 2020, 2020. URL: https://www.nozominetworks.com/blog/ what-it-needs-to-know-about-ot-io-security-threats-in-2020. [8] T. Micro, State of ot security in 2022: Big survey key insights, 2022. URL: https://www. trendmicro.com/en_nl/research/22/f/state-of-ot-security-2022.html. [9] I. Stellios, P. Kotzanikolaou, M. Psarakis, C. Alcaraz, J. Lopez, A survey of iot-enabled cyberattacks: Assessing attack paths to critical infrastructures and services, IEEE Commu- nications Surveys & Tutorials 20 (2018) 3453–3495. doi:10.1109/COMST.2018.2855563. [10] M. A. Khan, K. Salah, Iot security: Review, blockchain solutions, and open chal- lenges, Future Generation Computer Systems 82 (2018) 395–411. URL: https:// www.sciencedirect.com/science/article/pii/S0167739X17315765. doi:https://doi.org/ 10.1016/j.future.2017.11.022. [11] S. Li, M. Iqbal, N. Saxena, Future industry internet of things with zero-trust security, Information Systems Frontiers (2022) 1–14. [12] C. Zanasi, S. Russo, M. Colajanni, Flexible zero trust architecture for the cyberse- curity of industrial iot infrastructures, Ad Hoc Networks 156 (2024) 103414. URL: https://www.sciencedirect.com/science/article/pii/S1570870524000258. doi:https://doi. org/10.1016/j.adhoc.2024.103414. [13] N. Moustafa, J. Slay, Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set), in: 2015 Military Communications and Information Systems Conference (MilCIS), 2015, pp. 1–6. doi:10.1109/MilCIS.2015.7348942. [14] I. C. S. I. (ICSI), Zeek, 2024. URL: https://zeek.org/. [15] Argus, 2024. URL: https://openargus.org/.