A Deep Learning Approach for Intrusion Detection System in Industry Network Ahmad HIJAZI EL Abed EL SAFADI Jean-Marie FLAUS Univ.Grenoble Alpes, G-SCOP, Univ.Grenoble Alpes, G-SCOP, Univ.Grenoble Alpes, G-SCOP, F-38000 Grenoble, France F-38000 Grenoble, France F-38000 Grenoble, France ahd.hjz@gmail.com Abed.safadi@grenoble-inp.fr Jean-marie.Flaus@grenoble-inp.fr Abstract— Network has brought convenience to the world by positive rates, in addition to that it is difficult to select normal allowing flexible transformation of data, but it also exposes a behavior of traffic dataset in the network. high number of vulnerabilities. A Network Intrusion Detection System (NIDS) helps system and network administrators to Various machine learning techniques have been used to detect network security breaches in their organizations. develop NIDSs, such as Articial Neural Networks (ANN), Identifying anonymous and new attacks is one of the main challenges in IDSs researches. Support VectorMachines (SVM), Naive-Bayesian (NB), Deep learning (2010’s), which is a subfield of machine learning Random Forests (RF), Self-Organized Maps (SOM), etc. The (1980’s), is concerned with algorithms that are based on the NIDSs are developed as classifiers to differentiate the normal structure and function of brain called artificial neural networks. traffic from the anomalous traffic [3]. The progression on such learning algorithms may improve the functionality of IDS especially in Industrial Control Systems to In this paper, an intrusion detection system using the deep increase its detection rate on unknown attacks. In this work, we learning is proposed to secure the ICS network. The proposed propose a deep learning approach to implement an effective and technique uses multi-layer perceptron with binary enhanced IDS for securing industrial network. classification and trains high-dimensional Modbus packet data Keywords—Intrusion Detection System, Deep Learning, SCADA, after a network simulation and label the data with normal and Modbus, Industrial Control Systems, Artificial Neural Networks. malicious in order to the neural network to understand the underlining structure of the normal and anomalous behavior of I. INTRODUCTION the network. Targeted attacks on industrial control systems are the biggest threat to critical national infrastructure, says Kaspersky Lab. II.ICS AND IDS Today’s industrial control systems (ICS) face an array of digital threats. Two in particular stand out. On the one hand, A. ICS overview digital attackers are increasingly targeting and succeeding in Industrial control system (ICS) is a general term that gaining unauthorized access to industrial organizations. Some encompasses several types of control systems, including actors use malware, while others resort to spear-phishing (or supervisory control and data acquisition (SCADA) systems, whaling) and other social engineering techniques [1]. The distributed control systems (DCS), and other control system main challenge is linked to the fact these systems typically configurations such as Programmable Logic Controllers (PLC) control physical processes that relate to power, transport, often found in the industrial sectors and critical infrastructures. water, gas and other critical infrastructure. Because the output of ICS relates to physical processes, the effects of any ICS have different performance and reliability requirements, downtime – such as a power outage – can affect millions of and also use operating systems and applications that may be people [2]. considered unconventional in a typical IT network environment. Security protections must be implemented in a Signature-based and anomaly-based Intrusion Detection way that maintains system integrity during normal operations System is one aspect of an effective network security as well as during times of cyber-attack. monitoring strategy. Very few asset owners have IDS/IPS deployed and configured appropriately at the boundary A typical ICS contains numerous control loops, human between the Enterprise IT and ICS networks. interfaces, and remote diagnostics and maintenance tools built However, network intrusion detection has been criticized for using an array of network protocols on layered network its propensity to generate a perceived large amount of false architectures. A control loop utilizes sensors, actuators, and positives and false negatives. Signature-based IDS lacks the controllers (e.g., PLCs) to manipulate some controlled capability of detecting new forms of attacks that it had not process. A sensor is a device that produces a measurement of seen before, and anomaly based produces high amount of false some physical property and then sends this information as controlled variables to the controller. The controller interprets 55 the signals and generates corresponding manipulated C. Deep learning and IDS variables, based on a control algorithm and target set points, Signature based IDS is effective in the detection of known which it transmits to the actuators. attacks and results in a high detection accuracy and less false- alarm rates. However, its performance suffers during detection Industrial control systems underpin the critical national of unknown or new attacks due to the limitation of rules that infrastructure and are essential for the success of industries can be installed beforehand in an IDS. On the other hand, such as: anomaly based IDS, is well-suited for the detection of unknown and new attacks. Although Anomaly Detection IDS  Electricity production and distribution produces high false-positive rates, its theoretical potential in  Water supply and treatment the identification of new attacks has caused its wide acceptance  Food production among the research community. There are primarily two challenges that arise while developing an effective and flexible  Oil and gas production and supply NIDS for the unknown future attacks. First, proper feature  Chemical and pharmaceutical production selections from the network traffic dataset for anomaly  Telecommunications detection is difficult. As attack scenarios are continuously  Manufacturing of components and finished products changing and evolving, the features selected for one class of  Paper and pulp production [5]. attack may not work well for other classes of attacks. Second, unavailability of labeled traffic dataset from real networks for SCADA and industrial protocols, such as Modbus/TCP, are developing an NIDS. critical for communications to most control devices. Deep learning belongs to a class of machine learning Unfortunately, many of these protocols were designed without methods, where employs consecutive layers of information- security built in and do not typically require any authentication processing stages in hierarchical manners for pattern to remotely execute commands on a control device. classification and feature or representation learning. Usually deep learning plays the important role in image classification B. IDS for ICS results. In addition, deep learning is also commonly used for language, graphical modeling, pattern recognition, speech, For a long time, ICS/SCADA was an area that relied on audio, image, video, natural language and signal processing. different embedded devices and clear-text communications There are many deep learning methods such as Deep Belief such as Modbus/TCP, without taking into consideration the Network (DBN), Restricted Boltzman Machine (RBM), Deep security approach which made it vulnerable to different types Boltzman Machine (DBM), Deep Neural Network (DNN), of attacks and it becomes a target of cyber threats. This Auto Encoder, Deep / stacked Auto Encoder, etc… [6]. resulted in a new focus on the security issues related to The advancements on learning algorithms might improve industrial control systems. IDS ability to reach higher detection rate and lower false alarm rate. It is envisioned that the deep learning based approaches Intrusion Detection System are capable of providing visibility can help to overcome the challenges of developing an effective and detection of any breach on the network, IDS can alarm in NIDS. response to network security or endpoint security events. IDSs for ICT networks have become very popular; especially In this work, we will use Multi-layer Perceptrons with for identifying the signatures of many pieces of known binary classification which we found the most useful type of malicious code (e.g. SNORT rules), other IDSs utilize model- neural network where the only two output classes will be base anomaly detectors. Modern ICS equipment does not normal and malicious ones. A Perceptron is a single neuron model that was a precursor to larger neural networks. normally fall in the same category as computer systems in modern-day ICT networks. ICS equipment is not typically The power of neural networks come from their ability to designed with security logging and processing in mind. It does learn the representation in your training data and how to best not usually run standard operating systems used in ICT relate it to the output variable that you want to predict. In this desktops and servers. Network-based IDSs are a network sense neural networks learn a mapping. Mathematically, they device that collects network traffic directly from the network, are capable of learning any mapping function and have been often from a central point such as a router or switch. Data proven to be a universal approximation algorithm. The data from multiple network sensors can be aggregated into a central structure can pick out (learn to represent) features at different processing engine, or processing may occur on the collection scales or resolutions and combine them into higher-order machine itself. The network traffic can also be analyzed for features. For example from lines, to collections of lines to unsatisfactory traffic or behavior patterns; either patterns that shapes. are anomalous to a previously established traffic or behavior model, or specific traffic patterns that display non-conformity III.APPLICATION OF DEEP LEARNING ALGORTHM TO NETWORK to standards, e.g. violations of specific communication TRAFFIC protocols. The steps for building a good deep learning approach consists of preparing the data, defining and compiling the model, fitting the model, and evaluation (prediction) the model. We 56 will start with a brief overview concerning the deep learning structure. A. Overview of deep neural networks 1) Neurons The building block for neural networks are artificial neurons. These are simple computational units that have weighted input signals and produce an output signal using an activation function. Fig. 2. An example of deep neural network with five layers a) Input Layer The first layer that takes input from some dataset is called the input or visible layer, because it is the exposed part of the neural network. Often a neural network is characterized with an input layer with one neuron per each input value in the dataset. Fig. 1. Model of a Simple Neuron b) Hidden Layer 2) Neuron Weights After the input layer, we have the hidden layers, they are Each neuron has a bias which can be thought of as an input called hidden because they are not directly exposed to the that always has the value 1.0 and it too must be weighted. For input. The simplest example of a neural network is to have a example, a neuron may have two inputs in which case it single neuron in the hidden layer that directly outputs a value. requires three weights. One for each input and one for the bias. With the increase in computing power and very efficient Weights are often initialized to small random values, such as libraries, very deep neural networks can be built. Neural values in the range 0 to 0.3, although more complex network can have many hidden layers in it. initialization schemes can be used. Like linear regression, larger weights indicate increased complexity and fragility of c) Output Layer the model. It is desirable to keep weights in the network small The last layer is called the output layer and it is responsible for and regularization techniques can be used. exporting the value or vector of values that correspond to the format required for the problem. 3) Activation B. Training The Network The weighted inputs are summed and passed through an a) Data Classification activation function, sometimes called a transfer function. An In order to use binary classification, we should capture two activation function is a simple mapping of summed weighted types of data, in our case it will be normal and malicious input to the output of the neuron. It is called an activation packets to train the neural network on. As neural networks can function because it governs the threshold at which the neuron only work with numerical data, we have to label the network is activated and the strength of the output signal. Historically packets with 0 or 1 for normal and malicious packets. simple step activation functions were used where if the We captured a big dataset that is composed of normal network summed input was above a threshold, for example 0.5, then traffic, i.e. a normal behavior of the ICS devices. In order to the neuron would output a value of 1.0, otherwise it would get the malicious packets, we prepared a table consisting of output a 0.0. the opposite functions and values of the normal ones, that is different IP sources, IP destinations, port numbers, protocol 4) Network of Neurons numbers, Modbus (functions, values, registers, coils) etc… DL involves making very large and deep (i.e. many layers of And then we captured almost the same number of packets. neurons) neural networks to solve specific problems, as shown After that, we combined the normal and malicious packets into in Fig.3. Thus, similar to how neurons are organized in layers one dataset and added a column labeling the packets 0 for in the human brain cells, neurons in neural networks are often normal and 1 for malicious one. organized in layers as well. So, an algorithm is deep if the input is passed through several non-linearities before being b) Data Values output. Data must be numerical, for example real values. If we have categorical data, such as a sex attribute with the values male and female, we can convert it to a real-valued representation 57 called a one hot encoding. This is where one new column is added for each class value (two columns in the case of sex of d) Prediction male and female) and a 0 or 1 is added for each row depending Once a neural network has been trained it can be used to make on the class value for that row. predictions. You can make predictions on test or validation Neural networks require the input to be scaled in a consistent data in order to estimate the skill of the model on unseen data. way. We can rescale it to the range between 0 and 1 called You can also deploy it operationally and use it to make normalization. Another popular technique is to standardize it predictions continuously. The network topology and the final so that the distribution of each column has the mean of zero set of weights is all that you need to save from the model. and the standard deviation of 1. Scaling also applies to image Predictions are made by providing the input to the network pixel data. In our case, the data will be a captured PCAP file and performing a forward-pass allowing it to generate an where the fields consists of IP addresses, port numbers, output that you can use as a prediction [7]. hexadecimal Modbus values as shown in Fig. 4. C. Model Approach a) Preparing the Neural Network As deep learning structure is defined as a sequence of layers, we will create a sequential model and add layers one at a time Fig. 3. Modbus Frame until we are satisfied with our network topology. The first thing to get right is to ensure the input layer has the right number of Thus, data must be well-prepared before training the neural inputs. In our case, the number of inputs will be the number of network on, we should convert the IP addresses, hexadecimal fields extracted from the network packets as shown in Fig.6, in values, and all other non-decimal attributes into decimal ones, addition to the last field which indicates if the packet is normal preferred between 0 and 1. or malicious. c) Stochastic Gradient Descent The classical and still preferred training algorithm for neural networks is called stochastic gradient descent. This is where one row of data is exposed to the network at a time as input. The network processes the input upward activating neurons as it goes to finally produce an output value. This is called a forward pass on the network. It is the type of pass that is also used after the network is trained in order to make predictions on new data. The output of the network is compared to the expected output and an error is calculated. This error is then propagated back through the network, one layer at a time, and the weights are updated according to the amount that they contributed to the error. This clever bit of math is called the Back Propagation algorithm. The process is repeated for all of the examples in your training data. One round of updating the network for the Fig. 5. Input parameters of the neural network entire training dataset is called an epoch. A network may be trained for tens, hundreds or thousands of epochs, an example As shown in the above figure, we have 12 inputs including of epoch round is shown in Fig. 5. different types of fields (IP, TCP, and MODBUS). The neural network will try to train and learn using those attributes. How do we know the number of hidden layers to use and their types? This is a bit hard question. There are heuristics that we can use and often the best network structure is found through a process of trial and error experimentation. Generally, we need a network large enough to capture the structure of the problem if that helps at all. In our case we will use a fully-connected network structure with three layers as shown in Fig. 6. Next, it’s best to think about the structure of our layer, we have an input layer, some hidden layers and an output layer. As stated previously, a type of network that performs well on binary classification problem is a multi-layer perceptron. This Fig. 4. Epoch example during network training 58 type of neural network is often fully connected. That means deviation of 1. This can be thought of as subtracting the mean that we are looking to build a fairly simple stack of fully- value or centering the data. Standardization can be useful, and connected layers to solve this problem. As for the activation even required in some machine learning algorithms when the function that you we will use, it’s best to use one of the most input data values are of different scales. common functions which is relu activation function [8]. Below is a table showing the network input conversion for a The Rectified Linear Unit has become very popular in the last normal packet: few years for logistic/continues output. It computes the Table-1 function Network packet different conversion stages 𝑓(𝑥) = max⁡(0, 𝑥) Attribute Normal Value Decemalized Value Encoded Value One way ReLUs improve neural networks is by speeding up IP Source 192.168.1.5 3232235781 0.53640178 training. The gradient computation is very simple (either 0 or 1 depending on the sign of x). IP Destination 192.168.1.3 3232235779 0 When we are building our model, it’s therefore important to take into account that the first layer needs to make the input Protocol 6 6 0 shape clear. The model needs to know what input shape to expect and that’s why you’ll always find the input shape, input TTL 128 128 0.71646104 dimension, input length arguments in the documentation of the layers and in practical examples of those layers Fig.7. TCP Window Size 524288 524288 -1.06582338 Destination Port 56783 56783 1.0261182 Output Layer (1 output) Source Port 502 502 -1.01072698 TCP Length 0 0 -0.99563837 Hidden Layer Modbus Data FF:00 65280 -0.01348645 (8 neurons) Modbus Code 5 5 -0.88003806 Modbus Register 0 0 -0.05902683 Input Layer (18 inputs) Modbus 100 100 -0.13751838 Reference Fig. 6. Visualization of Neural Network Structure b) Encoding c) Computation Time However, the training must be on numerical fields only, that is The machine used to run the algorithm is a Intel® Core™ i7- if we have an IP address which have the format 3630QM @ 2.4GHz with 8GB installed memory (RAM) xxx.xxx.xxx.xxx, the network wont understand it, same as if having x64-based processor with 4 cores and 8 Logical we have a hexadecimal Modbus data of FF00 for example.To Processors. The total time for learning (Training + Testing) Solve this problem, data must be converted into decimals, we was 3228 seconds that is 54 minute (Fig.8). used Excel plugins to convert IP addresses and hexadecimal values into numbers, so that all the fields became of decimal values. As the scales of the different fields are wildly different, it may have a knock-on effect on network ability to learn. To overcome this, we used data standarization. Standardization is a scaling technique that assumes your data conforms to a normal distribution. If a given data attribute is normal or close to normal, this is probably the scaling method to use. Fig. 7. Training computation time The result of standardization is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with a mean of =0 and a standard 59 IV.RESULTS AND DISCUSSION containing a PLC, a local network, a SCADA control and a virtual mockup built of electronic-designed parts, and a IHM for operator interaction. Fig.9 presents the generic schema of A. Description of the Network the system. Our ICS network is composed of the SCADA, PLC, and a simulated heater process which triggers the network with a large amount of traffic for gathering and analyzing a real time data to be shown on the SCADA screen, the reactor diagram is shown in Fig.9. Fig.9. General ICS architecture The PLC performs the control of the virtual mockup. It receives the data from the digital mockup as though it were a sensor capturing ongoing information of a physical process such as a fluid heater process. Then, it uses the received data to calculate a control signal that is sent to the mockup through an analog output. The SCADA displays the system information for a supervisor that can access the major system information about the industrial process, the information comes from the PLC that gets information from the sensor and updates the system status. The supervisor uses a PC to control some functions of Fig. 8. Reactor diagram with inputs/outputs label the systems such as the water temperature and the height. The real network is created by a Switch. The following table summarizes the system inputs/outputs shown in the above figure. B. Proposed Approach The proposed intrusion detection systems considers a general Table-2 type of an attack scenario where malicious packets are injected Reactor system inputs and outputs values into a SCADA network system composed of a heater and a PLC. The proposed intrusion detection monitors incoming Variable Value packets and determines an attack. In this work, we consider the most common industrial protocol, X1 Opened/Closed that is to say MODBUS protocol. Our IDS design is composed of two main phases, the training X2 Opened/Closed phase and the detection phase. The training phase is performed offline as it is somehow time consuming. In the training phase, Xout Opened/Closed the Modbus packet is processed to extract a feature that represents the normal behavior of the network. Each trained Modbus packet has a label indicating either normal or Coolant Qc [0; 500] malicious packet, that what we call the supervised learning. We adopt the Neural Network structure to train the features. The Liquid Height H [0; 200] detection phase works almost the same, the same feature is extracted from an incoming packet and the Neural Network Liquid Temperature T [coolant temperature, undefined] structure calculates with the trained parameters to predict the binary decision that is either normal or malicious. Reactant [0; undefined] In order to perform the training phase, we simulated a network concentration traffic composed of real values to let the neural network train on. Explosion Notifier True/False a) Preparing the simulation The simulation is composed of three virtual machines, the first one is the process that will be executed each 0.1s in order to Using existing approaches of a HIL system and a local generate high network traffic, the second one is the SCADA – network a hybrid approaches was designed respecting some HMI screen that will display the result and is capable of constraints in order to simulate an industrial environment changing the temperature and finally the PLC controller who 60 is responsible for reading/writing from/to the registers and coils it is holding as shown in Fig.10, the PLC will control the cooling flow rate. The captured PCAP file can be saved into CSV by using Tshark (A tool installed when installing Wireshark) where we can choose specific fields to be saved only (IP source, IP destination, Ports, Protocols, Modbus Data, etc…). Now after obtaining a good traffic and converting it into CSV file we can adjust and perform any operation on any field before training the neural network on. C. Results Upon training the neural network on the prepared dataset using Tensorflow and Keras, we can evaluate the performance of the network on the same dataset, this will give us the accuracy and the loss of the training after splitting the data into 70% for training and 30% for testing, these evaluations shows how well the network is doing on the data it is being Fig. 10. Simulation of Modbus traffic using virtual machines trained, training accuracy usually keeps increasing throughout training. Using Tensorflow visualization on training and testing dataset, we can view the accuracy of our approach The process sends and receieves multiple input/output which is shown in Fig. 13. variables, these variables corresponds to modbus addresses in addition to the value sent for this variable, the addresses with their correspondant variables are shown in Fig.11. Fig. 11. Process input/output values Fig. 13. Model accuracy during the training of the network As we can see, the accuracy of the trained data is increasing as b) Capturing the traffic number of steps (epochs) is increasing, until it reaches Upon running the PLC, process and SCADA, a high volume approximately 99.89% of accuracy, which means that there is of network packets can be captured using Wireshark, and then a change of 99.89% of detecting any malicious packet filtered in order to get the Modbus/TCP traffic only that is destined towards the network. It is good to note that the neural running between the machines. An example of those packets is network performed very well while training, this can be shown in Fig.12. noticed by viewing the speed by which the network learned to draw a pattern from the data given to him, so that between 0 and 40 epochs the accuracy reached approximately 100% of detecting. This is to ensure the importance of decimalizing and reshaping of the data before training the network on them. Fig. 12. Modbus/TCP packets capture using Wireshark 61 Moreover, after each epoch, the model is tested against a detect Denial of Service attacks and adding time stamps to the validation set, Keras can separate a portion of the training data fields in order to learn the interval of times packets usually into a validation dataset and evaluate the performance of the arrive by. model on that validation dataset after each epoch. The lower the loss, the better the model. Loss is not in percentage as VI.REFERENCES opposed to accuracy and it is a summation of the errors made for each example in training or validation sets. Fig. 14 shows [1] David Bisson. (2016, Nov 13) How to Approach Cyber the loss upon training the network. Security for Industrial Control Systems. [Online]. Available: https://www.tripwire.com/state-of-security/ics- security/approach-cyber-security-industrial-control-systems/ [2] Warwick Ashford. (2014, Oct 15) Industrial control systems: What are the security challenges? [Online]. Available: http://www.computerweekly.com/news/2240232680/Industrial -control-systems-What-are-the-security-challenges [3] C.-F. Tsai, Y.-F. Hsu, C.-Y. Lin, and W.-Y. Lin,” Intrusion Detection by Machine Learning: A Review," Expert Systems with Applications, vol. 36, no. 10, pp. 11994 - 12000, 2009. [4] Keith Stouffer, Victoria Pillitteri, Suzanne Lightman, Marshall Abrams, Adam Hahn, “Guide to Industrial Control Fig. 14. Model loss during the training of the network Systems (ICS) Security”, rev 2, NIST National Institute of Standards and Technology, U.S Department of Commere, Similar to accuracy, loss will decrease as number of epochs May. 2015. increase till it reaches a value of 0.005% which is almost a negligible loss at the end of the training. [5] Characteristics of Industrial Control Systems. [Online]. Available: https://www.citicus.com/Characteristics-of- To test the neural network on malicious packets, we prepared Industrial-Control-Systems a lot of anomalous packets with different IP addresses, ports, functions, and values combinations and injected the IDS with [6] Muhamad Erza Aminantoa, Kwangjo Kimb, “Deep them, the IDS detects all the packets with a high accuracy of Learning in Intrusion Detection System: An Overview”, 99.9%, an example of the result Keras shows when injecting it School of Computing, KAIST, Korea. with a normal packet is 0.99987454, which when rounded becomes 1 that is a normal one. [7] Jason Brownlee, “Deep Learning With Python: Develop Deep Learning Models on Theano and TensorFlow Using This result when compared to self-taught learning (STL) and Keras”, v1.7. soft-max regression (SMR) [9] shows a higher performance rate, where when using SMR the accuracy reached 97% and [8] Karlijn Willems (2017, May 2) “Keras Tutorial: Deep STR reached 98.4%, whereas our discussed approach reached Learning in Python”. [Online]. Available: 99.9% of accuracy. https://www.datacamp.com/community/tutorials/deep- learning-python V.CONCLUSION AND FUTURE WORK [9] Quamar Niyaz, Weiqing Sun, Ahmad Y Javaid, and Mansoor Alam, “A Deep Learning Approach for Network We proposed a deep learning based approach to build an Intrusion Detection System”, College of Engineering, The effective and flexible IDS. A multi-layer perceptron and University of Toledo, USA. binary based IDS was implemented. We used a network dataset that we simulated to evaluate anomaly detection accuracy. We observed that the IDS anomaly detection accuracy showed a very high percentage of detecting. The performance can further be enhanced by adding the ability to 62