=Paper=
{{Paper
|id=Vol-2889/PAPER_03
|storemode=property
|title=Drift-based approach for evolving data stream classification in Intrusion detection system
|pdfUrl=https://ceur-ws.org/Vol-2889/PAPER_03.pdf
|volume=Vol-2889
|authors=Sugandh Seth,Gurwinder Singh,Kuljit Kaur Chahal
}}
==Drift-based approach for evolving data stream classification in Intrusion detection system==
Drift-based approach for evolving data stream classification in Intrusion detection system Sugandh Setha, Gurwinder Singha, Kuljit Kaur Chahala a Guru Nanak Dev University,Amritsar, India Abstract Machine learning, and deep learning are extensively used to augment the performance of Intrusion detection systems. While the existing work on intrusion detection system using data mining and machine learning is efficient, but it involves training static batch classifiers to detect intrusions irrespective of the regular data stream's time-varying characteristics. Aims: This paper proposes an adaptive approach for online intrusion detection using stream-oriented learning for adapting to concept drift in real world environment. Method: Adaptive Random Forest classifier with ADWIN change detector is used for detecting change in a data stream and adapting to drift detection in the streamed data resulting in agile adaptation against unknown intrusions and the proposed approach also overcomes the need to retrain the model with time. Results: The latest CIC-IDS 2018 dataset is used for evaluating the approach. With the proposed method, the final Accuracy obtained is 99.5 % and a recall rate of 99.8%. Keywords 1 Intrusion Detection System, Concept Drift, Stream oriented learning, Adaptive Random Forest, 1. Introduction Data mining and machine learning application’s prominence is increasing with time. Recently much research is being proposed to utilize machine learning and deep learning techniques in many domains such as weather forecasting, Spam detection, detecting fraudulent financial transactions, Intrusion detection system etc. Traditionally, machine learning was primarily focused on using static data enough to represent underlying distribution. However, usually, real-world problems don’t fit in models with such restrictions. Also, many real-world applications such as Intrusion Detection Systems have non- stationary data distributions that cause the problem of non-stationary learning or concept drift over the time. Many studies with good results are available to investigate IDS (Intrusion Detection Systems) using deep learning and machine learning approaches. However, most of the studies have deployed static data sources. These studies fail to take rapid technological developments, and the problem of concept drift into account [1], leading to poor performance of the system. As an indicator, concept drift holds importance due to its ability to measure time-based data distribution variance. Besides, IDS can also be considered as a typical scenario of concept drift. Usually, for a single source providing a data stream to a network, the data under the scanner is in a stable state, distributed identically. And in case of an unknown intrusion, the current data distribution undergoes dynamic changes as compared to the historical data. This motivates for building and adaptive Intrusion detection method that involves incremental learning based on Concept Drift that quickly adapts to new intrusion types. Moreover, the endless emergence of new attacks and security loopholes raises the need for an ideal classifier that quickly adapts to intrusion's emerging methods. In such a situation, the static batch WCNC-2021: Workshop on Computer Networks & Communications, May 01, 2021, Chennai, India. EMAIL: kuljitchahal.cse@gndu.ac.in (Kuljit Kaur Chahal) ORCID: 0000-0003-3785-116X (Kuljit Kaur Chahal) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 23 learning approach discussed above delivers poor performance. In other words, when a static classifier goes obsolete, its response to new intrusion types becomes slow, and a re-training [2] becomes mandatory that needs high-cost investments. Contrarily, incremental learning with the adaptive updating of the classifier to a regular data stream over time ensures the promising performance of an IDS. Thus, current IDS models have numerous noteworthy drawbacks. The underlying model of current IDS usually detects only known network attacks whereas IDS are prone to novel malicious attacks. The IDS models are built on static data. Whereas data come in streams in an IDS, and the data distribution may vary over the time. IDS models become obsolete with time and needs to be retrained which is cost intensive. To overcome the above research gaps, we propose a concept drift-based approach for evolving data stream classification in Intrusion detection system. With limited processing and memory time, the concept drift detection method ensures accurate and quick identification of changes in the underlying data point distribution, followed by the fastest possible adaptation in the model [3]. The major contributions of the paper are as follows: High performance intrusion detection system based on streamed data with concept drift. Incremental IDS model with adaptive updation of classification models, achieving high accuracy in real time. IDS with agile adaptation against unknown intrusions. 2. Related Work Lot of research is done in the field of intrusion detection. Many researchers have proposed machine and deep learning techniques for detecting intrusions. Ferrag et al. in [4] evaluated seven deep learning techniques namely: Deep neural network(DNN), recurrent neural network(RNN), restricted Boltzmann machine(RBM), deep belied networks(DBN), recurrent neural network, convolutional neural network(CNN), deep Boltzmann machine(DBM) and Deep Autoencoders(DA) on the latest CIC-IDS 2018 dataset. Though this paper has comprehensively evaluated the deep learning techniques, but all the evaluation is done on the static data. (Karatas et al. in [5] evaluated six machine learning-based IDS using K Nearest Neighbour, Random Forest, Gradient Boosting, Adaboost, Decision Tree, and Linear Discriminant Analysis algorithm on the CIC-IDS 2018 dataset. This proposed approach focuses on balancing the skewed dataset using oversampling with SMOTE. Though this approach gives good results, but it is also on static data. In [6] Roshan et al. proposed an adaptive Intrusion detection system based on extreme learning an clustering. The proposed model was evaluated using the NSL-KDD dataset and have claimed to achieve an accuracy rate of 89% for novel attacks. Feng et al. in [7]developed a plug and play to capture the packets. Deep learning techniques were used for detecting DOS attacks, CNN was used to detect XSS and LSTM was used for detecting SQL Injection. In [8] Yuan et al.proposed concept drift-based ensemble incremental approach for intrusion detection system. The HDDM drift detection method based on Hoeffding’s bounds was used to detect the anomaly and the ensemble based incremental learning using weighted voting was used for classification. All the experiments were done on the NSL-KDD dataset. They claimed to have achieved an accuracy of 94.91 % with the proposed approach. In [3] Breve & Zhao,proposed semi supervised classification with concept drift for intrusion detection. The study is based on passive drift detection 24 without explicitly using any algorithm and is inspired by the competitive and cooperative behavior of some animals to protect their territory. It is based on natural way of learning new data and forgetting the older ones. The proposed algorithm evaluated on the KDD Cup 1999 dataset. Park et al. in [9] proposed online eigenvector transformation for reflecting concept drift detection in Intrusion Detection System. In Online PCA eigenvectors were computed by converting the existing eigenvector as per the latest data without generating a new eigenvector. They compared the performance of network intrusion detection using both online and offline PCA. Both the methods gave good precision rate but the recall rate for online PCA outperformed the offline method. The majority of the proposed work in literature is on old datasets or is based on static datasets. To overcome the above research gaps, In this paper Adaptive Random forest classifier with drift detection to classify attacks in stream data using the latest CIC-IDS 2018 dataset. The rest of the paper is structured as follows. Section 2 reviews the current literature Intrusion detection systems. Section 3 discusses the research methodology. Section 4 discusses the results obtained. Section 5 concludes the paper with a summary. 3. Research Methodology In the network security domain, malicious intrusions have increased, making the IDS (Intrusion Detection System) design vital for securer systems. Recently, machine learning methods are increasingly used for network abnormality detection. However, currently available works do not explore the variation in data over the time, restricting their ability to detect new intrusion types. Therefore, for unpredicted changes in the status data's statistical properties over time, we propose an IDS with a concept drift-based incremental learning using adaptive random forest classifier with adwin drift detection. 3.1. Data Preprocessing 3.1.1. Data Collection The proposed study is done on the latest CIC IDS 2018 dataset. The CIC IDS 2018 is a massive dataset that incorporates 14 modern-day attacks. The CIC IDS dataset was published by Communications Security Establishment (CSE) & the Canadian Institute for Cybersecurity (CIC). The dataset comprises of 80 features with 16 million rows. 3.1.2. Data Transformation Table 1: Count of benign and various attack sessions in the CIC-IDS 2018 dataset Label Count Benign 13484708 Bot 286191 Brute Force -Web 611 Brute Force -XSS 230 DDOS attacks-HOIC 686012 DDOS attacks-LOIC-UDP 1730 DDOS attacks-LOIC-HTTP 576191 DoS attacks- Golden Eye 41508 DoS attacks-Hulk 461912 Dos attacks-SlowHTTPTest 139890 25 Dos attacks-Slowloris 10990 FTP-BruteForce 193360 Infiltration 161934 Sql Injection 87 The CIC-IDS dataset comprises of 13 modern-day attacks as listed in the table 1. The dataset is relabeled to Attack and Benign sessions. All the 13 attack types listed in the table are relabeled to attack class. Thus, the problem of multi classification of attacks is reduced to binary classification. Thus, the dataset mix after preprocessing is listed in table 2. Table 2. CIC-IDS 2018 dataset after pre-processing Label Count Benign 12615791 Attack 2586295 3.2. Training the Model Any modification in the underlying process of data generation is referred to as concept drift. In the classification context, concept drift points to variation in the target variable’s statistical properties. The target variable is the one the model is trying to make a time-based prediction for, and the term concept is used for the quantity the researcher aims to predict. As previously mentioned, the distribution that creates the data stream’s items can change with time. To address this problem of concept drift in Intrusion Detection System the proposed model uses Adaptive Random Forest with Adaptive Windowing method for concept drift detection (ADWIN). Adaptive Random Forest Random Forest is a popular learning algorithm in regression and non-stream classification (batch) tasks. This approach creates multiple trees, avoiding overfitting of the branches through bootstrap aggregation for decorrelation [10] and random feature selection when nodes split. For creating each tree’s bootstraps, the original Random Forest passes over the input data multiple times. It also passes over a part of the original features for each of the tree’s internal nodes. Performing multiple passes becomes infeasible when using data stream learning with input data. Thus, Random Forests need to adapt to the streaming data based on: A suitable process for online bootstrap aggregation Limitation of each leaf split decision into a feature subset To achieve the second requirement, the algorithm for the base tree is modified [11]. For effective modification, the set of features to be taken for further splits is limited to a random subset of m size. Here, m