=Paper= {{Paper |id=Vol-2889/PAPER_03 |storemode=property |title=Drift-based approach for evolving data stream classification in Intrusion detection system |pdfUrl=https://ceur-ws.org/Vol-2889/PAPER_03.pdf |volume=Vol-2889 |authors=Sugandh Seth,Gurwinder Singh,Kuljit Kaur Chahal }} ==Drift-based approach for evolving data stream classification in Intrusion detection system== https://ceur-ws.org/Vol-2889/PAPER_03.pdf
Drift-based approach for evolving data stream classification in
Intrusion detection system
Sugandh Setha, Gurwinder Singha, Kuljit Kaur Chahala
a
    Guru Nanak Dev University,Amritsar, India


                 Abstract
                 Machine learning, and deep learning are extensively used to augment the performance of
                 Intrusion detection systems. While the existing work on intrusion detection system using data
                 mining and machine learning is efficient, but it involves training static batch classifiers to
                 detect intrusions irrespective of the regular data stream's time-varying characteristics. Aims:
                 This paper proposes an adaptive approach for online intrusion detection using stream-oriented
                 learning for adapting to concept drift in real world environment. Method: Adaptive Random
                 Forest classifier with ADWIN change detector is used for detecting change in a data stream
                 and adapting to drift detection in the streamed data resulting in agile adaptation against
                 unknown intrusions and the proposed approach also overcomes the need to retrain the model
                 with time. Results: The latest CIC-IDS 2018 dataset is used for evaluating the approach. With
                 the proposed method, the final Accuracy obtained is 99.5 % and a recall rate of 99.8%.

                 Keywords 1
                 Intrusion Detection System, Concept Drift, Stream oriented learning, Adaptive Random Forest,

1. Introduction
    Data mining and machine learning application’s prominence is increasing with time. Recently much
research is being proposed to utilize machine learning and deep learning techniques in many domains
such as weather forecasting, Spam detection, detecting fraudulent financial transactions, Intrusion
detection system etc. Traditionally, machine learning was primarily focused on using static data enough
to represent underlying distribution. However, usually, real-world problems don’t fit in models with
such restrictions. Also, many real-world applications such as Intrusion Detection Systems have non-
stationary data distributions that cause the problem of non-stationary learning or concept drift over the
time. Many studies with good results are available to investigate IDS (Intrusion Detection Systems)
using deep learning and machine learning approaches. However, most of the studies have deployed
static data sources. These studies fail to take rapid technological developments, and the problem of
concept drift into account [1], leading to poor performance of the system.

    As an indicator, concept drift holds importance due to its ability to measure time-based data
distribution variance. Besides, IDS can also be considered as a typical scenario of concept drift. Usually,
for a single source providing a data stream to a network, the data under the scanner is in a stable state,
distributed identically. And in case of an unknown intrusion, the current data distribution undergoes
dynamic changes as compared to the historical data. This motivates for building and adaptive Intrusion
detection method that involves incremental learning based on Concept Drift that quickly adapts to new
intrusion types.

   Moreover, the endless emergence of new attacks and security loopholes raises the need for an ideal
classifier that quickly adapts to intrusion's emerging methods. In such a situation, the static batch

WCNC-2021: Workshop on Computer Networks & Communications, May 01, 2021, Chennai, India.
EMAIL: kuljitchahal.cse@gndu.ac.in (Kuljit Kaur Chahal)
ORCID: 0000-0003-3785-116X (Kuljit Kaur Chahal)
              © 2021 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                 23
learning approach discussed above delivers poor performance. In other words, when a static classifier
goes obsolete, its response to new intrusion types becomes slow, and a re-training [2] becomes
mandatory that needs high-cost investments. Contrarily, incremental learning with the adaptive
updating of the classifier to a regular data stream over time ensures the promising performance of an
IDS.

Thus, current IDS models have numerous noteworthy drawbacks.
    The underlying model of current IDS usually detects only known network attacks whereas IDS
        are prone to novel malicious attacks.
    The IDS models are built on static data. Whereas data come in streams in an IDS, and the data
        distribution may vary over the time.
    IDS models become obsolete with time and needs to be retrained which is cost intensive.

    To overcome the above research gaps, we propose a concept drift-based approach for evolving data
stream classification in Intrusion detection system. With limited processing and memory time, the
concept drift detection method ensures accurate and quick identification of changes in the underlying
data point distribution, followed by the fastest possible adaptation in the model [3].

The major contributions of the paper are as follows:
    High performance intrusion detection system based on streamed data with concept drift.
    Incremental IDS model with adaptive updation of classification models, achieving high
       accuracy in real time.
    IDS with agile adaptation against unknown intrusions.

2. Related Work
   Lot of research is done in the field of intrusion detection. Many researchers have proposed machine
and deep learning techniques for detecting intrusions.

   Ferrag et al. in [4] evaluated seven deep learning techniques namely: Deep neural network(DNN),
recurrent neural network(RNN), restricted Boltzmann machine(RBM), deep belied networks(DBN),
recurrent neural network, convolutional neural network(CNN), deep Boltzmann machine(DBM) and
Deep Autoencoders(DA) on the latest CIC-IDS 2018 dataset. Though this paper has comprehensively
evaluated the deep learning techniques, but all the evaluation is done on the static data.

   (Karatas et al. in [5] evaluated six machine learning-based IDS using K Nearest Neighbour, Random
Forest, Gradient Boosting, Adaboost, Decision Tree, and Linear Discriminant Analysis algorithm on
the CIC-IDS 2018 dataset. This proposed approach focuses on balancing the skewed dataset using
oversampling with SMOTE. Though this approach gives good results, but it is also on static data.

   In [6] Roshan et al. proposed an adaptive Intrusion detection system based on extreme learning an
clustering. The proposed model was evaluated using the NSL-KDD dataset and have claimed to achieve
an accuracy rate of 89% for novel attacks.

   Feng et al. in [7]developed a plug and play to capture the packets. Deep learning techniques were
used for detecting DOS attacks, CNN was used to detect XSS and LSTM was used for detecting SQL
Injection.

In [8] Yuan et al.proposed concept drift-based ensemble incremental approach for intrusion detection
system. The HDDM drift detection method based on Hoeffding’s bounds was used to detect the
anomaly and the ensemble based incremental learning using weighted voting was used for
classification. All the experiments were done on the NSL-KDD dataset. They claimed to have achieved
an accuracy of 94.91 % with the proposed approach. In [3] Breve & Zhao,proposed semi supervised
classification with concept drift for intrusion detection. The study is based on passive drift detection

                                                  24
without explicitly using any algorithm and is inspired by the competitive and cooperative behavior of
some animals to protect their territory. It is based on natural way of learning new data and forgetting
the older ones. The proposed algorithm evaluated on the KDD Cup 1999 dataset.

Park et al. in [9] proposed online eigenvector transformation for reflecting concept drift detection in
Intrusion Detection System. In Online PCA eigenvectors were computed by converting the existing
eigenvector as per the latest data without generating a new eigenvector. They compared the
performance of network intrusion detection using both online and offline PCA. Both the methods gave
good precision rate but the recall rate for online PCA outperformed the offline method.

    The majority of the proposed work in literature is on old datasets or is based on static datasets. To
overcome the above research gaps, In this paper Adaptive Random forest classifier with drift detection
to classify attacks in stream data using the latest CIC-IDS 2018 dataset.

   The rest of the paper is structured as follows. Section 2 reviews the current literature Intrusion
detection systems. Section 3 discusses the research methodology. Section 4 discusses the results
obtained. Section 5 concludes the paper with a summary.

3. Research Methodology
   In the network security domain, malicious intrusions have increased, making the IDS (Intrusion
Detection System) design vital for securer systems. Recently, machine learning methods are
increasingly used for network abnormality detection. However, currently available works do not explore
the variation in data over the time, restricting their ability to detect new intrusion types. Therefore, for
unpredicted changes in the status data's statistical properties over time, we propose an IDS with a
concept drift-based incremental learning using adaptive random forest classifier with adwin drift
detection.

3.1. Data Preprocessing

3.1.1. Data Collection
   The proposed study is done on the latest CIC IDS 2018 dataset. The CIC IDS 2018 is a massive
dataset that incorporates 14 modern-day attacks. The CIC IDS dataset was published by
Communications Security Establishment (CSE) & the Canadian Institute for Cybersecurity (CIC). The
dataset comprises of 80 features with 16 million rows.

3.1.2. Data Transformation

Table 1: Count of benign and various attack sessions in the CIC-IDS 2018 dataset
                     Label                         Count
                     Benign                        13484708
                     Bot                           286191
                     Brute Force -Web              611
                     Brute Force -XSS              230
                     DDOS attacks-HOIC             686012
                     DDOS attacks-LOIC-UDP         1730
                     DDOS attacks-LOIC-HTTP        576191
                     DoS attacks- Golden Eye       41508
                     DoS attacks-Hulk              461912
                     Dos attacks-SlowHTTPTest      139890


                                                    25
                       Dos attacks-Slowloris            10990
                       FTP-BruteForce                   193360
                       Infiltration                     161934
                       Sql Injection                    87

    The CIC-IDS dataset comprises of 13 modern-day attacks as listed in the table 1. The dataset is
relabeled to Attack and Benign sessions. All the 13 attack types listed in the table are relabeled to attack
class. Thus, the problem of multi classification of attacks is reduced to binary classification. Thus, the
dataset mix after preprocessing is listed in table 2.

Table 2. CIC-IDS 2018 dataset after pre-processing
                       Label                            Count
                       Benign                           12615791
                       Attack                           2586295


3.2. Training the Model
    Any modification in the underlying process of data generation is referred to as concept drift. In the
classification context, concept drift points to variation in the target variable’s statistical properties. The
target variable is the one the model is trying to make a time-based prediction for, and the term concept
is used for the quantity the researcher aims to predict. As previously mentioned, the distribution that
creates the data stream’s items can change with time. To address this problem of concept drift in
Intrusion Detection System the proposed model uses Adaptive Random Forest with Adaptive
Windowing method for concept drift detection (ADWIN).

Adaptive Random Forest

    Random Forest is a popular learning algorithm in regression and non-stream classification (batch)
tasks. This approach creates multiple trees, avoiding overfitting of the branches through bootstrap
aggregation for decorrelation [10] and random feature selection when nodes split. For creating each
tree’s bootstraps, the original Random Forest passes over the input data multiple times. It also passes
over a part of the original features for each of the tree’s internal nodes.

  Performing multiple passes becomes infeasible when using data stream learning with input data.
Thus, Random Forests need to adapt to the streaming data based on:

       A suitable process for online bootstrap aggregation
       Limitation of each leaf split decision into a feature subset

    To achieve the second requirement, the algorithm for the base tree is modified [11]. For effective
modification, the set of features to be taken for further splits is limited to a random subset of m size.
Here, m