=Paper= {{Paper |id=Vol-2889/PAPER_03 |storemode=property |title=Drift-based approach for evolving data stream classification in Intrusion detection system |pdfUrl=https://ceur-ws.org/Vol-2889/PAPER_03.pdf |volume=Vol-2889 |authors=Sugandh Seth,Gurwinder Singh,Kuljit Kaur Chahal }} ==Drift-based approach for evolving data stream classification in Intrusion detection system== https://ceur-ws.org/Vol-2889/PAPER_03.pdf

Drift-based approach for evolving data stream classification in
Intrusion detection system
Sugandh Setha, Gurwinder Singha, Kuljit Kaur Chahala
a
Guru Nanak Dev University,Amritsar, India

Abstract
Machine learning, and deep learning are extensively used to augment the performance of
Intrusion detection systems. While the existing work on intrusion detection system using data
mining and machine learning is efficient, but it involves training static batch classifiers to
detect intrusions irrespective of the regular data stream's time-varying characteristics. Aims:
This paper proposes an adaptive approach for online intrusion detection using stream-oriented
learning for adapting to concept drift in real world environment. Method: Adaptive Random
Forest classifier with ADWIN change detector is used for detecting change in a data stream
and adapting to drift detection in the streamed data resulting in agile adaptation against
unknown intrusions and the proposed approach also overcomes the need to retrain the model
with time. Results: The latest CIC-IDS 2018 dataset is used for evaluating the approach. With
the proposed method, the final Accuracy obtained is 99.5 % and a recall rate of 99.8%.

Keywords 1
Intrusion Detection System, Concept Drift, Stream oriented learning, Adaptive Random Forest,

1. Introduction
Data mining and machine learning application’s prominence is increasing with time. Recently much
research is being proposed to utilize machine learning and deep learning techniques in many domains
such as weather forecasting, Spam detection, detecting fraudulent financial transactions, Intrusion
detection system etc. Traditionally, machine learning was primarily focused on using static data enough
to represent underlying distribution. However, usually, real-world problems don’t fit in models with
such restrictions. Also, many real-world applications such as Intrusion Detection Systems have non-
stationary data distributions that cause the problem of non-stationary learning or concept drift over the
time. Many studies with good results are available to investigate IDS (Intrusion Detection Systems)
using deep learning and machine learning approaches. However, most of the studies have deployed
static data sources. These studies fail to take rapid technological developments, and the problem of
concept drift into account [1], leading to poor performance of the system.

As an indicator, concept drift holds importance due to its ability to measure time-based data
distribution variance. Besides, IDS can also be considered as a typical scenario of concept drift. Usually,
for a single source providing a data stream to a network, the data under the scanner is in a stable state,
distributed identically. And in case of an unknown intrusion, the current data distribution undergoes
dynamic changes as compared to the historical data. This motivates for building and adaptive Intrusion
detection method that involves incremental learning based on Concept Drift that quickly adapts to new
intrusion types.

Moreover, the endless emergence of new attacks and security loopholes raises the need for an ideal
classifier that quickly adapts to intrusion's emerging methods. In such a situation, the static batch

WCNC-2021: Workshop on Computer Networks & Communications, May 01, 2021, Chennai, India.
EMAIL: kuljitchahal.cse@gndu.ac.in (Kuljit Kaur Chahal)
ORCID: 0000-0003-3785-116X (Kuljit Kaur Chahal)
© 2021 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)

23
learning approach discussed above delivers poor performance. In other words, when a static classifier
goes obsolete, its response to new intrusion types becomes slow, and a re-training [2] becomes
mandatory that needs high-cost investments. Contrarily, incremental learning with the adaptive
updating of the classifier to a regular data stream over time ensures the promising performance of an
IDS.

Thus, current IDS models have numerous noteworthy drawbacks.
 The underlying model of current IDS usually detects only known network attacks whereas IDS
are prone to novel malicious attacks.
 The IDS models are built on static data. Whereas data come in streams in an IDS, and the data
distribution may vary over the time.
 IDS models become obsolete with time and needs to be retrained which is cost intensive.

To overcome the above research gaps, we propose a concept drift-based approach for evolving data
stream classification in Intrusion detection system. With limited processing and memory time, the
concept drift detection method ensures accurate and quick identification of changes in the underlying
data point distribution, followed by the fastest possible adaptation in the model [3].

The major contributions of the paper are as follows:
 High performance intrusion detection system based on streamed data with concept drift.
 Incremental IDS model with adaptive updation of classification models, achieving high
accuracy in real time.
 IDS with agile adaptation against unknown intrusions.

2. Related Work
Lot of research is done in the field of intrusion detection. Many researchers have proposed machine
and deep learning techniques for detecting intrusions.

Ferrag et al. in [4] evaluated seven deep learning techniques namely: Deep neural network(DNN),
recurrent neural network(RNN), restricted Boltzmann machine(RBM), deep belied networks(DBN),
recurrent neural network, convolutional neural network(CNN), deep Boltzmann machine(DBM) and
Deep Autoencoders(DA) on the latest CIC-IDS 2018 dataset. Though this paper has comprehensively
evaluated the deep learning techniques, but all the evaluation is done on the static data.

(Karatas et al. in [5] evaluated six machine learning-based IDS using K Nearest Neighbour, Random
Forest, Gradient Boosting, Adaboost, Decision Tree, and Linear Discriminant Analysis algorithm on
the CIC-IDS 2018 dataset. This proposed approach focuses on balancing the skewed dataset using
oversampling with SMOTE. Though this approach gives good results, but it is also on static data.

In [6] Roshan et al. proposed an adaptive Intrusion detection system based on extreme learning an
clustering. The proposed model was evaluated using the NSL-KDD dataset and have claimed to achieve
an accuracy rate of 89% for novel attacks.

Feng et al. in [7]developed a plug and play to capture the packets. Deep learning techniques were
used for detecting DOS attacks, CNN was used to detect XSS and LSTM was used for detecting SQL
Injection.

In [8] Yuan et al.proposed concept drift-based ensemble incremental approach for intrusion detection
system. The HDDM drift detection method based on Hoeffding’s bounds was used to detect the
anomaly and the ensemble based incremental learning using weighted voting was used for
classification. All the experiments were done on the NSL-KDD dataset. They claimed to have achieved
an accuracy of 94.91 % with the proposed approach. In [3] Breve & Zhao,proposed semi supervised
classification with concept drift for intrusion detection. The study is based on passive drift detection

24
without explicitly using any algorithm and is inspired by the competitive and cooperative behavior of
some animals to protect their territory. It is based on natural way of learning new data and forgetting
the older ones. The proposed algorithm evaluated on the KDD Cup 1999 dataset.

Park et al. in [9] proposed online eigenvector transformation for reflecting concept drift detection in
Intrusion Detection System. In Online PCA eigenvectors were computed by converting the existing
eigenvector as per the latest data without generating a new eigenvector. They compared the
performance of network intrusion detection using both online and offline PCA. Both the methods gave
good precision rate but the recall rate for online PCA outperformed the offline method.

The majority of the proposed work in literature is on old datasets or is based on static datasets. To
overcome the above research gaps, In this paper Adaptive Random forest classifier with drift detection
to classify attacks in stream data using the latest CIC-IDS 2018 dataset.

The rest of the paper is structured as follows. Section 2 reviews the current literature Intrusion
detection systems. Section 3 discusses the research methodology. Section 4 discusses the results
obtained. Section 5 concludes the paper with a summary.

3. Research Methodology
In the network security domain, malicious intrusions have increased, making the IDS (Intrusion
Detection System) design vital for securer systems. Recently, machine learning methods are
increasingly used for network abnormality detection. However, currently available works do not explore
the variation in data over the time, restricting their ability to detect new intrusion types. Therefore, for
unpredicted changes in the status data's statistical properties over time, we propose an IDS with a
concept drift-based incremental learning using adaptive random forest classifier with adwin drift
detection.

3.1. Data Preprocessing

3.1.1. Data Collection
The proposed study is done on the latest CIC IDS 2018 dataset. The CIC IDS 2018 is a massive
dataset that incorporates 14 modern-day attacks. The CIC IDS dataset was published by
Communications Security Establishment (CSE) & the Canadian Institute for Cybersecurity (CIC). The
dataset comprises of 80 features with 16 million rows.

3.1.2. Data Transformation

Table 1: Count of benign and various attack sessions in the CIC-IDS 2018 dataset
Label Count
Benign 13484708
Bot 286191
Brute Force -Web 611
Brute Force -XSS 230
DDOS attacks-HOIC 686012
DDOS attacks-LOIC-UDP 1730
DDOS attacks-LOIC-HTTP 576191
DoS attacks- Golden Eye 41508
DoS attacks-Hulk 461912
Dos attacks-SlowHTTPTest 139890

25
Dos attacks-Slowloris 10990
FTP-BruteForce 193360
Infiltration 161934
Sql Injection 87

The CIC-IDS dataset comprises of 13 modern-day attacks as listed in the table 1. The dataset is
relabeled to Attack and Benign sessions. All the 13 attack types listed in the table are relabeled to attack
class. Thus, the problem of multi classification of attacks is reduced to binary classification. Thus, the
dataset mix after preprocessing is listed in table 2.

Table 2. CIC-IDS 2018 dataset after pre-processing
Label Count
Benign 12615791
Attack 2586295

3.2. Training the Model
Any modification in the underlying process of data generation is referred to as concept drift. In the
classification context, concept drift points to variation in the target variable’s statistical properties. The
target variable is the one the model is trying to make a time-based prediction for, and the term concept
is used for the quantity the researcher aims to predict. As previously mentioned, the distribution that
creates the data stream’s items can change with time. To address this problem of concept drift in
Intrusion Detection System the proposed model uses Adaptive Random Forest with Adaptive
Windowing method for concept drift detection (ADWIN).

Adaptive Random Forest

Random Forest is a popular learning algorithm in regression and non-stream classification (batch)
tasks. This approach creates multiple trees, avoiding overfitting of the branches through bootstrap
aggregation for decorrelation [10] and random feature selection when nodes split. For creating each
tree’s bootstraps, the original Random Forest passes over the input data multiple times. It also passes
over a part of the original features for each of the tree’s internal nodes.

Performing multiple passes becomes infeasible when using data stream learning with input data.
Thus, Random Forests need to adapt to the streaming data based on:

 A suitable process for online bootstrap aggregation
 Limitation of each leaf split decision into a feature subset

To achieve the second requirement, the algorithm for the base tree is modified [11]. For effective
modification, the set of features to be taken for further splits is limited to a random subset of m size.
Here, m