A cross-comparison of feature selection algorithms on
              multiple cyber security data-sets

     Alexander Powell, Darren Bates, Chad Van Wyk, and Adrian Darren de Abreu

                      Stellenbosch University, Stellenbosch, South Africa


       Abstract. In network intrusion detection, it is essential to detect an attack in real-
       time and to initiate preventive measures accordingly. This paper aims to evaluate
       whether SciKit Learn feature selection algorithms improve or worsen the accu-
       racy and processing time of machine learning algorithms when used for network
       intrusion detection and classification. We develop recommendations of potential
       machine learning and feature selection algorithms that can be used to obtain a
       desirable level of accuracy whilst significantly reducing the total processing time
       of the algorithm.

       Keywords: Cyber Security · Feature Selection · Machine Learning · Network
       Intrusion Detection· SciKitLearn · SK Learn.


1    Introduction
During the last decade, the number of devices handling data has increased exponen-
tially and as a result cyber security has become a major concern and priority for many
companies globally. Othman [12] states that recent advancements in technology cause
an increase in volume, variety and speed of network generated data and has made data
analysis and processing of intrusion detection a challenge for traditional algorithms,
namely K-means, Naı̈ve Bayesian, K-Nearest Neighbors and Support Vector Machine.
This increase in data volume has a linear relationship with the number of well exe-
cuted cyber-attacks, which have increased in both their occurrence and their capabil-
ity of incapacitating individuals, businesses and organizations whilst remaining unde-
tected [12].
     Intrusion detection comprises of identifying malicious activities that may hinder
or disrupt the functionality of a system. A challenging aspect of intrusion detection is
distinguishing a clear boundary between normal and malicious network activity [5].This
paper provides a comprehensive investigation of various feature selection algorithms to
evaluate their performance and accuracy when applied to more than one security data-
set. Additionally, this research paper presents an evaluation of the accuracy and the
various performance metrics of the machine learning algorithms used. Feature selection
algorithms are then paired with machine learning algorithms to determine which feature
selection algorithms produce the largest differences in accuracy and in computational
processing times. A cross-comparison of the most compatible machine learning and
feature selection algorithms will be presented. The resulting matrix provides users with
a heuristic to select the most suitable combination of machine learning and feature
selection algorithms based on each data-set.


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0)
2       A. Powell et al.

    The objective of this paper is to determine which feature selection algorithm is the
most robust when tested on more than one security data-set for intrusion detection.
Another main objective of the paper is to determine which of the selected machine
learning algorithms produce the highest accuracy scores amongst the three data-sets
with no feature selection. This will be achieved by baselining the following machine
learning algorithms: Logistic Regression, K-Nearest Neighbors, Decision Tree, Ran-
dom Forest, and K-Means. Only commonly used sklearn machine learning and feature
selection algorithms were considered. The feature selection algorithms will be used in
conjunction with the baselined machine learning algorithms to determine which fea-
ture selection algorithm produced the most consistent accuracy ratings whilst reducing
processing times. A final objective of our paper is to identify the two best performing
combinations of machine learning and feature selection algorithms per data-set. Addi-
tionally, this paper will address the complexities of working with network intrusion data
and moreover, the difficulties in applying machine learning and feature selection onto
it.
    The rest of this paper is structured as follows: first, related work and cyber security
complexities are presented. Following this, the methodology used in this paper is pre-
sented. Machine learning algorithms will then be analysed and compared with regards
to accuracy and processing time in order to identify the best performing baselined ma-
chine learning algorithms. Following this, each combination of machine learning and
feature selection algorithms will be examined in terms of their accuracy and processing
time on each data-set. A cross-comparison is then presented, which determines the best
performing combination of machine learning and feature selection algorithms for each
data-set. After this, findings, conclusions and future work are discussed.


2   Related Work

Çavuşoğlu [4] proposed an intrusion detection system (IDS) that pairs multiple machine
learning algorithms and feature selection techniques, in order to achieve high detection
and performance rates.The NSL-KDD data-set was used. Çavuşoğlu [4] used feature
selection in an attempt to improve the performance and detection rates of the chosen
algorithms by establishing new relationships between specific attributes and condensing
the overall number of attributes present in the data-set. The machine learning algorithms
exhibited high accuracy and low false-positive rates when identifying any attack type.
The paper concludes that the proposed system successfully identified attacks at higher
rates than that of existing detection systems.
    Gunduz et al [9] attempted to determine the most important feature sets within the
KDD Cup’99 data-set by using feature selection techniques to increase the accuracy and
reduce computational time by decreasing the number of features analysed.The results
of the accuracy scores of each classification algorithm were compared to each other to
determine which feature selection set produced the most accurate results. Several au-
thors have discussed feature selection algorithms and their application in cyber security.
Relevant works include those of Al-Jarrah et al [2], Anusha and Sathiyamoorthy [3] and
Xue et al [17]. Therefore, previous research has applied machine learning intrusion de-
tection onto one data-set, whereas our paper will conduct a cross-comparison among
                                       Intrusion detection on cyber security data-sets    3

three cyber security data-sets. Furthermore, the application of SciKit Learn’s machine
learning and feature selection algorithms has been undervalued in existing literature and
to our knowledge, a comprehensive evaluation of such nature does not exist.


3   Complexities of working with cyber security data-sets

The cyber security data-sets utilized in this paper are representative of network flow data
taken directly from a network at the Canadian Institute for Cybersecurity. There is thus
a large amount of traffic in this data. For example, the CICIDS-2017 data-set is 3 GB,
resulting in difficulty when analysing and experimenting on the data-set. The relatively
large size of the data-set resulted in long processing times and difficulty in finding
hardware that had sufficient computing power to support the algorithms being run. Short
processing times are important for an intrusion detection system since any delay can
compromise the effectiveness and speed at which a system can detect intrusions [1].
    The issue of over-fitting is said to have occurred when a statistical model has suc-
cessfully and accurately fit the data, however has failed to describe the pattern or trends
that underlie it [16]. Often due to a large amount of data-set features relative to “events”
(in this case benign instances), over-fitting leads to an increased likelihood of inaccu-
rate predictions [6]. Feature counts of the network data-sets utilized in this paper were
43, 80, and 79 for NSL-KDD, ISCX-URL-2016 and CICIDS-2017, respectively. Addi-
tionally, excluding the ISCX-URL-2016 data-set, only a small portion of each data-set
has been classified as malicious, thus prompting direct measures to be employed in
order to prevent over-fitting. The preventative measures used within this paper, both
being widely accepted methods of combating over-fitting, are feature selection and the
partitioning of each data-set through the use of a train test split [16].
    Network intrusion data-sets inherently have many features that are neither intuitive
nor have an obvious meaning. A lack of information on specific data-set features means
that one cannot be certain how necessary these features are for the machine learning al-
gorithms. High feature counts impact accuracy, increases processing and loading times
and increases the difficulty of analysing the data. The speed of analysis is very impor-
tant as delays could result in malicious attacks infiltrating a system [1]. The NSL-KDD
data-set comprised of 43 features, the ISCX-URL-2016 data-set comprised of 80, and
the CICIDS-2017 data-set comprised of 79 features. Once the feature selection algo-
rithms were run, the number of features were dramatically reduced. The more features
a data-set possess, the higher the computational power needed to process it. The prob-
lem with many features is that some may be irrelevant or redundant, and not contribute
to the final result or may negatively disturb end results [1].
    Previous studies of the CICIDS-2017 data-set show a total of 3,119,345 observa-
tions, 83 features and 288,602 missing values [13]. The data-set used in this research
paper contains 2,271,320 observations, 79 features and 2867 missing values, and this
may be the result of data-set tampering. A lack of information with regards to what has
been tampered with, removed, or added to the data-set may skew results retrieved and
potentially remove valuable insights. As discussed previously, the NSL-KDD data-set
is a revised version of the KDD’99 data-set. Therefore, it is apparent that the KDD’99
was tampered with by researchers in an attempt to remove several issues present in the
4       A. Powell et al.

data-set, such as duplicate records. Although this may have led to an improvement in
the NSL-KDD data-set, tampering with network intrusion data-sets may skew the data
unintentionally. Lastly, this is an indication that data-sets may be curated or tampered
with by researchers, and thus cannot be true representations, but rather curated iterations
of proposed network activity.
    Defined as the over-representation of one or more classes in a data-set, class imbal-
ance with regards to machine learning classification is a major challenge that is actively
being studied [8]. Inherently, certain services are more often consumed than others in
network environments, and network data-sets are thus in most cases non-uniform and
display relatively large class imbalances. Subsequently, network data-sets used for clas-
sifying malicious activity often contain only a small proportion of malicious data. In our
research paper, except for the ISCX-URL-2016 data-set, this trend follows. This, cou-
pled with the fact that standard machine learning algorithms were developed under the
assumption that training data would have an equal distribution of classes, is likely to
cause classifier bias towards the majority class, thus skewing results [8]. The percent-
age of maliciously classified network data in the three data-sets in this paper is 78.8%,
19.7%, and 19.05% for ISCX-URL-2016, CICIDS-2017, and NSL-KDD, respectively.
To prevent classifier bias and the skewing of results, great care was taken to ensure that
the training data-sets included instances representative of both benign and of each at-
tack type present in the original data-set. Lastly, the network data-sets utilized in this
paper are the result of a simulation and are not an accurate representation of a real-world
network environment.


4   Methodology

All data-sets used are publicly accessible and have been collected from the Canadian
Institute for Cybersecurity [15]. This paper utilises three network intrusion data-sets:
ISCX-URL-2016, NSL-KDD, and CICIDS2017.
    The CICIDS-2017 data-set contains 2,830,743 observations with 79 features. Of
this, 2 271,320 instances were flagged as benign and 556,556 as malicious activity. A
total of 14 different attack types are represented within the data-set, and 24.5% of the
traffic has been classified as malicious and the remainder benign. The ISCX-URL-2016
data-set, consisting of five types of URLs, namely: benign, spam, phishing, malware,
and defacement, was created by Mamun et al [10] for the purpose of detection and
categorization of malicious URLs. The data-set contains 36,707 network observations
and 80 features, with 19.70% of observations being malicious. The NSL-KDD data-set
contains malicious network observations belonging to one of four distinct attack groups:
denial of service, user-to-root, remote-to-local or probing attack [14]. The dimensions
of the data-set are 4,898,431 by 43, with 19.05% of the data being malicious.
    The SciKit Learn package was selected as it is a reliable package for data pre-
processing, machine learning and feature selection, providing various functions and
efficient processing times for such tasks [11]. In order to ensure standardization, one
computer was used to run both the machine learning and feature selection algorithms
and compute performance statistics for every data-set. The device specifications are as
follows: Dell Inspiron 7577, 16 GB RAM, Intel Core i7-7700HQ and 64-bit Windows
                                      Intrusion detection on cyber security data-sets   5

10. During pre-processing, a binary column was created to identify network traffic as
either benign (0) or malicious (1), and thereafter the pre-processed data-sets were split
using the SK Learn’s train test split function. The NSL-KDD data-set is pre-packaged
into train and test data-sets. The resulting ratio of a 85% train and 15% test split was
applied to the other two data-sets to ensure standardisation. The selected machine learn-
ing algorithms were trained and tested on all three data-sets, with no feature selection
or hyper-parameter tuning, and ranked according to their overall accuracy and compu-
tational time (training and testing duration). The top four performing baselined algo-
rithms were then used together with the following SciKit Learn feature selection algo-
rithms: Extra-Tree Classifier/SelectFromModel, SelectKbest, Variance Threshold, and
SelectPercentile. The SelectFromModel feature selection algorithm was used to remove
unimportant features that were weighted by the Extra-Tree Classifier. Finally, a cross-
comparison was conducted to compare how each machine learning and feature selec-
tion pair performed against its original baselined statistics in order to determine which
combinations of algorithms improved or retained the highest accuracy rates whilst de-
creasing overall computational time. Lastly, a total of six recommendations (two per
data-set) are presented.


5   Analysis of baselined machine learning algorithms

Analysis was initiated by bench-marking the machine learning algorithms. This was
achieved by training and testing each machine learning algorithm on each of the three
data-sets with no hyper-parameter tuning or feature selection, in order to determine
which algorithms natively produce the highest accuracy rates in the lowest computa-
tional times. Only the top four machine learning algorithms were chosen based on their
accuracy and processing times, and used in the following feature selection analysis.
The chosen performance metrics include: accuracy, precision, recall, F1-score and total
processing time (training and testing duration).
     Upon examination of the ISCX-URL-2016 data-set, it is apparent that the four best
performing algorithms, with regards to accuracy, are: Logistic Regression, K-Nearest
Neighbors, Decision Tree, and Random Forest. Baselined, these algorithms produced
accuracy ratings of 92%, 99%, 99%, and 99%, respectively. Upon further examination,
Logistic Regression displayed the lowest computational time of 0.68 seconds, differing
only by 0.10 seconds from the best performing baselined algorithm, Decision Tree.
Furthermore, Random Forest was the third best performing algorithm with an accuracy
of 99% and processing time of 4.67 seconds. The fourth top performing algorithm was
K-Nearest Neighbors which produced an accuracy of 99% in a total processing time of
4.96 seconds.
     In analysing how each machine learning algorithm performed on the NSL-KDD,
it is apparent that the Logistic Regression and Decision Tree algorithms produced the
highest accuracy rating (76%) with execution times of 1.02 and 1.40 seconds. Random
Forest is the third best performing algorithm with an accuracy of 72% and a process-
ing time of 1.23 seconds. Lastly, K-Nearest Neighbors is the fourth best performing
baselined machine learning algorithm with an accuracy of 69% but the second longest
processing time of 121.33 seconds. The remaining algorithms (supervised and unsu-
6       A. Powell et al.

pervised K-Means) exhibit a significant decrease in accuracy, an increase in processing
time and do not rank in the top four best performing algorithms and are thus, not used
in further analysis. Therefore, the four best performing baselined algorithms on the
NSL-KDD data-set include: Logistic Regression, Decision Tree, Random Forest, and
K-Nearest Neighbors with accuracy ratings of above 69% and run-times as low as 1.02
seconds.
    In terms of the CICIDS-2017 data-set, Logistic Regression achieved an overall ac-
curacy of 93% with a total run-time of 308.71 seconds. While this is not the top per-
forming algorithm, it managed to process the results in one of the shorter time frames.
Although producing an accuracy of 99%, K-Nearest Neighbours displayed the longest
processing time of 13970.01 seconds. Decision Tree and Random Forest both achieved
an accuracy of 100% with short processing times of 230.21 and 137.30 seconds, respec-
tively. Comparing this to K-Nearest Neighbors, one can see that it achieved a relatively
similar accuracy but in a shorter processing time. From these results it can be deter-
mined that Random Forest was the best performing algorithm, since it was the fastest
and most accurate algorithm followed closely by Decision Tree. The third best perform-
ing algorithm is Logistic Regression, as it is 6% less accurate than K-Nearest Neighbors
but was less computationally intensive as it produced results 13 661 seconds faster. The
last algorithm is K-Nearest Neighbors which generated a accuracy of 99% but with the
longest processing time of 13970.01 seconds.
    It is evident that Logistic Regression, K-Nearest Neighbors, Decision Tree and Ran-
dom Forest are the machine learning algorithms that have yielded the highest accuracy
ratings within the lowest computational times across all three data-sets and will there-
fore, be further analysed in terms of their changes in performances when used together
with feature selection.


6   Analysis of feature selection algorithms

Each of the four best performing machine learning algorithms were paired with the fol-
lowing feature selection algorithms: Extra-Tree Classifier and SelectFromModel, Selec-
tKBest, SelectPercentile, and VarianceThreshold.In this section, each machine learning
algorithm will be compared to each other, in order to determine which machine learn-
ing algorithm is the most suitable for each of the four feature selection algorithms, on
each data-set. Thereafter, a total of 12 algorithm combinations are selected for further
analysis.
    The feature selection algorithms: Extra-Tree Classifier, SelectKBest, SelectPercentile
and VarianceThreshold each reduced the number of features in the ISCX-URL-2016
data-set from 80 features to 28, 10, 8, and 27, respectively. When paired with the Extra-
Tree Classifier, Decision tree was the best performing machine learning algorithm, pro-
ducing an accuracy of 99% and a computational time of 0.29 seconds. Similarly, the
pairing of Decision Tree with the SelectKBest algorithm yielded the best results in
terms of accuracy and computational time, producing an accuracy of 97% in 0.11 sec-
onds, the lowest computational time relative to the four highest performing pairs. Se-
lectPercentile performed excellently when paired with the Random Forest algorithm,
reducing the data-set from 80 features to 8, and producing an accuracy of 96% in a
                                     Intrusion detection on cyber security data-sets   7

computational time of 0.18 seconds. Finally, although displaying one of the highest
computational times (0.41 seconds) and the second highest number of features (27),
Random Forest produced a high accuracy of 99% when paired with VarianceThreshold.
    Each of the four chosen feature selection algorithms were run on the NSL-KDD
data-set. When the Extra-Tree Classifier and SelectFromModel algorithms were ap-
plied, the number of features reduced from 43 to 32 and Logistic Regression produced
the highest accuracy rate (74%) and a relatively low computational time (1.32 sec-
onds). The SelectKBest algorithm reduced the number of features from 43 to 10. After
each machine learning algorithm was paired with the SelectKBest algorithm, it is evi-
dent that the Decision Tree algorithm produced the most desirable performance rates,
despite these statistics being relatively low. The Decision Tree algorithm yielded an
accuracy rating of 47% with a total computational run-time of 0.39 seconds. The Se-
lectPercentile algorithm subsequently reduced the number of features from 43 to 13.
The algorithm that performed best when combined with the SelectPercentile feature se-
lection algorithm was Decision Tree, with an accuracy of 74% and a run-time of 0.46
seconds. It is important to note that when SelectPercentile was combined with Logistic
Regression, K-Nearest Neighbors and Random Forest all produced significantly lower
accuracy rates that range between 43% to 47%, respectively. Lastly, VarianceThresh-
old reduced the number of features from 43 to 40. To this end, the Decision Tree al-
gorithm performed best when paired with the VarianceThreshold algorithm, with an
accuracy of 74% and a computational run-time of 0.71 seconds. Therefore, it is evi-
dent that the best combinations of machine learning and feature selection algorithms
when applied to the NSL-KDD data-set include, Logistic Regression and Extra-Tree
Classifier/SelectFromModel, Decision Tree and SelectKBest, Decision Tree and Se-
lectPercentile, and lastly, Decision Tree and VarianceThreshold.
    When the Extra-Tree Classifier was run on the CICIDS-2017 data-set, the original
80 features were reduced to 26. When the Logistic Regression algorithm was trained on
this reduced data-set, it produced an accuracy of 91% in 82.87 seconds. The Decision
Tree and Random Forest algorithms produced 100% accuracy ratings with run-times of
83.03 and 112.37 seconds. Secondly, SelectKBest reduced the total 80 features to 10
features. This generated an accuracy for Logistic Regression of 88% with a processing
time of 17.83 seconds. This is a decrease of 3% in accuracy but a massive decrease
of 65.04 seconds in processing time. The same pattern follows for Decision Tree and
Random Forest in that the accuracy decreased by 3% to 97% but the processing times
decreased by 56.54 and 41.98 seconds. The SelectPercentile feature selection algorithm
managed to reduce the number of features to 8. This resulted in Logistic Regression
producing an accuracy of 88% in 12.03 seconds. When compared to the SelectKBest
algorithm the number of features was reduced by 2, the processing time decreased by
5.8 seconds, whilst maintaining the same accuracy. Decision Tree and Random Forest
obtained an accuracy of 97% and a processing time of 23.2 and 53.21 seconds which
show improvements from the SelectKBest algorithm. The VarianceThreshold feature
selection algorithm decreased the total features to 23. Logistic Regression produced an
accuracy of 87% and a processing time of 87.74 seconds. Decision tree and Random
Forest retained an accuracy of 100% in processing times of 78.87 and 123.98 seconds.
8       A. Powell et al.

From analysing the four feature selection algorithms, the best performing combination
of algorithm appears to be the Decision Tree and VarianceThreshold combination.


7   Cross-comparison of baselined and feature selection algorithms

The next phase of our analysis entails the cross-comparison of the four combinations
of machine learning and feature selection algorithms against the original baselined ma-
chine learning algorithm performances, to determine whether for each algorithm the
baselined machine learning accuracy rate either improved or retained its original rating,
and whether the overall processing duration was decreased substantially. Additionally,
the number of true negatives (correctly classified attacks) and false negatives (incor-
rectly classified attacks) are compared and the two best performing combinations of
machine learning and feature selection algorithms are recommended per data-set.
    Paired with the Extra-Tree Classifier on the ISCX-URL-2016 data-set, the accuracy
achieved by the Decision Tree algorithm remained 99%, however its computational
time decreased by 0.49 seconds, dropping from a baselined computational time of 0.78
seconds to 0.29 seconds. Additionally, the number of true negatives identified increased
from 4323 to 4334, along with a decrease in false negatives from 45 to 34. Similarly,
when paired with SelectKBest, the computational time of the Decision Tree algorithm
dropped from 0.78 seconds to 0.11 seconds, the shortest computational time of each
of the four combinations. This drop in time was accompanied by a 2% decrease in
accuracy, lowering the accuracy attained at baseline, 99%, to an accuracy of 97%. Ex-
amination of both the true negatives and false negatives reveal a similar trend, with true
negatives dropping from 4323 to 4281, and false negatives increasing from 45 to 87.
The baselined accuracy of the Random Forest algorithm, 99%, was also lowered when
paired with the SelectPercentile feature selection algorithm, and dropped to 96%. This
3% drop in accuracy, was reinforced by the number of true negatives dropping from
4346 to 4246 and the false negatives increasing from 22 to 122, but however was ac-
companied by a substantially lower computational time. A total of 4.49 seconds was
dropped when combined with the SelectPercentile algorithm, lowering the baselined
computational time from 4.67 seconds to 0.18 seconds. Lastly, when paired with the
VarianceThreshold algorithm, the baselined accuracy of the Random Forest algorithm
was matched at 99%, with the number of true negatives decreasing from 4346 to 4337
and the false negatives increasing from 22 to 31, and accompanied by a drop in compu-
tational time of 4.26 seconds, dropping from 4.67 to 0.41 seconds.
    With regards to the NSL-KDD data-set, comparing the baselined Logistic Regres-
sion performance metrics against that of the Extra-Tree Classifier and Logistic Regres-
sion feature selection combination, it is evident that accuracy decreased from 76%
to 74%, whereas total processing time increased from 1.02 seconds to 1.32 seconds.
Originally, the Logistic Regression algorithm correctly identified 8738 attacks and mis-
classified 4096 attacks as normal network traffic, but after Logistic Regression was run
with the Extra-Tree Classifier and SelectFromModel, the number of correctly classified
network attacks was 8780 and incorrectly classified attacks was 4053. The baselined
Decision Tree performance metrics produced an accuracy of 76% and a run-time of
1.40 seconds. When the Decision Tree algorithm was paired with SelectKBest algo-
                                      Intrusion detection on cyber security data-sets   9

rithm, one can see that accuracy decreased drastically from 76% to 47% accuracy and
processing time decreased from 1.40 seconds to 0.39 seconds.Additionally, the base-
lined Decision Tree algorithm successfully classified 8307 attacks and incorrectly clas-
sified 4526 attacks as benign network traffic and after being paired with the SelectKBest
feature selection algorithm, Decision Tree successfully classified 3125 attacks and mis-
classified 9708 attacks as benign. The Decision Tree algorithm was also the best per-
forming algorithm when partnered with the SelectPercentile algorithm. To this end, ac-
curacy decreased from 76% with no feature selection to 74% with SelectPercentile fea-
ture selection and run-time subsequently decreased from 1.40 seconds to 0.46 seconds,
whilst correctly classifying 7085 attacks and mis-classifying 5748 attacks. Lastly, De-
cision Tree was the best performing algorithm when used with VarianceThreshold fea-
ture selection. The Decision Tree baselined accuracy of 76% decreased to 74% whilst
computational processing time reduced from 1.40 to 0.71 seconds. The combination
of Decision Tree and VarianceThreshold was able to detect and correctly classify 8044
network attacks and incorrectly mis-classified 4789 attacks as benign network traffic.
Therefore, the Decision Tree algorithm is the most robust to feature selection as it was
the top performing algorithm when paired with three separate feature selection algo-
rithms (SelectKBest, SelectPercentile, VarianceThreshold).
     The best performing algorithms on the CICIDS-2017 data-set were Decision Tree
and Random Forest for each feature selection algorithm. Decision Tree produced a base-
lined accuracy of 100% and a run-time of 230.21 seconds. When the Decision Tree and
SelectKBest algorithms were paired, accuracy decreased to 97% and processing time to
26.49 seconds. The false negatives increased from 276 to 10508 and the true negatives
decreased from 82947 to 72762. When Decision Tree was run with ExtraTreeClassi-
fier/SelectFromModel, it produced 341 false negatives and 82929 true negatives with a
processing time of 83.03 seconds and an accuracy rating of 100%. Decision Tree and
VarianceThreshold was able to produce a high accuracy rating of 100%, of which the
number of incorrectly and correctly classified attacks were 117 and 83153, respectfully.
The processing time was relatively low at 78.87 seconds for Decision Tree, which is a
large decrease from the original computational time (230.21), but it remains a reason-
able time when compared with the other algorithms. Decision Tree and SelectPercentile
generated an accuracy of 97%, a processing time of 23.22 seconds, and correctly classi-
fied 72716 attacks whilst incorrectly classifying 10554 attacks as benign. Drawing from
the above analysis, it is apparent that the most suitable combination of machine learn-
ing and feature selection algorithms is Decision Tree and VarianceThreshold. Paired
together, these algorithms have a low false negative rate, greatly reduced the number
of features to 23 whilst retaining an accuracy of 100%, and decreasing the processing
time to 78.87 seconds. Secondly, the combination of Decision Tree and ExtraTree Clas-
sifier produced an accuracy of 100% in 83.03 seconds but incorrectly classified 341
attacks, which is 224 more incorrectly classified attacks than that of the Decision Tree
and VarianceThreshold algorithms.
10      A. Powell et al.

8    Findings
It is evident that when machine learning algorithms were used in combination with fea-
ture selection algorithms, there was either a significant decrease in both accuracy and
processing time or the algorithm’s baselined accuracy rate was maintained whilst dras-
tically reducing computational time. Given that network intrusion detection requires the
identification of malicious traffic in real time with as little downtime as possible, we are
interested in the cases where machine learning algorithms maintained their accuracy
and decreased in total processing time when feature selection was used. This section
will present combinations of machine learning and feature selection algorithms that are
best suited for network intrusion detection and classification.
     In our analysis on the ISCX-URL-2016 data-set, it was identified that the algorithm
combination that achieved the most desirable results was Decision Tree and ExtraTree
Classifier, reducing the number of features from 80 to 28 and producing an accuracy of
99% in 0.29 seconds. Moreover, this combination successfully classified 4334 attacks
and mis-classified only 34 attacks. The second top performing algorithm was Decision
Tree and SelectKBest which reduced the data-set to 10 features and displayed a total
processing time of 0.11 seconds whilst retaining a high accuracy of 97%. This combina-
tion successfully classified 4281 attacks and incorrectly classified 87 attacks as benign.
The best performing algorithm combination on the NSL-KDD data-set was Decision
Tree and VarianceThreshold. During the combination, the number of features was re-
duced to 40 and a relatively high accuracy rate of 74% was maintained with a total pro-
cessing time of 0.71 seconds. The algorithms correctly classified 8044 attacks whilst in-
correctly labelling 4789 malicious attacks as benign network traffic. The second combi-
nation of algorithms is Decision Tree and SelectPercentile, as these algorithms reduced
the number of features from 43 to 13 and successfully classified 7085 attacks and in-
correctly classified 5748 attacks, whilst producing an accuracy of 74% in an execution
time of 0.46 seconds. Lastly, on the CICIDS-2017 data-set, when Decision Tree was ex-
ecuted with VarianceThreshold, the number of features was reduced from 79 to 23 and
the combination successfully classified 83153 attacks whilst incorrectly labelling 117
attacks as normal network traffic, and produced an accuracy 100% in 78.87 seconds.
Secondly, when Decision Tree was run with ExtraTreeClassifier/SelectFromModel, the
feature count was reduced to 26 and the combination managed to correctly label 82929
malicious network traffic and incorrectly classified 341 attacks in an execution time of
83.03 seconds whilst retaining an accuracy rating of 100%.
     Therefore, is evident that these six combinations provide a useful foundation for
combinations of machine learning and feature selection algorithms that meet the main
application and task of feature selection on network intrusion data, being the reduction
of data-set features and processing time to ensure that malicious network activity is
promptly identified. Despite the findings being data-set-specific, general trends can be
seen across the three data-sets. It is evident that the most robust machine learning algo-
rithm is Decision Tree, as it performed best with all feature selection algorithms and on
each data-set and ultimately, managed to retain high accuracy rates in short execution
times. The most robust feature selection algorithm changes depending on the data-set
it was tested on. However, a pattern emerged that showed that ExtraTree Classifier per-
formed better on the larger and smaller data-sets, whereas SelectKBest and SelectPer-
                                     Intrusion detection on cyber security data-sets   11

centile performed better on the smaller data-sets. Therefore, an important outcome of
these findings is a set of algorithm combinations for anyone who is seeking to use the
SK Learn library for machine learning and feature selection on network intrusion data.


9    Conclusion and future research
In this paper, feature selection and machine learning algorithms were applied to the
ISCX-URL-2016, NSL-KDD, and CICIDS-2017 data-sets to perform network intru-
sion detection. Each machine learning algorithm was baselined. Next, machine learning
algorithms were paired with feature selections algorithms. Thereafter, a cross-comparison
of the four best performing algorithms was conducted in terms of accuracy and process-
ing time. The recommendations made provide network intrusion detection practitioners
with varying combinations of feature selection and machine learning algorithms. Our
research forms the basis for future work on the development of a recommendation sys-
tem for SK Learn machine learning and feature selection algorithms, in the application
of intrusion detection. It was identified that the top performing machine learning and
feature selection algorithm combinations, across the data-sets, are as follows: Decision
Tree and ExtraTree Classifier, Decision Tree and SelectKBest, Decision Tree and Vari-
anceThreshold, Decision Tree and SelectPercentile. Each of these combinations retain
their baselined accuracy whilst significantly reducing the total processing time. The
Decision Tree machine learning algorithm is the most robust to feature selection as it
retains its original accuracy with a significant reduction in processing time after fea-
ture selection. Additionally, there are several complexities that one faces when working
with cyber security data-sets, including: large data-sets and long processing times, sig-
nificant number of features, data tampering and class imbalances. A major constraint
when working with cyber security data is the complexity and non-intuitive nature of
the data-set features. This complexity resulted in our research focusing on automated
feature selection as opposed to manually removing highly correlated features as the un-
derlying meaning and structure of the data is non-obvious and complex. This paper’s
research could be extended through the implementation of both cross-validation and
hyper-parameter tuning when obtaining the initial baseline machine learning statistics.
A further extension would be the inclusion of more advanced machine learning (neural
networks) algorithms within SK Learn. Lastly, future work would include a shift away
from total processing time, to unit processing time (time per sample).


10    Acknowledgements
Our paper acknowledges the Canadian Institute for Cybersecurity for their publicly
available network intrusion data-sets. This paper would also like to acknowledge Pro-
fessor Louise Leenen from the University of the Western Cape for consultation on our
methodology and research objectives. This paper would like to acknowledge Professor
Bruce Watson and Professor Arina Britz as our paper supervisors and for consultations
throughout the development of this paper. Lastly, our paper acknowledges the valuable
feedback from the two anonymous FAIR2019 peer-reviewers.
12       A. Powell et al.

References
1. Ait Tchackoucht, T. and Ezziyyani, M.: Building A Fast Intrusion Detection System For High-
   Speed-Networks: Probe and DoS Attacks Detection. Procedia Computer Science. 127, 521–
   530 (2018)
2. Al-Jarrah, O. and Arafat, A.: Network Intrusion Detection System using attack behavior clas-
   sification. In 2014 5th International Conference on Information and Communication Systems.
   pp. 1–6. IEEE, (2014)
3. Anusha, K. and Sathiyamoorthy, E.: Comparative study for feature selection algorithms in
   intrusion detection system. Automatic Control and Computer Sciences. 50(1), 1–9 (2016)
4. Çavuşoğlu, Ü.: A new hybrid approach for intrusion detection using machine learning meth-
   ods. Applied Intelligence. 49(7), 2735–2761 (2019)
5. Chattopadhyay, M., Sen, R. & Gupta, S.: A Comprehensive Review and Meta-Analysis on
   Applications of Machine Learning Techniques in Intrusion Detection. Australasian Journal of
   Information Systems. 22, (2018)
6. Foster, K.R., Koprowski, R. and Skufca, J.D.: Machine learning, medical diagnosis, and
   biomedical engineering research-commentary. Biomedical engineering online. 13(1), 94.
   (2014)
7. Gharib, A., Sharafaldin, I., Lashkari, A.H. and Ghorbani, A.A.: An evaluation framework for
   intrusion detection dataset. In: 2016 International Conference on Information Science and
   Security. pp. 1–6. IEEE (2016)
8. Gómez, S.E., Hernández-Callejo, L., Martı́nez, B.C. and Sánchez-Esguevillas, A.J.: Ex-
   ploratory study on class imbalance and solutions for network traffic classification. Neuro-
   computing. 343, 100–119 (2019)
9. Gündüz, S.Y. & Çeter, M.N.: Feature Selection and Comparison of Classification Algorithms
   for Intrusion Detection. Anadolu University Journal of Science and Technology: Applied Sci-
   ences and Engineering. 19(1), 206–218 (2018)
10. Mamun, M.S.I., Rathore, M.A., Lashkari, A.H., Stakhanova, N. and Ghorbani, A.A.: Detect-
   ing malicious urls using lexical analysis. In: International Conference on Network and System
   Security,pp. 467–482. (2016)
11. Mishra, P. Varadharajan, V., Tupakula, U., Pilli, E.S.: A Detailed Investigation and Analy-
   sis of Using Machine Learning Techniques for Intrusion Detection. IEEE Communications
   Surveys & Tutorials. 21(1), 686-728 (2019)
12. Othman, S. M., Ba-Alwi, F. M., Alsohybe, N. T. & Al-Hashida, A. Y.: Intrusion detection
   model using machine learning algorithm on Big Data environment. Journal of Big Data. 5(34)
   (2018)
13. Panigrahi, Ranjit & Borah, Samarjeet.: A detailed analysis of CICIDS2017 dataset for de-
   signing Intrusion Detection Systems. 7, 479–482, (2018)
14. Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A.: A detailed analysis of the KDD CUP
   99 dataset. In: 2nd IEEE Symposium on Computational Intelligence for Security and Defence
   Applications. (2009)
15. Canadian Institute for Cybersecurity, https://www.unb.ca/cic/datasets/nsl.html. Last ac-
   cessed August 2019
16. Wagner, N. and Rondinelli, J.M.: Theory-guided machine learning in materials science.
   Frontiers in Materials. 3, 28 (2016)
17. Xue, Y., Jia, W., Zhao, X. and Pang, W.: An evolutionary computation based feature selection
   method for intrusion detection. Security and Communication Networks. (2018)