=Paper= {{Paper |id=Vol-2600/paper1 |storemode=property |title=Domain Knowledge Aided Explainable Artificial Intelligence for Intrusion Detection and Response |pdfUrl=https://ceur-ws.org/Vol-2600/paper1.pdf |volume=Vol-2600 |authors=Sheikh Rabiul Islam,William Eberle,Sheikh K. Ghafoor,Ambareen Siraj,Mike Rogers |dblpUrl=https://dblp.org/rec/conf/aaaiss/IslamEGSR20 }} ==Domain Knowledge Aided Explainable Artificial Intelligence for Intrusion Detection and Response== https://ceur-ws.org/Vol-2600/paper1.pdf

Domain Knowledge Aided Explainable Artificial Intelligence for Intrusion
Detection and Response
Sheikh Rabiul Islam, William Eberle, Sheikh K. Ghafoor, Ambareen Siraj, Mike Rogers
Departmemt of Computer Science
Tennessee Technological University
Cookeville, U.S.
sislam42@students.tntech.edu, {weberle, sghafoor, asiraj, mrogers}@tntech.edu

Abstract of trust in the model and prediction, as well as possibly
leading to a delayed human response/decision. This limita-
Artificial Intelligence (AI) has become an integral part of
modern-day security solutions for its ability to learn very tion also involves ethical issues in a few sensitive domains
complex functions and handling “Big Data”. However, the like finance (e.g., credit approval), health care (e.g., dis-
lack of explainability and interpretability of successful AI ease diagnosis), and security (e.g., identifying target). For
models is a key stumbling block when trust in a model’s pre- instance, AI and ML are becoming an integral part of se-
diction is critical. This leads to human intervention, which curity solutions and defense. To mitigate the unethical use
in turn results in a delayed response or decision. While there of AI as well as to promote the responsible use of AI sys-
have been major advancements in the speed and performance tems, various governments have started taking different pre-
of AI-based intrusion detection systems, the response is still cautionary initiatives. Recently, the European Union imple-
at human speed when it comes to explaining and interpreting mented the rule of “right of explanation”, where a user can
a specific prediction or decision. In this work, we infuse pop-
ask an explanation of algorithmic decision (Goodman and
ular domain knowledge (i.e., CIA principles) in our model for
better explainability and validate the approach on a network Flaxman 2017). In addition, more recently the US govern-
intrusion detection test case. Our experimental results sug- ment introduced a new bill, the “Algorithmic Accountability
gest that the infusion of domain knowledge provides better Act”, which would require companies to assess their ma-
explainability as well as a faster decision or response. In ad- chine learning systems for bias and discrimination, with a
dition, the infused domain knowledge generalizes the model need to take corrective measures (Wyden 2019). The U.S.
to work well with unknown attacks, as well as opens the path Department of Defense (DoD) has identified explainability
to adapt to a large stream of network traffic from numerous as a key stumbling block in the adoption of AI-based solu-
IoT devices. tions in many of their projects. Their DARPA division has
invested $2 billion on an Explainable Artificial Intelligence
1 Introduction (XAI) program (Turek 2019; Rankin 2019).
Most of the recent advancements in Artificial Intelligence Network intrusions are a common cyber-crime activity,
(AI), and more specifically Machine Learning (ML), have estimated to cost around $6 trillion annually in damages by
come from complex non-linear models such as Deep Neu- 2021 (Doyle 2019). In order to combat these attacks, an In-
ral Networks, Ensemble Methods, and Support Vector Ma- trusion Detection System (IDS) is a security system to mon-
chines. These models are also known as “black box” models itor network and computer systems (Hodo et al. 2016). Re-
as they are complex to interpret and explain, which arises search in AI-based IDS has shown promising results (Hodo
from their inherent non-linear capabilities, multiple param- et al. 2016), (Shone et al. 2018), (Kim et al. 2016), (Javaid et
eters, and very complex transformations. In addition, some al. 2016), (Li, Sun, and Wang 2012), and has become an in-
algorithms require a very large number of samples (i.e., large tegral part of security solutions due to its capability of learn-
training sets) to work efficiently, where it is very difficult ing complex, nonlinear functions and analyzing large data
to figure out what the model learned from the dataset and streams from numerous connected devices. A recent survey
which portion of the data set has more influence on the out- by (Dong and Wang 2016) suggests that deep learning-based
put (Kabul 2018). methods are accurate and robust to a wide range of attacks
Due to these challenges, the black box models lacks ex- and sample sizes. However, there are concerns regarding the
plainability and interpretability, ultimately resulting in a lack sustainability of current approaches (e.g., intrusion detec-
tion/prevention systems) when faced with the demands of
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-
modern networks and the increasing level of human inter-
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com- action (Shone et al. 2018). In the age of IoT and Big Data,
bining Machine Learning and Knowledge Engineering in Practice an increasing number of connected devices and associated
(AAAI-MAKE 2020). Stanford University, Palo Alto, California, streams of network traffic have exacerbated the problem.
USA, March 23-25, 2020. Use permitted under Creative Commons In addition, the delay in detection/response increases the
License Attribution 4.0 International (CC BY 4.0). chance of zero day exploitation, whereby a previously un-
known vulnerability is just discovered by the attacker, and domain knowledge in an intrusion detection/response sys-
the attacker immediately initiates an attack. However, im- tem; (2) we introduce a way to bring popular security prin-
proved explainability of an AI model could quicken inter- ciples (e.g., CIA principles) to aid in interpretability and ex-
pretation, making it more feasible to accelerate the response. plainability; (3) our experimental results show that infusing
Explainability is the extent to which the internal working domain knowledge into “black box” models can make them
mechanism of the machine or AI system can be explained better explainable with little or no compromise in perfor-
in human terms. And interpretability is the extent to which mance ; and (4) domain knowledge infusion increases gener-
a cause and effect (i.e., understanding of what’s happen- alizability, which leads to better resiliency against unknown
ing) can be observed within a system. In other words, in- attack.
terpretability is a form of abstract knowledge about what’s We start with a background of related work (Section 2)
happening and explainability is about the detailed step-by- followed by a description of our proposed approach, an in-
step knowledge of what is happening (Montavon, Samek, tuitive description of standard supervised algorithms, and an
and Müller 2018), (Turek 2019). However, while some lit- overview of the dataset (Section 3) used in this work. In Sec-
erature treat interpretability and explainability as the same, tion 4, we describe our experiments, followed by Section 5
they are actually two different traits of a model. Just because which contains discussion on results from the experiments.
a model can be interpreted does not mean that it can be ex- We conclude with limitations and future work in Section 6.
plained, and explainability needs to go beyond the algorithm
(Lipton 2016). 2 Background
Explainability and interpretability of a model could be Research in Explainable Artificial Intelligence (XAI) is a re-
achieved before, during, and after modeling. From the lit- emerging field, after the earlier work of (Chandrasekaran,
erature, we find that interpretability in pre-modeling (i.e., Tanner, and Josephson 1989), (Swartout and Moore 1993),
before modeling) is under-focused. (Miller 2018) argue that and (Swartout 1985). Previous work focused on primarily
explainability should incorporate knowledge from a differ- explaining the decision process of knowledge-based sys-
ent domain such as philosophy, psychology, and cognitive tems and expert systems. The classical learning paradigm
science, so that the explanation is not just based on the re- Explanation-Based Learning (EBL), introduced in the early
searcher’s intuition of what constitutes a good explanation. ’80s, can also be regarded as a precursor of explainability.
However, we also find that the use of domain knowledge for EBL involves learning a problem-solving technique by ob-
explainability is under-focused. In this work, we introduce serving and analyzing solutions to a specific problem (De-
a novel approach for an AI-based explainable intrusion de- Jong 1981), (Mitchell, Keller, and Kedar-Cabelli 1986). The
tection and response system, and demonstrate its effective- main reason for the renewed interest in XAI research has
ness by infusing a popular network security principle (CIA stemmed from recent advancements in AI and ML and their
principle) into the model for better explainability and inter- application to a wide range of areas, as well as concerns over
pretability of the decision. unethical use and undesired biases in the models. In addi-
We use a recent and comprehensive IDS dataset (CI- tion, recent concerns and laws by different governments are
CIDS2017) which covers necessary criteria with com- necessitating more research in XAI.
mon updated attacks such as DDoS, Brute Force, XSS, (Yang and Shafto 2017), use Bayesian Teaching, where
SQL Injection, Infiltration, Portscan, and Botnet. We in- a smaller subset of examples is used to train the model in-
fuse CIA principle in the model that provides a concise stead of the whole dataset. The subset of examples is chosen
and interpretable set of important features. Computer se- by domain experts as the examples are most relevant to the
curity rest on CIA principles, C stands for confidential- problem of interest. However, for this purpose, choosing the
ity—concealment of information or resources, I stands for right subset of examples in the real-world is challenging.
integrity—trustworthiness of data or resources, and A stands (Lei, Barzilay, and Jaakkola 2016) propose an approach
for availability—ability to use the information or resource for sentiment analysis where a subset of text from the whole
desired (Matt and others 2006). For instance, security com- text is selected as the rationale for the prediction. In addition,
promise in confidentiality could happen through eavesdrop- the selected subset of text is concise and sufficient enough
ping unencrypted data, compromise in integrity could hap- to act as a substitute for the original text, and still capable of
pen through an unauthorized attempt to change data, and the making the correct prediction. Although their approach out-
compromise in availability could happen through the delib- performs available attention-based models (from deep learn-
erate arrangement of denial to access data or service. ing) with variable-length input (e.g., a model for document
We also convert the domain knowledge infused features summarization) , it is limited to only text analysis.
into three features C, I, and A by quantitatively comput- When the explanation is based on feature importance, it is
ing compromises associated with each of those for each necessary to keep in mind that features that are globally im-
record. Output expressed as these generalized and newly portant may not be important in the local context, and vice
constructed set of features provides better explainability versa (Ribeiro, Singh, and Guestrin 2016). (Ribeiro, Singh,
with negligible compromises in performance. We also found and Guestrin 2016) propose a novel explanation technique
that generalization provides more resiliency against un- capable of explaining the prediction of any classifier (i.e., in
known attacks. the model agnostic way) with a locally interpretable model
In summary, our contributions in this work are as follows: (i.e., in the vicinity of the instance being predicted) around
(1) we demonstrate a method for the collection and use of the prediction. Their concern is on two issues: (1) whether
the user should trust the prediction of the model and act on al. 2019), albeit for bankruptcy prediction with a limited fo-
that, and (2) whether the user should trust a model to be- cus. (Miller 2018) have argued that incorporating knowl-
have reasonably-well when deployed. In addition, they in- edge from different domains will provide better explain-
volve human judgment in their experiment (i.e., human in ability. In addition, (Kim et al. 2017) use the prediction’s
the loop) to decide whether to trust the model or not. sensitivity to high dimensional concepts (e.g., the concept
(Kim et al. 2017) propose a concept attribution-based ap- “striped” to “Zebra”) for explaining the prediction. Further-
proach (i.e., sensitivity to concept) that provides an inter- more, both LIME (Ribeiro, Singh, and Guestrin 2016) and
pretation of the neural network’s internal state in terms of SHAP (Lundberg and Lee 2017) use a simplified input map-
human-friendly concepts. Their approach, Testing with CAV ping—mapping the original input to a simplified set of input.
(TCAV), quantifies the prediction’s sensitivity to the high di- To the best of our knowledge, none of the models incorpo-
mensional concept. For example, a user-defined set of exam- rate domain knowledge with a focus towards better explain-
ples that defines the concept “striped”, TCAV can quantify ability and interpretability. Although our proposed concep-
the influence of “striped” in the prediction of “zebra” as a tual model comes with a negligible compromise in accuracy,
single number. To learn the high dimensional concepts they it comes with better explainability and interpretibility, and
use a Concept Activation Vector (CAV) —CAVs are learned scalability to big data problems.
from training a linear classifier that can distinguish between
the activations produced by a particular concept’s examples 3 Methodology
and examples in any layer.
Most of these approaches try to find out how the predic-
tion deviates from the base/average scenario. Lime (Ribeiro,
Singh, and Guestrin 2016) tries to generate an explana-
tion by locally (i.e., using local behavior) approximating the
model with an interpretable model (e.g., decision trees, lin-
ear model). However, it is limited by the use of the only
linear model to approximate the local behavior. (Lundberg
and Lee 2017) propose “SHAP” which unifies seven previ-
ous approaches: LIME (Ribeiro, Singh, and Guestrin 2016),
DeepLIFT (Shrikumar, Greenside, and Kundaje 2017), Tree
Interpreter (Ando 2019), QII (Datta, Sen, and Zick 2016),
Shapley sampling values (Štrumbelj and Kononenko 2014),
Shapley regression values (Lipovetsky and Conklin 2001),
and Layer-wise relevance propagation (Bach et al. 2015) to
make the explanation of prediction for any machine learn-
ing model. While SHAP comes with theoretical guarantees
about consistency and local accuracy from game theory, it
needs to run many evaluations of the original model to esti-
mate a single vector of feature importance (Lundberg 2019).
ELI5 also uses the LIME algorithm internally for explana-
tions. In addition, ELI5 is not truly model agnostic, mostly
limited to tree-based and other parametric or linear models.
Furthermore, Tree Interpreter is limited to only tree-based
approaches (e.g., Random Forest, Decision Trees).
AI-based IDSs have continued to show promising perfor- Figure 1: Proposed Technique
mance (Hodo et al. 2016),(Shone et al. 2018),(Kim et al.
2016),(Javaid et al. 2016),(Li, Sun, and Wang 2012). (Shone
et al. 2018) propose an approach in combination of both 3.1 Proposed Approach
shallow (Random Forest) and deep learning (Auto Encoder), The proposed approach consists of two components: a fea-
capable of analyzing a wide range of network traffic, out- ture generalizer, which gives a generalized feature set with
performing mainstream Deep Belief Networks (DBN). In a the help of domain knowledge in two different ways; and an
literature survey on traditional IDS vs deep learning IDS by evaluator that produces and compares the results from the
(Dong and Wang 2016), they suggest deep learning-based “black box” model for multiple configurations of features:
methods provide better accuracy for a wide range of sam- domain knowledge infused features, newly constructed fea-
ples sizes and a variety of network traffic or attacks (Dong tures from domain knowledge infused features, selected fea-
and Wang 2016). However, in all of the previous work, there tures, and all features.
are still long training times and a reliance on a human oper-
ator (Shone et al. 2018). 3.2 Feature Generalizer
However, incorporating domain knowledge for explain- The feature generalizer (Figure 1, top portion), takes orig-
ability has garnered little attention. Previously, we intro- inal features of the dataset (X 1 , X 2 , .... X n ∈ X where X
duced the concept of infusing domain knowledge (Islam et is the set of all features) and infuse domain knowledge to
mapping between feature and associated compromises, we
Table 1: Mapping of network attack with related component first find the mapping between an attack and related com-
of CIA principles promises (from Table 1, formulated as Equation 2). In other
Attack Related component of CIA words, Formula 1 gives the name of the associated attack
DoS GoldenEye A where the feature is in the top three feature to identify that
Heartbleed C particular attack and Formula 2 gives associated compro-
DoS hulk A mises in C, I, or A from the attack name. Thus, with the
DoS Slowhttp A help of domain knowledge, we keep 22 features (see Table
DoS slowloris A 2) out of a total of 78 features. We will refer to these features
SSH-Patator C as the domain features. The feature descriptions in Table 2
FTP-Patator C are taken from the data processing software’s website (net
Web Attack C, I, A 2019).
Inflitration C f (f eature) → attack (1)
Bot C, I, A f (attack) → C, I, orA (2)
PortScan C
DDoS A Feature Construction We also construct three new fea-
tures, C, I, and A, from the domain features by quantita-
tively calculating compromises associated with each of the
produce/re-construct a concise and better interpretable fea- domain features. For that purpose, we calculate the correla-
ture set (X 1 ’, X 2 ’, ..... X k ’ ∈ X’ where X’ is the universal tion coefficient vector of the dataset to understand whether
set of original/transformed/constructed features, but here k the increase in the value of a feature has a positive or nega-
is much smaller than n) in two different ways: tive impact on the target variable. We then convert the corre-
lation coefficient (a.k.a coeff ) vector V in to a 1 or -1 based
Feature Mapping As stated earlier, we use CIA princi- on whether the correlation coefficient is positive or negative
ples as domain knowledge, which stands for confidentiality, accordingly. We also group the domain features and corre-
integrity, and availability. We analyze all types of attacks sponding coeff tuple into three groups. Using formula 3, 4,
for associated compromises in each component of CIA prin- and 5, we aggregate each group (from C, I, and A) of domain
ciples (see Table 1). The Heartbleed vulnerability is related features into the three new features C, I, and A. We also scale
to a compromise in confidentiality as an attacker could gain all feature values from 0 to 1 before starting the aggregation
access to the memory of the systems protected by the vul- process. During the aggregation for a particular group (e.g.,
nerable version of the OpenSSL. A Web attack (e.g., Sql in- C), if the correlation coefficient vector (e.g., Vi ) for a feature
jection) is related to a compromise in confidentiality and in- (e.g., Ci ) of that group has a negative value, then the product
tegrity (e.g., read/write data using injected query), and avail- of the feature value and the correlation coefficient for that
ability (e.g., flooding the database server with injected com- feature is deducted, and vice-versa if positive. In addition,
plex queries like a cross join). Infiltration attack is related to when a feature is liable for more than one compromise, the
a compromise in confidentiality as it normally exploits soft- feature value is split between the associated elements of CIA
ware vulnerability (e.g., Adobe Acrobat Reader) to create a principles.
backdoor and reveal information (e.g., IP’s). Port scan attack X n
is related to a compromise in confidentiality as the attacker C= C iV i (3)
sends packets with varying destination ports to learn the ser- i=0
vices and operating systems from the reply. All DoS and n
DDoS attacks are related to a compromise in availability as
X
I= I iV i (4)
it aims to hamper the availability of service or data. Further-
i=0
more, SSH patator and FTP patator are brute force attacks
n
and are usually responsible for a compromise in confiden- X
tiality. Botnet (i.e., robot network—a network of malware- A= Ai V i (5)
infected computers) could provide a remote shell, file up- i=0
load/download option, screenshot capture option, and key
logging options which has potential for all of the confiden- 3.3 Evaluator
tiality, integrity, and availability related compromises. The task of the evaluator (Figure 1, bottom side) is to ex-
Furthermore, from the feature ranking of the origi- ecute (supervised models or algorithms) and compare the
nal dataset provider (Sharafaldin, Lashkari, and Ghorbani performance (in detecting malicious and benign records) of
2018), for each type of attack, we take the top three fea- four different types of configurations of features, as follows:
tures according to their importance (i.e., feature importance (1) using all features, (2) using selected features (selection
from Random Forest Regressor) and calculate the mapping is done by feature selection algorithm), (3) using domain
(Table 2) with related compromises under CIA principles. knowledge infused features, and (4) using newly constructed
For example, the feature Average Packet Size is renamed as features C, I, and A from domain knowledge infused fea-
Avg Packet Size - A where -A indicates that it is a key feature tures. In addition, the evaluator performs the following two
for the compromise of availability (see Table 2). To get this tests:
Table 2: Mapping of feature with related component of CIA principles
Feature Description In top 3 features of attack Renamed feature

ACK Flag Count Number of packets with ACK SSH-Patator ACK Flag Count - C
Active Mean Mean time a flow was active before becoming idle DoS Slowhttp, Infiltration Active Mean - AC
Active Min Minimum time a flow was active before becoming idle DoS Slowhttp Active Min - A
Average Packet Size Average size of packet DDoS Avg Packet Size - A
Bwd IAT Mean Mean time between two packets sent in the backward direction DoS slowloris Bwd IAT Mean - A
Bwd Packet Length Std Standard deviation size of packet in backward direction DoS Hulk, DoS GoldenEye, DDoS, Bwd Packet Length Std - AC
Heartbleed, DoS Hulk
Bwd Packets/s Number of backward packets per second Bot, PortScan Bwd Packets/s - CIA
Fwd IAT Mean Mean time between two packets sent in the forward direction DoS slowloris Fwd IAT Mean - A
Fwd IAT Min Minimum time between two packets sent in the forward direction DoS slowloris, DoS GoldenEye Fwd IAT Min - A
Fwd Packet Length Mean Mean size of packet in forward direction Benign, Bot Fwd Packet Length Mean - CIA
Fwd Packets/s Number of forward packets per second FTP-Patator Fwd Packets/s - C
Fwd PSH Flags Number of times the PSH flag was set in packets travelling in the for- FTP-Patator Fwd PSH Flags - C
ward direction (0 for UDP)
Flow Duration Duration of the flow in Microsecond DDoS, DoS slowloris, DoS Hulk, Flow Duration - AC
DoS Slowhttp, Infiltration, Heart-
bleed
Flow IAT Mean Mean inter-arrival time of packet DoS GoldenEye Flow IAT Mean - A
Flow IAT Min Minimum inter-arrival time of packet DoS GoldenEye Flow IAT Min - A
Flow IAT Std Standard deviation inter-arrival time of packet DDoS, DoS Slowhttp, DoS Hulk Flow IAT Std - A
Init Win bytes forward The total number of bytes sent in initial window in the backward direc- Web Attack Init Win Bytes Fwd - CIA
tion
PSH Flag Count Number of packets with PUSH PortScan PSH Flag Count - C
Subflow Fwd Bytes The average number of packets in a sub flow in the forward direction Benign, SSH-Patator, Web Attack, Subflow Fwd Bytes - CIA
Bot, Heartbleed, Infiltration
SYN Flag Count Number of packets with SYN FTP-Patator SYN Flag Count - C
Total Length of Fwd Packets Total size of packet in forward direction Benign, SSH-Patator, Web Attack, Total Length of Fwd Packets - CIA
Bot, Heartbleed, Infiltration

1. Explainability Test: The purpose of this test is to discover Random Forest (RF) A Random Forest is a tree-based
the comparative advantages or disadvantages of incorpo- ensemble technique developed by (Breiman 2001) for the
rating domain knowledge in the experiment; and supervised classification task. In RF, many trees are gener-
2. Generalizability Test: The purpose of this test is to an- ated from the bootstrapped subsamples (i.e., random sample
alyze how different approaches perform in unknown or drawn with replacement) of the training data. In each tree,
unseen attack detection. We delete all training records for the splitting attribute is chosen from a smaller random sub-
a particular attack one at a time and investigate the perfor- set of attributes of that tree (i.e., the chosen split attribute
mance of the model on the same test set, which includes that is the best among that random subset). This random-
records from unknown or unseen attacks. Details of these ness helps to make trees less correlated as correlated trees
tests are described in Section 4. make the same kinds of prediction errors and can overfit the
model. In less correlated trees, a few trees may be wrong
3.4 Algorithms but many others will be right and as a group the trees can
move in the right direction as the output from all the trees
We use six different algorithms for predicting malicious are averaged for the final prediction.
records: one of those is a probabilistic classifier based on
Naive Bayes theorem, and the remaining five are supervised Extra Trees (ET) Extremely Randomized Trees or Ex-
“black box” models. The algorithm descriptions are taken tra Trees (ET) is a tree-based ensemble technique simialr
from our previous work (Islam et al. 2019). to RF. The only difference is in the process of splitting
attribute selections and determining the threshold (cutoff)
Artificial Neural Network (ANN) An Artificial Neural value, both are chosen in an extremely random fashion (Is-
Network is a non-linear model, capable of mimicking hu- lam, Ghafoor, and Eberle 2018). Similar to RF, a random
man brain functions to some extent. It consists of an input subset of features are taken into consideration for the split
layer, one or multiple hidden layer(s), and the output layer. selection, but instead of choosing the most discriminative
Each layer consists of multiple neurons that help to learn the cut off threshold, ET cut off thresholds are set to random
complex pattern. Each subsequent layer learns more abstract values. Thus, the best of these randomly chosen values is set
concepts before it finally merges into the output layer. as the threshold for the splitting rule (ens 2019) on a partic-
ular node. Unlike DT, RF has multiple trees which leads to
Support Vector Machine (SVM) The Support Vector a reduced variance. However, bias is introduced, as a subset
Machine (SVM) was first introduced by (Boser, Guyon, and of the whole feature set is chosen for each tree instead of
Vapnik 1992) and has been used for many supervised classi- all features. ET was proposed by (Geurts, Ernst, and We-
fication tasks. In addition to linear classification, the model henkel 2006), and has achieved a state of the art perfor-
can learn an optimal hyperplane that separates instances of mance in some anomaly/intrusion detection research (Islam
different classes using a highly non-linear implicit mapping 2018), (Islam, Eberle, and Ghafoor 2018),(Islam, Ghafoor,
of input vectors in high dimensional feature space (i.e., ker- and Eberle 2018).
nel trick) (Hooman et al. 2016). When the number of sam-
ples is too high (i.e., millions) then it is very costly in terms Gradient Boosting (GB) (Friedman 2001), generalized
of computation time. Adaboost to a Gradient Boosting algorithm to allow a vari-
ety of loss function. Here the shortcoming of weak learners (e.g., bias to the class of majority samples) of some machine
is identified using the gradient instead of highly weighted learning algorithms. To overcome this problem, we use the
data points as in Adaboost. Gradient Boosting (GB) is a well-known oversampling technique SMOTE (Chawla et al.
classifier/regression model in the form of an ensemble of 2002) to oversample the minority class. In (Dong and Wang
weak prediction models, such as Decision Trees. It works 2016), the author uses SMOTE to overcome the issue in
sequentially like the AdaBoost algorithm, in that each sub- their empirical study on the comparison of traditional vs
sequent model tries to minimize the loss function (i.e., Mean deep learning-based IDS. SMOTE creates synthetic samples
Squared Error) by paying special focus on instances that rather than just oversampling with replacement. The minor-
were hard to get right in the previous model. ity class is oversampled by creating new examples along
with the line segments joining any or all of k nearest mi-
Naive Bayes (NB) Naive Bayes algorithm is based on nority samples, where k is chosen based on the percentage
Bayes Theorem, which was formulated in the seventeenth of oversampling required (i.e., hyperparameter to the algo-
century. It is a supervised, simple, and comparatively fast rithm) (Chawla et al. 2002).
algorithm based on statistics. In a real-world problem, it is
unusual that all features are independent. However, Naive
Bayes assumes conditional independence among features 4 Experiments
and surprisingly works well in many cases. It also requires a 4.1 Experimental Setup
small amount of training data to estimate the necessary pa-
rameters (nai 2019). This assumption of Naive Bayes helps We execute the experiments in a GPU enabled Linux ma-
to avoid lots of computations (e.g., computing the condi- chine with 12GB of RAM and core i7 processor. All super-
tional probability for each feature with others) and makes vised machine learning algorithms are implemented using
it a faster algorithm. Besides, the avoidance of a condi- the Python-based Scikit-learn (sci 2019) library. In addition,
tional probability calculation helps (the class conditional we use Tensorflow (ten 2019) for the Artificial Neural Net-
feature distribution can be independently estimated as one- work. Due to resource limitations, instead of using the whole
dimensional distribution) in Big Data problems where the dataset, we take a stratified sample of the data which is big
curse of dimensionality is a concern. However, NB is a bad enough (i.e., 300K records) for a single GPU enabled com-
estimator of a probabilty (Zhang 2004). We use the Bernoulli modity machine. We make the sampled dataset available to
Naive Bayes (Manning, Raghavan, and 2010) for our exper- the research community at (sam 2019). Furthermore, we use
iments where each feature is assumed to be binary-valued. 70% of the data for training the models and kept 30% of the
data as a holdout set to test the model. We confirm the target
3.5 Data class had the same ratio in both sets. To avoid the adverse ef-
fect of class imbalance in classification performance, we re-
In this work, we use a recent and comprehensive IDS dataset sample the minority class of the training set using SMOTE
namely CICIDS2017, published in 2017, covers necessary (Chawla et al. 2002) to balance the dataset. However, we do
criteria with common updated attacks such as DoS, DDoS, not re-sample the test set, as real-world data is skewed and
Brute Force, XSS, SQL Injection, Infiltration, Portscan, oversampling the test set could exhibit an overoptimistic per-
and Botnet. In fact, this dataset is created to eliminates formance.
the shortcomings (e.g., lack of traffic diversity and vol- We run all supervised machine learning algorithm using
ume, lack of variety of attacks, anonymized packet infor- four different approaches:
mation, and out of date) of previous well known IDS dataset
such as DARPA98, KDD99, ISC2012, ADFA13, DEFCON, 1. With all features: using all 78 features of the dataset with-
CAIDA, LBNL, CDX, Kyoto, Twente, and UMASS since out discarding any features.
1998. This is a labeled dataset containing 78 network traf-
2. With selected features: using Random Forest Regres-
fic features (some features are listed in Table 2) extracted
sor (adapting with the work of (Sharafaldin, Lashkari,
and calculated from pcap file using CICFlowMeter soft-
and Ghorbani 2018)) to select important features of the
ware (Lashkari et al. 2017) for all benign and intrusive flows
dataset, giving us 50 important features having a nonzero
(Sharafaldin, Lashkari, and Ghorbani 2018) . This new IDS
influence on the target variable;
dataset includes seven common updated family of attacks
satisfying real-world criteria, also publicly available at here: 3. With domain knowledge infused features: using infused
https://www.unb.ca/cic/datasets/ids-2017.html . domain knowledge features (see Section 3.2), we will use
Each record of the dataset is labeled by the particular type the term domain features interchangeably to express it in
of attack. We make a new feature “Class”, which is the tar- short form; and
get feature. We set the value of the “Class” attribute to 1 for
4. With newly constructed features from domain knowl-
all records labeled as any of 14 types of attacks, as those
edge infused features: using newly constructed features
are malicious/intrusive, and set the value to 0 for the re-
C, I, and A (see Section 3.2) from domain knowledge
maining records as those are benign. Following that, in the
infused features, we will use the term domain features-
whole dataset, there are total 2,830,743 records for 14 differ-
constructed interchangeably to express it in short form.
ent attacks, 2,273,097 are benign and 557,646 are malicious.
Approximately 24.5% of the records are malicious, giving The following are two types of experiments using each of
us an imbalanced dataset which impacts the performance the four feature settings.
Table 3: Performance using all features vs selected features Table 4: Performance using domain features vs constructed
Alg. Acc. Prec. Rec. F-score AUC features
Alg. Acc. Prec. Rec. F-score AUC
RF-A 0.9987 0.9965 0.9971 0.9968 0.9997
RF-S 0.9986 0.9962 0.9966 0.9964 0.9997 RF-D1 0.9973 0.9920 0.9945 0.9932 0.9993
Difference 0.0002 0.0003 0.0006 0.0005 0.0000 RF-D2 0.9511 0.9446 0.7985 0.8654 0.9572
ET-A 0.9981 0.9951 0.9951 0.9951 0.9994 Difference 0.0463 0.0475 0.1960 0.1278 0.0421
ET-S 0.9980 0.9950 0.9950 0.9950 0.9994 ET-D1 0.9969 0.9913 0.9932 0.9923 0.9989
Difference 0.0001 0.0002 0.0001 0.0001 0.0000 ET-D2 0.9756 0.9321 0.9448 0.9384 0.9954
ANN-A 0.9802 0.9155 0.9908 0.9516 0.9984 Difference 0.0214 0.0592 0.0483 0.0538 0.0036
ANN-S 0.9740 0.8929 0.9860 0.9372 0.9968 ANN-D1 0.9497 0.8300 0.9362 0.8799 0.9865
Difference 0.0062 0.0226 0.0047 0.0145 0.0017 ANN-D2 0.5952 0.3241 0.9721 0.4862 0.7921
SVM-A 0.9109 0.6996 0.9595 0.8092 0.9780 Difference 0.3544 0.5059 (0.0359) 0.3937 0.1945
SVM-S 0.8869 0.6433 0.9565 0.7692 0.9746 SVM-D1 0.8489 0.5747 0.8968 0.7005 0.9252
Difference 0.0239 0.0563 0.0030 0.0400 0.0034 SVM-D2 0.7195 0.3739 0.6281 0.4687 0.7886
GB-A 0.9960 0.9854 0.9944 0.9899 0.9995 Difference 0.1294 0.2008 0.2687 0.2318 0.1366
GB-S 0.9957 0.9840 0.9945 0.9892 0.9996 GB-D1 0.9881 0.9513 0.9904 0.9705 0.9986
Difference 0.0003 0.0014 (0.0001) 0.0007 (0.0001) GB-D2 0.9230 0.7692 0.8701 0.8165 0.9789
NB-A 0.7753 0.4371 0.4888 0.4615 0.8601 Difference 0.0652 0.1821 0.1204 0.1539 0.0198
NB-S 0.7621 0.4144 0.5019 0.4539 0.8508 NB-D1 0.7982 0.4881 0.5028 0.4953 0.8553
Difference 0.0132 0.0228 (0.0131) 0.0076 0.0093 NB-D2 0.5591 0.2687 0.7195 0.3913 0.6591
Difference 0.2391 0.2194 (0.2167) 0.1040 0.1962

4.2 Explainability Test
For this test, we run six supervised algorithms RF, ET, SVM, Before starting the comparison of results from our ap-
GB, ANN, and NB using the four described feature settings proach with all features (i.e., baseline), we seek the best
and report the results Section 5.1. Unlike NB, other classi- feature setting among two domain related feature settings
fiers are “black box” in nature. NB is a probabilistic classi- of our proposed approach. In other words, in our attempt to
fier based on Bayes Theorem with strong conditional inde- find the better approach among using domain knowledge in-
pendence assumption among features. The main purpose to fused features vs newly constructed features (C, I, and A)
include NB in the experiment is the generalizability test. from domain knowledge infused features, we find that, in
almost all cases, the model with domain knowledge infused
4.3 Generalizability Test features (denoted with an appended -D1, for instance RF-
D1) performs better than the counterpart (see Table 4). Al-
For testing the generalizability of the approach, we train the though for RF, the maximum performance gap is .2 in the
classifier without the representative of a particular attack, recall, for ET that gap is .048 with a similar precision. As
and test it with the presence of the representative of that par- the domain features (22 features) contain a lot more detail
ticular attack, in order to classify it malicious/benign. To be than the newly constructed features C, I, and A (3 features),
more specific, we delete all records of a particular attack it loses few details. In terms of individual algorithms, RF
from the training set, train the classifier with the records of is again a clear winner this time using domain features. Al-
the remaining 13 attacks, and test the classifier with all 14 though NB and ANN exhibit better recall using constructed
attacks. We report the percentage of deleted attacks that are features, it comes with compromises in precision. So, over-
correctly detected as malicious (see Section 5.2). We repeat all we consider the domain features setting as the best over
this one by one for all 14 attacks. We make the source code the constructed features.
available to the research community to replicate the experi- While we know the best feature setting is the all features,
ments at (pro 2019). as shown in the comparison of all features vs selected fea-
tures in the Table 3), we also know the best feature setting
5 Results domain features from domain features vs constructed fea-
The following sections discuss results from the two cate- tures (see Table 4). So we further compare the performance
gories of experiments previously described. of models using the two best settings all features (i.e., base-
line) vs domain features. We find that, among all models, RF
using all features (denoted with an appended -A, for instance
5.1 Findings from Explainability Test RF-A) performs better than all other algorithms (see Ta-
Comparing the performance using all features vs selected ble 5 and Figure 3). Interestingly, RF using domain knowl-
features, Table 3 shows that models using all features (de- edge infused features (denoted with an appended -D1, for
noted with an appended -A, for instance RF-A) tend to show instance RF-D1) also shows promising performance. The
better results in terms of all performance metrics. However, difference between these two in terms of any performance
while the difference with the selected features setting is neg- metrics is negligible (<.005). In fact, the result of RF us-
ligible (<.0007 for RF) for any performance metric, that ing the domain knowledge infused feature settings is better
might be a result of the elimination of features with little than what (Sharafaldin, Lashkari, and Ghorbani 2018) re-
significance. In addition, Random Forest outperforms other ports using the same dataset. The slight improvement might
algorithms SVM, ET, GB, ANN, and NB under this feature stem from the experimental settings (e.g., training test set
setting (i.e., using all features). So we consider the results split, re-sampling techniques). Furthermore, in the domain
using all features as a baseline to compare against our pro- knowledge infused feature setting we are using only 22 fea-
posed approach. tures out of 78 total, where each feature indicates the asso-
ciated compromises (e.g., confidentiality, integrity, or avail-
ability), capable of producing better explainable and inter- Table 5: Performance using all features vs domain features
Alg. Acc. Prec. Rec. F-score AUC
pretable results compared to the counterpart. The prediction
for a particular sample can be represented as: RF-A
RF-D1
0.9987
0.9973
0.9965
0.9920
0.9971
0.9945
0.9968
0.9932
0.9997
0.9993
G Difference 0.0014 0.0045 0.0027 0.0036 0.0004
X ET-A 0.9981 0.9951 0.9951 0.9951 0.9994
P (D) = b + contribution(g) (6) ET-D1 0.9969 0.9913 0.9932 0.9923 0.9989
Difference 0.0011 0.0038 0.0020 0.0029 0.0004
g=0 ANN-A 0.9802 0.9155 0.9908 0.9516 0.9984
ANN-D1 0.9497 0.8300 0.9362 0.8799 0.9865
where b is the model average and g is the generalized do- Difference 0.0305 0.0855 0.0546 0.0717 0.0119
SVM-A 0.9109 0.6996 0.9595 0.8092 0.9780
main feature (e.g., ACK Flag Count - C), P(D) is the prob- SVM-D1 0.8489 0.5747 0.8968 0.7005 0.9252
ability value of the decision. Instead of using contributions Difference 0.0619 0.1249 0.0627 0.1087 0.0528
GB-A 0.9960 0.9854 0.9944 0.9899 0.9995
from each of the domain features, we can express the output GB-D1 0.9881 0.9513 0.9904 0.9705 0.9986
in terms of the contribution from each element of the do- Difference 0.0079 0.0341 0.0039 0.0194 0.0009
NB-A 0.7753 0.4371 0.4888 0.4615 0.8601
main concept. For that, we need to aggregate contributions NB-D1 0.7982 0.4881 0.5028 0.4953 0.8553
Difference (0.0229) (0.0510) (0.0140) (0.0338) 0.0048
from all features into three groups (C, I, and A). This will en-
able an analyst to understand the nature of the attack more
quickly (Figure 2). For instance, when the greater portion of
a feature contribution for a sample is from features tagged
with -A (i.e., Availability) then it might be a DDoS attack,
which usually comes with very high compromises in avail-
ability of data or service. We use the iml package from the
programming language R to generate the breakdown of fea-
ture contributions of a particular sample’s prediction (Figure
2).

Figure 3: Performance deviations of using domain knowl-
edge infused features from baseline

domain knowledge infused feature setting perform slightly
worse than the all feature setting, it comes with an explain-
able features set with the added capability of identifying a
few unknown attacks.
To reiterate, the constructed features set consists of only
three features (C, I, and A) constructed from aggregating do-
main knowledge infused features. As this feature setting is
Figure 2: Breakdown of the prediction for a random sample composed of only three features, it is an extreme generaliza-
in terms of C, I, and A. tion of features and it loses a lot of details of data. However,
this time it comes with an exceptional capability which we
realize after applying a statistical approach (Naive Bayes)
5.2 Findings from Generalizability Test on the dataset. We find that (see Table 7), for NB, the newly
Recall that the purpose of this test is to test the resiliency constructed feature setting is best as NB is also able to detect
against unknown attacks. First, we use Random Forest (RF), unknown attacks with similar accuracy compared to other
the best performing algorithm so far, using all four settings feature settings by RF in Table 6. The most interesting thing
of features. As shown in Table 6 and Figure 4, we see that ex- about this capability is that this feature set is composed of
cept for the constructed feature settings (denoted by Cons.), only three features (C, I, and A), takes comparatively less
the performances of other feature settings (all, selected, and time to execute, and comes with the added benefit of very
domain) are similar. The constructed features fail to provide good explainability. Once the prediction is expressed as a
comparable performance for RF as it has only three fea- percentage of influence from each of C, I, and A, the analyst
tures and loses data details (i.e., too much generalization). would be able to perceive the level of compromise more in-
Surprisingly, a few unknown attacks are only detectable us- tuitively from the hints about the type of attack (e.g., DDoS
ing the domain knowledge infused features. For instance, will show a high percentage of A—compromise in Avail-
Web Attack Sql Injection is detected as suspicious only by ability).
domain knowledge infused features. Overall, although the However, from Table 3, 4, and 5, we can see that NB’s
Table 6: Performance of unseen attack detection using RF Table 7: Performance of unseen attack detection using NB
Attack Count All(%) Sel.(%) Dom.(%) Cons.(%) Attack Count All(%) Sel.(%) Dom.(%) Cons.(%)

Ddos 4184 99.90 99.90 99.90 62.86 Ddos 4184 76.94 74.59 83.22 100
PortScan 4973 99.90 99.94 99.94 66.28 PortScan 4973 0.18 0.18 6.64 100
Bot 54 77.78 77.78 75.93 22.22 Bot 54 0.00 0.00 100
Infiltration 1 100 100 100 0.00 Infiltration 1 0.00 0.00 0.00 100
Web Attack-BF 49 95.92 95.92 91.84 75.51 Web Attack-BF 49 6.12 6.12 83.67 100
Web Attack-XSS 23 95.65 95.65 91.30 65.22 Web Attack-XSS 23 0.00 0.00 95.65 100
Web Attack-Sql 1 0.00 0.00 100 0.00 Web Attack-Sql 1 100 100 0.00 100
FTP-Patator 251 99.20 100 99.20 81.67 FTP-Patator 251 0.00 0.00 0.00 48.61
SSH-Patator 198 98.99 99.49 96.97 75.76 SSH-Patator 198 0.00 0.00 0.00 100
DoS slowloris 188 99.47 99.47 98.94 61.70 DoS slowloris 188 25.53 30.85 38.30 66.49
DoS Slowloris 174 99.43 99.43 96.55 31.61 DoS Slowloris 174 71.26 79.89 75.29 78.16
Dos Hulk 7319 99.71 99.73 99.34 96.19 Dos Hulk 7319 67.78 67.78 67.03 35.37
DoS GoldenEye 314 99.36 99.68 98.41 85.03 DoS GoldenEye 314 50.32 65.29 47.13 96.82
Heartbleed 1 100 100 100 100 Heartbleed 1 100 100 0.00 0.00

Figure 4: Unknown attack detection rate using Random For-
est Figure 5: Unknown attack detection rate using Naive Bayes

performance comes at a cost of precision and recall (i.e.,
speed where there is human in the loop. The lack of explain-
produces comparatively more false positives and false neg-
ability of an AI-based model is a key reason for this bottle-
atives). In addition, NB is a bad probability estimator of
neck. To mitigate this problem, we infuse the CIA principle
the predicted output (Zhang 2004). However, NB with con-
(i.e., domain knowledge) in the AI-based black box model
structed features setting could be recommended as an addi-
for better explainability and generalizability of the model.
tional IDS for quick interpretation of huge traffic data given
Our experimental results show realizable successes in better
the decision is treated as tentative with the requirement of
explainability with a comprehensive, up to date, and real-
a further sanity check. We also calculate the average time
world network intrusion dataset. In addition, the infused do-
taken by each algorithm for all four feature settings and
main knowledge helps in detecting an unknown attack as it
found that NB is the fastest algorithm. RF, ET, GB, ANN,
generalizes the problem, which ultimately opens the door to
and SVM take 2.80, 9.27, 77.06, 15.07, and 444.50 times
accommodate big data.
more execution time compared to NB. Besides, the best al-
gorithm, RF (1st in terms of the performance metric and 2nd Going forward, finding an optimal solution to segregate
in terms of execution time), can be executed in parallel using the contribution of each participating feature (sample wise)
an Apache Spark for a far better run-time (Chen et al. 2016) considering interactions (i.e., correlations among features
making it highly scalable to big data problems. complicate explanations) among features will aid in better
Overall, domain knowledge infusion provides better ex- explainability of an individual prediction (i.e., per sample).
plainability with negligible compromises in performance. In Besides, to ensure trust, estimating the level of uncertainty
addition, the generalization provides better execution time in the model will be another extension of this work. There
and resiliency with unknown attacks. are some open challenges surrounding explainability and in-
terpretability such as an agreement of what an explanation
is and to whom, a formalism for the explanation, and quan-
6 Conclusion and Future Work tifying the human comprehensibility of the explanation.
AI-based approaches have become an integral part of secu-
rity solutions due to the potential for handling “Big Data” Acknowledgment
and handling diverse network traffic data. Cybercrime-
related damages continue to rise, and network intrusions are Thanks to Tennessee Tech’s Cyber-security Education, Re-
a key tactic. Although AI-based IDS provides accelerated search and Outreach Center (CEROC) for supporting this
speeds in intrusion detection, response is still at a human research.
References methods in credit scoring. The Journal of Developing Ar-
Ando, S. 2019. Interpreting random forests. eas 50(5):371–381.
http://blog.datadive.net/interpreting-random-forests/. Islam, S. R.; Eberle, W.; Bundy, S.; and Ghafoor, S. K. 2019.
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, Infusing domain knowledge in ai-based” black box” models
K.-R.; and Samek, W. 2015. On pixel-wise explanations for for better explainability with application in bankruptcy pre-
non-linear classifier decisions by layer-wise relevance prop- diction. arXiv preprint arXiv:1905.11474.
agation. PloS one 10(7):e0130140. Islam, S. R.; Eberle, W.; and Ghafoor, S. K. 2018. Credit de-
Boser, B. E.; Guyon, I. M.; and Vapnik, V. N. 1992. A train- fault mining using combined machine learning and heuristic
ing algorithm for optimal margin classifiers. In Proceedings approach. arXiv preprint arXiv:1807.01176.
of the fifth annual workshop on Computational learning the- Islam, S. R.; Ghafoor, S. K.; and Eberle, W. 2018. Min-
ory, 144–152. ACM. ing illegal insider trading of stocks: A proactive approach.
Breiman, L. 2001. Random forests. Machine learning In 2018 IEEE International Conference on Big Data (Big
45(1):5–32. Data), 1397–1406. IEEE.
Chandrasekaran, B.; Tanner, M. C.; and Josephson, J. R. Islam, S. R. 2018. An efficient technique for mining bad
1989. Explaining control strategies in problem solving. credit accounts from both olap and oltp. Ph.D. Dissertation,
IEEE Intelligent Systems (1):9–15. Tennessee Technological University.
Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; and Kegelmeyer, Javaid, A.; Niyaz, Q.; Sun, W.; and Alam, M. 2016. A
W. P. 2002. Smote: synthetic minority over-sampling tech- deep learning approach for network intrusion detection sys-
nique. Journal of artificial intelligence research 16:321– tem. In Proceedings of the 9th EAI International Conference
357. on Bio-inspired Information and Communications Technolo-
Chen, J.; Li, K.; Tang, Z.; Bilal, K.; Yu, S.; Weng, C.; and Li, gies (formerly BIONETICS), 21–26. ICST (Institute for
K. 2016. A parallel random forest algorithm for big data in Computer Sciences, Social-Informatics and . . . .
a spark cloud computing environment. IEEE Transactions Kabul, I. K. 2018. Explainable ai. https://www.kdnuggets.
on Parallel and Distributed Systems 28(4):919–933. com/2018/11/interpretability-trust-ai-machine-
Datta, A.; Sen, S.; and Zick, Y. 2016. Algorithmic trans- learning.html.
parency via quantitative input influence: Theory and exper- Kim, J.; Kim, J.; Thu, H. L. T.; and Kim, H. 2016. Long
iments with learning systems. In 2016 IEEE symposium on short term memory recurrent neural network classifier for
security and privacy (SP), 598–617. IEEE. intrusion detection. In 2016 International Conference on
DeJong, G. 1981. Generalizations based on explanations. In Platform Technology and Service (PlatCon), 1–5. IEEE.
IJCAI, volume 81, 67–69. Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Vie-
Dong, B., and Wang, X. 2016. Comparison deep learning gas, F.; and Sayres, R. 2017. Interpretability beyond feature
method to traditional methods using for network intrusion attribution: Quantitative testing with concept activation vec-
detection. In 2016 8th IEEE International Conference on tors (tcav). arXiv preprint arXiv:1711.11279.
Communication Software and Networks (ICCSN), 581–585. Lashkari, A. H.; Draper-Gil, G.; Mamun, M. S. I.; and Ghor-
IEEE. bani, A. A. 2017. Characterization of tor traffic using time
Doyle, M. 2019. Don’t be lulled into a false sense of secu- based features. In ICISSP, 253–262.
rity. https://www.securityroundtable.org/dont-lulled-false- Lei, T.; Barzilay, R.; and Jaakkola, T. 2016. Rationalizing
sense-cybersecurity/. neural predictions. arXiv preprint arXiv:1606.04155.
2019. Ensemble methods. https://scikit-learn.org /sta- Li, Z.; Sun, W.; and Wang, L. 2012. A neural network based
ble/modules/ensemble.html. distributed intrusion detection system on cloud platform. In
Friedman, J. H. 2001. Greedy function approximation: a 2012 IEEE 2nd international conference on Cloud Comput-
gradient boosting machine. Annals of statistics 1189–1232. ing and Intelligence Systems, volume 1, 75–79. IEEE.
Geurts, P.; Ernst, D.; and Wehenkel, L. 2006. Extremely Lipovetsky, S., and Conklin, M. 2001. Analysis of regres-
randomized trees. Machine learning 63(1):3–42. sion in game theory approach. Applied Stochastic Models in
Business and Industry 17(4):319–330.
Goodman, B., and Flaxman, S. 2017. European union reg-
ulations on algorithmic decision-making and a “right to ex- Lipton, Z. C. 2016. The mythos of model interpretability.
planation”. AI Magazine 38(3):50–57. arXiv preprint arXiv:1606.03490.
Hodo, E.; Bellekens, X.; Hamilton, A.; Dubouilh, P.-L.; Lundberg, S. M., and Lee, S.-I. 2017. A unified approach
Iorkyase, E.; Tachtatzis, C.; and Atkinson, R. 2016. Threat to interpreting model predictions. In Advances in Neural
analysis of iot networks using artificial neural network intru- Information Processing Systems, 4765–4774.
sion detection system. In 2016 International Symposium on Lundberg, S. 2019. Shap vs lime. https://github.com /slund-
Networks, Computers and Communications (ISNCC), 1–6. berg/shap/issues/19.
IEEE. Manning, C.; Raghavan, P.; and , H. 2010. Introduc-
Hooman, A.; Marthandan, G.; Yusoff, W. F. W.; Omid, M.; tion to information retrieval. Natural Language Engineering
and Karamizadeh, S. 2016. Statistical and data mining 16(1):100–103.
Matt, B., et al. 2006. Introduction to computer security. Wyden, B. 2019. Algorithmic accountability.
Pearson Education India. https://www.wyden.senate.gov/imo/media/doc/Algorithmic%20
Miller, T. 2018. Explanation in artificial intelligence: In- Accountability%20Act%20of%202019%20Bill%20Text.pdf.
sights from the social sciences. Artificial Intelligence. Yang, S. C.-H., and Shafto, P. 2017. Explainable artificial
Mitchell, T. M.; Keller, R. M.; and Kedar-Cabelli, S. T. intelligence via bayesian teaching. In NIPS 2017 workshop
1986. Explanation-based generalization: A unifying view. on Teaching Machines, Robots, and Humans.
Machine learning 1(1):47–80. Zhang, H. 2004. The optimality of naive bayes. AA 1(2):3.
Montavon, G.; Samek, W.; and Müller, K.-R. 2018. Methods
for interpreting and understanding deep neural networks.
Digital Signal Processing 73:1–15.
2019. Naive bayes. https://scikit-learn.org/ sta-
ble/modules/naive bayes.html.
2019. Netflow meter. http://netflowmeter.ca/ netflowme-
ter.html.
2019. domain-knowledge-aided code. https://github.com/
SheikhRabiul/domain-knowledge-aided-explainable-ai-for-
intrusion-detection-and-response.
Rankin, K. 2019. The dark secret at the heart of
ai. https://www.technologyreview.com/s/604087/the-dark-
secret-at-the-heart-of-ai/.
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why
should i trust you?: Explaining the predictions of any classi-
fier. In Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining, 1135–
1144. ACM.
2019. domain-knowledge-aided dataset. https://github.com/
SheikhRabiul/domain-knowledge-aided-explainable-ai-
for -intrusion-detection-and-response/tree/master/data/
combined sampled.zip.
2019. Scikit-learn: Machine learning in python.
https://scikit-learn.org/stable/.
Sharafaldin, I.; Lashkari, A. H.; and Ghorbani, A. A. 2018.
Toward generating a new intrusion detection dataset and in-
trusion traffic characterization. In ICISSP, 108–116.
Shone, N.; Ngoc, T. N.; Phai, V. D.; and Shi, Q. 2018. A
deep learning approach to network intrusion detection. IEEE
Transactions on Emerging Topics in Computational Intelli-
gence 2(1):41–50.
Shrikumar, A.; Greenside, P.; and Kundaje, A. 2017. Learn-
ing important features through propagating activation differ-
ences. In Proceedings of the 34th International Conference
on Machine Learning-Volume 70, 3145–3153. JMLR. org.
Štrumbelj, E., and Kononenko, I. 2014. Explaining predic-
tion models and individual predictions with feature contribu-
tions. Knowledge and information systems 41(3):647–665.
Swartout, W. R., and Moore, J. D. 1993. Explanation in sec-
ond generation expert systems. In Second generation expert
systems. Springer. 543–585.
Swartout, W. R. 1985. Rule-based expert systems: The
mycin experiments of the stanford heuristic programming
project: Bg buchanan and eh shortliffe,(addison-wesley,
reading, ma, 1984); 702 pages.
2019. Tensorflow. https://www.tensorflow.org/.
Turek, M. 2019. Explainable ai. https://www.darpa.mil/
program/explainable-artificial-intelligence.