=Paper= {{Paper |id=Vol-2600/paper1 |storemode=property |title=Domain Knowledge Aided Explainable Artificial Intelligence for Intrusion Detection and Response |pdfUrl=https://ceur-ws.org/Vol-2600/paper1.pdf |volume=Vol-2600 |authors=Sheikh Rabiul Islam,William Eberle,Sheikh K. Ghafoor,Ambareen Siraj,Mike Rogers |dblpUrl=https://dblp.org/rec/conf/aaaiss/IslamEGSR20 }} ==Domain Knowledge Aided Explainable Artificial Intelligence for Intrusion Detection and Response== https://ceur-ws.org/Vol-2600/paper1.pdf
      Domain Knowledge Aided Explainable Artificial Intelligence for Intrusion
                           Detection and Response
       Sheikh Rabiul Islam, William Eberle, Sheikh K. Ghafoor, Ambareen Siraj, Mike Rogers
                                               Departmemt of Computer Science
                                             Tennessee Technological University
                                                       Cookeville, U.S.
                         sislam42@students.tntech.edu, {weberle, sghafoor, asiraj, mrogers}@tntech.edu


                            Abstract                                of trust in the model and prediction, as well as possibly
                                                                    leading to a delayed human response/decision. This limita-
  Artificial Intelligence (AI) has become an integral part of
  modern-day security solutions for its ability to learn very       tion also involves ethical issues in a few sensitive domains
  complex functions and handling “Big Data”. However, the           like finance (e.g., credit approval), health care (e.g., dis-
  lack of explainability and interpretability of successful AI      ease diagnosis), and security (e.g., identifying target). For
  models is a key stumbling block when trust in a model’s pre-      instance, AI and ML are becoming an integral part of se-
  diction is critical. This leads to human intervention, which      curity solutions and defense. To mitigate the unethical use
  in turn results in a delayed response or decision. While there    of AI as well as to promote the responsible use of AI sys-
  have been major advancements in the speed and performance         tems, various governments have started taking different pre-
  of AI-based intrusion detection systems, the response is still    cautionary initiatives. Recently, the European Union imple-
  at human speed when it comes to explaining and interpreting       mented the rule of “right of explanation”, where a user can
  a specific prediction or decision. In this work, we infuse pop-
                                                                    ask an explanation of algorithmic decision (Goodman and
  ular domain knowledge (i.e., CIA principles) in our model for
  better explainability and validate the approach on a network      Flaxman 2017). In addition, more recently the US govern-
  intrusion detection test case. Our experimental results sug-      ment introduced a new bill, the “Algorithmic Accountability
  gest that the infusion of domain knowledge provides better        Act”, which would require companies to assess their ma-
  explainability as well as a faster decision or response. In ad-   chine learning systems for bias and discrimination, with a
  dition, the infused domain knowledge generalizes the model        need to take corrective measures (Wyden 2019). The U.S.
  to work well with unknown attacks, as well as opens the path      Department of Defense (DoD) has identified explainability
  to adapt to a large stream of network traffic from numerous       as a key stumbling block in the adoption of AI-based solu-
  IoT devices.                                                      tions in many of their projects. Their DARPA division has
                                                                    invested $2 billion on an Explainable Artificial Intelligence
                     1    Introduction                              (XAI) program (Turek 2019; Rankin 2019).
Most of the recent advancements in Artificial Intelligence             Network intrusions are a common cyber-crime activity,
(AI), and more specifically Machine Learning (ML), have             estimated to cost around $6 trillion annually in damages by
come from complex non-linear models such as Deep Neu-               2021 (Doyle 2019). In order to combat these attacks, an In-
ral Networks, Ensemble Methods, and Support Vector Ma-              trusion Detection System (IDS) is a security system to mon-
chines. These models are also known as “black box” models           itor network and computer systems (Hodo et al. 2016). Re-
as they are complex to interpret and explain, which arises          search in AI-based IDS has shown promising results (Hodo
from their inherent non-linear capabilities, multiple param-        et al. 2016), (Shone et al. 2018), (Kim et al. 2016), (Javaid et
eters, and very complex transformations. In addition, some          al. 2016), (Li, Sun, and Wang 2012), and has become an in-
algorithms require a very large number of samples (i.e., large      tegral part of security solutions due to its capability of learn-
training sets) to work efficiently, where it is very difficult      ing complex, nonlinear functions and analyzing large data
to figure out what the model learned from the dataset and           streams from numerous connected devices. A recent survey
which portion of the data set has more influence on the out-        by (Dong and Wang 2016) suggests that deep learning-based
put (Kabul 2018).                                                   methods are accurate and robust to a wide range of attacks
   Due to these challenges, the black box models lacks ex-          and sample sizes. However, there are concerns regarding the
plainability and interpretability, ultimately resulting in a lack   sustainability of current approaches (e.g., intrusion detec-
                                                                    tion/prevention systems) when faced with the demands of
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-
                                                                    modern networks and the increasing level of human inter-
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-       action (Shone et al. 2018). In the age of IoT and Big Data,
bining Machine Learning and Knowledge Engineering in Practice       an increasing number of connected devices and associated
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,       streams of network traffic have exacerbated the problem.
USA, March 23-25, 2020. Use permitted under Creative Commons        In addition, the delay in detection/response increases the
License Attribution 4.0 International (CC BY 4.0).                  chance of zero day exploitation, whereby a previously un-
known vulnerability is just discovered by the attacker, and       domain knowledge in an intrusion detection/response sys-
the attacker immediately initiates an attack. However, im-        tem; (2) we introduce a way to bring popular security prin-
proved explainability of an AI model could quicken inter-         ciples (e.g., CIA principles) to aid in interpretability and ex-
pretation, making it more feasible to accelerate the response.    plainability; (3) our experimental results show that infusing
   Explainability is the extent to which the internal working     domain knowledge into “black box” models can make them
mechanism of the machine or AI system can be explained            better explainable with little or no compromise in perfor-
in human terms. And interpretability is the extent to which       mance ; and (4) domain knowledge infusion increases gener-
a cause and effect (i.e., understanding of what’s happen-         alizability, which leads to better resiliency against unknown
ing) can be observed within a system. In other words, in-         attack.
terpretability is a form of abstract knowledge about what’s          We start with a background of related work (Section 2)
happening and explainability is about the detailed step-by-       followed by a description of our proposed approach, an in-
step knowledge of what is happening (Montavon, Samek,             tuitive description of standard supervised algorithms, and an
and Müller 2018), (Turek 2019). However, while some lit-         overview of the dataset (Section 3) used in this work. In Sec-
erature treat interpretability and explainability as the same,    tion 4, we describe our experiments, followed by Section 5
they are actually two different traits of a model. Just because   which contains discussion on results from the experiments.
a model can be interpreted does not mean that it can be ex-       We conclude with limitations and future work in Section 6.
plained, and explainability needs to go beyond the algorithm
(Lipton 2016).                                                                         2    Background
   Explainability and interpretability of a model could be        Research in Explainable Artificial Intelligence (XAI) is a re-
achieved before, during, and after modeling. From the lit-        emerging field, after the earlier work of (Chandrasekaran,
erature, we find that interpretability in pre-modeling (i.e.,     Tanner, and Josephson 1989), (Swartout and Moore 1993),
before modeling) is under-focused. (Miller 2018) argue that       and (Swartout 1985). Previous work focused on primarily
explainability should incorporate knowledge from a differ-        explaining the decision process of knowledge-based sys-
ent domain such as philosophy, psychology, and cognitive          tems and expert systems. The classical learning paradigm
science, so that the explanation is not just based on the re-     Explanation-Based Learning (EBL), introduced in the early
searcher’s intuition of what constitutes a good explanation.      ’80s, can also be regarded as a precursor of explainability.
However, we also find that the use of domain knowledge for        EBL involves learning a problem-solving technique by ob-
explainability is under-focused. In this work, we introduce       serving and analyzing solutions to a specific problem (De-
a novel approach for an AI-based explainable intrusion de-        Jong 1981), (Mitchell, Keller, and Kedar-Cabelli 1986). The
tection and response system, and demonstrate its effective-       main reason for the renewed interest in XAI research has
ness by infusing a popular network security principle (CIA        stemmed from recent advancements in AI and ML and their
principle) into the model for better explainability and inter-    application to a wide range of areas, as well as concerns over
pretability of the decision.                                      unethical use and undesired biases in the models. In addi-
   We use a recent and comprehensive IDS dataset (CI-             tion, recent concerns and laws by different governments are
CIDS2017) which covers necessary criteria with com-               necessitating more research in XAI.
mon updated attacks such as DDoS, Brute Force, XSS,                  (Yang and Shafto 2017), use Bayesian Teaching, where
SQL Injection, Infiltration, Portscan, and Botnet. We in-         a smaller subset of examples is used to train the model in-
fuse CIA principle in the model that provides a concise           stead of the whole dataset. The subset of examples is chosen
and interpretable set of important features. Computer se-         by domain experts as the examples are most relevant to the
curity rest on CIA principles, C stands for confidential-         problem of interest. However, for this purpose, choosing the
ity—concealment of information or resources, I stands for         right subset of examples in the real-world is challenging.
integrity—trustworthiness of data or resources, and A stands         (Lei, Barzilay, and Jaakkola 2016) propose an approach
for availability—ability to use the information or resource       for sentiment analysis where a subset of text from the whole
desired (Matt and others 2006). For instance, security com-       text is selected as the rationale for the prediction. In addition,
promise in confidentiality could happen through eavesdrop-        the selected subset of text is concise and sufficient enough
ping unencrypted data, compromise in integrity could hap-         to act as a substitute for the original text, and still capable of
pen through an unauthorized attempt to change data, and the       making the correct prediction. Although their approach out-
compromise in availability could happen through the delib-        performs available attention-based models (from deep learn-
erate arrangement of denial to access data or service.            ing) with variable-length input (e.g., a model for document
   We also convert the domain knowledge infused features          summarization) , it is limited to only text analysis.
into three features C, I, and A by quantitatively comput-            When the explanation is based on feature importance, it is
ing compromises associated with each of those for each            necessary to keep in mind that features that are globally im-
record. Output expressed as these generalized and newly           portant may not be important in the local context, and vice
constructed set of features provides better explainability        versa (Ribeiro, Singh, and Guestrin 2016). (Ribeiro, Singh,
with negligible compromises in performance. We also found         and Guestrin 2016) propose a novel explanation technique
that generalization provides more resiliency against un-          capable of explaining the prediction of any classifier (i.e., in
known attacks.                                                    the model agnostic way) with a locally interpretable model
   In summary, our contributions in this work are as follows:     (i.e., in the vicinity of the instance being predicted) around
(1) we demonstrate a method for the collection and use of         the prediction. Their concern is on two issues: (1) whether
the user should trust the prediction of the model and act on      al. 2019), albeit for bankruptcy prediction with a limited fo-
that, and (2) whether the user should trust a model to be-        cus. (Miller 2018) have argued that incorporating knowl-
have reasonably-well when deployed. In addition, they in-         edge from different domains will provide better explain-
volve human judgment in their experiment (i.e., human in          ability. In addition, (Kim et al. 2017) use the prediction’s
the loop) to decide whether to trust the model or not.            sensitivity to high dimensional concepts (e.g., the concept
   (Kim et al. 2017) propose a concept attribution-based ap-      “striped” to “Zebra”) for explaining the prediction. Further-
proach (i.e., sensitivity to concept) that provides an inter-     more, both LIME (Ribeiro, Singh, and Guestrin 2016) and
pretation of the neural network’s internal state in terms of      SHAP (Lundberg and Lee 2017) use a simplified input map-
human-friendly concepts. Their approach, Testing with CAV         ping—mapping the original input to a simplified set of input.
(TCAV), quantifies the prediction’s sensitivity to the high di-   To the best of our knowledge, none of the models incorpo-
mensional concept. For example, a user-defined set of exam-       rate domain knowledge with a focus towards better explain-
ples that defines the concept “striped”, TCAV can quantify        ability and interpretability. Although our proposed concep-
the influence of “striped” in the prediction of “zebra” as a      tual model comes with a negligible compromise in accuracy,
single number. To learn the high dimensional concepts they        it comes with better explainability and interpretibility, and
use a Concept Activation Vector (CAV) —CAVs are learned           scalability to big data problems.
from training a linear classifier that can distinguish between
the activations produced by a particular concept’s examples                          3    Methodology
and examples in any layer.
   Most of these approaches try to find out how the predic-
tion deviates from the base/average scenario. Lime (Ribeiro,
Singh, and Guestrin 2016) tries to generate an explana-
tion by locally (i.e., using local behavior) approximating the
model with an interpretable model (e.g., decision trees, lin-
ear model). However, it is limited by the use of the only
linear model to approximate the local behavior. (Lundberg
and Lee 2017) propose “SHAP” which unifies seven previ-
ous approaches: LIME (Ribeiro, Singh, and Guestrin 2016),
DeepLIFT (Shrikumar, Greenside, and Kundaje 2017), Tree
Interpreter (Ando 2019), QII (Datta, Sen, and Zick 2016),
Shapley sampling values (Štrumbelj and Kononenko 2014),
Shapley regression values (Lipovetsky and Conklin 2001),
and Layer-wise relevance propagation (Bach et al. 2015) to
make the explanation of prediction for any machine learn-
ing model. While SHAP comes with theoretical guarantees
about consistency and local accuracy from game theory, it
needs to run many evaluations of the original model to esti-
mate a single vector of feature importance (Lundberg 2019).
ELI5 also uses the LIME algorithm internally for explana-
tions. In addition, ELI5 is not truly model agnostic, mostly
limited to tree-based and other parametric or linear models.
Furthermore, Tree Interpreter is limited to only tree-based
approaches (e.g., Random Forest, Decision Trees).
   AI-based IDSs have continued to show promising perfor-                        Figure 1: Proposed Technique
mance (Hodo et al. 2016),(Shone et al. 2018),(Kim et al.
2016),(Javaid et al. 2016),(Li, Sun, and Wang 2012). (Shone
et al. 2018) propose an approach in combination of both           3.1   Proposed Approach
shallow (Random Forest) and deep learning (Auto Encoder),         The proposed approach consists of two components: a fea-
capable of analyzing a wide range of network traffic, out-        ture generalizer, which gives a generalized feature set with
performing mainstream Deep Belief Networks (DBN). In a            the help of domain knowledge in two different ways; and an
literature survey on traditional IDS vs deep learning IDS by      evaluator that produces and compares the results from the
(Dong and Wang 2016), they suggest deep learning-based            “black box” model for multiple configurations of features:
methods provide better accuracy for a wide range of sam-          domain knowledge infused features, newly constructed fea-
ples sizes and a variety of network traffic or attacks (Dong      tures from domain knowledge infused features, selected fea-
and Wang 2016). However, in all of the previous work, there       tures, and all features.
are still long training times and a reliance on a human oper-
ator (Shone et al. 2018).                                         3.2   Feature Generalizer
   However, incorporating domain knowledge for explain-           The feature generalizer (Figure 1, top portion), takes orig-
ability has garnered little attention. Previously, we intro-      inal features of the dataset (X 1 , X 2 , .... X n ∈ X where X
duced the concept of infusing domain knowledge (Islam et          is the set of all features) and infuse domain knowledge to
                                                                     mapping between feature and associated compromises, we
Table 1: Mapping of network attack with related component            first find the mapping between an attack and related com-
of CIA principles                                                    promises (from Table 1, formulated as Equation 2). In other
       Attack           Related component of CIA                     words, Formula 1 gives the name of the associated attack
       DoS GoldenEye        A                                        where the feature is in the top three feature to identify that
       Heartbleed           C                                        particular attack and Formula 2 gives associated compro-
       DoS hulk             A                                        mises in C, I, or A from the attack name. Thus, with the
       DoS Slowhttp         A                                        help of domain knowledge, we keep 22 features (see Table
       DoS slowloris        A                                        2) out of a total of 78 features. We will refer to these features
       SSH-Patator          C                                        as the domain features. The feature descriptions in Table 2
       FTP-Patator          C                                        are taken from the data processing software’s website (net
       Web Attack           C, I, A                                  2019).
       Inflitration         C                                                             f (f eature) → attack                    (1)
       Bot                  C, I, A                                                      f (attack) → C, I, orA                     (2)
       PortScan             C
       DDoS                 A                                        Feature Construction We also construct three new fea-
                                                                     tures, C, I, and A, from the domain features by quantita-
                                                                     tively calculating compromises associated with each of the
produce/re-construct a concise and better interpretable fea-         domain features. For that purpose, we calculate the correla-
ture set (X 1 ’, X 2 ’, ..... X k ’ ∈ X’ where X’ is the universal   tion coefficient vector of the dataset to understand whether
set of original/transformed/constructed features, but here k         the increase in the value of a feature has a positive or nega-
is much smaller than n) in two different ways:                       tive impact on the target variable. We then convert the corre-
                                                                     lation coefficient (a.k.a coeff ) vector V in to a 1 or -1 based
Feature Mapping As stated earlier, we use CIA princi-                on whether the correlation coefficient is positive or negative
ples as domain knowledge, which stands for confidentiality,          accordingly. We also group the domain features and corre-
integrity, and availability. We analyze all types of attacks         sponding coeff tuple into three groups. Using formula 3, 4,
for associated compromises in each component of CIA prin-            and 5, we aggregate each group (from C, I, and A) of domain
ciples (see Table 1). The Heartbleed vulnerability is related        features into the three new features C, I, and A. We also scale
to a compromise in confidentiality as an attacker could gain         all feature values from 0 to 1 before starting the aggregation
access to the memory of the systems protected by the vul-            process. During the aggregation for a particular group (e.g.,
nerable version of the OpenSSL. A Web attack (e.g., Sql in-          C), if the correlation coefficient vector (e.g., Vi ) for a feature
jection) is related to a compromise in confidentiality and in-       (e.g., Ci ) of that group has a negative value, then the product
tegrity (e.g., read/write data using injected query), and avail-     of the feature value and the correlation coefficient for that
ability (e.g., flooding the database server with injected com-       feature is deducted, and vice-versa if positive. In addition,
plex queries like a cross join). Infiltration attack is related to   when a feature is liable for more than one compromise, the
a compromise in confidentiality as it normally exploits soft-        feature value is split between the associated elements of CIA
ware vulnerability (e.g., Adobe Acrobat Reader) to create a          principles.
backdoor and reveal information (e.g., IP’s). Port scan attack                                      X n
is related to a compromise in confidentiality as the attacker                                  C=       C iV i                       (3)
sends packets with varying destination ports to learn the ser-                                      i=0
vices and operating systems from the reply. All DoS and                                             n
DDoS attacks are related to a compromise in availability as
                                                                                                    X
                                                                                               I=         I iV i                    (4)
it aims to hamper the availability of service or data. Further-
                                                                                                    i=0
more, SSH patator and FTP patator are brute force attacks
                                                                                                    n
and are usually responsible for a compromise in confiden-                                           X
tiality. Botnet (i.e., robot network—a network of malware-                                    A=          Ai V i                    (5)
infected computers) could provide a remote shell, file up-                                          i=0
load/download option, screenshot capture option, and key
logging options which has potential for all of the confiden-         3.3   Evaluator
tiality, integrity, and availability related compromises.            The task of the evaluator (Figure 1, bottom side) is to ex-
   Furthermore, from the feature ranking of the origi-               ecute (supervised models or algorithms) and compare the
nal dataset provider (Sharafaldin, Lashkari, and Ghorbani            performance (in detecting malicious and benign records) of
2018), for each type of attack, we take the top three fea-           four different types of configurations of features, as follows:
tures according to their importance (i.e., feature importance        (1) using all features, (2) using selected features (selection
from Random Forest Regressor) and calculate the mapping              is done by feature selection algorithm), (3) using domain
(Table 2) with related compromises under CIA principles.             knowledge infused features, and (4) using newly constructed
For example, the feature Average Packet Size is renamed as           features C, I, and A from domain knowledge infused fea-
Avg Packet Size - A where -A indicates that it is a key feature      tures. In addition, the evaluator performs the following two
for the compromise of availability (see Table 2). To get this        tests:
                                      Table 2: Mapping of feature with related component of CIA principles
        Feature                       Description                                                               In top 3 features of attack          Renamed feature

        ACK Flag Count                Number of packets with ACK                                                SSH-Patator                          ACK Flag Count - C
        Active Mean                   Mean time a flow was active before becoming idle                          DoS Slowhttp, Infiltration           Active Mean - AC
        Active Min                    Minimum time a flow was active before becoming idle                       DoS Slowhttp                         Active Min - A
        Average Packet Size           Average size of packet                                                    DDoS                                 Avg Packet Size - A
        Bwd IAT Mean                  Mean time between two packets sent in the backward direction              DoS slowloris                        Bwd IAT Mean - A
        Bwd Packet Length Std         Standard deviation size of packet in backward direction                   DoS Hulk, DoS GoldenEye, DDoS,       Bwd Packet Length Std - AC
                                                                                                                Heartbleed, DoS Hulk
        Bwd Packets/s                 Number of backward packets per second                                     Bot, PortScan                        Bwd Packets/s - CIA
        Fwd IAT Mean                  Mean time between two packets sent in the forward direction               DoS slowloris                        Fwd IAT Mean - A
        Fwd IAT Min                   Minimum time between two packets sent in the forward direction            DoS slowloris, DoS GoldenEye         Fwd IAT Min - A
        Fwd Packet Length Mean        Mean size of packet in forward direction                                  Benign, Bot                          Fwd Packet Length Mean - CIA
        Fwd Packets/s                 Number of forward packets per second                                      FTP-Patator                          Fwd Packets/s - C
        Fwd PSH Flags                 Number of times the PSH flag was set in packets travelling in the for-    FTP-Patator                          Fwd PSH Flags - C
                                      ward direction (0 for UDP)
        Flow Duration                 Duration of the flow in Microsecond                                       DDoS, DoS slowloris, DoS Hulk,       Flow Duration - AC
                                                                                                                DoS Slowhttp, Infiltration, Heart-
                                                                                                                bleed
        Flow IAT Mean                 Mean inter-arrival time of packet                                         DoS GoldenEye                        Flow IAT Mean - A
        Flow IAT Min                  Minimum inter-arrival time of packet                                      DoS GoldenEye                        Flow IAT Min - A
        Flow IAT Std                  Standard deviation inter-arrival time of packet                           DDoS, DoS Slowhttp, DoS Hulk         Flow IAT Std - A
        Init Win bytes forward        The total number of bytes sent in initial window in the backward direc-   Web Attack                           Init Win Bytes Fwd - CIA
                                      tion
        PSH Flag Count                Number of packets with PUSH                                               PortScan                             PSH Flag Count - C
        Subflow Fwd Bytes             The average number of packets in a sub flow in the forward direction      Benign, SSH-Patator, Web Attack,     Subflow Fwd Bytes - CIA
                                                                                                                Bot, Heartbleed, Infiltration
        SYN Flag Count                Number of packets with SYN                                                FTP-Patator                          SYN Flag Count - C
        Total Length of Fwd Packets   Total size of packet in forward direction                                 Benign, SSH-Patator, Web Attack,     Total Length of Fwd Packets - CIA
                                                                                                                Bot, Heartbleed, Infiltration




1. Explainability Test: The purpose of this test is to discover                                       Random Forest (RF) A Random Forest is a tree-based
   the comparative advantages or disadvantages of incorpo-                                            ensemble technique developed by (Breiman 2001) for the
   rating domain knowledge in the experiment; and                                                     supervised classification task. In RF, many trees are gener-
2. Generalizability Test: The purpose of this test is to an-                                          ated from the bootstrapped subsamples (i.e., random sample
   alyze how different approaches perform in unknown or                                               drawn with replacement) of the training data. In each tree,
   unseen attack detection. We delete all training records for                                        the splitting attribute is chosen from a smaller random sub-
   a particular attack one at a time and investigate the perfor-                                      set of attributes of that tree (i.e., the chosen split attribute
   mance of the model on the same test set, which includes                                            that is the best among that random subset). This random-
   records from unknown or unseen attacks. Details of these                                           ness helps to make trees less correlated as correlated trees
   tests are described in Section 4.                                                                  make the same kinds of prediction errors and can overfit the
                                                                                                      model. In less correlated trees, a few trees may be wrong
3.4    Algorithms                                                                                     but many others will be right and as a group the trees can
                                                                                                      move in the right direction as the output from all the trees
We use six different algorithms for predicting malicious                                              are averaged for the final prediction.
records: one of those is a probabilistic classifier based on
Naive Bayes theorem, and the remaining five are supervised                                            Extra Trees (ET) Extremely Randomized Trees or Ex-
“black box” models. The algorithm descriptions are taken                                              tra Trees (ET) is a tree-based ensemble technique simialr
from our previous work (Islam et al. 2019).                                                           to RF. The only difference is in the process of splitting
                                                                                                      attribute selections and determining the threshold (cutoff)
Artificial Neural Network (ANN) An Artificial Neural                                                  value, both are chosen in an extremely random fashion (Is-
Network is a non-linear model, capable of mimicking hu-                                               lam, Ghafoor, and Eberle 2018). Similar to RF, a random
man brain functions to some extent. It consists of an input                                           subset of features are taken into consideration for the split
layer, one or multiple hidden layer(s), and the output layer.                                         selection, but instead of choosing the most discriminative
Each layer consists of multiple neurons that help to learn the                                        cut off threshold, ET cut off thresholds are set to random
complex pattern. Each subsequent layer learns more abstract                                           values. Thus, the best of these randomly chosen values is set
concepts before it finally merges into the output layer.                                              as the threshold for the splitting rule (ens 2019) on a partic-
                                                                                                      ular node. Unlike DT, RF has multiple trees which leads to
Support Vector Machine (SVM) The Support Vector                                                       a reduced variance. However, bias is introduced, as a subset
Machine (SVM) was first introduced by (Boser, Guyon, and                                              of the whole feature set is chosen for each tree instead of
Vapnik 1992) and has been used for many supervised classi-                                            all features. ET was proposed by (Geurts, Ernst, and We-
fication tasks. In addition to linear classification, the model                                       henkel 2006), and has achieved a state of the art perfor-
can learn an optimal hyperplane that separates instances of                                           mance in some anomaly/intrusion detection research (Islam
different classes using a highly non-linear implicit mapping                                          2018), (Islam, Eberle, and Ghafoor 2018),(Islam, Ghafoor,
of input vectors in high dimensional feature space (i.e., ker-                                        and Eberle 2018).
nel trick) (Hooman et al. 2016). When the number of sam-
ples is too high (i.e., millions) then it is very costly in terms                                     Gradient Boosting (GB) (Friedman 2001), generalized
of computation time.                                                                                  Adaboost to a Gradient Boosting algorithm to allow a vari-
ety of loss function. Here the shortcoming of weak learners        (e.g., bias to the class of majority samples) of some machine
is identified using the gradient instead of highly weighted        learning algorithms. To overcome this problem, we use the
data points as in Adaboost. Gradient Boosting (GB) is a            well-known oversampling technique SMOTE (Chawla et al.
classifier/regression model in the form of an ensemble of          2002) to oversample the minority class. In (Dong and Wang
weak prediction models, such as Decision Trees. It works           2016), the author uses SMOTE to overcome the issue in
sequentially like the AdaBoost algorithm, in that each sub-        their empirical study on the comparison of traditional vs
sequent model tries to minimize the loss function (i.e., Mean      deep learning-based IDS. SMOTE creates synthetic samples
Squared Error) by paying special focus on instances that           rather than just oversampling with replacement. The minor-
were hard to get right in the previous model.                      ity class is oversampled by creating new examples along
                                                                   with the line segments joining any or all of k nearest mi-
Naive Bayes (NB) Naive Bayes algorithm is based on                 nority samples, where k is chosen based on the percentage
Bayes Theorem, which was formulated in the seventeenth             of oversampling required (i.e., hyperparameter to the algo-
century. It is a supervised, simple, and comparatively fast        rithm) (Chawla et al. 2002).
algorithm based on statistics. In a real-world problem, it is
unusual that all features are independent. However, Naive
Bayes assumes conditional independence among features                                   4    Experiments
and surprisingly works well in many cases. It also requires a      4.1    Experimental Setup
small amount of training data to estimate the necessary pa-
rameters (nai 2019). This assumption of Naive Bayes helps          We execute the experiments in a GPU enabled Linux ma-
to avoid lots of computations (e.g., computing the condi-          chine with 12GB of RAM and core i7 processor. All super-
tional probability for each feature with others) and makes         vised machine learning algorithms are implemented using
it a faster algorithm. Besides, the avoidance of a condi-          the Python-based Scikit-learn (sci 2019) library. In addition,
tional probability calculation helps (the class conditional        we use Tensorflow (ten 2019) for the Artificial Neural Net-
feature distribution can be independently estimated as one-        work. Due to resource limitations, instead of using the whole
dimensional distribution) in Big Data problems where the           dataset, we take a stratified sample of the data which is big
curse of dimensionality is a concern. However, NB is a bad         enough (i.e., 300K records) for a single GPU enabled com-
estimator of a probabilty (Zhang 2004). We use the Bernoulli       modity machine. We make the sampled dataset available to
Naive Bayes (Manning, Raghavan, and 2010) for our exper-           the research community at (sam 2019). Furthermore, we use
iments where each feature is assumed to be binary-valued.          70% of the data for training the models and kept 30% of the
                                                                   data as a holdout set to test the model. We confirm the target
3.5   Data                                                         class had the same ratio in both sets. To avoid the adverse ef-
                                                                   fect of class imbalance in classification performance, we re-
In this work, we use a recent and comprehensive IDS dataset        sample the minority class of the training set using SMOTE
namely CICIDS2017, published in 2017, covers necessary             (Chawla et al. 2002) to balance the dataset. However, we do
criteria with common updated attacks such as DoS, DDoS,            not re-sample the test set, as real-world data is skewed and
Brute Force, XSS, SQL Injection, Infiltration, Portscan,           oversampling the test set could exhibit an overoptimistic per-
and Botnet. In fact, this dataset is created to eliminates         formance.
the shortcomings (e.g., lack of traffic diversity and vol-            We run all supervised machine learning algorithm using
ume, lack of variety of attacks, anonymized packet infor-          four different approaches:
mation, and out of date) of previous well known IDS dataset
such as DARPA98, KDD99, ISC2012, ADFA13, DEFCON,                   1. With all features: using all 78 features of the dataset with-
CAIDA, LBNL, CDX, Kyoto, Twente, and UMASS since                      out discarding any features.
1998. This is a labeled dataset containing 78 network traf-
                                                                   2. With selected features: using Random Forest Regres-
fic features (some features are listed in Table 2) extracted
                                                                      sor (adapting with the work of (Sharafaldin, Lashkari,
and calculated from pcap file using CICFlowMeter soft-
                                                                      and Ghorbani 2018)) to select important features of the
ware (Lashkari et al. 2017) for all benign and intrusive flows
                                                                      dataset, giving us 50 important features having a nonzero
(Sharafaldin, Lashkari, and Ghorbani 2018) . This new IDS
                                                                      influence on the target variable;
dataset includes seven common updated family of attacks
satisfying real-world criteria, also publicly available at here:   3. With domain knowledge infused features: using infused
https://www.unb.ca/cic/datasets/ids-2017.html .                       domain knowledge features (see Section 3.2), we will use
   Each record of the dataset is labeled by the particular type       the term domain features interchangeably to express it in
of attack. We make a new feature “Class”, which is the tar-           short form; and
get feature. We set the value of the “Class” attribute to 1 for
                                                                   4. With newly constructed features from domain knowl-
all records labeled as any of 14 types of attacks, as those
                                                                      edge infused features: using newly constructed features
are malicious/intrusive, and set the value to 0 for the re-
                                                                      C, I, and A (see Section 3.2) from domain knowledge
maining records as those are benign. Following that, in the
                                                                      infused features, we will use the term domain features-
whole dataset, there are total 2,830,743 records for 14 differ-
                                                                      constructed interchangeably to express it in short form.
ent attacks, 2,273,097 are benign and 557,646 are malicious.
Approximately 24.5% of the records are malicious, giving              The following are two types of experiments using each of
us an imbalanced dataset which impacts the performance             the four feature settings.
Table 3: Performance using all features vs selected features            Table 4: Performance using domain features vs constructed
        Alg.         Acc.     Prec.     Rec.       F-score   AUC        features
                                                                                Alg.         Acc.     Prec.    Rec.       F-score   AUC
        RF-A         0.9987   0.9965    0.9971     0.9968    0.9997
        RF-S         0.9986   0.9962    0.9966     0.9964    0.9997             RF-D1        0.9973   0.9920   0.9945     0.9932    0.9993
        Difference   0.0002   0.0003    0.0006     0.0005    0.0000             RF-D2        0.9511   0.9446   0.7985     0.8654    0.9572
        ET-A         0.9981   0.9951    0.9951     0.9951    0.9994             Difference   0.0463   0.0475   0.1960     0.1278    0.0421
        ET-S         0.9980   0.9950    0.9950     0.9950    0.9994             ET-D1        0.9969   0.9913   0.9932     0.9923    0.9989
        Difference   0.0001   0.0002    0.0001     0.0001    0.0000             ET-D2        0.9756   0.9321   0.9448     0.9384    0.9954
        ANN-A        0.9802   0.9155    0.9908     0.9516    0.9984             Difference   0.0214   0.0592   0.0483     0.0538    0.0036
        ANN-S        0.9740   0.8929    0.9860     0.9372    0.9968             ANN-D1       0.9497   0.8300   0.9362     0.8799    0.9865
        Difference   0.0062   0.0226    0.0047     0.0145    0.0017             ANN-D2       0.5952   0.3241   0.9721     0.4862    0.7921
        SVM-A        0.9109   0.6996    0.9595     0.8092    0.9780             Difference   0.3544   0.5059   (0.0359)   0.3937    0.1945
        SVM-S        0.8869   0.6433    0.9565     0.7692    0.9746             SVM-D1       0.8489   0.5747   0.8968     0.7005    0.9252
        Difference   0.0239   0.0563    0.0030     0.0400    0.0034             SVM-D2       0.7195   0.3739   0.6281     0.4687    0.7886
        GB-A         0.9960   0.9854    0.9944     0.9899    0.9995             Difference   0.1294   0.2008   0.2687     0.2318    0.1366
        GB-S         0.9957   0.9840    0.9945     0.9892    0.9996             GB-D1        0.9881   0.9513   0.9904     0.9705    0.9986
        Difference   0.0003   0.0014    (0.0001)   0.0007    (0.0001)           GB-D2        0.9230   0.7692   0.8701     0.8165    0.9789
        NB-A         0.7753   0.4371    0.4888     0.4615    0.8601             Difference   0.0652   0.1821   0.1204     0.1539    0.0198
        NB-S         0.7621   0.4144    0.5019     0.4539    0.8508             NB-D1        0.7982   0.4881   0.5028     0.4953    0.8553
        Difference   0.0132   0.0228    (0.0131)   0.0076    0.0093             NB-D2        0.5591   0.2687   0.7195     0.3913    0.6591
                                                                                Difference   0.2391   0.2194   (0.2167)   0.1040    0.1962



4.2   Explainability Test
For this test, we run six supervised algorithms RF, ET, SVM,               Before starting the comparison of results from our ap-
GB, ANN, and NB using the four described feature settings               proach with all features (i.e., baseline), we seek the best
and report the results Section 5.1. Unlike NB, other classi-            feature setting among two domain related feature settings
fiers are “black box” in nature. NB is a probabilistic classi-          of our proposed approach. In other words, in our attempt to
fier based on Bayes Theorem with strong conditional inde-               find the better approach among using domain knowledge in-
pendence assumption among features. The main purpose to                 fused features vs newly constructed features (C, I, and A)
include NB in the experiment is the generalizability test.              from domain knowledge infused features, we find that, in
                                                                        almost all cases, the model with domain knowledge infused
4.3   Generalizability Test                                             features (denoted with an appended -D1, for instance RF-
                                                                        D1) performs better than the counterpart (see Table 4). Al-
For testing the generalizability of the approach, we train the          though for RF, the maximum performance gap is .2 in the
classifier without the representative of a particular attack,           recall, for ET that gap is .048 with a similar precision. As
and test it with the presence of the representative of that par-        the domain features (22 features) contain a lot more detail
ticular attack, in order to classify it malicious/benign. To be         than the newly constructed features C, I, and A (3 features),
more specific, we delete all records of a particular attack             it loses few details. In terms of individual algorithms, RF
from the training set, train the classifier with the records of         is again a clear winner this time using domain features. Al-
the remaining 13 attacks, and test the classifier with all 14           though NB and ANN exhibit better recall using constructed
attacks. We report the percentage of deleted attacks that are           features, it comes with compromises in precision. So, over-
correctly detected as malicious (see Section 5.2). We repeat            all we consider the domain features setting as the best over
this one by one for all 14 attacks. We make the source code             the constructed features.
available to the research community to replicate the experi-               While we know the best feature setting is the all features,
ments at (pro 2019).                                                    as shown in the comparison of all features vs selected fea-
                                                                        tures in the Table 3), we also know the best feature setting
                              5       Results                           domain features from domain features vs constructed fea-
The following sections discuss results from the two cate-               tures (see Table 4). So we further compare the performance
gories of experiments previously described.                             of models using the two best settings all features (i.e., base-
                                                                        line) vs domain features. We find that, among all models, RF
                                                                        using all features (denoted with an appended -A, for instance
5.1   Findings from Explainability Test                                 RF-A) performs better than all other algorithms (see Ta-
Comparing the performance using all features vs selected                ble 5 and Figure 3). Interestingly, RF using domain knowl-
features, Table 3 shows that models using all features (de-             edge infused features (denoted with an appended -D1, for
noted with an appended -A, for instance RF-A) tend to show              instance RF-D1) also shows promising performance. The
better results in terms of all performance metrics. However,            difference between these two in terms of any performance
while the difference with the selected features setting is neg-         metrics is negligible (<.005). In fact, the result of RF us-
ligible (<.0007 for RF) for any performance metric, that                ing the domain knowledge infused feature settings is better
might be a result of the elimination of features with little            than what (Sharafaldin, Lashkari, and Ghorbani 2018) re-
significance. In addition, Random Forest outperforms other              ports using the same dataset. The slight improvement might
algorithms SVM, ET, GB, ANN, and NB under this feature                  stem from the experimental settings (e.g., training test set
setting (i.e., using all features). So we consider the results          split, re-sampling techniques). Furthermore, in the domain
using all features as a baseline to compare against our pro-            knowledge infused feature setting we are using only 22 fea-
posed approach.                                                         tures out of 78 total, where each feature indicates the asso-
ciated compromises (e.g., confidentiality, integrity, or avail-
ability), capable of producing better explainable and inter-       Table 5: Performance using all features vs domain features
                                                                          Alg.         Acc.       Prec.      Rec.       F-score    AUC
pretable results compared to the counterpart. The prediction
for a particular sample can be represented as:                            RF-A
                                                                          RF-D1
                                                                                       0.9987
                                                                                       0.9973
                                                                                                  0.9965
                                                                                                  0.9920
                                                                                                             0.9971
                                                                                                             0.9945
                                                                                                                        0.9968
                                                                                                                        0.9932
                                                                                                                                   0.9997
                                                                                                                                   0.9993
                           G                                              Difference   0.0014     0.0045     0.0027     0.0036     0.0004
                           X                                              ET-A         0.9981     0.9951     0.9951     0.9951     0.9994
             P (D) = b +          contribution(g)           (6)           ET-D1        0.9969     0.9913     0.9932     0.9923     0.9989
                                                                          Difference   0.0011     0.0038     0.0020     0.0029     0.0004
                            g=0                                           ANN-A        0.9802     0.9155     0.9908     0.9516     0.9984
                                                                          ANN-D1       0.9497     0.8300     0.9362     0.8799     0.9865
where b is the model average and g is the generalized do-                 Difference   0.0305     0.0855     0.0546     0.0717     0.0119
                                                                          SVM-A        0.9109     0.6996     0.9595     0.8092     0.9780
main feature (e.g., ACK Flag Count - C), P(D) is the prob-                SVM-D1       0.8489     0.5747     0.8968     0.7005     0.9252
ability value of the decision. Instead of using contributions             Difference   0.0619     0.1249     0.0627     0.1087     0.0528
                                                                          GB-A         0.9960     0.9854     0.9944     0.9899     0.9995
from each of the domain features, we can express the output               GB-D1        0.9881     0.9513     0.9904     0.9705     0.9986
in terms of the contribution from each element of the do-                 Difference   0.0079     0.0341     0.0039     0.0194     0.0009
                                                                          NB-A         0.7753     0.4371     0.4888     0.4615     0.8601
main concept. For that, we need to aggregate contributions                NB-D1        0.7982     0.4881     0.5028     0.4953     0.8553
                                                                          Difference   (0.0229)   (0.0510)   (0.0140)   (0.0338)   0.0048
from all features into three groups (C, I, and A). This will en-
able an analyst to understand the nature of the attack more
quickly (Figure 2). For instance, when the greater portion of
a feature contribution for a sample is from features tagged
with -A (i.e., Availability) then it might be a DDoS attack,
which usually comes with very high compromises in avail-
ability of data or service. We use the iml package from the
programming language R to generate the breakdown of fea-
ture contributions of a particular sample’s prediction (Figure
2).




                                                                   Figure 3: Performance deviations of using domain knowl-
                                                                   edge infused features from baseline


                                                                   domain knowledge infused feature setting perform slightly
                                                                   worse than the all feature setting, it comes with an explain-
                                                                   able features set with the added capability of identifying a
                                                                   few unknown attacks.
                                                                      To reiterate, the constructed features set consists of only
                                                                   three features (C, I, and A) constructed from aggregating do-
                                                                   main knowledge infused features. As this feature setting is
Figure 2: Breakdown of the prediction for a random sample          composed of only three features, it is an extreme generaliza-
in terms of C, I, and A.                                           tion of features and it loses a lot of details of data. However,
                                                                   this time it comes with an exceptional capability which we
                                                                   realize after applying a statistical approach (Naive Bayes)
5.2   Findings from Generalizability Test                          on the dataset. We find that (see Table 7), for NB, the newly
Recall that the purpose of this test is to test the resiliency     constructed feature setting is best as NB is also able to detect
against unknown attacks. First, we use Random Forest (RF),         unknown attacks with similar accuracy compared to other
the best performing algorithm so far, using all four settings      feature settings by RF in Table 6. The most interesting thing
of features. As shown in Table 6 and Figure 4, we see that ex-     about this capability is that this feature set is composed of
cept for the constructed feature settings (denoted by Cons.),      only three features (C, I, and A), takes comparatively less
the performances of other feature settings (all, selected, and     time to execute, and comes with the added benefit of very
domain) are similar. The constructed features fail to provide      good explainability. Once the prediction is expressed as a
comparable performance for RF as it has only three fea-            percentage of influence from each of C, I, and A, the analyst
tures and loses data details (i.e., too much generalization).      would be able to perceive the level of compromise more in-
Surprisingly, a few unknown attacks are only detectable us-        tuitively from the hints about the type of attack (e.g., DDoS
ing the domain knowledge infused features. For instance,           will show a high percentage of A—compromise in Avail-
Web Attack Sql Injection is detected as suspicious only by         ability).
domain knowledge infused features. Overall, although the              However, from Table 3, 4, and 5, we can see that NB’s
 Table 6: Performance of unseen attack detection using RF             Table 7: Performance of unseen attack detection using NB
     Attack           Count   All(%)   Sel.(%)   Dom.(%)   Cons.(%)        Attack           Count   All(%)   Sel.(%)   Dom.(%)   Cons.(%)

     Ddos             4184    99.90    99.90     99.90     62.86           Ddos             4184    76.94    74.59     83.22     100
     PortScan         4973    99.90    99.94     99.94     66.28           PortScan         4973    0.18     0.18      6.64      100
     Bot              54      77.78    77.78     75.93     22.22           Bot              54      0.00               0.00      100
     Infiltration     1       100      100       100       0.00            Infiltration     1       0.00     0.00      0.00      100
     Web Attack-BF    49      95.92    95.92     91.84     75.51           Web Attack-BF    49      6.12     6.12      83.67     100
     Web Attack-XSS   23      95.65    95.65     91.30     65.22           Web Attack-XSS   23      0.00     0.00      95.65     100
     Web Attack-Sql   1       0.00     0.00      100       0.00            Web Attack-Sql   1       100      100       0.00      100
     FTP-Patator      251     99.20    100       99.20     81.67           FTP-Patator      251     0.00     0.00      0.00      48.61
     SSH-Patator      198     98.99    99.49     96.97     75.76           SSH-Patator      198     0.00     0.00      0.00      100
     DoS slowloris    188     99.47    99.47     98.94     61.70           DoS slowloris    188     25.53    30.85     38.30     66.49
     DoS Slowloris    174     99.43    99.43     96.55     31.61           DoS Slowloris    174     71.26    79.89     75.29     78.16
     Dos Hulk         7319    99.71    99.73     99.34     96.19           Dos Hulk         7319    67.78    67.78     67.03     35.37
     DoS GoldenEye    314     99.36    99.68     98.41     85.03           DoS GoldenEye    314     50.32    65.29     47.13     96.82
     Heartbleed       1       100      100       100       100             Heartbleed       1       100      100       0.00      0.00




Figure 4: Unknown attack detection rate using Random For-
est                                                                   Figure 5: Unknown attack detection rate using Naive Bayes

performance comes at a cost of precision and recall (i.e.,
                                                                      speed where there is human in the loop. The lack of explain-
produces comparatively more false positives and false neg-
                                                                      ability of an AI-based model is a key reason for this bottle-
atives). In addition, NB is a bad probability estimator of
                                                                      neck. To mitigate this problem, we infuse the CIA principle
the predicted output (Zhang 2004). However, NB with con-
                                                                      (i.e., domain knowledge) in the AI-based black box model
structed features setting could be recommended as an addi-
                                                                      for better explainability and generalizability of the model.
tional IDS for quick interpretation of huge traffic data given
                                                                      Our experimental results show realizable successes in better
the decision is treated as tentative with the requirement of
                                                                      explainability with a comprehensive, up to date, and real-
a further sanity check. We also calculate the average time
                                                                      world network intrusion dataset. In addition, the infused do-
taken by each algorithm for all four feature settings and
                                                                      main knowledge helps in detecting an unknown attack as it
found that NB is the fastest algorithm. RF, ET, GB, ANN,
                                                                      generalizes the problem, which ultimately opens the door to
and SVM take 2.80, 9.27, 77.06, 15.07, and 444.50 times
                                                                      accommodate big data.
more execution time compared to NB. Besides, the best al-
gorithm, RF (1st in terms of the performance metric and 2nd              Going forward, finding an optimal solution to segregate
in terms of execution time), can be executed in parallel using        the contribution of each participating feature (sample wise)
an Apache Spark for a far better run-time (Chen et al. 2016)          considering interactions (i.e., correlations among features
making it highly scalable to big data problems.                       complicate explanations) among features will aid in better
   Overall, domain knowledge infusion provides better ex-             explainability of an individual prediction (i.e., per sample).
plainability with negligible compromises in performance. In           Besides, to ensure trust, estimating the level of uncertainty
addition, the generalization provides better execution time           in the model will be another extension of this work. There
and resiliency with unknown attacks.                                  are some open challenges surrounding explainability and in-
                                                                      terpretability such as an agreement of what an explanation
                                                                      is and to whom, a formalism for the explanation, and quan-
          6    Conclusion and Future Work                             tifying the human comprehensibility of the explanation.
AI-based approaches have become an integral part of secu-
rity solutions due to the potential for handling “Big Data”                                 Acknowledgment
and handling diverse network traffic data. Cybercrime-
related damages continue to rise, and network intrusions are          Thanks to Tennessee Tech’s Cyber-security Education, Re-
a key tactic. Although AI-based IDS provides accelerated              search and Outreach Center (CEROC) for supporting this
speeds in intrusion detection, response is still at a human           research.
                        References                                 methods in credit scoring. The Journal of Developing Ar-
Ando, S.          2019.        Interpreting random forests.        eas 50(5):371–381.
http://blog.datadive.net/interpreting-random-forests/.             Islam, S. R.; Eberle, W.; Bundy, S.; and Ghafoor, S. K. 2019.
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller,        Infusing domain knowledge in ai-based” black box” models
K.-R.; and Samek, W. 2015. On pixel-wise explanations for          for better explainability with application in bankruptcy pre-
non-linear classifier decisions by layer-wise relevance prop-      diction. arXiv preprint arXiv:1905.11474.
agation. PloS one 10(7):e0130140.                                  Islam, S. R.; Eberle, W.; and Ghafoor, S. K. 2018. Credit de-
Boser, B. E.; Guyon, I. M.; and Vapnik, V. N. 1992. A train-       fault mining using combined machine learning and heuristic
ing algorithm for optimal margin classifiers. In Proceedings       approach. arXiv preprint arXiv:1807.01176.
of the fifth annual workshop on Computational learning the-        Islam, S. R.; Ghafoor, S. K.; and Eberle, W. 2018. Min-
ory, 144–152. ACM.                                                 ing illegal insider trading of stocks: A proactive approach.
Breiman, L. 2001. Random forests. Machine learning                 In 2018 IEEE International Conference on Big Data (Big
45(1):5–32.                                                        Data), 1397–1406. IEEE.
Chandrasekaran, B.; Tanner, M. C.; and Josephson, J. R.            Islam, S. R. 2018. An efficient technique for mining bad
1989. Explaining control strategies in problem solving.            credit accounts from both olap and oltp. Ph.D. Dissertation,
IEEE Intelligent Systems (1):9–15.                                 Tennessee Technological University.
Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; and Kegelmeyer,         Javaid, A.; Niyaz, Q.; Sun, W.; and Alam, M. 2016. A
W. P. 2002. Smote: synthetic minority over-sampling tech-          deep learning approach for network intrusion detection sys-
nique. Journal of artificial intelligence research 16:321–         tem. In Proceedings of the 9th EAI International Conference
357.                                                               on Bio-inspired Information and Communications Technolo-
Chen, J.; Li, K.; Tang, Z.; Bilal, K.; Yu, S.; Weng, C.; and Li,   gies (formerly BIONETICS), 21–26. ICST (Institute for
K. 2016. A parallel random forest algorithm for big data in        Computer Sciences, Social-Informatics and . . . .
a spark cloud computing environment. IEEE Transactions             Kabul, I. K. 2018. Explainable ai. https://www.kdnuggets.
on Parallel and Distributed Systems 28(4):919–933.                 com/2018/11/interpretability-trust-ai-machine-
Datta, A.; Sen, S.; and Zick, Y. 2016. Algorithmic trans-          learning.html.
parency via quantitative input influence: Theory and exper-        Kim, J.; Kim, J.; Thu, H. L. T.; and Kim, H. 2016. Long
iments with learning systems. In 2016 IEEE symposium on            short term memory recurrent neural network classifier for
security and privacy (SP), 598–617. IEEE.                          intrusion detection. In 2016 International Conference on
DeJong, G. 1981. Generalizations based on explanations. In         Platform Technology and Service (PlatCon), 1–5. IEEE.
IJCAI, volume 81, 67–69.                                           Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Vie-
Dong, B., and Wang, X. 2016. Comparison deep learning              gas, F.; and Sayres, R. 2017. Interpretability beyond feature
method to traditional methods using for network intrusion          attribution: Quantitative testing with concept activation vec-
detection. In 2016 8th IEEE International Conference on            tors (tcav). arXiv preprint arXiv:1711.11279.
Communication Software and Networks (ICCSN), 581–585.              Lashkari, A. H.; Draper-Gil, G.; Mamun, M. S. I.; and Ghor-
IEEE.                                                              bani, A. A. 2017. Characterization of tor traffic using time
Doyle, M. 2019. Don’t be lulled into a false sense of secu-        based features. In ICISSP, 253–262.
rity. https://www.securityroundtable.org/dont-lulled-false-        Lei, T.; Barzilay, R.; and Jaakkola, T. 2016. Rationalizing
sense-cybersecurity/.                                              neural predictions. arXiv preprint arXiv:1606.04155.
2019. Ensemble methods. https://scikit-learn.org /sta-             Li, Z.; Sun, W.; and Wang, L. 2012. A neural network based
ble/modules/ensemble.html.                                         distributed intrusion detection system on cloud platform. In
Friedman, J. H. 2001. Greedy function approximation: a             2012 IEEE 2nd international conference on Cloud Comput-
gradient boosting machine. Annals of statistics 1189–1232.         ing and Intelligence Systems, volume 1, 75–79. IEEE.
Geurts, P.; Ernst, D.; and Wehenkel, L. 2006. Extremely            Lipovetsky, S., and Conklin, M. 2001. Analysis of regres-
randomized trees. Machine learning 63(1):3–42.                     sion in game theory approach. Applied Stochastic Models in
                                                                   Business and Industry 17(4):319–330.
Goodman, B., and Flaxman, S. 2017. European union reg-
ulations on algorithmic decision-making and a “right to ex-        Lipton, Z. C. 2016. The mythos of model interpretability.
planation”. AI Magazine 38(3):50–57.                               arXiv preprint arXiv:1606.03490.
Hodo, E.; Bellekens, X.; Hamilton, A.; Dubouilh, P.-L.;            Lundberg, S. M., and Lee, S.-I. 2017. A unified approach
Iorkyase, E.; Tachtatzis, C.; and Atkinson, R. 2016. Threat        to interpreting model predictions. In Advances in Neural
analysis of iot networks using artificial neural network intru-    Information Processing Systems, 4765–4774.
sion detection system. In 2016 International Symposium on          Lundberg, S. 2019. Shap vs lime. https://github.com /slund-
Networks, Computers and Communications (ISNCC), 1–6.               berg/shap/issues/19.
IEEE.                                                              Manning, C.; Raghavan, P.; and , H. 2010. Introduc-
Hooman, A.; Marthandan, G.; Yusoff, W. F. W.; Omid, M.;            tion to information retrieval. Natural Language Engineering
and Karamizadeh, S. 2016. Statistical and data mining              16(1):100–103.
Matt, B., et al. 2006. Introduction to computer security.          Wyden, B.          2019.       Algorithmic accountability.
Pearson Education India.                                           https://www.wyden.senate.gov/imo/media/doc/Algorithmic%20
Miller, T. 2018. Explanation in artificial intelligence: In-       Accountability%20Act%20of%202019%20Bill%20Text.pdf.
sights from the social sciences. Artificial Intelligence.          Yang, S. C.-H., and Shafto, P. 2017. Explainable artificial
Mitchell, T. M.; Keller, R. M.; and Kedar-Cabelli, S. T.           intelligence via bayesian teaching. In NIPS 2017 workshop
1986. Explanation-based generalization: A unifying view.           on Teaching Machines, Robots, and Humans.
Machine learning 1(1):47–80.                                       Zhang, H. 2004. The optimality of naive bayes. AA 1(2):3.
Montavon, G.; Samek, W.; and Müller, K.-R. 2018. Methods
for interpreting and understanding deep neural networks.
Digital Signal Processing 73:1–15.
2019.        Naive bayes.         https://scikit-learn.org/ sta-
ble/modules/naive bayes.html.
2019. Netflow meter. http://netflowmeter.ca/ netflowme-
ter.html.
2019. domain-knowledge-aided code. https://github.com/
SheikhRabiul/domain-knowledge-aided-explainable-ai-for-
intrusion-detection-and-response.
Rankin, K. 2019. The dark secret at the heart of
ai. https://www.technologyreview.com/s/604087/the-dark-
secret-at-the-heart-of-ai/.
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why
should i trust you?: Explaining the predictions of any classi-
fier. In Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining, 1135–
1144. ACM.
2019. domain-knowledge-aided dataset. https://github.com/
SheikhRabiul/domain-knowledge-aided-explainable-ai-
for       -intrusion-detection-and-response/tree/master/data/
combined sampled.zip.
2019.        Scikit-learn: Machine learning in python.
https://scikit-learn.org/stable/.
Sharafaldin, I.; Lashkari, A. H.; and Ghorbani, A. A. 2018.
Toward generating a new intrusion detection dataset and in-
trusion traffic characterization. In ICISSP, 108–116.
Shone, N.; Ngoc, T. N.; Phai, V. D.; and Shi, Q. 2018. A
deep learning approach to network intrusion detection. IEEE
Transactions on Emerging Topics in Computational Intelli-
gence 2(1):41–50.
Shrikumar, A.; Greenside, P.; and Kundaje, A. 2017. Learn-
ing important features through propagating activation differ-
ences. In Proceedings of the 34th International Conference
on Machine Learning-Volume 70, 3145–3153. JMLR. org.
Štrumbelj, E., and Kononenko, I. 2014. Explaining predic-
tion models and individual predictions with feature contribu-
tions. Knowledge and information systems 41(3):647–665.
Swartout, W. R., and Moore, J. D. 1993. Explanation in sec-
ond generation expert systems. In Second generation expert
systems. Springer. 543–585.
Swartout, W. R. 1985. Rule-based expert systems: The
mycin experiments of the stanford heuristic programming
project: Bg buchanan and eh shortliffe,(addison-wesley,
reading, ma, 1984); 702 pages.
2019. Tensorflow. https://www.tensorflow.org/.
Turek, M. 2019. Explainable ai. https://www.darpa.mil/
program/explainable-artificial-intelligence.