=Paper=
{{Paper
|id=Vol-2600/paper1
|storemode=property
|title=Domain Knowledge Aided Explainable Artificial Intelligence for Intrusion Detection and Response
|pdfUrl=https://ceur-ws.org/Vol-2600/paper1.pdf
|volume=Vol-2600
|authors=Sheikh Rabiul Islam,William Eberle,Sheikh K. Ghafoor,Ambareen Siraj,Mike Rogers
|dblpUrl=https://dblp.org/rec/conf/aaaiss/IslamEGSR20
}}
==Domain Knowledge Aided Explainable Artificial Intelligence for Intrusion Detection and Response==
Domain Knowledge Aided Explainable Artificial Intelligence for Intrusion Detection and Response Sheikh Rabiul Islam, William Eberle, Sheikh K. Ghafoor, Ambareen Siraj, Mike Rogers Departmemt of Computer Science Tennessee Technological University Cookeville, U.S. sislam42@students.tntech.edu, {weberle, sghafoor, asiraj, mrogers}@tntech.edu Abstract of trust in the model and prediction, as well as possibly leading to a delayed human response/decision. This limita- Artificial Intelligence (AI) has become an integral part of modern-day security solutions for its ability to learn very tion also involves ethical issues in a few sensitive domains complex functions and handling “Big Data”. However, the like finance (e.g., credit approval), health care (e.g., dis- lack of explainability and interpretability of successful AI ease diagnosis), and security (e.g., identifying target). For models is a key stumbling block when trust in a model’s pre- instance, AI and ML are becoming an integral part of se- diction is critical. This leads to human intervention, which curity solutions and defense. To mitigate the unethical use in turn results in a delayed response or decision. While there of AI as well as to promote the responsible use of AI sys- have been major advancements in the speed and performance tems, various governments have started taking different pre- of AI-based intrusion detection systems, the response is still cautionary initiatives. Recently, the European Union imple- at human speed when it comes to explaining and interpreting mented the rule of “right of explanation”, where a user can a specific prediction or decision. In this work, we infuse pop- ask an explanation of algorithmic decision (Goodman and ular domain knowledge (i.e., CIA principles) in our model for better explainability and validate the approach on a network Flaxman 2017). In addition, more recently the US govern- intrusion detection test case. Our experimental results sug- ment introduced a new bill, the “Algorithmic Accountability gest that the infusion of domain knowledge provides better Act”, which would require companies to assess their ma- explainability as well as a faster decision or response. In ad- chine learning systems for bias and discrimination, with a dition, the infused domain knowledge generalizes the model need to take corrective measures (Wyden 2019). The U.S. to work well with unknown attacks, as well as opens the path Department of Defense (DoD) has identified explainability to adapt to a large stream of network traffic from numerous as a key stumbling block in the adoption of AI-based solu- IoT devices. tions in many of their projects. Their DARPA division has invested $2 billion on an Explainable Artificial Intelligence 1 Introduction (XAI) program (Turek 2019; Rankin 2019). Most of the recent advancements in Artificial Intelligence Network intrusions are a common cyber-crime activity, (AI), and more specifically Machine Learning (ML), have estimated to cost around $6 trillion annually in damages by come from complex non-linear models such as Deep Neu- 2021 (Doyle 2019). In order to combat these attacks, an In- ral Networks, Ensemble Methods, and Support Vector Ma- trusion Detection System (IDS) is a security system to mon- chines. These models are also known as “black box” models itor network and computer systems (Hodo et al. 2016). Re- as they are complex to interpret and explain, which arises search in AI-based IDS has shown promising results (Hodo from their inherent non-linear capabilities, multiple param- et al. 2016), (Shone et al. 2018), (Kim et al. 2016), (Javaid et eters, and very complex transformations. In addition, some al. 2016), (Li, Sun, and Wang 2012), and has become an in- algorithms require a very large number of samples (i.e., large tegral part of security solutions due to its capability of learn- training sets) to work efficiently, where it is very difficult ing complex, nonlinear functions and analyzing large data to figure out what the model learned from the dataset and streams from numerous connected devices. A recent survey which portion of the data set has more influence on the out- by (Dong and Wang 2016) suggests that deep learning-based put (Kabul 2018). methods are accurate and robust to a wide range of attacks Due to these challenges, the black box models lacks ex- and sample sizes. However, there are concerns regarding the plainability and interpretability, ultimately resulting in a lack sustainability of current approaches (e.g., intrusion detec- tion/prevention systems) when faced with the demands of Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel- modern networks and the increasing level of human inter- mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com- action (Shone et al. 2018). In the age of IoT and Big Data, bining Machine Learning and Knowledge Engineering in Practice an increasing number of connected devices and associated (AAAI-MAKE 2020). Stanford University, Palo Alto, California, streams of network traffic have exacerbated the problem. USA, March 23-25, 2020. Use permitted under Creative Commons In addition, the delay in detection/response increases the License Attribution 4.0 International (CC BY 4.0). chance of zero day exploitation, whereby a previously un- known vulnerability is just discovered by the attacker, and domain knowledge in an intrusion detection/response sys- the attacker immediately initiates an attack. However, im- tem; (2) we introduce a way to bring popular security prin- proved explainability of an AI model could quicken inter- ciples (e.g., CIA principles) to aid in interpretability and ex- pretation, making it more feasible to accelerate the response. plainability; (3) our experimental results show that infusing Explainability is the extent to which the internal working domain knowledge into “black box” models can make them mechanism of the machine or AI system can be explained better explainable with little or no compromise in perfor- in human terms. And interpretability is the extent to which mance ; and (4) domain knowledge infusion increases gener- a cause and effect (i.e., understanding of what’s happen- alizability, which leads to better resiliency against unknown ing) can be observed within a system. In other words, in- attack. terpretability is a form of abstract knowledge about what’s We start with a background of related work (Section 2) happening and explainability is about the detailed step-by- followed by a description of our proposed approach, an in- step knowledge of what is happening (Montavon, Samek, tuitive description of standard supervised algorithms, and an and Müller 2018), (Turek 2019). However, while some lit- overview of the dataset (Section 3) used in this work. In Sec- erature treat interpretability and explainability as the same, tion 4, we describe our experiments, followed by Section 5 they are actually two different traits of a model. Just because which contains discussion on results from the experiments. a model can be interpreted does not mean that it can be ex- We conclude with limitations and future work in Section 6. plained, and explainability needs to go beyond the algorithm (Lipton 2016). 2 Background Explainability and interpretability of a model could be Research in Explainable Artificial Intelligence (XAI) is a re- achieved before, during, and after modeling. From the lit- emerging field, after the earlier work of (Chandrasekaran, erature, we find that interpretability in pre-modeling (i.e., Tanner, and Josephson 1989), (Swartout and Moore 1993), before modeling) is under-focused. (Miller 2018) argue that and (Swartout 1985). Previous work focused on primarily explainability should incorporate knowledge from a differ- explaining the decision process of knowledge-based sys- ent domain such as philosophy, psychology, and cognitive tems and expert systems. The classical learning paradigm science, so that the explanation is not just based on the re- Explanation-Based Learning (EBL), introduced in the early searcher’s intuition of what constitutes a good explanation. ’80s, can also be regarded as a precursor of explainability. However, we also find that the use of domain knowledge for EBL involves learning a problem-solving technique by ob- explainability is under-focused. In this work, we introduce serving and analyzing solutions to a specific problem (De- a novel approach for an AI-based explainable intrusion de- Jong 1981), (Mitchell, Keller, and Kedar-Cabelli 1986). The tection and response system, and demonstrate its effective- main reason for the renewed interest in XAI research has ness by infusing a popular network security principle (CIA stemmed from recent advancements in AI and ML and their principle) into the model for better explainability and inter- application to a wide range of areas, as well as concerns over pretability of the decision. unethical use and undesired biases in the models. In addi- We use a recent and comprehensive IDS dataset (CI- tion, recent concerns and laws by different governments are CIDS2017) which covers necessary criteria with com- necessitating more research in XAI. mon updated attacks such as DDoS, Brute Force, XSS, (Yang and Shafto 2017), use Bayesian Teaching, where SQL Injection, Infiltration, Portscan, and Botnet. We in- a smaller subset of examples is used to train the model in- fuse CIA principle in the model that provides a concise stead of the whole dataset. The subset of examples is chosen and interpretable set of important features. Computer se- by domain experts as the examples are most relevant to the curity rest on CIA principles, C stands for confidential- problem of interest. However, for this purpose, choosing the ity—concealment of information or resources, I stands for right subset of examples in the real-world is challenging. integrity—trustworthiness of data or resources, and A stands (Lei, Barzilay, and Jaakkola 2016) propose an approach for availability—ability to use the information or resource for sentiment analysis where a subset of text from the whole desired (Matt and others 2006). For instance, security com- text is selected as the rationale for the prediction. In addition, promise in confidentiality could happen through eavesdrop- the selected subset of text is concise and sufficient enough ping unencrypted data, compromise in integrity could hap- to act as a substitute for the original text, and still capable of pen through an unauthorized attempt to change data, and the making the correct prediction. Although their approach out- compromise in availability could happen through the delib- performs available attention-based models (from deep learn- erate arrangement of denial to access data or service. ing) with variable-length input (e.g., a model for document We also convert the domain knowledge infused features summarization) , it is limited to only text analysis. into three features C, I, and A by quantitatively comput- When the explanation is based on feature importance, it is ing compromises associated with each of those for each necessary to keep in mind that features that are globally im- record. Output expressed as these generalized and newly portant may not be important in the local context, and vice constructed set of features provides better explainability versa (Ribeiro, Singh, and Guestrin 2016). (Ribeiro, Singh, with negligible compromises in performance. We also found and Guestrin 2016) propose a novel explanation technique that generalization provides more resiliency against un- capable of explaining the prediction of any classifier (i.e., in known attacks. the model agnostic way) with a locally interpretable model In summary, our contributions in this work are as follows: (i.e., in the vicinity of the instance being predicted) around (1) we demonstrate a method for the collection and use of the prediction. Their concern is on two issues: (1) whether the user should trust the prediction of the model and act on al. 2019), albeit for bankruptcy prediction with a limited fo- that, and (2) whether the user should trust a model to be- cus. (Miller 2018) have argued that incorporating knowl- have reasonably-well when deployed. In addition, they in- edge from different domains will provide better explain- volve human judgment in their experiment (i.e., human in ability. In addition, (Kim et al. 2017) use the prediction’s the loop) to decide whether to trust the model or not. sensitivity to high dimensional concepts (e.g., the concept (Kim et al. 2017) propose a concept attribution-based ap- “striped” to “Zebra”) for explaining the prediction. Further- proach (i.e., sensitivity to concept) that provides an inter- more, both LIME (Ribeiro, Singh, and Guestrin 2016) and pretation of the neural network’s internal state in terms of SHAP (Lundberg and Lee 2017) use a simplified input map- human-friendly concepts. Their approach, Testing with CAV ping—mapping the original input to a simplified set of input. (TCAV), quantifies the prediction’s sensitivity to the high di- To the best of our knowledge, none of the models incorpo- mensional concept. For example, a user-defined set of exam- rate domain knowledge with a focus towards better explain- ples that defines the concept “striped”, TCAV can quantify ability and interpretability. Although our proposed concep- the influence of “striped” in the prediction of “zebra” as a tual model comes with a negligible compromise in accuracy, single number. To learn the high dimensional concepts they it comes with better explainability and interpretibility, and use a Concept Activation Vector (CAV) —CAVs are learned scalability to big data problems. from training a linear classifier that can distinguish between the activations produced by a particular concept’s examples 3 Methodology and examples in any layer. Most of these approaches try to find out how the predic- tion deviates from the base/average scenario. Lime (Ribeiro, Singh, and Guestrin 2016) tries to generate an explana- tion by locally (i.e., using local behavior) approximating the model with an interpretable model (e.g., decision trees, lin- ear model). However, it is limited by the use of the only linear model to approximate the local behavior. (Lundberg and Lee 2017) propose “SHAP” which unifies seven previ- ous approaches: LIME (Ribeiro, Singh, and Guestrin 2016), DeepLIFT (Shrikumar, Greenside, and Kundaje 2017), Tree Interpreter (Ando 2019), QII (Datta, Sen, and Zick 2016), Shapley sampling values (Štrumbelj and Kononenko 2014), Shapley regression values (Lipovetsky and Conklin 2001), and Layer-wise relevance propagation (Bach et al. 2015) to make the explanation of prediction for any machine learn- ing model. While SHAP comes with theoretical guarantees about consistency and local accuracy from game theory, it needs to run many evaluations of the original model to esti- mate a single vector of feature importance (Lundberg 2019). ELI5 also uses the LIME algorithm internally for explana- tions. In addition, ELI5 is not truly model agnostic, mostly limited to tree-based and other parametric or linear models. Furthermore, Tree Interpreter is limited to only tree-based approaches (e.g., Random Forest, Decision Trees). AI-based IDSs have continued to show promising perfor- Figure 1: Proposed Technique mance (Hodo et al. 2016),(Shone et al. 2018),(Kim et al. 2016),(Javaid et al. 2016),(Li, Sun, and Wang 2012). (Shone et al. 2018) propose an approach in combination of both 3.1 Proposed Approach shallow (Random Forest) and deep learning (Auto Encoder), The proposed approach consists of two components: a fea- capable of analyzing a wide range of network traffic, out- ture generalizer, which gives a generalized feature set with performing mainstream Deep Belief Networks (DBN). In a the help of domain knowledge in two different ways; and an literature survey on traditional IDS vs deep learning IDS by evaluator that produces and compares the results from the (Dong and Wang 2016), they suggest deep learning-based “black box” model for multiple configurations of features: methods provide better accuracy for a wide range of sam- domain knowledge infused features, newly constructed fea- ples sizes and a variety of network traffic or attacks (Dong tures from domain knowledge infused features, selected fea- and Wang 2016). However, in all of the previous work, there tures, and all features. are still long training times and a reliance on a human oper- ator (Shone et al. 2018). 3.2 Feature Generalizer However, incorporating domain knowledge for explain- The feature generalizer (Figure 1, top portion), takes orig- ability has garnered little attention. Previously, we intro- inal features of the dataset (X 1 , X 2 , .... X n ∈ X where X duced the concept of infusing domain knowledge (Islam et is the set of all features) and infuse domain knowledge to mapping between feature and associated compromises, we Table 1: Mapping of network attack with related component first find the mapping between an attack and related com- of CIA principles promises (from Table 1, formulated as Equation 2). In other Attack Related component of CIA words, Formula 1 gives the name of the associated attack DoS GoldenEye A where the feature is in the top three feature to identify that Heartbleed C particular attack and Formula 2 gives associated compro- DoS hulk A mises in C, I, or A from the attack name. Thus, with the DoS Slowhttp A help of domain knowledge, we keep 22 features (see Table DoS slowloris A 2) out of a total of 78 features. We will refer to these features SSH-Patator C as the domain features. The feature descriptions in Table 2 FTP-Patator C are taken from the data processing software’s website (net Web Attack C, I, A 2019). Inflitration C f (f eature) → attack (1) Bot C, I, A f (attack) → C, I, orA (2) PortScan C DDoS A Feature Construction We also construct three new fea- tures, C, I, and A, from the domain features by quantita- tively calculating compromises associated with each of the produce/re-construct a concise and better interpretable fea- domain features. For that purpose, we calculate the correla- ture set (X 1 ’, X 2 ’, ..... X k ’ ∈ X’ where X’ is the universal tion coefficient vector of the dataset to understand whether set of original/transformed/constructed features, but here k the increase in the value of a feature has a positive or nega- is much smaller than n) in two different ways: tive impact on the target variable. We then convert the corre- lation coefficient (a.k.a coeff ) vector V in to a 1 or -1 based Feature Mapping As stated earlier, we use CIA princi- on whether the correlation coefficient is positive or negative ples as domain knowledge, which stands for confidentiality, accordingly. We also group the domain features and corre- integrity, and availability. We analyze all types of attacks sponding coeff tuple into three groups. Using formula 3, 4, for associated compromises in each component of CIA prin- and 5, we aggregate each group (from C, I, and A) of domain ciples (see Table 1). The Heartbleed vulnerability is related features into the three new features C, I, and A. We also scale to a compromise in confidentiality as an attacker could gain all feature values from 0 to 1 before starting the aggregation access to the memory of the systems protected by the vul- process. During the aggregation for a particular group (e.g., nerable version of the OpenSSL. A Web attack (e.g., Sql in- C), if the correlation coefficient vector (e.g., Vi ) for a feature jection) is related to a compromise in confidentiality and in- (e.g., Ci ) of that group has a negative value, then the product tegrity (e.g., read/write data using injected query), and avail- of the feature value and the correlation coefficient for that ability (e.g., flooding the database server with injected com- feature is deducted, and vice-versa if positive. In addition, plex queries like a cross join). Infiltration attack is related to when a feature is liable for more than one compromise, the a compromise in confidentiality as it normally exploits soft- feature value is split between the associated elements of CIA ware vulnerability (e.g., Adobe Acrobat Reader) to create a principles. backdoor and reveal information (e.g., IP’s). Port scan attack X n is related to a compromise in confidentiality as the attacker C= C iV i (3) sends packets with varying destination ports to learn the ser- i=0 vices and operating systems from the reply. All DoS and n DDoS attacks are related to a compromise in availability as X I= I iV i (4) it aims to hamper the availability of service or data. Further- i=0 more, SSH patator and FTP patator are brute force attacks n and are usually responsible for a compromise in confiden- X tiality. Botnet (i.e., robot network—a network of malware- A= Ai V i (5) infected computers) could provide a remote shell, file up- i=0 load/download option, screenshot capture option, and key logging options which has potential for all of the confiden- 3.3 Evaluator tiality, integrity, and availability related compromises. The task of the evaluator (Figure 1, bottom side) is to ex- Furthermore, from the feature ranking of the origi- ecute (supervised models or algorithms) and compare the nal dataset provider (Sharafaldin, Lashkari, and Ghorbani performance (in detecting malicious and benign records) of 2018), for each type of attack, we take the top three fea- four different types of configurations of features, as follows: tures according to their importance (i.e., feature importance (1) using all features, (2) using selected features (selection from Random Forest Regressor) and calculate the mapping is done by feature selection algorithm), (3) using domain (Table 2) with related compromises under CIA principles. knowledge infused features, and (4) using newly constructed For example, the feature Average Packet Size is renamed as features C, I, and A from domain knowledge infused fea- Avg Packet Size - A where -A indicates that it is a key feature tures. In addition, the evaluator performs the following two for the compromise of availability (see Table 2). To get this tests: Table 2: Mapping of feature with related component of CIA principles Feature Description In top 3 features of attack Renamed feature ACK Flag Count Number of packets with ACK SSH-Patator ACK Flag Count - C Active Mean Mean time a flow was active before becoming idle DoS Slowhttp, Infiltration Active Mean - AC Active Min Minimum time a flow was active before becoming idle DoS Slowhttp Active Min - A Average Packet Size Average size of packet DDoS Avg Packet Size - A Bwd IAT Mean Mean time between two packets sent in the backward direction DoS slowloris Bwd IAT Mean - A Bwd Packet Length Std Standard deviation size of packet in backward direction DoS Hulk, DoS GoldenEye, DDoS, Bwd Packet Length Std - AC Heartbleed, DoS Hulk Bwd Packets/s Number of backward packets per second Bot, PortScan Bwd Packets/s - CIA Fwd IAT Mean Mean time between two packets sent in the forward direction DoS slowloris Fwd IAT Mean - A Fwd IAT Min Minimum time between two packets sent in the forward direction DoS slowloris, DoS GoldenEye Fwd IAT Min - A Fwd Packet Length Mean Mean size of packet in forward direction Benign, Bot Fwd Packet Length Mean - CIA Fwd Packets/s Number of forward packets per second FTP-Patator Fwd Packets/s - C Fwd PSH Flags Number of times the PSH flag was set in packets travelling in the for- FTP-Patator Fwd PSH Flags - C ward direction (0 for UDP) Flow Duration Duration of the flow in Microsecond DDoS, DoS slowloris, DoS Hulk, Flow Duration - AC DoS Slowhttp, Infiltration, Heart- bleed Flow IAT Mean Mean inter-arrival time of packet DoS GoldenEye Flow IAT Mean - A Flow IAT Min Minimum inter-arrival time of packet DoS GoldenEye Flow IAT Min - A Flow IAT Std Standard deviation inter-arrival time of packet DDoS, DoS Slowhttp, DoS Hulk Flow IAT Std - A Init Win bytes forward The total number of bytes sent in initial window in the backward direc- Web Attack Init Win Bytes Fwd - CIA tion PSH Flag Count Number of packets with PUSH PortScan PSH Flag Count - C Subflow Fwd Bytes The average number of packets in a sub flow in the forward direction Benign, SSH-Patator, Web Attack, Subflow Fwd Bytes - CIA Bot, Heartbleed, Infiltration SYN Flag Count Number of packets with SYN FTP-Patator SYN Flag Count - C Total Length of Fwd Packets Total size of packet in forward direction Benign, SSH-Patator, Web Attack, Total Length of Fwd Packets - CIA Bot, Heartbleed, Infiltration 1. Explainability Test: The purpose of this test is to discover Random Forest (RF) A Random Forest is a tree-based the comparative advantages or disadvantages of incorpo- ensemble technique developed by (Breiman 2001) for the rating domain knowledge in the experiment; and supervised classification task. In RF, many trees are gener- 2. Generalizability Test: The purpose of this test is to an- ated from the bootstrapped subsamples (i.e., random sample alyze how different approaches perform in unknown or drawn with replacement) of the training data. In each tree, unseen attack detection. We delete all training records for the splitting attribute is chosen from a smaller random sub- a particular attack one at a time and investigate the perfor- set of attributes of that tree (i.e., the chosen split attribute mance of the model on the same test set, which includes that is the best among that random subset). This random- records from unknown or unseen attacks. Details of these ness helps to make trees less correlated as correlated trees tests are described in Section 4. make the same kinds of prediction errors and can overfit the model. In less correlated trees, a few trees may be wrong 3.4 Algorithms but many others will be right and as a group the trees can move in the right direction as the output from all the trees We use six different algorithms for predicting malicious are averaged for the final prediction. records: one of those is a probabilistic classifier based on Naive Bayes theorem, and the remaining five are supervised Extra Trees (ET) Extremely Randomized Trees or Ex- “black box” models. The algorithm descriptions are taken tra Trees (ET) is a tree-based ensemble technique simialr from our previous work (Islam et al. 2019). to RF. The only difference is in the process of splitting attribute selections and determining the threshold (cutoff) Artificial Neural Network (ANN) An Artificial Neural value, both are chosen in an extremely random fashion (Is- Network is a non-linear model, capable of mimicking hu- lam, Ghafoor, and Eberle 2018). Similar to RF, a random man brain functions to some extent. It consists of an input subset of features are taken into consideration for the split layer, one or multiple hidden layer(s), and the output layer. selection, but instead of choosing the most discriminative Each layer consists of multiple neurons that help to learn the cut off threshold, ET cut off thresholds are set to random complex pattern. Each subsequent layer learns more abstract values. Thus, the best of these randomly chosen values is set concepts before it finally merges into the output layer. as the threshold for the splitting rule (ens 2019) on a partic- ular node. Unlike DT, RF has multiple trees which leads to Support Vector Machine (SVM) The Support Vector a reduced variance. However, bias is introduced, as a subset Machine (SVM) was first introduced by (Boser, Guyon, and of the whole feature set is chosen for each tree instead of Vapnik 1992) and has been used for many supervised classi- all features. ET was proposed by (Geurts, Ernst, and We- fication tasks. In addition to linear classification, the model henkel 2006), and has achieved a state of the art perfor- can learn an optimal hyperplane that separates instances of mance in some anomaly/intrusion detection research (Islam different classes using a highly non-linear implicit mapping 2018), (Islam, Eberle, and Ghafoor 2018),(Islam, Ghafoor, of input vectors in high dimensional feature space (i.e., ker- and Eberle 2018). nel trick) (Hooman et al. 2016). When the number of sam- ples is too high (i.e., millions) then it is very costly in terms Gradient Boosting (GB) (Friedman 2001), generalized of computation time. Adaboost to a Gradient Boosting algorithm to allow a vari- ety of loss function. Here the shortcoming of weak learners (e.g., bias to the class of majority samples) of some machine is identified using the gradient instead of highly weighted learning algorithms. To overcome this problem, we use the data points as in Adaboost. Gradient Boosting (GB) is a well-known oversampling technique SMOTE (Chawla et al. classifier/regression model in the form of an ensemble of 2002) to oversample the minority class. In (Dong and Wang weak prediction models, such as Decision Trees. It works 2016), the author uses SMOTE to overcome the issue in sequentially like the AdaBoost algorithm, in that each sub- their empirical study on the comparison of traditional vs sequent model tries to minimize the loss function (i.e., Mean deep learning-based IDS. SMOTE creates synthetic samples Squared Error) by paying special focus on instances that rather than just oversampling with replacement. The minor- were hard to get right in the previous model. ity class is oversampled by creating new examples along with the line segments joining any or all of k nearest mi- Naive Bayes (NB) Naive Bayes algorithm is based on nority samples, where k is chosen based on the percentage Bayes Theorem, which was formulated in the seventeenth of oversampling required (i.e., hyperparameter to the algo- century. It is a supervised, simple, and comparatively fast rithm) (Chawla et al. 2002). algorithm based on statistics. In a real-world problem, it is unusual that all features are independent. However, Naive Bayes assumes conditional independence among features 4 Experiments and surprisingly works well in many cases. It also requires a 4.1 Experimental Setup small amount of training data to estimate the necessary pa- rameters (nai 2019). This assumption of Naive Bayes helps We execute the experiments in a GPU enabled Linux ma- to avoid lots of computations (e.g., computing the condi- chine with 12GB of RAM and core i7 processor. All super- tional probability for each feature with others) and makes vised machine learning algorithms are implemented using it a faster algorithm. Besides, the avoidance of a condi- the Python-based Scikit-learn (sci 2019) library. In addition, tional probability calculation helps (the class conditional we use Tensorflow (ten 2019) for the Artificial Neural Net- feature distribution can be independently estimated as one- work. Due to resource limitations, instead of using the whole dimensional distribution) in Big Data problems where the dataset, we take a stratified sample of the data which is big curse of dimensionality is a concern. However, NB is a bad enough (i.e., 300K records) for a single GPU enabled com- estimator of a probabilty (Zhang 2004). We use the Bernoulli modity machine. We make the sampled dataset available to Naive Bayes (Manning, Raghavan, and 2010) for our exper- the research community at (sam 2019). Furthermore, we use iments where each feature is assumed to be binary-valued. 70% of the data for training the models and kept 30% of the data as a holdout set to test the model. We confirm the target 3.5 Data class had the same ratio in both sets. To avoid the adverse ef- fect of class imbalance in classification performance, we re- In this work, we use a recent and comprehensive IDS dataset sample the minority class of the training set using SMOTE namely CICIDS2017, published in 2017, covers necessary (Chawla et al. 2002) to balance the dataset. However, we do criteria with common updated attacks such as DoS, DDoS, not re-sample the test set, as real-world data is skewed and Brute Force, XSS, SQL Injection, Infiltration, Portscan, oversampling the test set could exhibit an overoptimistic per- and Botnet. In fact, this dataset is created to eliminates formance. the shortcomings (e.g., lack of traffic diversity and vol- We run all supervised machine learning algorithm using ume, lack of variety of attacks, anonymized packet infor- four different approaches: mation, and out of date) of previous well known IDS dataset such as DARPA98, KDD99, ISC2012, ADFA13, DEFCON, 1. With all features: using all 78 features of the dataset with- CAIDA, LBNL, CDX, Kyoto, Twente, and UMASS since out discarding any features. 1998. This is a labeled dataset containing 78 network traf- 2. With selected features: using Random Forest Regres- fic features (some features are listed in Table 2) extracted sor (adapting with the work of (Sharafaldin, Lashkari, and calculated from pcap file using CICFlowMeter soft- and Ghorbani 2018)) to select important features of the ware (Lashkari et al. 2017) for all benign and intrusive flows dataset, giving us 50 important features having a nonzero (Sharafaldin, Lashkari, and Ghorbani 2018) . This new IDS influence on the target variable; dataset includes seven common updated family of attacks satisfying real-world criteria, also publicly available at here: 3. With domain knowledge infused features: using infused https://www.unb.ca/cic/datasets/ids-2017.html . domain knowledge features (see Section 3.2), we will use Each record of the dataset is labeled by the particular type the term domain features interchangeably to express it in of attack. We make a new feature “Class”, which is the tar- short form; and get feature. We set the value of the “Class” attribute to 1 for 4. With newly constructed features from domain knowl- all records labeled as any of 14 types of attacks, as those edge infused features: using newly constructed features are malicious/intrusive, and set the value to 0 for the re- C, I, and A (see Section 3.2) from domain knowledge maining records as those are benign. Following that, in the infused features, we will use the term domain features- whole dataset, there are total 2,830,743 records for 14 differ- constructed interchangeably to express it in short form. ent attacks, 2,273,097 are benign and 557,646 are malicious. Approximately 24.5% of the records are malicious, giving The following are two types of experiments using each of us an imbalanced dataset which impacts the performance the four feature settings. Table 3: Performance using all features vs selected features Table 4: Performance using domain features vs constructed Alg. Acc. Prec. Rec. F-score AUC features Alg. Acc. Prec. Rec. F-score AUC RF-A 0.9987 0.9965 0.9971 0.9968 0.9997 RF-S 0.9986 0.9962 0.9966 0.9964 0.9997 RF-D1 0.9973 0.9920 0.9945 0.9932 0.9993 Difference 0.0002 0.0003 0.0006 0.0005 0.0000 RF-D2 0.9511 0.9446 0.7985 0.8654 0.9572 ET-A 0.9981 0.9951 0.9951 0.9951 0.9994 Difference 0.0463 0.0475 0.1960 0.1278 0.0421 ET-S 0.9980 0.9950 0.9950 0.9950 0.9994 ET-D1 0.9969 0.9913 0.9932 0.9923 0.9989 Difference 0.0001 0.0002 0.0001 0.0001 0.0000 ET-D2 0.9756 0.9321 0.9448 0.9384 0.9954 ANN-A 0.9802 0.9155 0.9908 0.9516 0.9984 Difference 0.0214 0.0592 0.0483 0.0538 0.0036 ANN-S 0.9740 0.8929 0.9860 0.9372 0.9968 ANN-D1 0.9497 0.8300 0.9362 0.8799 0.9865 Difference 0.0062 0.0226 0.0047 0.0145 0.0017 ANN-D2 0.5952 0.3241 0.9721 0.4862 0.7921 SVM-A 0.9109 0.6996 0.9595 0.8092 0.9780 Difference 0.3544 0.5059 (0.0359) 0.3937 0.1945 SVM-S 0.8869 0.6433 0.9565 0.7692 0.9746 SVM-D1 0.8489 0.5747 0.8968 0.7005 0.9252 Difference 0.0239 0.0563 0.0030 0.0400 0.0034 SVM-D2 0.7195 0.3739 0.6281 0.4687 0.7886 GB-A 0.9960 0.9854 0.9944 0.9899 0.9995 Difference 0.1294 0.2008 0.2687 0.2318 0.1366 GB-S 0.9957 0.9840 0.9945 0.9892 0.9996 GB-D1 0.9881 0.9513 0.9904 0.9705 0.9986 Difference 0.0003 0.0014 (0.0001) 0.0007 (0.0001) GB-D2 0.9230 0.7692 0.8701 0.8165 0.9789 NB-A 0.7753 0.4371 0.4888 0.4615 0.8601 Difference 0.0652 0.1821 0.1204 0.1539 0.0198 NB-S 0.7621 0.4144 0.5019 0.4539 0.8508 NB-D1 0.7982 0.4881 0.5028 0.4953 0.8553 Difference 0.0132 0.0228 (0.0131) 0.0076 0.0093 NB-D2 0.5591 0.2687 0.7195 0.3913 0.6591 Difference 0.2391 0.2194 (0.2167) 0.1040 0.1962 4.2 Explainability Test For this test, we run six supervised algorithms RF, ET, SVM, Before starting the comparison of results from our ap- GB, ANN, and NB using the four described feature settings proach with all features (i.e., baseline), we seek the best and report the results Section 5.1. Unlike NB, other classi- feature setting among two domain related feature settings fiers are “black box” in nature. NB is a probabilistic classi- of our proposed approach. In other words, in our attempt to fier based on Bayes Theorem with strong conditional inde- find the better approach among using domain knowledge in- pendence assumption among features. The main purpose to fused features vs newly constructed features (C, I, and A) include NB in the experiment is the generalizability test. from domain knowledge infused features, we find that, in almost all cases, the model with domain knowledge infused 4.3 Generalizability Test features (denoted with an appended -D1, for instance RF- D1) performs better than the counterpart (see Table 4). Al- For testing the generalizability of the approach, we train the though for RF, the maximum performance gap is .2 in the classifier without the representative of a particular attack, recall, for ET that gap is .048 with a similar precision. As and test it with the presence of the representative of that par- the domain features (22 features) contain a lot more detail ticular attack, in order to classify it malicious/benign. To be than the newly constructed features C, I, and A (3 features), more specific, we delete all records of a particular attack it loses few details. In terms of individual algorithms, RF from the training set, train the classifier with the records of is again a clear winner this time using domain features. Al- the remaining 13 attacks, and test the classifier with all 14 though NB and ANN exhibit better recall using constructed attacks. We report the percentage of deleted attacks that are features, it comes with compromises in precision. So, over- correctly detected as malicious (see Section 5.2). We repeat all we consider the domain features setting as the best over this one by one for all 14 attacks. We make the source code the constructed features. available to the research community to replicate the experi- While we know the best feature setting is the all features, ments at (pro 2019). as shown in the comparison of all features vs selected fea- tures in the Table 3), we also know the best feature setting 5 Results domain features from domain features vs constructed fea- The following sections discuss results from the two cate- tures (see Table 4). So we further compare the performance gories of experiments previously described. of models using the two best settings all features (i.e., base- line) vs domain features. We find that, among all models, RF using all features (denoted with an appended -A, for instance 5.1 Findings from Explainability Test RF-A) performs better than all other algorithms (see Ta- Comparing the performance using all features vs selected ble 5 and Figure 3). Interestingly, RF using domain knowl- features, Table 3 shows that models using all features (de- edge infused features (denoted with an appended -D1, for noted with an appended -A, for instance RF-A) tend to show instance RF-D1) also shows promising performance. The better results in terms of all performance metrics. However, difference between these two in terms of any performance while the difference with the selected features setting is neg- metrics is negligible (<.005). In fact, the result of RF us- ligible (<.0007 for RF) for any performance metric, that ing the domain knowledge infused feature settings is better might be a result of the elimination of features with little than what (Sharafaldin, Lashkari, and Ghorbani 2018) re- significance. In addition, Random Forest outperforms other ports using the same dataset. The slight improvement might algorithms SVM, ET, GB, ANN, and NB under this feature stem from the experimental settings (e.g., training test set setting (i.e., using all features). So we consider the results split, re-sampling techniques). Furthermore, in the domain using all features as a baseline to compare against our pro- knowledge infused feature setting we are using only 22 fea- posed approach. tures out of 78 total, where each feature indicates the asso- ciated compromises (e.g., confidentiality, integrity, or avail- ability), capable of producing better explainable and inter- Table 5: Performance using all features vs domain features Alg. Acc. Prec. Rec. F-score AUC pretable results compared to the counterpart. The prediction for a particular sample can be represented as: RF-A RF-D1 0.9987 0.9973 0.9965 0.9920 0.9971 0.9945 0.9968 0.9932 0.9997 0.9993 G Difference 0.0014 0.0045 0.0027 0.0036 0.0004 X ET-A 0.9981 0.9951 0.9951 0.9951 0.9994 P (D) = b + contribution(g) (6) ET-D1 0.9969 0.9913 0.9932 0.9923 0.9989 Difference 0.0011 0.0038 0.0020 0.0029 0.0004 g=0 ANN-A 0.9802 0.9155 0.9908 0.9516 0.9984 ANN-D1 0.9497 0.8300 0.9362 0.8799 0.9865 where b is the model average and g is the generalized do- Difference 0.0305 0.0855 0.0546 0.0717 0.0119 SVM-A 0.9109 0.6996 0.9595 0.8092 0.9780 main feature (e.g., ACK Flag Count - C), P(D) is the prob- SVM-D1 0.8489 0.5747 0.8968 0.7005 0.9252 ability value of the decision. Instead of using contributions Difference 0.0619 0.1249 0.0627 0.1087 0.0528 GB-A 0.9960 0.9854 0.9944 0.9899 0.9995 from each of the domain features, we can express the output GB-D1 0.9881 0.9513 0.9904 0.9705 0.9986 in terms of the contribution from each element of the do- Difference 0.0079 0.0341 0.0039 0.0194 0.0009 NB-A 0.7753 0.4371 0.4888 0.4615 0.8601 main concept. For that, we need to aggregate contributions NB-D1 0.7982 0.4881 0.5028 0.4953 0.8553 Difference (0.0229) (0.0510) (0.0140) (0.0338) 0.0048 from all features into three groups (C, I, and A). This will en- able an analyst to understand the nature of the attack more quickly (Figure 2). For instance, when the greater portion of a feature contribution for a sample is from features tagged with -A (i.e., Availability) then it might be a DDoS attack, which usually comes with very high compromises in avail- ability of data or service. We use the iml package from the programming language R to generate the breakdown of fea- ture contributions of a particular sample’s prediction (Figure 2). Figure 3: Performance deviations of using domain knowl- edge infused features from baseline domain knowledge infused feature setting perform slightly worse than the all feature setting, it comes with an explain- able features set with the added capability of identifying a few unknown attacks. To reiterate, the constructed features set consists of only three features (C, I, and A) constructed from aggregating do- main knowledge infused features. As this feature setting is Figure 2: Breakdown of the prediction for a random sample composed of only three features, it is an extreme generaliza- in terms of C, I, and A. tion of features and it loses a lot of details of data. However, this time it comes with an exceptional capability which we realize after applying a statistical approach (Naive Bayes) 5.2 Findings from Generalizability Test on the dataset. We find that (see Table 7), for NB, the newly Recall that the purpose of this test is to test the resiliency constructed feature setting is best as NB is also able to detect against unknown attacks. First, we use Random Forest (RF), unknown attacks with similar accuracy compared to other the best performing algorithm so far, using all four settings feature settings by RF in Table 6. The most interesting thing of features. As shown in Table 6 and Figure 4, we see that ex- about this capability is that this feature set is composed of cept for the constructed feature settings (denoted by Cons.), only three features (C, I, and A), takes comparatively less the performances of other feature settings (all, selected, and time to execute, and comes with the added benefit of very domain) are similar. The constructed features fail to provide good explainability. Once the prediction is expressed as a comparable performance for RF as it has only three fea- percentage of influence from each of C, I, and A, the analyst tures and loses data details (i.e., too much generalization). would be able to perceive the level of compromise more in- Surprisingly, a few unknown attacks are only detectable us- tuitively from the hints about the type of attack (e.g., DDoS ing the domain knowledge infused features. For instance, will show a high percentage of A—compromise in Avail- Web Attack Sql Injection is detected as suspicious only by ability). domain knowledge infused features. Overall, although the However, from Table 3, 4, and 5, we can see that NB’s Table 6: Performance of unseen attack detection using RF Table 7: Performance of unseen attack detection using NB Attack Count All(%) Sel.(%) Dom.(%) Cons.(%) Attack Count All(%) Sel.(%) Dom.(%) Cons.(%) Ddos 4184 99.90 99.90 99.90 62.86 Ddos 4184 76.94 74.59 83.22 100 PortScan 4973 99.90 99.94 99.94 66.28 PortScan 4973 0.18 0.18 6.64 100 Bot 54 77.78 77.78 75.93 22.22 Bot 54 0.00 0.00 100 Infiltration 1 100 100 100 0.00 Infiltration 1 0.00 0.00 0.00 100 Web Attack-BF 49 95.92 95.92 91.84 75.51 Web Attack-BF 49 6.12 6.12 83.67 100 Web Attack-XSS 23 95.65 95.65 91.30 65.22 Web Attack-XSS 23 0.00 0.00 95.65 100 Web Attack-Sql 1 0.00 0.00 100 0.00 Web Attack-Sql 1 100 100 0.00 100 FTP-Patator 251 99.20 100 99.20 81.67 FTP-Patator 251 0.00 0.00 0.00 48.61 SSH-Patator 198 98.99 99.49 96.97 75.76 SSH-Patator 198 0.00 0.00 0.00 100 DoS slowloris 188 99.47 99.47 98.94 61.70 DoS slowloris 188 25.53 30.85 38.30 66.49 DoS Slowloris 174 99.43 99.43 96.55 31.61 DoS Slowloris 174 71.26 79.89 75.29 78.16 Dos Hulk 7319 99.71 99.73 99.34 96.19 Dos Hulk 7319 67.78 67.78 67.03 35.37 DoS GoldenEye 314 99.36 99.68 98.41 85.03 DoS GoldenEye 314 50.32 65.29 47.13 96.82 Heartbleed 1 100 100 100 100 Heartbleed 1 100 100 0.00 0.00 Figure 4: Unknown attack detection rate using Random For- est Figure 5: Unknown attack detection rate using Naive Bayes performance comes at a cost of precision and recall (i.e., speed where there is human in the loop. The lack of explain- produces comparatively more false positives and false neg- ability of an AI-based model is a key reason for this bottle- atives). In addition, NB is a bad probability estimator of neck. To mitigate this problem, we infuse the CIA principle the predicted output (Zhang 2004). However, NB with con- (i.e., domain knowledge) in the AI-based black box model structed features setting could be recommended as an addi- for better explainability and generalizability of the model. tional IDS for quick interpretation of huge traffic data given Our experimental results show realizable successes in better the decision is treated as tentative with the requirement of explainability with a comprehensive, up to date, and real- a further sanity check. We also calculate the average time world network intrusion dataset. In addition, the infused do- taken by each algorithm for all four feature settings and main knowledge helps in detecting an unknown attack as it found that NB is the fastest algorithm. RF, ET, GB, ANN, generalizes the problem, which ultimately opens the door to and SVM take 2.80, 9.27, 77.06, 15.07, and 444.50 times accommodate big data. more execution time compared to NB. Besides, the best al- gorithm, RF (1st in terms of the performance metric and 2nd Going forward, finding an optimal solution to segregate in terms of execution time), can be executed in parallel using the contribution of each participating feature (sample wise) an Apache Spark for a far better run-time (Chen et al. 2016) considering interactions (i.e., correlations among features making it highly scalable to big data problems. complicate explanations) among features will aid in better Overall, domain knowledge infusion provides better ex- explainability of an individual prediction (i.e., per sample). plainability with negligible compromises in performance. In Besides, to ensure trust, estimating the level of uncertainty addition, the generalization provides better execution time in the model will be another extension of this work. There and resiliency with unknown attacks. are some open challenges surrounding explainability and in- terpretability such as an agreement of what an explanation is and to whom, a formalism for the explanation, and quan- 6 Conclusion and Future Work tifying the human comprehensibility of the explanation. AI-based approaches have become an integral part of secu- rity solutions due to the potential for handling “Big Data” Acknowledgment and handling diverse network traffic data. Cybercrime- related damages continue to rise, and network intrusions are Thanks to Tennessee Tech’s Cyber-security Education, Re- a key tactic. Although AI-based IDS provides accelerated search and Outreach Center (CEROC) for supporting this speeds in intrusion detection, response is still at a human research. References methods in credit scoring. The Journal of Developing Ar- Ando, S. 2019. Interpreting random forests. eas 50(5):371–381. http://blog.datadive.net/interpreting-random-forests/. Islam, S. R.; Eberle, W.; Bundy, S.; and Ghafoor, S. K. 2019. Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, Infusing domain knowledge in ai-based” black box” models K.-R.; and Samek, W. 2015. On pixel-wise explanations for for better explainability with application in bankruptcy pre- non-linear classifier decisions by layer-wise relevance prop- diction. arXiv preprint arXiv:1905.11474. agation. PloS one 10(7):e0130140. Islam, S. R.; Eberle, W.; and Ghafoor, S. K. 2018. Credit de- Boser, B. E.; Guyon, I. M.; and Vapnik, V. N. 1992. A train- fault mining using combined machine learning and heuristic ing algorithm for optimal margin classifiers. In Proceedings approach. arXiv preprint arXiv:1807.01176. of the fifth annual workshop on Computational learning the- Islam, S. R.; Ghafoor, S. K.; and Eberle, W. 2018. Min- ory, 144–152. ACM. ing illegal insider trading of stocks: A proactive approach. Breiman, L. 2001. Random forests. Machine learning In 2018 IEEE International Conference on Big Data (Big 45(1):5–32. Data), 1397–1406. IEEE. Chandrasekaran, B.; Tanner, M. C.; and Josephson, J. R. Islam, S. R. 2018. An efficient technique for mining bad 1989. Explaining control strategies in problem solving. credit accounts from both olap and oltp. Ph.D. Dissertation, IEEE Intelligent Systems (1):9–15. Tennessee Technological University. Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; and Kegelmeyer, Javaid, A.; Niyaz, Q.; Sun, W.; and Alam, M. 2016. A W. P. 2002. Smote: synthetic minority over-sampling tech- deep learning approach for network intrusion detection sys- nique. Journal of artificial intelligence research 16:321– tem. In Proceedings of the 9th EAI International Conference 357. on Bio-inspired Information and Communications Technolo- Chen, J.; Li, K.; Tang, Z.; Bilal, K.; Yu, S.; Weng, C.; and Li, gies (formerly BIONETICS), 21–26. ICST (Institute for K. 2016. A parallel random forest algorithm for big data in Computer Sciences, Social-Informatics and . . . . a spark cloud computing environment. IEEE Transactions Kabul, I. K. 2018. Explainable ai. https://www.kdnuggets. on Parallel and Distributed Systems 28(4):919–933. com/2018/11/interpretability-trust-ai-machine- Datta, A.; Sen, S.; and Zick, Y. 2016. Algorithmic trans- learning.html. parency via quantitative input influence: Theory and exper- Kim, J.; Kim, J.; Thu, H. L. T.; and Kim, H. 2016. Long iments with learning systems. In 2016 IEEE symposium on short term memory recurrent neural network classifier for security and privacy (SP), 598–617. IEEE. intrusion detection. In 2016 International Conference on DeJong, G. 1981. Generalizations based on explanations. In Platform Technology and Service (PlatCon), 1–5. IEEE. IJCAI, volume 81, 67–69. Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Vie- Dong, B., and Wang, X. 2016. Comparison deep learning gas, F.; and Sayres, R. 2017. Interpretability beyond feature method to traditional methods using for network intrusion attribution: Quantitative testing with concept activation vec- detection. In 2016 8th IEEE International Conference on tors (tcav). arXiv preprint arXiv:1711.11279. Communication Software and Networks (ICCSN), 581–585. Lashkari, A. H.; Draper-Gil, G.; Mamun, M. S. I.; and Ghor- IEEE. bani, A. A. 2017. Characterization of tor traffic using time Doyle, M. 2019. Don’t be lulled into a false sense of secu- based features. In ICISSP, 253–262. rity. https://www.securityroundtable.org/dont-lulled-false- Lei, T.; Barzilay, R.; and Jaakkola, T. 2016. Rationalizing sense-cybersecurity/. neural predictions. arXiv preprint arXiv:1606.04155. 2019. Ensemble methods. https://scikit-learn.org /sta- Li, Z.; Sun, W.; and Wang, L. 2012. A neural network based ble/modules/ensemble.html. distributed intrusion detection system on cloud platform. In Friedman, J. H. 2001. Greedy function approximation: a 2012 IEEE 2nd international conference on Cloud Comput- gradient boosting machine. Annals of statistics 1189–1232. ing and Intelligence Systems, volume 1, 75–79. IEEE. Geurts, P.; Ernst, D.; and Wehenkel, L. 2006. Extremely Lipovetsky, S., and Conklin, M. 2001. Analysis of regres- randomized trees. Machine learning 63(1):3–42. sion in game theory approach. Applied Stochastic Models in Business and Industry 17(4):319–330. Goodman, B., and Flaxman, S. 2017. European union reg- ulations on algorithmic decision-making and a “right to ex- Lipton, Z. C. 2016. The mythos of model interpretability. planation”. AI Magazine 38(3):50–57. arXiv preprint arXiv:1606.03490. Hodo, E.; Bellekens, X.; Hamilton, A.; Dubouilh, P.-L.; Lundberg, S. M., and Lee, S.-I. 2017. A unified approach Iorkyase, E.; Tachtatzis, C.; and Atkinson, R. 2016. Threat to interpreting model predictions. In Advances in Neural analysis of iot networks using artificial neural network intru- Information Processing Systems, 4765–4774. sion detection system. In 2016 International Symposium on Lundberg, S. 2019. Shap vs lime. https://github.com /slund- Networks, Computers and Communications (ISNCC), 1–6. berg/shap/issues/19. IEEE. Manning, C.; Raghavan, P.; and , H. 2010. Introduc- Hooman, A.; Marthandan, G.; Yusoff, W. F. W.; Omid, M.; tion to information retrieval. Natural Language Engineering and Karamizadeh, S. 2016. Statistical and data mining 16(1):100–103. Matt, B., et al. 2006. Introduction to computer security. Wyden, B. 2019. Algorithmic accountability. Pearson Education India. https://www.wyden.senate.gov/imo/media/doc/Algorithmic%20 Miller, T. 2018. Explanation in artificial intelligence: In- Accountability%20Act%20of%202019%20Bill%20Text.pdf. sights from the social sciences. Artificial Intelligence. Yang, S. C.-H., and Shafto, P. 2017. Explainable artificial Mitchell, T. M.; Keller, R. M.; and Kedar-Cabelli, S. T. intelligence via bayesian teaching. In NIPS 2017 workshop 1986. Explanation-based generalization: A unifying view. on Teaching Machines, Robots, and Humans. Machine learning 1(1):47–80. Zhang, H. 2004. The optimality of naive bayes. AA 1(2):3. Montavon, G.; Samek, W.; and Müller, K.-R. 2018. Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73:1–15. 2019. Naive bayes. https://scikit-learn.org/ sta- ble/modules/naive bayes.html. 2019. Netflow meter. http://netflowmeter.ca/ netflowme- ter.html. 2019. domain-knowledge-aided code. https://github.com/ SheikhRabiul/domain-knowledge-aided-explainable-ai-for- intrusion-detection-and-response. Rankin, K. 2019. The dark secret at the heart of ai. https://www.technologyreview.com/s/604087/the-dark- secret-at-the-heart-of-ai/. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why should i trust you?: Explaining the predictions of any classi- fier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1135– 1144. ACM. 2019. domain-knowledge-aided dataset. https://github.com/ SheikhRabiul/domain-knowledge-aided-explainable-ai- for -intrusion-detection-and-response/tree/master/data/ combined sampled.zip. 2019. Scikit-learn: Machine learning in python. https://scikit-learn.org/stable/. Sharafaldin, I.; Lashkari, A. H.; and Ghorbani, A. A. 2018. Toward generating a new intrusion detection dataset and in- trusion traffic characterization. In ICISSP, 108–116. Shone, N.; Ngoc, T. N.; Phai, V. D.; and Shi, Q. 2018. A deep learning approach to network intrusion detection. IEEE Transactions on Emerging Topics in Computational Intelli- gence 2(1):41–50. Shrikumar, A.; Greenside, P.; and Kundaje, A. 2017. Learn- ing important features through propagating activation differ- ences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3145–3153. JMLR. org. Štrumbelj, E., and Kononenko, I. 2014. Explaining predic- tion models and individual predictions with feature contribu- tions. Knowledge and information systems 41(3):647–665. Swartout, W. R., and Moore, J. D. 1993. Explanation in sec- ond generation expert systems. In Second generation expert systems. Springer. 543–585. Swartout, W. R. 1985. Rule-based expert systems: The mycin experiments of the stanford heuristic programming project: Bg buchanan and eh shortliffe,(addison-wesley, reading, ma, 1984); 702 pages. 2019. Tensorflow. https://www.tensorflow.org/. Turek, M. 2019. Explainable ai. https://www.darpa.mil/ program/explainable-artificial-intelligence.