=Paper= {{Paper |id=Vol-2846/paper32 |storemode=property |title=Implications of Combining Domain Knowledge in Explainable Artificial Intelligence |pdfUrl=https://ceur-ws.org/Vol-2846/paper32.pdf |volume=Vol-2846 |authors=Sheikh Rabiul Islam,William Eberle |dblpUrl=https://dblp.org/rec/conf/aaaiss/IslamE21 }} ==Implications of Combining Domain Knowledge in Explainable Artificial Intelligence== https://ceur-ws.org/Vol-2846/paper32.pdf
Implications of Combining Domain Knowledge in
Explainable Artificial Intelligence
Sheikh Rabiul Islama , William Eberleb
a
    University of Hartford, 200 Bloomfield Ave, West Hartford, CT 06117
b
    Tennessee Technological University, 1 William L Jones Dr, Cookeville, TN 38505


                                         Abstract
                                         Although the infusion of domain knowledge in explainable artificial intelligence techniques is a viable
                                         approach to enhance the explainability of “black box” model’s decision, there are some open challenges.
                                         Some of the challenges include quantification of explainability, compromise in performances, and in-
                                         formation sacrifices. In our prior work, we demonstrated that the infusion of domain knowledge in
                                         network intrusion detection provides better explainability of decisions, better generalization to work
                                         well with unknown attacks, and a faster decision or response. In this paper, we extend our prior work
                                         to quantify the level of information sacrifices from the introduces generalization, and quantify the level
                                         of explainability in an explainable artificial intelligence technique applied to a network intrusion de-
                                         tection problem. Our experimental results suggest that, as a result of domain knowledge infusion, the
                                         level of information sacrificed is very negligible, and the explainability score, using a recently proposed
                                         proxy method, is better than the case of not using domain knowledge.

                                         Keywords
                                         explainable artificial intelligence, domain knowledge infusion, explainability quantification




1. Introduction
Artificial Intelligence (AI) based “black box” models lack explainability even though they show
superior performance in many applications. Furthermore, the lack of explainability leads to a
lack of trust in the model and prediction, which can involve ethical and legal issues in critical
domains due to the potential implications on human interests, rights, and lives (e.g., credit ap-
proval in finance, killer robots in defense, etc.). To mitigate the unethical use of AI as well as to
promote the responsible use of AI systems, various governments have started taking different
precautionary initiatives. Recently, the European Union implemented the rule of “right of ex-
planation”, where a user can ask for an explanation of the algorithmic decision [1]. In addition,
more recently the U.S. government introduced a new bill, the “Algorithmic Accountability Act”,
which would require companies to assess their machine learning systems for bias and discrim-
ination, with a need to take corrective measures [2]. The U.S. Department of Defense (DoD)


In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI
2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021) - Stanford
University, Palo Alto, California, USA, March 22-24, 2021.
" shislam@hartford.edu (S.R. Islam); weberle@tntech.edu (W. Eberle)

                                       © 2021 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
has identified explainability as a key stumbling block in the adoption of AI-based solutions in
many of their projects [3, 4].
   Network intrusions are a common cyber-crime activity, estimated to cost around $6 trillion
annually in damages by 2021 [5]. To combat these attacks, an Intrusion Detection System (IDS)
is a security system to monitor network and computer systems [6]. Research in AI-based IDSs
has shown promising results [6], [7], [8], [9], [10], and has become an integral part of security
solutions due to its capability of learning complex, nonlinear functions and analyzing large
data streams from numerous connected devices. A recent survey by [11] suggests that deep
learning-based methods are accurate and robust to a wide range of attacks and sample sizes.
However, there are concerns regarding the sustainability of current approaches (e.g., intrusion
detection/prevention systems) when faced with the demands of modern networks and the in-
creasing level of human interaction [7]. In the age of IoT and Big Data, an increasing number of
connected devices and associated streams of network traffic have exacerbated the problem. In
addition, the delay in detection/response increases the chance of zero day exploitation, whereby
a previously unknown vulnerability is just discovered by the attacker, and the attacker imme-
diately initiates an attack. However, improved explainability of an AI model could quicken
interpretation, making it more feasible to accelerate the response. In prior work [12], we intro-
duce a novel approach for an AI-based explainable intrusion detection and response system,
and demonstrate its effectiveness by infusing a popular network security principle (CIA prin-
ciple) into the model for better explainability and interpretability of the decision.
   However, domain knowledge incorporation involves feature engineering, and the amount
of information that is being sacrificed is still not clear. In addition, there is a need to realize
the impact of lost information due to feature engineering which is key to success in applied
Machine Learning. Also, the performance of predictive modeling depends on the chosen al-
gorithm/model, available data, and the features used. To mitigate the quantification of and
the impact of lost information, we introduce the use of Principal Component Analysis (PCA),
which is a feature extraction technique that creates a new smaller set of features that still cap-
tures most of the information of the data. In summary, PCA finds the directions of maximum
variance (i.e., the direction of maximum data dispersion using the Eigenvector and the impor-
tance of the direction using the Eigenvalue) in high-dimensional data and project the data onto
a smaller subspace (i.e., smaller set of new features) while retaining most of the information in
the raw data.
   Quantification of explainability is another open challenge, and a known limitation in prior
work [12]. In this paper, we extend that work to quantify the level of explainability for intrusion
detection and response, adapting from the work [13] for credit default prediction.
   In summary, our main contributions in this work are as follows: (1) we extend our prior
work on explainable AI for intrusion detection to quantify and analyze the level of information
loss from the introduced generalization, and (2) we quantify the level of explainability in the
previously proposed XAI system for intrusion detection and response.
   We start with a background of related work (Section 2) followed by a description of the pro-
posed approach, an intuitive description of algorithms, and an overview of the dataset (Section
3) used in this work. In Section 4, we describe our experiments, followed by Section 5 which
contains a discussion on results from the experiments. We conclude with limitations and future
work in Section 6.
2. Background
Research in Explainable Artificial Intelligence (XAI) is a re-emerging field following some ear-
lier work of [14], [15], and [16]. Previous work focused on primarily explaining the decision
process of knowledge-based systems and expert systems. The main reason for the renewed
interest in XAI research has stemmed from recent advancements in AI and ML and their appli-
cation to a wide range of areas, as well as concerns over unethical use and undesired biases in
the models. In addition, recent concerns and laws by different governments are necessitating
more research in XAI.
   AI-based IDSs have continued to show promising performance [6],[7],[8],[9],[10]. However,
there are still the problems of long training times and a reliance on a human operator [7].
Incorporating domain knowledge for explainability has gotten little attention. Previously, we
introduced the concept of infusing domain knowledge [12] for network intrusion detection.
   There are two primary directions of research towards the evaluation of the explainability
of an AI/ML model: (1) human study-based, and (1) model complexity-based. Human study-
based explainability evaluation is an open challenge. [17] argue that interpretability is not an
absolute concept; rather, it is relative to the target model, and may or may not be relative to
the human. Their finding suggests that a model is readily interpretable to a human when it
uses no more than seven pieces of information [18]. However, this might vary from task to
task and person to person. For instance, a domain expert might consume a lot more detailed
information depending on their experience. Much attention is still needed for quantifying the
explainability with human subjects.
   In the literature, model complexity and (lack of) model interpretability are often treated
as the same [19]. For instance, in [20] and [21], model size is often used as a measure of
interpretability (e.g., number of decision rules, depth of the tree, number of non-zero coef-
ficients). More recently, [19] attempts to quantify the complexity of the arbitrary machine
learning model with a model agnostic measure. In that work, the author demonstrates that
when the feature interaction (i.e., the correlation among features) increases, the quality of rep-
resentations of explainability tools degrades. In fact, from our study of different explainability
tools (e.g., LIME, SHAP, PDP), we have found that the correlation among features is a key stum-
bling block to representing feature contribution in a model agnostic way. Keeping the issue of
feature interactions in mind, [19] proposes a technique that uses three measures: number of
features, interaction strength among features, and the main effect (excluding the interaction
part) of features to measure the complexity of a post-hoc model for interpretation. While their
work focuses primarily on post-hoc models, we use their approach as a precursor to formulate
our explainability quantification. In addition, our approach is model agnostic and targets any
notion (e.g., pre-modeling, post-hoc).


3. Methodology
3.1. Proposed Approach
In previous work [12], the proposed approach for combining domain knowledge consists of
two components: a feature generalizer, which gives a generalized feature set with the help of
Figure 1: Proposed Technique


domain knowledge in two different ways; and an evaluator that produces and compares the
results from the “black box” model for multiple configurations of features: domain knowledge
infused features, newly constructed features from domain knowledge infused features, selected
features, and all features.

3.2. Feature Generalizer
The feature generalizer (Figure 1, top portion), takes original features of the dataset (X 1 , X 2 ,
.... X n ∈ X where X is the set of all features) and infuses domain knowledge to produce/re-
construct a concise and better interpretable feature set (X 1 ’, X 2 ’, ..... X k ’ ∈ X’ where X’ is the
universal set of original/transformed/constructed features, but here k is much smaller than n)
in two different ways:

3.2.1. Feature Mapping
We use CIA principles as domain knowledge, which stands for confidentiality, integrity, and
availability. We analyze all types of attacks for associated compromises in each component
of CIA principles (see Table 2 in Appendix). For example, a Heartbleed vulnerability is related
to a compromise in confidentiality as an attacker could gain access to the memory of the sys-
tems protected by the vulnerable version of the OpenSSL; a Web attack (e.g., Sql injection) is
related to a compromise in confidentiality and integrity (e.g., read/write data using injected
query); and flooding a database server with injected complex queries (e.g., a cross join), would
constitute an availability compromise. Another example is an Infiltration attack, which is re-
lated to a compromise in confidentiality as it normally exploits a software vulnerability (e.g.,
Adobe Acrobat Reader) to create a backdoor and reveal information (e.g., IP’s). Port scan attack
is related to a compromise in confidentiality as the attacker sends packets with varying des-
tination ports to learn the services and operating systems from the reply. All DoS and DDoS
attacks are related to a compromise in availability as it aims to hamper the availability of ser-
vice or data. Furthermore, SSH patator and FTP patator are brute force attacks and are usually
responsible for a compromise in confidentiality. For instance, a botnet (i.e., robot network—a
network of malware-infected computers) could provide a remote shell, file upload/download
option, screenshot capture option, and key logging options which has potential for all of the
confidentiality, integrity, and availability related compromises.
   From the feature ranking of the public CICIDS2017 dataset [22], for each type of attack,
we take the top three features according to their importance (i.e., feature importance from
Random Forest Regressor) and calculate the mapping with related compromises under the CIA
principles. For example, the feature Average Packet Size is renamed as Avg Packet Size - A where
-A indicates that it is a key feature for the compromise of availability. To get this mapping
between features and associated compromises, we first find the mapping between an attack
and related compromises (from Table 2 in Appendix), formulated as Equation 7). In other
words, Formula 1 gives the name of the associated attack where the feature is in the top three
feature to identify that particular attack and Formula 2 gives associated compromises in C, I,
or A from the attack name. Thus, with the help of domain knowledge, we keep 22 features out
of a total of 78 features. We will refer to these features as the domain features.

                                       𝑓 (𝑓 𝑒𝑎𝑡𝑢𝑟𝑒) → 𝑎𝑡𝑡𝑎𝑐𝑘                                         (1)

                                       𝑓 (𝑎𝑡𝑡𝑎𝑐𝑘) → 𝐶, 𝐼 , 𝑜𝑟𝐴                                       (2)

3.2.2. Feature Construction
We also construct three new features, C, I, and A, from the domain features by quantitatively
calculating compromises associated with each of the domain features. For that purpose, we
calculate the correlation coefficient vector of the dataset to understand whether the increase
in the value of a feature has a positive or negative impact on the target variable. We then
convert the correlation coefficient (a.k.a., coeff ) vector V in to a 1 or -1 based on whether the
correlation coefficient is positive or negative accordingly. We also group the domain features
and corresponding coeff tuple into three groups. Using formula 3, 4, and 5, we aggregate each
group (from C, I, and A) of domain features into the three new features C, I, and A. We also scale
all feature values from 0 to 1 before starting the aggregation process. During the aggregation
for a particular group (e.g., C), if the correlation coefficient vector (e.g., Vi ) for a feature (e.g.,
Ci ) of that group has a negative value, then the product of the feature value and the correlation
coefficient for that feature is deducted, and vice-versa if positive. In addition, when a feature is
liable for more than one compromise, the feature value is split between the associated elements
of CIA principles.
                                                   𝑛
                                             𝐶 = ∑ 𝐶 i𝑉 i                                            (3)
                                                  𝑖=0
                                                   𝑛
                                             𝐼 = ∑ 𝐼 i𝑉 i                                            (4)
                                                  𝑖=0
                                                𝑛
                                          𝐴 = ∑ 𝐴i 𝑉 i                                          (5)
                                               𝑖=0


3.3. Evaluator
The task of the evaluator (Figure 1, bottom side) is to execute (supervised models or algorithms)
and compare the performance (in detecting malicious and benign records) of four different
types of configurations of features: (1) using all features, (2) using selected features (selection
is done by feature selection algorithm), (3) using domain knowledge infused features, and (4)
using newly constructed features C, I, and A from domain knowledge infused features. In
addition, the evaluator performs the following two tests:
   1. Explainability Test: The purpose of this test is to discover the comparative advantages
      or disadvantages of incorporating domain knowledge in the experiment; and
   2. Generalizability Test: The purpose of this test is to analyze how different approaches
      perform in unknown or unseen attack detection. All training records for a particular
      attack are deleted one at a time, and the performance of the model is evaluated on the
      same test set, which includes records from unknown or unseen attacks. Details of these
      tests are described in Section 4.

3.4. Algorithms
We use six different standard algorithms for predicting malicious records: one of those is a
probabilistic classifier based on Naive Bayes theorem, and the remaining five are supervised
“black box” models : Artificial Neural Network (ANN), Support Vector Machine (SVM), Ran-
dom Forest (RF), Extra Trees (ET), and Gradient Boosting (GB). In addition, we use Principal
Component Analysis (PCA) for quantification of information sacrifices, and a proxy-based ex-
plainability quantification method for quantification of explainability.

3.4.1. Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a feature extraction technique that creates a new smaller
set of features that still captures most of the information of the data. In other words, PCA
finds the directions of maximum variance (i.e., the direction of maximum data dispersion us-
ing Eigenvector and importance of the direction using Eigenvalue) in high-dimensional data
and project the data onto a smaller subspace (i.e., smaller set of new features) while retain-
ing most of the information in the raw data. So, PCA is a viable candidate to realize the level
of information loss from the feature engineering. Therefore, we use PCA to demonstrate the
information loss induced from the combination of domain knowledge for better explainability.

3.4.2. Explainability Quantification Method
In our prior work [13], we propose a proxy task-based explainability quantification method
for XAI in credit default prediction. In this work, we apply that approach for quantification of
explainability in XAI for intrusion detection. In fact, along with the previously demonstrated
field of finance, and in the paper with cyber-security, there is nothing about our proposed ex-
plainability quantification method that should prohibit it from being applied to other domains.
A proxy task-based explainability quantification method considers different properties of out-
put representation (e.g., depth of decision tree, length of rule list) as a metric for evaluation.
Usually, humans can relate and process 7+-2 pieces of information (i.e., cognitive chunks) to
understand something [18]. For instance, suppose that, in the most generalized form, the qual-
ity of an explanation is dependent upon the number of cognitive chunks that the recipient has
to relate to understanding an explanation (i.e., the less, the better). Lets assume, E = explain-
ability score; Nc = number of cognitive chunks; I = interaction; Ni = number of input cognitive
chunks; and No = number cognitive chunks involved in the explanation representation (i.e.,
output cognitive chunks).
                                                  1
                                            𝐸=                                                 (6)
                                                 𝑁𝑐
However, sometimes, these cognitive chunks are correlated and their influences are not mutu-
ally exclusive. This interaction among cognitive chunks complicates the explainability. So we
penalize Formula 6 for having an interaction among cognitive chunks, resulting in Formula 7.
                                               1
                                          𝐸=      + (1 − 𝐼 )                                       (7)
                                               𝑁𝑐
Where, the interaction I ranges in between 0 and 1, and the less the interaction, the better the
explainability, so we take the complement of that.
   Furthermore, both the number of input cognitive chunks in the model and the number of
output cognitive chunks involved in the representation of output are important to understand
the causal relationship, which is vital for explanation. While the ideal explainability case would
be when there is only one input and one output cognitive chunk (no chance of interaction), that
is unusual in real-world situations. Following the segregation of input and output cognitive
chunks, Formula 7 can be re-written as Formula 8:
                                            1   1
                                       𝐸=     +   + (1 − 𝐼 )                                       (8)
                                            𝑁𝑖 𝑁𝑜
where Ni refers to the number of input cognitive chunks and No refers to the number of cogni-
tive chunks involved in the explanation representation (i.e., output cognitive chunks). Usually,
the more these cognitive chunks, the more complicated the explanation becomes. So, the ratio
of the best possible case (i.e., one cognitive chunk) and the observed case is added towards total
explainability.
   After the addition of the weight terms for each of three predicates, Formula 8 becomes For-
mula 9:
                                           𝑤1 𝑤2
                                       𝐸=      +    + 𝑤3 (1 − 𝐼 )                              (9)
                                           𝑁𝑖 𝑁𝑜
Formula 9 can then be used to quantify the explainability of the explanation method (i.e., global
explainability). We can use Formula 9 to also quantify the explainability of an instance level
prediction (i.e., local explainability). In that case, the first predicate of Formula 9 (including the
weight term) remains the same (i.e., the same number of input chunks). However, predicate 2
and predicate 3 will be different from instance to instance as a different set of cognitive chunks
with different interaction strengths might be involved in the representation of explanation for
a particular instance as explanations are selective.

3.5. Data
In this work, we use the recent and comprehensive IDS dataset CICIDS2017, published in
2017, that contains various attacks: DoS, DDoS, Brute Force, XSS, SQL Injection, Infiltration,
Portscan, and Botnet. In fact, this dataset was created to eliminate the shortcomings (e.g., lack
of traffic diversity and volume, lack of variety of attacks, anonymized packet information, and
out of date) of previous well known IDS datasets such as DARPA98, KDD99, ISC2012, ADFA13,
DEFCON, CAIDA, LBNL, CDX, Kyoto, Twente, and UMASS since 1998. This is a labeled dataset
containing 78 network traffic features (some features are listed in our prior work [13]) extracted
and calculated from pcap files using CICFlowMeter software [23] for all benign and intrusive
flows [22]. This new IDS dataset includes seven common attacks satisfying real-world criteria,
publicly available at https://www.unb.ca/cic/datasets/ids-2017.html .


4. Experiments
4.1. Experimental Setup
We execute the experiments on a GPU enabled Linux machine with 12GB of RAM and a core
i7 processor. All supervised machine learning algorithms are implemented using the Python-
based Scikit-learn [24] library. In addition, we use Tensorflow [25] for the Artificial Neural
Network. Due to resource limitations, instead of using the whole dataset, we take a stratified
sample of the data which is big enough (i.e., 300K records) for a single GPU enabled commodity
machine. We make the sampled dataset available to the research community at [26]. Further-
more, we use 70% of the data for training the models and kept 30% of the data as a holdout set to
test the model. We confirm the target class had the same ratio in both sets. To avoid the adverse
effect of class imbalance in classification performance, we re-sample the minority class of the
training set using SMOTE [27] to balance the dataset. However, we do not re-sample the test
set, as real-world data is skewed, and oversampling the test set could exhibit an over-optimistic
performance.
   We run all supervised machine learning algorithms using five different approaches:
   1. With all features: using all 78 features of the dataset without discarding any features.
   2. With selected features: using Random Forest Regressor (adapting the work of [22]) to
      select important features of the dataset, giving us 50 important features having a nonzero
      influence on the target variable;
   3. With domain knowledge infused features: using infused domain knowledge features (see
      Section 3.2.1), we will use the term domain features interchangeably to express it in short
      form;
   4. With newly constructed features from domain knowledge infused features: using newly
      constructed features C, I, and A (see Section 3.2.2) from domain knowledge infused fea-
      tures, we will use the term domain features-constructed interchangeably to express it in
      short form; and
   5. With an increasing number of Principal Components: using different combinations of
      principal components, in increasing order, as a feature set.
   The following are two types of experiments using each of the first four feature settings. The
last feature setting, principal component as features, is primarily used to measure the informa-
tion sacrifices.

4.2. Explainability Test
For this test, we run six supervised algorithms RF, ET, SVM, GB, ANN, and NB using the four
described feature settings and report the results in Section 5.1. Unlike NB, other classifiers are
“black box” in nature. NB is a probabilistic classifier based on Bayes Theorem with a strong
conditional independence assumption among features.

4.3. Generalizability Test
Domain knowledge infusion is capable of enhancing explainability with the added benefit of
enhancing generalizability. For testing the generalizability of the approach, we train the clas-
sifier without the representative of a particular attack, and test it with the presence of the
representative of that particular attack, in order to classify it as malicious/benign. To be more
specific, we delete all records of a particular attack from the training set, train the classifier
with the records of the remaining 13 attacks, and test the classifier with all 14 attacks. We
report the percentage of deleted attacks that are correctly detected as malicious (see Section
5.2). We repeat this one by one for all 14 attacks. We make the source code available to the
research community to replicate the experiments at [28].


5. Results
The following sections discuss results from the two categories of experiments that we per-
formed in our prior work [13], and from the new experiments for quantification of (1) infor-
mation sacrifice, and (2) explainability.

5.1. Findings from Explainability Test
Comparing the performance using all features vs selected features, Table 3 (see Appendix) shows
that models using all features (denoted with an appended -A, for instance RF-A) tend to show
better results in terms of all performance metrics. However, while the difference with the
selected features setting is negligible (<.0007 for RF) for any performance metric, that might
be a result of the elimination of features with little significance. In addition, Random Forest
outperforms other algorithms SVM, ET, GB, ANN, and NB under this feature setting (i.e., using
all features). So we consider the results using all features as a baseline to compare against our
proposed approach.
   Before comparing the results from our proposed approach with the baseline (i.e., using all
features), we seek the best feature setting among two domain related feature settings: domain
knowledge infused features and newly constructed features. In other words, in our attempt
to find the better approach among using domain knowledge infused features vs newly con-
structed features (C, I, and A) from domain knowledge infused features, we find that, in almost
all cases, the model with domain knowledge infused features (denoted with an appended -D1,
for instance RF-D1) performs better than the counterpart (see Table 4 in Appendix). Although
for RF, the maximum performance gap is .2 in the recall, for ET that gap is .048 with a similar
precision. As the domain features (22 features) contain a lot more detail than the newly con-
structed features C, I, and A (3 features), it loses few details. In terms of individual algorithms,
RF is again a clear winner this time using domain features. Although NB and ANN exhibit
better recall using constructed features, it comes with compromises in precision. So, overall
we consider the domain features setting as the best over the constructed features.
   While we know the best feature setting is the all features, as shown in the comparison of all
features vs selected features in the Table 3 (see Appendix), we also know that domain features
is the best feature setting from the comparison of domain features vs constructed features (see
Table 4 in Appendix). So we further compare the performance of models using the two best
settings all features (i.e., baseline) vs domain features. We find that, among all models, RF using
all features (denoted with an appended -A, for instance, RF-A) performs better than all other
algorithms (see Table 5 (see Appendix) and Figure 5 (see Appendix). Interestingly, RF using
domain knowledge infused features (denoted with an appended -D1, for instance, RF-D1) also
shows promising performance. The difference between these two in terms of any performance
metrics is negligible (<.005). In fact, the result of RF using the domain knowledge infused fea-
ture settings is better than what [22] reports using the same dataset. The slight improvement
might stem from the experimental settings (e.g., training and test set split, re-sampling tech-
niques). Furthermore, in the domain knowledge infused feature setting we are using only 22
features out of 78 total, where each feature indicates the associated compromises (e.g., confi-
dentiality, integrity, or availability), capable of producing better explainable and interpretable
results compared to the counterpart. The prediction for a particular sample can be represented
as:
                                              𝐺
                                 𝑃(𝐷) = 𝑏 + ∑ 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛(𝑔)                                   (10)
                                             𝑔=0
where b is the model average and g is the generalized domain feature (e.g., ACK Flag Count
- C), P(D) is the probability value of the decision. Instead of using contributions from each of
the domain features, we can express the output in terms of the contribution from each element
of the domain concept. For that, we need to aggregate contributions from all features into
three groups (C, I, and A). This will enable an analyst to understand the nature of the attack
more quickly (see Figure 4 in Appendix). For instance, when the greater portion of a feature
contribution for a sample is from features tagged with -A (i.e., Availability) then it might be
a DDoS attack, which usually comes with very high compromises in availability of data or
service. We use the iml package from the programming language R to generate the breakdown
of feature contributions of a particular sample’s prediction (see Figure 4 in Appendix).

5.2. Findings from Generalizability Test
The purpose of the generalizability test is to test the resiliency against unknown attacks. First,
we use Random Forest (RF), the best performing algorithm so far, using all four settings of
features. As shown in Table 6 (see Appendix) and Figure 6(see Appendix), we see that except
for the constructed feature settings (denoted by Dom-Cons. in Figure 6, Appendix), the per-
formances of other feature settings (all, selected, and domain) are similar. The constructed
features fail to provide comparable performance for RF as it has only three features and loses
data details (i.e., too much generalization). Surprisingly, a few unknown attacks are only de-
tectable using the domain knowledge infused features. For instance, Web Attack Sql Injection
is detected as suspicious only by domain knowledge infused features. Overall, although the
domain knowledge infused feature setting performs slightly worse than the all feature setting,
it comes with explainable features set with the added capability of identifying a few unknown
attacks.
   To reiterate, the constructed features set consists of only three features (C, I, and A) con-
structed from aggregating domain knowledge infused features. As this feature setting is com-
posed of only three features, it is an extreme generalization of features and loses a lot of data
details. However, this time it comes with unique capabilities that are realized after applying a
statistical approach (Naive Bayes) on the dataset. We find that (see Table 7 in Appendix), for
NB, the newly constructed feature setting is best as NB is also able to detect unknown attacks
with similar accuracy compared to other feature settings by RF in Table 6 (see Appendix). The
most interesting thing about this capability is that this feature set is composed of only three
features (C, I, and A), takes comparatively less time to execute, and comes with the added ben-
efit of very good explainability. Once the prediction is expressed as a percentage of influence
from each of C, I, and A, the analyst would be able to perceive the level of compromise more
intuitively from the hints about the type of attack (e.g., DDoS will show a high percentage of
A—compromise in Availability).
   However, from Table 3 (see Appendix), Table 4(see Appendix), and Table 5(see Appendix),
we can see that NB’s performance comes at a cost of precision and recall (i.e., produces compar-
atively more false positives and false negatives). In addition, NB is a bad probability estimator
of the predicted output [29]. However, NB with a constructed features setting could be recom-
mended as an additional IDS for quick interpretation of huge traffic data given the decision is
treated as tentative with the requirement of a further sanity check. We also calculate the aver-
age time taken by each algorithm for all four feature settings and found that NB is the fastest
algorithm. RF, ET, GB, ANN, and SVM take 2.80, 9.27, 77.06, 15.07, and 444.50 times more exe-
cution time compared to NB. Besides, the best algorithm, RF (1st in terms of the performance
metric and 2nd in terms of execution time), can be executed in parallel using an Apache Spark
for a far better run-time [30] making it highly scalable to big data problems.
   Overall, domain knowledge infusion provides better explainability with negligible compro-
mises in performance. In addition, the generalization provides better execution time and re-
siliency with unknown attacks.

5.3. Information Sacrifices from Explainability
Domain knowledge incorporation involves feature engineering, and the amount of information
that we are sacrificing is still not clear.
  In our proposed approach, we use different techniques for feature engineering: (a) the do-
main mapped features approach falls under the feature selection, and (b) the newly con-
Figure 2: Variance ratio and cumulative variance for different principal components.


structed features approach falls under the feature construction. Furthermore, we leverage
Principal Component Analysis (PCA), which is a feature extraction technique that creates a
new smaller set of features that still captures most of the information of the data. The blue
bottom line in Figure 2 shows the ratio of variance (ie., percentage of variance) covered by an
increasing number of principal components (PC). The blue line shows the variance covered by
the number of principal components starting from 1 PC to 78 PCs. Once we reached to 38 PCs,
the covered variance was 100%, and following that, with the increase of PCs, there were no
changes in the covered variance as 38 PCs were enough to cover 100% variance in the data.
Therefore, we need at least 38 PCs to explain the 100% variance in the data.
   Figure 3 shows the comparison of performance using different feature engineering tech-
niques used in this paper. For PCA, we are using the first 38 principal components that explain
100% of the data variance (Figure 2) projected by PCA. From Figure 3, we can see that in terms
of all metrics, PCA, Original, and Domain features show the same performance (.99) up to 2 dec-
imal digits, and following two decimal digits, PCA minimally outperforms other approaches.
On the other hand, the Constructed feature setting performs slightly worse (by 4% for accu-
racy, 5% for precision, 20% for recall, and 4% for AUC) compared to other features settings.
In the constructed features, we are drastically reducing the number of features to 3, and is a
reason behind the compromise in performance. Although PCA features lose an explainability
of features, it provides an estimation of the present information in the raw data irrespective of
the target of the problem.
   In the end, our proposed approach (domain mapped feature) provides the competitive accu-
racy, precision, recall, F-score among all these feature engineering techniques, which is also
similar to PCA (up to two decimal digits). So, we can conclude that our approach retains enough
valuable information and does not trade valuable information to achieve explainability.
   Overall, even though infusing domain knowledge might lead to some compromise in perfor-
mance, clearly it ensures better explainability and interpretability as the output is made from
a concise and familiar set of features from the domain.
Figure 3: Comparison of performance using different features settings.


Table 1
Comparison of explainability
                                             Original   Domain     Constructed
                      Input chunks (Ni )     78         22         3
                      Output chunks (No )    78         5          3
                      Int. Strength (I)      1          0.8083     0.6803
                      Explainability (E)     0.0085     0.1450     0.3284


5.4. Explainability Quantification
We represent the predicted output as a composition of individual elements of the domain prin-
ciple CIA (i.e., newly constructed features) (Figure 4 in Appendix).
   One will notice that the representation (Figure 4 in Appendix) of the prediction in terms of
the newly constructed features provides better explainability as the final prediction is segregated
into the individual influences of a very concise set of intuitive features (i.e., 3 compared to 78).
However, there is no way to quantify the level of perceived explainability. To use our proposed
formula in this work (Formula 9) to quantify explainability, we need to calculate the interaction
strength (I ) too. We measure the interaction strength among features using R’s iml package
that uses the partial dependence of an individual feature as the basis for calculating interaction
strength (I ).
   Applying Formula 9, on metadata (Table 1) of three different feature settings, we see that
newly constructed features (CIA principal) provide the best explainability score of 0.3284, which
is a considerable improvement compared to the 0.0085 that we get using the original features
(Table 1). In fact, even if we apply the state-of-the-art methods of post-hoc interpretabil-
ity/explainability like SHAP, the explainability will be still limited to 0.0085 as it does not reduce
the number of cognitive chunks to represent output. Besides, using domain related features, the
explainability score is 0.1450, which is better than using the original features, although worse
than using the newly constructed features.
6. Conclusion and Future Work
We combine domain knowledge in the “black box” system to enhance the explainability of AI-
based complex model’s decision. We also quantify the level of information sacrifices induced
from the domain knowledge infusion. For advancing explainability quantification, which is
one of the open challenges in XAI, we applied a proxy task-based explainability quantification
method for network intrusion detection. Our experimental results suggest that, as a result
of domain knowledge infusion, the level of information sacrifice is very negligible, and the
explainability score, using a recently proposed proxy method, is better than the case of not
using domain knowledge.
   Going forward, as an extension of this work, we would like to address human studies in
the explainability quantification, and investigate the effectiveness of domain knowledge infu-
sion among different approaches (e.g., supervised, unsupervised, semi-supervised) for different
application areas (e.g., natural language processing, image recognition).


References
 [1] B. Goodman, S. Flaxman, European union regulations on algorithmic decision-making
     and a “right to explanation”, AI Magazine 38 (2017) 50–57.
 [2] B. Wyden, Algorithmic accountability, https://www.wyden.senate.gov/imo/media/doc/
     Algorithmic%20 Accountability%20Act%20of%202019%20Bill%20Text.pdf, 2019.
 [3] M. Turek, Explainable ai, https://www.darpa.mil/ program/explainable-artificial-
     intelligence, 2019.
 [4] K.       Rankin,        The      dark       secret      at      the    heart     of     ai,
     https://www.technologyreview.com/s/604087/the-dark-secret-at-the-heart-of-ai/, 2019.
 [5] M.      Doyle,      Don’t     be     lulled    into    a    false   sense    of   security,
     https://www.securityroundtable.org/dont-lulled-false-sense-cybersecurity/, 2019.
 [6] E. Hodo, X. Bellekens, A. Hamilton, P.-L. Dubouilh, E. Iorkyase, C. Tachtatzis, R. Atkin-
     son, Threat analysis of iot networks using artificial neural network intrusion detection
     system, in: 2016 International Symposium on Networks, Computers and Communica-
     tions (ISNCC), IEEE, 2016, pp. 1–6.
 [7] N. Shone, T. N. Ngoc, V. D. Phai, Q. Shi, A deep learning approach to network intrusion
     detection, IEEE Transactions on Emerging Topics in Computational Intelligence 2 (2018)
     41–50.
 [8] J. Kim, J. Kim, H. L. T. Thu, H. Kim, Long short term memory recurrent neural network
     classifier for intrusion detection, in: 2016 International Conference on Platform Technol-
     ogy and Service (PlatCon), IEEE, 2016, pp. 1–5.
 [9] A. Javaid, Q. Niyaz, W. Sun, M. Alam, A deep learning approach for network intrusion
     detection system, in: Proceedings of the 9th EAI International Conference on Bio-inspired
     Information and Communications Technologies (formerly BIONETICS), ICST (Institute
     for Computer Sciences, Social-Informatics and . . . , 2016, pp. 21–26.
[10] Z. Li, W. Sun, L. Wang, A neural network based distributed intrusion detection system
     on cloud platform, in: 2012 IEEE 2nd international conference on Cloud Computing and
     Intelligence Systems, volume 1, IEEE, 2012, pp. 75–79.
[11] B. Dong, X. Wang, Comparison deep learning method to traditional methods using for
     network intrusion detection, in: 2016 8th IEEE International Conference on Communi-
     cation Software and Networks (ICCSN), IEEE, 2016, pp. 581–585.
[12] S. R. Islam, W. Eberle, S. K. Ghafoor, A. Siraj, M. Rogers, Domain knowledge aided ex-
     plainable artificial intelligence for intrusion detection and response, AAAI-MAKE 2020
     Combining Machine Learning and Knowledge Engineering in Practice - Volume I: Spring
     Symposium (2020).
[13] S. R. Islam, W. Eberle, S. K. Ghafoor, Towards quantification of explainability in explain-
     able artificial intelligence methods, AAAI Publications, The Thirty-Third International
     Flairs Conference (2020).
[14] B. Chandrasekaran, M. C. Tanner, J. R. Josephson, Explaining control strategies in problem
     solving, IEEE Intelligent Systems (1989) 9–15.
[15] W. R. Swartout, J. D. Moore, Explanation in second generation expert systems, in: Second
     generation expert systems, Springer, 1993, pp. 543–585.
[16] W. R. Swartout, Rule-based expert systems: The mycin experiments of the stanford heuris-
     tic programming project: Bg buchanan and eh shortliffe,(addison-wesley, reading, ma,
     1984); 702 pages, 1985.
[17] A. Dhurandhar, V. Iyengar, R. Luss, K. Shanmugam, Tip: Typifying the interpretability of
     procedures, arXiv preprint arXiv:1706.02952 (2017).
[18] G. A. Miller, The magical number seven, plus or minus two: Some limits on our capacity
     for processing information., Psychological review 63 (1956) 81.
[19] C. Molnar, G. Casalicchio, B. Bischl, Quantifying model complexity via functional decom-
     position for better post-hoc interpretability, in: Joint European Conference on Machine
     Learning and Knowledge Discovery in Databases, Springer, 2019, pp. 193–204.
[20] J. Fürnkranz, D. Gamberger, N. Lavrač, Rule learning in a nutshell, in: Foundations of
     Rule Learning, Springer, 2012, pp. 19–55.
[21] H. Yang, C. Rudin, M. Seltzer, Scalable bayesian rule lists, in: Proceedings of the 34th
     International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 3921–
     3930.
[22] I. Sharafaldin, A. H. Lashkari, A. A. Ghorbani, Toward generating a new intrusion detec-
     tion dataset and intrusion traffic characterization., in: ICISSP, 2018, pp. 108–116.
[23] A. H. Lashkari, G. Draper-Gil, M. S. I. Mamun, A. A. Ghorbani, Characterization of tor
     traffic using time based features., in: ICISSP, 2017, pp. 253–262.
[24] Scikit-learn: Machine learning in python, https://scikit-learn.org/stable/, 2019.
[25] Tensorflow, https://www.tensorflow.org/, 2019.
[26] domain-knowledge-aided dataset, https://github.com/ SheikhRabiul/domain-knowledge-
     aided-explainable-ai-for -intrusion-detection-and-response/tree/master/data/ com-
     bined_sampled.zip, 2019.
[27] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority
     over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.
[28] domain-knowledge-aided code, https://github.com/ SheikhRabiul/domain-knowledge-
     aided-explainable-ai-for-intrusion-detection-and-response, 2019.
[29] H. Zhang, The optimality of naive bayes, AA 1 (2004) 3.
[30] J. Chen, K. Li, Z. Tang, K. Bilal, S. Yu, C. Weng, K. Li, A parallel random forest algorithm
     for big data in a spark cloud computing environment, IEEE Transactions on Parallel and
     Distributed Systems 28 (2016) 919–933.
A. Appendix




Table 2
Mapping of network attack with related component of CIA principles
                           Attack             Related component of CIA
                           DoS GoldenEye      A
                           Heartbleed         C
                           DoS hulk           A
                           DoS Slowhttp       A
                           DoS slowloris      A
                           SSH-Patator        C
                           FTP-Patator        C
                           Web Attack         C, I, A
                           Inflitration       C
                           Bot                C, I, A
                           PortScan           C
                           DDoS               A




Figure 4: Breakdown of the prediction for a random sample in terms of C, I, and A.
Table 3
Performance using all features vs selected features
                              Alg.         Acc.     Prec.    Rec.       F-score   AUC

                              RF-A         0.9987   0.9965   0.9971     0.9968    0.9997
                              RF-S         0.9986   0.9962   0.9966     0.9964    0.9997
                              Difference   0.0002   0.0003   0.0006     0.0005    0.0000
                              ET-A         0.9981   0.9951   0.9951     0.9951    0.9994
                              ET-S         0.9980   0.9950   0.9950     0.9950    0.9994
                              Difference   0.0001   0.0002   0.0001     0.0001    0.0000
                              ANN-A        0.9802   0.9155   0.9908     0.9516    0.9984
                              ANN-S        0.9740   0.8929   0.9860     0.9372    0.9968
                              Difference   0.0062   0.0226   0.0047     0.0145    0.0017
                              SVM-A        0.9109   0.6996   0.9595     0.8092    0.9780
                              SVM-S        0.8869   0.6433   0.9565     0.7692    0.9746
                              Difference   0.0239   0.0563   0.0030     0.0400    0.0034
                              GB-A         0.9960   0.9854   0.9944     0.9899    0.9995
                              GB-S         0.9957   0.9840   0.9945     0.9892    0.9996
                              Difference   0.0003   0.0014   (0.0001)   0.0007    (0.0001)
                              NB-A         0.7753   0.4371   0.4888     0.4615    0.8601
                              NB-S         0.7621   0.4144   0.5019     0.4539    0.8508
                              Difference   0.0132   0.0228   (0.0131)   0.0076    0.0093




Table 4
Performance using domain features vs constructed features
                              Alg.         Acc.     Prec.    Rec.       F-score   AUC

                              RF-D1        0.9973   0.9920   0.9945     0.9932    0.9993
                              RF-D2        0.9511   0.9446   0.7985     0.8654    0.9572
                              Difference   0.0463   0.0475   0.1960     0.1278    0.0421
                              ET-D1        0.9969   0.9913   0.9932     0.9923    0.9989
                              ET-D2        0.9756   0.9321   0.9448     0.9384    0.9954
                              Difference   0.0214   0.0592   0.0483     0.0538    0.0036
                              ANN-D1       0.9497   0.8300   0.9362     0.8799    0.9865
                              ANN-D2       0.5952   0.3241   0.9721     0.4862    0.7921
                              Difference   0.3544   0.5059   (0.0359)   0.3937    0.1945
                              SVM-D1       0.8489   0.5747   0.8968     0.7005    0.9252
                              SVM-D2       0.7195   0.3739   0.6281     0.4687    0.7886
                              Difference   0.1294   0.2008   0.2687     0.2318    0.1366
                              GB-D1        0.9881   0.9513   0.9904     0.9705    0.9986
                              GB-D2        0.9230   0.7692   0.8701     0.8165    0.9789
                              Difference   0.0652   0.1821   0.1204     0.1539    0.0198
                              NB-D1        0.7982   0.4881   0.5028     0.4953    0.8553
                              NB-D2        0.5591   0.2687   0.7195     0.3913    0.6591
                              Difference   0.2391   0.2194   (0.2167)   0.1040    0.1962




Figure 5: Performance deviations of using domain knowledge infused features from baseline
Table 5
Performance using all features vs domain features
                             Alg.           Acc.        Prec.      Rec.       F-score    AUC

                             RF-A           0.9987      0.9965     0.9971     0.9968     0.9997
                             RF-D1          0.9973      0.9920     0.9945     0.9932     0.9993
                             Difference     0.0014      0.0045     0.0027     0.0036     0.0004
                             ET-A           0.9981      0.9951     0.9951     0.9951     0.9994
                             ET-D1          0.9969      0.9913     0.9932     0.9923     0.9989
                             Difference     0.0011      0.0038     0.0020     0.0029     0.0004
                             ANN-A          0.9802      0.9155     0.9908     0.9516     0.9984
                             ANN-D1         0.9497      0.8300     0.9362     0.8799     0.9865
                             Difference     0.0305      0.0855     0.0546     0.0717     0.0119
                             SVM-A          0.9109      0.6996     0.9595     0.8092     0.9780
                             SVM-D1         0.8489      0.5747     0.8968     0.7005     0.9252
                             Difference     0.0619      0.1249     0.0627     0.1087     0.0528
                             GB-A           0.9960      0.9854     0.9944     0.9899     0.9995
                             GB-D1          0.9881      0.9513     0.9904     0.9705     0.9986
                             Difference     0.0079      0.0341     0.0039     0.0194     0.0009
                             NB-A           0.7753      0.4371     0.4888     0.4615     0.8601
                             NB-D1          0.7982      0.4881     0.5028     0.4953     0.8553
                             Difference     (0.0229)    (0.0510)   (0.0140)   (0.0338)   0.0048




Figure 6: Unknown attack detection rate using Random Forest




Table 6
Performance of unseen attack detection using RF
                           Attack               Count     All(%)   Sel.(%)    Dom.(%)    Cons.(%)

                           Ddos                 4184      99.90    99.90      99.90      62.86
                           PortScan             4973      99.90    99.94      99.94      66.28
                           Bot                  54        77.78    77.78      75.93      22.22
                           Infiltration         1         100      100        100        0.00
                           Web Attack-BF        49        95.92    95.92      91.84      75.51
                           Web Attack-XSS       23        95.65    95.65      91.30      65.22
                           Web Attack-Sql       1         0.00     0.00       100        0.00
                           FTP-Patator          251       99.20    100        99.20      81.67
                           SSH-Patator          198       98.99    99.49      96.97      75.76
                           DoS slowloris        188       99.47    99.47      98.94      61.70
                           DoS Slowloris        174       99.43    99.43      96.55      31.61
                           Dos Hulk             7319      99.71    99.73      99.34      96.19
                           DoS GoldenEye        314       99.36    99.68      98.41      85.03
                           Heartbleed           1         100      100        100        100
Figure 7: Unknown attack detection rate using Naive Bayes




Table 7
Performance of unseen attack detection using NB
                          Attack           Count   All(%)   Sel.(%)   Dom.(%)   Cons.(%)

                          Ddos             4184    76.94    74.59     83.22     100
                          PortScan         4973    0.18     0.18      6.64      100
                          Bot              54      0.00               0.00      100
                          Infiltration     1       0.00     0.00      0.00      100
                          Web Attack-BF    49      6.12     6.12      83.67     100
                          Web Attack-XSS   23      0.00     0.00      95.65     100
                          Web Attack-Sql   1       100      100       0.00      100
                          FTP-Patator      251     0.00     0.00      0.00      48.61
                          SSH-Patator      198     0.00     0.00      0.00      100
                          DoS slowloris    188     25.53    30.85     38.30     66.49
                          DoS Slowloris    174     71.26    79.89     75.29     78.16
                          Dos Hulk         7319    67.78    67.78     67.03     35.37
                          DoS GoldenEye    314     50.32    65.29     47.13     96.82
                          Heartbleed       1       100      100       0.00      0.00