<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Artificial Inteligence and Cybernetics Technical University of Kosice Kosice</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <fpage>25</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>-The presented paper describes the design and validation of the hierarchical intrusion detection system (IDS), which combines machine learning approach with the knowledgebased methods. As the knowledge model, we have proposed the ontology of network attacks, which allow to us decompose detection and classification of the existing types of attacks or formalize detection rules for the new types. Designed IDS was evaluated on a widely used KDD 99 dataset and compared to similar approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>ontologies</kwd>
        <kwd>network security incidents</kwd>
        <kwd>machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>With the extensive usage of the information and
communication technologies the number and variety of the
security attacks grow. This is also reflected in the growing of
budget invested by companies or public institutions into the
security. In order to cope with the current situation, the new and
innovative techniques are applied in order to automatize the
security management [1].</p>
      <p>
        Recently, we can observe two main approaches to the
security of the ICT: the first approach is data-oriented, and it is
based on the application of machine learning techniques to
proactively achieve the best possible prediction of the new
attacks [2][3][4][5]. The second approach is more user-centric
and it is based on the application of knowledge modelling
techniques in order to model user behavior and ICT environment
[8][9][
        <xref ref-type="bibr" rid="ref9">10</xref>
        ].
      </p>
      <p>The presented article tries to combine these two approaches
into a single system, where the domain knowledge about the
types, effects and severity of the attacks is used to decompose
intrusion detection task into the classification subtasks which
can be handled more efficiently with less training data. The
design of the proposed intrusion detection system is symmetrical
in the sense that both approaches (machine learning and
knowledge based) are equal and mutually contribute to address
the challenges of the detection and prevention of the security
threats.</p>
      <p>The rest of this paper is organized as follows: in the
following chapter we will present hierarchical knowledge model
in the form of the ontology which will be used for the
decomposition of the detection problem and which will provide
additional contextual information. Subsequent part describes
implemented machine learning models and how these models
are combined with the knowledge in the ontology. Subsequent
section then presents the experimental evaluation of the
proposed combined approach. In this chapter we at first define
quantitative evaluation metrics and then summarize the
performance of the system on the standard benchmark dataset
from the KDD Cup competition.</p>
      <p>II. HIERARCHICAL KNOWLEDGE-BASED INTRUSION DETECTION</p>
      <p>SYSTEM</p>
      <sec id="sec-1-1">
        <title>A. Overall system architecture</title>
        <p>The main objective of the proposed architecture is to
hierarchically decompose detection and classification of the
intrusions according to the types of the attacks. For the
decomposition we have proposed the Network Intrusion
Ontology which main part is formalized as the taxonomy of
attack types. This ontology allows to capture all knowledge
related to the known types of the attacks, including the
description of rare cases which are difficult to detect using the
machine learning methods.</p>
        <p>The main decomposition of the detection and classification
process can be divided into the following phases:
1.
2.
3.</p>
        <p>Coarse attack/normal classification - this phase is
implemented using the machine learning algorithm
which distinguish normal traffic and attacks. If a
network connection is labelled as a normal one, then an
alarm is not raised. Otherwise, the suspicious
connection is processed by a set of models to determine
the class of attack during the phase 2.</p>
        <p>Attack class and type prediction—this phase is guided
by the taxonomy of the attacks from the Network
Intrusion Ontology. The system hierarchically processes
the taxonomy and selects the appropriate model to
classify the instance on a particular level of a class
hierarchy. The model can be a machine learning model
statistically inferred from the training data, or rule-based
model formalized using the classes and relations from
the ontology.</p>
        <p>When a class of attack is predicted, ontology is queried
for all relevant sub-types of the attack type and to
retrieve the suitable model to predict the particular
subtype. Knowledge model can also be used to extract
specific domain-related information as a new attribute,
which could be used either to improve the classifier’s
performance or to provide context, domain-specific
information which could complement the predictive
model.</p>
        <p>Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
The details about the predictive models and their evaluation will
be presented in the subsequent chapter.
system. The main concepts and relations of the ontology are
represented on the Figure 1.</p>
      </sec>
      <sec id="sec-1-2">
        <title>B. Network Intrusion Ontology</title>
        <p>The proposed knowledge model captures all essentials
concepts required to describe network intrusion systems. We
have designed our semantic model according to the
methodology proposed by Grüninger and Fox and with some
extensions from Methodology.</p>
        <p>The designed ontology is formalized using the OWL 2 RL
profile, which allows to formalize common constructs such as
multiple hierarchies and at the same time provides compatibility
with the rule languages for automatic reasoning. As the objective
of the knowledge model was to use it in the data analytical tasks,
the concepts and properties map directly to the data used in the
process. Moreover, ontology was extended with the concepts
related to the classification models, to create the relation
between the particular classifier and its usability on the specific
level of target attribute hierarchy. The main classes of ontology
include:
•
•
•
•
•
•
•
•
•</p>
        <p>Connections - This class represents the status of each
connection record. It specifies Attack connection or
normal traffic. Attack connections are further
conceptualized using the Attack hierarchy described
below.</p>
        <p>Effects - This class contains subclasses that represent all
possible consequences of individual attacks (e.g., slows
down server response, execute commands as root, etc.).
Mechanisms - The subclasses represent all possible
causes of individual ontology attacks (poor environment
sanitation, misconfiguration, etc.).</p>
        <p>Flags - The subclasses represent normal or error states
of individual connections (Established, responder
aborted, Connection attempt was rejected, etc.). Each of
these subclasses has a 1 equivalent instance.</p>
        <p>Protocols - The class contains subclasses that represent
the types of the communication protocols on which the
connection is running (TCP, UDP, and ICMP).</p>
        <p>Services - The subclasses represent each type of
connection service (http, telnet, etc. ...). Each of these
subclasses has a 1 equivalent instance.</p>
        <p>Severities - This class represents the severity of the
attack, its subclasses represent the severity level (weak,
medium, and high).</p>
        <p>Targets - The subclasses represent possible targets of a
given type of attack (user, network).</p>
        <p>Models concept covers the classification models used to
predict the given target attribute.</p>
        <p>The instances of the specified classes represent the network
connections (e.g., connection records from the data set). Trained
and serialized classification models are instantiated as the
instances of the Model class. The models are represented as the
web resources and they could be accessed by their URI property,
which points to the location where the model is serialized in the
The central part of the proposed semantic model is the
taxonomy of Attacks which are summarized in the following
figure. The taxonomy was extracted from the types of the attacks
described in the KDD 99 datasets. Attacks are divided into the
four main groups such as DOS, R2L, U2R and PROBE. The
main types of the attacks are further specified on the additional
level of the hierarchy.</p>
        <p>To evaluate the proposed approach, we used the KDD Cup
1999 competition dataset, which is a commonly accepted
benchmark for the intrusion detection task. The dataset consists
of the records from the device logs in a LAN network collected
over nine weeks. For the evaluation, we have used 10% sample
with the 494,021 records in total. Each record is labeled as the
normal communication or it is assigned to the major attack class
and specific attack types. There are 22 different attack types
which corresponds to the classes in the proposed ontology.</p>
        <p>The common problem with the diagnostic tasks such as
intrusion detection systems is that the target attribute (i.e. in our
case type of the attack) is highly unbalanced with the majority
of normal communication. Table I presents the taxonomy of
attack types together with the number of cases in the dataset.
Some attack classes such as Probe are more balanced but
generally for each attack class we can find some minor types
with only the few training examples. The lack of cases is
problematic not only for the training of statistical models but
also for the evaluation. On the other side, rare cases can be still
very critical and can in overall a big impact on the security of the
system.</p>
        <p>The records for each connection are described by set of
features, which are represented in the ontology as the data
attributes. The features can be divided into the basic features,
content features and traffic features. Overall there are 32
features. The first group describes the type of the communication
protocol, duration of the connection, service on the destination
network node and other standard attributes describing the TCP
connection. Content features are attributes that can be linked to
the domain specific knowledge depending on the applications
and environment in which communication occurs. The last
group of features (traffic) describe the communication attributes
captured during the 2 seconds time window, e.g. the number of
hosts communicating with the target host etc. For the data
preprocessing, we have selected only the most relevant features
for the classification which were identified in the work of [4].
The final list of features includes: service, src_bytes, dst_bytes,
logged_in, num_file_creations, srv_diff_host_rate,
dst_host_count, dst_host_diff_srv_rate,
dst_host_srv_diff_host_rate, srv_count, serror_rate, rerror_rate,</p>
        <p>Since the data of diagnostic tasks are commonly highly
unbalanced towards the normal cases, the proposed approach is
based on the decomposition of the diagnostic classification task
into the hierarchy of classifiers. At the top level of the class
hierarchy, an attack detection model is used for the prediction to
distinguish between the attack connections and normal traffic.
The classifier on this level was trained on the whole dataset and
target attribute was transformed to the binary indicator. The
main goal of this top-level classifier is to reliably separate
normal connections from the attack ones.</p>
        <p>If the top-level model detects an attack connection, the cases
are further classified by the ensemble models into the one of the
four types of the attack on the second level of the taxonomy
(DoS, R2L, U2R, Probe). In this level, we use ensemble
classifier with voting scheme trained on all attack instances (i.e.
without the normal communication cases). We found that the
proposed ensemble model is more efficient in the case of
unbalanced target classes. The standard machine learning
models proposed in the previous works were able to gain good
accuracy, achieved mostly on the dominant class (in our case on
KDD 99 dataset, on the most common DoS attack). However,
the simple models struggled to predict minor classes such as
U2R, which can be even more serious from the point of view of
network security. For example, when training a decision tree
model, the model has very good performance for the DoS and
R2L classes but missed a significant amount of the Probe attacks
and was not able to detect the U2R class at all.</p>
        <p>Proposed weighting schema is based on the idea of
complementing classifiers which is based on the performance of
a particular model on the particular class. This weighting schema
is presented on the Table II. The wi,j terms represent the weight
associated with the i-th model and j-th class.</p>
        <p>After the binary classification and classification of the attack
class by ensemble weighted classifier, we have trained particular
models to further classify specific type of the attack on the most
specific level of the taxonomy. Four different models were
trained using only the records of particular attack classes (i.e.
models for DoS, R2L, U2R and Probe). The most problematic
was minority U2R class, as the dataset contains very few records
of that type. The final implemented classification schema is
presented on the Figure 3. All models were implemented in the
Python environment using the standard pandas and scikit-learn
stack. Predictive models were then persistently stored and the
models URIs (Uniform Resource Locators) were added as the
data properties to the knowledge model.</p>
        <p>The main role of the semantic model in the proposed
detection system is to navigate through the target class taxonomy
and decompose classification problem to the sub-problems
implemented by the particular models for the specific type of
attack. The system is implemented using the Python language
and RDFlib package which provides integration with the
ontology using the SPARQL query interface. When predicting
the unknown connection, system query the ontology using the
SPARQL query and retrieve correspondent model for the
particular class of the attacks according to the URL stored in the
hasTargetAttribute property. Once the classification of the main
type is performed, the system checks in the ontology if there is a
classifier able to process the record further and to detect subtype
of the attack.</p>
        <p>Besides the hierarchical decomposition of the detection
process, knowledge model provides also additional context
which can be leveraged during the classification and improve
detection of the minor classes. We have mainly extended the
context with the potential effect of the attack. Additionally, if the
models are not reliable enough to predict the concrete attack
subtype, the system can be used to classify attacks at least according
to the severity which is retrieved from the knowledge model for
the particular main class of the attack. This could serve as a
supporting source of information, completing the attach type
classification.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>III. EVALUATION</title>
      <p>For the evaluation, we used the most common metrics
employed in the classification tasks such as recall and precision.
We have also computed confusion matrix for the particular
classes of attacks. The confusion matrices were especially
informative since they record number of correctly and
incorrectly classified examples and also the types of the error.
For the binary classification on the top level of the taxonomy
hierarchy we used standard evaluation metrics:
•
•</p>
      <p>Precision: P = TP / (TP + FP)</p>
      <p>Recall: R = TP / (TP + FN)
where TP, TN, FP, FN are numbers of true positive, true
negative, false positive and false negative records (e.g. for true
positive number of records when the predicted attack was in fact
attack, false positive when the predicted attack was in fact a
normal traffic, false negative when the predicted normal traffic
was in fact an attack, etc. The entire system was also evaluated
with the number of missed attacks and raised false alarms as
FAR metric (False Alarm Rate), which corresponds to the false
positive records divided by total number of normal traffic
records (true negative + false positive).</p>
      <p>For the evaluation of the binary classification on the top level
of the taxonomy, we used directly precision and recall metrics.
In the subsequent stages on the more specific levels of taxonomy
we have computed precision and recall for each class and used
macro-averaging for overall evaluation. Additionally, we have
computed multi-class confusion matrix to further investigate the
types of the errors produced by the system.</p>
      <sec id="sec-2-1">
        <title>A. Training and evaluation</title>
        <p>For the binary classification for the attack detection, we used
the decision tree classifier. Dataset includes all records and
target attribute was transformed to binary indicator
attack/normal traffic. The classifier was trained without the limit
for maximum depth with default settings for pruning and gini
index as the splitting criterion. We split the dataset randomly to
70/30 training/testing ration. The testing data were also used for
overall evaluation of the entire system. Model for the binary
classification achieved the accuracy 0.9997. The detailed
confusion matrix is presented in the Table III.</p>
        <p>For the training of ensemble classifier, we have selected only
the attack records from the training set. As the base classifiers
we have used various configuration of the Naive Bayes and
Decision Tree models. The experiments proved that the Decision
Tree classifier performed well on the Probe, DoS and R2L
attacks. On the other hand, for the U2R class model produces
many false alarms or (depending on pruning) the model was not
able to detect U2R attacks at all. For this reason, we have trained
one-vs-all model just to separate U2R class. We have then
combined both types of the models into the ensemble classifier.
The weights of the base classifiers were computed according to
the accuracies of the models on the training data. For the
evaluation we have used the same 70/30 dataset split as for the
binary classification, but we have further selected only the attack
records (since the normal communication is filtered already by
the binary classifier). In total, models were trained on 396743
records. The confusion matrix of the ensemble classifier is
presented on the Table IV.
0.982</p>
        <p>On the most specific level of the taxonomy, each major
attack class has dedicated one model for the further classification
of subtypes. The performance of each model was evaluated
using the precision and recall macro-averaged for each subtype.
The overall performance of the models is summarized in Table
V.</p>
        <p>
          The overall system with the hierarchical classification was
evaluated using the standard precision, recall F-measure and
FAR (False Alarm Rate) metrics. Comparison of the proposed
system and models published in previous works [4][6][7][
          <xref ref-type="bibr" rid="ref10">11</xref>
          ] is
presented in Table VI.
        </p>
        <p>Additionally, we have computed confusion matrix, which
summarizes the performance for each attack class. The
confusion matrix is presented in the Table VII.</p>
        <p>Besides the classification of attack types, we have
implemented and also evaluated the classification of the attack
severity. To train the severity detector we have used 10 % of
KDD 99 dataset with the 70/30 training/testing ratio. The
severity classifier was applied complementary to the ensemble
F1
0.970
0.986
0.998
0.998
model for the detection of the attack class. Overall achieved
performance was 0.999 precision and recall with very good
accuracy for the high and low severity. The Table VIII presents
the confusion matrix for the severity detection in comparison for
each class of the attack.</p>
        <p>Medium severity was biased by our model towards the high
severity which has the similar effect like the higher false positive
rate. Further details and information about the designed model
were published in [9].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>IV. CONCLUSION AND FUTURE WORK</title>
      <p>In this paper we have proposed an approach based on the
combination of knowledge based and machine learning methods
for intrusion detection. The proposed knowledge model in the
form of the ontology is used for the hierarchical decomposition
of the detection process according to the types of the attack. This
decomposition allows to overcome the problems with the
unbalanced training data which are typical for the diagnostic
machine learning tasks. By the leveraging of the domain
knowledge, our combined approach also provides an additional
context which includes for example the effects and severity of
the attacks.</p>
      <p>The performance of the proposed IDS is 0.998 in terms of
precision as well as recall and 0.001 in terms of FAR metric,
which on the standard benchmark dataset outperforms other
state-of-the-art methods. Moreover, the proposed method has
also potential to partially detect new emerging types of attacks
in terms of the contextual information stored in the knowledge
model.</p>
      <p>In the future work we plan to extend the role of the
knowledge model by introducing a rule-based classifier which
will be based on the declarative rules and application of
automatic reasoning technique and logical programming. We
hope that this will allow to further improve accuracy for minor
classes with the low number of training examples. Additionally,
extended knowledge model will allow to create formalized
knowledge base of the existing cases.</p>
    </sec>
    <sec id="sec-4">
      <title>ACKNOWLEDGMENT This work was supported by Slovak Research and Development Agency under the contract No. APVV-16-0213 and by the VEGA project under grant No. 1/0493/16.</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>Symmetry</source>
          <year>2019</year>
          ,
          <volume>11</volume>
          ,
          <fpage>244</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Javaid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Niyaz</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Alam,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>A Deep Learning Approach for Network Intrusion Detection System</article-title>
          .
          <source>In Proceedings of the 9th EAI International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS)</source>
          , New York, NY, USA,
          <fpage>3</fpage>
          -
          <issue>5</issue>
          <year>December 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>Symmetry</source>
          <year>2019</year>
          ,
          <volume>11</volume>
          ,
          <fpage>583</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Zhou</surname>
            , Y.; Cheng, G; Jiang, S.; dai,
            <given-names>M.</given-names>
          </string-name>
          <article-title>An efficient detection system based on feature selection and ensemble classifier</article-title>
          .
          <source>arXiv</source>
          <year>2019</year>
          , arXiv:190401352 Aljawarneh,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Aldwairi,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Yassein</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>B</article-title>
          .
          <article-title>Anomaly-based intrusion detection system through feature selection analysis and building hybrid efficient model</article-title>
          .
          <source>J. Comput. Sci</source>
          .
          <year>2018</year>
          ,
          <volume>25</volume>
          ,
          <fpage>152</fpage>
          -
          <lpage>160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mukherjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>A Novel Multi-Classifier Layered Approach to Improve Minority Attack Detection in IDS</article-title>
          .
          <source>Procedia Technol</source>
          .
          <year>2012</year>
          ,
          <volume>6</volume>
          ,
          <fpage>913</fpage>
          -
          <lpage>921</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Ahmim</surname>
            ,
            <given-names>A.; Ghoualmi</given-names>
          </string-name>
          <string-name>
            <surname>Zine</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <article-title>A new hierarchical intrusion detection system based on a binary tree of classifiers</article-title>
          .
          <source>Inf. Comput. Secur</source>
          .
          <year>2015</year>
          ,
          <volume>23</volume>
          ,
          <fpage>31</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Abdoli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kahani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Ontology-based distributed intrusion detection system</article-title>
          .
          <source>In Proceedings of the 2009 14th International CSI Computer Conference</source>
          , Tehran, Iran,
          <fpage>20</fpage>
          -
          <lpage>21</lpage>
          October 2009; pp.
          <fpage>65</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Sarnovsky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Paralic</surname>
            ,
            <given-names>J. Hierarchical Intrusion Detection</given-names>
          </string-name>
          <article-title>Using Machine Learning and Knowledge Model</article-title>
          .
          <source>Symmetry</source>
          <year>2020</year>
          ,
          <volume>12</volume>
          ,
          <fpage>203</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <surname>More</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Matthews,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Finin</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>A Knowledge-Based Approach to Intrusion Detection Modeling</article-title>
          .
          <source>In Proceedings of the 2012 IEEE Symposium on Security and Privacy Workshops</source>
          , San Francisco, CA, USA,
          <fpage>24</fpage>
          -
          <lpage>25</lpage>
          May
          <year>2012</year>
          ; pp.
          <fpage>75</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>O</given-names>
            <surname>̈ zgü r</surname>
          </string-name>
          , A.;
          <string-name>
            <surname>Erdem</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>A review of KDD99 dataset usage in intrusion detection and machine learning between 2010 and 2015</article-title>
          .
          <source>PeerJ Preprints</source>
          <year>2016</year>
          ,
          <volume>4</volume>
          ,
          <fpage>e1954v1</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>