<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Italy
EMAIL: name.surname@uniba.it
ORCID:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Ensemble Consensus: an unsupervised algorithm for anomaly detection in network security data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vincenzo Dentamaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vito Nicola Convertini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Galantucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Giglio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tonino Palmisano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Pirlo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Meng</institution>
          ,
          <addr-line>Weibin, Liu, Ying, Zhu, Yichen, Zhang, Shenglin, Pei, Dan, Liu, Yuqing, Chen, Yihao, Zhang, Ruizhi, Tao, Shimin, Sun, Pei and others</addr-line>
          ,
          <institution>"LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs.," in IJCAI</institution>
          ,
          <addr-line>2019</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bari, Dep. Of computer Science</institution>
          ,
          <addr-line>Via Orabona 4, Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Unsupervised network traffic monitoring is of paramount importance in cyber security. It allows to detect suspicious events that are defined as non-normal and report or block them. In this work the Anomaly Consensus algorithm for unsupervised network analysis is presented. The algorithm aim is to fuse the three most important anomaly detection techniques for unsupervised detection of suspicious events. Tests are performed against the KDD Cup '99 dataset, one of the most famous supervised datasets for automatic intrusion detection created by DARPA. Accuracies reveal that Anomaly Consensus performs on-par with respect to state-of-the-art supervised learning techniques, ensuring high generalization power also in borderline tests when small amount of data (5%) is used for training and the rest is for validation and testing.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Unsupervised anomaly detection</kwd>
        <kwd>isolation forest</kwd>
        <kwd>elliptic envelope</kwd>
        <kwd>local outlier factor</kwd>
        <kwd>anomaly detection logs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Computer systems, especially the networks that interconnect them, must be
carefully monitored to avoid computer attacks. The detection of an attack is
generally done through two methods: knowing the attack, one tries to
understand if it is present in the analyzed traffic, or one goes to analyze if there
are variations between the analyzed traffic and the "normal" network traffic.</p>
      <p>The first approach is not always applicable, as it requires a large amount of
information, i.e., a mapping of all possible attacks (signature) and such
information must be in a format that allows its application to the analyzed
problem. Moreover, the first approach does not allow in any way the revelation of
unknown attacks.</p>
      <p>A greater protection is given by the second approach, i.e., through the use of
machine learning [16-17], releasing, in fact, the concept of the attack from the
detection system, treating the traffic as mere data. The task to be solved is
therefore linked to anomaly detection. The use of classifiers [13],[15] presents a
number of problems: firstly, the availability of an adequate source of data on
traffic under attack is not always available (absent data or unbalanced classes);
secondly, there are difficulties in differentiating normal traffic from anomalous
traffic.</p>
      <p>In cybersecurity contexts, a non-excessive presence of false positives is
acceptable since it is still up to the operator to evaluate suspicious traffic.</p>
      <p>In this paper an algorithm based on the union of three anomaly detection
algorithms (Elliptic Envelope, Isolation Forest, Local Outlier Factor) is proposed.</p>
      <p>In the next section the state of the art of anomaly detection systems for
cybersecurity will be described, while in section 3 the functioning of the
Ensemble Consensus algorithm will be explained in detail. In section 4 the data
of the conducted experiment are presented, and the evaluation of the results,
conclusions and future developments are deferred to the last section.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the art review</title>
      <p>Anomaly detection systems use machine learning to store the normal state of
a batch of data and identify which of these deviates heavily from normal. Such
approaches assume a character of primary importance within Intrusion
Prevention/Detection Systems. In the following state of the art, the most
interesting approaches proposed for the latter will be analyzed, as
contextualization to cybersecurity domains and aimed at network data is of
fundamental interest in the article.</p>
      <p>Authors in [1] proposes an anomaly detection module uses a Self-Organizing
Map (SOM) structure to map and model normal behavior. Anything that differs
from normal behavior is classified as an attack. SOM relies on unsupervised
learning to map nonlinear statistical relationships between high-dimensional
input data into the output space, a two-dimensional lattice or grid. SOMs place
highly correlated patterns in contiguous locations in the resulting output space
with good results and provide visualization and projection options for
highdimensional data. Finally, there is a module that allows to determine the type of
attack, called abuse detection module, which makes use of a J.48 decision tree.
The model is trained with only normal traffic data. The classification of a
connection into the classes of normal/anomalous is done through the parameters
of quantization error ( ) and best matching unit ( ). If the quantization error
is greater than a fixed threshold, it is anomalous.</p>
      <p>Authors in [2] propose an Intrusion Detection System model that uses the
Neighborhood Outlier Factor (NOF). The main idea of this approach is to assign
to each data example a ranking of “outlier”, the NOF, and search for data that is
very different from a heavy amount of data representing normality.</p>
      <p>
        Important in the use of such a model is the amount of data: although the
system recognizes almost all types of attacks, where the number of outliers
exceeds the dataset that identifies normal behavior [
        <xref ref-type="bibr" rid="ref1">14</xref>
        ], the system behaves as
an intrusion dataset. The great advantage of this approach is the execution time,
which proves to be less than any other classifier machine learning, this potential
makes it an interesting system for approaches to the problem of type online or
highly responsive, which can be, in fact, the domain of cybersecurity. This
advantage comes from the fact that less trained datasets, so calculating the
distance between the training dataset and the testing dataset is particularly easy.
      </p>
      <p>A hybrid approach to detect unknown attacks by clustering is proposed by [3].
The algorithm is called SSC-OCSVM, and is obtained by the union of Sub-Space
Clustering (SSC) and One Class Support Vector Machine (OCSVM); the first is an
evolution of the classical clustering mechanisms, the second is instead an
evolution of the supervised Support Vector Machine algorithm particularly useful
for use on unlabeled data, where the model of the support vectors is produced
from single-class data, corresponding to the traffic normal. The algorithm, tested
on the NSL-KDD dataset returns better results than the K-means and DBSCAN
approaches, however it has a longer computation time, due to the sequential
execution of each subspace, the authors however specify that each of these is
independent and therefore can be addressed by parallel execution.</p>
      <p>SwissLog [4] was created to act on the basis of two considerations: the first is
that logs change frequently, in real application contexts, when dealing with
software under development or otherwise with active maintenance; the second
is the fact that performance problems in systems are indicative of partial failure
problems. SwissLog merge time embedding and semantic embedding
approaches to detect sequential log anomalies and performance issues with a
unified deep learning model. The anomalies are identified on the basis of
different types: the anomalous sequences with respect to the order and the
change in the times (performance), understood as the decrease in the execution
time of a single task. The operation takes place by means of processing on two
phases: offline e online. SwissLog's log analysis methods extract multiple
patterns through the stages of tokenization, dictionarization, and clustering of
historical log data, which are tracked within the system for the construction of
the single sessions, leading, finally, to the transformation into temporal
information with semantic content.</p>
      <p>An interesting method for anomaly detection is offered by [5], which applies it
to logs expressed directly in natural language. The algorithm uses template2vec
(a novel method) to extract semantic information from the logs. The concept of
anomaly partially overlaps with the one intended by SwissLog, since the concept
of anomaly in terms of sequence is maintained, i.e., when the sequence of logs
differs from the normal order; the concept of change over time, therefore related
to the performance of individual tasks, is dropped, and instead quantitative
anomaly is introduced, which occurs when the temporal relationship breaks
down on a group of logs. The main idea behind LogAnomaly is that log systems
produce logs according to patterns, hence semi-structured data; understanding
such methods of operation in NLP can aid anomaly detection. LogAnomaly
therefore considers that, in a normal workflow, if there are no anomalies, given a
certain log it is possible to predict its next one. In the offline training phase, an
FT-Tree is used to generate the templates starting from the history and the
consequent match of the data with the produced templates takes place; in the
online testing phase it is verified if the log in object corresponds to one of the
above-mentioned templates, otherwise an approximation of this is looked for to
verify if there is a correspondence with the normal flow, if not it is defined
anomalous.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Ensemble Consensus</title>
      <p>There are three basic approaches to detect anomalies [6]. They are based on:
 Density
 Distance
 Isolation</p>
      <p>Within the Ensemble Consensus algorithm, three anomaly detection
techniques are fused together. Each technique belongs to one of the three
fundamental approaches previously mentioned. The algorithm performs the
weighted majority voting between the three different techniques on the same
data making use of the weighted bagging approach.</p>
      <p>At its core there are 3 different anomaly detection algorithms:
 Elliptic Envelope, which exploits the concept of Density. It creates an imaginary
elliptical area around a given dataset. Values that fall within the area are
considered normal data, and anything outside of that area (distribution density)
is returned as an outlier [7].
 Isolation Forest. This algorithm 'isolates' observations by randomly selecting
an element (feature) and then randomly selecting a division value between the
maximum and minimum values of the selected element. It is an unsupervised
algorithm and therefore does not need labels to identify the normal/anomaly.
[8] The path distance is averaged and normalized to calculate the outlier score.
 Local Outlier Factor, which exploits the concept of "distance" in a manner
similar to K nearest neighbor to identify anomalous tuples. [9]</p>
      <p>Judgment of the outlier is made based on the previously calculated score for
each technique. At the end, the tuple is classified as outlier only if at least 2 of
the three algorithms deem it to be outlier for the "hard consensus" or if the
average value of the normalized scores between -1 and 1 is over a soft threshold
for the so-called "soft consensus".</p>
      <p>The “hard consensus” is just a majority voting anomaly detection. The “soft
consensus” technique requires further deepening.</p>
      <p>For the so-called “soft consensus”, a normalized [-1,1] outlier (or anomaly) score
is built with respect to each technique. The average of the normalized scores is
taken. If this averaged score of a sample is lower than a soft threshold, the
sample is reported as anomalous, normal otherwise:
For Isolation Forest, the score is defined in equation (6):</p>
      <p>Where ℎ( ) is the length of the path for the sample , ( ) is the mean of the
length of the unsuccessful dichotomic binary tree search where n is the number
of external nodes and (ℎ( )) is the average of ℎ( ) from a list of isolation trees
[10]. The offset value -0.5 is the same used in work [10].</p>
      <p>For Local Outlier Factor in equation (7) it is important first to define the local
reachability density in equation (8)
(7)
(8)
lrd is the inverse of the mean reachability distance of a point p from its
neighbors. In theory the density is inversely proportional to “how far” the
neighbors are from the point with respect to a predefined distance metric (i.e.,
Euclidean). For the used Local Outlier Factor in [9], the offset is set to -1.5 as
shown in equation (9).</p>
      <p>=
− ( − 1.5)
(9)</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
    </sec>
    <sec id="sec-5">
      <title>4.1. Dataset</title>
      <p>In this work it has been used the KDD’99 [11] dataset. It is one of the highly
used datasets in network security. The dataset was built on the networking
information captured in DARPA’98 intrusion detection system program and it is
composed by over 4 gigabytes binary tcpdump data over 7 weeks of network
traffic, which make up over 5 million records (rows). Each row contains 41
features ending with a label which identifies if the row is normal or an attack and
the exact type of attack. The simulated attacks are grouped in four big families
namely:
1. Denial of Service (DoS)
2. User to Root (U2R)
3. Remote to Local (R2L)
4. Probing</p>
      <p>In this work, this dataset will be used completely in a non-supervised setting
showing effectiveness of this solution, even when labels are not present, as
normally happens in network traffic analysis.</p>
      <p>In this work, all 4 simulated attacks are grouped into one single non-normal class,
thus the experiment is performed in a binary classification fashion.</p>
    </sec>
    <sec id="sec-6">
      <title>4.2. Experiment</title>
      <p>The KDD dataset was parsed, and all non-numerical (string) variables were
encoded to categorical variables. The dataset was z-scored and split in 60% for
training, 20% validation and 20% test.</p>
      <p>The label column was removed in order to transform this dataset in
unsupervised learning. A copy of the entire dataset maintains the label column
for accuracies comparisons.</p>
      <p>The exhausting search called grid search was used to determine the best
hyperparameters combination that deliver the highest accuracy. This exhaustive
search was performed on the contamination ratio of each internal algorithm and
on the soft threshold described previously.</p>
      <p>It was found that with a contamination ratio of 0.3 for all the internal
algorithms used within the Anomaly Consensus and a soft threshold of -0.2 the
accuracy was the highest as reported in Table 1.</p>
      <p>For simplicity, results presented in Table 1 were performed by fixing the soft
threshold to -0.2.</p>
      <p>It was, in addition performed a borderline test using only the 5% of the
dataset for training, 30% for validation and the remaining 65% of the entire
dataset for testing. Accuracies are reported in Table 1.</p>
      <p>With F1 score it is meant the F1 weighted score, which is used to have more
realistic accuracies accounting for class imbalance.</p>
    </sec>
    <sec id="sec-7">
      <title>4.3. Discussion</title>
      <p>Experiment results and comparisons in Table 2 suggest the effectiveness of
the Anomaly Consensus algorithm comparing its accuracies with state-of-the-art
supervised learning algorithms. It is important to state that in some cases, such
as Naïve Bayes Classifier, Random forest and neural network, the recall is better
or almost on par with others. From Table 1 it is possible to observe that the
quality of the result strongly depends on the amount of data used to train the
algorithm. This is somehow intuitively explained by the intrinsic properties of
each algorithm composing the Anomaly Consensus, who requires a statistically
representative number of instances in order to infer a specific pattern and thus
recognize anomalies (outliers) effectively. Using only 1000 instances for training
and knowing that non-normal (anomaly) data is only a fraction (about 0.001%) of
the whole data, achieving a precision of 0.72 and a recall of 0.69 with a final
weighted F1-score of 0.57 is encouraging. This suggest that this algorithm can be
used effectively also when low quantity of data is present and can leverage the
presence of new fresh data in an online learning fashion.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Conclusions and future work</title>
      <p>The paper proposes an ensemble approach for the anomaly detection task,
through an application to the network security domain, having used the KDD99
dataset.</p>
      <p>Ensemble Consensus offers the benefit of not being a supervised algorithm,
and with that, the fact that it can be used where we have no particular data
about attacks on the application network. Therefore, the problem of class
imbalance does not arise. The system is immediately applicable, being able to
recognize attacks without having information about them.</p>
      <p>As widely discussed in the previous sections, the number of training examples
is an extremely heavy parameter for generating an accurate classifier. However,
the proposed approach demonstrates that the system is applicable even in
datapoor contexts, as it possesses not inconsiderable precision and recall parameters
with respect to the modest amount of data provided.</p>
      <p>The method is particularly suitable for network contexts that have a high
temporal variability, as it allows an online application, using as comparison data
(training) the data of a certain time slot, considered representative of the
normality of traffic. Since it is an unsupervised machine learning algorithm, it can
be directly applied without requiring data labeling by an operator, system
administrator, or other means.</p>
      <p>An effective application of the Ensemble Consensus algorithm also depends on
the soft threshold parameter , used to determine the threshold below which an
example is considered anomalous. In the experiment, the parameter is derived
through a grid search process aimed at maximizing performance. In a real-world
context, the value can be calibrated to the needs of the system, though it
depends in particular on the submitted data stream. A higher value of will
therefore tend to produce more values deemed anomalous and can be chosen
when a system is desired to produce more alerts, so when more control is
desired. A lower value of instead will produce fewer alerts: on the one hand the
advantage of detecting with more certainty anomalies (as they are more evident,
regarding the parameter), on the other hand the risk of not recognizing some
anomalous data because they are above the threshold.</p>
      <p>When it is necessary to work with more complete data, instead, Ensemble
Consensus is able to stand comparison with supervised algorithms, providing,
against a small decrease of the effectiveness scores, the advantage of working
without labelled data. Important result is the recall value, which, not undergoing
particular variations with respect to the supervised algorithms, allows the CERT
operator an adequate data flow to control.</p>
      <p>The main system improvement task for the future is to identify ways to increase
recall without penalizing overall precision, as for a critical infrastructure the
presence of false negatives could be dangerous. As such, the possibilities of
varying the soft threshold to achieve higher or lower recall will also be
investigated. The difference, in absolute value, between and the result
obtained from the single example could also represent an estimate of the
criticality of the anomaly; this analysis will be deferred to later work.</p>
    </sec>
    <sec id="sec-9">
      <title>6. Funding</title>
      <p>Research activity co-financed by "Fondo Europeo di Sviluppo Regionale Puglia POR
Puglia 2014 - 2020 - Axis I - Specific Objective 1a - Action 1.1 (R&amp;D) - Project Title:
Suite products CyberSecurity and SOC" and by BV TECH S.p.A.
7. References</p>
      <p>Depren, Ozgur, Topallar, Murat, Anarim, Emin, Ciliz and M. Kemal, "An
intelligent intrusion detection system (IDS) for anomaly and misuse detection
in computer networks," Expert systems with Applications, pp. 713-722, 2005.
Jabez, Ja and M. B, "Intrusion detection system (IDS): anomaly detection
using outlier detection approach," Procedia Computer Science, pp. 338-346,
2015.</p>
      <p>Pu, Guo, Wang, Lijuan, Shen, Jun, Dong and Fang, "A hybrid unsupervised
clustering-based anomaly detection method," Tsinghua Science and
Technology, pp. 146-153, 2020.
"SwissLog: Robust and Unified Deep Learning Based Log Anomaly Detection
for Diverse Faults," in 2020 IEEE 31st International Symposium on Software
Reliability Engineering (ISSRE), 2020.</p>
      <p>X. Yu, L. A. Tang and J. Han, "Filtering and Refinement: A Two-Stage
Approach for Efficient and Effective Anomaly Detection," in 2009 Ninth IEEE
International Conference on Data Mining, Miami, 2009.</p>
      <p>Rousseeuw, J. Peter, Driessen and V. Katrien, "A fast algorithm for the
minimum covariance determinant estimator," Technometrics, pp. 212-223,
1999.</p>
      <p>Liu, F. Tony, Ting, K. Ming, Zhou and Zhi-Hua, "Isolation-based anomaly
detection," ACM Transactions on Knowledge Discovery from Data (TKDD), pp.
1-39, 2012.</p>
      <p>Breunig, M. M, Kriegel, Hans-Peter, Ng, R. T and J. Sander, "LOF: identifying
density-based local outliers," in Proceedings of the 2000 ACM SIGMOD
international conference on Management of data, 2000.</p>
      <p>Liu, F. Tony, Ting, K. Ming, Zhou and Zhi-Hua, "Isolation forest," in 2008
eighth ieee international conference on data mining, 2008.</p>
      <p>KDD Cup '99. (n.d.). Retrieved Febraury
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
2021,
from
Sapre, Suchet, Ahmadi, Pouyan, Islam and Khondkar, "A robust comparison of
the KDDCup99 and NSL-KDD IoT network intrusion detection datasets
through various machine learning algorithms," arXiv preprint
arXiv:1912.13204, 2019.
[13]</p>
      <p>Impedovo, Donato, and Giuseppe Pirlo. "Automatic signature verification in</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [14] [15] [16] [17]
          <article-title>the mobile cloud scenario: survey and way ahead</article-title>
          .
          <source>" IEEE Transactions on Emerging Topics in Computing</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Pirlo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Impedovo</surname>
          </string-name>
          .
          <article-title>"A new class of monotone functions of the residue number system</article-title>
          .
          <source>" Int. J. Math. Models Methods Appl. Sci 7</source>
          .9,
          <year>2013</year>
          :
          <fpage>803</fpage>
          -
          <lpage>809</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Pirlo</surname>
            , Giuseppe, and
            <given-names>Donato</given-names>
          </string-name>
          <string-name>
            <surname>Impedovo</surname>
          </string-name>
          .
          <article-title>"Cosine similarity for analysis and verification of static signatures</article-title>
          .
          <source>" IET biometrics 2.4</source>
          ,
          <issue>2013</issue>
          Impedovo, Donato, and
          <string-name>
            <given-names>Giuseppe</given-names>
            <surname>Pirlo</surname>
          </string-name>
          .
          <article-title>"Updating knowledge in feedbackbased multi-classifier systems</article-title>
          .
          <source>" 2011 International Conference on Document Analysis and Recognition. IEEE</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Pirlo</surname>
            , Giuseppe, Claudia Adamita Trullo, and
            <given-names>Donato</given-names>
          </string-name>
          <string-name>
            <surname>Impedovo</surname>
          </string-name>
          .
          <article-title>"A feedbackbased multi-classifier system</article-title>
          .
          <source>" 10th International Conference on Document Analysis and Recognition. IEEE</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>