<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Pizzo Calabro (VV),
Italy
" michele.ianni@univr.it (M. Ianni); elio.masciari@unina.it (E. Masciari)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Some experiments on activity outlier detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Ianni</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elio Masciari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università Federico II</institution>
          ,
          <addr-line>Napoli</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università della Calabria</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>The rise of cyber crime observed in recent years calls for more eficient and efective data exploration and analysis tools. In this respect, the need to support advanced analytics on activity logs and real time data is driving data scientist' interest in designing and implementing scalable cyber security solutions. However, when data science algorithms are leveraged for huge amounts of data, their fully scalable deployment faces a number of technical challenges that grow with the complexity of the algorithms involved and the task to be tackled. Thus algorithms, that were originally designed for classical scenarios, need to be redesigned in order to be efectively used for cyber security purposes. In this paper, we explore these problems and then propose a solution which has proven to be very efective in identifying malicious activities.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Cybersecurity</kwd>
        <kwd>Data Compression</kwd>
        <kwd>Activity Analysis</kwd>
        <kwd>Cluster Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Monitoring activities is a crucial and often underused method in the task of documenting,
assessing and preventing risk. Either if we talk about a large computer network or a single
machine or even a small IoT device, we face the problem of handling a high volume of data
that is generated continuously. This data could be related to network activity, software logs,
user interaction, hardware monitoring to cite a few. By carefully analyzing this endless stream
of information, it could be possible to thwart many kinds of cyberattacks, even before they
happen [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Cybercrime is on the rise, annual damage from cyberattacks is set to hit 6 trillion dollar in
2021 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], this is twice the amount of annual damage from cybercrime recorded in 2016. Many
attacks follow well known patterns and many malicious software are just a variant of existing
menaces, however, even the simplest form of cyberattacks are still dangerous nowadays. It
is important to notice that many attacks could be detected in the very early stages, or even
prevented at all, if all the information available could be analyzed in a fast and reliable way [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
In this respect, activity monitoring is a crucial component of comprehensive cybersecurity. The
technologies in the field of Big Data analysis and the advances of computing capabilities led to
new possibilities in the dificult task of extracting sensitive information from huge amounts of
data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Moreover, activity data has an inestimable value in cybersecurity as the information
extracted can easily be leveraged in order to better frame the legitimate actions performed thus
easily recognizing and blocking unwanted interactions.
      </p>
      <p>
        Nowadays, especially in business scenarios, the attackers could be insiders, even employees
sometimes, who (consciously or not) could endanger the sensitivity of data, or damage the
software/hardware architecture by opening holes that can be exploited by attackers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Misconfigured services, unpatched binaries, unknown vulnerabilities are all issues who constantly
jeopardize the security of our computers, leaving big holes open which can be used by attackers
to easily infiltrate. By analyzing logs and monitoring activities in real time, many of these holes
could be detected before an unauthorized user could spread your valuable information [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>How Is Activity Monitoring Useful In The Event Of An Attack? If a cyberattack occurs,
activity logs provide a roadmap to the origin of the attack and what agents were directly involved.
Each log can provide further details about where the cyberattackers may be going next, how to
prevent further attacks and what can be done to secure your system after the threat has been
eliminated. It is important to notice that, in the majority of the cases, the domain experts are
able to classify malicious actions and safe ones. This is usually a straightforward task that is
performed in several diferent types of security systems, from intrusion detection systems to
blacklist based access control. Unfortunately, the task of identifying and marking malicious
actions is not always easy. As a matter of fact, even if we could enumerate all malicious actions,
we may found that many malicious activities are the result of the execution of several actions
that, when analyzed individually, are usually considered safe. This allows to understand why
simple blacklists are usually not enough to keep a system safe. System administrators are
overwhelmed by a continuous stream of logs that is very dificult to tackle. It is worth noticing
that, in managing this huge amount of data, it is likely to miss small amounts of relevant actions,
that could be related to an attack. Indeed, the actions related to an attack represent an almost
invisible fraction, in terms of size, hidden in endless logs as a needle in a haystack. Thus, it is
straightforward to note that the analysis of the activity logs became an infeasible task to be
performed manually, on the contrary it is mandatory to use anomaly detection techniques in
order to early detect any possible threat.</p>
      <p>
        In this paper, we leverage the prime numbers based encoding scheme described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] in
order to eficiently and efectively identify suspicious activities by means of a proper outlier
detection strategy. More in detail, our contribution can be summarized as follows:
• scout: a hierarchical approach for quick identification of outlying activities;
• A deep experimental evaluation of the proposed approach that confirmed the validity of
our proposal.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Computing Outliers on Activity Logs</title>
      <p>
        In order to identify outlying activities after they are encoded as explained in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we leverage
a fast hierarchical approach, that combines the benefits of the divisive and agglomerative
approaches. Fig. 1 illustrates the operations performed by our algorithm scout. The input
…
…
…
…
dataset  is first partitioned into a set of rectangular blocks {1, . . . , } by means of a
topdown divisive procedure. Those blocks are then refined in the next step, whose result is a
new partitioning of the points of , {0, 1, . . . , }, where 0 contains points that are
classified as outliers, and each of the remaining  sets represents a cluster. These  clusters
are processed in the third step, through a bottom-up agglomerative procedure, whose result
is a new partitioning of the points in 1 ∪ . . . ∪  into the sets 1, . . . , , where  ≤ ,
obtained by possibly merging groups of clusters in {1 ∪ . . . ∪ } into one only cluster of
{1, . . . , }. Finally, the new clusters in output from the agglomerative step and the outliers
0 in output from the previous, are refined, thus obtaining the final set of well-rounded clusters,
{1, . . . , }, and the set of outliers 0. In our running example of Fig. 2, we show how the
successive steps of scout applied to the two-dimensional data distribution of Fig. 2(a), produce
the results shown in Fig. 2(b–e). Thus, Fig. 2(b), (c), (d), and (e) show the results produced,
respectively, by the steps of Fig. 1, which are described in detail in the following.
      </p>
      <p>G</p>
      <p>C</p>
      <p>F</p>
      <p>D</p>
      <p>E
(a)</p>
      <p>B</p>
      <p>A
H</p>
      <p>J
1 C</p>
      <p>
        The divisive step of scout performs a top-down partitioning of the data set to isolate blocks
whose points are as much as possible close to each other: this is equivalent to minimizing the
  of the blocks, an objective also pursued by the vast majority of algorithms tailored for
outlier detection. As the clustering algorithm at the basis of the outlier detection strategy have
been described in details in [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] we do not report here the complete description except a brief
overview of the approach focusing on the goal of this paper, i.e., the. outlier detection phase.
Indeed, the divisive phase described above partition the overall space into (A) several blocks
containing clusters, and (B) zero or more blocks that only contain outliers and noise points.
We also see that blocks of type A will only contain a single cluster since blocks containing
several clusters were split according to the “when in doubt split” philosophy used in the divisive
end
 ← average(v) + standardDeviation(v);  ← 1;
while  &lt;  ∧ [] ≤  do
      </p>
      <p>←  + 1;
end
 ← []; // the maximum density an outlier block may have
for  ← 1 to  do</p>
      <p>[] ← density()≤  ∧ aroundCentroidDensity()&lt; 2× density();
phase. To better understand this approach, we report the details of the refinement step in order
to better explain the details of the outlier detection strategy. More in detail, scout seeks to
realize the objectives of (I) identifying the blocks that contain only outliers, and (II) generating
well-rounded clusters for the remaining blocks. Task (I) is performed by using the observation
that (a) the density of blocks containing only outliers is low, and (b) the density of such blocks
is rather uniform because of the random nature of noise and outliers.</p>
      <p>Algorithm 1: The algorithm for identifying outlier blocks</p>
      <p>The pseudo-code is reported in Algorithm 1. scout checks condition (a) first by sorting the
densities of blocks and detecting discontinuity in density between blocks that are contiguous in
that ordering. That is, if  is an array containing the density values of the  blocks yielded by
the divisive step, sorted in ascending order, we compute an array  with  − 1 values, such that
[] = [ + 1]/[]. Then, if  outlier-blocks are present, the first  − 1 values of  will
be approximately equal to 1, while [] will be ‘quite large’. Through a number of experiments,
extensive empirical evaluation shows that a good threshold for value of  to be considered
‘quite large’ is provided by average value of  plus its standard deviation. Thus, with  the
smallest value index for which the v-value exceeds this threshold, we classify all blocks with
density above  = [] as cluster blocks. Then, we test the other blocks using condition
(b), i.e., we test the blocks that are suspected to be outlier blocks because of their low density.
This additional verification step is needed because large density diferences between cluster
blocks cannot be always excluded (e.g., because of large diferences in the volume of those
blocks). Therefore, a small hypercube is taken around the centroid of each block being checked
for condition (b), as depicted in Figure 3, and if the density in the hypercube is less than twice
the density of the whole block, this is classified as an outlier block (Fig. 3(a)); otherwise it is
classified as a cluster block (Fig. 3(b)). The size of the hypercube used for this check is such that,
for each dimension, the edge of the hypercube is 1/10 of that of the whole block on the same
dimension. The rationale of this the criterion can be understood by considering that for an
(a)
(b)
outlier-block, where data are randomly distributed, it is very unlikely that the density around
centroid is twice the global density while in a cluster-block, the density around centroid is very
likely to be twice the global density. For instance, these tests applied to our running example in
Fig. 2(b) have classified blocks “J”, “K”, “L” and “M” as outlier-blocks.</p>
      <p>Having thus identified all the cluster blocks, we can now proceed with task (II) and materialize
the centroids of their clusters and assign each point to the cluster with the nearest centroid.
To this end, we use a weighted function of the distance between points on each dimension.
Specifically, we measure the distance of a point p = (1, . . . , ) from a cluster-blocks  with
centroid (1, . . . , ) as
⋆(p, ) = ∑︁ (︂ ‖ − ‖ )︂ 2</p>
      <p>
        =1
 = 3 ·
√︂  ()

where  is the radius of the cluster along the -th dimension, which is estimated by the usual
3×  criterion [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] used to separate core points from outliers:
where  is the number of points inside  and  () is the variance of the -th coordinates
of the points in . Observe that ⋆(p, ) = 1 defines an ellipsoid whose center is the
centroid of  and whose radius on the -th dimension is . Therefore, we can now classify each
point in the sample set as either belonging to a particular cluster or as an outlier by assigning
each point p to the cluster  which minimizes (p, ), provided that this minimum distance
is less than or equal to 1; otherwise p is classified as outlier. The final output is thus a quite
simple and explainable set of points that are marked as outliers.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Evaluation</title>
      <p>In this section, we will describe the experimental assessment for our framework. First, we
introduce the pre-elaboration steps that are needed on the dataset before running our tests as
the activity logs usually contains a huge amount of information that could be not partitioned
in sessions or single user activities. More in detail, when dealing with transactional data, a
tuple is a collection of features, an activity record instead is an ordered set (i.e., a sequence) of
time-stamped actions. Moreover, log data are usually recorded in a variety of diferent formats,
and they can be built from several domains. As we deal with activity logs composed of diferent
types of actions we define a mapping function from the domains of actions to logs.
Definition 3.1. Let  be a set of actions contained in an activity logs, a function  :  →
 mapping each action to its activity log (or set of activity logs) is called a labeling function.</p>
      <p>We point out here, that due to privacy reasons, real datasets are usually anonymized,
nevertheless, in the experimental section we performed a proper labeling of activities in order to
obtain a meaningful experimental assessment for our approaches. More in detail, we ran our
experiments on several dataset whose naming convention are diferent for each action, thus,
we omit here to describe the labeling function we used as they are quite straightforward (by
using name permutation or punctuation) but we mention here their existence to clarify that a
proper mapping is always computed as pre-elaboration step especially when same action could
have diferent name in diferent sessions in order to have a fair comparison.
3.1. Results
For the experimental evaluation we used two diferent datasets.</p>
      <p>
        The first dataset source is the widely used KDD Cup 1999 Data [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] which contains more
than seven million connection records. A connection is a sequence of TCP packets starting and
ending within a time interval during which, data flows to and from a source IP address to a
target IP address under some well defined protocol. Each connection is labeled as either normal,
or as an attack, with exactly one specific attack type. Each connection record consists of about
100 bytes. Attacks fall into four main categories: i) DOS: denial-of-service, e.g. syn flood; ii) R2L:
unauthorized access from a remote machine, e.g. guessing password; iii) U2R: unauthorized
access to local superuser (root) privileges, e.g., various “bufer overflow” attacks; iv) probing:
surveillance and other probing, e.g., port scanning. As regards the second dataset, referred from
now on as “synthetic”, in order to perform fine tuning tests, we developed a logging system
based on the strace Linux command that captures the system calls invoked by a malicious
software we created ad-hoc. The generated log contains the timestamp, the user performing the
call, the name of the call, and its parameters. We further enriched the information contained in
the logs by running our logging system on many diferent applications, using diferent users, so
that the results contains information related both to normal usage of the operating system and
to malicious activity. It is important to notice that each dataset described above contains labels
that are used to mark malicious actions. However, since we are dealing with activities which
we defined as a sequence of several atomic actions, we consider an activity as a malevolent
one if it is a sequence of actions containing at least one malicious action. The first task in our
experimental evaluation is, then, identifying activities in our datasets. This task can be done
using diferent metrics and considerations, and, in most of the cases, is strictly related to the
specific log we are working on. The actions included in a given activity are, in fact, determined
by means of multiple factors like the user performing the action, the timestamp, the relationship
with a software session, a network connection, the type of the action performed and so on.
More in detail, we built activities by specifying a fixed length and then (eventually) padding
them as explained in previous section. We compared the performances on well-established
      </p>
      <p>Measure
Sensitivity
Specificity</p>
      <p>Precision
Negative Predictive Value</p>
      <p>False Positive Rate
False Discovery Rate
False Negative Rate</p>
      <p>Accuracy</p>
      <p>F1 Score
Matthews Correlation</p>
      <p>Coeficient</p>
      <p>Derivations</p>
      <p>+</p>
      <p>+</p>
      <p>+</p>
      <p>+</p>
      <p>+</p>
      <p>+</p>
      <p>+ 
  + 
 +
2 
2  +  + 
  *  −   *  
√(  +  )* (  + )* ( +  )* ( + )
evaluation described in table 1 where TP, FP, TN, FN respectively mean True Positive, False
Positive, True Negative and False Negative.</p>
      <p>In Figure 1, we report the results obtained for the KDD and Synthetic datasets.The
performances are quite interesting due to the peculiar features of our approach, especially when
observing that the percentage of threats that are not identified correctly (and the percentage of
normal activities marked as malicious as well) is negligible.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future Work</title>
      <p>In this paper we discussed SCOUT, an algorithm for clustering based outlier detection for
activity logs. Our goal is to overcome the limitation of classical approaches for outlier detection
that do not apply well to activity logs as each activity is a sequence of actions and the definition
of a proper metric for evaluating anomalous actions could be quite dificult. In this respect, we
leverage a simple yet efective encoding that allow us to easily represent complex activities as
real numbers. After the encoding phase, we perform an efective outlier detection based on
cluster analysis. We evaluated our algorithm on several datasets by performing an accurate
pre-elaboration step in order to make them more efective. The results obtained show a very
good efectiveness of our strategy on all the datasets being tested thus confirming the soundness
of our approach. As a future work, we plan to test other encoding strategies as well as more
refined outlier functions in order to further improve our (already satisfactory) performances.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Babbin</surname>
          </string-name>
          ,
          <article-title>Security log management: identifying patterns in the chaos</article-title>
          ,
          <source>Elsevier</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Morgan</surname>
          </string-name>
          , Cybercrime to cost
          <source>the world $10.5 trillion annually by</source>
          <year>2025</year>
          ,
          <year>2020</year>
          . URL: https://cybersecurityventures.com/hackerpocalypse-cybercrime
          <source>-report-2016/.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jang-Jaccard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nepal</surname>
          </string-name>
          ,
          <article-title>A survey of emerging threats in cybersecurity</article-title>
          ,
          <source>Journal of Computer and System Sciences</source>
          <volume>80</volume>
          (
          <year>2014</year>
          )
          <fpage>973</fpage>
          -
          <lpage>993</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zikopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Eaton</surname>
          </string-name>
          , et al.,
          <article-title>Understanding big data: Analytics for enterprise class hadoop and streaming data</article-title>
          ,
          <source>McGraw-Hill Osborne Media</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. E.</given-names>
            <surname>Schultz</surname>
          </string-name>
          ,
          <article-title>A framework for understanding and predicting insider attacks</article-title>
          ,
          <source>Computers &amp; Security</source>
          <volume>21</volume>
          (
          <year>2002</year>
          )
          <fpage>526</fpage>
          -
          <lpage>531</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Souppaya</surname>
          </string-name>
          ,
          <article-title>Guide to computer security log management</article-title>
          ,
          <source>NIST special publication 92</source>
          (
          <year>2006</year>
          )
          <fpage>1</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ianni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Masciari</surname>
          </string-name>
          ,
          <article-title>A compact encoding of security logs for high performance activity detection</article-title>
          ,
          <source>in: 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP</source>
          <year>2021</year>
          , Valladolid, Spain, March
          <volume>10</volume>
          -12,
          <year>2021</year>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>240</fpage>
          -
          <lpage>244</lpage>
          . URL: https://doi.org/10.1109/PDP52278.
          <year>2021</year>
          .
          <volume>00045</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ianni</surname>
          </string-name>
          , E. Masciari,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Mazzeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mezzanzanica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          ,
          <article-title>Fast and efective big data exploration by clustering</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>102</volume>
          (
          <year>2020</year>
          )
          <fpage>84</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ianni</surname>
          </string-name>
          , E. Masciari,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Mazzeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          ,
          <article-title>Clustering goes big: Clubs-p, an algorithm for unsupervised clustering around centroids tailored for big data applications</article-title>
          ,
          <source>in: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>558</fpage>
          -
          <lpage>561</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Livny</surname>
          </string-name>
          ,
          <string-name>
            <surname>Birch:</surname>
          </string-name>
          <article-title>An eficient data clustering method for very large databases</article-title>
          ,
          <source>in: SIGMOD</source>
          ,
          <year>1996</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hettich</surname>
          </string-name>
          ,
          <article-title>Kdd cup 1999 data, The UCI KDD Archive (</article-title>
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>