<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DIP-ECOD: Improving Anomaly Detection in Multimodal Distributions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kaixi Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Miller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jesús Martinez-del-Rincon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Secure Information Technologies(CSIT), Queen's University Belfast</institution>
          ,
          <addr-line>Belfast</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Anomaly detection algorithms identify unusual events and outliers in large datasets where manual approaches are highly impractical. Most prior anomaly detection methods assume simple unimodal Gaussian data distributions; however, they produce suboptimal results on complex multimodal distributions. To address this problem, we propose Dip-ECOD, a novel anomaly detection algorithm leveraging unsupervised machine learning that generalises to both multimodal and unimodal distributions. Dip-ECOD integrates a dip test within the ECOD framework, using SkinnyDip to split a probability distribution into separate modes, after which ECOD is applied. In this way, dificult-to-find outliers between modes and hidden in the distribution tails of each mode are also detected. Experiments using nine benchmark datasets across a range of domains such as healthcare and imagery demonstrate Dip-ECOD's improved performance over ECOD in detecting outliers in both multimodal and unimodal distributions, with Dip-ECOD achieving an average AUC score of 0.791 compared to ECOD's 0.761. Further, using a proprietary enterprise dataset, we show Dip-ECOD efectively identifies anomalous GitHub commits, indicating its applicability to information security and software vulnerability, where multimodal distributions are expected.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;anomaly detection</kwd>
        <kwd>unsupervised learning</kwd>
        <kwd>modality testing</kwd>
        <kwd>distributed learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Instructions</title>
      <p>
        Outliers refer to rare events, data points, and unusual behaviors that do not follow the same trends,
distributions, or patterns as the majority of a dataset. Outlier detection (OD) algorithms have a range
of important applications, for example, intrusion detection [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], log anomaly detection [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], malware
detection [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], fraud detection [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and medical diagnosis [5]. Proximity-based approaches and
machinelearning algorithms previously obtained promising results, although often led to high computation
costs and sufered from the curse of dimensionality [ 6], [7]. Recent work [8], [9], [10], [11] developed
distribution-based approaches by fitting probability distributions to data points. However, they assumed
unimodal distributions [8], [12], meaning a performance drop on more complex data distributions.
While unimodal Gaussian distributions are a common assumption that fits many tasks, it does not
hold for more nuanced problems where multiple phenomena or data sources can cause abnormality.
In this regard, there is little existing work focused on multimodal datasets [13],[14], [15] or detecting
anomalies within multimodal distributions. With the majority of real-world datasets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [5] being
multimodal, we are thus motivated to investigate this under-researched area of OD.
      </p>
      <p>This paper presents Dip-ECOD, a novel unsupervised learning method for efective anomaly detection
within multimodal distributions. Our proposed approach generalises Empirical
Cumulative-distributionbased Outlier Detection (ECOD) [8] from simple unimodal Gaussian distributions to more widely
applicable multimodal cases. This is achieved using the statistical dip test [16] of unimodality and
integrating it within the ECOD framework. We assume each dataset dimension is a multimodal Gaussian,
a generalisation of unimodal Gaussian. Inspired by SkinnyDip [17], we apply the dip test [16] recursively
to detect all the areas that cover the least frequent value between the modes, which we call anti-mode
intervals [18], and use midpoints of each anti-mode interval as the split points for separating multimodal
distributions into anti-mode distributions. After dividing the data distribution into separate unimodal
distributions, ECOD is applied to each anti-mode distribution to detect outliers at the tail of each mode’s
distribution.</p>
      <p>Dip-ECOD’s performance is first demonstrated on multiple benchmark datasets from varying
realworld domains such as healthcare and imagery. Afterward, we further show its applicability to
information security by evaluating Dip-ECOD against a proprietary enterprise cybersecurity dataset of
GitHub commits. Identifying unusual and suspicious GitHub commits can prevent malicious actors
from executing software attacks, though in reality, it is impossible to check all codebase commits by
hand so OD algorithms are required.</p>
      <p>Our contributions are:
• A novel unsupervised anomaly detection method, Dip-ECOD, performs on both multimodal and
unimodal distributions, including the first use of the statistical dip test within OD.
• An extensive evaluation on nine diferent real-world benchmark datasets showing Dip-ECOD’s
wider general-purpose applicability. Dip-ECOD outperforms eight other state-of-the-art methods
to achieve an average AUC score of 0.791.
• A further evaluation using a proprietary enterprise dataset of GitHub commits showing its
suitability to information security.</p>
      <p>The rest of the paper is organised as follows. In Section 2, related work on OD is reviewed. Our
approach, Dip-ECOD, is discussed in Section 3. The proposed model is evaluated in Section 4 while
Section 5 contains experimental results and analysis. Section 6 presents our conclusions and discusses
future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In this section, relevant anomaly detection approaches and software vulnerability detection in the
literature are reviewed.</p>
      <sec id="sec-2-1">
        <title>2.1. Distribution-based approaches</title>
        <p>Anomaly detection methods based on out-of-distribution assumptions are implemented by fitting
probability distributions to data points, where the anomalies are those that have a very low probability.
Common distribution-based approaches for anomaly detection include the Gaussian Mixture Model
(GMM) [10], linear regression [11] or Kernel Density Estimation (KDE) [9]. However, those methods
either require tuning hyperparameters or have high computational costs [8]. To address the sensitivity
to key parameters, a parameter-free, interpretable, and fast algorithm was introduced by Li. et al [8],
called ECOD. ECOD is an unsupervised learning anomaly detection technique based on empirical
cumulative distribution function (ECDF) [19]. ECOD is inspired by the fact that outliers are often
the rare events that appear in the tails of a distribution. For each dimension, a univariate ECDF was
computed; then outlier scores were calculated by multiplying all of the estimated tail probabilities from
each of the univariate ECDFs. However, ECOD is built on the assumption of unimodal distributions,
and so may not perform well on multimodal distributions where for instance the outliers lie between
modes not detectable by ECOD. It has been stated the majority of datasets are multimodal [20], meaning
it is worthwhile for us to investigate a method to handle the multimodal scenario. Therefore, taking
inspiration from [8] and [17], we propose a novel anomaly detection method called Dip-ECOD to solve
this problem of anomaly detection from data with multimodal distributions, with our motivation being
that rare events are also located in the shallows between frequent events rather than solely in the tails.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. GitHub repository analysis for software vulnerability</title>
        <p>There are many studies on detecting vulnerabilities in GitHub repositories. The existing methods
analyse GitHub repositories from three aspects: source code patches [21], [22], [23], [24], commit
messages [25], and commits logs [26], [27], [28].</p>
        <p>Among the most recent approaches, deep Learning architectures and neural networks, such as RNN
[21], [25] and hierarchical attention networks [22] have been proposed reporting good results using
either code change features or commit messages. However, those methods need the complete code
that is built into the packages or needs to access the whole source code files, which means that the
code snippets are not suitable for the methods and the methods are slow and cumbersome to be used in
real-time at commit upload. They also required a large amount of annotated training data given their
supervised approach, so they rely on either synthetic source code [21], such as the Juliet Test Suite [29]
or privately annotated scrapped datasets [25].</p>
        <p>Simpler approaches, less reliant on annotated data have also been proposed. In this respect, commit
logs are also analysed to detect anomalous commits. Gonzalez. et al build a rule-based model to detect
malicious commits from the metadata of commit logs [26]. The metadata features they considered
include lines of code added, removed, and modified, the number of files, and the number of unique file
types modified. Although they successfully detect more than 50% of the malicious commits, there are
still high false positives. It shows that there is still space to improve. Therefore, we proposed detecting
software vulnerabilities by analysing commit logs using unsupervised approaches.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this section, we propose Dip-ECOD, our novel anomaly detection method. This approach is based
on the assumption that a dataset distribution is an equally weighted multimodal mixture Gaussian,
with rare events lying both in the tails and between the mixture components. Our aim is to generalise
the anomaly detection method ECOD [8] to multimodal datasets. We also assume the outliers are rare
events whose data points occur in low-density regions of the probability distribution [30, 31].</p>
      <p>Figure. 1 presents the architecture of our proposed approach, Dip-ECOD. Our approach involves
ifrst estimating the modality of each dimension from data by using a dip test, a p-value is calculated
and modal intervals are returned. If it is unimodal, the ECDF is estimated straightaway; otherwise
component splitting is applied to locate each component’s modal intervals recursively, followed by
taking the middle points between the modal intervals to split a multimodal distribution into individual
components. Then ECDF of each component is estimated and the skewness is calculated. The outlier
score for each data point is obtained by concatenating the outlier scores for each dimension. More
details are demonstrated in the following section.</p>
      <sec id="sec-3-1">
        <title>3.1. Modality estimation</title>
        <p>We estimate the modality of a dataset first. In the case of multimodality, as most real datasets are
expected [20], Dip-ECOD’s component splitting will detect and isolate each modality. Otherwise, if
it happens to be unimodal, our approach becomes equivalent to ECOD. More details can be found in
Section 3.2.</p>
        <p>Hartigan’s dip test [16] is a common approach to test the modality of distributions. In the dip test,
the null hypothesis assumes that the observed dataset is drawn from an unimodal distribution, which
means that there is only one mode in the distribution. The alternative hypothesis suggests that the
dataset is derived from a multimodal distribution, meaning that there is more than one mode.</p>
        <p>Dip statistic  calculates the maximum diference between the empirical distribution function of
the observed dataset, denoted  (), and the best-fitting unimodal function () that minimises that
maximum diference, which represents the departure of the samples from unimodal distribution and it
can be calculated as</p>
        <p>= | () − ()|
with  ∈ (0, 0.25], where if a  value closes to zero, the distribution is unimodal, while the distribution
is deemed multimodal if  is close to 0.25. () can be obtained by estimating the greatest convex
minorant (g.c.m) and least concave majorant (l.c.m) of  () [16], [32]. The modal interval (,  ),
which indicates the region of the steepest slope in the  (), i.e. the g.c.m of  () is in (−∞ , ]
and the l.c.m of  () is in [ , ∞). The g.c.m of  () in (−∞ , ] means that function () is
convex in (−∞ , ] and nothing is greater than  (). In contrast, l.c.m of  () in [ , ∞) means
that function () is concave in [ , ∞) and nothing is less than  () [16]. In other words,  ()
increases in (−∞ , ] and decreases in [ , ∞). The modal interval will be used in the following
stages of Dip-ECOD.</p>
        <p>To assess the significance of the dip statistic, given the sample size  and the computed dip statistic
, the p-value can be calculated using the formula
 −  = √ * 
(1)
(2)
A p-value represents the probability of occurrence of the given events. In other words, it measures the
diference between the data and the null hypothesis using an estimate of the parameter of interest. The
smaller the p-value, the stronger the evidence against the null hypothesis. Note the p-value is afected
by the sample size. The larger the sample size, the smaller the p-value [33]. As the sample size increases,
the uncertainty about where the population mean lies may decrease. With a very large sample, the
standard error becomes extremely small, so that even minuscule distances between the estimate and
the null hypothesis become statistically significant [34].</p>
        <p>We then compare the p-value with a threshold  to determine whether our p-value is low enough
to reject the null hypothesis, i.e. where the dataset is unimodal. The higher the  , the higher the
probability of rejecting the null hypothesis and therefore identifying the data as multimodal. On the
other hand, the lower the  , the lower the probability of identifying the data as multimodal. After
careful consideration, we choose  = 0.001 as our default setting for all the experiments.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Cumulative distribution function (CDF) and empirical cumulative distribution function (ECDF)</title>
        <p>A cumulative distribution function (CDF) tells us the probability that a random variable  takes on a
value less than or equal to , which is usually denoted as (). The CDF of a random variable  is
formally defined by:
() =  ( ≤ )
(3)
where  ( ≤ ) is the probability that the random variable  takes on a value less than or equal to .
A CDF has four properties: it is non-decreasing and right-continuous [35], with lim→−∞ () = 0
and lim→+∞ () = 1, illustrating the CDF approaches 1 as  becomes large and vice versa.</p>
        <p>To estimate a CDF, an empirical cumulative distribution function (ECDF) [19] is used. It assigns
a probability density of 1 to each datum, orders the data from smallest to largest by data value, and
calculates the sum of the assigned probability densities up to and including each datum. The result is a
step function that increases by 1 at each datum. The ECDF, ¯(), is formally defined as:
¯() = 1 ∑︁ 1( ≤ )</p>
        <p>=1
where 1( ≤ ) is an indicator of the event  ≤ .</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Mixture distribution</title>
        <p>A mixture distribution, also called a multimodal distribution, is a collection of probability distributions.
Each distribution that forms the mixture distribution is called a mixture component and can be regarded
as an unimodal distribution. A weight is associated with each component to model its influence
on the overall distribution. The number of components in a mixture distribution can be finite or
infinite. However, in practical terms, a finite number of components should sufice to model the mixture
distribution within a defined error  . In our case, we only assume that the mixture distributions we
investigate have a finite number of components.</p>
        <p>Given a finite set of CDF, 1(), 2(),..., () and weights 1, 2,...,  such that  ≤ 0 and,
the CDF of the mixture function can be represented by  (˜) as a sum, formally:</p>
        <p>(˜) = ∑︁ ()</p>
        <p>=1
where () can relate to the combination of individual distributions () (or components) and
their associated weight .</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Component splitting</title>
        <p>The main idea of our method is to generalise ECOD to multimodal distribution. To do so, a process
called component splitting is introduced to split the multimodal distribution into individual unimodal
distributions or components. In this section, we will focus on explaining how we split multimodal
distributions.</p>
        <p>Once a multimodal distribution is identified, (see sec.3.1), the intervals of each mode from each
component in each multimodal distribution, called modal intervals, are also identified by using [ 17].
Next, the intervals outside the modal intervals are the areas where the modes do not local, i.e.
antimodal interval, can also be identified. Then, the middle points are obtained by taking the middle points
between two anti-modal intervals for component purposes. Consequently, the multimodal distribution
is split into multiple unimodal distributions or components according to the middle points. Finally,
ECOD can be applied to each component to calculate the outlier scores.</p>
        <p>We assume that we have a multimodal dataset with n data points  = {}=1 ∈R× . There are 

components or modes or modal intervals in the dataset and the index of each component is represented
as . The location of each modal interval , i.e. [(), ()] of each component { }=1 at dimension
() 
 are returned by using SkinnyDip from [17], where () and () present lower bound and upper
bound of the modal interval. As a result, a set of the modal intervals of  components is obtained,
denoted as: {[(1), (1)], ..., [(), ()]}.</p>
        <p>The split point can be located according to the modal intervals obtained from the previous step.
Knowing that there are  components in total, the middle point between each of two modal intervals
(4)
(5)
− 1
is taken as the split point  = {}=1 ∈  where represents the ℎ split point of the ℎ dimension
and there are up to  − 1 split points by using the function below.</p>
        <p>() = (()− 1) +
() −
(()− 1)
2
where (()− 1) is the upper bound from the first component and () is the lower bound from the
next component which beside the first component. As a result, the multimodal distribution is split
() consists of all the data points with
into multiple unimodal components, where each component 
the range of two split points, i.e. 1() = {[1(), (1)]}. Each component consists of  data points
where  = {}=1 = |{{()}=1}| = |() − (−)1| ∈ . For instance, 2 can be formulated as
2 = |{2()}| = |(2) − (1)|.</p>
        <p>Hence all the components are separated as individual unimodal distribution. A similar procedure as
ECOD is applied to measure how anomalous each  is inside each component, the left and right tail
ECDFs of each component are estimated using the following equations:</p>
        <p>()
   ∈ 
^ ()
 (())() =</p>
        <p>1 ∑︁ 1{() ≤ }
 =1
^ ()
 (())() =</p>
        <p>1 ∑︁ 1{() ≥ }
 =1
()
   ∈ 
where the indicator function 1{·} is 1 when its arguments are true and 0 otherwise.</p>
        <p>Therefore, across all the dimensions , based on the assumption that diferent dimensions are
independent of each other, we estimate the joint left and right tail ECDFs for each data point as:

^ () = ∏︁ ^ ()(())
=1</p>
        <p>^ () = ∏︁ ^ ()(())
=1
(6)
(7)
(8)
(9)
(10)
   ∈</p>
        <p>After estimating the joint left and right ECDF from the previous step, as a part of the computation of
the outlier score for each data point, skewness is also needed. Skewness is the degree of asymmetry of
a distribution. A distribution can have right (or positive), left (or negative), or zero skewness. To decide
which tail should be kept, the skewness of each component is computed by using the function:
  =
1 ∑︀=1(() − ˜())3

[ 1− 1</p>
        <p>∑︀=1 () − ˜()] 23
where ˜() =</p>
        <p>1 () is the data point mean of the (ℎ) feature in each component.</p>
        <p>
          Finally, we move to the stage to calculate the outlier score for each data point. Since the values
of ECDFs are in the range of [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] and the potential outliers have lower ECDFs. The negative log
probability is applied to aggregate the tail probabilities. As a result, lower tail probability corresponds
to higher outlier scores and vice versa. The left, right, and auto outlier score  for the point  of the
ℎ dimension are calculated as the equations shown below:
() = −
() = −
        </p>
        <p>∑︁ (^ ()(()))
=1
=1</p>
        <p>∑︁ (^ ()(()))]
() = −</p>
        <p>∑︁[1{  &lt; 0}(^ ()(()))
=1
+1{  ≥
0}(^ ()(()))]
and then the highest tail probability among (), () and () is taken as the outlier score
 for  in each dimension shown as:</p>
        <p>= ((), (), ())</p>
        <p>Lastly, the outlier score  ∈ [0, ∞) for every point  is obtained. The higher the outlier score, the
more likely it is an outlier. Note that the outlier scores cannot be interpreted as the probabilities, they
are only used for relative comparison across data points. The pseudocode of Dip-ECOD is displayed in
Algorithm 1.</p>
        <p>Algorithm 1 Unsupervised Outlier Detection using Dip-ECOD</p>
        <p>Input:input data  = {}=1 ∈R×  with  data point and  features; () refers to the value of the ℎ

feature of the ℎ data point.</p>
        <p>Output: outlier score  := () ∈R
1: for each dimension  in 1, ...,  do
2: Estimate modal intervals [(), ()] of each component ()
3: Obtain split points  = {}=−11 ∈R,s() refers to the ℎ split point of the ℎ feature
4: Split multimodal distribution into  components {()}=1 according to split points (), where each

components consist of  = {}=1 ∈  data points
5: for each component  in 1, ...,  do
6: Estimate left and right tail ECDFs for each componen
7: Compute the data point skewness coeficient  
8: end for
9: end for
10: for each data point i in 1,..., n do
11: Aggregate tail probability  to obtain left, right and auto outlier scores ( )(), ()() and ()()
12: Select the highest outlier score for  as ()
13: end for
14: Obtain the final outlier score 
15: return  = (1, ..., )</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Settings</title>
      <p>In this section, we present in detail the datasets used to evaluate the performance of Dip-ECOD, the
performance measure, and the baseline methods we compare with.
(11)
(12)
(13)
(14)
(a)  1 = 0 and  2 = 10.The modal
interval is [-1.6, 1.6] for the first
component and [8.3, 11.6] for
the second component and the
split point is 5.</p>
      <p>(b)  1 = 0 and  2 = 4.The modal
interval is [-1.6, 1.6] for the first
component and [2.3, 6.6] for
the second component and the
split point is 1.95.</p>
      <p>(c)  1 = 0 and  2 = 3.The modal
interval is [-1.6, 1.6] for the first
component and [1.4, 4.6] for
the second component and the
split point is 1.5.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>We perform three types of evaluations in our validation, including synthetic datasets, nine public
benchmark datasets commonly used in previously published work [8], and a proprietary enterprise
dataset in software vulnerabilities. More details on each type of dataset are discussed in the following
subsections.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Synthetic dataset</title>
          <p>We first create a synthetic dataset to evaluate the capability of the proposed method in a multimodal
scenario. A bi-dimensional bimodal synthetic dataset of 1,000 data points is built, where the distribution
of each dimension is multimodal or consists of more than one Gaussian.</p>
          <p>Moreover, for evaluating our model on diferent shapes of the multimodal dataset, we have one fixed
standard Gaussian,  = 0 and  2 = 1; plus a second moving Gaussian which also has a  2 = 1 but its
mean moves from 10 to 2 on each iteration, in increments of 1. In this way, the distance between two
Gaussian changes. The larger the mean of the second moving Gaussian, the fewer the overlapping
points, and the more distinguishable the outliers. The data points present below the 5ℎ percentile
and above the 95ℎ percentile for each Gaussian have been labeled as outliers for Gaussian centered
between 6 and 10 since there are no overlapping points between two Gaussian. Figure 2a shows an
example of how we label the data points when  = 10 for each dimension. In Figure 2a, the grey area
represents outliers, with the red lines being the 5ℎ percentile and the 95ℎ percentile of each Gaussian.</p>
          <p>For Gaussian centered at 4 and 5, the data points below the 5ℎ percentile for the fixed Gaussian, and
above the 95ℎ percentile for the moving Gaussian are labeled as outliers. For the overlapping points on
both fixed and moving Gaussian, we label as outliers the data points that are above the 95ℎ percentile
of fixed Gaussian and below the 5ℎ percentile of the moving Gaussian, per Figure 2b. For Gaussian
centered ≤ 3, we only label as outliers data points which are below the 5ℎ percentile for the fixed
Gaussian, and above the 95ℎ percentile for the moving Gaussian, per Figure 2. The data points in the
middle are not included in the outliers.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Public benchmark datasets</title>
          <p>Table 1 summarises the nine public anomaly detection benchmark datasets selected for this study,
including their dimensionality  and proportion of outliers. These benchmark datasets are part of ODDS
database [20] which also provides ground truth. The ODDS database was developed in 2016 to provide
real-world anomaly detection datasets with labels. Furthermore, the datasets cover diferent application
areas and domains, for instance, medicine (arrhythmia, wbc, pima, and lympo) and aerospace imagery
(satellite and satimage-2). The selected datasets are all multidimensional, which means there is more
than one feature in each dataset.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>4.1.3. Proprietary enterprise dataset</title>
          <p>To highlight the applicability of Dip-ECOD for software vulnerability specifically, and for information
security in general, we also evaluate against the task of detecting anomalous outliers from real-world
GitHub commits in a software repository. Although this dataset is proprietary and non-public, we feel
it is worthwhile to provide the insights gained from using Dip-ECOD on this dataset.</p>
          <p>After selecting an up-to-date commercial repository, we extracted commits over the last five years
(2018 January-2023 January) from the author who made the most commits, 743 commits. An example of
a GitHub commit is shown in Figure 3. Only contextual features such as the environmental data around
the changed code are extracted. We do this because we would like to detect the anomalous behaviors
from a high-level aspect instead of from the content data i.e. changed code itself. This approach is
faster and more privacy-preserving than analysing code. The contextual features for our experiments
include changed code, commit ID, number of changed files, number of added files, number of deleted
ifles, and commit time by using git and then saving as a CSV file for analysis. Expected outliers could
be anomalous working patterns of the author, unusual code commits or deletions, etc.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental setup and evaluation metrics</title>
        <p>We split each dataset, using 60% for the training set and 40% for the testing set, with a 10-fold stratification
used to help avoid overfitting and artificially high results. Our main focus is to see how much the model
is capable of distinguishing between classes. Area Under the Curve (AUC) score is calculated from the
area under the Receiver Operating Characteristic curve (ROC). The positive case is an outlier, and the
negative case is a non-outlier, or normal, data point. The ROC curve is created by plotting the true
positive rate (TPR) or sensitivity against the false positive rate (FPR) at various threshold settings. The
calculation of TPR and FPR are formally shown in Equation 15 and Equation 16:
   =</p>
        <p>+</p>
        <p>(15)
 
   = (16)</p>
        <p>+  
AUC represents the degree or measure of class separability by varying TPR and FPR. It tells us to what
extent the model can distinguish between classes. The higher the AUC, the better the model predicts
each class correctly. In our case, this is how well the model distinguishes between outliers and normal
data points. An excellent model has an AUC near 1, meaning it has a good measure of separability. A
poor model has an AUC near 0, meaning it has a worsening measure of separability, predicting the
negative class as positive and the positive class as negative. When AUC is 0.5, it means the model has
no class separation capacity.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. State-of-the-art comparison setup</title>
        <p>We compare the performance of Dip-ECOD with ECOD [8] and seven other state-of-the-art outlier
detection algorithms. This variety of algorithms ensures the state-of-the-art comparison is more robust.
The seven other outlier detection algorithms are Clustering-Based Local Outlier Factor (CBLOF)[36],
Histogram-based Outlier Score (HBOS)[37], Isolation Forest (IForest) [38], k Nearest Neighbors (KNN)
[39], Lightweight On-line Detector of Anomalies (LODA) [40], Local Outlier Factor(LOF) [41], and
PCA-based outlier detector (PCA) [42].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>To evaluate the performance of Dip-ECOD we conduct experiments on a variety of synthetic data and
real-world data sets, comparing Dip-ECOD’s results with those of the state-of-art algorithms. AUC is
used to measure the overall performance of each algorithm.</p>
      <sec id="sec-5-1">
        <title>5.1. Synthetic dataset evaluations</title>
        <p>First, to demonstrate Dip-ECOD is suitable for multi-dimensional multimodal datasets, we tested on
the synthetic datasets described in Section 4.1.1. As illustrated in Figure 4, the Gaussian lies along the
diagonal, i.e. the two dimensions are independent of each other. Three experiments are performed;
varying the means of the moving Gaussian, varying  of the Dip Test, and varying the number of
modes.
The first experiment focuses on illustrating the main contribution of the proposed algorithm, which
is performance against multimodal distributions. By changing the mean of the moving Gaussian, the
modality and how obvious that modality is also changes. The larger the value of the mean, the more
likely the multimodality. The AUC of varying means of the moving Gaussian is displayed in Figure 5. It
can be seen that Dip-ECOD’s AUC is significantly better than that of ECOD.</p>
        <p>However, when the mean of the moving Gaussian is less than 3, i.e. the two Gaussian are closer
together, Dip-ECOD performance is similar but slightly lower than ECOD. This is likely due to distribution
approaching unimodality where ECOD performs slightly better. This result proves two things: firstly,
Dip-ECOD performs well on multimodal data, and secondly, ECOD indeed works well, as expected, on
unimodal data.</p>
        <sec id="sec-5-1-1">
          <title>5.1.2. Varying  of the Dip Test</title>
          <p>As section 3.1 discussed, the threshold  determines whether our p-value is low enough to reject the
null hypothesis, i.e. where the dataset is unimodal. An optimal  could balance Type I and Type II
errors. In this case,  = 0.001 is the default setting for our remaining experiments. Note that the larger
the sample size, the smaller the p-value. This is because the larger the data size, the more accurate the
statistical test, and therefore, the stronger the evidence rejecting or supporting the null hypothesis [33].</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.3. Varying the number of components</title>
          <p>This third evaluation focuses on the impact of varying the number of components or modes. We created
additional synthetic datasets with distributions beyond the bimodal dataset but still allocated 1,000 data
points for each Gaussian. We have one fixed standard Gaussian,  = 0 and  2 =1, and we have multiple
Gaussian which also have a  2 =1, with their means being multiples of 1 to 10. By using  = 0.001, the
number of components from 2 to 10 is explored.</p>
          <p>Figure 6 shows there is no significant change in AUC as the number of components varies. We
can also see that our model outperforms ECOD in multimodal cases. In summary, the number of
components does not have a significant efect on the performance of our model, and it strongly proves
that Dip-ECOD outperforms ECOD on multimodal distributions.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Real-world benchmark dataset evaluations</title>
        <p>To demonstrate the generalisation potential of our approach to real-world cases, we evaluate Dip-ECOD
on the previously discussed nine selected publicly available benchmark datasets against eight other
state-of-the-art methods. The datasets are of varying modality types to test the performance of the
Dip-ECOD model in both multimodal and unimodal scenarios.</p>
        <p>Per Table 2, results show Dip-ECOD consistently outperforms all other models, including ECOD.
Overall, Dip-ECOD achieves an average AUC score of 0.791, compared to the ECOD AUC of only 0.761
and the lowest AUC average from LOF of 0.617. This demonstrates Dip-ECOD performs better than
the other models when it comes to classifying outliers and normal samples. Moreover, Dip-ECOD
reaches a higher AUC than ECOD on all the Gaussian multimodal datasets and matches ECOD on all
the strongly unimodal datasets, especially the real-world datasets that have a mix of unimodal features
and multimodal features. In addition, Dip-ECOD generalises better than ECOD in multimodal mixture
Gaussian cases while maintaining strong performance on the unimodal datasets. Although it is only a
0.03 improvement on AUC, it could reduce False Negative rates dramatically on large datasets, such as
a dataset with 100,000 samples.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Proprietary enterprise dataset evaluations</title>
        <p>In this last experiment, we evaluate our approach using the enterprise dataset described in Section 4.1.3
by applying Dip-ECOD to identify potentially unusual and anomalous GitHub commits. Five features
are extracted from the GitHub commits, per Table 3 - the name of the author, the commit data, the
number of additions and deletions, plus the number of changed files. Implicitly these are key contextual
features that may contain anomalous signals. We analyse all the commits from the most active author
in the selected repository, numbering 738 commits over five years.</p>
        <p>Feature engineering transforms each commit date’s year, month, and day into individual features.
Days are converted to integers between 1 and 7, where 1 represents Monday, 2 represents Tuesday, and
so on, with the same for month integers running from 1 to 12. The 24-hour time of the commit date is
represented as a float, with the number of additions, deletions, and changed files remaining as integers.
To illustrate, a commit made at 3:11 am Thursday, August 2017 with 29 additions, and 1 deletion across
1 file, is converted to the feature vector [3.11, 4, 8, 2017, 29, 1, 1].</p>
        <p>Dip-ECOD identifies potentially unusual and suspicious commits from the enterprise data. We set
the top 1% as anomalies in this case. Therefore there are 7 out of 743 commits identified as anomalous.
The results are reviewed by an expert engineer and 3 out of 7 are confirmed as TPs that were never
detected by the company. For instance, the commit made on Thursday, August 3, 2017, at 3:11 am
with 29 additions, 1 deletion, and 1 changed file is identified as an outlier with the highest outlier score
of 10.94, which indeed seems unusual given the time is in the middle of the night and does not fit the
normal working patterns of the developer, i.e. 9.00 a.m. to 5.00 p.m. This demonstrates the potential of
our model to be used to identify unusual events on information security from a range of unlabelled
real-world datasets with multimodal distributions.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and future work</title>
      <p>A novel anomaly detection algorithm, Dip-ECOD, has been presented, ofering superior outlier detection
performance on multimodal datasets without loss of performance in the unimodal case. Results show
Dip-ECOD outperforms eight other state-of-art techniques using five datasets of multimodal mixture
Gaussian and four unimodal datasets, achieving an AUC of 0.791 on average, compared with 0.761
achieved by the next best, ECOD, and surpassing other 7 techniques. Dip-ECOD has a set of unique
properties that makes it interpretable while not requiring distance calculation, making it faster to
compute. Dip-ECOD exploits the Dip Test to identify modal and anti-modal intervals, enabling the
modeling of a multimodal Gaussian mixture dataset as separate unimodal Gaussians, enabling outliers
hidden between modes to be identified.</p>
      <p>Like other anomaly detection techniques, Dip-ECOD may have a limitation in that the dip test and
ECOD both treat each dimension independently as univariate data, meaning information occasionally
may be lost as a consequence, while Dip-ECOD is also generally restricted to continuous data. Future
work could consider including the likelihood of each data point belonging to a given mode to better
separate multimodal data. Lastly, the dependencies of each dimension and the computational complexity
of our model may be worth investigating.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgment References</title>
      <p>This project has received funding from eBay Ltd. and filed a patent application in the US. We are grateful
to our colleagues Stuart Millar, Jake Sloan, Marc Patterson, Alok Lal, and Xin Hong for their insightful
comments on the early draft of this work.
[5] W.-K. Wong, A. W. Moore, G. F. Cooper, M. M. Wagner, Bayesian network anomaly pattern
detection for disease outbreaks, in: Proceedings of the 20th International Conference on Machine
Learning (ICML-03), 2003, pp. 808–815.
[6] Y. Zhao, X. Ding, J. Yang, H. Bai, Suod: toward scalable unsupervised outlier detection, arXiv
preprint arXiv:2002.03222 (2020).
[7] Y. Zhao, G. H. Chen, Z. Jia, Tod: Gpu-accelerated outlier detection via tensor operations, arXiv
preprint arXiv:2110.14007 (2021).
[8] Z. Li, Y. Zhao, X. Hu, N. Botta, C. Ionescu, G. Chen, Ecod: Unsupervised outlier detection
using empirical cumulative distribution functions, IEEE Transactions on Knowledge and Data
Engineering (2022).
[9] M. Pavlidou, G. Zioutas, Kernel density outlier detector, in: Topics in Nonparametric Statistics:
Proceedings of the First Conference of the International Society for Nonparametric Statistics,
Springer, 2014, pp. 241–250.
[10] X. Yang, L. J. Latecki, D. Pokrajac, Outlier detection with globally optimal exemplar-based gmm, in:</p>
      <p>Proceedings of the 2009 SIAM international conference on data mining, SIAM, 2009, pp. 145–154.
[11] M. H. Satman, A new algorithm for detecting outliers in linear regression, International Journal
of statistics and Probability 2 (2013) 101.
[12] H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek, Loop: local outlier probabilities, in: Proceedings of
the 18th ACM conference on Information and knowledge management, 2009, pp. 1649–1652.
[13] K. Highnam, K. Arulkumaran, Z. Hanif, N. R. Jennings, Beth dataset: Real cybersecurity data for
unsupervised anomaly detection research, in: CEUR Workshop Proc, volume 3095, 2021, pp. 1–12.
[14] Y.-J. Kang, Y. Noh, Development of hartigan’s dip statistic with bimodality coeficient to assess
multimodality of distributions, Mathematical Problems in Engineering 2019 (2019) 1–17.
[15] C. Zhang, B. E. Mapes, B. J. Soden, Bimodality in tropical water vapour, Quarterly Journal of the
Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and
physical oceanography 129 (2003) 2847–2866.
[16] P. Hartigan, Algorithm as 217: Computation of the dip statistic to test for unimodality, Journal of
the Royal Statistical Society. Series C (Applied Statistics) 34 (1985) 320–325.
[17] S. Maurus, C. Plant, Skinny-dip: clustering in a sea of noise, in: Proceedings of the 22nd ACM</p>
      <p>SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 1055–1064.
[18] Multimodal distribution - wikipedia, https://en.wikipedia.org/wiki/Multimodal_distribution, 2023.</p>
      <p>Accessed: 2023-09-11.
[19] Empirical distribution function, https://en.wikipedia.org/wiki/Empirical_distribution_function,
2023. Accessed: 2023-09-11.
[20] Outlier detection dataset, http://odds.cs.stonybrook.edu/, 2023. Accessed: 2023-09-11.
[21] N. Saccente, J. Dehlinger, L. Deng, S. Chakraborty, Y. Xiong, Project achilles: A prototype tool for
static method-level vulnerability detection of java source code using a recurrent neural network,
in: 2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop
(ASEW), IEEE, 2019, pp. 114–121.
[22] T. Hoang, H. J. Kang, D. Lo, J. Lawall, Cc2vec: Distributed representations of code changes, in:
Proceedings of the ACM/IEEE 42nd international conference on software engineering, 2020, pp.
518–529.
[23] T. H. M. Le, D. Hin, R. Croft, M. A. Babar, Deepcva: Automated commit-level vulnerability
assessment with deep multi-task learning, in: 2021 36th IEEE/ACM International Conference on
Automated Software Engineering (ASE), IEEE, 2021, pp. 717–729.
[24] H. Hanif, S. Mafeis, Vulberta: Simplified source code pre-training for vulnerability detection, in:
2022 International joint conference on neural networks (IJCNN), IEEE, 2022, pp. 1–8.
[25] Y. Zhou, A. Sharma, Automated identification of security issues from commit messages and bug
reports, in: Proceedings of the 2017 11th joint meeting on foundations of software engineering,
2017, pp. 914–919.
[26] D. Gonzalez, T. Zimmermann, P. Godefroid, M. Schäfer, Anomalicious: Automated detection of
anomalous and potentially malicious commits on github, in: 2021 IEEE/ACM 43rd International
Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), IEEE, 2021,
pp. 258–267.
[27] G. Bhandari, A. Naseer, L. Moonen, Cvefixes: automated collection of vulnerabilities and their fixes
from open-source software, in: Proceedings of the 17th International Conference on Predictive
Models and Data Analytics in Software Engineering, 2021, pp. 30–39.
[28] F. Lomio, E. Iannone, A. De Lucia, F. Palomba, V. Lenarduzzi, Just-in-time software vulnerability
detection: Are we there yet?, Journal of Systems and Software 188 (2022) 111283.
[29] Software assurance reference dataset, https://samate.nist.gov/SARD/testsuite.php, 2022. Accessed:
2022-04-03.
[30] A. Lazarevic, V. Kumar, Feature bagging for outlier detection, in: Proceedings of the eleventh ACM</p>
      <p>SIGKDD international conference on Knowledge discovery in data mining, 2005, pp. 157–166.
[31] D. Pokrajac, A. Lazarevic, L. J. Latecki, Incremental local outlier detection for data streams, in:
2007 IEEE symposium on computational intelligence and data mining, IEEE, 2007, pp. 504–515.
[32] P. Hartigan, Computation of the dip statistic to test for unimodality: Algorithm as 217, Applied</p>
      <p>Statistics 34 (1985) 320–5.
[33] E. Gómez-de Mariscal, V. Guerrero, A. Sneider, H. Jayatilaka, J. M. Phillip, D. Wirtz, A.
MuñozBarrutia, Use of the p-values as a size-dependent function to address practical diferences when
analyzing large datasets, Scientific reports 11 (2021) 20942.
[34] M. Lin, H. Lucas, G. Shmueli, Too big to fail: large samples and the p-value problem. 2 information
systems research, 2013.
[35] K. I. Park, M. Park, James, Fundamentals of probability and stochastic processes with applications
to communications, Springer, 2018.
[36] Z. He, X. Xu, S. Deng, Discovering cluster-based local outliers, Pattern recognition letters 24
(2003) 1641–1650.
[37] M. Goldstein, A. Dengel, Histogram-based outlier score (hbos): A fast unsupervised anomaly
detection algorithm, KI-2012: poster and demo track 1 (2012) 59–63.
[38] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation forest, in: 2008 eighth ieee international conference on
data mining, IEEE, 2008, pp. 413–422.
[39] G. Guo, H. Wang, D. Bell, Y. Bi, K. Greer, Knn model-based approach in classification in otm
confederated international conferences” on the move to meaningful internet systems”, Sicily, Italy
(2003) 986–996.
[40] T. Pevny`, Loda: Lightweight on-line detector of anomalies, Machine Learning 102 (2016) 275–304.
[41] M. M. Breunig, H.-P. Kriegel, R. T. Ng, J. Sander, Lof: identifying density-based local outliers, in:
Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp.
93–104.
[42] M.-L. Shyu, S.-C. Chen, K. Sarinnapakorn, L. Chang, A novel anomaly detection scheme based on
principal component classifier, in: Proceedings of the IEEE foundations and new directions of
data mining workshop, IEEE Press, 2003, pp. 172–179.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Crookes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Component-based feature saliency for clustering, IEEE transactions on knowledge and data engineering 33 (</article-title>
          <year>2019</year>
          )
          <fpage>882</fpage>
          -
          <lpage>896</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Buford</surname>
          </string-name>
          ,
          <article-title>Anomaly detection of command shell sessions based on distilbert: Unsupervised and supervised approaches</article-title>
          ,
          <source>arXiv preprint arXiv:2310.13247</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Harang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Rudd</surname>
          </string-name>
          , Sorel-20m:
          <article-title>A large scale benchmark dataset for malicious pe detection</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <volume>07634</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Nadim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. M.</given-names>
            <surname>Sayem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mutsuddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <article-title>Analysis of machine learning techniques for credit card fraud detection</article-title>
          ,
          <source>in: 2019 International Conference on Machine Learning and Data Engineering (iCMLDE)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>42</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>