Anomaly Detection on DNS Traffic using Big Data
            and Machine Learning
            Kelvin Soh Boon Kai                                    Eugene Chong                       Vivek Balachandran
            University of Glasgow                                     Singtel                  Singapore Institute of Technology
                 Singapore                                          Singapore                               Singapore
          2355381s@student.gla.ac.uk                        eugene.chong@singtel.com               vivek.b@SingaporeTech.edu.sg


   Abstract—In this paper, we will demonstrate, devise and build                technology available for generating these data e.g. IoT devices.
an anomaly detection model for detecting general DNS anomalies                  Using Big Data, we are interested in applying Big Data
in an unsupervised learning problem using multi-enterprise                      Analytics on enormous network traffic dataset with the goals
network traffic data collected from various organizations
(NetFlow dataset) without attack labels. In our approach two                    of generating insights and knowledge to infer and detect
types of clustering algorithms are implemented for evaluating                   unusual and suspicious network behavior [3]. This paper is
the detection rate of the model. Clustering algorithm K-means                   structured as follows. 2. Related Work. 3. Design, 4. Analysis.
and Gaussian Mixture Model (GMM) are investigated due to                        5. Implementation. 6. Evaluation and 7. Conclusion.
their popularity for being the state-of-the-art techniques for
detecting anomalies with low false negatives [1]. In addition,
another unsupervised neural network algorithm (SOM) is used                                          2. R ELATED W ORK
for visualizing if any potential cluster can be found in the
dataset. Simulation of DNS anomalies will be performed for                         Past research has shown that traditional approach to anomaly
evaluating the robustness of the final detection model, and a                   detection uses Network ”Intrusion Detection System” (IDS)
comparison has been made between K-means and GMM by                             to detect anomalies using known signatures. It is known that
assessing the detection rate against the simulated anomalies.
The final GMM model achieved a seemingly high detection rate                    IDS is ineffective when trying to detect non-deterministic
on the simulated anomalies.                                                     or unknown traffic, since IDS can only detect patterns or
                                                                                attacks from signatures, a new and unknown traffic might
                                                                                not yet be captured by the IDS. It is evident that detecting
                            1. I NTRODUCTION                                    real time traffic has proven to be not feasible due to the
   With the advent of abundance of devices and tools in the                     non-deterministic nature of the varying traffic patterns in any
market, collection of data has never been easier today. Corpus                  enterprise network. Hence, alternative detection techniques
of network traffic data can be generated in millions or billions                should be implemented to handle the unknown nature of
of records in seconds [2]. In this paper, the focus will be on                  non-deterministic traffic patterns such as zero-day attacks [4].
these Big ”Network Traffic” Data. By adopting and leveraging
the current tools and off-the-shelf state-of-the-art algorithms,                   Given the rise of Big Data, Machine Learning (ML), Deep
the objective is to to achieve cyber situational awareness thru                 Learning (DL) and Artificial Intelligence (AI), the strategies
data exploration and modelling, to devise and build a detection                 involving these techniques are evolving rapidly to help combat
model for detecting network traffic anomalies with Big Data.                    the limitations of traditional detection approaches with IDS.
                                                                                Given that these state-of-the-art techniques has proved to be
  Cybersecurity is a major concern for companies and                            perform better as compared to traditional IDS approach.
organizations that relies on technology to keep the business
running. In any monetary intensive organization (Stocks and                       The following subsections A, B, C and D discusses the
Banking), due to a glitch or vulnerability in the system,                       types of detection techniques approach to perform anomaly
organizations could have lost millions or even billions. Thus,                  detection. [5].
given the rise of new technologies and tools, to exploit
and undermine the vulnerabilities of such systems can be
achieved easily with adequate resources. There comes a point                    A. Anomaly Detection
where new techniques and technologies are required to detect                       Anomaly detection is the process of detecting rare or unusual
unusual and suspicious activities within the network, such that                 patterns that deviates from the normal behavior or norm.
anomalous behavior can be promptly detected and mitigated.                      Unusual patterns are also known as ”Outliers” or ”Anomalies”
                                                                                [4], as shown in Fig.1. In the context of machine learning
   With the prevalent of Big Data given the fact that it                        to anomaly detection, the focus will be building an anomaly
is exponentially growing every year due to the increasing                       detection model using ML clustering algorithms.

 Copyright © 2019 for this paper by its authors. Use permitted under Creative
 Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                                                                   95
                                                                   subsequent sections.

                                                                      In addition, if anomalies are presented in the data, it needs
                                                                   to be removed, otherwise the detection model would have
                                                                   failed to detect future instances of anomalies as the detection
                                                                   would assumed it is normal during the detection [4].

         Fig. 1: The anomalies are the sudden spike.                  Once the baseline of the common patterns has been
                                                                   established, detection can be deployed using the normal
                                                                   distribution as the baseline. Since anomalies are statistically
  Currently there exists two types of detection techniques         different from normal DNS traffic, by modelling the normal
to detect anomalies, 1. Anomaly detection, and 2. Misuse           distribution of the data, anomalies can be detected one, two
detection [4], [5], the research focus will be based on anomaly    or three standard deviations away from the mean using some
detection due to the caveat of misuse detection and the absence    threshold [4].
of attack labels.

                                                                   D. Clustering in Anomaly Detection
B. Misuse Detection
                                                                      Clustering is a common technique used to group similar
   Misuse detection or signature-based is an intrusion detection   objects together for cluster analysis. Objects that are similar to
technique that build signatures of different types of known        each other belong to a cluster of it’s own and vice versa for
patterns from malicious behavior. It has been proven that misuse   dissimilar objects, as shown in Fig.3. For anomaly detection
detection can detect malicious behavior with high detection        in DNS traffic, similar DNS traffic patterns should belong to
rate due to the known patterns that are build and hard-coded       a cluster of its own, using some distance metrics such as the
as signatures by security experts. However, the downsides are      Euclidean distance. By following this rule, anomalies can be
they tend to perform badly due to unknown and unprecedented        detected when new and incoming patterns stray away from any
patterns such as zero-day attacks [4], [5]. The following Fig.2    of the clusters or its distance to any of the clusters exceeds a
shows the process of misuse detection using a rule-based pattern   certain threshold. It has been proven that clustering algorithms
matching approach.                                                 shows promises by learning distinctive and complex patterns
                                                                   from data without human intervention [6].


Fig. 2: Detection is only possible with existing rules and
signatures.                                                                   Fig. 3: Showing three blobs of clusters.


C. Establishing a baseline of common patterns                                                3. D ESIGN
  Prior to any anomaly detection techniques described above,          This section gives a short overview of the tools and
one of the statistical techniques for anomaly detection would      framework used for the proposed anomaly detection workflow.
be relying on the commonalities found in the data.

   One being that the data should follow a normal distribution. A. Tools and Framework
Since anomalies don’t occur on a regular basis, the assumption    The following are the computational and hardware
is that data should be normally distributed, as majority of the specifications for this research which are provided by our
DNS traffic should be largely normal as compared to anomalies. industrial partner. Large scale data processing using Big Data
By identifying the baseline of common behaviors/patterns, Analytics are performed using the following big data cluster.
one can determine what are the common DNS traffic patterns
flowing in the network [4], e.g. the average no. of pkts          The hardware specifications for the big data cluster consists
and bytes in a normal DNS flow. *The term Flows, DNS of 9 nodes with a total of 544 virtual cores and 3.5TiB of
traffic and DNS flows are used interchangeably in the memory. The big data framework ”Apache Spark” is used to


                                                                                                                                        96
process the NetFlow datasets.                                      of 48 attributes with one day worth of network traffic
                                                                   data, but it was reduced to only 10 columns which are
                                                                   pre-selected as the most relevant attributes for data analysis
B. Proposed Anomaly Detection Workflow
                                                                   of DNS traffic, as shown in Fig.6. We will be using one
  The proposed anomaly detection workflow first comprises          of the many NetFlow datasets stored in our database dated
of 1) Data collection, 2) Data analysis/preprocessing, 3)          on ”06/28/2018” which approximated around 253 million
Training using clustering algorithms, 4) Deploying model           worth of records, but was reduced to 4 million records,
for detection 5) Evaluation and 6) Reiterate.                      since the objective of this research is to focus on DNS
                                                                   traffic to detect DNS anomalies, other services like HTTP,
                                                                   SSH etc. are removed. Each row or record in the dataset is
                                                                   known as a ”Network Flow”, each flow can also be referred
                                                                   as a transaction between the source and destination address [2].

                                                                     A standard flow record F contains the following attributes.

                                                                   F = (IPsrc , IPdst , P kts, Bytes, Tstart , Tend , P rot, P ortsrc
                                                                   , P ortdst , F lags)


             Fig. 4: Anomaly Detection Workflow.


                        4. A NALYSIS
   This section takes a deeper look into exploring the data. It
gives an overview of the NetFlow dataset, and further analysis     Fig. 6: Sample of a DNS flow (The IP address are masked
are explored to determine the characteristics that constitutes a   for privacy reasons).
normal DNS traffic by performing exploratory data analysis
on the data.
                                                                   B. DNS flow packets count
                                                                      TABLE I below shows packet feature where pkts 1 and
A. NetFlow                                                         pkts 2 are the most common number of packets send per DNS
  NetFlow is a traffic monitoring protocol developed by Cisco      flow as compared to the next subsequent leading packets 3..5,
for collecting network traffic flows from NetFlow-enabled          hence 4.1 million counts of pkts 1 formed a commonalities of
router. Data collected using NetFlow can be used by network        95% of the total DNS flow.
analyst to understand how network traffic is flowing in and
                                                                                               count (pkts)
out of the network [2].                                                                 pkts   UDP          TCP
                                                                                        1      4147021      274610
                                                                                        2      82152        1177
                                                                                        3      22477        102
                                                                                        4      11147        95
                                                                                        5      6753         70

                                                                   TABLE I: More than 95% of the total DNS flow are formed
                                                                   by only pkts 1 & pkts 2


                                                                   C. Normality of Data
                                                                      The above Fig.6 shows the 10 most relevant NetFlow
                                                                   attributes, the most useful features or attributes for determine
                                                                   the normality and distribution of the data are the no. of pkts
                                                                   per DNS flow, the pkts feature, as shown in TABLE I above,
Fig. 5: NetFlow data are collected from various multi-enterprise   many of the UDP flow and TCP flow largely contains only 1
network.                                                           packet per DNS flow. Observations of UDP DNS flow with
                                                                   low packets count are much more common than larger ones.
  The NetFlow datasets are jointly provided by our industrial
partner (Due to privacy reasons, the industrial partner               The following two figures, Fig.7 and Fig.8 shows the typical
would prefer to be anonymous) which originally consists            distribution of the bytes feature commonalities in the NetFlow


                                                                                                                                        97
dataset. There are two peaks in the distribution, one with 81     considered as one of the most crucial attribute for the detection
bytes another 1508 bytes Fig.7 (Red arrow), follow by a clearer   of anomalies. During our initial observation of the normal
zoom into a single well-known DNS Server e.g. ”Google” that       baseline on DNS traffic, 95% of the flows contains only 1
exhibits a bimodal (multi-modal) distribution with two peaks      packet per flow. This serves as an important information at
and also heavily right skewed, Fig.8. It can be concluded that    determining whether a DNS traffic is anomalous or not.
most DNS flows contains only 1 packet with the average of
around 81 bytes per DNS flow, Fig.7 and Fig.8 (Red arrow).
                                                                  E. Feature Selection

                                                                     Selecting the relevant features is an indispensable process
                                                                  before data preprocessing, since the goal is to retain as much
                                                                  information as possible and remove any redundant information
                                                                  from the dataset that does not constitute towards the detection
                                                                  of anomalies. The feature pkts is one important feature, which
                                                                  helps in the detection of anomalies. It is evident from our initial
                                                                  analysis, DNS traffic with packets count between 1 and 5 made
                                                                  up of more than 95% of the data commonalities TABLE I; and
Fig. 7: Bimodal distribution of the feature bytes for all DNS     the probability of any DNS packets per flow Pr (2..N <= 1)
servers.                                                          is <= 2%, where N is the packet, Fig.9.


Fig. 8: Bimodal distribution of the feature bytes for one DNS
server e.g. Google                                                Fig. 9: Cumulative distribution function of DNS packets per
                                                                  flow.
   Given the above TABLE I, DNS flow using UDP are much
more common than DNS flow using TCP. Given the fact
that UDP are short-lived, and the typical usage of DNS is to         The following Fig.10 shows the cumulative percentage of
perform name-to-address translation by performing a DNS           the average number of unique source and destination port in a
lookup using some DNS Server, thus the protocol must be fast      typical DNS flow. Where 99% of the total DNS flow contains
to avoid any sort of latency, congestion and overhead in the      only at most 1 to 2 source or destination port. Hence, additional
network. The usage of using DNS via TCP is not as common          information such as using the features source port: sport and
as compared to UDP. Since TCP usually contains more               destination port: dport will be useful during the detection of
data per flow due to the necessity of a reliable connection       anomalies.
(three-way handshake) with additional information stored in a
flow [7]. The typical usage of using DNS via TCP is Zone
Transfer or sending large data over a network using DNS
as a tunneling protocol where reliability is insured [7], [8].
However, the similarity of DNS flow using either UDP or TCP
undoubtedly formed a commonality of only 1 packet per flow.


D. Determine important features
  Relevant features of interest should be carefully hand-picked
before fitting into any ML algorithms. Since irrelevant data,
anomalies and noise should not exist in the data which will
heavily penalize the quality of the final ML model during the     Fig. 10: Cumulative percentage of the average number of source
actual detection [9], [10]. Given our initial analysis of the     and destination port.
most common occurrence of pkts in TABLE I, pkts should be


                                                                                                                                        98
F. Finalize Feature Selection
   The following Fig.11 are the 10 features we originally had
in the NetFlow dataset. These features will be used for data
preprocessing and implementing the anomaly detection model.


                                                                           Fig. 13: Features selected for data preprocessing.


                                                                     B. Preprocessed features (Time Window Aggregation)
                                                                        The following Fig.14 shows the set of preprocessed features
      Fig. 11: Features selected for data preprocessing.             after time window aggregation using the features source ip:
                                                                     src ip, datetime: first and last for aggregation in 1 minute
                                                                     interval. The reason for aggregating the flow in 1 minute
                                                                     interval is that a flow should not exceed X amount of packets
G. Types of Clustering Algorithm                                     when using DNS as a service [11]. By looking at the first
   Three clustering algorithms are investigated in this paper        row, the source IP ”231.x.x.x” communicated with 133 unique
given their suitability towards our problem of interest. The         destination IPs with 297 no. of packets with a total of 33.6k
following clustering algorithms will be further discuss in           bytes, using only one unique source port to 294 unique
Section 5. Implementation. *The list are by no means                 destination ports with the protocol 17 (UDP) for transmission
exhaustive, there could be a better clustering algorithm             with a total of 295 flows in 1 minute. When inspecting the
more suited for this research despite the following.                 source IP of ”231.x.x.x”, it turns out that the source IP belong
                                                                     to a well-known legitimate DNS Server e.g. ”Google”, and is
  • K-means clustering - Based on centroid models.                   using one source port ”53” to resolve DNS requests to 133
  • Self Organizing Map (SOM) - Based on unsupervised                unique destination IPs in 1 minute, which is perfectly normal
    neural network model with competitive learning.                  for a well known DNS server. After preprocessing the features,
  • Gaussian Mixture Model (GMM) - Based on distribution             the following Fig.14 shows the aggregated flow in 1 minute
    and probabilistic models.                                        interval.


                                                                            Fig. 14: Aggregated Flow in 1 minute interval.
      Fig. 12: Different types of clustering algorithms.


                                                                     C. Feature Scaling
                      5. I MPLEMENTATION
                                                                        Feature scaling consists of normalization/standardization
   This section discusses the features that will be used for         which is to transform and bring features into the same
implementing the anomaly detection model. It also discusses          units despite the different measurements. Normalization is
the preprocessing stage of the anomaly detection model, and          to transform data into a specific range, bringing values into the
finally the inherent limitation and caveat of the aforementioned     range ”[0,1]”. Our preprocessed features are standardized and
three clustering algorithms. Both UDP and TCP DNS flow will          centered around mean 0 and standard deviation 1, modelling a
be trained on clustering algorithms after data preprocessing.        normal distribution. Standardization can be perform using the
                                                                     following equation 1.
A. Initial features                                                                                  x   µ
                                                                                               z =                                (1)
  The following Fig.13 shows some features underlined in
red, and were used to produce additional features through               where µ is the mean, is the standard deviation and z is the
aggregation to derive a more sensible feature set. Since these are   z-score. The values of the rescaled features will be represented
nominal/categorical values the ML algorithms don’t understand.       in z as continuous value.


                                                                                                                                         99
D. Dimensionality Reduction                                          The training of K-means algorithm are performed by
   After feature scaling, projecting the data for visualization   specifying the k value to be 2, max iterations of 100, and
is difficult, since projection of data points in graph is only    a constant seed has been set for reproducible results. k = 2 is
visually interpretable in at most two to three dimensions,        purely based on hypothetical assumptions and related work [11],
anything beyond three dimensions are difficult for humans         [13]. With the absence of labels, we can only make general
to visualize. Dimensionality reduction techniques are used        assumptions of the dataset, by using k = 2, we presumed 2
such that our carefully preprocessed features can be best         clusters, where cluster 0 denotes normal traffic and cluster 1
represented in at most two or three dimensions for easier         denotes anomalous traffic. Back to our original analysis of
exploration and visualization.                                    DNS flow, our CDF of DNS packets per flow contains only 1
                                                                  packet 95% of the time Fig.9. The following Fig.16 shows two
   The preprocessed features after feature scaling has been       clusters plotted using P C1 and P C2 for visualization with an
transformed into six Principal Components (P C) using Prin-       emerging V pattern, TABLE II.
cipal Component Analysis (PCA). The following TABLE II
shows that by applying PCA to our standardized preprocessed
features, 95% of the total variance can be explained using only
P C1 and P C2 . Using PCA, the standardized preprocessed
features are compressed into two for cluster analysis.

            Component   Variance (%)   Cumulative (%)
                1          78.53           78.53
                2          17.05           95.58
                3            2.4           97.98
                4            1.1           99.08
                5           0.76           99.84
                6           0.16            100

TABLE II: Cumulative variance explained using six compo-
                                                                      Fig. 16: K-means cluster analysis on P C1 and P C2
nents.
                                                                     2) Self-Organizing Maps (SOMs): Self-Organizing Map
                                                                  (SOM) is a type of artificial neural network that perform
E. Clustering Algorithms
                                                                  competitive learning known as vector quantization, unlike
  The following three clustering algorithms 1) K-means,           standard neural network architecture that uses error correction
2) Self-Organizing Maps (SOMs) and 3) Gaussian Mixture            e.g. (Backpropagation via Gradient descent an optimization
Model (GMM) are further discuss below.                            technique). It is also a clustering technique that maps high-
                                                                  dimensional features into low-dimensional for data visualiza-
   1) K-means Clustering: K-means is one of the most popular      tion. It is similar to K-means, given that both are clustering
unsupervised clustering algorithm used in ML. It is capable       algorithms. SOM is also a non-linear dimensional reduction
of handling large amount of data with O(n) complexity, and        technique as opposed to PCA (Linear dimensional reduction),
often used for every preliminary cluster analysis priori. It is   suitable for learning complex non-linear patterns. SOM does
highly scalable with Big Data [11]. K-means algorithm works       not require a target label like many clustering algorithms. The
via an iterative approach, it takes in one important parameter    following Fig.17 shows a map with grid X by Y , where the
called k, which is also the number of clusters you want the       nodes X and Y denotes the neurons and x1 , x2 ..xn denotes
algorithm to return. The k is also known as the cluster centers   the input vector [14], [15].
or centroids [12]. Fig.15 shows K-means initialized with three
clusters.


                                                                             Fig. 17: Self-Organizing Maps (SOMs)

                                                                     SOM can be used for interpreting high-dimensional data
       Fig. 15: K-means algorithm initialize with k=3.            in at most 2D or 3D, usually accompanying with a unified


                                                                                                                                    100
distance matrix (U-matrix) for SOM visualizations as a N xN          datasets tends to be Gaussian or normally distributed when
hexagonal grid for identifying potential neighbours/clusters in      enough data are collected. Using GMM, we assume there
large dataset without any prior k. Fig.18 shows the U-matrix of      are more than one distributions (multi-modal) existed within
Iris dataset exhibiting around two to three clusters in a hexmap.    the data that can be modelled using multiple Gaussian with
                                                                     unknown parameters or latent variables that needs estimation
                                                                     using the expectation maximization (EM) algorithm. With
                                                                     GMM, to approximate a multi-modal distributions, the mixture
                                                                     of several sub Gaussian distributions can be modelled as one
                                                                     big Gaussian distribution [17].


Fig. 18: Iris dataset trained on SOM with grid size 7x7, where
the contrasting cyan-like color in the middle represents the
color of separation between the clusters, and the blueish areas
are data points/input vectors that are similar to each other. [16]

   SOM is applicable to our problem as an unsupervised
learning approach, since we do not know how many hidden              Fig. 20: Gaussian Mixture Model of three normal distributions.
k clusters exists in the NetFlow dataset due to the absence of
labels. SOM algorithm has been fitted using the preprocessed            During our analysis of the bimodal distribution Fig.7 and
dataset with grid size of 200x200 (*Grid size are hyperpa-           Fig.8, there seems to be two peaks for the bytes feature, thus
rameters which can be tuned, with larger grid size, the              GMM is worth considering, presumably there could be more
separation of clusters becomes more visible). Fig.19 shows           than one distributions in the underlying NetFlow dataset.
that it is hard to determine the number of clusters, due to the      In addition, GMM is also computationally efficient when
diversity of network traffic collected from various different        modelling large datasets, especially vital in the context of Big
organizations.                                                       Data. With more than one feature, modelling the data using
                                                                     one Gaussian is not feasible, since there could be different
                                                                     latent distributions that needs to be estimated [17], [18]. Thus
                                                                     using GMM, one could approximate which distributions
                                                                     a data point belongs to, by generating data points from a
                                                                     mixture of Gaussians. *(The term gaussian, components and
                                                                     clusters are used interchangeably in the context of GMM).


                                                                                            6. E VALUATION
                                                                        Evaluation is conducted on the simulated NetFlow dataset
                                                                     with anomalies injected. It will also discusses the caveat of
                                                                     the proposed clustering algorithms K-Means, SOM and GMM,
                                                                     such as finding the parameters k for K-means and the number
                                                                     of gaussians or components for GMM. Finally, the evaluation
                                                                     of the final proposed clustering algorithm are again used to
                                                                     evaluate on the detection rate of the simulated anomalies.


                                                                     A. Simulated NetFlow Anomalies
Fig. 19: U-Matrix visualized with a hexmap after SOM training.          The simulated NetFlow dataset are jointly provided by our
Contrasting high-value color denotes the cluster separators,         industrial partner using the tool ”dns2tcp” to simulate a DNS
while adjacent and similar colors represents the similarity of       attack known as ”Data Exfiltration”. The simulated traffic are
the data points.                                                     used to assess the performance of the detection model. Where
                                                                     the model should detect these anomalies and cluster them into
  3) Gaussian Mixture Model (GMM): Gaussian Mixture                  the anomalous clusters, separating them from the normal cluster.
Model belongs to the distribution/probabilistic model that relies    Data exfiltration (Large file transfer/Unauthorized copying over
on normally distributed data. In real world scenario, many           DNS) was conducted between 1130hrs and 1430hrs.


                                                                                                                                        101
                                                                   B. Evaluation of Clustering Algorithms
                                                                     Our combined preprocessed NetFlow dataset are feed into
                                                                   K-means and GMM algorithm for evaluating the detection rate
                                                                   on the simulated data exfiltration anomalies. The following 1)
                                                                   and 2) discusses the detection rate on the simulated anomalies
                                                                   using both clustering algorithms.

                                                                      1) Anomalies Detection with K-means: The following Fig.23
                                                                   shows the data exfiltration anomalies plotted in two-dimensional
                                                                   scatterplot using the features no pkts and no bytes. All the
                                                                   blue points are anomalies, filtering has been done to show only
                                                                   the data exfiltration points. Given our analysis on the K-means
      Fig. 21: Sudden spike from 1130hrs to 1430hrs.               clustering results below, K-means cluster these anomalies as
                                                                   normal, and fails to detect any data exfiltration anomalies. It
                                                                   is possible that using the hypothetical assumptions of k = 2
   1) NetFlow Dataset: In Section 5. Implementation, to            based on research and [13] may not be the optimal k choice.
determine what constitute the baseline of a normal DNS flow,
we will be combining an additional of four more different
NetFlow datasets of different date dated months and years
apart. After the baseline has been ascertained from the
different NetFlow datasets, such that these different NetFlow
datasets exhibits similar behaviors/patterns for the typical DNS
flow, we can safely combine them and form an even larger
NetFlow dataset. This is to ensure that with NetFlow datasets
of the future, the normal behavior of a DNS flow should not
fluctuate too much, affecting normality. Finally with more
data, the approximation of future unseen data can be more
accurately identified.

   2) Combining different NetFlow Datasets: The idea of            Fig. 23: K-means fails to detect even a single anomalies from
combining the different NetFlow datasets is to validate our        the simulated dataset.
original assumptions of the preprocessed dataset conforms
to the same normality across different NetFlow datasets of            Finding the optimal k using the elbow method has also
different date, e.g. days, weeks and years. The following          been experimented. However K-means still fails to detect
Fig.22 are the results of combining four different NetFlow         even a single anomalies with the optimal k. It is evident that
datasets, preprocessed and plotted on two principal components     the model K-means is not complex enough to handle the
dated months and year apart. It is seemingly convincing that       diversities of DNS traffic.
the following visualization exhibits the same unique V shape
pattern for the DNS traffic when compared to Fig.16.                 2) Anomalies Detection with GMM: The GMM is trained
                                                                   on two components/clusters using the covariance type ”Full”
                                                                   and max iterations set to 100 due to several trial and errors
                                                                   and hyperparameter tuning.

                                                                      To determine if GMM is able to detect and cluster anomalies
                                                                   into the anomalous cluster, the simulated data exfiltration traffic
                                                                   are feed into GMM for clustering. The following Fig.24 shows
                                                                   that GMM is able to detect the anomalies with a detection
                                                                   rate of 95% given that the number of pkts and bytes for the
                                                                   anomalies are generally higher as compared to a normal DNS
                                                                   flow, thus the detection of data exfiltration anomalies works
                                                                   well for GMM with only 5% of false negatives, Fig. 24 blue
                                                                   points. In the next section, we aim to overcome these false
     Fig. 22: Four different NetFlow datasets combined.            negatives or mis-clustered anomalies that are very similar to a
                                                                   normal DNS flow.


                                                                                                                                         102
                                                                  D. Anomalies Detection after Model Selection
                                                                     After selecting the optimal number of components, in this
                                                                  case the lowest BIC and AIC scores tell us at around five
                                                                  components or more is a suitable trade-off between a better
                                                                  model fit with the extra computation time by adding additional
                                                                  gaussians or components. Thus, the GMM has been re-trained
                                                                  using the same data with the number of components set to five
                                                                  and covariance type ”Full”. When using this GMM model to
                                                                  cluster the simulated NetFlow dataset, the first cluster is the
                                                                  normal cluster, since it represented 81% of the total DNS
                                                                  flow, and any subsequent clusters from 2..N are assumed
                                                                  to be the anomalous or unknown clusters, where N is the
                                                                  number of clusters/components, in this case five clusters. The
                                                                  simulated data exfiltration anomalies are conducted between
Fig. 24: GMM detected 95% of anomalies using two gaussian.
                                                                  1130hrs - 1430hrs, as shown in Fig.26, the GMM with five
                                                                  components is able to accurately detect all data exfiltration
                                                                  anomalies successfully under Cluster 4, with the trade-off of
C. Finding GMM model parameters
                                                                  having more components representing the different mixture
   Up till now, using the parameter k = 2 is a hypothetical       distributions of the combined NetFlow dataset Fig.22.
assumptions based on existing literatures and research [13].
However, there could be a suitable k that one could choose
via statistical inference, such that anomalies can be better
identified with the appropriate k. The following 1) Finding
no. of Components, explores two techniques for selecting the
optimal k for GMM.

   1) Finding no. of Components: Two techniques that will be
discuss here are Bayesian Information Criterion (BIC) and
Akaike’s Information Criterion (AIC). Both are statistical
model to perform model selection criteria to determine if one
model is better than the other. [19].

   The following Fig.25 shows that both BIC and AIC are
                                                                  Fig. 26: Data exfiltration anomalies are detected under Cluster
identical with respect to the number of components specified,
                                                                  4.
where the reduction of the maximum likelihood of BIC
asymptote at around five components or more. Hence, we can
                                                                     The following TABLE III shows the confusion matrix after
safety choose the minimum number of components with the
                                                                  GMM detection on the simulated NetFlow dataset filtered
lowest BIC score where the maximum likelihood is achieved, in
                                                                  to show only the data exfiltration anomalies. The simulated
this case 5 (Circled in red), where the BIC line was overlapped
                                                                  NetFlow dataset have a total of 141 time window aggregated
by AIC, given that both attained the same score.
                                                                  DNS flow of data exfiltration anomalies after preprocessing.
                                                                                                   Normal   Anomalies
                                                                              Detected Normal      TP       FP
                                                                              Detected Anomalies   FN       TN=141

                                                                  TABLE III: Confusion Matrix: All 141 data exfiltration
                                                                  anomalies has been detected by GMM.


                                                                  E. Kernel Density Estimation (KDE) on Gaussian Mixture
                                                                     The following TABLE IV are the Gaussian mean µ and
                                                                  mixture weights of the simulated NetFlow dataset. Where
                                                                  the Normal Cluster/Gaussian represented 68% of the total
                                                                  distribution and the remaining clusters represents only 32%;
                                                                  the simulated data exfiltration anomalies are clustered under
                Fig. 25: BIC and AIC scores.                      Cluster 4 by the GMM model.


                                                                                                                                    103
                  Cluster k   Mean µ   Weight                        rate on the data exfiltration anomalies. However, limited to
                   Normal       86       0.68                        only data exfiltration anomalies, the evaluation of the model is
                  Cluster 1    587       0.08                        limited to and only that. Hence, it is probable that our model
                  Cluster 2    491       0.12
                  Cluster 3    1762      0.08                        is not able to detect other kinds of DNS anomalies other than
                  Cluster 4   22040      0.02                        data exfiltration.
TABLE IV: The mean and weight of the average number of                  To validate the robustness of our detection model, more
bytes with five Gaussian.                                            DNS anomalies needs to be simulated to assess the detection
                                                                     rate/recall of the GMM model, by understanding the patterns
                                                                     of the different kinds of anomalies, experimentation using
   The following Fig.27 shows the KDE of the five Gaussian
                                                                     statistical analysis and evaluation can be further conducted.
after GMM clustering. We can see that there are some slight
overlapping between the Normal (Blue) and Cluster 3 (Purple).
By modelling the distribution of the DNS Flows using GMM,                                   8. ACKNOWLEDGMENTS
data exfiltration anomalies that are similar to normal traffic can     This is an industrial research project supported by the
be accurately clustered using probabilities/soft assignments.        University of Glasgow and Singapore Institute of Technology
                                                                     (SIT) with Industrial Partner. I want to thank my professor Dr.
                                                                     Vivek Balachandran in the department of Singapore Institute of
                                                                     Technology for sharing his knowledge and guidance throughout.
                                                                     Finally, I would also like to thank my industrial supervisor
                                                                     Eugene Chong for giving me a chance for taking on this
                                                                     challenging research project.
                                                                                                   R EFERENCES
                                                                      [1] M. Facure, “Semi-Supervised Learning for Fraud Detection Part
                                                                          1”,” https://lamfo-unb.github.io/2017/05/09/Semi-Supervised-learning-for-
                                                                          fraud-detection-Part-1/, 2017.
                                                                      [2] M. W. Lucas, “Network flow analysis,” ch. 1, pp. 9–11.
                                                                      [3] Y. Cheng, T. T. Nguyen, H. Zeng, and J. Deng., “Big data analytics in
                                                                          cybersecurity,” ch. 2, pp. 27–30.
                                                                      [4] S. Singh and G. Kaur, “Unsupervised anomaly detection in network
                                                                          intrusion detection using clusters,” 2007.
                                                                      [5] C. Zhang, G. Zhang, and S. Sun, “A mixed unsupervised clustering-based
                                                                          intrusion detection model,” 2009.
        Fig. 27: KDE on a mixture of five Gaussian.                   [6] I. Syarif, A. Prugel-Bennett, and G. Wills, “Unsupervised clustering
                                                                          approach for network anomaly detection,” 2012.
                                                                      [7] S. Northcutt and J. Novak, “Network intrusion detection third edition,”
                                                                          ch. 6, pp. 103–115.
                        7. C ONCLUSION                                [8] S. Northcutt and J. Novak, “Network intrusion detection third edition,”
   Anomaly detection is still a challenging problem that spans            ch. 6, pp. 113–115.
                                                                      [9] S. Paul, “Beginner’s Guide to Feature Selection in Python,”
decades of research [20]. The purpose of this research is to              https://www.datacamp.com/community/tutorials/feature-selection-
detect general DNS anomalies that are statistically different             python, 2018.
from normal behavior using Big Data, given that DNS are              [10] S. Tripathy and L. Sahoo, “A survey of different methods of clustering
                                                                          for anomaly detection,” vol. 6, 2015.
often used as a covert channel for attackers to perform              [11] D. S. Terzi, R. Terzi, and S. Sagiroglu, “Big data analytics for network
malicious activities e.g. data exfiltration. Detecting anomalies          anomaly detection from netflow data,” 2017.
in a DNS environment is beneficial and proved to be crucial          [12] A.     Trevino,     “A     Introduction    to     K-means     Clustering,”
                                                                          https://www.datascience.com/blog/k-means-clustering, 2016.
in any enterprise network. Using GMM with two clusters, the          [13] G. Munz, S. Li, and G. Carle, “Traffic anomaly detection using k-means
detection model is able to detect data exfiltration anomalies             clustering,” 2007.
with a 95% detection rate. Further to that, SOM has also been        [14] A. Ralhan, “Self Organizing Maps,” https://towardsdatascience.com/self-
                                                                          organizing-maps-ff5853a118d4, 2018.
used for cluster analysis to determine if there are any visible      [15] D. Miljkovic, “Brief review of self-organizing maps,” 2017.
clusters in the NetFlow dataset. However, given the nature of        [16] O. K. Pavel Stefanovic, “Visual analysis of self-organizing maps,” vol. 16,
the network traffic collected from multi-enterprise network               p. 495, 2011.
                                                                     [17] C. M. Bishop, “Pattern recognition and machine learning,” ch. 9, pp.
of different organizations, no fixed number of clusters can               430–435.
be obtained due to the diversity of network traffic and the          [18] C. M. Bishop, “Pattern recognition and machine learning,” ch. 9, pp.
varying traffic behavior of how different organizations operates.         435–439.
                                                                     [19] C. Xie, J. Chang, and Y. Liu, “Estimating the number of components in
                                                                          gaussian mixture models adaptively,” 2013.
  Thus, model estimation techniques using BIC/AIC                    [20] V. Chandola and V. Kumar, “Anomaly Detection: A Survey,” 2009.
has been selected to determine the optimal number of
components/clusters for the GMM model. Using only five
components, the final GMM model achieved a 100% detection


                                                                                                                                                        104