<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Anomaly Detection on DNS Traffic using Big Data and Machine Learning</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Kelvin</forename><surname>Soh</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Boon</forename><surname>Kai</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Eugene</forename><forename type="middle">Chong</forename><surname>Singtel</surname></persName>
						</author>
						<author role="corresp">
							<persName><forename type="first">Vivek</forename><surname>Balachandran</surname></persName>
							<email>vivek.b@singaporetech.edu.sg</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">University of Glasgow</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution">Singapore Institute of Technology</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Anomaly Detection on DNS Traffic using Big Data and Machine Learning</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">59E2D94E95486BDA943DC14D427A4A52</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T21:37+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we will demonstrate, devise and build an anomaly detection model for detecting general DNS anomalies in an unsupervised learning problem using multi-enterprise network traffic data collected from various organizations (NetFlow dataset) without attack labels. In our approach two types of clustering algorithms are implemented for evaluating the detection rate of the model. Clustering algorithm K-means and Gaussian Mixture Model (GMM) are investigated due to their popularity for being the state-of-the-art techniques for detecting anomalies with low false negatives [1]. In addition, another unsupervised neural network algorithm (SOM) is used for visualizing if any potential cluster can be found in the dataset. Simulation of DNS anomalies will be performed for evaluating the robustness of the final detection model, and a comparison has been made between K-means and GMM by assessing the detection rate against the simulated anomalies. The final GMM model achieved a seemingly high detection rate on the simulated anomalies.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>With the advent of abundance of devices and tools in the market, collection of data has never been easier today. Corpus of network traffic data can be generated in millions or billions of records in seconds <ref type="bibr" target="#b1">[2]</ref>. In this paper, the focus will be on these Big "Network Traffic" Data. By adopting and leveraging the current tools and off-the-shelf state-of-the-art algorithms, the objective is to to achieve cyber situational awareness thru data exploration and modelling, to devise and build a detection model for detecting network traffic anomalies with Big Data.</p><p>Cybersecurity is a major concern for companies and organizations that relies on technology to keep the business running. In any monetary intensive organization (Stocks and Banking), due to a glitch or vulnerability in the system, organizations could have lost millions or even billions. Thus, given the rise of new technologies and tools, to exploit and undermine the vulnerabilities of such systems can be achieved easily with adequate resources. There comes a point where new techniques and technologies are required to detect unusual and suspicious activities within the network, such that anomalous behavior can be promptly detected and mitigated.</p><p>With the prevalent of Big Data given the fact that it is exponentially growing every year due to the increasing technology available for generating these data e.g. IoT devices. Using Big Data, we are interested in applying Big Data Analytics on enormous network traffic dataset with the goals of generating insights and knowledge to infer and detect unusual and suspicious network behavior <ref type="bibr" target="#b2">[3]</ref>. This paper is structured as follows. 2. Related Work. 3. Design, 4. Analysis. 5. Implementation. 6. Evaluation and 7. Conclusion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">RELATED WORK</head><p>Past research has shown that traditional approach to anomaly detection uses Network "Intrusion Detection System" (IDS) to detect anomalies using known signatures. It is known that IDS is ineffective when trying to detect non-deterministic or unknown traffic, since IDS can only detect patterns or attacks from signatures, a new and unknown traffic might not yet be captured by the IDS. It is evident that detecting real time traffic has proven to be not feasible due to the non-deterministic nature of the varying traffic patterns in any enterprise network. Hence, alternative detection techniques should be implemented to handle the unknown nature of non-deterministic traffic patterns such as zero-day attacks <ref type="bibr" target="#b3">[4]</ref>.</p><p>Given the rise of Big Data, Machine Learning (ML), Deep Learning (DL) and Artificial Intelligence (AI), the strategies involving these techniques are evolving rapidly to help combat the limitations of traditional detection approaches with IDS. Given that these state-of-the-art techniques has proved to be perform better as compared to traditional IDS approach.</p><p>The following subsections A, B, C and D discusses the types of detection techniques approach to perform anomaly detection. <ref type="bibr" target="#b4">[5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Anomaly Detection</head><p>Anomaly detection is the process of detecting rare or unusual patterns that deviates from the normal behavior or norm. Unusual patterns are also known as "Outliers" or "Anomalies" <ref type="bibr" target="#b3">[4]</ref>, as shown in Fig. <ref type="figure" target="#fig_0">1</ref>. In the context of machine learning to anomaly detection, the focus will be building an anomaly detection model using ML clustering algorithms. Currently there exists two types of detection techniques to detect anomalies, 1. Anomaly detection, and 2. Misuse detection <ref type="bibr" target="#b3">[4]</ref>, <ref type="bibr" target="#b4">[5]</ref>, the research focus will be based on anomaly detection due to the caveat of misuse detection and the absence of attack labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Misuse Detection</head><p>Misuse detection or signature-based is an intrusion detection technique that build signatures of different types of known patterns from malicious behavior. It has been proven that misuse detection can detect malicious behavior with high detection rate due to the known patterns that are build and hard-coded as signatures by security experts. However, the downsides are they tend to perform badly due to unknown and unprecedented patterns such as zero-day attacks <ref type="bibr" target="#b3">[4]</ref>, <ref type="bibr" target="#b4">[5]</ref>. The following Fig. <ref type="figure" target="#fig_1">2</ref> shows the process of misuse detection using a rule-based pattern matching approach. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Establishing a baseline of common patterns</head><p>Prior to any anomaly detection techniques described above, one of the statistical techniques for anomaly detection would be relying on the commonalities found in the data.</p><p>One being that the data should follow a normal distribution. Since anomalies don't occur on a regular basis, the assumption is that data should be normally distributed, as majority of the DNS traffic should be largely normal as compared to anomalies. In addition, if anomalies are presented in the data, it needs to be removed, otherwise the detection model would have failed to detect future instances of anomalies as the detection would assumed it is normal during the detection <ref type="bibr" target="#b3">[4]</ref>.</p><p>Once the baseline of the common patterns has been established, detection can be deployed using the normal distribution as the baseline. Since anomalies are statistically different from normal DNS traffic, by modelling the normal distribution of the data, anomalies can be detected one, two or three standard deviations away from the mean using some threshold <ref type="bibr" target="#b3">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Clustering in Anomaly Detection</head><p>Clustering is a common technique used to group similar objects together for cluster analysis. Objects that are similar to each other belong to a cluster of it's own and vice versa for dissimilar objects, as shown in Fig. <ref type="figure" target="#fig_3">3</ref>. For anomaly detection in DNS traffic, similar DNS traffic patterns should belong to a cluster of its own, using some distance metrics such as the Euclidean distance. By following this rule, anomalies can be detected when new and incoming patterns stray away from any of the clusters or its distance to any of the clusters exceeds a certain threshold. It has been proven that clustering algorithms shows promises by learning distinctive and complex patterns from data without human intervention <ref type="bibr" target="#b5">[6]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">DESIGN</head><p>This section gives a short overview of the tools and framework used for the proposed anomaly detection workflow.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Tools and Framework</head><p>The following are the computational and hardware specifications for this research which are provided by our industrial partner. Large scale data processing using Big Data Analytics are performed using the following big data cluster.</p><p>The hardware specifications for the big data cluster consists of 9 nodes with a total of 544 virtual cores and 3.5TiB of memory. The big data framework "Apache Spark" is used to process the NetFlow datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Proposed Anomaly Detection Workflow</head><p>The proposed anomaly detection workflow first comprises of 1) Data collection, 2) Data analysis/preprocessing, 3) Training using clustering algorithms, 4) Deploying model for detection 5) Evaluation and 6) Reiterate. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">ANALYSIS</head><p>This section takes a deeper look into exploring the data. It gives an overview of the NetFlow dataset, and further analysis are explored to determine the characteristics that constitutes a normal DNS traffic by performing exploratory data analysis on the data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. NetFlow</head><p>NetFlow is a traffic monitoring protocol developed by Cisco for collecting network traffic flows from NetFlow-enabled router. Data collected using NetFlow can be used by network analyst to understand how network traffic is flowing in and out of the network <ref type="bibr" target="#b1">[2]</ref>. The NetFlow datasets are jointly provided by our industrial partner (Due to privacy reasons, the industrial partner would prefer to be anonymous) which originally consists of 48 attributes with one day worth of network traffic data, but it was reduced to only 10 columns which are pre-selected as the most relevant attributes for data analysis of DNS traffic, as shown in Fig. <ref type="figure" target="#fig_6">6</ref>. We will be using one of the many NetFlow datasets stored in our database dated on "06/28/2018" which approximated around 253 million worth of records, but was reduced to 4 million records, since the objective of this research is to focus on DNS traffic to detect DNS anomalies, other services like HTTP, SSH etc. are removed. Each row or record in the dataset is known as a "Network Flow", each flow can also be referred as a transaction between the source and destination address <ref type="bibr" target="#b1">[2]</ref>.</p><p>A standard flow record F contains the following attributes.  Given the above TABLE I, DNS flow using UDP are much more common than DNS flow using TCP. Given the fact that UDP are short-lived, and the typical usage of DNS is to perform name-to-address translation by performing a DNS lookup using some DNS Server, thus the protocol must be fast to avoid any sort of latency, congestion and overhead in the network. The usage of using DNS via TCP is not as common as compared to UDP. Since TCP usually contains more data per flow due to the necessity of a reliable connection (three-way handshake) with additional information stored in a flow <ref type="bibr" target="#b6">[7]</ref>. The typical usage of using DNS via TCP is Zone Transfer or sending large data over a network using DNS as a tunneling protocol where reliability is insured <ref type="bibr" target="#b6">[7]</ref>, <ref type="bibr" target="#b7">[8]</ref>. However, the similarity of DNS flow using either UDP or TCP undoubtedly formed a commonality of only 1 packet per flow.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Determine important features</head><p>Relevant features of interest should be carefully hand-picked before fitting into any ML algorithms. Since irrelevant data, anomalies and noise should not exist in the data which will heavily penalize the quality of the final ML model during the actual detection <ref type="bibr" target="#b8">[9]</ref>, <ref type="bibr" target="#b9">[10]</ref>. Given our initial analysis of the most common occurrence of pkts in TABLE I, pkts should be considered as one of the most crucial attribute for the detection of anomalies. During our initial observation of the normal baseline on DNS traffic, 95% of the flows contains only 1 packet per flow. This serves as an important information at determining whether a DNS traffic is anomalous or not.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Feature Selection</head><p>Selecting the relevant features is an indispensable process before data preprocessing, since the goal is to retain as much information as possible and remove any redundant information from the dataset that does not constitute towards the detection of anomalies. The feature pkts is one important feature, which helps in the detection of anomalies. It is evident from our initial analysis, DNS traffic with packets count between 1 and 5 made up of more than 95% of the data commonalities TABLE I; and the probability of any DNS packets per flow P r (2..N &lt;= 1) is &lt;= 2%, where N is the packet, Fig. <ref type="figure" target="#fig_8">9</ref>. The following Fig. <ref type="figure" target="#fig_9">10</ref> shows the cumulative percentage of the average number of unique source and destination port in a typical DNS flow. Where 99% of the total DNS flow contains only at most 1 to 2 source or destination port. Hence, additional information such as using the features source port: sport and destination port: dport will be useful during the detection of anomalies. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Finalize Feature Selection</head><p>The following Fig. <ref type="figure" target="#fig_10">11</ref> are the 10 features we originally had in the NetFlow dataset. These features will be used for data preprocessing and implementing the anomaly detection model. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G. Types of Clustering Algorithm</head><p>Three clustering algorithms are investigated in this paper given their suitability towards our problem of interest. The following clustering algorithms will be further discuss in Section 5. Implementation. *The list are by no means exhaustive, there could be a better clustering algorithm more suited for this research despite the following.</p><p>• K-means clustering -Based on centroid models.</p><p>• Self Organizing Map (SOM) -Based on unsupervised neural network model with competitive learning. • Gaussian Mixture Model (GMM) -Based on distribution and probabilistic models. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">IMPLEMENTATION</head><p>This section discusses the features that will be used for implementing the anomaly detection model. It also discusses the preprocessing stage of the anomaly detection model, and finally the inherent limitation and caveat of the aforementioned three clustering algorithms. Both UDP and TCP DNS flow will be trained on clustering algorithms after data preprocessing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Initial features</head><p>The following Fig. <ref type="figure" target="#fig_12">13</ref> shows some features underlined in red, and were used to produce additional features through aggregation to derive a more sensible feature set. Since these are nominal/categorical values the ML algorithms don't understand. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Preprocessed features (Time Window Aggregation)</head><p>The following Fig. <ref type="figure" target="#fig_13">14</ref> shows the set of preprocessed features after time window aggregation using the features source ip: src ip, datetime: first and last for aggregation in 1 minute interval. The reason for aggregating the flow in 1 minute interval is that a flow should not exceed X amount of packets when using DNS as a service <ref type="bibr" target="#b10">[11]</ref>. By looking at the first row, the source IP "231.x.x.x" communicated with 133 unique destination IPs with 297 no. of packets with a total of 33.6k bytes, using only one unique source port to 294 unique destination ports with the protocol 17 (UDP) for transmission with a total of 295 flows in 1 minute. When inspecting the source IP of "231.x.x.x", it turns out that the source IP belong to a well-known legitimate DNS Server e.g. "Google", and is using one source port "53" to resolve DNS requests to 133 unique destination IPs in 1 minute, which is perfectly normal for a well known DNS server. After preprocessing the features, the following Fig. <ref type="figure" target="#fig_13">14</ref> shows the aggregated flow in 1 minute interval. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Feature Scaling</head><p>Feature scaling consists of normalization/standardization which is to transform and bring features into the same units despite the different measurements. Normalization is to transform data into a specific range, bringing values into the range "[0,1]". Our preprocessed features are standardized and centered around mean 0 and standard deviation 1, modelling a normal distribution. Standardization can be perform using the following equation 1.</p><formula xml:id="formula_0">z = x µ (1)</formula><p>where µ is the mean, is the standard deviation and z is the z-score. The values of the rescaled features will be represented in z as continuous value.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Dimensionality Reduction</head><p>After feature scaling, projecting the data for visualization is difficult, since projection of data points in graph is only visually interpretable in at most two to three dimensions, anything beyond three dimensions are difficult for humans to visualize. Dimensionality reduction techniques are used such that our carefully preprocessed features can be best represented in at most two or three dimensions for easier exploration and visualization.</p><p>The preprocessed features after feature scaling has been transformed into six Principal Components (P C) using Principal Component Analysis (PCA). The following TABLE <ref type="table" target="#tab_1">II</ref> shows that by applying PCA to our standardized preprocessed features, 95% of the total variance can be explained using only P C 1 and P C 2 . Using PCA, the standardized preprocessed features are compressed into two for cluster analysis.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Clustering Algorithms</head><p>The following three clustering algorithms 1) K-means, 2) Self-Organizing Maps (SOMs) and 3) Gaussian Mixture Model (GMM) are further discuss below.</p><p>1) K-means Clustering: K-means is one of the most popular unsupervised clustering algorithm used in ML. It is capable of handling large amount of data with O(n) complexity, and often used for every preliminary cluster analysis priori. It is highly scalable with Big Data <ref type="bibr" target="#b10">[11]</ref>. K-means algorithm works via an iterative approach, it takes in one important parameter called k, which is also the number of clusters you want the algorithm to return. The k is also known as the cluster centers or centroids <ref type="bibr" target="#b11">[12]</ref>. Fig. <ref type="figure" target="#fig_5">15</ref> shows K-means initialized with three clusters. Fig. <ref type="figure" target="#fig_5">15</ref>: K-means algorithm initialize with k=3.</p><p>The training of K-means algorithm are performed by specifying the k value to be 2, max iterations of 100, and a constant seed has been set for reproducible results. k = 2 is purely based on hypothetical assumptions and related work <ref type="bibr" target="#b10">[11]</ref>, <ref type="bibr" target="#b12">[13]</ref>. With the absence of labels, we can only make general assumptions of the dataset, by using k = 2, we presumed 2 clusters, where cluster 0 denotes normal traffic and cluster 1 denotes anomalous traffic. Back to our original analysis of DNS flow, our CDF of DNS packets per flow contains only 1 packet 95% of the time Fig. <ref type="figure" target="#fig_8">9</ref>. The following Fig. <ref type="figure" target="#fig_6">16</ref> shows two clusters plotted using P C 1 and P C 2 for visualization with an emerging V pattern, TABLE II. Fig. <ref type="figure" target="#fig_6">16</ref>: K-means cluster analysis on P C 1 and P C2</p><p>2) Self-Organizing Maps (SOMs): Self-Organizing Map (SOM) is a type of artificial neural network that perform competitive learning known as vector quantization, unlike standard neural network architecture that uses error correction e.g. (Backpropagation via Gradient descent an optimization technique). It is also a clustering technique that maps highdimensional features into low-dimensional for data visualization. It is similar to K-means, given that both are clustering algorithms. SOM is also a non-linear dimensional reduction technique as opposed to PCA (Linear dimensional reduction), suitable for learning complex non-linear patterns. SOM does not require a target label like many clustering algorithms. The following Fig. <ref type="figure" target="#fig_14">17</ref> shows a map with grid X by Y , where the nodes X and Y denotes the neurons and x 1 , x 2 ..x n denotes the input vector <ref type="bibr" target="#b13">[14]</ref>, <ref type="bibr" target="#b14">[15]</ref>. SOM can be used for interpreting high-dimensional data in at most 2D or 3D, usually accompanying with a unified distance matrix (U-matrix) for SOM visualizations as a NxN hexagonal grid for identifying potential neighbours/clusters in large dataset without any prior k. Fig. <ref type="figure" target="#fig_0">18</ref> shows the U-matrix of Iris dataset exhibiting around two to three clusters in a hexmap. Fig. <ref type="figure" target="#fig_0">18</ref>: Iris dataset trained on SOM with grid size 7x7, where the contrasting cyan-like color in the middle represents the color of separation between the clusters, and the blueish areas are data points/input vectors that are similar to each other. <ref type="bibr" target="#b15">[16]</ref> SOM is applicable to our problem as an unsupervised learning approach, since we do not know how many hidden k clusters exists in the NetFlow dataset due to the absence of labels. SOM algorithm has been fitted using the preprocessed dataset with grid size of 200x200 (*Grid size are hyperparameters which can be tuned, with larger grid size, the separation of clusters becomes more visible). shows that it is hard to determine the number of clusters, due to the diversity of network traffic collected from various different organizations. 3) Gaussian Mixture Model (GMM): Gaussian Mixture Model belongs to the distribution/probabilistic model that relies on normally distributed data. In real world scenario, many datasets tends to be Gaussian or normally distributed when enough data are collected. Using GMM, we assume there are more than one distributions (multi-modal) existed within the data that can be modelled using multiple Gaussian with unknown parameters or latent variables that needs estimation using the expectation maximization (EM) algorithm. With GMM, to approximate a multi-modal distributions, the mixture of several sub Gaussian distributions can be modelled as one big Gaussian distribution <ref type="bibr" target="#b16">[17]</ref>. During our analysis of the bimodal distribution Fig. <ref type="figure">7</ref> and Fig. <ref type="figure">8</ref>, there seems to be two peaks for the bytes feature, thus GMM is worth considering, presumably there could be more than one distributions in the underlying NetFlow dataset. In addition, GMM is also computationally efficient when modelling large datasets, especially vital in the context of Big Data. With more than one feature, modelling the data using one Gaussian is not feasible, since there could be different latent distributions that needs to be estimated <ref type="bibr" target="#b16">[17]</ref>, <ref type="bibr" target="#b17">[18]</ref>. Thus using GMM, one could approximate which distributions a data point belongs to, by generating data points from a mixture of Gaussians. *(The term gaussian, components and clusters are used interchangeably in the context of GMM).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">EVALUATION</head><p>Evaluation is conducted on the simulated NetFlow dataset with anomalies injected. It will also discusses the caveat of the proposed clustering algorithms K-Means, SOM and GMM, such as finding the parameters k for K-means and the number of gaussians or components for GMM. Finally, the evaluation of the final proposed clustering algorithm are again used to evaluate on the detection rate of the simulated anomalies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Simulated NetFlow Anomalies</head><p>The simulated NetFlow dataset are jointly provided by our industrial partner using the tool "dns2tcp" to simulate a DNS attack known as "Data Exfiltration". The simulated traffic are used to assess the performance of the detection model. Where the model should detect these anomalies and cluster them into the anomalous clusters, separating them from the normal cluster. Data exfiltration (Large file transfer/Unauthorized copying over DNS) was conducted between 1130hrs and 1430hrs.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Evaluation of Clustering Algorithms</head><p>Our combined preprocessed NetFlow dataset are feed into K-means and GMM algorithm for evaluating the detection rate on the simulated data exfiltration anomalies. The following 1) and 2) discusses the detection rate on the simulated anomalies using both clustering algorithms.</p><p>1) Anomalies Detection with K-means: The following Fig. <ref type="figure" target="#fig_19">23</ref> shows the data exfiltration anomalies plotted in two-dimensional scatterplot using the features no pkts and no bytes. All the blue points are anomalies, filtering has been done to show only the data exfiltration points. Given our analysis on the K-means clustering results below, K-means cluster these anomalies as normal, and fails to detect any data exfiltration anomalies. It is possible that using the hypothetical assumptions of k = 2 based on research and <ref type="bibr" target="#b12">[13]</ref> may not be the optimal k choice. Finding the optimal k using the elbow method has also been experimented. However K-means still fails to detect even a single anomalies with the optimal k. It is evident that the model K-means is not complex enough to handle the diversities of DNS traffic.</p><p>2) Anomalies Detection with GMM: The GMM is trained on two components/clusters using the covariance type "Full" and max iterations set to 100 due to several trial and errors and hyperparameter tuning.</p><p>To determine if GMM is able to detect and cluster anomalies into the anomalous cluster, the simulated data exfiltration traffic are feed into GMM for clustering. The following Fig. <ref type="figure" target="#fig_4">24</ref> shows that GMM is able to detect the anomalies with a detection rate of 95% given that the number of pkts and bytes for the anomalies are generally higher as compared to a normal DNS flow, thus the detection of data exfiltration anomalies works well for GMM with only 5% of false negatives, Fig. <ref type="figure" target="#fig_4">24</ref> blue points. In the next section, we aim to overcome these false negatives or mis-clustered anomalies that are very similar to a normal DNS flow. Fig. <ref type="figure" target="#fig_4">24</ref>: GMM detected 95% of anomalies using two gaussian.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Finding GMM model parameters</head><p>Up till now, using the parameter k = 2 is a hypothetical assumptions based on existing literatures and research <ref type="bibr" target="#b12">[13]</ref>. However, there could be a suitable k that one could choose via statistical inference, such that anomalies can be better identified with the appropriate k. The following 1) Finding no. of Components, explores two techniques for selecting the optimal k for GMM.</p><p>1) Finding no. of Components: Two techniques that will be discuss here are Bayesian Information Criterion (BIC) and Akaike's Information Criterion (AIC). Both are statistical model to perform model selection criteria to determine if one model is better than the other. <ref type="bibr" target="#b18">[19]</ref>.</p><p>The following Fig. <ref type="figure" target="#fig_20">25</ref> shows that both BIC and AIC are identical with respect to the number of components specified, where the reduction of the maximum likelihood of BIC asymptote at around five components or more. Hence, we can safety choose the minimum number of components with the lowest BIC score where the maximum likelihood is achieved, in this case 5 (Circled in red), where the BIC line was overlapped by AIC, given that both attained the same score. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Anomalies Detection after Model Selection</head><p>After selecting the optimal number of components, in this case the lowest BIC and AIC scores tell us at around five components or more is a suitable trade-off between a better model fit with the extra computation time by adding additional gaussians or components. Thus, the GMM has been re-trained using the same data with the number of components set to five and covariance type "Full". When using this GMM model to cluster the simulated NetFlow dataset, the first cluster is the normal cluster, since it represented 81% of the total DNS flow, and any subsequent clusters from 2..N are assumed to be the anomalous or unknown clusters, where N is the number of clusters/components, in this case five clusters. The simulated data exfiltration anomalies are conducted between 1130hrs -1430hrs, as shown in Fig. <ref type="figure" target="#fig_21">26</ref>, the GMM with five components is able to accurately detect all data exfiltration anomalies successfully under Cluster 4, with the trade-off of having more components representing the different mixture distributions of the combined NetFlow dataset Fig. <ref type="figure" target="#fig_18">22</ref>.   The following Fig. <ref type="figure" target="#fig_22">27</ref> shows the KDE of the five Gaussian after GMM clustering. We can see that there are some slight overlapping between the Normal (Blue) and Cluster 3 (Purple). By modelling the distribution of the DNS Flows using GMM, data exfiltration anomalies that are similar to normal traffic can be accurately clustered using probabilities/soft assignments. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">CONCLUSION</head><p>Anomaly detection is still a challenging problem that spans decades of research <ref type="bibr" target="#b19">[20]</ref>. The purpose of this research is to detect general DNS anomalies that are statistically different from normal behavior using Big Data, given that DNS are often used as a covert channel for attackers to perform malicious activities e.g. data exfiltration. Detecting anomalies in a DNS environment is beneficial and proved to be crucial in any enterprise network. Using GMM with two clusters, the detection model is able to detect data exfiltration anomalies with a 95% detection rate. Further to that, SOM has also been used for cluster analysis to determine if there are any visible clusters in the NetFlow dataset. However, given the nature of the network traffic collected from multi-enterprise network of different organizations, no fixed number of clusters can be obtained due to the diversity of network traffic and the varying traffic behavior of how different organizations operates. Thus, model estimation techniques using BIC/AIC has been selected to determine the optimal number of components/clusters for the GMM model. Using only five components, the final GMM model achieved a 100% detection rate on the data exfiltration anomalies. However, limited to only data exfiltration anomalies, the evaluation of the model is limited to and only that. Hence, it is probable that our model is not able to detect other kinds of DNS anomalies other than data exfiltration.</p><p>To validate the robustness of our detection model, more DNS anomalies needs to be simulated to assess the detection rate/recall of the GMM model, by understanding the patterns of the different kinds of anomalies, experimentation using statistical analysis and evaluation can be further conducted.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: The anomalies are the sudden spike.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: Detection is only possible with existing rules and signatures.</figDesc><graphic coords="2,84.22,419.74,170.86,102.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>By identifying the baseline of common behaviors/patterns, one can determine what are the common DNS traffic patterns flowing in the network [4], e.g. the average no. of pkts and bytes in a normal DNS flow. *The term Flows, DNS traffic and DNS flows are used interchangeably in the subsequent sections.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 3 :</head><label>3</label><figDesc>Fig. 3: Showing three blobs of clusters.</figDesc><graphic coords="2,364.33,433.22,122.04,100.34" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 4 :</head><label>4</label><figDesc>Fig. 4: Anomaly Detection Workflow.</figDesc><graphic coords="3,59.81,187.57,219.68,120.78" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Fig. 5 :</head><label>5</label><figDesc>Fig. 5: NetFlow data are collected from various multi-enterprise network.</figDesc><graphic coords="3,47.60,519.15,244.08,141.11" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>FFig. 6 :</head><label>6</label><figDesc>Fig. 6: Sample of a DNS flow (The IP address are masked for privacy reasons).</figDesc><graphic coords="3,303.31,295.72,244.09,66.71" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Fig. 7 :Fig. 8 :</head><label>78</label><figDesc>Fig. 7: Bimodal distribution of the feature bytes for all DNS servers.</figDesc><graphic coords="4,47.60,178.30,244.08,98.05" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Fig. 9 :</head><label>9</label><figDesc>Fig. 9: Cumulative distribution function of DNS packets per flow.</figDesc><graphic coords="4,339.92,319.18,170.86,114.71" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Fig. 10 :</head><label>10</label><figDesc>Fig. 10: Cumulative percentage of the average number of source and destination port.</figDesc><graphic coords="4,339.92,575.20,170.86,115.39" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>Fig. 11 :</head><label>11</label><figDesc>Fig. 11: Features selected for data preprocessing.</figDesc><graphic coords="5,47.60,153.09,244.09,66.71" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_11"><head>Fig. 12 :</head><label>12</label><figDesc>Fig. 12: Different types of clustering algorithms.</figDesc><graphic coords="5,47.60,433.82,244.09,85.46" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_12"><head>Fig. 13 :</head><label>13</label><figDesc>Fig. 13: Features selected for data preprocessing.</figDesc><graphic coords="5,303.31,85.14,244.09,66.71" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_13"><head>Fig. 14 :</head><label>14</label><figDesc>Fig. 14: Aggregated Flow in 1 minute interval.</figDesc><graphic coords="5,303.31,431.36,244.09,78.15" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_14"><head>Fig. 17 :</head><label>17</label><figDesc>Fig. 17: Self-Organizing Maps (SOMs)</figDesc><graphic coords="6,352.13,584.10,146.45,101.89" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_15"><head>Fig. 19 :</head><label>19</label><figDesc>Fig. 19: U-Matrix visualized with a hexmap after SOM training. Contrasting high-value color denotes the cluster separators, while adjacent and similar colors represents the similarity of the data points.</figDesc><graphic coords="7,47.60,439.95,244.07,198.68" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_16"><head>Fig. 20 :</head><label>20</label><figDesc>Fig. 20: Gaussian Mixture Model of three normal distributions.</figDesc><graphic coords="7,364.33,201.91,122.04,112.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_17"><head>Fig. 21 :</head><label>21</label><figDesc>Fig. 21: Sudden spike from 1130hrs to 1430hrs.</figDesc><graphic coords="8,84.22,85.14,170.86,138.21" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_18"><head>Fig. 22 :</head><label>22</label><figDesc>Fig. 22: Four different NetFlow datasets combined.</figDesc><graphic coords="8,84.22,553.21,170.86,135.72" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_19"><head>Fig. 23 :</head><label>23</label><figDesc>Fig. 23: K-means fails to detect even a single anomalies from the simulated dataset.</figDesc><graphic coords="8,339.92,286.13,170.86,135.72" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_20"><head>Fig. 25 :</head><label>25</label><figDesc>Fig. 25: BIC and AIC scores.</figDesc><graphic coords="9,84.22,561.54,170.87,144.39" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_21"><head>Fig. 26 :</head><label>26</label><figDesc>Fig. 26: Data exfiltration anomalies are detected under Cluster 4. The following TABLE III shows the confusion matrix after GMM detection on the simulated NetFlow dataset filtered to show only the data exfiltration anomalies. The simulated NetFlow dataset have a total of 141 time window aggregated DNS flow of data exfiltration anomalies after preprocessing. Normal Anomalies Detected Normal TP FP Detected Anomalies FN TN=141</figDesc><graphic coords="9,339.92,321.00,170.86,136.48" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_22"><head>Fig. 27 :</head><label>27</label><figDesc>Fig. 27: KDE on a mixture of five Gaussian.</figDesc><graphic coords="10,47.60,277.33,244.08,164.77" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>TABLE II</head><label>II</label><figDesc></figDesc><table /><note>: Cumulative variance explained using six components.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>TABLE III :</head><label>III</label><figDesc>Confusion Matrix: All 141 data exfiltration anomalies has been detected by GMM.E. Kernel Density Estimation (KDE) on Gaussian MixtureThe following TABLE IV are the Gaussian mean µ and mixture weights of the simulated NetFlow dataset. Where the Normal Cluster/Gaussian represented 68% of the total distribution and the remaining clusters represents only 32%; the simulated data exfiltration anomalies are clustered under Cluster 4 by the GMM model.</figDesc><table><row><cell>Cluster k</cell><cell>Mean µ</cell><cell>Weight</cell></row><row><cell>Normal</cell><cell>86</cell><cell>0.68</cell></row><row><cell>Cluster 1</cell><cell>587</cell><cell>0.08</cell></row><row><cell>Cluster 2</cell><cell>491</cell><cell>0.12</cell></row><row><cell>Cluster 3</cell><cell>1762</cell><cell>0.08</cell></row><row><cell>Cluster 4</cell><cell>22040</cell><cell>0.02</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>TABLE IV :</head><label>IV</label><figDesc>The mean and weight of the average number of bytes with five Gaussian.</figDesc><table /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">ACKNOWLEDGMENTS</head><p>This is an industrial research project supported by the University of Glasgow and Singapore Institute of Technology (SIT) with Industrial Partner. I want to thank my professor Dr. Vivek Balachandran in the department of Singapore Institute of Technology for sharing his knowledge and guidance throughout. Finally, I would also like to thank my industrial supervisor Eugene Chong for giving me a chance for taking on this challenging research project.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Facure</surname></persName>
		</author>
		<ptr target="https://lamfo-unb.github.io/2017/05/09/Semi-Supervised-learning-for-fraud-detection-Part-1/" />
		<title level="m">Semi-Supervised Learning for Fraud Detection Part 1</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Network flow analysis</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">W</forename><surname>Lucas</surname></persName>
		</author>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="9" to="11" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Big data analytics in cybersecurity</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">T</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="27" to="30" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Unsupervised anomaly detection in network intrusion detection using clusters</title>
		<author>
			<persName><forename type="first">S</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kaur</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">A mixed unsupervised clustering-based intrusion detection model</title>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sun</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Unsupervised clustering approach for network anomaly detection</title>
		<author>
			<persName><forename type="first">I</forename><surname>Syarif</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Prugel-Bennett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wills</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Network intrusion detection third edition</title>
		<author>
			<persName><forename type="first">S</forename><surname>Northcutt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Novak</surname></persName>
		</author>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="103" to="115" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Network intrusion detection third edition</title>
		<author>
			<persName><forename type="first">S</forename><surname>Northcutt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Novak</surname></persName>
		</author>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="113" to="115" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Beginner&apos;s Guide to Feature Selection in Python</title>
		<author>
			<persName><forename type="first">S</forename><surname>Paul</surname></persName>
		</author>
		<ptr target="https://www.datacamp.com/community/tutorials/feature-selection-python" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">A survey of different methods of clustering for anomaly detection</title>
		<author>
			<persName><forename type="first">S</forename><surname>Tripathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sahoo</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">6</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Big data analytics for network anomaly detection from netflow data</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Terzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Terzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sagiroglu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">A Introduction to K-means Clustering</title>
		<author>
			<persName><forename type="first">A</forename><surname>Trevino</surname></persName>
		</author>
		<ptr target="https://www.datascience.com/blog/k-means-clustering" />
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Traffic anomaly detection using k-means clustering</title>
		<author>
			<persName><forename type="first">G</forename><surname>Munz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Carle</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Self Organizing Maps</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ralhan</surname></persName>
		</author>
		<ptr target="https://towardsdatascience.com/self-organizing-maps-ff5853a118d4" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Brief review of self-organizing maps</title>
		<author>
			<persName><forename type="first">D</forename><surname>Miljkovic</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Visual analysis of self-organizing maps</title>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">K</forename><surname>Pavel</surname></persName>
		</author>
		<author>
			<persName><surname>Stefanovic</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page">495</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Pattern recognition and machine learning</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Bishop</surname></persName>
		</author>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="430" to="435" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Pattern recognition and machine learning</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Bishop</surname></persName>
		</author>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="435" to="439" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Estimating the number of components in gaussian mixture models adaptively</title>
		<author>
			<persName><forename type="first">C</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Anomaly Detection: A Survey</title>
		<author>
			<persName><forename type="first">V</forename><surname>Chandola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kumar</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
