=Paper=
{{Paper
|id=Vol-2343/paper9
|storemode=property
|title=Performance/Cost Analysis of a Cloud Based Solution for Big Data Analytic: Application in Intrusion Detection
|pdfUrl=https://ceur-ws.org/Vol-2343/paper9.pdf
|volume=Vol-2343
|authors=Nada Chendeb,Iman Mallat,Nazim Agoulmine,Nour Mawass
|dblpUrl=https://dblp.org/rec/conf/bdcsintell/TaherMAM18
}}
==Performance/Cost Analysis of a Cloud Based Solution for Big Data Analytic: Application in Intrusion Detection==
Performance/Cost Analysis of a Cloud Based
Solution for Big Data Analytic: Application in
Intrusion Detection
1 Nada Chendeb Taher, 1 Imane Mallat, 2 Nazim Agoulmine, 3 Nour Mawass
1
Lebanese University, Faculty of Engineering, Tripoli, Lebanon
2
COSMO, IBISC Laboratory, UEVE, Paris-Saclay University, France
3
Normandie University, UNIROUEN, LITIS
Abstract—The essential target of ‘Big Data’ technology is to Data has been around for many years, but its mainstream
provide new techniques and tools to assimilate and store large application started only recently [2]. The concept of cloud
amount of generated data in a way to analyze and process it computing also traces back to the 1960s and has since then
to get insights and predictions that can offer new opportunities
towards the improvement of our life in different domains. In evolved and passed through many stages to become today a
this context, ‘Big Data’ treats two essential issues: the real-time mainstream commercial necessity [2].
analysis issue introduced by the increasing velocity at which Demand for Big Data is calling for the adoption of Cloud
data is generated, and the long-term analysis issue introduced platforms because Big Data techniques need very high
by the huge volume of stored data. computing resources that grow in parallel with the fast
To deal with these two issues, we propose in this paper a
Cloud-based solution for big data analytic on Amazon Cloud growth of the generated data. Therefore, if we need to
operator. Our objective is to evaluate the performance of Big transform the data into value and utilize its potential, we need
Data services offered regarding the volume/velocity of the first to fully embrace Cloud-based systems.The convergence
processed data. The dataset we use contains information about of Big Data and Cloud Computing eventually provides new
”network connections” in approximately 5 million records with opportunities and applications in many verticals. Nowadays,
41 features; the solution works as a network intrusion detector.
It receives data records in real time from a raspberry pi node many companies have already adopted Cloud technologies as
and predicts if the connection is bad (malicious intrusion or a mean to store and analyze data, and finally get predictions
attack) or good (normal connection). The prediction model was and insights about their business performances. Many
made using a logistic regression network. We evaluate the cloud examples of these applications do exist in many domains such
resources needed to train the machine learning model (batch as healthcare, industry 2.0, Networking, business, marketing
processing), and to predict the new streaming data with the
trained network in real time (real time processing). and many others.
The solution worked very well with high accuracy and the Big data technology is also helping to fight hackers. Indeed,
results show that when working with Big Data in the cloud, the same tools that are central to Big data can also used
we are mainly dealing with a cost/performance trade-off, the to analyze network traffic at large scale and react faster to
processing performance in term of response time for both attacks and therefore prevent damages or data leakage before
long-term and real-time analysis can be always guaranteed once
the cloud resources are well provisioned according to the needs. they happen. Big Data can be used to identify anomalies in
device behaviour, user behaviour or network connections.
I. I NTRODUCTION In this context, any application with Big Data analytic can
From the down of civilization until 2003, humankind take one or both of two ways: the (1) real-time analysis that
generated five exabytes of data. Now we produce five helps to find out irregularities in the collected data and act
exabytes every two days and the pace is accelerating [1]. as fast as possible to prevent an undesired scenario, or the
Our smart phone collects data on how we use it and our (2) long-term analysis that uses the massive data collected
web browser collects information on what we are searching and utilizes the insights to identify the future trends and
for. Servers also collect information about connections and opportunities.
user activities to protect and develop the services they are With the real-time analysis, we face the ”Velocity” challenge.
offering. Today, it is hard to imagine any activity or device In major applications, we have to process data and get
that does not generate data. Just think about all the pictures information and decision in real time before the next data is
we take on our smart phones. We upload and share 100s of generated. For example, in network security, we have to detect
thousands of them on social media sites every second. With any threat and stop it as early as the network connection starts
the datafication of everything, comes Big Data. to be established, avoiding individuals entering, interacting
Cloud Computing and Big Data are distinct disciplines with or damaging the system. While with the long-term
that have evolved separately over time. However, they are analysis that requires batch processing, we face the ”Volume”
increasingly becoming interdependent. The concept of Big challenge; where to store the huge amount of data and how
34
Fig. 1: A diagram for a Cloud based model for real-time and
long-term analysis of Big Data
to process it accurately and in a reasonable amount of time
and cost?
The two different ways of processing data lead to the
necessity to use a Cloud based solution for Big Data analytic
that is able to address both the volume and the velocity
Fig. 2: Big Data 4V’s
challenges as highlighted in Figure 1.
This model could be built on existing Cloud Computing II. P RELIMINARIES
infrastructure however these infrastructures (computing re- A. What is Big Data?
sources) do not offer the same level of Quality of Service
neither in general the same charging model. Therefore, it There is no stable definition for Big Data. We can use Big
is important to be able to evaluate the best offers in terms Data to describe a massive volume of both structured and
of performance and services. It is also important to identify unstructured data that is so large, increasing very fast and that
what is the trade off between the provided performances and it is very difficult to process using traditional tools, databases
the cost. In particular, when the volume/complexity/velocity and software techniques. To deal with this kind of data, a
of the data increases, it is important to understand how the multitude of tools and frameworks are available including
processing performance is positively or negatively impacted by the famous Hadoop (and spark) ecosystem. The underlying
the provided computing resources. These questions constitute techniques behind these tools and frameworks are distribution,
the challenges we are addressing in this paper to derive a parallelism and clustering of computing resources.
performance/cost model to execute Big Data analytic in the B. Characteristics of big data
Cloud Computing for intrusion detection applications.
Big data is commonly characterized using a number of V’s.
The data is related to ”network connections” and the aim The first four are Volume, Velocity, Variety and Veracity as
is to specify and deploy a machine learning solution able to shown in Figure 2. Volume refers to the vast amount of data
detect malicious connections and stop them as soon as possi- that is generated every second, minutes, hour, and day in our
ble. Information about new data connection to the protected digitized world. Variety refers to the ever increasing various
system is sent to intrusion detection system that is running forms of captured data such as text, images, voice, geospatial,
in the Cloud and are processed in real-time (streaming). If raw, etc. Velocity refers to the speed at which data is being
the intrusion-detection system detects an anomaly, it blocks generated and the pace at which data moves from one point
the connection and/or notifies the network administrator. The to the next. Veracity refers to the quality and the reliability of
decision-making are the result of batch processing/machine the data.
learning applied to the data accumulated over time.
The remaining of this paper is as follows: after preliminaries C. Turning Data into Value
presenting Big Data and Big Data analysis in section II, the Data analytic backed by the expansion of computing power
following section III presents the proposed intrusion detection is enabling companies to extract maximum value from data
solution based on Cloud Computing and Big Data analytic. to get the best insights. The way they help us move from
The solutions uses the services provided by Amazon Cloud data to insight and value is called the ‘wisdom hierarchy’
provider namely AWS (Amazon Web Services). The section [3]. The wisdom hierarchy is a conceptual framework for
IV, presents an implementation of the system as well as the thinking about how the raw inputs of reality (signals) are
used dataset and the machine learning model, in this section stored as data and transformed first into information, then into
we present also the results of the conducted performance/cost knowledge, and finally into wisdom (Figure 3). It summarizes
analysis. In section V, we discuss the results and highlight the data analysis process. In other words, ’Wisdom Hierarchy’
the cost-benefit of our solution against a traditional one. represents a path from gathering and exploring raw data,
Finally, we conclude the work in section VI and present some to machine learning that enables getting knowledge from
perspectives. raw data, and finally to artificial intelligence that ensures a
35
1) Acquiring data: The first step in acquiring data is to
determine what data is important. Leaving out even a small
amount of important data can lead to incorrect conclusions.
2) Exploring data: Exploring data is a part of the two-step
data preparation process. We need to do some preliminary
investigation in order to gain a better understanding of the
specific characteristics of the data. In this step, we will be
looking to things such as correlations, general trends, and
outliers.
3) Pre-Processing data: There are two main goals in the
data pre-processing step. The first is to clean the data to
address data quality issues, and the second is to transform
the raw data to make it suitable for analysis.
4) Analyzing data: Data analysis involves building a model
from the clean data, which is called input data. The input
data is used by the analysis technique to build a model. What
the model generates is the output data. There are different
types of problems, and so there are different types of analysis
Fig. 3: Wisdom hierarchy techniques. The main categories of analysis techniques are
classification, regression, clustering, association analysis, and
graph analysis.
deep understanding of knowledge. Based on this hierarchy, In classification, the goal is to predict the category of the input
we discover the importance of machine learning in big data data. When the model has to predict a numeric value instead
domain to get insights from raw data, make predictions and of a category, then the task becomes a regression problem. In
then take the appropriate decisions; this is what we call the clustering, the goal is to organize similar items into groups.
‘data mining’. So what is machine learning? And what are the The goal in association analysis is to come up with a set of
machine learning models? rules to capture associations between items or events.
Let’s briefly look at how to evaluate each technique. For
D. Machine learning classification and regression, we will have the correct output
Machine learning is an artificial intelligence technique that for each sample in the input data. Comparing the correct output
allows software applications to become more accurate in and the output predicted by the model provides a way to
predicting outcomes without being explicitly programmed. evaluate the model. For clustering, the groups resulting from
The basic premise of machine learning is to build algorithms clustering should be examined to see if they make sense for
that can receive a large amount of input data and then predict our application. For association analysis, some investigation
a reliable output. Based on this definition, Machine Learning will be needed to see if the results are correct.
takes data as input. The input data is called ‘training data’. As a summary, data analysis involves selecting the appropriate
The desired output is called ‘Targets’ or ‘Labels’. Machine technique for our problem, building the model, and then
learning models are often categorized as supervised or un- evaluating the results.
supervised. Supervised models need humans to provide both 5) Turning insights into actions: The next step is to de-
input (‘Training Data’) and the desired output (’Targets’). termine what action or actions should be taken, based on the
Once training is complete, the algorithm must be tested on insights gained? This is the first step in turning insights into
new labeled data to compute its ‘accuracy’. The accuracy action. Now that we have determined what action to take, the
parameter represents the number of correct predictions from all next step is to study how to implement the action.
predictions made with the Machine Learning algorithm. The
desired accuracy depends on the application we deal with. As F. Techniques and tools for big data analysis
for the unsupervised models, the training data does not include Hadoop is an open-source framework that allows to store
‘Targets’ so we do not tell the system where to go, the system and process large datasets in parallel and distributed fashion.
has to understand itself from the data provided. It runs on clusters of commodity servers and can scale up to
support thousands of hardware nodes and massive amounts
E. Data analysis process of data. Consequently, Hadoop became a data management
When talking about ‘Wisdom Hierarchy’, data analysis platform for big data analytics. As the diagram of Figure
passes through many steps before being translated into actions. 4 shows, we have different layers that can operate in this
In the Big Data course provided by UCSD untitled ‘Big Data ecosystem:
Specialization’ [4], these steps are clearly defined and detailed. • The Hadoop ‘HDFS’ as Hadoop Distributed File System,
They start with data acquisition towards decision-making. that is a parallel and distributed storage unit.
36
other hand, with the evolution of big data technology, new
frameworks and techniques appeared providing solutions for
real-time and complex analysis towards trends and behavior
discovering, malicious preventing, and fraud detections. Big
Data analytics promises significant opportunities for solving
different information security problems [5]. For this reason,
researches and works evolved in the domain of Big data and
cyber-security.
Security is now a big data problem because the data that
has security context is huge. Hence, to construct a big data
processing and computing infrastructure extra secure, authors
Fig. 4: Layers diagram in Hadoop ecosystem in [6] summarized some security and privacy related issues.
Under a big data environment, it is more complicated and
difficult to store and process the organizations and the
• The Hadoop ‘Yarn’ as Yet Another Resource Negotiator, customer’s information in a secure manner given the huge
a resource manager layer that interacts with application volume, the increasing velocity and the variety of generated
and schedules resources for their use data. In this context, the author of [7] proposed some solutions
• ‘MapReduce’ that is a programming model for processing for intrusion detection and threats attacks limitation. One
large amounts of data in parallel and distributed fashion of these solution was to implement a MapReduce machine
composed of Map() and Reduce() procedures learning model that can distinguish between bad and normal
• ‘Storm’ platform, a distributed, real-time data processing connections based on some features and metrics such as flow
platform duration, average bytes per packet in the flow, and average
• ‘Spark’ platform, an open-source big data processing bytes per second in the flow. The proposed model may use
framework built around speed, ease of use, and sophis- the collected network traffic consisting of both normal flow
tical analytics. In addition to MapReduce operations, it and potential attack flows, to train a logistic regression (LR)
supports SQL queries, streaming data, machine learning or naı̈ve Bayes network that works as binary classifier.
and graph processing. It also offers a shell for python. One of the machine learning techniques developed under the
• ‘Cassandra’, ‘MongoDB’, ‘HBase’ are distributed big data and cyber security domains is the neural network
databases approaches that takes an interest role for discovering patterns
In our cloud-based solution, we will use the ‘Spark’ framework and malicious activity of the users. Ana-Maria Ghimes, and
above a Hadoop cluster so that we can ensure a real-time Victor-Valeriu Patriciu, proposed a neural network model that
processing and sophistical analytics like ‘Machine Learning’ consists of several case studies on algorithms and architectures
and ‘data mining’. of neural networks for determining the best way to discover
new attacks malicious patterns in data [8].For test purposes,
G. Cloud Computing they used data sets provided by UCI Machine Learning
Cloud Computing is a paradigm in which any user with Repository [9]. For classification, they have used repositories
internet connection can rent computing resources as needed like “Detect Malicious Executable (AntiVirus) Data Set”
from a cloud operator owning large datacenters and offer- which consists of malicious and non-malicious samples.
ing services. This is mainly an economic revolution in the They have started the study with a simple implementation
IT/Networking field thanks to the huge advances in virtual- of a neural network and continued using pruning techniques
ization technology and datacenters. Cloud Computing services for finding the optimal network. The best model they have
cover a vast range of options going from the basics of storage, obtained in the pruning process had the hyperbolic tangent
networking, and processing power through natural language function as the activation function, and consisted of 4 layers:
processing and artificial intelligence services. A fundamental the input layer with 22 neurons, an output layer with one
concept behind cloud computing is that the location of the neuron and two hidden layers with 8, respectively 5, hidden
service, and many of the details such as the hardware or many neurons. Once the pruning will be completed the training
benefit operating system on which it is running, are largely process will be initiated for updating the connection weights
irrelevant to the user. There ares from cloud computing that and getting the best performance.
make this field a very important one. Three of the main benefits
of cloud computing are self-service provisioning, elasticity and
Pay per use. III. P ROPOSED C LOUD BASED S OLUTION FOR I NTRUSION
D ETECTION
H. Literature review As the speed and the volume of network data increases in
With the exponential growth of data and the digitization of particular connections to remote servers, the need to perform
everything, cyber-attacks become widespread and threatens the data analysis in real time with machine learning algorithms
the organization’s security and the personal privacy. In the and extract a deeper understanding from the data becomes
37
crucial for all business, organizations and governments. As the For each Shard, the cost is 0.015$ per hour, and 0.014$ per
same time, to satisfy the increasing velocity and volume even million PUT Payloads Units.
the complexity of generated data, the use of big data tools and 3) Amazon Elastic Compute Cloud (EC2): Amazon EC2
techniques that ensure parallelism in computing, scalability provides scalable computing capacity in the Amazon Web
and reliability is a necessity. Services (AWS) cloud. Using Amazon EC2 eliminates the
Consequently, and based on the importance given for both need to invest in hardware up front allowing to develop and
the real time and the long-term analysis in the majority of deploy applications faster. Amazon EC2 can be used to launch
the domains today, we created a cloud based system for both as many or as few virtual servers as needed, configure secu-
stream and batch processing using the ‘Amazon’ Cloud that rity and networking, and manage storage. With On-Demand
provides all Big Data techniques and tools we need to perform instances, only EC2 instances usage is charged on per hour
such a system, and at a low cost without the necessity to depending on which EC2 instance type is used. For example,
procure hardware or to maintain infrastructure. To perform for c5.2xlarge instance type (8 CPUs and a RAM of 16GiB),
our solution on Amazon we referred to a set of Amazon Web the cost is 0.34$ per hour.
Services (AWS) that can be connected between them. Some of 4) Elastic MapReduce (EMR): Amazon EMR is a highly
these services are for capturing data streams, other for compute distributed computing framework to easily process and store
and processing and some other for storage and notification. In data quickly in a cost-effective manner. Amazon EMR uses
the following we will describe the services we used in our ‘Apache Hadoop’, an open source framework, to distribute the
model. data and processing across a resizable cluster of Amazon EC2
instances and allows to use the most common Hadoop tools
A. Used AWS services such as ‘Hive’, ‘Pig’, ‘Spark’ and so on. With Amazon EMR,
more core nodes can be added at any time to increase the
1) AWS IoT core: Amazon IoT service is used to connect processing power.
IoT devices, receive data from these devices using ‘MQTT’ Amazon EMR pricing is simple and predictable: we pay a
protocol and publish the messages to a specific ‘topic’. In IoT per-second rate for every second we use, with a one-minute
AWS, a ‘thing’ represents any connected device. Additionally, minimum. For example, a 10-node cluster running for 10 hours
AWS IoT ‘rules’ applied on the received data, gives IoT- costs the same as a 100-node cluster running for 1 hour. The
enabled devices the ability to interact with other AWS services. hourly rate depends on the instance type used. For example,
One rule can combine more than one ‘actions’. for c5.2xlarge EC2 instance type the cost is 0.085$ per hour.
This service costs as today 0.08$ per million minutes of 5) Simple Storage Service (S3): Amazon S3 is carefully
connection, 1$ per million messages and 0.15$ per million engineered to meet the requirements for scalability, reliability,
rules triggered. speed, low-cost, and simplicity.
2) Amazon Kinesis Stream: Amazon Kinesis Streams can We pay 0.023$ per GB for the first 5 TB per month, 0.022$
continuously capture and store terabytes of data per hour and per GB for the next 450 TB per month, and 0.021$ per GB
hundreds and thousands of sources. Data records are accessible for over 500 TB per month.
for a default of 24 hours from the time they are added to a 6) Short Notification Service (SNS): SNS is a fully man-
stream. During that window, data is available to be read, re- aged push notification service that allows sending individual
read, backfilled and analyzed, or moved to long-term storage. messages to large numbers of recipients. Amazon SNS makes
With amazon kinesis streaming data can be ingested, buffered it simple and cost-effective to send push notifications to mobile
and processed in real-time, so insights can be derived in device users, email recipients or even send messages to other
seconds or minutes instead of hours or days. distributed services.
Pricing is based on two core dimensions - Shard Hour and SMS messages sent to non-US phone numbers are charged.
PUT Payload Unit. ‘Shard’ is the base throughput unit of an For example, to send a message to Lebanon, the cost per
Amazon Kinesis stream. An Amazon Kinesis stream is made message is 0.04746$ for Alfa line and 0.05192$ for Touch
up of one or more shards. Each shard provides a capacity of line.
1MB/sec data input and 2MB/sec data output. Each shard can
support up to 1000 write and 5 read transactions per second. B. Our proposed model
The number of shards needed within the stream is specified These services should be carefully connected to form our
based on the throughput requirements. The charging of the Cloud based solution for real-time and batch processing of
shard is based on hourly usage rate. Big Data as in Figure 5.
A record is the data that the producer adds to the Amazon First, we have created a spark cluster of specific EC2 instance
Kinesis stream. A PUT Payload Unit is counted in 25KB type using Amazon EMR. After uploading data to an S3
payload “chunks” that comprise a record. For example, a bucket, data is pulled from spark cluster to train a machine
5KB record contains one PUT Payload Unit, a 45KB record learning network. A Raspberry Pi (Rpi) that plays the role of
contains two PUT Payload Units, and a 1MB record contains any connected device is connected to the Amazon IoT core.
40 PUT Payload Units. PUT Payload Unit is charged with a Data sent from the Rpi is published to a specific topic where
per million PUT Payload Units rate. a kinesis rule is applied to push data into the kinesis stream
38
Fig. 5: Our Cloud Based solution for Big Data analytic
shard. In the EMR spark cluster, streaming data is pulled to
be predicted with the machine learning model already built.
In case of anomalies, a notification is sent with Amazon SNS
to some specific phone number.
IV. DATA S ET, P ROCESSING M ECHANISM AND R ESULTS
A. Network connections DataSet
To build this solution, we have used the ’Kdd Cup’ dataset.
This is a typical big dataset that helped us to perform the
performance/cost analysis of the complete model in the cloud. Fig. 6: Sketch of a portion of the data used
This dataset is the one used for The Third International
Knowledge Discovery and Data Mining Tools Competition.
testing the machine learning network and the unlabeled data
The competition was to evaluate different network intrusion
file ‘kddcup.testdata.unlabeled-10-percent.gz’ to stream data
detectors i.e. predictive models capable of distinguishing be-
from the Rpi thus simulating new connections that should be
tween ’bad’ connections, called intrusions or attacks, and
analyzed in real-time. Figure 6 shows a portion of the full data
’good’ normal connections.
file ‘kddcup.data.gz’.
This dataset contains records about network connections with
41 features, like connection duration, the number of segments B. Processing mechanism
sent, the number of failed segments, the protocol (tcp or udp),
the service (http, ...) used, the label (in case of labeled data) We have used the ‘logistic regression’, a supervised ma-
that indicates the status of the connection (‘normal’, ‘buffer- chine learning model, on this dataset to distinguish between
overflow’, ‘guess-passwd.’, . . . ), etc... the normal and the bad network connections. The Machine
This dataset consists of 6 compressed data files. Each file learning model was inspired from a tutorial applied on the
contains distinct set of data including what is labeled to same data set testing different algorithms in Apache Spark
use for training and testing the network and what is un- [11]. The following is a quick walk-through of the Process
labeled to use for prediction [10]. In our experiment, we that is presented in Figure 5:
have used the full data file ‘kddcup.data.gz’ for training • First, we uploaded the full dataset ‘kddcup.data.gz’ with
network, the ‘corrected.gz’ data file with corrected labels for the ‘corrected.gz’ dataset to a S3 bucket.
39
2M records 3M records 5M records instances to 6, we became able to train the network with all
the available records in the ‘kddcup.data.gz’ data file i.e. 5M
3 instances — — —
records.
5 instances 152.58 — —
When using a c5.xlarge cluster with 5 instances or even with
6 instances 142.3 — —
6 instances, it was not possible to train the network with
more than 2M records nearly as shown in Table 1. Hence,
TABLE I: Training time (in seconds) for different c5.xlarge
we noticed that the processing time is decreasing with the
cluster sizes and different volumes of training data.
decreasing size of the cluster as well as with increasing type
of instance. Since the training with 2 Million records has
2M records 3M records 5M records given us a good accuracy equal to 0.9195, we found it the
3 instances 118.17 162.496 — right configuration to train the network.
5 instances 101.22 132.9 — From a cost perspective, using a c5.2xlarge cluster, the usage
6 instances 90.4 129 — cost for each EC2 instance is 0.34$/h, and 0.085$/h for
EMR. For a c5.xlarge cluster, cost decreases to 0.199$/h for
TABLE II: Training time (in seconds) for different c5.2xlarge each EC2 instance, and 0.052$/h for EMR. If we had used
cluster sizes and different volumes of training data. a c5.xlarge cluster with 5 instances, the total cost per hour
would have been the following: 5*0,199+0.052=1,047$/h and
the training would have taken 152.581 secs. Similarly, if we
• We have created a spark cluster on EMR to process the had used c5.2xlarge cluster with 3 instances, the total cost
large amount of data. In this solution we used a cluster per hour would have been: 3*0.34+0.085=1.105$/h and the
of 3 EC2 instances with c5.2xlarge instance type (8 CPU training would have take 118.17 secs. Therefore, we have
and 16GB RAM for each instance). decided to use a cluster of 3 c5.2xlarge instances to train the
• On EMR cluster, we have created a machine learning network with 2M records (480MB) .
“Logistic Regression” model, that takes as ‘Input’ the Regarding the response time of the analysis we wanted to be
numerical features from ‘kddcup’ data and as a ‘Target’ as real-time as possible, we noticed that all records did not
the label ‘0’ in case of a normal connection and ‘1’ exceed 150 Bytes. It was therefore enough to use only one
in case of a bad connection. We have used 2 Million Shard, on condition that the time between two sent records
records of ‘kddcup.data.gz’ for training the network. did exceed the 1 ms, while the time between two received
The ‘corrected.gz’ data file is also used to compute the records will be in minimum 0.2 sec. The obtained response
accuracy and to validate this network. time was very low and did not eventually exceed the ms. If
• We downloaded the ‘kddcup.testdata.unlabeled-10- we had to send or receive faster or bigger records, we would
percent.gz’data file on the raspberry pi. We have pulled have needed to increase the number of Shards. Kinesis Stream
records from this file and sent them to the Amazon is a very efficient service to ensure the scalability of our
IoT core. The generated data is published to a topic on model as well as maintaining the stream processing real-time.
which a rule is applied to send data to a kinesis stream. For each created shard, the price was 0.015$/h while it was
Then, on the EMR cluster, we have pulled each record 0.014$ for each million PUT Payloads. In our case, each
from the Kinesis stream to predict it using the trained record contained only one PUT Payload. In one second, we
machine learning network. generated 1000 records and 3600000 Put Payloads per hour.
• In case of a bad record (bad network connection), a Therefore, the bill to pay was 4*0.014+0.015=0.071$/h
notification is sent to a specific phone number using
Amazon SNS service.
V. P ERFORMANCE /C OST A NALYSIS OF THE I NTRUSION
C. Performance/Cost Analysis of the Training Phase D ETECTION P HASE
To perform this machine learning, we have tried different A. Presentation of the Case Study
types of EC2 instance as well different numbers of instances Assume we have 20 computing devices that all generate the
in the cluster. The objective is to derive the best cost-effective same type of data (the one used to build our model). Assume
configuration (i.e. a trade-off between the processing time that each record does not exceed 150 Bytes, therefore each
and the cost). After several configuration tests in the real record can be eventually included in one PUT Payload Unit.
Amazon cloud, we have been able to complete Table 1 and Assume also that each device generates data at a speed of 50
Table 2. records per second, therefore one shard is required to ensure
this input velocity (each shard will actually support 1000
As shown in Table 2, we notice that a c5.2xlarge instance write transactions per second corresponding to the 20 devices
type cluster with 3 instances can train network a max of generating 50 writes per second). Assume the required output
nearly 3M records. With the same type of virtual machine velocity is at least 1 output record per second. To achieve this
but with 5 instances, it is possible to reach a maximum of performance, it is necessary to subscribe to 4 shards (since
3.5M records. Eventually, when we increased the number of each shard supports 5 read transactions per second for a total
40
of 4*5=20 read transactions per second for the 4 shards). operators however doing an appropriate Performance/Costs
The output velocity requirement depends on the consuming analysis of the different possibilities of deployment of the
application. For example, some applications may need to service as the one presented in this paper.
process data very quickly (critical application) while other This approach can also be applied to other applications with
applications may accept process the date more slowly. different response time or batch processing requirements, In
Suppose, a spark cluster is built on Amazon EMR with 3 case of machine learning, it is also to include in the study
c5.2xlarge instances to read output records from shards and a cost/performance analysis of the network resources to deal
process them in real-time. with the size and the speed of the data to manipulate.
Assume the used prediction model can detect 70 anomalies VI. C ONCLUSIONS
in average per month therefore 100 messages will be sent
In conclusion, the big data cloud based model we built
monthly.
can reach the desired results in terms of response time
B. Monthly Cost Analysis and accuracy, with a low cost relatively. We always have a
cost/performance trade-off; the increase in complexity, speed,
In this section, we will evaluate the monthly cost for the case
and volume of data leads to using more Cloud resources with
we assumed above. If the used configuration is the c5.2xlarge
higher features and hence paying more. So we have to define
instance type, the cost will be 0.34$/h for each instance and
the exact needs on the cloud to reduce costs and then make
0.085$/h for EMR. Therefore, the cost for the whole Spark
an efficient model.
cluster will be 0.34*3+0.085=1.105$ per hour. We can deduce
Using the Amazon Cloud that provides many services to
the monthly cost, that is in this case 1.105$*24*30=795.6$.
address Big Bata analytic requirements, we built a cloud
To ensure a real time processing, the data records will be
based solution that deals with the real-time and the batch
streamed using the Amazon kinesis. For each shard, the cost
processing issues. This complete solution can help meet
will be 0.015$/h and for each million of PUT Payload Unit, the
the stringent business requirements in the most-optimized,
cost will be 0.014$. Assuming 1000 records are generated per
performant, and resilient possible way. It can also be used in
second and 4 shards are used for the streaming, the monthly
many domains that require real-time anomalies detection or
cost will be 0.015*24*30=10.8$ for all the used shards, and
a long-term analysis to get insights and trends from stored
(0.014*3600*24*30*1000)/1000000=36.288$ per month for
data. The role of the engineer is to provision the requested
the PUT Payload Units. We can deduce the cost for the amazon
resources from the cloud operator to satisfy the needs with
kinesis stream service usage that is 36.288+10.8=47.088$
the low cost possible. Performance on the cloud is always
monthly in our case.
guaranteed once the rented resources are sufficient for the
For the Alfa line messaging service usage, the cost per sent
needed processing use case.
message will be 0.04746$. If we suppose that we need to send
a mean of 70 messages per month, the monthly usage cost for
this service will be 70*0.04746=3.3222$. ACKNOWLEDGMENTS
Assuming we decide to not exceed the free tier offered for This research was supported by The Lebanese University
the AWS IoT core and the Amazon S3, such a solution will and CNRS Lebanon. Part of this work was also conducted in
cost in total nearly 835.2102 $ per month corresponding to the the frame of the PHC CEDRE Project N37319SK.
sum of the individual costs 795.6$+36.288$+3.3222$. This is
R EFERENCES
obviously an advantageous solution compareed to hiring one
[1] Sampriti, Sarkar, ”Convergence of Big Data, IoT and Cloud Computing
engineer in the company. for Better Future”, Analytics Insight, 2017
[2] Eric Schmidt
C. Comparison with on premises solution [3] Wood, Adam Michael, ”The wisdom hierarchy: From signals to artificial
intelligence and beyond”,O’Reilly Data Newsletter, 2017
If the decision is deploy this model on premise, it is [4] ”Big Data Specialization. s.l.”, University of California San Diego, 2018
necessary to use a minimum of three powerful computers and [5] Rasim Alguliyev, Yadigar Imamverdiyev, ”Big Data: Big Promises for
pay for CapEX (hardware cost) and OpEX (engineers cost to Information Security”, 2014 IEEE 8th International Conference on Ap-
plication of Information and Communication Technologies (AICT), 2014
configure the specific environment and software on hardware). [6] Aditya Dev Mishra, Yooddha Beer Singh, ”Big Data Analytics for Se-
One should not neglect maintenance cost of the used hardware curity and Privacy challenges”, International Conference on Computing,
and software. Hence, this solution doesn’t ensure scalability Communication and Automation (ICCCA2016), 2016
[7] M. S. Al-kahtani, ”Security and Privacy in Big Data”, International
since it is necessary to buy more hardware in case increase Journal of Computer Engineering and Information Technology, 2017
in the computing or storage resources demand. With Big Data [8] Ana-Maria Ghimes, , Victor-Valeriu Patriciu, ”Neural Network Models
services deployed in the Cloud Computing, it is possible to in Big Data Analytics”, 9th International Conference on Electronics,
Computers and Artificial Intelligence (ECAI), 2017
use hardware and software as commodities with guarantee of [9] Machine learning repository [Online]. Available:
performance. In this case, there is no need to worry about http://archive.ics.uci.edu/ml/
installation, maintenance or upgrade since it is part of the [10] Knowledge Discovery and Data Mining Tools Competition 1999 Data
[Online]. Available: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99
Cloud Computing service. [11] Spark Python Notebooks [Online]. Avail-
For this reason, we plebiscite building this solution in the able: https://github.com/jadianes/spark-py-
Cloud taking benefit of all services provided by the Cloud notebooks/blob/master/README.md
41