=Paper= {{Paper |id=Vol-2343/paper9 |storemode=property |title=Performance/Cost Analysis of a Cloud Based Solution for Big Data Analytic: Application in Intrusion Detection |pdfUrl=https://ceur-ws.org/Vol-2343/paper9.pdf |volume=Vol-2343 |authors=Nada Chendeb,Iman Mallat,Nazim Agoulmine,Nour Mawass |dblpUrl=https://dblp.org/rec/conf/bdcsintell/TaherMAM18 }} ==Performance/Cost Analysis of a Cloud Based Solution for Big Data Analytic: Application in Intrusion Detection== https://ceur-ws.org/Vol-2343/paper9.pdf
        Performance/Cost Analysis of a Cloud Based
       Solution for Big Data Analytic: Application in
                     Intrusion Detection
                       1 Nada Chendeb Taher, 1 Imane Mallat, 2 Nazim Agoulmine, 3 Nour Mawass
                                   1
                                    Lebanese University, Faculty of Engineering, Tripoli, Lebanon
                              2
                                  COSMO, IBISC Laboratory, UEVE, Paris-Saclay University, France
                                           3
                                             Normandie University, UNIROUEN, LITIS


   Abstract—The essential target of ‘Big Data’ technology is to       Data has been around for many years, but its mainstream
provide new techniques and tools to assimilate and store large        application started only recently [2]. The concept of cloud
amount of generated data in a way to analyze and process it           computing also traces back to the 1960s and has since then
to get insights and predictions that can offer new opportunities
towards the improvement of our life in different domains. In          evolved and passed through many stages to become today a
this context, ‘Big Data’ treats two essential issues: the real-time   mainstream commercial necessity [2].
analysis issue introduced by the increasing velocity at which         Demand for Big Data is calling for the adoption of Cloud
data is generated, and the long-term analysis issue introduced        platforms because Big Data techniques need very high
by the huge volume of stored data.                                    computing resources that grow in parallel with the fast
To deal with these two issues, we propose in this paper a
Cloud-based solution for big data analytic on Amazon Cloud            growth of the generated data. Therefore, if we need to
operator. Our objective is to evaluate the performance of Big         transform the data into value and utilize its potential, we need
Data services offered regarding the volume/velocity of the            first to fully embrace Cloud-based systems.The convergence
processed data. The dataset we use contains information about         of Big Data and Cloud Computing eventually provides new
”network connections” in approximately 5 million records with         opportunities and applications in many verticals. Nowadays,
41 features; the solution works as a network intrusion detector.
It receives data records in real time from a raspberry pi node        many companies have already adopted Cloud technologies as
and predicts if the connection is bad (malicious intrusion or         a mean to store and analyze data, and finally get predictions
attack) or good (normal connection). The prediction model was         and insights about their business performances. Many
made using a logistic regression network. We evaluate the cloud       examples of these applications do exist in many domains such
resources needed to train the machine learning model (batch           as healthcare, industry 2.0, Networking, business, marketing
processing), and to predict the new streaming data with the
trained network in real time (real time processing).                  and many others.
The solution worked very well with high accuracy and the              Big data technology is also helping to fight hackers. Indeed,
results show that when working with Big Data in the cloud,            the same tools that are central to Big data can also used
we are mainly dealing with a cost/performance trade-off, the          to analyze network traffic at large scale and react faster to
processing performance in term of response time for both              attacks and therefore prevent damages or data leakage before
long-term and real-time analysis can be always guaranteed once
the cloud resources are well provisioned according to the needs.      they happen. Big Data can be used to identify anomalies in
                                                                      device behaviour, user behaviour or network connections.

                       I. I NTRODUCTION                                  In this context, any application with Big Data analytic can
   From the down of civilization until 2003, humankind                take one or both of two ways: the (1) real-time analysis that
generated five exabytes of data. Now we produce five                  helps to find out irregularities in the collected data and act
exabytes every two days and the pace is accelerating [1].             as fast as possible to prevent an undesired scenario, or the
Our smart phone collects data on how we use it and our                (2) long-term analysis that uses the massive data collected
web browser collects information on what we are searching             and utilizes the insights to identify the future trends and
for. Servers also collect information about connections and           opportunities.
user activities to protect and develop the services they are          With the real-time analysis, we face the ”Velocity” challenge.
offering. Today, it is hard to imagine any activity or device         In major applications, we have to process data and get
that does not generate data. Just think about all the pictures        information and decision in real time before the next data is
we take on our smart phones. We upload and share 100s of              generated. For example, in network security, we have to detect
thousands of them on social media sites every second. With            any threat and stop it as early as the network connection starts
the datafication of everything, comes Big Data.                       to be established, avoiding individuals entering, interacting
Cloud Computing and Big Data are distinct disciplines                 with or damaging the system. While with the long-term
that have evolved separately over time. However, they are             analysis that requires batch processing, we face the ”Volume”
increasingly becoming interdependent. The concept of Big              challenge; where to store the huge amount of data and how




                                                                                                                                         34
Fig. 1: A diagram for a Cloud based model for real-time and
long-term analysis of Big Data



to process it accurately and in a reasonable amount of time
and cost?
The two different ways of processing data lead to the
necessity to use a Cloud based solution for Big Data analytic
that is able to address both the volume and the velocity
                                                                                       Fig. 2: Big Data 4V’s
challenges as highlighted in Figure 1.

   This model could be built on existing Cloud Computing                                II. P RELIMINARIES
infrastructure however these infrastructures (computing re-       A. What is Big Data?
sources) do not offer the same level of Quality of Service
neither in general the same charging model. Therefore, it            There is no stable definition for Big Data. We can use Big
is important to be able to evaluate the best offers in terms      Data to describe a massive volume of both structured and
of performance and services. It is also important to identify     unstructured data that is so large, increasing very fast and that
what is the trade off between the provided performances and       it is very difficult to process using traditional tools, databases
the cost. In particular, when the volume/complexity/velocity      and software techniques. To deal with this kind of data, a
of the data increases, it is important to understand how the      multitude of tools and frameworks are available including
processing performance is positively or negatively impacted by    the famous Hadoop (and spark) ecosystem. The underlying
the provided computing resources. These questions constitute      techniques behind these tools and frameworks are distribution,
the challenges we are addressing in this paper to derive a        parallelism and clustering of computing resources.
performance/cost model to execute Big Data analytic in the        B. Characteristics of big data
Cloud Computing for intrusion detection applications.
                                                                     Big data is commonly characterized using a number of V’s.
   The data is related to ”network connections” and the aim       The first four are Volume, Velocity, Variety and Veracity as
is to specify and deploy a machine learning solution able to      shown in Figure 2. Volume refers to the vast amount of data
detect malicious connections and stop them as soon as possi-      that is generated every second, minutes, hour, and day in our
ble. Information about new data connection to the protected       digitized world. Variety refers to the ever increasing various
system is sent to intrusion detection system that is running      forms of captured data such as text, images, voice, geospatial,
in the Cloud and are processed in real-time (streaming). If       raw, etc. Velocity refers to the speed at which data is being
the intrusion-detection system detects an anomaly, it blocks      generated and the pace at which data moves from one point
the connection and/or notifies the network administrator. The     to the next. Veracity refers to the quality and the reliability of
decision-making are the result of batch processing/machine        the data.
learning applied to the data accumulated over time.
The remaining of this paper is as follows: after preliminaries    C. Turning Data into Value
presenting Big Data and Big Data analysis in section II, the         Data analytic backed by the expansion of computing power
following section III presents the proposed intrusion detection   is enabling companies to extract maximum value from data
solution based on Cloud Computing and Big Data analytic.          to get the best insights. The way they help us move from
The solutions uses the services provided by Amazon Cloud          data to insight and value is called the ‘wisdom hierarchy’
provider namely AWS (Amazon Web Services). The section            [3]. The wisdom hierarchy is a conceptual framework for
IV, presents an implementation of the system as well as the       thinking about how the raw inputs of reality (signals) are
used dataset and the machine learning model, in this section      stored as data and transformed first into information, then into
we present also the results of the conducted performance/cost     knowledge, and finally into wisdom (Figure 3). It summarizes
analysis. In section V, we discuss the results and highlight      the data analysis process. In other words, ’Wisdom Hierarchy’
the cost-benefit of our solution against a traditional one.       represents a path from gathering and exploring raw data,
Finally, we conclude the work in section VI and present some      to machine learning that enables getting knowledge from
perspectives.                                                     raw data, and finally to artificial intelligence that ensures a




                                                                                                                                       35
                                                                        1) Acquiring data: The first step in acquiring data is to
                                                                     determine what data is important. Leaving out even a small
                                                                     amount of important data can lead to incorrect conclusions.
                                                                        2) Exploring data: Exploring data is a part of the two-step
                                                                     data preparation process. We need to do some preliminary
                                                                     investigation in order to gain a better understanding of the
                                                                     specific characteristics of the data. In this step, we will be
                                                                     looking to things such as correlations, general trends, and
                                                                     outliers.
                                                                        3) Pre-Processing data: There are two main goals in the
                                                                     data pre-processing step. The first is to clean the data to
                                                                     address data quality issues, and the second is to transform
                                                                     the raw data to make it suitable for analysis.
                                                                        4) Analyzing data: Data analysis involves building a model
                                                                     from the clean data, which is called input data. The input
                                                                     data is used by the analysis technique to build a model. What
                                                                     the model generates is the output data. There are different
                                                                     types of problems, and so there are different types of analysis
                   Fig. 3: Wisdom hierarchy                          techniques. The main categories of analysis techniques are
                                                                     classification, regression, clustering, association analysis, and
                                                                     graph analysis.
deep understanding of knowledge. Based on this hierarchy,            In classification, the goal is to predict the category of the input
we discover the importance of machine learning in big data           data. When the model has to predict a numeric value instead
domain to get insights from raw data, make predictions and           of a category, then the task becomes a regression problem. In
then take the appropriate decisions; this is what we call the        clustering, the goal is to organize similar items into groups.
‘data mining’. So what is machine learning? And what are the         The goal in association analysis is to come up with a set of
machine learning models?                                             rules to capture associations between items or events.
                                                                     Let’s briefly look at how to evaluate each technique. For
D. Machine learning                                                  classification and regression, we will have the correct output
   Machine learning is an artificial intelligence technique that     for each sample in the input data. Comparing the correct output
allows software applications to become more accurate in              and the output predicted by the model provides a way to
predicting outcomes without being explicitly programmed.             evaluate the model. For clustering, the groups resulting from
The basic premise of machine learning is to build algorithms         clustering should be examined to see if they make sense for
that can receive a large amount of input data and then predict       our application. For association analysis, some investigation
a reliable output. Based on this definition, Machine Learning        will be needed to see if the results are correct.
takes data as input. The input data is called ‘training data’.       As a summary, data analysis involves selecting the appropriate
The desired output is called ‘Targets’ or ‘Labels’. Machine          technique for our problem, building the model, and then
learning models are often categorized as supervised or un-           evaluating the results.
supervised. Supervised models need humans to provide both               5) Turning insights into actions: The next step is to de-
input (‘Training Data’) and the desired output (’Targets’).          termine what action or actions should be taken, based on the
Once training is complete, the algorithm must be tested on           insights gained? This is the first step in turning insights into
new labeled data to compute its ‘accuracy’. The accuracy             action. Now that we have determined what action to take, the
parameter represents the number of correct predictions from all      next step is to study how to implement the action.
predictions made with the Machine Learning algorithm. The
desired accuracy depends on the application we deal with. As         F. Techniques and tools for big data analysis
for the unsupervised models, the training data does not include         Hadoop is an open-source framework that allows to store
‘Targets’ so we do not tell the system where to go, the system       and process large datasets in parallel and distributed fashion.
has to understand itself from the data provided.                     It runs on clusters of commodity servers and can scale up to
                                                                     support thousands of hardware nodes and massive amounts
E. Data analysis process                                             of data. Consequently, Hadoop became a data management
   When talking about ‘Wisdom Hierarchy’, data analysis              platform for big data analytics. As the diagram of Figure
passes through many steps before being translated into actions.      4 shows, we have different layers that can operate in this
In the Big Data course provided by UCSD untitled ‘Big Data           ecosystem:
Specialization’ [4], these steps are clearly defined and detailed.     •   The Hadoop ‘HDFS’ as Hadoop Distributed File System,
They start with data acquisition towards decision-making.                  that is a parallel and distributed storage unit.




                                                                                                                                           36
                                                                   other hand, with the evolution of big data technology, new
                                                                   frameworks and techniques appeared providing solutions for
                                                                   real-time and complex analysis towards trends and behavior
                                                                   discovering, malicious preventing, and fraud detections. Big
                                                                   Data analytics promises significant opportunities for solving
                                                                   different information security problems [5]. For this reason,
                                                                   researches and works evolved in the domain of Big data and
                                                                   cyber-security.
                                                                   Security is now a big data problem because the data that
                                                                   has security context is huge. Hence, to construct a big data
                                                                   processing and computing infrastructure extra secure, authors
        Fig. 4: Layers diagram in Hadoop ecosystem                 in [6] summarized some security and privacy related issues.
                                                                   Under a big data environment, it is more complicated and
                                                                   difficult to store and process the organizations and the
  •  The Hadoop ‘Yarn’ as Yet Another Resource Negotiator,         customer’s information in a secure manner given the huge
     a resource manager layer that interacts with application      volume, the increasing velocity and the variety of generated
     and schedules resources for their use                         data. In this context, the author of [7] proposed some solutions
   • ‘MapReduce’ that is a programming model for processing        for intrusion detection and threats attacks limitation. One
     large amounts of data in parallel and distributed fashion     of these solution was to implement a MapReduce machine
     composed of Map() and Reduce() procedures                     learning model that can distinguish between bad and normal
   • ‘Storm’ platform, a distributed, real-time data processing    connections based on some features and metrics such as flow
     platform                                                      duration, average bytes per packet in the flow, and average
   • ‘Spark’ platform, an open-source big data processing          bytes per second in the flow. The proposed model may use
     framework built around speed, ease of use, and sophis-        the collected network traffic consisting of both normal flow
     tical analytics. In addition to MapReduce operations, it      and potential attack flows, to train a logistic regression (LR)
     supports SQL queries, streaming data, machine learning        or naı̈ve Bayes network that works as binary classifier.
     and graph processing. It also offers a shell for python.      One of the machine learning techniques developed under the
   • ‘Cassandra’, ‘MongoDB’, ‘HBase’ are distributed               big data and cyber security domains is the neural network
     databases                                                     approaches that takes an interest role for discovering patterns
In our cloud-based solution, we will use the ‘Spark’ framework     and malicious activity of the users. Ana-Maria Ghimes, and
above a Hadoop cluster so that we can ensure a real-time           Victor-Valeriu Patriciu, proposed a neural network model that
processing and sophistical analytics like ‘Machine Learning’       consists of several case studies on algorithms and architectures
and ‘data mining’.                                                 of neural networks for determining the best way to discover
                                                                   new attacks malicious patterns in data [8].For test purposes,
G. Cloud Computing                                                 they used data sets provided by UCI Machine Learning
   Cloud Computing is a paradigm in which any user with            Repository [9]. For classification, they have used repositories
internet connection can rent computing resources as needed         like “Detect Malicious Executable (AntiVirus) Data Set”
from a cloud operator owning large datacenters and offer-          which consists of malicious and non-malicious samples.
ing services. This is mainly an economic revolution in the         They have started the study with a simple implementation
IT/Networking field thanks to the huge advances in virtual-        of a neural network and continued using pruning techniques
ization technology and datacenters. Cloud Computing services       for finding the optimal network. The best model they have
cover a vast range of options going from the basics of storage,    obtained in the pruning process had the hyperbolic tangent
networking, and processing power through natural language          function as the activation function, and consisted of 4 layers:
processing and artificial intelligence services. A fundamental     the input layer with 22 neurons, an output layer with one
concept behind cloud computing is that the location of the         neuron and two hidden layers with 8, respectively 5, hidden
service, and many of the details such as the hardware or many      neurons. Once the pruning will be completed the training
benefit operating system on which it is running, are largely       process will be initiated for updating the connection weights
irrelevant to the user. There ares from cloud computing that       and getting the best performance.
make this field a very important one. Three of the main benefits
of cloud computing are self-service provisioning, elasticity and
Pay per use.                                                       III. P ROPOSED C LOUD BASED S OLUTION FOR I NTRUSION
                                                                                        D ETECTION
H. Literature review                                                 As the speed and the volume of network data increases in
   With the exponential growth of data and the digitization of     particular connections to remote servers, the need to perform
everything, cyber-attacks become widespread and threatens          the data analysis in real time with machine learning algorithms
the organization’s security and the personal privacy. In the       and extract a deeper understanding from the data becomes




                                                                                                                                      37
crucial for all business, organizations and governments. As the    For each Shard, the cost is 0.015$ per hour, and 0.014$ per
same time, to satisfy the increasing velocity and volume even      million PUT Payloads Units.
the complexity of generated data, the use of big data tools and        3) Amazon Elastic Compute Cloud (EC2): Amazon EC2
techniques that ensure parallelism in computing, scalability       provides scalable computing capacity in the Amazon Web
and reliability is a necessity.                                    Services (AWS) cloud. Using Amazon EC2 eliminates the
Consequently, and based on the importance given for both           need to invest in hardware up front allowing to develop and
the real time and the long-term analysis in the majority of        deploy applications faster. Amazon EC2 can be used to launch
the domains today, we created a cloud based system for both        as many or as few virtual servers as needed, configure secu-
stream and batch processing using the ‘Amazon’ Cloud that          rity and networking, and manage storage. With On-Demand
provides all Big Data techniques and tools we need to perform      instances, only EC2 instances usage is charged on per hour
such a system, and at a low cost without the necessity to          depending on which EC2 instance type is used. For example,
procure hardware or to maintain infrastructure. To perform         for c5.2xlarge instance type (8 CPUs and a RAM of 16GiB),
our solution on Amazon we referred to a set of Amazon Web          the cost is 0.34$ per hour.
Services (AWS) that can be connected between them. Some of             4) Elastic MapReduce (EMR): Amazon EMR is a highly
these services are for capturing data streams, other for compute   distributed computing framework to easily process and store
and processing and some other for storage and notification. In     data quickly in a cost-effective manner. Amazon EMR uses
the following we will describe the services we used in our         ‘Apache Hadoop’, an open source framework, to distribute the
model.                                                             data and processing across a resizable cluster of Amazon EC2
                                                                   instances and allows to use the most common Hadoop tools
A. Used AWS services                                               such as ‘Hive’, ‘Pig’, ‘Spark’ and so on. With Amazon EMR,
                                                                   more core nodes can be added at any time to increase the
   1) AWS IoT core: Amazon IoT service is used to connect          processing power.
IoT devices, receive data from these devices using ‘MQTT’          Amazon EMR pricing is simple and predictable: we pay a
protocol and publish the messages to a specific ‘topic’. In IoT    per-second rate for every second we use, with a one-minute
AWS, a ‘thing’ represents any connected device. Additionally,      minimum. For example, a 10-node cluster running for 10 hours
AWS IoT ‘rules’ applied on the received data, gives IoT-           costs the same as a 100-node cluster running for 1 hour. The
enabled devices the ability to interact with other AWS services.   hourly rate depends on the instance type used. For example,
One rule can combine more than one ‘actions’.                      for c5.2xlarge EC2 instance type the cost is 0.085$ per hour.
This service costs as today 0.08$ per million minutes of               5) Simple Storage Service (S3): Amazon S3 is carefully
connection, 1$ per million messages and 0.15$ per million          engineered to meet the requirements for scalability, reliability,
rules triggered.                                                   speed, low-cost, and simplicity.
   2) Amazon Kinesis Stream: Amazon Kinesis Streams can            We pay 0.023$ per GB for the first 5 TB per month, 0.022$
continuously capture and store terabytes of data per hour and      per GB for the next 450 TB per month, and 0.021$ per GB
hundreds and thousands of sources. Data records are accessible     for over 500 TB per month.
for a default of 24 hours from the time they are added to a            6) Short Notification Service (SNS): SNS is a fully man-
stream. During that window, data is available to be read, re-      aged push notification service that allows sending individual
read, backfilled and analyzed, or moved to long-term storage.      messages to large numbers of recipients. Amazon SNS makes
With amazon kinesis streaming data can be ingested, buffered       it simple and cost-effective to send push notifications to mobile
and processed in real-time, so insights can be derived in          device users, email recipients or even send messages to other
seconds or minutes instead of hours or days.                       distributed services.
Pricing is based on two core dimensions - Shard Hour and           SMS messages sent to non-US phone numbers are charged.
PUT Payload Unit. ‘Shard’ is the base throughput unit of an        For example, to send a message to Lebanon, the cost per
Amazon Kinesis stream. An Amazon Kinesis stream is made            message is 0.04746$ for Alfa line and 0.05192$ for Touch
up of one or more shards. Each shard provides a capacity of        line.
1MB/sec data input and 2MB/sec data output. Each shard can
support up to 1000 write and 5 read transactions per second.       B. Our proposed model
The number of shards needed within the stream is specified            These services should be carefully connected to form our
based on the throughput requirements. The charging of the          Cloud based solution for real-time and batch processing of
shard is based on hourly usage rate.                               Big Data as in Figure 5.
A record is the data that the producer adds to the Amazon           First, we have created a spark cluster of specific EC2 instance
Kinesis stream. A PUT Payload Unit is counted in 25KB              type using Amazon EMR. After uploading data to an S3
payload “chunks” that comprise a record. For example, a            bucket, data is pulled from spark cluster to train a machine
5KB record contains one PUT Payload Unit, a 45KB record            learning network. A Raspberry Pi (Rpi) that plays the role of
contains two PUT Payload Units, and a 1MB record contains          any connected device is connected to the Amazon IoT core.
40 PUT Payload Units. PUT Payload Unit is charged with a           Data sent from the Rpi is published to a specific topic where
per million PUT Payload Units rate.                                a kinesis rule is applied to push data into the kinesis stream




                                                                                                                                       38
                                     Fig. 5: Our Cloud Based solution for Big Data analytic


shard. In the EMR spark cluster, streaming data is pulled to
be predicted with the machine learning model already built.
In case of anomalies, a notification is sent with Amazon SNS
to some specific phone number.


 IV. DATA S ET, P ROCESSING M ECHANISM AND R ESULTS
A. Network connections DataSet
   To build this solution, we have used the ’Kdd Cup’ dataset.
This is a typical big dataset that helped us to perform the
performance/cost analysis of the complete model in the cloud.                Fig. 6: Sketch of a portion of the data used
This dataset is the one used for The Third International
Knowledge Discovery and Data Mining Tools Competition.
                                                                    testing the machine learning network and the unlabeled data
The competition was to evaluate different network intrusion
                                                                    file ‘kddcup.testdata.unlabeled-10-percent.gz’ to stream data
detectors i.e. predictive models capable of distinguishing be-
                                                                    from the Rpi thus simulating new connections that should be
tween ’bad’ connections, called intrusions or attacks, and
                                                                    analyzed in real-time. Figure 6 shows a portion of the full data
’good’ normal connections.
                                                                    file ‘kddcup.data.gz’.
This dataset contains records about network connections with
41 features, like connection duration, the number of segments       B. Processing mechanism
sent, the number of failed segments, the protocol (tcp or udp),
the service (http, ...) used, the label (in case of labeled data)      We have used the ‘logistic regression’, a supervised ma-
that indicates the status of the connection (‘normal’, ‘buffer-     chine learning model, on this dataset to distinguish between
overflow’, ‘guess-passwd.’, . . . ), etc...                         the normal and the bad network connections. The Machine
This dataset consists of 6 compressed data files. Each file         learning model was inspired from a tutorial applied on the
contains distinct set of data including what is labeled to          same data set testing different algorithms in Apache Spark
use for training and testing the network and what is un-            [11]. The following is a quick walk-through of the Process
labeled to use for prediction [10]. In our experiment, we           that is presented in Figure 5:
have used the full data file ‘kddcup.data.gz’ for training             • First, we uploaded the full dataset ‘kddcup.data.gz’ with
network, the ‘corrected.gz’ data file with corrected labels for           the ‘corrected.gz’ dataset to a S3 bucket.




                                                                                                                                       39
                 2M records      3M records       5M records         instances to 6, we became able to train the network with all
                                                                     the available records in the ‘kddcup.data.gz’ data file i.e. 5M
  3 instances         —                —               —
                                                                     records.
  5 instances       152.58             —               —
                                                                     When using a c5.xlarge cluster with 5 instances or even with
  6 instances       142.3              —               —
                                                                     6 instances, it was not possible to train the network with
                                                                     more than 2M records nearly as shown in Table 1. Hence,
TABLE I: Training time (in seconds) for different c5.xlarge
                                                                     we noticed that the processing time is decreasing with the
cluster sizes and different volumes of training data.
                                                                     decreasing size of the cluster as well as with increasing type
                                                                     of instance. Since the training with 2 Million records has
                 2M records      3M records       5M records         given us a good accuracy equal to 0.9195, we found it the
  3 instances       118.17          162.496            —             right configuration to train the network.
  5 instances       101.22           132.9             —             From a cost perspective, using a c5.2xlarge cluster, the usage
  6 instances        90.4             129              —             cost for each EC2 instance is 0.34$/h, and 0.085$/h for
                                                                     EMR. For a c5.xlarge cluster, cost decreases to 0.199$/h for
TABLE II: Training time (in seconds) for different c5.2xlarge        each EC2 instance, and 0.052$/h for EMR. If we had used
cluster sizes and different volumes of training data.                a c5.xlarge cluster with 5 instances, the total cost per hour
                                                                     would have been the following: 5*0,199+0.052=1,047$/h and
                                                                     the training would have taken 152.581 secs. Similarly, if we
  • We have created a spark cluster on EMR to process the            had used c5.2xlarge cluster with 3 instances, the total cost
    large amount of data. In this solution we used a cluster         per hour would have been: 3*0.34+0.085=1.105$/h and the
    of 3 EC2 instances with c5.2xlarge instance type (8 CPU          training would have take 118.17 secs. Therefore, we have
    and 16GB RAM for each instance).                                 decided to use a cluster of 3 c5.2xlarge instances to train the
  • On EMR cluster, we have created a machine learning               network with 2M records (480MB) .
    “Logistic Regression” model, that takes as ‘Input’ the           Regarding the response time of the analysis we wanted to be
    numerical features from ‘kddcup’ data and as a ‘Target’          as real-time as possible, we noticed that all records did not
    the label ‘0’ in case of a normal connection and ‘1’             exceed 150 Bytes. It was therefore enough to use only one
    in case of a bad connection. We have used 2 Million              Shard, on condition that the time between two sent records
    records of ‘kddcup.data.gz’ for training the network.            did exceed the 1 ms, while the time between two received
    The ‘corrected.gz’ data file is also used to compute the         records will be in minimum 0.2 sec. The obtained response
    accuracy and to validate this network.                           time was very low and did not eventually exceed the ms. If
  • We     downloaded the ‘kddcup.testdata.unlabeled-10-             we had to send or receive faster or bigger records, we would
    percent.gz’data file on the raspberry pi. We have pulled         have needed to increase the number of Shards. Kinesis Stream
    records from this file and sent them to the Amazon               is a very efficient service to ensure the scalability of our
    IoT core. The generated data is published to a topic on          model as well as maintaining the stream processing real-time.
    which a rule is applied to send data to a kinesis stream.        For each created shard, the price was 0.015$/h while it was
    Then, on the EMR cluster, we have pulled each record             0.014$ for each million PUT Payloads. In our case, each
    from the Kinesis stream to predict it using the trained          record contained only one PUT Payload. In one second, we
    machine learning network.                                        generated 1000 records and 3600000 Put Payloads per hour.
  • In case of a bad record (bad network connection), a              Therefore, the bill to pay was 4*0.014+0.015=0.071$/h
    notification is sent to a specific phone number using
    Amazon SNS service.
                                                                      V. P ERFORMANCE /C OST A NALYSIS OF THE I NTRUSION
C. Performance/Cost Analysis of the Training Phase                                   D ETECTION P HASE
   To perform this machine learning, we have tried different         A. Presentation of the Case Study
types of EC2 instance as well different numbers of instances            Assume we have 20 computing devices that all generate the
in the cluster. The objective is to derive the best cost-effective   same type of data (the one used to build our model). Assume
configuration (i.e. a trade-off between the processing time          that each record does not exceed 150 Bytes, therefore each
and the cost). After several configuration tests in the real         record can be eventually included in one PUT Payload Unit.
Amazon cloud, we have been able to complete Table 1 and              Assume also that each device generates data at a speed of 50
Table 2.                                                             records per second, therefore one shard is required to ensure
                                                                     this input velocity (each shard will actually support 1000
   As shown in Table 2, we notice that a c5.2xlarge instance         write transactions per second corresponding to the 20 devices
type cluster with 3 instances can train network a max of             generating 50 writes per second). Assume the required output
nearly 3M records. With the same type of virtual machine             velocity is at least 1 output record per second. To achieve this
but with 5 instances, it is possible to reach a maximum of           performance, it is necessary to subscribe to 4 shards (since
3.5M records. Eventually, when we increased the number of            each shard supports 5 read transactions per second for a total




                                                                                                                                        40
of 4*5=20 read transactions per second for the 4 shards).            operators however doing an appropriate Performance/Costs
The output velocity requirement depends on the consuming             analysis of the different possibilities of deployment of the
application. For example, some applications may need to              service as the one presented in this paper.
process data very quickly (critical application) while other            This approach can also be applied to other applications with
applications may accept process the date more slowly.                different response time or batch processing requirements, In
Suppose, a spark cluster is built on Amazon EMR with 3               case of machine learning, it is also to include in the study
c5.2xlarge instances to read output records from shards and          a cost/performance analysis of the network resources to deal
process them in real-time.                                           with the size and the speed of the data to manipulate.
Assume the used prediction model can detect 70 anomalies                                  VI. C ONCLUSIONS
in average per month therefore 100 messages will be sent
                                                                        In conclusion, the big data cloud based model we built
monthly.
                                                                     can reach the desired results in terms of response time
B. Monthly Cost Analysis                                             and accuracy, with a low cost relatively. We always have a
                                                                     cost/performance trade-off; the increase in complexity, speed,
   In this section, we will evaluate the monthly cost for the case
                                                                     and volume of data leads to using more Cloud resources with
we assumed above. If the used configuration is the c5.2xlarge
                                                                     higher features and hence paying more. So we have to define
instance type, the cost will be 0.34$/h for each instance and
                                                                     the exact needs on the cloud to reduce costs and then make
0.085$/h for EMR. Therefore, the cost for the whole Spark
                                                                     an efficient model.
cluster will be 0.34*3+0.085=1.105$ per hour. We can deduce
                                                                     Using the Amazon Cloud that provides many services to
the monthly cost, that is in this case 1.105$*24*30=795.6$.
                                                                     address Big Bata analytic requirements, we built a cloud
To ensure a real time processing, the data records will be
                                                                     based solution that deals with the real-time and the batch
streamed using the Amazon kinesis. For each shard, the cost
                                                                     processing issues. This complete solution can help meet
will be 0.015$/h and for each million of PUT Payload Unit, the
                                                                     the stringent business requirements in the most-optimized,
cost will be 0.014$. Assuming 1000 records are generated per
                                                                     performant, and resilient possible way. It can also be used in
second and 4 shards are used for the streaming, the monthly
                                                                     many domains that require real-time anomalies detection or
cost will be 0.015*24*30=10.8$ for all the used shards, and
                                                                     a long-term analysis to get insights and trends from stored
(0.014*3600*24*30*1000)/1000000=36.288$ per month for
                                                                     data. The role of the engineer is to provision the requested
the PUT Payload Units. We can deduce the cost for the amazon
                                                                     resources from the cloud operator to satisfy the needs with
kinesis stream service usage that is 36.288+10.8=47.088$
                                                                     the low cost possible. Performance on the cloud is always
monthly in our case.
                                                                     guaranteed once the rented resources are sufficient for the
For the Alfa line messaging service usage, the cost per sent
                                                                     needed processing use case.
message will be 0.04746$. If we suppose that we need to send
a mean of 70 messages per month, the monthly usage cost for
this service will be 70*0.04746=3.3222$.                                                ACKNOWLEDGMENTS
Assuming we decide to not exceed the free tier offered for             This research was supported by The Lebanese University
the AWS IoT core and the Amazon S3, such a solution will             and CNRS Lebanon. Part of this work was also conducted in
cost in total nearly 835.2102 $ per month corresponding to the       the frame of the PHC CEDRE Project N37319SK.
sum of the individual costs 795.6$+36.288$+3.3222$. This is
                                                                                                   R EFERENCES
obviously an advantageous solution compareed to hiring one
                                                                     [1] Sampriti, Sarkar, ”Convergence of Big Data, IoT and Cloud Computing
engineer in the company.                                                 for Better Future”, Analytics Insight, 2017
                                                                     [2] Eric Schmidt
C. Comparison with on premises solution                              [3] Wood, Adam Michael, ”The wisdom hierarchy: From signals to artificial
                                                                         intelligence and beyond”,O’Reilly Data Newsletter, 2017
   If the decision is deploy this model on premise, it is            [4] ”Big Data Specialization. s.l.”, University of California San Diego, 2018
necessary to use a minimum of three powerful computers and           [5] Rasim Alguliyev, Yadigar Imamverdiyev, ”Big Data: Big Promises for
pay for CapEX (hardware cost) and OpEX (engineers cost to                Information Security”, 2014 IEEE 8th International Conference on Ap-
                                                                         plication of Information and Communication Technologies (AICT), 2014
configure the specific environment and software on hardware).        [6] Aditya Dev Mishra, Yooddha Beer Singh, ”Big Data Analytics for Se-
One should not neglect maintenance cost of the used hardware             curity and Privacy challenges”, International Conference on Computing,
and software. Hence, this solution doesn’t ensure scalability            Communication and Automation (ICCCA2016), 2016
                                                                     [7] M. S. Al-kahtani, ”Security and Privacy in Big Data”, International
since it is necessary to buy more hardware in case increase              Journal of Computer Engineering and Information Technology, 2017
in the computing or storage resources demand. With Big Data          [8] Ana-Maria Ghimes, , Victor-Valeriu Patriciu, ”Neural Network Models
services deployed in the Cloud Computing, it is possible to              in Big Data Analytics”, 9th International Conference on Electronics,
                                                                         Computers and Artificial Intelligence (ECAI), 2017
use hardware and software as commodities with guarantee of           [9] Machine          learning       repository        [Online].      Available:
performance. In this case, there is no need to worry about               http://archive.ics.uci.edu/ml/
installation, maintenance or upgrade since it is part of the         [10] Knowledge Discovery and Data Mining Tools Competition 1999 Data
                                                                         [Online]. Available: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99
Cloud Computing service.                                             [11] Spark           Python         Notebooks           [Online].        Avail-
For this reason, we plebiscite building this solution in the             able:                                 https://github.com/jadianes/spark-py-
Cloud taking benefit of all services provided by the Cloud               notebooks/blob/master/README.md




                                                                                                                                                       41