<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparing Performances of Big Data Stream Processing Platforms with RAM3S (extended abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilaria Bartolini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Patella</string-name>
          <email>marco.patellag@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DISI - Alma Mater Studiorum, Universita di Bologna</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Nowadays, Big Data platforms allow the analysis of massive data streams in an e cient way. However, the services they provide are often too raw, thus the implementation of advanced real-world applications requires a non-negligible e ort for interfacing with such services. This also complicates the task of choosing which one of the many available alternatives is the most appropriate for the application at hand. In this paper, we present a comparative study of the three major opensource Big Data platforms for stream processing, as performed by using our novel RAM3S framework. Although the results we present are speci c for our use case (recognition of suspect people from massive video streams), the generality of the RAM3S framework allows both considering such results as valid for similar applications and implementing different use cases on top of Big Data platforms with very limited e ort.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The huge amount of information produced by modern society could help in
several important tasks of public life, such as security, medicine, environment,
and smarter use of cities, but is rarely exploited due to aspects of privacy, ethics,
and availability of appropriate technology, among the others. The emerging Big
Data paradigm provides opportunities for the management and analysis of such
large quantities of information, but the services such systems provide are often
too raw, since they focus on issues of fault tolerance, increased parallelism, etc.</p>
      <p>In order to help bridging the technological gap between facilities provided
by Big Data analysis platforms and advanced applications, in this paper we
focus on the real-time analysis of massive multimedia streams, where data come
from multiple data sources (like sensors, cameras, etc.) that are widely located
on the territory. The goal of such analysis is the discovery of new and
hidden information from the output of data sources as they occur, thus with very
limited latency. A strong technological support is therefore needed to analyze
the information generated by the growing presence of multimedia sensors. Such
technologies have to meet strict requirements: rapid processing, high accuracy,
and minimal human intervention. In this context, Big Data techniques can help
this automated analysis, e.g., to enable corrective actions or to signal a state of
security alarm for citizens.</p>
      <p>To this end, we exploit our RAM3S (Real-time Analysis of Massive
MultiMedia Streams) framework for the automated real-time analysis of multimedia
streams to compare the performance of three di erent open source engines for
the analysis of streaming Big Data. In particular, we focus on the study of face
detection for the automatic identi cation of \suspect" people: this can be
useful, for example, in counter-terrorism, protection against espionage, intelligent
control of the territory, or smart investigations.1</p>
      <p>Although the problem of stream analysis is not novel per se [5,2], its
application in presence of several multimedia (MM) streams is questionable, due
to the very nature of MM data, which are complex, heterogeneous, and of large
size. This makes the analysis of MM streams computationally expensive, so that,
when deployed on a single centralized system, computing power becomes a
bottleneck. Moreover, the size of the knowledge base could also prevent its storage
on a single computation node.</p>
      <p>In order to allow e cient analysis of massive multimedia streams in real
scenarios, we advocate scaling out the underlying system by using platforms for
Big Data management. Since we deal with online applications, we are
interested in analyzing the data as soon as these are available, in order to exploit
their \freshness". Therefore, we concentrate on platforms for Big Data
analytics following the stream processing paradigm, as opposed to systems for batch
processing [6]. While in the latter, whose main representative is Apache Hadoop
(hadoop.apache.org), data are rst stored on a physical support (usually on a
Distributed File System, DFS ) and then analyzed, for the case of stream
processing data are not stored and the e ciency of the system depends on the amount
of data processed, keeping low latency, at the second, or even millisecond, level.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Big Data Analysis Platforms</title>
      <p>In this section, we describe the alternatives available for the analysis of Big
Data under the stream processing paradigm. This is necessary to grasp common
characteristics present in such systems, that will be exploited to guarantee the
generality of the proposed framework.</p>
      <p>
        Apache Spark Streaming Apache Spark [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ] (spark.apache.org) is an open
source platform, originally developed at the University of California, Berkeley's
AMPLab. The main idea of Spark is that of storing data into a so-called Resilient
Distributed Dataset (RDD ), which represents a read-only fault-tolerant
collection of (Python, Java, or Scala) objects partitioned across a set of machines that
can be stored in main memory. RDDs are immutable and their operations are
lazy. The \lineage" of every RDD, i.e., the sequence of operations that produced
it, is kept so that it can be rebuilt in case of failures, thus guaranteeing
faulttolerance. Every Spark application is therefore a sequence of operations on such
RDDs, with the goal of producing the desired nal result. There are two types of
1 Although we instantiated RAM3S for a very speci c task, it is independent of the
particular application at hand, thus demonstrating its wide applicability which is
however out of the scope of this paper.
operations that can be performed on a RDD: a transformation is any operation
that produces a new RDD, built from the original one; on the other hand, an
action is any operation that produce a single value or writes an output on disk.
The kernel of Spark is Spark Core, providing functionalities for distributed task
dispatching, scheduling, and I/O. The program invokes the operations on Spark
Core exploiting a functional programming model. Spark Core then schedules the
function's execution in parallel on the cluster. This assumes the form of a
Directed Acyclic Graph (DAG), where nodes are operations on RDDs performed
on a computing node and arcs represent dependencies among RDDs.
      </p>
      <p>The Streaming version of Spark introduces receiver components, in charge
of receiving data from multiple sources, like Apache Kafka, Twitter, or TCP/IP
sockets. Real-time processing is made possible by introducing a new object, the
D-Stream, a sequence of RDDs, sampled every n time units. Every D-Stream
can be now introduced in the Spark architecture as a \regular" RDD to be
elaborated, thus realizing micro-batching. A Spark Streaming application is built
by two main components: A receiver contains the data acquisition logic, de ning
the OnStart() and OnStop() methods for initializing and terminating the data
input; every receiver occupies a core in the architecture; in a driver the program
speci es the receiver(s) creating the D-Streams and the sequence of operations
transforming the data, terminated with an action storing the nal result.
Apache Storm Apache Storm (storm.apache.org) is an open source platform,
originally developed by Backtype and afterwards acquired by Twitter. The basic
element that is the subject of computation is the tuple, a serializable user de ned
type. The Storm architecture is composed of two types of nodes: spout s are nodes
that generate the stream of tuples, binding to an external data source (like a
TCP/IP socket or a message queue), while bolt s are nodes that perform stream
analysis; ideally, every bolt node should perform a transformation of limited
complexity, so that the complete computation is performed by the coordination
of several bolt nodes. Every bolt is characterized by the prepare() and the
execute(Tuple) methods; the former is used when the bolt is started, while the
latter is invoked every time a Tuple object is ready to be processed.</p>
      <p>A Storm application simply de nes a topology of spout and bolt nodes in
the form of a DAG, where nodes are connected by way of streams of tuples
owing from one node to the other. The overall topology, therefore, speci es the
computation pipeline. A key concept in Storm is the acknowledgment of input
data, in order to provide fault-tolerance, so that if a node emits a tuple but this is
not acknowledged, this can be re-processed, e.g., forwarding it to another node.
This amounts to an \at-least-once" semantics, guaranteeing that every datum
is always processed. This allows low latency, but does not ensure the uniqueness
of the nal result.</p>
      <p>Apache Flink Apache Flink (flink.apache.org) is an open source platform,
originally started as a German research project [1], that was later transformed
into an Apache top-level project. Its core is a distributed streaming data ow
that accepts programs in form of a graph (JobGraph) of activities consuming
and producing data. Each JobGraph can be executed according to a single
distribution option, chosen among the several available for Flink (e.g., single JVM,
YARN, or cloud). This allows Flink to perform both batch and stream
processing, by simply turning data bu ering on and o . Data processing by Flink is
based on the snapshot algorithm [3], using marker messages to save the state of
the whole distributed system without data loss and duplication. Flink saves the
topology state in main memory (or on DFS) at xed time intervals. The use of
the snapshot algorithm provides Flink with a \exactly-once" semantics,
guaranteeing that every datum is processed at least one time and no more than that.
Clearly, this allows both a low latency and a low overhead, and also maintains
the original ow of data. Finally, Flink has a \natural" data ow control, since
it uses xed-length data queues for passing data between topology nodes.</p>
      <p>The main di erence between Storm and Flink is that, in the former, the
graph topology is de ned by the programmer, while the latter does not have
this requirement, since each node of the JobGraph can be deployed on every
computing node of the architecture.
3</p>
      <p>The RAM3S Framework
The RAM3S framework provides a general infrastructure for the analysis of
multimedia streams on top of Big Data platforms. By exploiting RAM3S,
researchers can apply their multimedia analysis algorithms to huge data streams,
e ectively scaling out methods that were originally created for a centralized
scenario, without incurring the overhead of understanding issues peculiar to
distributed systems. To be both general and e ective, we designed RAM3S as a
\middleware" software layer between the underlying Big Data platforms and
the top data stream application.</p>
      <p>The core of RAM3S consists of two interfaces that, when appropriately
instantiated, can act as a general stream analyzer: the Analyzer and the Receiver
(see below).</p>
      <sec id="sec-2-1">
        <title>Receiver start(): void stop(): void</title>
      </sec>
      <sec id="sec-2-2">
        <title>Analyzer Analyzer(KnowledgeBase) analyze(MMObject): boolean</title>
        <p>The Analyzer is the component in charge of the analysis of individual
multimedia objects. In particular, every multimedia object is compared to the
underlying KB to see whether an alarm should be generated or not. On the other hand,
the Receiver should break up a single incoming multimedia stream into
individual objects that can be analyzed one by one. The interface of the Receiver
component is very simple: only the start() and the stop() methods should be
de ned, specifying how the data stream acquisition is initiated and terminated.
The output of the Receiver component is a sequence of MMObject instances.
The Analyzer component includes only a constructor and a single analyze()
method. The constructor has a single parameter, the KB which will be used for
the analysis of individual objects. On the other hand, the analyze() method
takes as input a single MMObject instance and outputs a boolean, indicating
whether the comparison of the input MMObject \matches" the KB or not.
3.1 Applying the Framework to the Use Case
Our scenario of suspect people recognition is illustrated in Figure 1. Each camera
sends its video ow to a single Receiver, that performs the task of frame splitting
and generates a stream of still images (instances of the MMObject interface). The
Analyzer receives a ow of images, and performs the face detection task rst,
by exploiting an Haar cascade detector [8], followed by face recognition, by way
of an eigenfaces algorithm [7]: both techniques are provided by the OpenIMAJ
library (openimaj.org).</p>
        <sec id="sec-2-2-1">
          <title>Stream analysis</title>
          <p>Receiver
Analyzer</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Face</title>
          <p>detection</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Face</title>
          <p>recognition</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>Known face? yes</title>
        </sec>
        <sec id="sec-2-2-5">
          <title>Alert generation</title>
        </sec>
        <sec id="sec-2-2-6">
          <title>Face DB</title>
          <p>3.2 Interface to Big Data Platforms
The software framework described so far (and its application to the use case)
is directly applicable to a centralized framework, by instantiating a Receiver
for each input data stream and a single Analyzer, collecting outputs of the
Receivers. In Figure 2, we show how RAM3S has been appropriately extended
in order to work as components of the distributed Big Data platforms.
(a)
(b)
(c)
Apache Spark For Spark (Figure 2 (a)), every camera sends its video stream to
a Receiver, corresponding to a single Spark receiver: the output of the receiver
is bu ered to build the D-stream. The bu er consists of n MMObject instances,
where n is a system parameter. Every time the bu er is full, a D-stream is
completed and it is sent to the cluster of Spark drivers. The OnStart() and
OnStop() methods of the Spark receiver correspond to the start() and stop()
methods of the Receiver interface. Any Spark driver includes an Analyzer.
The logic of Spark drivers should iterate the analyze() method for all the n
MMObject instances included in the input D-stream. The output of the Spark
driver is the boolean \or" of the n boolean results of the analyze() method.
Finally, this alarm is sent to the alert server.</p>
          <p>Apache Storm For Storm (Figure 2 (b)), as before, every camera sends its
video stream to a Receiver, encapsulated into a Storm spout. Now, however,
every MMObject generated by the spout is immediately sent to the cluster of
Storm bolts. Storm bolts correspond to single Analyzer objects. The prepare()
method is equivalent to the Analyzer constructor, while execute() matches the
analyze() method, which is executed for any received MMObject. Any time the
Analyzer receives a MMObject, it has to acknowledge the emitting Receiver to
respect the Storm semantics. With respect to the Spark solution, we have the
advantage of a real stream processing architecture: latency is likely to be reduced,
since we do not have to wait the completion of a D-stream to begin the analysis
of any single MMObject.</p>
          <p>Apache Flink Finally, for Flink (Figure 2 (c)) the con guration is apparently
rather similar to the Storm case: cameras send their video streams to a
cluster of Flink receivers, each including a single Receiver which sends individual
instances of MMObject to the cluster of Flink task nodes. A single Flink task
node consists of an Analyzer, executing the analyze() method for any received
MMObject, nally sending an alarm to the alert server. With respect to the Storm
architecture, although for ease of presentation Figure 2 (c) shows Flink receivers
separated from Flink task nodes, we remind that this di erence is only virtual,
since each activity of the Flink JobGraph can be executed on any computing
node, i.e., there are no \receiver" nodes and \task" nodes, but any single
computing node can host either a receiver or a task node at any given moment.
This gives to the Flink architecture high resiliency and versatility, since it can
easily accommodate node faults (for both receivers and task nodes) and
addition/deletion of input data streams (which would require modi cation of the
topology in Storm).
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>The goal of our experiments is to evaluate the e ciency of the proposed
architectures, comparing them on a fair basis. The open source platforms for
realtime data analysis described in Section 2 have been implemented both in a
local environment with a cluster of 12 PCs and on the Google Cloud Platform
(cloud.google.com). The rst example, realized in our datalab (www-db.disi.
unibo.it/research/datalab/), mimics the case where streams from
surveillance cameras are processed locally by a commodity cluster of low-end
computers, while the latter represents the scenario where all video streams are
transmitted to a center of massively parallel computers.</p>
      <p>For tests on the local cluster, we used (up to) 12 PCs with a Pentium 4 CPU @
2.8 GHz, equipped with 1GB of RAM, interconnected with a 100Mbps Ethernet
network. For tests on the cloud architecture, we used (up to) 128 machines of type
n1-standard-1 with 1 vCPU (2.75 GCEU) and 3.75GB of RAM. Our real dataset
for the face detection/recognition task consists of the YouTube Faces Dataset
(YTFD) [9], including 3,425 videos of 1,595 di erent people, with videos having
di erent resolutions (480x360 is the most common size) and a total of 621,126
frames containing at least a face (on average, 181.3 frames/video). In order to
be able to freely vary the input frame rate, the incoming stream of video frames
was implemented as a RabbitMQ queue fed with sequences of still images.</p>
      <p>Figure 3 shows the sustainable input rate (the maximum frame rate such
that the input bu er does not over ow) obtained for the two scenarios. Results
show that Apache Storm consistently outperforms the other two frameworks
on both scenarios. We believe this behavior is due to the simplest at-least-once
semantics of Storm (with respect to the exactly-once semantics of Flink) and to
the fact that the Storm topology has to be explicitly de ned by the programmer,
while for Flink the topology is decided by the streaming optimizer; this superior
exibility of Flink comes at the cost of a slightly diminished e ciency. On the
other hand, Apache Spark exhibits the worst performance, although its results
are quite similar to those of Apache Flink for the cloud scenario. This is largely
expected, since Spark is not a real streaming engine and the management of
D-Streams is the main reason of its inferior input rates.
1000
f)(s/
1tae00
tr
u
p
n
li
e
b
itsaan10
u
S
1</p>
      <p>Spark
Storm
Flink
25
/s20
()ft
e
a
tr15
u
p
n
i
l10
e
b
a
n
it
a
s5
u
S
0</p>
      <p>Spark
Storm</p>
      <p>Flink
0
2
4 n 6
(a)
8
10
12
1
2
4</p>
      <p>Figure 4 illustrates the latency for the three architectures on the local and
the cloud scenarios. Latency of the system is the measured as the (average)
time interval between an image exiting the FIFO queue and the corresponding
alert signal received by the alert server, when using the maximum input rate
sustainable by an architecture composed of a single node. For Apache Spark,
the delay is the average delay of images included in a single D-Stream. Since
a micro-batch period of 30 seconds was considered, this can be computed as
30/2 plus the average latency. Figure 4 only plots the average latency, omitting
the 15 seconds from the total delay. Apache Storm consistently achieves the
lowest latency, with Apache Flink always very close. For both streaming-native
platforms, latency times are largely una ected by the number of nodes: this
same behavior is also common to the non-discounted latency of Apache Spark
(not shown here to avoid cluttering of gures). The discounted latency of Spark,
on the other hand, clearly takes advantage of the increased number of nodes,
reaching latency values similar to Storm for high numbers of nodes.
9
8
7
)s6
(5
y
ltca43
n
e
2
1
0</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>Comparing performance of the three Big Data platforms, Spark attains the lower throughput and highest latency, clearly due to its micro-batch design: when looking at its good latency performance on the cloud environment, we should remember that the time for building the D-stream is not considered there. Comparing Storm and Flink, the former attains slightly better results, both in terms of throughput and in latency: as already pointed out in Section 3.2, Flink is likely to be superior to Storm when issues related to fault tolerance and correctness of the nal result are considered; this clearly comes at the expenses of slightly inferior results when only performance is an issue. Similar considerations, albeit for a slightly di erent scenario (counting JSON events from Apache Kafka), were recently obtained in [4]. References 1</article-title>
          .
          <string-name>
            <given-names>A.</given-names>
            <surname>Alexandrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bergmann</surname>
          </string-name>
          , et al.
          <article-title>The Stratosphere Platform for Big Data Analytics</article-title>
          . In
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>VLDBJ</surname>
          </string-name>
          ,
          <volume>23</volume>
          (
          <issue>6</issue>
          ),
          <year>2014</year>
          . 2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          , et al.
          <article-title>MOA: Massive Online Analysis</article-title>
          .
          <source>In JMLR</source>
          ,
          <volume>99</volume>
          ,
          <year>2010</year>
          . 3.
          <string-name>
            <surname>K. M. Chandy</surname>
            and
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Lamport</surname>
          </string-name>
          . Distributed Snapshots: Determining Global States of Dis-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>tributed Systems</source>
          .
          <source>In TOCS</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <year>1985</year>
          . 4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintapalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dagit</surname>
          </string-name>
          , et al. Benchmarking Streaming Computation Engines at Yahoo!. Ya-
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          hoo! Engineering blog,
          <year>2015</year>
          . 5.
          <string-name>
            <surname>M.M. Gaber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Zaslavsky</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Krishnaswamy</surname>
          </string-name>
          .
          <article-title>Mining Data Streams: A Review</article-title>
          . In SIGMOD
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Record</surname>
          </string-name>
          ,
          <volume>34</volume>
          (
          <issue>2</issue>
          ),
          <year>2005</year>
          . 6.
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wen</surname>
          </string-name>
          , T.-S. Chua, and
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Toward Scalable Systems for Big Data Analytics: A</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Technology</given-names>
            <surname>Tutorial</surname>
          </string-name>
          .
          <source>In IEEE Access</source>
          ,
          <volume>2</volume>
          ,
          <year>2014</year>
          . 7.
          <string-name>
            <given-names>M.</given-names>
            <surname>Turk</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.P.</given-names>
            <surname>Pentland</surname>
          </string-name>
          .
          <article-title>Face Recognition Using Eigenfaces</article-title>
          .
          <source>In CVPR</source>
          <year>1991</year>
          ,
          <article-title>Lahaina</article-title>
          , HI. 8.
          <string-name>
            <given-names>P.</given-names>
            <surname>Viola</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Rapid Object Detection Using a Boosted Cascade of Simple Features</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>In CVPR</source>
          <year>2001</year>
          ,
          <article-title>Kauai</article-title>
          , HI. 9.
          <string-name>
            <given-names>L.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hassner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Maoz</surname>
          </string-name>
          .
          <article-title>Face Recognition in Unconstrained Videos with Matched</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Background</given-names>
            <surname>Similarity</surname>
          </string-name>
          .
          <source>In CVPR</source>
          <year>2011</year>
          ,
          <article-title>Colorado Springs</article-title>
          , CO. 10.
          <string-name>
            <surname>M. Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chowdhury</surname>
          </string-name>
          , et al.
          <article-title>Spark: Cluster Computing with Working Sets</article-title>
          . In HotCloud
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          '
          <fpage>10</fpage>
          , Boston, MA.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>