<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Designing Flink Pipelines in IoT Mashup Tools?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tanmaya Mahapatra</string-name>
          <email>tanmaya.mahapatra@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilias Gerostathopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Alonso Fernandez Moreno</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Prehofer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lehrstuhl fur Software und Systems Engineering, Fakultat fur Informatik, Technische Universitat Munchen</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Internet of Things (IoT) applications are generating increasingly large amounts of data because of continuous activity and periodical sensing capabilities. Processing the data generated by IoT applications is necessary to derive important insights|for example, processing data from CO emissions can help municipal authorities apply tra c restrictions in order to improve a city's air quality. State-of-the-art streamprocessing platforms, such as Apache Flink, can be used to process large amounts of data streams from di erent IoT devices. However, it is difcult to both set-up and write applications for these platforms; this is also manifested in the increasing need for data analysts and engineers. A promising solution is to enable domain experts, who are not necessarily programmers, to develop the necessary stream pipelines by providing them with domain-speci c graphical tools. We present our proposal for a state-of-the-art mashup tool, originally developed for wiring IoT applications together, to graphically design streaming data pipelines and deploy them as a Flink application. Our prototype and experimental evaluation show that our proposal is feasible and potentially impactful.</p>
      </abstract>
      <kwd-group>
        <kwd>Flink pipelines</kwd>
        <kwd>graphical tool</kwd>
        <kwd>IoT mashup tools</kwd>
        <kwd>stream analytics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In recent years, there has been an upsurge in the number and usage of ubiquitous
connected physical devices, thereby making the era of the Internet of Things
(IoT) a reality. IoT is de ned as the interconnection of ubiquitous computing
devices for increased value to end users [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Realising this value of IoT for
endusers depends heavily on its software applications which in turn depends on the
insights gained from IoT data.
      </p>
      <p>IoT data typically comes in the form of data streams that often need to
be processed under latency requirements to obtain insights in a timely fashion.
Examples include tra c monitoring and control in a smart city; tra c data
from di erent sources (e.g. cars, induction loop detectors, cameras) need to be
combined in order to take tra c control decisions (e.g. setting speed limits,
? Copyright held by the author(s). NOBIDS 2018
opening extra lanes in highways). The more sensors and capabilities, the more
data streams require processing.Specialised stream-processing platforms, such
as Apache Flink, Spark Streaming and Kafka Streams, have been proposed to
address the challenge of processing vast amounts of data (also called Big Data),
that come in as streams, in a timely, cost-e cient and trustworthy manner.</p>
      <p>The problem with existing stream platforms is that they are di cult to both
set-up and write applications for. The current practice relies on human expertise
and the skills of data engineers and analysts, who can deploy Big Data stream
platforms in clusters, manage their life-cycle and write data analytics
applications in general-purpose high-level languages such as Java, Scala and Python.
Although many platforms, including Flink and Spark, provide SQL-like
programming interfaces to simplify data manipulation and analysis, the barrier is
still high for non-programmers.</p>
      <p>In response to this growing need, we believe a promising solution is to enable
domain experts, who are not necessarily programmers, to develop the necessary
pipelines for streaming data analytics by providing them with domain-speci c
graphical tools. In particular, we propose to extend existing ow-based graphical
programming environments, used for simplifying IoT application development,
called IoT mashup tools, and allow the speci cation of streaming data analytics
pipelines (programs) via their intuitive graphical interfaces which allow
components to be dragged , dropped and wired together.</p>
      <p>
        To provide a technical underpinning for our proposal and evaluate its
feasibility, we have extended aFlux1 [
        <xref ref-type="bibr" rid="ref10 ref11">11, 10</xref>
        ], a state-of-the-art mashup tool developed
in our department, to support the speci cation of streaming data pipelines for
Flink, one of the most popular Big Data stream-processing platforms. One main
challenge is reconciling the di erence in Flink's programming paradigm and
owbased mashup tools. Flink relies on a lazy evaluation execution model, where
computations are materialised if their output is necessary, while ow-based
programming triggers a component, proceeds to execution and nally passes their
output to the next component upon completion. To program Flink from mashup
tools, the di erence in the computation model needs to be addressed.
Additionally, there needs to be a seamless connection between the two systems to enable
a smoother consumption of the generated results.
      </p>
      <p>Succinctly, we provide the following contributions in this paper:
1. We analyse the Flink ecosystem and identify the abstractions that will work
for graphical programming of Flink pipelines (Section 3).
2. We describe the concept idea and technical realisation of mapping a
graphical ow, designed in aFlux, to a Flink pipeline and providing basic ow
validation functionalities at the level of aFlux (Section 4).
3. We evaluate our proposal by designing pipelines that monitor tra c
conditions and detect patterns in the incoming streaming data using real-time
tra c data from the city of Santander, Spain (Section 5).</p>
    </sec>
    <sec id="sec-2">
      <title>1 https://github.com/mahapatra09/a ux tum</title>
      <sec id="sec-2-1">
        <title>Background</title>
        <p>In this section we give an overview of mashup tools, with an emphasis on aFlux
and Big Data stream analytics platforms, with emphasis on Flink.
2.1</p>
        <p>
          aFlux: An IoT Mashup Tool
Mashups are a conglomeration of several accessible and reusable components
on the web [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Mashup tools simplify the development of mashups by allowing
end-users to wire together mashup components, encapsulating business logic into
one or more mashup ows. When executing a mashup ow, control follows the
data ow from one component to the next; this type of ow-based programming
paradigm is also followed in a very popular mashup tool for IoT, Node-RED [
          <xref ref-type="bibr" rid="ref1 ref7">7,
1</xref>
          ].
        </p>
        <p>aFlux is a recently proposed IoT mashup tool that o ers several
advantages compared to Node-RED. It features a multi-threaded execution model,
asynchronous and non-blocking execution semantics and concurrent execution
of components.</p>
        <p>Available Mashup</p>
        <p>Components</p>
        <p>Application Header &amp; Menu Bar</p>
        <p>Side Panel</p>
        <p>Activity Tabs
Add-Plug-in</p>
        <p>Button</p>
        <p>Console-like Output</p>
        <p>Mashups</p>
        <p>Canvas
aFlux consists of a web application and a back-end developed in Java and the
Spring Framework2. The web application is composed of two main entities: the
front-end and back-end, based on REST API. The front-end of aFlux (Fig. 1)
provides a GUI for the creation of mashups. It is based on React3 and Redux4
frameworks. Mashups are created by dragging-and-dropping mashup components</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2 https://spring.io/</title>
    </sec>
    <sec id="sec-4">
      <title>3 https://reactjs.org/</title>
    </sec>
    <sec id="sec-5">
      <title>4 https://redux.js.org/</title>
      <p>from the left panel. New mashup components are loaded from plug-ins. The
application shows a console-like output in the footer, and the details about a
selected item are shown on the right panel. Using the aFlux front-end, a user
can create a ow by wiring several mashup components (or sub- ows) together.</p>
      <p>
        When a ow is sent to the back-end, it is translated to an internal model,
which is a graph called the `Flow Execution Model' [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This model is composed
of actors, as aFlux makes use of Akka actor system5 and Java. In an actor
system, actors encapsulate both a state and behaviour. When an actor receives
a message, it starts to perform the associated computations, and it may send a
message to another actor when nished. In aFlux, a mashup component of the
front-end corresponds to an actor in the back-end. Messages can only be sent
asynchronously between actors [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]; concurrency of actors is also supported.
Currently, aFlux supports graphical Spark programming by making use of the
declarative APIs of the Spark eco-system [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
2.2
      </p>
      <p>Stream Analytics
The idea of processing data as streams, i.e. as they come in, is di erent from
batch processing. The latter approach was followed in the rst Big Data-processing
systems, such as in Hadoop's MapReduce and in Apache Spark, which mainly
dealt with reliable parallel processing of Big Data residing in distributed le
systems, such as Hadoop's HDFS. Stream processing of Big Data has been
recently sought as a solution to reduce the latency in data processing and provide
real-time insights (e.g. on the scale of seconds or milliseconds).</p>
      <p>
        In particular, an ideal stream-processing platform should meet the following
requirements [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]:
{ Low latency. Streaming platforms usually make use of in-memory
processing, in order to avoid the time required to read/write data in a storage
facility and thus decrease the overall data-processing latency.
{ High throughput. Scalability and parallelism enable high performance in
terms of data-processing capability. The real-time performance of
streamprocessing systems is frequently demanded even with spikes in incoming
data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
{ Data querying. Streaming platforms should make it possible to nd events
in the entire data stream. Typically, SQL-like language is employed [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
However, since data streams never end, there needs to be a mechanism to
de ne the limits of a query; otherwise it would be impossible to query
streaming data. This is where the window concept takes part. Windows de ne the
data in which an operation may be applied, so they become key elements in
stream-processing.
{ Out-of-order data. Since a streaming platform does not wait for all the
data to become available, it must have a mechanism to handle data coming
late or never arriving. A concept of time needs to be introduced, to process
data in chunks regardless of order of arrival.
      </p>
    </sec>
    <sec id="sec-6">
      <title>5 https://akka.io/</title>
      <p>{ High availability and scalability. Stream processors will most likely
handle ever-growing amounts of data, and in most cases, other systems could
rely on them, e.g. in IoT scenarios. For this reason, the stream-processing
platform must be reliable, fault-tolerant and capable of handling any amount
of data events.</p>
      <p>
        The rst approaches to stream processing, notably Storm and Spark
Streaming, used to focus on requirements such as low latency and high throughput [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Lambda architecture, a well-known approach [
        <xref ref-type="bibr" rid="ref12 ref6 ref9">6, 12, 9</xref>
        ] combines batch and
streamlike approaches to achieve shorter response times (on the order of seconds). This
approach has some advantages, but one critical downside: the business logic
needs to be duplicated into the stream and the batch processors. In contrast to
this, stream- rst solutions, such as Apache Flink, meet all the outlined
requirements [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
3
      </p>
      <sec id="sec-6-1">
        <title>Flink Ecosystem: An Analysis</title>
        <p>
          Apache Flink is a processing platform for distributed stream as well as batch
data. Its core is a streaming data- ow engine, providing data distribution,
communication and fault tolerance for distributed computations over data streams [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
It is a distributed engine, built upon a distributed runtime that can be executed
in a cluster to bene t from high availability and high-performance computing
resources. It is based on stateful computations. Indeed, Flink o ers exactly-once
state consistency, which means it can ensure correctness even in the case of
failure. Flink is also scalable because the state can be distributed among several
systems. It supports both bounded and unbounded data streams. Flink achieves
all this by means of a distributed data- ow runtime that allows a real-stream
pipelined processing of data.
        </p>
        <p>A streaming platform should be able to handle time because the reference
frame is used for understanding how the data stream ows, that is to say, which
events come before or after another. Time is used to create windows and perform
operations on streaming data, in a broad sense. Flink supports several concepts
of time: (i) Event time refers to the time at which an event was produced in the
producing device. (ii) Processing time is related to the system time of the cluster
machine in which the streams are processed. (iii) Ingestion time is the wait time
between when an event enters the Flink platform and the processing time.</p>
        <p>Windows are a basic element in stream processors. Flink supports di erent
types of windows, and all of them rely on the notion of time as described above.
Tumbling windows have a speci ed size, and they assign each event to one and
only one window without any overlap. Sliding windows have xed sizes, but an
overlap, called the slide, is allowed. Session windows can be of interest for some
applications, because sometimes it is insightful to process events in sessions.A
global window assigns all elements to one single window. This approach allows
for the de nition of triggers, which tell Flink exactly when the computations
should be performed.</p>
        <p>
          The Flink distributed data- ow programming model together with its various
abstractions for developing applications, form the Flink ecosystem. Flink o ers
three di erent levels of abstraction to develop streaming/batch applications as
follows: (i) Stateful stream processing: The lowest level abstraction o ers stateful
streaming, permitting users to process events from di erent streams. It features
full exibility by enabling low-level processing and control. (ii) Core level: above
this level is the core API level of abstraction. By means of both a DataStream
API and a DataSet API, Flink enables not only stream processing but also
batch analytics on `bounded data streams', i.e., data sets with xed lengths (iii)
Declarative domain-speci c language: Flink o ers a Table API as well, which
provides high-level abstraction to data processing. With this tool, a data set or
data stream can be converted to a table that follows a relational model. The
Table API is more concise, because instead of the exact code of the operation,
de ned logical operations [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] are less expressive than the core APIs. In the
latest Flink releases, an even-higher-level SQL abstraction has been created as
an evolution of this declarative domain-speci c language. In addition to the
aforementioned user-facing APIs, some libraries with special functionality are
built. The added value ranges from machine learning algorithms (currently only
available in Scala) to complex event processing (CEP) and graph processing.
        </p>
        <p>Data Source Transformations</p>
        <p>Data Sink</p>
        <p>The structure of a Flink program (especially when using the core-level APIs)
begins with data from a source entering Flink, where a set of transformations
is applied (window operations, data ltering, data mapping, etc.). The results
are subsequently yielded to a data sink, as shown in Figure 2. A Flink program
typically consists of streams and transformations. Simplistically, a stream is a
never-ending ow of data-sets, and a transformation is an operation on one or
more streams that produces one or more streams as output.</p>
        <p>On deployment, a Flink program is mapped internally as a data- ow
consisting of streams and transformation operators. The data- ow typically resembles
directed acyclic graphs (DAGs). Flink programs typically apply transformations
on data-sources and save the results to data-sinks before exiting. Flink has the
special classes DataSet for bounded datasets, and DataStream for unbounded
data-streams, to represent data in a program. To summarise, Flink programs look
like regular programs that transform data collections. Each program consists of:
(i) initialising the execution environment, (ii) loading datasets, (iii) applying
transformations, (iv) specifying where to save the results. Flink programs use
a lazy execution model, i.e. when the programs main method is executed, the
data loading and transformations do not happen immediately. Rather, each
operation is added to the program's plan, which is executed when its output needs
to be used immediately. This contrasts with a ow-based programming model of
mashup tools, which relies on an eager evaluation model i.e., a ow component
is rst executed before the control ows to the next component. This di
erence must be taken into consideration while enabling Flink programming from
mashup tools.</p>
        <sec id="sec-6-1-1">
          <title>Design Decisions</title>
          <p>In order to support Flink pipelines in mashup tools, we needed to decide on
the (i) required abstraction level, (ii) the execution model mapping and (iii)
the way to support semantic validity of graphical ows. Accordingly, from the
di erent abstraction levels, we decided to select the core API abstraction levels
for supporting Flink pipelines in graphical mashup tools, as these APIs are easy
to represent in a ow-based programming model. They prevent the need for
userde ned functions to bring about data transformation and provide predictable
input and output types for each operation|the tool can then focus on validating
the associated schema changes. Moreover, it is easy to represent DataStream
and DataSet APIs as graphical components that can be wired together. Finally,
the di erent input parameters required by an API can be speci ed by the user
from the front-end. We follow the lazy execution model while composing a Flink
pipeline graphically, i.e., when a user connects di erent components, we do not
automatically generate Flink code but instead take a note of the structure and
capture it via a DAG, simultaneously checking for semantic validity of the ow.
When the ow is marked as complete, the runnable Flink code is generated.
Lastly, we impose semantic validity restrictions on the graphical ow which can
be composed by the user, i.e. it must begin with a data-source component,
followed by a set of transformation components and nally ending with a
datasink component, in accordance with the anatomy of a Flink program.
4</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>Designing Flink pipelines</title>
        <p>The conceptual approach for designing Flink pipelines via graphical ows
addresses the main contributions stated in Section 1, and consists of: (i) A model
to enable the graphical creation of programs for stream analytics, in other words,
to automatically translate items speci ed via a GUI to runnable source code,
known as the Translation &amp; Code Generation Model, and (ii) a model to
continuously assess the end-user ow composition for semantic validity and provide
feedback to ensure that the nal graphical ow yields a compilable source code,
known as the Validation Model. Figure 3 gives a high-level overview of the
conceptual approach used to achieve such a purpose. This conceptual approach is
based on the design decisions discussed in Section 3.</p>
        <p>Since the main idea is to support stream analytics in mashup tools, we
restrict the scope of the translator to the DataFrame APIs from the core-level
API abstractions. In accordance to the anatomy of a Flink program, we have
built `SmartSantander Data' as the data-source component, an `Output Result'
supporting writing operation to Kafka, CSV and plain text as data-sink
component. Map, lter and window operations are the supported transformation
components. Accordingly, we built the `GPS Filter' component to specify lter
operations, the `select' component to support map operations and a `Window' as
well as \WindowOperation" to specify windows on data streams. We also support
the Flink CEP library via the following components :`CEP Begin', `CEP End',
`CEP Add condition' and `CEP New condition'. The CEP library is used to
detect patterns in data streams. We also have two additional components, namely
`Begin Job' and `End Job', to mark the start and end of a Flink pipeline. The
translator &amp; code generation model have been designed to work within this scope
of selection. We de ne all potential semantic rules between these components and
the validation model works within this scope.</p>
        <p>Graphical Flow (defined in GUI by the user)</p>
        <p>Visual
Component #1</p>
        <p>Visual
Component #3</p>
        <p>Visual</p>
        <p>Component #N</p>
        <p>Visual</p>
        <p>Component #2
Graphical
Parser</p>
        <p>Code Generator
actors + user-defined
properties</p>
        <p>STD
Translator
Actor System
Runnable Flink</p>
        <p>Program
The aim of the translation &amp; code generation model is to provide a way to
translate a graphical ow de ned by the end user of the mashup tool (via its
GUI), into source code to program Flink. This model behaves as follows: (i)
First, end users de ne graphical ows in the mashup tool GUI, by connecting
a set of visual components in a ow-like structure. It represents a certain Flink
functionality and has a set of properties that the user may con gure according
to their needs. (ii) Then, a translator acquires the aggregated information of the
user-de ned ow, which contains (a) the set of visual components that compose
the ow, (b) the way in which they are connected, (c) the properties that users
have con gured for each component.</p>
        <p>The translator has three basic components: a graphical parser, an actor
system and a code generator. It takes as input the aggregated information of the
user-de ned graphical ow (i.e. visual components, the ow structure and the
user-de ned properties) and its output is a packaged and runnable Flink job. The
graphical parser takes the aforementioned aggregated information and processes
it, creating an internal model and instantiates the set of actors corresponding
to the ow. The actor system is the execution environment of actors, which
contains the business logic of the translator. Actors are taken from the output
of the graphical parser. The actor model abstraction makes each actor
independent, and the only way to interact with the rest is by means of exchanging
messages. Actors communicate using a data structure that has been explicitly
de ned for making the translation, using a tree-like structure that makes
appending new nodes extremely easy. In this model, the data structure is referred
to as STDS (Speci c Tree-Like Data Structure). As previously stated, each
actor corresponds to a speci c Flink functionality and, in turn, to the standalone
implementation method of that speci c functionality. It adds a generic
methodinvocation statement as a message response to the next connected actor. The
method-invocation statement also passes the user parameters and the output
from its preceding node as input to the standalone implementation method of
Flink-functionality APIs. The next actor receives this message and appends its
corresponding method-invocation statement and so forth.</p>
        <p>Finally, the code generator takes the STDS as input. It has internal mapping
to translate parametrised statements into real Flink source code statements.
This entity combines the parametrised statement with this mapping and the
user-de ned properties, and then generates the nal source code. The compiling
process also takes place here. The code generator output is a packaged, running
Flink job that can be deployed in an instance of Flink.
4.2</p>
        <p>Validation
The translation model allows the translation of graphical ows into source code.
However, some graphical ows may result in source code that either cannot
be compiled or yields runtime errors. We have provided support on aFlux for
handling the type of errors that occur because of data dependencies in a data
pipeline, during the speci cation of the pipeline from the GUI. If one of the data
dependency rules is violated when the user connects or disconnects a component
in a ow, visual feedback is provided, which helps avoid problems early on.
Such semantic rules must be speci ed by the developers of the individual Flink
components of aFlux, according to the following pattern:</p>
        <p>should
|Compm{oazninent A} m{uzst
visual component is|Mandato}ry
come (immediately)
| isConsecutive }
{z
before
af{tzer } |Co marpgu{omzneenntt B}
is|Precedent visual component</p>
        <p>For example, the following rules can be speci ed:
{ `Window' component must come immediately after `Select' component
{ `End Job' component must come after `Load data' component
On the front-end, when a user connects two components, it is considered a
statechange. With every state-change, the entire ow is captured from the front-end
and subjected to the validation process. Basically, the ow is captured in the form
of a tree; the next step is to check whether the nodes are compatible to accept the
input received from their preceding nodes, whether two immediate connections
are legal and whether the tested component's positional rules permit it to be used
after its immediate predecessor. Algorithm 1 summarises the semantic validation
steps of the ow. During the check, if an error is found with any one component
of the ow, the user is alerted with the appropriate reasons and the component
is highlighted.</p>
        <p>Algorithm 1: Continuous Semantic Validation of Flink Pipelines
foreach ow in the canvas do
order the list of element as they appear in the ow;
foreach element in the orderedList do
instantiate the PropertyContainer that corresponds to element;
get the set of conditions out of it;
instantiate a new result;
foreach condition in conditions do
foreach element in the orderedList do
if condition is not met then</p>
        <p>result.add(condition);
end
end
else
end
end
if result is empty then
clear error information from element;
add error information to element;
end
end
5</p>
      </sec>
      <sec id="sec-6-3">
        <title>Evaluation and Discussion</title>
        <p>
          The implemented approach has been evaluated for its ease in graphically creating
Flink jobs from aFlux and abstracting the code-generation from the end-user.
For evaluation purposes, we have used real data from the city of Sandander,
Spain, which is o ered as open data behind public APIs [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. In this smart
city use-case, the user is an analyst of Santander City Hall, who need not have
programming skills. The user only needs to know how to use aFlux from the
end-user perspective (e.g. drag and drop mashup components) and have some
very basic knowledge of what Flink can do from a functionality point of view
rather than from a developer point of view. For example, the city hall analyst
should know that changes in the city are measured in events and events can be
processed in groups called windows. The user does not need to know any details
about how to create a window in the Flink Java or Scala API, or the fact that
generics need to be used when de ning the window type of window. The process
of analysing real-time data involves combining data from di erent sources of the
city and processing it. The goal of this use-case is to gain insights about the city,
that help decision makers take the appropriate calls.
        </p>
        <p>In this evaluation scenario, temperature vs. air quality in a certain area must
be compared with the average of the city. To study the relationship between the
level of a certain gas and temperature, the analyst needs to create four ows (or
wire them all together to create a simple Flink job): two of them will analyse
temperature data (i.e. the `temperature' attribute in the `environment' dataset)
and two of them will analyse air quality (e.g. the `levelOfCO' attribute in the
`airQuality' dataset). Two ows are required for each dataset because one will
include a `GPS lter component (Fig. 4a), and the other one will not include
it, in order to process all the data in the city (Fig. 4b). To avoid re-adding the
same mashup components, the analyst could make use of the sub- ow feature of
aFlux.</p>
        <p>(a) Flow A
(b) Flow B</p>
        <p>Fig. 4: Flows in aFlux for Case Study - Real-Time Data Analytics
Figure 4 shows how the analyst can easily get input from real-time sources by
using a graphical data-source component, i.e. the `SmartSntndr Data'. Adding
a third source of data to see not only the level of N O2 but also the level of
ozone is as simple as changing a property in the `SmartSntndr Data' component.
However, if they were doing it manually, the Java code for a new `MapFunction'
would have to be written.</p>
        <p>
          Tumbling windows were used in Figure 4a, but processing the data in a
di erent type of window (e.g. using sliding windows) is as easy as changing
the properties of the `Window' mashup component (Figure 5). In Java, the user
would need to know that a sliding window takes an extra parameter and that the
window slide needs to be speci ed using Flink's Time class, in which a di erent
method is invoked depending on the desired time units.The system has been
evaluated against additional scenarios and case studies [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <sec id="sec-6-3-1">
          <title>Discussion</title>
          <p>The approach used to model a Flink pipeline relies on three aspects, i.e. load
data from data source, transform data and nally publish the result via a data
sink. This is also the preliminary form of semantic validation i.e. deciding if the
positional hierarchy of a component is allowed or not. The user- ow is parsed
and expressed as an abstract syntax tree which is passed as an input to the code
generator. Each node in the tree maps to a standalone implementation of the
Flink Core APIs. The code generator generates code for sequences like, opening
and closing a Flink session, and for the nodes in the abstract syntax tree it wires
the standalone implementation of the APIs, while passing the user parameters
and the output from the preceding node as input. The result is a runnable Flink
program, compiled, packaged and deployed on a cluster.</p>
          <p>The current work has many limitations in its approach such as the following.</p>
          <p>Debugging run-time exceptions: The semantic validation techniques
described help the user create a ow which can result in a compilable Flink code.
Nevertheless, in the case of run-time exceptions, it becomes di cult to identify
the error from the logs and reverse map the generated Flink program to identify
the corresponding graphical component on the front-end.</p>
          <p>Integrate job monitoring: The current approach does not include methods
to include job monitoring and management features at the tool level. The user
can create a Flink job and consume the analytical result, but the user cannot
manage the job deployed on a Flink cluster. This is important from an end-user
perspective as stream applications typically run in nitely.</p>
          <p>Seamless integration with Flink cluster: Currently, there is no seamless
integration between the mashup tool and Flink run-time environment, hence
the consumption of the data analytics results is not a straightforward process.
Therefore, real-time data visualisation has many problems, including time-delays
and unresponsiveness to very minimal interactive capabilities. As of now, we rely
on third party systems, like Apache Kafka, where the Flink application writes
its results to, and we read the data from, Kafka in the mashup tool.
6</p>
        </sec>
      </sec>
      <sec id="sec-6-4">
        <title>Related Work</title>
        <p>
          We did not nd any mashup tool solutions which allow wiring components to
produce a Flink application. One of the closest existing solutions is Nussknacker [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
It is a tool currently in development which supports graphical Flink
programming. It consists of an engine, whose aim is to transform the graphical model
created in the GUI, into a Flink job. A standalone user interface application,
which allows both the development and deployment of Flink jobs, is written
in Scala and incorporates data persistence and a Flink client. Basically, a user
needs to enter the data model of their use-case to Nussknacker. Users with no
programming skills can bene t from the GUI to design a Flink job, send it to
a Flink cluster and monitor its execution. Nevertheless, it does not focus on
the integration of data analytics and business logic of application, but rather
designs a data analytics Flink application based on a particular usage model.
IBM SPSS Modeller provides a GUI to develop data analytics ows involving
simple statistical algorithms, machine learning algorithms, data validation
algorithms and visualisation types [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Although SPSS Modeller is a tool built
for non-programmers to perform data analytics using pre-programmed blocks of
algorithms, it does not support wiring new Flink applications.
7
        </p>
      </sec>
      <sec id="sec-6-5">
        <title>Conclusion</title>
        <p>We de ned a new approach for high-level Flink programming from graphical
mashup tools to make the usage of stream analytics easier for non-domain
experts. We showed that this is feasible and evaluated what are the right
abstractions. Accordingly, (i) we analysed the Flink ecosystem i.e. its distributed
data- ow programming model and the various abstraction levels o ered to
program applications; we found the core APIs, based on DataFrame and DataSet
interfaces to be the most suitable candidates for use in a graphical ow-based
programming paradigm, i.e. mashup tools; (ii) we adapted the eager
evaluation execution model of mashup tools to support designing Flink pipelines in a
lazy fashion and devised a novel generic approach for programming Flink from
graphical ows. The conceptual approach was implemented in aFlux, our JVM
actor-model-based mashup tool and evaluated it with real-time data from the
city of Santander.</p>
        <sec id="sec-6-5-1">
          <title>Acknowledgement</title>
          <p>This work is part of the TUM Living Lab Connected Mobility (TUM LLCM)
project and has been funded by the Bavarian Ministry of Economic A airs,
Energy and Technology (StMWi) through the Center Digitisation.Bavaria, an
initiative of the Bavarian State Government.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>IBM</given-names>
            <surname>Node-RED</surname>
          </string-name>
          ,
          <article-title>A visual tool for wiring the Internet of things</article-title>
          , http://nodered.org/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>IBM</given-names>
            <surname>SPSS</surname>
          </string-name>
          <article-title>Modeller</article-title>
          . https://www.ibm.com/products/spss-modeler, [Online; accessed 22-June-2018]
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>3. Nussknacker. https://github.com/TouK/nussknacker, [Online; accessed 22- September-2018]</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Atzori</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morabito</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The internet of things: A survey</article-title>
          .
          <source>Computer Networks</source>
          <volume>54</volume>
          (
          <issue>15</issue>
          ),
          <volume>2787</volume>
          {
          <fpage>2805</fpage>
          (
          <year>2010</year>
          ). https://doi.org/http://dx.doi.org/10.1016/j.comnet.
          <year>2010</year>
          .
          <volume>05</volume>
          .010, http://www.sciencedirect.com/science/article/pii/S1389128610001568
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Daniel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matera</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Mashups: Concepts,
          <source>Models and Architectures</source>
          . Springer Berlin Heidelberg, Berlin, Heidelberg (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tzoumas</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : Introduction to Apache Flink.
          <source>O'Reilly (09</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Health</surname>
          </string-name>
          , N.:
          <article-title>How ibm's node-red is hacking together the internet of things</article-title>
          (
          <year>March 2014</year>
          ), http://www.techrepublic.com/article/node-red/TechRepublic.com [Online; posted 13-March-2014]
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Iqbal</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soomro</surname>
            ,
            <given-names>T.R.</given-names>
          </string-name>
          :
          <article-title>Big data analysis: Apache storm perspective</article-title>
          .
          <source>International journal of computer trends and technology 19(1)</source>
          ,
          <volume>9</volume>
          {
          <fpage>14</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kiran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monga</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dugan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baveja</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          :
          <article-title>Lambda architecture for cost-e ective batch and speed big data processing</article-title>
          .
          <source>In: Big Data (Big Data)</source>
          ,
          <source>2015 IEEE International Conference on</source>
          . pp.
          <volume>2785</volume>
          {
          <fpage>2792</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mahapatra</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prehofer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerostathopoulos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varsamidakis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Stream analytics in iot mashup tools</article-title>
          .
          <source>In: 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)</source>
          . pp.
          <volume>227</volume>
          {
          <issue>231</issue>
          (Oct
          <year>2018</year>
          ). https://doi.org/10.1109/VLHCC.
          <year>2018</year>
          .8506548
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mahapatra</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerostathopoulos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prehofer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gore</surname>
            ,
            <given-names>S.G.</given-names>
          </string-name>
          :
          <article-title>Graphical spark programming in iot mashup tool</article-title>
          .
          <source>In: The Fifth International Conference on Internet of Things: Systems, Management and Security</source>
          . p. In Press. IoTSMS (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Marz</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warren</surname>
          </string-name>
          , J.:
          <article-title>Big Data: Principles and best practices of scalable real-time data systems</article-title>
          . New York; Manning Publications Co.
          <article-title>(</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>F.A.F.</given-names>
          </string-name>
          :
          <article-title>Modularizing ink programs to enable stream analytics in iot mashup tools (</article-title>
          <year>2018</year>
          ), http://oa.upm.es/52898/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Santander City Council: Santander Open Data - REST API Documentation</surname>
          </string-name>
          (
          <year>2018</year>
          ), http://datos.santander.es/documentacion-api/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Stonebraker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cetintemel</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zdonik</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The 8 requirements of real-time stream processing</article-title>
          .
          <source>ACM Sigmod Record</source>
          <volume>34</volume>
          (
          <issue>4</issue>
          ),
          <volume>42</volume>
          {
          <fpage>47</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <article-title>The Apache Software Foundation: Data ow Programming Model</article-title>
          ,
          <year>v1</year>
          .
          <volume>5</volume>
          (
          <issue>2018</issue>
          ), https://ci.apache.org/projects/ ink/ ink-docs-release1.
          <article-title>5/concepts/programming-model</article-title>
          .html
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>