<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Y. van den Wildenberg</string-name>
          <email>y.v.d.wildenberg@student.tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>W.W.L. Nuijten</string-name>
          <email>w.w.l.nuijten@student.tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>O. Papapetrou</string-name>
          <email>o.papapetrou@tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eindhoven University of Technology</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Today's data deluge calls for novel, scalable data handling and processing solutions. Spark has emerged as a popular distributed in-memory computing engine for processing and analysing a large amount of data in parallel. However, the way parallel processing pipelines are designed is fundamentally diferent from traditional programming techniques, and hence most programmers are either unable to start using Spark, or are not utilising Spark to the maximum of its potential. This study describes an easier entry point into Spark. We design and implement a GUI that allows any programmer with knowledge of a standard programming language (e.g., Python or Java) to write Spark applications efortlessly and interactively, and to submit and execute them to large clusters.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Currently, data volumes are exploding in research and industry
because of the increasing data-intensive applications. As a
consequence, more and more disciplines face scalability concerns.
However, the barrier to utilize distributed Big Data platforms is
high for a multitude of reasons. Firstly, legacy code does not fit
very well in Big Data platforms. Secondly, most senior
programmers in the field – the ones usually taking the decisions – never
had formal training on Big Data platforms. On top of that, the
mere number of available Big Data platforms and pay-as-you-go
solutions (e.g. cloud solutions) complicate the right choice for the
user to scale-out, which increases again the barrier to entry.
Because of this, companies are frequently reluctant to invest in Big
Data platforms. In the long run, these companies will face either
inability to scale or they will face higher cost for maintaining
much stronger architectures in-house.</p>
      <p>
        Spark is possibly the most popular Big Data framework to
date. The framework implements the so-called master/slave
architecture. It includes a central coordinator (the driver), and many
distributed executors (the workers). Spark hides the complexity
of distributing the tasks and data across the workers, and
transparently handles fault tolerance. Nonetheless, the complexity
of Spark steepens the learning curve, especially for entry-point
Data Engineers [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Furthermore, Spark environment allows
for several pitfalls, such as using too many collects, no clear
understanding of caching and lazy evaluation, etc.
      </p>
      <p>
        So far, a number of attempts have been made in order to
simplify distributed computing. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] present a web-based GUI to
simplify MapReduce data processing, but supports only a small set of
pre-determined actions/processing nodes. Also, [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] introduced
an extension for RapidMiner1, Radoop, which runs distributed
processes on Hadoop via the RapidMiner UI. Another
(streaming) extension to RapidMiner Studio was recently released by
the INforE EU project2, which supports code-free creation of
optimized, cross-platform, streaming workflows running on one
of the following stream processing frameworks: Apache Flink,
Spark Structured Streaming or Kafka [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In addition, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] present
RheemStudio, a visual IDE that creates code-free workflows on
(a subset of) Spark using RHEEM’s ecosystem [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to easily
specify cross-platform data analytic tasks. In the same line of work,
StreamSets Transformer3 ofers an UI based ETL pipeline where
data transformations can be executed on Spark. Legacy code can
be incorporated in StreamSets Transformers by writing custom
PySpark or Scala pipelines. Similarly, KNIME4, a visual
programming environment, supports extension of workflows with Spark
nodes. Spark code can be added in KNIME as a PySpark script
node in the workflow. However, in both cases, the developer
needs to understand the Spark API and semantics (RDDs, maps,
reduces). At the moment, we still lack a simple, open-source
graphical user interface that can be used out-of-the-box, to
support Spark newcomers – developers with potentially no training
and experience of Spark – to design, develop, and deploy complex
workflows in Spark that go beyond standard ETL processes. We
explicitly target a stand-alone tool that requires a very simple
installation process (e.g., unzipping a file, or clicking an icon)
and no servers/spark clusters, so that it can be used from novice
users. Such a tool will simplify Spark, lowering the learning curve
and initial cost for testing out integration and use of Spark in
mission-critical processes.
      </p>
      <p>This work introduces Easy Spark, a Graphical User Interface
(GUI) for guiding the developer and flattening out Spark’s
learning curve. Instead of having to write code, the user designs and
implements her big data application by designing a Directed
Acyclic Graph (DAG), inserting nodes, specifying the input and
output of each node, configuring the nodes, and linking them to
other nodes. Upon completion of the workflow, the tool translates
the model to executable Spark code, which can be submitted for
execution to a cluster, executed locally, or even saved for future
use. Beyond hiding the complexity of Spark by abstracting the job
to the natural DAG model of operators, the GUI itself prevents
bugs, e.g., by showing intermediary results to the developer, and
by restricting the developer to a model and to specific method
signatures that lend themselves to parallelism, yet without
requiring the introduction of Spark-specific concepts like Maps and
Reduces. Easy Spark can also serve more advanced requirements,
guiding experienced developers that do not know Spark to write
and to integrate code.</p>
      <p>The user group of Easy Spark is: (a) Spark newcomers and
students that want to quickly test out Spark, get a first introduction
to Spark’s basic ideas and capabilities, and construct a rapid
prototype/proof of concept, (b) data scientists and domain scientists
that are now hitting the limits of centralized computing, e.g., with
python, but do not have the formal training or extensive
programming experience to start with Spark, and, finally, (c) educators
and researchers that need novel methods to introduce, teach, and
2https://www.infore-project.eu/
3https://streamsets.com/products/dataops-platform/transformer-etl/
4https://www.knime.com
advertise Spark and similar Big Data frameworks. It is planned
that Easy Spark will be integrated in the syllabi of two courses
this year, in two diferent universities/diferent professors, and it
will be released as open-source after the conference.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK AND BACKGROUND</title>
    </sec>
    <sec id="sec-3">
      <title>A MapReduce/Spark primer</title>
      <p>
        Apache Spark [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and Hadoop MapReduce [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] are the two most
popular open-source frameworks to date for large scale data
processing on commodity hardware.
      </p>
      <p>MapReduce breaks a job to multiple tasks and assigns the tasks
to the available workers. The programming API of MapReduce is
concentrated on the implementation of two (types of) methods,
mappers and reducers. Mappers take a pair (originating from a
ifle) as input and produce a set of intermediate key/value pairs.
Typical uses of mappers are filtering and transformations on the
input data. The results of the mappers are typically pushed to
the reducers. Each reducer instance (running the user-defined
reducer function) accepts a key and a list of values for that
particular key. Reducers typically serve the purpose of aggregating
all data for each key.</p>
      <p>
        Even though MapReduce is generally recognized as a highly
efective and eficient tool for Big Data analysis [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], it comes
with several limitations. During applications such as machine
learning and graph analytics, data needs to pass from several
processing iterations, i.e., a sequence of diferent jobs. MapReduces
reads its input data from secondary storage (typically network
drives), processes it, and writes it back for every job, posing a
significant I/O overhead. Furthermore, expressing of complex
programs with a series of maps and reduces proves to be
cumbersome. Spark ofers a solution to these problems, by the use of
Resilient Distributed Datasets (RDDs), and by an expansion of
the programming model to enable representing more complex
data flows that may consist of multiple stages, namely Directed
Acyclic Graphs (DAGs) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. RDDs are read-only and can be kept
in memory, enabling algorithms to iterate over the same data
many times eficiently. RDDs support two kinds of operations
organized as a DAG: transformations and actions.
Transformations are applied on one or more RDDs and return a new RDD.
Examples of transformations are map, flatMap, filter, join, and
union. Actions operate on an RDD and return back a non-RDD
answer, e.g., a count, or a conversion of the last RDD to an array
list. Due to the lazy-evaluation nature of Spark, transformations
are executed only when their result needs to be processed by an
action. Placement of actions in the Spark code is critical for the
performance of the code – invoking unnecessary actions may
nullify the benefits of distributed computation. A single Spark DAG
may involve a number of transformations and actions, which
are completed in a single run and can be optimized by the Spark
execution engine. As a result, Spark is generally much faster on
non-trivial jobs compared to Hadoop MapReduce [
        <xref ref-type="bibr" rid="ref13 ref3">3, 13</xref>
        ].
      </p>
      <p>The architecture of Spark is depicted in Fig. 1. A Spark
deployment consists of a driver program (SparkContext), many
executors (workers), and a cluster manager. The driver program
serves as the main program and is responsible for the entire
execution of the job. The SparkContext object connects to the cluster
manager, which sends tasks to the executors.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Related work</title>
      <p>
        Even though MapReduce simplifies the complexity of
programming for distributed computing, it departs from the traditional
programming paradigms, imposing a steep learning curve to
programmers. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] developed a GUI in which users can design
their MapReduce workflow intuitively, without writing any code
and/or installing Hadoop locally. Users are only required to know
how to translate their tasks into target-value-action (TVA)
tuples, which reflect data processing into mappers and reducers.
Users pick objects as targets, whereas action filters or processes
the values with the same target. For example, to implement a
word-count code (i.e., count the frequency of each word in a text
ifle), we can identify the following TVA values: each word is a
target, 1 is a value, and sum is the desired action. The ofered GUI
ofers a predefined list of operations, such as merge, sum, count,
and multiply. However, the proposed GUI is inflexible in terms
of input data formats, and cannot support arbitrary code.
      </p>
      <p>
        RapidMiner Studio is an extendable and popular open-source
user-interface for data analytics. Radoop is an extension for
RapidMiner Studio and supports distributed computation on top of
Hadoop [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Furthermore, it supports integration with PySpark
and SparkR scripts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Nonetheless, the user is still required to
know Spark’s API to include custom code for execution on top of
Spark’s distributed environment. Another more recent extension
to RapidMiner Studio is [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which supports code-free creation
of optimized, cross-platform, streaming workflows over various
clusters, including Spark. To the best of our knowledge, both
Radoop and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] do not include data information available on the
operator level, whereas Easy Spark tries to reduce common errors
by explicitly guiding the user, i.e., extracting data output types at
individual nodes, and showing sample intermediate results when
constructing the DAG.
      </p>
      <p>StreamSets Transformer is a modern ETL pipelines engine for
building data transformation, stream processing, and machine
learning operations. It ofers an easy drag-and-drop web-based
UI in which users can create pipelines that are executed on Spark.
It is possible to write custom Pyspark and Scala code as part of
the user’s data pipeline. However, the user should be familiar to
Spark’s API to include custom Spark code into their pipeline5.</p>
      <p>KNIME 6 is an open-source platform for drag-and-drop data
analysis, visualization, machine learning, statistics, and ETL.
KNIME allows the user to create workflows via their GUI without,
or with only minimal programming. KNIME consists of an
extension7 that creates Spark nodes, which can execute Spark SQL
queries, create Spark MLlib models and allow for data blending
and manipulation. Spark Streaming and Spark GraphX are not
integrated in the extension. The extension consists of a PySpark
5https://streamsets.com/documentation/transformer/latest/help/index.html
6https://www.knime.com
7https://www.knime.com/knime-extension-for-apache-spark
script node, where users can add their own code. However,
similar to StreamSets and RapidMiner Studio, KNIME requires from
the user to be familiar with the Spark API.</p>
      <p>
        Lastly, RheemStudio [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a visual IDE on top of the
opensource platform system RHEEM [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which enables data
processing over multiple data processing platforms, including Spark.
RheemStudio helps developers to build applications easily and
intuitively. It allows for a drag-and-drop generated RHEEM plan
where users can drag-and-drop operators from the RHEEM
operators panel to the drawing surface to construct a RHEEM plan. It
is then shown to the user how the previously presented RHEEM
plan can be specified into RHEEMLatin – the declarative
language for RHEEM. Next, the user is invited to select one of the
operators to revise and is asked to inject her own logics (couple
of lines of code) which is then checked to be syntactically correct.
Furthermore, RheemStudio consists of both a dashboard for
displaying RHEEM’s internals in detail, which allows for immediate
interaction with the developer, and a monitor for keeping track
of the status of the data analytic tasks. Compared to
RheemStudio, Easy Spark focuses on simplicity and more guidance for the
novice user. It displays intermediary results to the user which
are useful for debugging, and it does not require learning an
additional language like RHEEMLatin, since it does not need to
integrate with multiple data processing languages. This focus to
simplicity makes Easy Spark an ideal entry point to Spark, for
non-Spark programmers.
      </p>
      <p>Our contribution improves the state-of-the-art in multiple
directions. First, it supports newcomers to start writing Spark
without previous Spark knowledge since it avoids using the Spark
semantics (RDDs, maps, and reduces). Instead it uses constructs
that are identical, or very similar to the standard programming
constructs of traditional languages (python and java). It also
supports coding arbitrary code, which is useful for integrating
legacy code, and it provides guidance to the user during the
design of the workflow, e.g., by providing the expected structure
(data format) and sample answers of each intermediary operation
during the design of the workflow. This enables the developer to
avoid pitfalls (e.g., calling too many collects), and quickly identify
bugs. Finally, it can function as a stand-alone tool, not requiring
complex installation and maintenance of large clusters.
3</p>
    </sec>
    <sec id="sec-5">
      <title>EASY SPARK</title>
      <p>Recall that the target group of Easy Spark includes a diverse
set: Spark newcomers, CS and non-CS students, educators,
researchers, and professional data/domain scientists. Naturally,
each of these categories comes with diferent levels of expertise,
experience, and problems of diferent complexity. To support
all of them, we need a powerful and intuitive GUI where users
can visually control the flow of the data. In Easy Spark, every
operation on the data is represented by a node, and the user can
connect nodes to form a computation path, which is a Directed
Acyclic Graph (DAG) of computation nodes and edges that
represent the flow of data between nodes. We start by providing a
high-level overview of Easy Spark, with a special emphasis on
how it guides the developer and prevents traditional errors. We
then present a detailed discussion of the ofered functionality.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>A high-level overview of Easy Spark</title>
      <p>Easy Spark contains two types of buttons: (a) nodes of diferent
types, and (b) configuration buttons. Figure 2 depicts the currently
supported DAG node types and configuration buttons.</p>
      <p>Node types. Input and Output are for choosing the data input
and results output files, and configuring how these should be
read/written (e.g., format, delimiters). For-each, Filter, and
Aggregate, correspond to the Spark functionalities of (flat-)map, filter,
and reduce/reduceByKey respectively. Model and Evaluate allow
the user to train an ML model, e.g., an SVM, and to use it for
classification/predictions. Clicking on any of the above buttons
adds a node with the matching color in the DAG, and allows the
user to configure it (e.g., copy-paste legacy code into an editor, or
configure the filter predicates). Section 3.3 discusses the precise
meaning and configuration parameters of these node types.</p>
      <p>Configuration buttons. Button Options allows the user to
re-open the configuration panel on the selected node. Calculate
Path executes the DAG on the full dataset – or submits it for
execution to a cluster – and writes the results to the output node,
whereas Show code depicts the generated Spark Code. The precise
functionality of these buttons will be detailed in Section 3.4.</p>
      <p>Example. Fig. 3. depicts an example DAG that executes a
fairly common ML task: training a ML model on a part of the
dataset, and testing the produced model on the remaining part.
The user first configures the two input nodes, by selecting the
correct file names, and then adds a model training node (in this
case, selected to generate an SVM). The output of the model node
is the model itself, which is then passed to an evaluator node,
together with the testing partition of the dataset. The output of
the evaluator is finally saved into the output node.</p>
      <p>Notice that the described workflow does not include any
Sparkrelated concepts. We will see soon that the Spark fundamentals
remain hidden from the user.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Guiding the user</title>
      <p>Diferent mechanisms are in place to guide the user through the
DAG design.
• Showing sample intermediary results. When adding a node, the
developer is able to immediately preview a small number of input
and output results of that node (see Fig. 4 for an example). The
preview is computed locally on a small part of the data such that
results can be shown with zero waiting time.
• Identifying and naming the attributes, and propagating the data
formats and structures. The developer configures a structure and
format for the data input when configuring the input nodes (part
of this is inferred if the data files contain headers). This
information is propagated in the following nodes, i.e., the user can see
and use the attribute names. When the code of an intermediary
node (e.g., a For-each node) modifies the data format, the
developer is supported to update the data format accordingly.
• Disabling buttons that are incompatible with the current state.
The current state and current selection determines the buttons
that are enabled and/or disabled. For example, when an output
node is selected, only the Calculate path button is enabled.
• Hiding the Spark semantics from the developer. Notice that the
used operations are not specific to Spark. E.g., the developer does
not need to understand map, reduce, and RDDs. She does not
need to write Spark boilerplate code, or directly submit the code
for execution on a server. All semantics of Easy Spark can be
easily understood by an everyday Python developer/data scientist.
• Preventing common errors. Besides hiding the complexity of
parallelism which is frequently a source of error, Easy Spark
supports the developer on using named and typed data structures,
and shows result samples at each intermediary node. Therefore
the developer can quickly detect most types of errors.
3.3</p>
    </sec>
    <sec id="sec-8">
      <title>Node types</title>
      <p>Nodes have to represent operations that are both familiar to the
user and useful in the context of data processing. To serve this
purpose, Easy Spark comes with a core set of node types, and
allows for an easy extension by implementing additional nodes.
The current node types are:
• Input. An input node serves as a data source. On creation of
an input node, the user chooses the data source, and the input
node handles the creation of the corresponding RDD, and
parallelizes the data to the available worker machines. Easy Spark
also prompts the user whether or not the data source contains
header data, and if so, the header is parsed and propagated to
subsequent nodes on the computation path. If no header is in the
data, the user is prompted to supply the related information.
• For-each. The for-each node allows the developer to enter code
that would normally be put in the body of a for-loop over all
records. The developer is prompted with a box for specifying the
desired behaviour of the node, i.e., to include the code that needs
to be executed. The node itself then chooses what Spark code can
be executed in order to replicate this behaviour, either through
map or flatMap functions.
• Aggregate. The aggregation node is responsible for aggregating
data over diferent groups in the data. The user is prompted to
supply the level of aggregation and the type of aggregation, and
the node itself generates key-value pairs and employs a reduce
or reduceByKey function in order to aggregate the data over the
given level of aggregation.
• Filter. The filter node handles the filtering of unwanted data.
The user can supply the tool with a boolean expression (or code
that will return a boolean expression) being evaluated for every
row in the data. This way, data is excluded from the dataset.
• Model. The model node is the gateway to the MLLib library
in Spark. The model node prompts the user to supply the split
between features and labels, and trains a Machine Learning (ML)
model on the given data. The complexity of training a ML model
is concealed from the user, such that the user can intuitively
create ML pipelines that run on Spark clusters.
• Evaluate. This node is responsible for evaluating the results of
a ML model trained by a model node on data that comes from
another node. This node supplies the user with the possibility of
evaluating a model on a large dataset, since evaluation will be
parallelized across nodes.
• Output. The output node executes each node within the path
of the graph and writes the results to a text file from which the
user picks the preferred location.
3.4</p>
    </sec>
    <sec id="sec-9">
      <title>Configuration buttons</title>
      <p>This subsection lists the purpose of the configuration buttons.
• Calculate path. By clicking on this button, the data of the node
that was previously selected is collected and written to disk. This
serves as a clear endpoint of the calculations for the user, and
supplies the tool with a clear target node for which we can apply
the transformations defined by the nodes on the path from the
data source nodes to the previously selected node that activates
the calculation. Note: it is possible to press this button multiple
times, for diferent nodes (and paths) in order to derive multiple
outputs from the DAG.
• Options. The options button enables the developer to show
additional information or set configuration parameters for the
selected node. The options for all nodes are as follows:
− Input presents a preview of ten rows from the data source (see
Fig. 4)
− For-each consists of a drop-down menu for the output type
(one output or multiple outputs), a drop-down menu for the level
of the for-loop based on the (provided) header data, an entry box
where the user can enter the code that needs to be execute for
every level and an entry box with the structure of the output.
− Aggregate shows a drop-down menu for the aggregation
function (sum or count), a drop-down menu for the key (one of the
headers) to aggregate on and an entry box to provide the
structure of the output.
− Filter includes a drop-down menu for the filter level (entire row
or attribute in a row) and an entry box where the filter condition
should be entered.
− Model has a drop-down menu for the type of model (SVM) and
two entry boxes where the statement that retrieves the label and
features from each data entry should be entered.
− Evaluate has two drop-down menus from which the model
node and data node should be selected and two entry boxes to
enter statements that retrieve the label and features.
• Show code. By clicking this button, the user can see the
generated Spark code on the selected node (see, e.g., Figure 5). If no
node is selected, the user gets the generated Spark code for the
whole program.
4</p>
    </sec>
    <sec id="sec-10">
      <title>CASE STUDIES</title>
      <p>We now discuss three case studies that are supported by the
tool. The goal of our discussion is to illustrate the simplicity and
expressive power of the tool, and to demonstrate the ease of use
with which Spark architectures can be designed.</p>
      <sec id="sec-10-1">
        <title>Algorithm 1: LetterCount.</title>
      </sec>
      <sec id="sec-10-2">
        <title>Input: Textfile f, dictionary Counts;</title>
        <p>for line in f do
for word in line do
for letter in word do
if letter in Counts then</p>
        <p>Counts[letter] += 1;
else
end</p>
        <p>Counts[letter] = 1;
end
end
end
4.1</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Applying Easy Spark to LetterCount</title>
      <p>We start by parallelizing a workflow that counts the number of
appearances of each letter in a large text file. The pseudocode of
the algorithm (algorithm 1) involves a triple nested loop – first
the text is broken into lines, then each line is broken into words,
and then each word is broken into letters.</p>
      <p>To implement this using Easy Spark, we need three nodes: a
(yellow) Input node which contains the data source, an (orange)
Aggregation node to count, and a (green) Output node. Figure 6
depicts the drawn graph in the GUI for LetterCount.</p>
      <p>When creating the Input node, the user is asked to provide
the header (since a plain text file does not contain a header).
For the sake of this example, we choose the header to be of the
following structure: sentence - word - letter. This means that the
expected output of the input node we just added will contain
three diferent representations of the data. Notice that the names
are arbitrarily chosen by the user, but give a reference point for
the next nodes.</p>
      <p>Next, we create the Aggregation node and set the right options:
aggregation type is set to count and aggregation key is set to
letter. Before we can calculate the computation path, the GUI
needs to know how to get from sentence to word, and from word
to letter. This information is provided by the user through two
pop-up windows (Figure 7). Lastly, we add the Output node from
which we can initiate code generation/execution.
4.2</p>
    </sec>
    <sec id="sec-12">
      <title>Distributed Video Analysis</title>
      <p>
        In this use case we will apply our tool to parallelize the data
of individual frames in an input video, and perform necessary
computations on each frame. In particular, we will apply an
ML model that extracts keypoint predictions from a pre-trained
network [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and use these predictions to extract relevant angles
between keypoints for motion analysis. Since currently our Input
node does not natively support breaking of videos to frames, we
use a third-party tool to extract and save all frames from the
video as a csv file, and provide this as an input. After importing
the data, we use a For-each node to extract the keypoints by
running the pre-trained network, followed by another For-each
node that extracts the relevant information from the keypoints.
The internal code of the two For-each nodes is copied from the
centralized implementation. The final DAG appears in Fig. 8.
4.3
      </p>
    </sec>
    <sec id="sec-13">
      <title>Engineering and evaluating a Machine</title>
    </sec>
    <sec id="sec-14">
      <title>Learning pipeline</title>
      <p>Our next use case considers a standard problem when
engineering an ML pipeline: diferent models (or possibly the same model
with diferent hyperparameters) need to be trained and tested
with the training and the holdout datasets, in order to compute
the accuracy of each model and choose the best one. Figure 9
presents a simple DAG that trains and evaluates an SVM model.
The first step is to configure the data input – the training and the
hold-out data set. Both input files are passed through a For-each
node for pre-processing. The training data is used for building
a model (an SVM), which is then passed into an Evaluate node
together with the test data. To count the misclassifications we
add a Filter node that is responsible for filtering out all
correctlyclassified cases, followed by an Aggregation node which will
count the number of results. Observe that the DAG is suficient
to obtain a clear idea of the data flow, and that we can represent
complex operations on the data by a relatively small amount of
nodes, i.e., the nodes have a high expressive power.
5</p>
    </sec>
    <sec id="sec-15">
      <title>CURRENT AND FUTURE WORK</title>
      <p>Easy Spark is under development. It can already generate code for
complex workflows and submit it for execution, but it still misses
some of the envisioned functionalities. In this section, we discuss
our current and planned work and the involved challenges.</p>
      <p>• Adding new node types. Our efort with the current version
of Easy Spark was to provide a zero-learning-curve
proof-ofconcept tool that can be used in teaching, and as the entry point
of a newcomer in Spark. It was out of our scope to provide a
complete visual programming language equivalent to the Spark
capabilities. Being convinced with the usefulness of the tool, we
are now extending it with more node types to cover
frequentlyused functionality, e.g., Spark SQL, GraphX, Spark Streaming,
and the full MLlib library. It is fairly straightforward to add more
node types. We are also redesigning the user interface for helping
the users to find the desired node types quickly, e.g., have an ML
tab, where all the nodes related to MLlib will be added.
• Improved support for legacy code. Easy Spark includes support
for including hand-written legacy code, e.g., when writing code
for For-each nodes. Currently, Aggregate nodes enable selecting
from a pre-defined set of functions, such as sum and count, but
it is not challenging to enable arbitrary code. Support for legacy
code can be further improved by recognizing the signature of the
functions, and automatically extracting the output data types,
instead of asking the user to update the data types manually
whenever these change. Notice that this functionality is feasible,
as it is already present in diferent IDE tools.
• Improved error detection. Automatic extraction of function
signatures can be leveraged to identify and present errors to the
user that relate to the data types and formats, e.g., the output
of one node does not agree with the expected input of the next
node. Again, this is supported by most modern IDEs.
• Showing sample intermediate results. Presentation of sample
intermediate results (a sample of the output of each node)
during design time is often very useful, for getting a rough idea
on what is happening up until that point, and for keeping track
of the data format that the next node will receive. Our current
implementation presents sample intermediary results (the first
10 results) only at some nodes. Notice, however, that some node
types, e.g., Filter, For-each, can change this number. The number
of sample results shown at each node should be adapted in order
to still get meaningful samples throughout the workflow.
Furthermore, in some cases, the distribution of the sample results is
also important to get a meaningful sample output, e.g., training
of a binary SVM classifier requires representative samples from
both classes. Starting with a huge number of samples to ensure
that all nodes will have an output is also not an option, since
this will decrement the performance of Easy Spark. We are now
developing methods that adaptively choose the samples at each
node, in order to derive meaningful sample results at all nodes.
• Supporting more data input formats and streaming data. This
will further reduce the complexity of loading the data, and ofer
better representations of the data.
• Improved support for Spark-related errors. Some common Spark
errors, e.g., OutOfMemory exceptions, can (mostly) be addressed
with a few standard steps, e.g., increase the RAM available to
the executors or driver, avoid collects, or increase the number
of partitions. In the future we will detect these exceptions and
propose these standard steps to the user.
• Integrating the Spark UI. It will be useful – and with educational
value – to integrate Spark’s Web UI in Easy Spark in order to allow
the user to monitor the status of the developed Spark application,
resource consumption, Spark cluster, and Spark configurations.</p>
    </sec>
    <sec id="sec-16">
      <title>6 CONCLUSIONS</title>
      <p>In this paper we proposed Easy Spark, a Graphical User Interface
for easily designing Spark architectures for Big Data Engineering.
The GUI enables the developer to design complex Spark DAGs
with arbitrary functionality, by masking Spark constructs and
concepts behind traditional programming constructs that any
trained data scientist or computer scientist is able to understand,
use, and configure. We examined three use cases and showed how
Easy Spark can be used to generate executable Spark code. We
also discussed the mechanisms that are currently implemented
in Easy Spark to guide the user and prevent bugs, and elaborated
on our current and future work to extend and improve it.</p>
    </sec>
    <sec id="sec-17">
      <title>7 ACKNOWLEDGEMENTS</title>
      <p>This work was partially funded by the EU H2020 project
SmartDataLake (825041).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <fpage>2020</fpage>
          .
          <article-title>Spark Cluster Mode Overview</article-title>
          . https://spark.apache.org/docs/latest/ cluster-overview.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <year>2021</year>
          .
          <article-title>RapidMiner Radoop Feature List</article-title>
          . https://rapidminer.com/products/ radoop/feature-list/. [Online; accessed 27-January-2021].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bansod</surname>
            <given-names>A.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Eficient Big Data Analysis with Apache Spark in HDFS</article-title>
          .
          <source>International Journal of Engineering and Advanced Technology</source>
          <volume>4</volume>
          ,
          <issue>6</issue>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Divy</given-names>
            <surname>Agrawal</surname>
          </string-name>
          , Sanjay Chawla, Bertty Contreras-Rojas, Ahmed Elmagarmid, Yasser Idris, Zoi Kaoudi,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Kruse</surname>
          </string-name>
          , Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti,
          <string-name>
            <surname>Jorge-Arnulfo</surname>
            Quiané-Ruiz,
            <given-names>Nan</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            , Saravanan Thirumuruganathan, and
            <given-names>Anis</given-names>
          </string-name>
          <string-name>
            <surname>Troudi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>RHEEM: Enabling Cross-Platform Data Processing: May the Big Data Be with You</article-title>
          !
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          ,
          <issue>11</issue>
          (
          <year>July 2018</year>
          ),
          <fpage>1414</fpage>
          -
          <lpage>1427</lpage>
          . https://doi.org/10.14778/3236187.3236195
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sanjay</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>MapReduce: simplified data processing on large clusters</article-title>
          .
          <source>Commun. ACM 51</source>
          ,
          <issue>1</issue>
          (
          <year>2008</year>
          ),
          <fpage>107</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Nikos</given-names>
            <surname>Giatrakos</surname>
          </string-name>
          , David Arnu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bitsakis</surname>
          </string-name>
          , Antonios Deligiannakis,
          <string-name>
            <given-names>M.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Klinkenberg</surname>
          </string-name>
          , Aris Konidaris, Antonis Kontaxakis,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kotidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Vasilis</given-names>
            <surname>Samoladas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Simitsis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>George</given-names>
            <surname>Stamatakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Temme</surname>
          </string-name>
          , Mate Torok, Edwin Yaqub, Arnau Montagud,
          <string-name>
            <given-names>M.</given-names>
            <surname>Leon</surname>
          </string-name>
          , Holger Arndt, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Burkard</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>INforE: Interactive Cross-platform Analytics for Everyone</article-title>
          .
          <source>Proceedings of the 29th ACM International Conference on Information &amp; Knowledge Management</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Georgia Gkioxari, Piotr Dollár, and
          <string-name>
            <given-names>Ross</given-names>
            <surname>Girshick</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <string-name>
            <surname>Mask</surname>
          </string-name>
          R-CNN.
          <source>arXiv:cs.CV/1703.06870</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Shih</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Simplifying MapReduce data processing</article-title>
          .
          <source>International Journal of Computational Science and Engineering</source>
          <volume>8</volume>
          (
          <year>2013</year>
          ). https://doi.org/10.1504/ijcse.
          <year>2013</year>
          .055353
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Ji</given-names>
            <surname>Lucas</surname>
          </string-name>
          , Yasser Idris, Bertty Contreras-Rojas,
          <article-title>Jorge-Arnulfo Quiané-Ruiz, and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Chawla</surname>
          </string-name>
          .
          <year>2018</year>
          . RheemStudio:
          <string-name>
            <surname>Cross-Platform Data Analytics Made Easy</surname>
          </string-name>
          .
          <source>2018 IEEE 34th International Conference on Data Engineering</source>
          (
          <year>2018</year>
          ),
          <fpage>1573</fpage>
          -
          <lpage>1576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Zoltán</surname>
            <given-names>Prekopcsák</given-names>
          </string-name>
          , Gabor Makrai,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henk</surname>
          </string-name>
          , and
          <string-name>
            <surname>Csaba</surname>
          </string-name>
          Gáspár-Papanek.
          <year>2011</year>
          .
          <article-title>Radoop: Analyzing Big Data with RapidMiner and Hadoop</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Salloum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dautov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            and
            <surname>Peng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Big data analytics on Apache Spark</article-title>
          .
          <source>Int. J. Data Sci. Anal</source>
          . (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Juwei</surname>
            <given-names>Shi</given-names>
          </string-name>
          , Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang,
          <string-name>
            <surname>Berthold Reinwald</surname>
            , and
            <given-names>Fatma</given-names>
          </string-name>
          <string-name>
            <surname>Özcan</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Clash of the Titans: MapReduce vs</article-title>
          .
          <source>Spark for Large Scale Data Analytics. Proc. VLDB Endow</source>
          .
          <volume>8</volume>
          ,
          <issue>13</issue>
          (Sept.
          <year>2015</year>
          ),
          <fpage>2110</fpage>
          -
          <lpage>2121</lpage>
          . https://doi.org/10.14778/2831360.2831365
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Mansuri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Jain</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Big data management processing with Hadoop MapReduce and spark technology: A comparison</article-title>
          .
          <source>2016 Symposium on Colossal Data Analysis and Networking (CDAN)</source>
          (
          <year>2016</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Matei</surname>
            <given-names>Zaharia</given-names>
          </string-name>
          , Reynold S. Xin, Patrick Wendell,
          <string-name>
            <surname>Tathagata Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Michael Armbrust</surname>
            , Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman,
            <given-names>Michael J.</given-names>
          </string-name>
          <string-name>
            <surname>Franklin</surname>
            , Ali Ghodsi,
            <given-names>Joseph</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>Scott</given-names>
          </string-name>
          <string-name>
            <surname>Shenker</surname>
            , and
            <given-names>Ion</given-names>
          </string-name>
          <string-name>
            <surname>Stoica</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Apache Spark: a unified engine for big data processing</article-title>
          .
          <source>Commun. ACM</source>
          <volume>59</volume>
          ,
          <issue>11</issue>
          (
          <year>2016</year>
          ),
          <fpage>56</fpage>
          -
          <lpage>65</lpage>
          . https://doi.org/10.1145/2934664
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>