=Paper=
{{Paper
|id=Vol-2841/SIMPLIFY_7
|storemode=property
|title=Easy Spark
|pdfUrl=https://ceur-ws.org/Vol-2841/SIMPLIFY_7.pdf
|volume=Vol-2841
|authors=Ylaise van den Wildenberg,Wouter W.L. Nuijten,Odysseas Papapetrou
|dblpUrl=https://dblp.org/rec/conf/edbt/WildenbergNP21
}}
==Easy Spark==
<pdf width="1500px">https://ceur-ws.org/Vol-2841/SIMPLIFY_7.pdf</pdf>
<pre>
                                                                        Easy Spark
          Y. van den Wildenberg                                           W.W.L. Nuijten                                     O. Papapetrou
    Eindhoven University of Technology                       Eindhoven University of Technology                  Eindhoven University of Technology
      y.v.d.wildenberg@student.tue.nl                           w.w.l.nuijten@student.tue.nl                           o.papapetrou@tue.nl

ABSTRACT                                                                               the INforE EU project2 , which supports code-free creation of
Today’s data deluge calls for novel, scalable data handling and                        optimized, cross-platform, streaming workflows running on one
processing solutions. Spark has emerged as a popular distributed                       of the following stream processing frameworks: Apache Flink,
in-memory computing engine for processing and analysing a                              Spark Structured Streaming or Kafka [6]. In addition, [9] present
large amount of data in parallel. However, the way parallel pro-                       RheemStudio, a visual IDE that creates code-free workflows on
cessing pipelines are designed is fundamentally different from                         (a subset of) Spark using RHEEM’s ecosystem [4] to easily spec-
traditional programming techniques, and hence most program-                            ify cross-platform data analytic tasks. In the same line of work,
mers are either unable to start using Spark, or are not utilising                      StreamSets Transformer3 offers an UI based ETL pipeline where
Spark to the maximum of its potential. This study describes an                         data transformations can be executed on Spark. Legacy code can
easier entry point into Spark. We design and implement a GUI                           be incorporated in StreamSets Transformers by writing custom
that allows any programmer with knowledge of a standard pro-                           PySpark or Scala pipelines. Similarly, KNIME4 , a visual program-
gramming language (e.g., Python or Java) to write Spark appli-                         ming environment, supports extension of workflows with Spark
cations effortlessly and interactively, and to submit and execute                      nodes. Spark code can be added in KNIME as a PySpark script
them to large clusters.                                                                node in the workflow. However, in both cases, the developer
                                                                                       needs to understand the Spark API and semantics (RDDs, maps,
                                                                                       reduces). At the moment, we still lack a simple, open-source
1    INTRODUCTION                                                                      graphical user interface that can be used out-of-the-box, to sup-
Currently, data volumes are exploding in research and industry                         port Spark newcomers – developers with potentially no training
because of the increasing data-intensive applications. As a con-                       and experience of Spark – to design, develop, and deploy complex
sequence, more and more disciplines face scalability concerns.                         workflows in Spark that go beyond standard ETL processes. We
However, the barrier to utilize distributed Big Data platforms is                      explicitly target a stand-alone tool that requires a very simple
high for a multitude of reasons. Firstly, legacy code does not fit                     installation process (e.g., unzipping a file, or clicking an icon)
very well in Big Data platforms. Secondly, most senior program-                        and no servers/spark clusters, so that it can be used from novice
mers in the field – the ones usually taking the decisions – never                      users. Such a tool will simplify Spark, lowering the learning curve
had formal training on Big Data platforms. On top of that, the                         and initial cost for testing out integration and use of Spark in
mere number of available Big Data platforms and pay-as-you-go                          mission-critical processes.
solutions (e.g. cloud solutions) complicate the right choice for the                      This work introduces Easy Spark, a Graphical User Interface
user to scale-out, which increases again the barrier to entry. Be-                     (GUI) for guiding the developer and flattening out Spark’s learn-
cause of this, companies are frequently reluctant to invest in Big                     ing curve. Instead of having to write code, the user designs and
Data platforms. In the long run, these companies will face either                      implements her big data application by designing a Directed
inability to scale or they will face higher cost for maintaining                       Acyclic Graph (DAG), inserting nodes, specifying the input and
much stronger architectures in-house.                                                  output of each node, configuring the nodes, and linking them to
   Spark is possibly the most popular Big Data framework to                            other nodes. Upon completion of the workflow, the tool translates
date. The framework implements the so-called master/slave ar-                          the model to executable Spark code, which can be submitted for
chitecture. It includes a central coordinator (the driver), and many                   execution to a cluster, executed locally, or even saved for future
distributed executors (the workers). Spark hides the complexity                        use. Beyond hiding the complexity of Spark by abstracting the job
of distributing the tasks and data across the workers, and trans-                      to the natural DAG model of operators, the GUI itself prevents
parently handles fault tolerance. Nonetheless, the complexity                          bugs, e.g., by showing intermediary results to the developer, and
of Spark steepens the learning curve, especially for entry-point                       by restricting the developer to a model and to specific method
Data Engineers [11]. Furthermore, Spark environment allows                             signatures that lend themselves to parallelism, yet without re-
for several pitfalls, such as using too many collects, no clear                        quiring the introduction of Spark-specific concepts like Maps and
understanding of caching and lazy evaluation, etc.                                     Reduces. Easy Spark can also serve more advanced requirements,
   So far, a number of attempts have been made in order to sim-                        guiding experienced developers that do not know Spark to write
plify distributed computing. [8] present a web-based GUI to sim-                       and to integrate code.
plify MapReduce data processing, but supports only a small set of                         The user group of Easy Spark is: (a) Spark newcomers and stu-
pre-determined actions/processing nodes. Also, [10] introduced                         dents that want to quickly test out Spark, get a first introduction
an extension for RapidMiner1 , Radoop, which runs distributed                          to Spark’s basic ideas and capabilities, and construct a rapid pro-
processes on Hadoop via the RapidMiner UI. Another (stream-                            totype/proof of concept, (b) data scientists and domain scientists
ing) extension to RapidMiner Studio was recently released by                           that are now hitting the limits of centralized computing, e.g., with
                                                                                       python, but do not have the formal training or extensive program-
1 https://rapidminer.com/
                                                                                       ming experience to start with Spark, and, finally, (c) educators
                                                                                       and researchers that need novel methods to introduce, teach, and
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)
                                                                                       2 https://www.infore-project.eu/
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
                                                                                       3 https://streamsets.com/products/dataops-platform/transformer-etl/
International (CC BY 4.0)
                                                                                       4 https://www.knime.com
advertise Spark and similar Big Data frameworks. It is planned
that Easy Spark will be integrated in the syllabi of two courses
this year, in two different universities/different professors, and it
will be released as open-source after the conference.

2 RELATED WORK AND BACKGROUND
2.1 A MapReduce/Spark primer
Apache Spark [14] and Hadoop MapReduce [5] are the two most
popular open-source frameworks to date for large scale data
processing on commodity hardware.
    MapReduce breaks a job to multiple tasks and assigns the tasks          Figure 1: Spark Architecture (figure taken from [1])
to the available workers. The programming API of MapReduce is
concentrated on the implementation of two (types of) methods,
mappers and reducers. Mappers take a pair (originating from a           programming paradigms, imposing a steep learning curve to
file) as input and produce a set of intermediate key/value pairs.       programmers. [8] developed a GUI in which users can design
Typical uses of mappers are filtering and transformations on the        their MapReduce workflow intuitively, without writing any code
input data. The results of the mappers are typically pushed to          and/or installing Hadoop locally. Users are only required to know
the reducers. Each reducer instance (running the user-defined           how to translate their tasks into target-value-action (TVA) tu-
reducer function) accepts a key and a list of values for that par-      ples, which reflect data processing into mappers and reducers.
ticular key. Reducers typically serve the purpose of aggregating        Users pick objects as targets, whereas action filters or processes
all data for each key.                                                  the values with the same target. For example, to implement a
    Even though MapReduce is generally recognized as a highly           word-count code (i.e., count the frequency of each word in a text
effective and efficient tool for Big Data analysis [13], it comes       file), we can identify the following TVA values: each word is a
with several limitations. During applications such as machine           target, 1 is a value, and sum is the desired action. The offered GUI
learning and graph analytics, data needs to pass from several pro-      offers a predefined list of operations, such as merge, sum, count,
cessing iterations, i.e., a sequence of different jobs. MapReduces      and multiply. However, the proposed GUI is inflexible in terms
reads its input data from secondary storage (typically network          of input data formats, and cannot support arbitrary code.
drives), processes it, and writes it back for every job, posing a           RapidMiner Studio is an extendable and popular open-source
significant I/O overhead. Furthermore, expressing of complex            user-interface for data analytics. Radoop is an extension for Rapid-
programs with a series of maps and reduces proves to be cum-            Miner Studio and supports distributed computation on top of
bersome. Spark offers a solution to these problems, by the use of       Hadoop [10]. Furthermore, it supports integration with PySpark
Resilient Distributed Datasets (RDDs), and by an expansion of           and SparkR scripts [2]. Nonetheless, the user is still required to
the programming model to enable representing more complex               know Spark’s API to include custom code for execution on top of
data flows that may consist of multiple stages, namely Directed         Spark’s distributed environment. Another more recent extension
Acyclic Graphs (DAGs) [12]. RDDs are read-only and can be kept          to RapidMiner Studio is [6], which supports code-free creation
in memory, enabling algorithms to iterate over the same data            of optimized, cross-platform, streaming workflows over various
many times efficiently. RDDs support two kinds of operations            clusters, including Spark. To the best of our knowledge, both
organized as a DAG: transformations and actions. Transforma-            Radoop and [6] do not include data information available on the
tions are applied on one or more RDDs and return a new RDD.             operator level, whereas Easy Spark tries to reduce common errors
Examples of transformations are map, flatMap, filter, join, and         by explicitly guiding the user, i.e., extracting data output types at
union. Actions operate on an RDD and return back a non-RDD              individual nodes, and showing sample intermediate results when
answer, e.g., a count, or a conversion of the last RDD to an array      constructing the DAG.
list. Due to the lazy-evaluation nature of Spark, transformations           StreamSets Transformer is a modern ETL pipelines engine for
are executed only when their result needs to be processed by an         building data transformation, stream processing, and machine
action. Placement of actions in the Spark code is critical for the      learning operations. It offers an easy drag-and-drop web-based
performance of the code – invoking unnecessary actions may nul-         UI in which users can create pipelines that are executed on Spark.
lify the benefits of distributed computation. A single Spark DAG        It is possible to write custom Pyspark and Scala code as part of
may involve a number of transformations and actions, which              the user’s data pipeline. However, the user should be familiar to
are completed in a single run and can be optimized by the Spark         Spark’s API to include custom Spark code into their pipeline5 .
execution engine. As a result, Spark is generally much faster on            KNIME 6 is an open-source platform for drag-and-drop data
non-trivial jobs compared to Hadoop MapReduce [3, 13].                  analysis, visualization, machine learning, statistics, and ETL. KN-
    The architecture of Spark is depicted in Fig. 1. A Spark de-        IME allows the user to create workflows via their GUI without,
ployment consists of a driver program (SparkContext), many              or with only minimal programming. KNIME consists of an ex-
executors (workers), and a cluster manager. The driver program          tension7 that creates Spark nodes, which can execute Spark SQL
serves as the main program and is responsible for the entire exe-       queries, create Spark MLlib models and allow for data blending
cution of the job. The SparkContext object connects to the cluster      and manipulation. Spark Streaming and Spark GraphX are not
manager, which sends tasks to the executors.                            integrated in the extension. The extension consists of a PySpark

2.2    Related work                                                     5 https://streamsets.com/documentation/transformer/latest/help/index.html
Even though MapReduce simplifies the complexity of program-             6 https://www.knime.com
                                                                        7 https://www.knime.com/knime-extension-for-apache-spark
ming for distributed computing, it departs from the traditional
script node, where users can add their own code. However, simi-
lar to StreamSets and RapidMiner Studio, KNIME requires from
the user to be familiar with the Spark API.
    Lastly, RheemStudio [9] is a visual IDE on top of the open-
source platform system RHEEM [4], which enables data process-
ing over multiple data processing platforms, including Spark.
RheemStudio helps developers to build applications easily and                  Figure 2: Boxes that create and configure nodes
intuitively. It allows for a drag-and-drop generated RHEEM plan
where users can drag-and-drop operators from the RHEEM oper-
ators panel to the drawing surface to construct a RHEEM plan. It
is then shown to the user how the previously presented RHEEM
plan can be specified into RHEEMLatin – the declarative lan-
guage for RHEEM. Next, the user is invited to select one of the
operators to revise and is asked to inject her own logics (couple
of lines of code) which is then checked to be syntactically correct.
Furthermore, RheemStudio consists of both a dashboard for dis-
playing RHEEM’s internals in detail, which allows for immediate                             Figure 3: Example graph
interaction with the developer, and a monitor for keeping track
of the status of the data analytic tasks. Compared to RheemStu-             Node types. Input and Output are for choosing the data input
dio, Easy Spark focuses on simplicity and more guidance for the          and results output files, and configuring how these should be
novice user. It displays intermediary results to the user which          read/written (e.g., format, delimiters). For-each, Filter, and Aggre-
are useful for debugging, and it does not require learning an            gate, correspond to the Spark functionalities of (flat-)map, filter,
additional language like RHEEMLatin, since it does not need to           and reduce/reduceByKey respectively. Model and Evaluate allow
integrate with multiple data processing languages. This focus to         the user to train an ML model, e.g., an SVM, and to use it for
simplicity makes Easy Spark an ideal entry point to Spark, for           classification/predictions. Clicking on any of the above buttons
non-Spark programmers.                                                   adds a node with the matching color in the DAG, and allows the
    Our contribution improves the state-of-the-art in multiple           user to configure it (e.g., copy-paste legacy code into an editor, or
directions. First, it supports newcomers to start writing Spark          configure the filter predicates). Section 3.3 discusses the precise
without previous Spark knowledge since it avoids using the Spark         meaning and configuration parameters of these node types.
semantics (RDDs, maps, and reduces). Instead it uses constructs             Configuration buttons. Button Options allows the user to
that are identical, or very similar to the standard programming          re-open the configuration panel on the selected node. Calculate
constructs of traditional languages (python and java). It also           Path executes the DAG on the full dataset – or submits it for
supports coding arbitrary code, which is useful for integrating          execution to a cluster – and writes the results to the output node,
legacy code, and it provides guidance to the user during the             whereas Show code depicts the generated Spark Code. The precise
design of the workflow, e.g., by providing the expected structure        functionality of these buttons will be detailed in Section 3.4.
(data format) and sample answers of each intermediary operation             Example. Fig. 3. depicts an example DAG that executes a
during the design of the workflow. This enables the developer to         fairly common ML task: training a ML model on a part of the
avoid pitfalls (e.g., calling too many collects), and quickly identify   dataset, and testing the produced model on the remaining part.
bugs. Finally, it can function as a stand-alone tool, not requiring      The user first configures the two input nodes, by selecting the
complex installation and maintenance of large clusters.                  correct file names, and then adds a model training node (in this
                                                                         case, selected to generate an SVM). The output of the model node
                                                                         is the model itself, which is then passed to an evaluator node,
3     EASY SPARK
                                                                         together with the testing partition of the dataset. The output of
Recall that the target group of Easy Spark includes a diverse            the evaluator is finally saved into the output node.
set: Spark newcomers, CS and non-CS students, educators, re-                Notice that the described workflow does not include any Spark-
searchers, and professional data/domain scientists. Naturally,           related concepts. We will see soon that the Spark fundamentals
each of these categories comes with different levels of expertise,       remain hidden from the user.
experience, and problems of different complexity. To support
all of them, we need a powerful and intuitive GUI where users            3.2    Guiding the user
can visually control the flow of the data. In Easy Spark, every
                                                                         Different mechanisms are in place to guide the user through the
operation on the data is represented by a node, and the user can
                                                                         DAG design.
connect nodes to form a computation path, which is a Directed
                                                                         • Showing sample intermediary results. When adding a node, the
Acyclic Graph (DAG) of computation nodes and edges that rep-
                                                                         developer is able to immediately preview a small number of input
resent the flow of data between nodes. We start by providing a
                                                                         and output results of that node (see Fig. 4 for an example). The
high-level overview of Easy Spark, with a special emphasis on
                                                                         preview is computed locally on a small part of the data such that
how it guides the developer and prevents traditional errors. We
                                                                         results can be shown with zero waiting time.
then present a detailed discussion of the offered functionality.
                                                                         • Identifying and naming the attributes, and propagating the data
                                                                         formats and structures. The developer configures a structure and
3.1    A high-level overview of Easy Spark                               format for the data input when configuring the input nodes (part
Easy Spark contains two types of buttons: (a) nodes of different         of this is inferred if the data files contain headers). This informa-
types, and (b) configuration buttons. Figure 2 depicts the currently     tion is propagated in the following nodes, i.e., the user can see
supported DAG node types and configuration buttons.                      and use the attribute names. When the code of an intermediary
node (e.g., a For-each node) modifies the data format, the devel-
oper is supported to update the data format accordingly.
• Disabling buttons that are incompatible with the current state.
The current state and current selection determines the buttons
that are enabled and/or disabled. For example, when an output
node is selected, only the Calculate path button is enabled.
• Hiding the Spark semantics from the developer. Notice that the
used operations are not specific to Spark. E.g., the developer does
not need to understand map, reduce, and RDDs. She does not
need to write Spark boilerplate code, or directly submit the code
for execution on a server. All semantics of Easy Spark can be eas-    Figure 4: Preview of the data via the options box for the
ily understood by an everyday Python developer/data scientist.        input node
• Preventing common errors. Besides hiding the complexity of
parallelism which is frequently a source of error, Easy Spark sup-
ports the developer on using named and typed data structures,         of the graph and writes the results to a text file from which the
and shows result samples at each intermediary node. Therefore         user picks the preferred location.
the developer can quickly detect most types of errors.
                                                                      3.4    Configuration buttons
3.3    Node types                                                     This subsection lists the purpose of the configuration buttons.
Nodes have to represent operations that are both familiar to the      • Calculate path. By clicking on this button, the data of the node
user and useful in the context of data processing. To serve this      that was previously selected is collected and written to disk. This
purpose, Easy Spark comes with a core set of node types, and          serves as a clear endpoint of the calculations for the user, and
allows for an easy extension by implementing additional nodes.        supplies the tool with a clear target node for which we can apply
The current node types are:                                           the transformations defined by the nodes on the path from the
• Input. An input node serves as a data source. On creation of        data source nodes to the previously selected node that activates
an input node, the user chooses the data source, and the input        the calculation. Note: it is possible to press this button multiple
node handles the creation of the corresponding RDD, and par-          times, for different nodes (and paths) in order to derive multiple
allelizes the data to the available worker machines. Easy Spark       outputs from the DAG.
also prompts the user whether or not the data source contains         • Options. The options button enables the developer to show
header data, and if so, the header is parsed and propagated to        additional information or set configuration parameters for the
subsequent nodes on the computation path. If no header is in the      selected node. The options for all nodes are as follows:
data, the user is prompted to supply the related information.         − Input presents a preview of ten rows from the data source (see
• For-each. The for-each node allows the developer to enter code      Fig. 4)
that would normally be put in the body of a for-loop over all         − For-each consists of a drop-down menu for the output type
records. The developer is prompted with a box for specifying the      (one output or multiple outputs), a drop-down menu for the level
desired behaviour of the node, i.e., to include the code that needs   of the for-loop based on the (provided) header data, an entry box
to be executed. The node itself then chooses what Spark code can      where the user can enter the code that needs to be execute for
be executed in order to replicate this behaviour, either through      every level and an entry box with the structure of the output.
map or flatMap functions.                                             − Aggregate shows a drop-down menu for the aggregation func-
• Aggregate. The aggregation node is responsible for aggregating      tion (sum or count), a drop-down menu for the key (one of the
data over different groups in the data. The user is prompted to       headers) to aggregate on and an entry box to provide the struc-
supply the level of aggregation and the type of aggregation, and      ture of the output.
the node itself generates key-value pairs and employs a reduce        − Filter includes a drop-down menu for the filter level (entire row
or reduceByKey function in order to aggregate the data over the       or attribute in a row) and an entry box where the filter condition
given level of aggregation.                                           should be entered.
• Filter. The filter node handles the filtering of unwanted data.     − Model has a drop-down menu for the type of model (SVM) and
The user can supply the tool with a boolean expression (or code       two entry boxes where the statement that retrieves the label and
that will return a boolean expression) being evaluated for every      features from each data entry should be entered.
row in the data. This way, data is excluded from the dataset.         − Evaluate has two drop-down menus from which the model
• Model. The model node is the gateway to the MLLib library           node and data node should be selected and two entry boxes to
in Spark. The model node prompts the user to supply the split         enter statements that retrieve the label and features.
between features and labels, and trains a Machine Learning (ML)       • Show code. By clicking this button, the user can see the gener-
model on the given data. The complexity of training a ML model        ated Spark code on the selected node (see, e.g., Figure 5). If no
is concealed from the user, such that the user can intuitively        node is selected, the user gets the generated Spark code for the
create ML pipelines that run on Spark clusters.                       whole program.
• Evaluate. This node is responsible for evaluating the results of
a ML model trained by a model node on data that comes from            4     CASE STUDIES
another node. This node supplies the user with the possibility of     We now discuss three case studies that are supported by the
evaluating a model on a large dataset, since evaluation will be       tool. The goal of our discussion is to illustrate the simplicity and
parallelized across nodes.                                            expressive power of the tool, and to demonstrate the ease of use
• Output. The output node executes each node within the path          with which Spark architectures can be designed.
Figure 5: Generated Spark code based on the aggregation                        (a) Sentence to word.                   (b) Word to letter.
node in LetterCount (subsection 4.1)
                                                                          Figure 7: Required transition data for LetterCount.
 Algorithm 1: LetterCount.
  Input: Textfile f, dictionary Counts;
  for line in f do
      for word in line do
          for letter in word do
               if letter in Counts then
                    Counts[letter] += 1;
               else
                    Counts[letter] = 1;
               end
          end
      end                                                                 Figure 8: DAG Corresponding to the video analysis
  end
                                                                      ML model that extracts keypoint predictions from a pre-trained
                                                                      network [7], and use these predictions to extract relevant angles
                                                                      between keypoints for motion analysis. Since currently our Input
4.1    Applying Easy Spark to LetterCount                             node does not natively support breaking of videos to frames, we
We start by parallelizing a workflow that counts the number of        use a third-party tool to extract and save all frames from the
appearances of each letter in a large text file. The pseudocode of    video as a csv file, and provide this as an input. After importing
the algorithm (algorithm 1) involves a triple nested loop – first     the data, we use a For-each node to extract the keypoints by
the text is broken into lines, then each line is broken into words,   running the pre-trained network, followed by another For-each
and then each word is broken into letters.                            node that extracts the relevant information from the keypoints.
   To implement this using Easy Spark, we need three nodes: a         The internal code of the two For-each nodes is copied from the
(yellow) Input node which contains the data source, an (orange)       centralized implementation. The final DAG appears in Fig. 8.
Aggregation node to count, and a (green) Output node. Figure 6
depicts the drawn graph in the GUI for LetterCount.                   4.3    Engineering and evaluating a Machine
                                                                             Learning pipeline
                                                                      Our next use case considers a standard problem when engineer-
                                                                      ing an ML pipeline: different models (or possibly the same model
                                                                      with different hyperparameters) need to be trained and tested
                                                                      with the training and the holdout datasets, in order to compute
                                                                      the accuracy of each model and choose the best one. Figure 9
          Figure 6: Drawn graph for LetterCount.                      presents a simple DAG that trains and evaluates an SVM model.
   When creating the Input node, the user is asked to provide         The first step is to configure the data input – the training and the
the header (since a plain text file does not contain a header).       hold-out data set. Both input files are passed through a For-each
For the sake of this example, we choose the header to be of the       node for pre-processing. The training data is used for building
following structure: sentence - word - letter. This means that the    a model (an SVM), which is then passed into an Evaluate node
expected output of the input node we just added will contain          together with the test data. To count the misclassifications we
three different representations of the data. Notice that the names    add a Filter node that is responsible for filtering out all correctly-
are arbitrarily chosen by the user, but give a reference point for    classified cases, followed by an Aggregation node which will
the next nodes.                                                       count the number of results. Observe that the DAG is sufficient
   Next, we create the Aggregation node and set the right options:    to obtain a clear idea of the data flow, and that we can represent
aggregation type is set to count and aggregation key is set to        complex operations on the data by a relatively small amount of
letter. Before we can calculate the computation path, the GUI         nodes, i.e., the nodes have a high expressive power.
needs to know how to get from sentence to word, and from word
to letter. This information is provided by the user through two       5     CURRENT AND FUTURE WORK
pop-up windows (Figure 7). Lastly, we add the Output node from        Easy Spark is under development. It can already generate code for
which we can initiate code generation/execution.                      complex workflows and submit it for execution, but it still misses
                                                                      some of the envisioned functionalities. In this section, we discuss
4.2    Distributed Video Analysis                                     our current and planned work and the involved challenges.
In this use case we will apply our tool to parallelize the data          • Adding new node types. Our effort with the current version
of individual frames in an input video, and perform necessary         of Easy Spark was to provide a zero-learning-curve proof-of-
computations on each frame. In particular, we will apply an           concept tool that can be used in teaching, and as the entry point
                                                                       of partitions. In the future we will detect these exceptions and
                                                                       propose these standard steps to the user.
                                                                       • Integrating the Spark UI. It will be useful – and with educational
                                                                       value – to integrate Spark’s Web UI in Easy Spark in order to allow
                                                                       the user to monitor the status of the developed Spark application,
                                                                       resource consumption, Spark cluster, and Spark configurations.

                                                                       6     CONCLUSIONS
                                                                       In this paper we proposed Easy Spark, a Graphical User Interface
                                                                       for easily designing Spark architectures for Big Data Engineering.
     Figure 9: DAG Corresponding to the ML pipeline                    The GUI enables the developer to design complex Spark DAGs
                                                                       with arbitrary functionality, by masking Spark constructs and
                                                                       concepts behind traditional programming constructs that any
of a newcomer in Spark. It was out of our scope to provide a           trained data scientist or computer scientist is able to understand,
complete visual programming language equivalent to the Spark           use, and configure. We examined three use cases and showed how
capabilities. Being convinced with the usefulness of the tool, we      Easy Spark can be used to generate executable Spark code. We
are now extending it with more node types to cover frequently-         also discussed the mechanisms that are currently implemented
used functionality, e.g., Spark SQL, GraphX, Spark Streaming,          in Easy Spark to guide the user and prevent bugs, and elaborated
and the full MLlib library. It is fairly straightforward to add more   on our current and future work to extend and improve it.
node types. We are also redesigning the user interface for helping
the users to find the desired node types quickly, e.g., have an ML     7     ACKNOWLEDGEMENTS
tab, where all the nodes related to MLlib will be added.
                                                                       This work was partially funded by the EU H2020 project Smart-
• Improved support for legacy code. Easy Spark includes support
                                                                       DataLake (825041).
for including hand-written legacy code, e.g., when writing code
for For-each nodes. Currently, Aggregate nodes enable selecting        REFERENCES
from a pre-defined set of functions, such as sum and count, but         [1] 2020. Spark Cluster Mode Overview. https://spark.apache.org/docs/latest/
it is not challenging to enable arbitrary code. Support for legacy          cluster-overview.html.
code can be further improved by recognizing the signature of the        [2] 2021. RapidMiner Radoop Feature List. https://rapidminer.com/products/
                                                                            radoop/feature-list/. [Online; accessed 27-January-2021].
functions, and automatically extracting the output data types,          [3] Bansod A. 2015. Efficient Big Data Analysis with Apache Spark in HDFS.
instead of asking the user to update the data types manually                International Journal of Engineering and Advanced Technology 4, 6 (2015).
whenever these change. Notice that this functionality is feasible,      [4] Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed Elmagarmid,
                                                                            Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad
as it is already present in different IDE tools.                            Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan
• Improved error detection. Automatic extraction of function sig-           Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform
                                                                            Data Processing: May the Big Data Be with You! Proc. VLDB Endow. 11, 11
natures can be leveraged to identify and present errors to the              (July 2018), 1414–1427. https://doi.org/10.14778/3236187.3236195
user that relate to the data types and formats, e.g., the output        [5] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data pro-
of one node does not agree with the expected input of the next              cessing on large clusters. Commun. ACM 51, 1 (2008), 107–113.
                                                                        [6] Nikos Giatrakos, David Arnu, T. Bitsakis, Antonios Deligiannakis, M. Garo-
node. Again, this is supported by most modern IDEs.                         falakis, R. Klinkenberg, Aris Konidaris, Antonis Kontaxakis, Y. Kotidis, Vasilis
• Showing sample intermediate results. Presentation of sample               Samoladas, A. Simitsis, George Stamatakis, F. Temme, Mate Torok, Edwin
intermediate results (a sample of the output of each node) dur-             Yaqub, Arnau Montagud, M. Leon, Holger Arndt, and Stefan Burkard. 2020.
                                                                            INforE: Interactive Cross-platform Analytics for Everyone. Proceedings of the
ing design time is often very useful, for getting a rough idea              29th ACM International Conference on Information & Knowledge Management
on what is happening up until that point, and for keeping track             (2020).
                                                                        [7] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2018. Mask
of the data format that the next node will receive. Our current             R-CNN. arXiv:cs.CV/1703.06870
implementation presents sample intermediary results (the first          [8] C. S. Liao, J. M. Shih, and R. S. Chang. 2013. Simplifying MapReduce data
10 results) only at some nodes. Notice, however, that some node             processing. International Journal of Computational Science and Engineering 8
                                                                            (2013). https://doi.org/10.1504/ijcse.2013.055353
types, e.g., Filter, For-each, can change this number. The number       [9] Ji Lucas, Yasser Idris, Bertty Contreras-Rojas, Jorge-Arnulfo Quiané-Ruiz, and
of sample results shown at each node should be adapted in order             S. Chawla. 2018. RheemStudio: Cross-Platform Data Analytics Made Easy.
to still get meaningful samples throughout the workflow. Fur-               2018 IEEE 34th International Conference on Data Engineering (2018), 1573–1576.
                                                                       [10] Zoltán Prekopcsák, Gabor Makrai, T. Henk, and Csaba Gáspár-Papanek. 2011.
thermore, in some cases, the distribution of the sample results is          Radoop: Analyzing Big Data with RapidMiner and Hadoop.
also important to get a meaningful sample output, e.g., training       [11] S. Salloum, R. Dautov, P. X. Chen, X.and Peng, and J. Z. Huang. 2016. Big data
                                                                            analytics on Apache Spark. Int. J. Data Sci. Anal. (2016).
of a binary SVM classifier requires representative samples from        [12] Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold
both classes. Starting with a huge number of samples to ensure              Reinwald, and Fatma Özcan. 2015. Clash of the Titans: MapReduce vs. Spark
that all nodes will have an output is also not an option, since             for Large Scale Data Analytics. Proc. VLDB Endow. 8, 13 (Sept. 2015), 2110–2121.
                                                                            https://doi.org/10.14778/2831360.2831365
this will decrement the performance of Easy Spark. We are now          [13] A. Verma, A. H. Mansuri, and N. Jain. 2016. Big data management process-
developing methods that adaptively choose the samples at each               ing with Hadoop MapReduce and spark technology: A comparison. 2016
node, in order to derive meaningful sample results at all nodes.            Symposium on Colossal Data Analysis and Networking (CDAN) (2016), 1–4.
                                                                       [14] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Arm-
• Supporting more data input formats and streaming data. This               brust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman,
will further reduce the complexity of loading the data, and offer           Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Sto-
                                                                            ica. 2016. Apache Spark: a unified engine for big data processing. Commun.
better representations of the data.                                         ACM 59, 11 (2016), 56–65. https://doi.org/10.1145/2934664
• Improved support for Spark-related errors. Some common Spark
errors, e.g., OutOfMemory exceptions, can (mostly) be addressed
with a few standard steps, e.g., increase the RAM available to
the executors or driver, avoid collects, or increase the number

</pre>