=Paper=
{{Paper
|id=Vol-2841/SIMPLIFY_7
|storemode=property
|title=Easy Spark
|pdfUrl=https://ceur-ws.org/Vol-2841/SIMPLIFY_7.pdf
|volume=Vol-2841
|authors=Ylaise van den Wildenberg,Wouter W.L. Nuijten,Odysseas Papapetrou
|dblpUrl=https://dblp.org/rec/conf/edbt/WildenbergNP21
}}
==Easy Spark==
Easy Spark Y. van den Wildenberg W.W.L. Nuijten O. Papapetrou Eindhoven University of Technology Eindhoven University of Technology Eindhoven University of Technology y.v.d.wildenberg@student.tue.nl w.w.l.nuijten@student.tue.nl o.papapetrou@tue.nl ABSTRACT the INforE EU project2 , which supports code-free creation of Today’s data deluge calls for novel, scalable data handling and optimized, cross-platform, streaming workflows running on one processing solutions. Spark has emerged as a popular distributed of the following stream processing frameworks: Apache Flink, in-memory computing engine for processing and analysing a Spark Structured Streaming or Kafka [6]. In addition, [9] present large amount of data in parallel. However, the way parallel pro- RheemStudio, a visual IDE that creates code-free workflows on cessing pipelines are designed is fundamentally different from (a subset of) Spark using RHEEM’s ecosystem [4] to easily spec- traditional programming techniques, and hence most program- ify cross-platform data analytic tasks. In the same line of work, mers are either unable to start using Spark, or are not utilising StreamSets Transformer3 offers an UI based ETL pipeline where Spark to the maximum of its potential. This study describes an data transformations can be executed on Spark. Legacy code can easier entry point into Spark. We design and implement a GUI be incorporated in StreamSets Transformers by writing custom that allows any programmer with knowledge of a standard pro- PySpark or Scala pipelines. Similarly, KNIME4 , a visual program- gramming language (e.g., Python or Java) to write Spark appli- ming environment, supports extension of workflows with Spark cations effortlessly and interactively, and to submit and execute nodes. Spark code can be added in KNIME as a PySpark script them to large clusters. node in the workflow. However, in both cases, the developer needs to understand the Spark API and semantics (RDDs, maps, reduces). At the moment, we still lack a simple, open-source 1 INTRODUCTION graphical user interface that can be used out-of-the-box, to sup- Currently, data volumes are exploding in research and industry port Spark newcomers – developers with potentially no training because of the increasing data-intensive applications. As a con- and experience of Spark – to design, develop, and deploy complex sequence, more and more disciplines face scalability concerns. workflows in Spark that go beyond standard ETL processes. We However, the barrier to utilize distributed Big Data platforms is explicitly target a stand-alone tool that requires a very simple high for a multitude of reasons. Firstly, legacy code does not fit installation process (e.g., unzipping a file, or clicking an icon) very well in Big Data platforms. Secondly, most senior program- and no servers/spark clusters, so that it can be used from novice mers in the field – the ones usually taking the decisions – never users. Such a tool will simplify Spark, lowering the learning curve had formal training on Big Data platforms. On top of that, the and initial cost for testing out integration and use of Spark in mere number of available Big Data platforms and pay-as-you-go mission-critical processes. solutions (e.g. cloud solutions) complicate the right choice for the This work introduces Easy Spark, a Graphical User Interface user to scale-out, which increases again the barrier to entry. Be- (GUI) for guiding the developer and flattening out Spark’s learn- cause of this, companies are frequently reluctant to invest in Big ing curve. Instead of having to write code, the user designs and Data platforms. In the long run, these companies will face either implements her big data application by designing a Directed inability to scale or they will face higher cost for maintaining Acyclic Graph (DAG), inserting nodes, specifying the input and much stronger architectures in-house. output of each node, configuring the nodes, and linking them to Spark is possibly the most popular Big Data framework to other nodes. Upon completion of the workflow, the tool translates date. The framework implements the so-called master/slave ar- the model to executable Spark code, which can be submitted for chitecture. It includes a central coordinator (the driver), and many execution to a cluster, executed locally, or even saved for future distributed executors (the workers). Spark hides the complexity use. Beyond hiding the complexity of Spark by abstracting the job of distributing the tasks and data across the workers, and trans- to the natural DAG model of operators, the GUI itself prevents parently handles fault tolerance. Nonetheless, the complexity bugs, e.g., by showing intermediary results to the developer, and of Spark steepens the learning curve, especially for entry-point by restricting the developer to a model and to specific method Data Engineers [11]. Furthermore, Spark environment allows signatures that lend themselves to parallelism, yet without re- for several pitfalls, such as using too many collects, no clear quiring the introduction of Spark-specific concepts like Maps and understanding of caching and lazy evaluation, etc. Reduces. Easy Spark can also serve more advanced requirements, So far, a number of attempts have been made in order to sim- guiding experienced developers that do not know Spark to write plify distributed computing. [8] present a web-based GUI to sim- and to integrate code. plify MapReduce data processing, but supports only a small set of The user group of Easy Spark is: (a) Spark newcomers and stu- pre-determined actions/processing nodes. Also, [10] introduced dents that want to quickly test out Spark, get a first introduction an extension for RapidMiner1 , Radoop, which runs distributed to Spark’s basic ideas and capabilities, and construct a rapid pro- processes on Hadoop via the RapidMiner UI. Another (stream- totype/proof of concept, (b) data scientists and domain scientists ing) extension to RapidMiner Studio was recently released by that are now hitting the limits of centralized computing, e.g., with python, but do not have the formal training or extensive program- 1 https://rapidminer.com/ ming experience to start with Spark, and, finally, (c) educators and researchers that need novel methods to introduce, teach, and © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) 2 https://www.infore-project.eu/ on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 3 https://streamsets.com/products/dataops-platform/transformer-etl/ International (CC BY 4.0) 4 https://www.knime.com advertise Spark and similar Big Data frameworks. It is planned that Easy Spark will be integrated in the syllabi of two courses this year, in two different universities/different professors, and it will be released as open-source after the conference. 2 RELATED WORK AND BACKGROUND 2.1 A MapReduce/Spark primer Apache Spark [14] and Hadoop MapReduce [5] are the two most popular open-source frameworks to date for large scale data processing on commodity hardware. MapReduce breaks a job to multiple tasks and assigns the tasks Figure 1: Spark Architecture (figure taken from [1]) to the available workers. The programming API of MapReduce is concentrated on the implementation of two (types of) methods, mappers and reducers. Mappers take a pair (originating from a programming paradigms, imposing a steep learning curve to file) as input and produce a set of intermediate key/value pairs. programmers. [8] developed a GUI in which users can design Typical uses of mappers are filtering and transformations on the their MapReduce workflow intuitively, without writing any code input data. The results of the mappers are typically pushed to and/or installing Hadoop locally. Users are only required to know the reducers. Each reducer instance (running the user-defined how to translate their tasks into target-value-action (TVA) tu- reducer function) accepts a key and a list of values for that par- ples, which reflect data processing into mappers and reducers. ticular key. Reducers typically serve the purpose of aggregating Users pick objects as targets, whereas action filters or processes all data for each key. the values with the same target. For example, to implement a Even though MapReduce is generally recognized as a highly word-count code (i.e., count the frequency of each word in a text effective and efficient tool for Big Data analysis [13], it comes file), we can identify the following TVA values: each word is a with several limitations. During applications such as machine target, 1 is a value, and sum is the desired action. The offered GUI learning and graph analytics, data needs to pass from several pro- offers a predefined list of operations, such as merge, sum, count, cessing iterations, i.e., a sequence of different jobs. MapReduces and multiply. However, the proposed GUI is inflexible in terms reads its input data from secondary storage (typically network of input data formats, and cannot support arbitrary code. drives), processes it, and writes it back for every job, posing a RapidMiner Studio is an extendable and popular open-source significant I/O overhead. Furthermore, expressing of complex user-interface for data analytics. Radoop is an extension for Rapid- programs with a series of maps and reduces proves to be cum- Miner Studio and supports distributed computation on top of bersome. Spark offers a solution to these problems, by the use of Hadoop [10]. Furthermore, it supports integration with PySpark Resilient Distributed Datasets (RDDs), and by an expansion of and SparkR scripts [2]. Nonetheless, the user is still required to the programming model to enable representing more complex know Spark’s API to include custom code for execution on top of data flows that may consist of multiple stages, namely Directed Spark’s distributed environment. Another more recent extension Acyclic Graphs (DAGs) [12]. RDDs are read-only and can be kept to RapidMiner Studio is [6], which supports code-free creation in memory, enabling algorithms to iterate over the same data of optimized, cross-platform, streaming workflows over various many times efficiently. RDDs support two kinds of operations clusters, including Spark. To the best of our knowledge, both organized as a DAG: transformations and actions. Transforma- Radoop and [6] do not include data information available on the tions are applied on one or more RDDs and return a new RDD. operator level, whereas Easy Spark tries to reduce common errors Examples of transformations are map, flatMap, filter, join, and by explicitly guiding the user, i.e., extracting data output types at union. Actions operate on an RDD and return back a non-RDD individual nodes, and showing sample intermediate results when answer, e.g., a count, or a conversion of the last RDD to an array constructing the DAG. list. Due to the lazy-evaluation nature of Spark, transformations StreamSets Transformer is a modern ETL pipelines engine for are executed only when their result needs to be processed by an building data transformation, stream processing, and machine action. Placement of actions in the Spark code is critical for the learning operations. It offers an easy drag-and-drop web-based performance of the code – invoking unnecessary actions may nul- UI in which users can create pipelines that are executed on Spark. lify the benefits of distributed computation. A single Spark DAG It is possible to write custom Pyspark and Scala code as part of may involve a number of transformations and actions, which the user’s data pipeline. However, the user should be familiar to are completed in a single run and can be optimized by the Spark Spark’s API to include custom Spark code into their pipeline5 . execution engine. As a result, Spark is generally much faster on KNIME 6 is an open-source platform for drag-and-drop data non-trivial jobs compared to Hadoop MapReduce [3, 13]. analysis, visualization, machine learning, statistics, and ETL. KN- The architecture of Spark is depicted in Fig. 1. A Spark de- IME allows the user to create workflows via their GUI without, ployment consists of a driver program (SparkContext), many or with only minimal programming. KNIME consists of an ex- executors (workers), and a cluster manager. The driver program tension7 that creates Spark nodes, which can execute Spark SQL serves as the main program and is responsible for the entire exe- queries, create Spark MLlib models and allow for data blending cution of the job. The SparkContext object connects to the cluster and manipulation. Spark Streaming and Spark GraphX are not manager, which sends tasks to the executors. integrated in the extension. The extension consists of a PySpark 2.2 Related work 5 https://streamsets.com/documentation/transformer/latest/help/index.html Even though MapReduce simplifies the complexity of program- 6 https://www.knime.com 7 https://www.knime.com/knime-extension-for-apache-spark ming for distributed computing, it departs from the traditional script node, where users can add their own code. However, simi- lar to StreamSets and RapidMiner Studio, KNIME requires from the user to be familiar with the Spark API. Lastly, RheemStudio [9] is a visual IDE on top of the open- source platform system RHEEM [4], which enables data process- ing over multiple data processing platforms, including Spark. RheemStudio helps developers to build applications easily and Figure 2: Boxes that create and configure nodes intuitively. It allows for a drag-and-drop generated RHEEM plan where users can drag-and-drop operators from the RHEEM oper- ators panel to the drawing surface to construct a RHEEM plan. It is then shown to the user how the previously presented RHEEM plan can be specified into RHEEMLatin – the declarative lan- guage for RHEEM. Next, the user is invited to select one of the operators to revise and is asked to inject her own logics (couple of lines of code) which is then checked to be syntactically correct. Furthermore, RheemStudio consists of both a dashboard for dis- playing RHEEM’s internals in detail, which allows for immediate Figure 3: Example graph interaction with the developer, and a monitor for keeping track of the status of the data analytic tasks. Compared to RheemStu- Node types. Input and Output are for choosing the data input dio, Easy Spark focuses on simplicity and more guidance for the and results output files, and configuring how these should be novice user. It displays intermediary results to the user which read/written (e.g., format, delimiters). For-each, Filter, and Aggre- are useful for debugging, and it does not require learning an gate, correspond to the Spark functionalities of (flat-)map, filter, additional language like RHEEMLatin, since it does not need to and reduce/reduceByKey respectively. Model and Evaluate allow integrate with multiple data processing languages. This focus to the user to train an ML model, e.g., an SVM, and to use it for simplicity makes Easy Spark an ideal entry point to Spark, for classification/predictions. Clicking on any of the above buttons non-Spark programmers. adds a node with the matching color in the DAG, and allows the Our contribution improves the state-of-the-art in multiple user to configure it (e.g., copy-paste legacy code into an editor, or directions. First, it supports newcomers to start writing Spark configure the filter predicates). Section 3.3 discusses the precise without previous Spark knowledge since it avoids using the Spark meaning and configuration parameters of these node types. semantics (RDDs, maps, and reduces). Instead it uses constructs Configuration buttons. Button Options allows the user to that are identical, or very similar to the standard programming re-open the configuration panel on the selected node. Calculate constructs of traditional languages (python and java). It also Path executes the DAG on the full dataset – or submits it for supports coding arbitrary code, which is useful for integrating execution to a cluster – and writes the results to the output node, legacy code, and it provides guidance to the user during the whereas Show code depicts the generated Spark Code. The precise design of the workflow, e.g., by providing the expected structure functionality of these buttons will be detailed in Section 3.4. (data format) and sample answers of each intermediary operation Example. Fig. 3. depicts an example DAG that executes a during the design of the workflow. This enables the developer to fairly common ML task: training a ML model on a part of the avoid pitfalls (e.g., calling too many collects), and quickly identify dataset, and testing the produced model on the remaining part. bugs. Finally, it can function as a stand-alone tool, not requiring The user first configures the two input nodes, by selecting the complex installation and maintenance of large clusters. correct file names, and then adds a model training node (in this case, selected to generate an SVM). The output of the model node is the model itself, which is then passed to an evaluator node, 3 EASY SPARK together with the testing partition of the dataset. The output of Recall that the target group of Easy Spark includes a diverse the evaluator is finally saved into the output node. set: Spark newcomers, CS and non-CS students, educators, re- Notice that the described workflow does not include any Spark- searchers, and professional data/domain scientists. Naturally, related concepts. We will see soon that the Spark fundamentals each of these categories comes with different levels of expertise, remain hidden from the user. experience, and problems of different complexity. To support all of them, we need a powerful and intuitive GUI where users 3.2 Guiding the user can visually control the flow of the data. In Easy Spark, every Different mechanisms are in place to guide the user through the operation on the data is represented by a node, and the user can DAG design. connect nodes to form a computation path, which is a Directed • Showing sample intermediary results. When adding a node, the Acyclic Graph (DAG) of computation nodes and edges that rep- developer is able to immediately preview a small number of input resent the flow of data between nodes. We start by providing a and output results of that node (see Fig. 4 for an example). The high-level overview of Easy Spark, with a special emphasis on preview is computed locally on a small part of the data such that how it guides the developer and prevents traditional errors. We results can be shown with zero waiting time. then present a detailed discussion of the offered functionality. • Identifying and naming the attributes, and propagating the data formats and structures. The developer configures a structure and 3.1 A high-level overview of Easy Spark format for the data input when configuring the input nodes (part Easy Spark contains two types of buttons: (a) nodes of different of this is inferred if the data files contain headers). This informa- types, and (b) configuration buttons. Figure 2 depicts the currently tion is propagated in the following nodes, i.e., the user can see supported DAG node types and configuration buttons. and use the attribute names. When the code of an intermediary node (e.g., a For-each node) modifies the data format, the devel- oper is supported to update the data format accordingly. • Disabling buttons that are incompatible with the current state. The current state and current selection determines the buttons that are enabled and/or disabled. For example, when an output node is selected, only the Calculate path button is enabled. • Hiding the Spark semantics from the developer. Notice that the used operations are not specific to Spark. E.g., the developer does not need to understand map, reduce, and RDDs. She does not need to write Spark boilerplate code, or directly submit the code for execution on a server. All semantics of Easy Spark can be eas- Figure 4: Preview of the data via the options box for the ily understood by an everyday Python developer/data scientist. input node • Preventing common errors. Besides hiding the complexity of parallelism which is frequently a source of error, Easy Spark sup- ports the developer on using named and typed data structures, of the graph and writes the results to a text file from which the and shows result samples at each intermediary node. Therefore user picks the preferred location. the developer can quickly detect most types of errors. 3.4 Configuration buttons 3.3 Node types This subsection lists the purpose of the configuration buttons. Nodes have to represent operations that are both familiar to the • Calculate path. By clicking on this button, the data of the node user and useful in the context of data processing. To serve this that was previously selected is collected and written to disk. This purpose, Easy Spark comes with a core set of node types, and serves as a clear endpoint of the calculations for the user, and allows for an easy extension by implementing additional nodes. supplies the tool with a clear target node for which we can apply The current node types are: the transformations defined by the nodes on the path from the • Input. An input node serves as a data source. On creation of data source nodes to the previously selected node that activates an input node, the user chooses the data source, and the input the calculation. Note: it is possible to press this button multiple node handles the creation of the corresponding RDD, and par- times, for different nodes (and paths) in order to derive multiple allelizes the data to the available worker machines. Easy Spark outputs from the DAG. also prompts the user whether or not the data source contains • Options. The options button enables the developer to show header data, and if so, the header is parsed and propagated to additional information or set configuration parameters for the subsequent nodes on the computation path. If no header is in the selected node. The options for all nodes are as follows: data, the user is prompted to supply the related information. − Input presents a preview of ten rows from the data source (see • For-each. The for-each node allows the developer to enter code Fig. 4) that would normally be put in the body of a for-loop over all − For-each consists of a drop-down menu for the output type records. The developer is prompted with a box for specifying the (one output or multiple outputs), a drop-down menu for the level desired behaviour of the node, i.e., to include the code that needs of the for-loop based on the (provided) header data, an entry box to be executed. The node itself then chooses what Spark code can where the user can enter the code that needs to be execute for be executed in order to replicate this behaviour, either through every level and an entry box with the structure of the output. map or flatMap functions. − Aggregate shows a drop-down menu for the aggregation func- • Aggregate. The aggregation node is responsible for aggregating tion (sum or count), a drop-down menu for the key (one of the data over different groups in the data. The user is prompted to headers) to aggregate on and an entry box to provide the struc- supply the level of aggregation and the type of aggregation, and ture of the output. the node itself generates key-value pairs and employs a reduce − Filter includes a drop-down menu for the filter level (entire row or reduceByKey function in order to aggregate the data over the or attribute in a row) and an entry box where the filter condition given level of aggregation. should be entered. • Filter. The filter node handles the filtering of unwanted data. − Model has a drop-down menu for the type of model (SVM) and The user can supply the tool with a boolean expression (or code two entry boxes where the statement that retrieves the label and that will return a boolean expression) being evaluated for every features from each data entry should be entered. row in the data. This way, data is excluded from the dataset. − Evaluate has two drop-down menus from which the model • Model. The model node is the gateway to the MLLib library node and data node should be selected and two entry boxes to in Spark. The model node prompts the user to supply the split enter statements that retrieve the label and features. between features and labels, and trains a Machine Learning (ML) • Show code. By clicking this button, the user can see the gener- model on the given data. The complexity of training a ML model ated Spark code on the selected node (see, e.g., Figure 5). If no is concealed from the user, such that the user can intuitively node is selected, the user gets the generated Spark code for the create ML pipelines that run on Spark clusters. whole program. • Evaluate. This node is responsible for evaluating the results of a ML model trained by a model node on data that comes from 4 CASE STUDIES another node. This node supplies the user with the possibility of We now discuss three case studies that are supported by the evaluating a model on a large dataset, since evaluation will be tool. The goal of our discussion is to illustrate the simplicity and parallelized across nodes. expressive power of the tool, and to demonstrate the ease of use • Output. The output node executes each node within the path with which Spark architectures can be designed. Figure 5: Generated Spark code based on the aggregation (a) Sentence to word. (b) Word to letter. node in LetterCount (subsection 4.1) Figure 7: Required transition data for LetterCount. Algorithm 1: LetterCount. Input: Textfile f, dictionary Counts; for line in f do for word in line do for letter in word do if letter in Counts then Counts[letter] += 1; else Counts[letter] = 1; end end end Figure 8: DAG Corresponding to the video analysis end ML model that extracts keypoint predictions from a pre-trained network [7], and use these predictions to extract relevant angles between keypoints for motion analysis. Since currently our Input 4.1 Applying Easy Spark to LetterCount node does not natively support breaking of videos to frames, we We start by parallelizing a workflow that counts the number of use a third-party tool to extract and save all frames from the appearances of each letter in a large text file. The pseudocode of video as a csv file, and provide this as an input. After importing the algorithm (algorithm 1) involves a triple nested loop – first the data, we use a For-each node to extract the keypoints by the text is broken into lines, then each line is broken into words, running the pre-trained network, followed by another For-each and then each word is broken into letters. node that extracts the relevant information from the keypoints. To implement this using Easy Spark, we need three nodes: a The internal code of the two For-each nodes is copied from the (yellow) Input node which contains the data source, an (orange) centralized implementation. The final DAG appears in Fig. 8. Aggregation node to count, and a (green) Output node. Figure 6 depicts the drawn graph in the GUI for LetterCount. 4.3 Engineering and evaluating a Machine Learning pipeline Our next use case considers a standard problem when engineer- ing an ML pipeline: different models (or possibly the same model with different hyperparameters) need to be trained and tested with the training and the holdout datasets, in order to compute the accuracy of each model and choose the best one. Figure 9 Figure 6: Drawn graph for LetterCount. presents a simple DAG that trains and evaluates an SVM model. When creating the Input node, the user is asked to provide The first step is to configure the data input – the training and the the header (since a plain text file does not contain a header). hold-out data set. Both input files are passed through a For-each For the sake of this example, we choose the header to be of the node for pre-processing. The training data is used for building following structure: sentence - word - letter. This means that the a model (an SVM), which is then passed into an Evaluate node expected output of the input node we just added will contain together with the test data. To count the misclassifications we three different representations of the data. Notice that the names add a Filter node that is responsible for filtering out all correctly- are arbitrarily chosen by the user, but give a reference point for classified cases, followed by an Aggregation node which will the next nodes. count the number of results. Observe that the DAG is sufficient Next, we create the Aggregation node and set the right options: to obtain a clear idea of the data flow, and that we can represent aggregation type is set to count and aggregation key is set to complex operations on the data by a relatively small amount of letter. Before we can calculate the computation path, the GUI nodes, i.e., the nodes have a high expressive power. needs to know how to get from sentence to word, and from word to letter. This information is provided by the user through two 5 CURRENT AND FUTURE WORK pop-up windows (Figure 7). Lastly, we add the Output node from Easy Spark is under development. It can already generate code for which we can initiate code generation/execution. complex workflows and submit it for execution, but it still misses some of the envisioned functionalities. In this section, we discuss 4.2 Distributed Video Analysis our current and planned work and the involved challenges. In this use case we will apply our tool to parallelize the data • Adding new node types. Our effort with the current version of individual frames in an input video, and perform necessary of Easy Spark was to provide a zero-learning-curve proof-of- computations on each frame. In particular, we will apply an concept tool that can be used in teaching, and as the entry point of partitions. In the future we will detect these exceptions and propose these standard steps to the user. • Integrating the Spark UI. It will be useful – and with educational value – to integrate Spark’s Web UI in Easy Spark in order to allow the user to monitor the status of the developed Spark application, resource consumption, Spark cluster, and Spark configurations. 6 CONCLUSIONS In this paper we proposed Easy Spark, a Graphical User Interface for easily designing Spark architectures for Big Data Engineering. Figure 9: DAG Corresponding to the ML pipeline The GUI enables the developer to design complex Spark DAGs with arbitrary functionality, by masking Spark constructs and concepts behind traditional programming constructs that any of a newcomer in Spark. It was out of our scope to provide a trained data scientist or computer scientist is able to understand, complete visual programming language equivalent to the Spark use, and configure. We examined three use cases and showed how capabilities. Being convinced with the usefulness of the tool, we Easy Spark can be used to generate executable Spark code. We are now extending it with more node types to cover frequently- also discussed the mechanisms that are currently implemented used functionality, e.g., Spark SQL, GraphX, Spark Streaming, in Easy Spark to guide the user and prevent bugs, and elaborated and the full MLlib library. It is fairly straightforward to add more on our current and future work to extend and improve it. node types. We are also redesigning the user interface for helping the users to find the desired node types quickly, e.g., have an ML 7 ACKNOWLEDGEMENTS tab, where all the nodes related to MLlib will be added. This work was partially funded by the EU H2020 project Smart- • Improved support for legacy code. Easy Spark includes support DataLake (825041). for including hand-written legacy code, e.g., when writing code for For-each nodes. Currently, Aggregate nodes enable selecting REFERENCES from a pre-defined set of functions, such as sum and count, but [1] 2020. Spark Cluster Mode Overview. https://spark.apache.org/docs/latest/ it is not challenging to enable arbitrary code. Support for legacy cluster-overview.html. code can be further improved by recognizing the signature of the [2] 2021. RapidMiner Radoop Feature List. https://rapidminer.com/products/ radoop/feature-list/. [Online; accessed 27-January-2021]. functions, and automatically extracting the output data types, [3] Bansod A. 2015. Efficient Big Data Analysis with Apache Spark in HDFS. instead of asking the user to update the data types manually International Journal of Engineering and Advanced Technology 4, 6 (2015). whenever these change. Notice that this functionality is feasible, [4] Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad as it is already present in different IDE tools. Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan • Improved error detection. Automatic extraction of function sig- Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data Processing: May the Big Data Be with You! Proc. VLDB Endow. 11, 11 natures can be leveraged to identify and present errors to the (July 2018), 1414–1427. https://doi.org/10.14778/3236187.3236195 user that relate to the data types and formats, e.g., the output [5] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data pro- of one node does not agree with the expected input of the next cessing on large clusters. Commun. ACM 51, 1 (2008), 107–113. [6] Nikos Giatrakos, David Arnu, T. Bitsakis, Antonios Deligiannakis, M. Garo- node. Again, this is supported by most modern IDEs. falakis, R. Klinkenberg, Aris Konidaris, Antonis Kontaxakis, Y. Kotidis, Vasilis • Showing sample intermediate results. Presentation of sample Samoladas, A. Simitsis, George Stamatakis, F. Temme, Mate Torok, Edwin intermediate results (a sample of the output of each node) dur- Yaqub, Arnau Montagud, M. Leon, Holger Arndt, and Stefan Burkard. 2020. INforE: Interactive Cross-platform Analytics for Everyone. Proceedings of the ing design time is often very useful, for getting a rough idea 29th ACM International Conference on Information & Knowledge Management on what is happening up until that point, and for keeping track (2020). [7] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2018. Mask of the data format that the next node will receive. Our current R-CNN. arXiv:cs.CV/1703.06870 implementation presents sample intermediary results (the first [8] C. S. Liao, J. M. Shih, and R. S. Chang. 2013. Simplifying MapReduce data 10 results) only at some nodes. Notice, however, that some node processing. International Journal of Computational Science and Engineering 8 (2013). https://doi.org/10.1504/ijcse.2013.055353 types, e.g., Filter, For-each, can change this number. The number [9] Ji Lucas, Yasser Idris, Bertty Contreras-Rojas, Jorge-Arnulfo Quiané-Ruiz, and of sample results shown at each node should be adapted in order S. Chawla. 2018. RheemStudio: Cross-Platform Data Analytics Made Easy. to still get meaningful samples throughout the workflow. Fur- 2018 IEEE 34th International Conference on Data Engineering (2018), 1573–1576. [10] Zoltán Prekopcsák, Gabor Makrai, T. Henk, and Csaba Gáspár-Papanek. 2011. thermore, in some cases, the distribution of the sample results is Radoop: Analyzing Big Data with RapidMiner and Hadoop. also important to get a meaningful sample output, e.g., training [11] S. Salloum, R. Dautov, P. X. Chen, X.and Peng, and J. Z. Huang. 2016. Big data analytics on Apache Spark. Int. J. Data Sci. Anal. (2016). of a binary SVM classifier requires representative samples from [12] Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold both classes. Starting with a huge number of samples to ensure Reinwald, and Fatma Özcan. 2015. Clash of the Titans: MapReduce vs. Spark that all nodes will have an output is also not an option, since for Large Scale Data Analytics. Proc. VLDB Endow. 8, 13 (Sept. 2015), 2110–2121. https://doi.org/10.14778/2831360.2831365 this will decrement the performance of Easy Spark. We are now [13] A. Verma, A. H. Mansuri, and N. Jain. 2016. Big data management process- developing methods that adaptively choose the samples at each ing with Hadoop MapReduce and spark technology: A comparison. 2016 node, in order to derive meaningful sample results at all nodes. Symposium on Colossal Data Analysis and Networking (CDAN) (2016), 1–4. [14] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Arm- • Supporting more data input formats and streaming data. This brust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, will further reduce the complexity of loading the data, and offer Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Sto- ica. 2016. Apache Spark: a unified engine for big data processing. Commun. better representations of the data. ACM 59, 11 (2016), 56–65. https://doi.org/10.1145/2934664 • Improved support for Spark-related errors. Some common Spark errors, e.g., OutOfMemory exceptions, can (mostly) be addressed with a few standard steps, e.g., increase the RAM available to the executors or driver, avoid collects, or increase the number