A Survey of Big Data Pipeline Orchestration Tools from the Perspective of the DataCloud Project∗ Mihhail Matskin1, Shirin Tahmasebi1, Amirhossein Layegh1, Amir H. Payberah1, Aleena Thomas2, Nikolay Nikolov2, and Dumitru Roman2 1 KTH Royal Institute of Technology, Stockholm, Sweden {misha,shirint,amlk,payberah}@kth.se 2 SINTEF AS, Oslo, Norway {firstname.lastname }@sintef.no Abstract. This paper presents a survey of existing tools for Big Data pipeline orchestration based on a comparative framework developed in the DataCloud project. We propose criteria for evaluating the tools to support reusability, flexible pipeline communication modes, and separa- tion of concerns in Big Data pipeline descriptions. This survey aims to identify research and technological gaps and to recommend approaches for filling them. Further work in the DataCloud project is oriented to- wards the design, implementation, and practical evaluation of the rec- ommended approaches. Keywords: Big Data pipeline · Orchestration tools · Reusability. 1 Introduction The availability of massive amounts of data has tremendously changed the data collection process and analysis over the last few years. The concept of Big Data and the supporting solutions have allowed dealing with potentially unlimited heterogeneous data in different formats within a practically acceptable time. However, the growth of data has increased both opportunities and challenges. In terms of opportunities, data processing is being heavily invested in to em- power the decision-making process of organizations in possession of Big Data. Furthermore, the data analytics process is becoming complex due to the charac- teristics of Big Data, the sophisticated tools and technologies involved, different interests among stakeholders, often changing business needs, and the lack of a standardized process for the lifecycle of Big Data pipelines [22]. Because of the complexity of Big Data analysis tasks, the software support- ing such analysis requires a combination of a broad spectrum of trusted software components. Such a combination involves integrating components into pipelines ∗ This work is partly funded by the EC H2020 project “DataCloud: Enabling The Big Data Pipeline Lifecycle on the Computing Continuum” (Grant nr. 101016835). Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 63 that take care of the pipeline execution and data transfer. The design and usage of Big Data pipelines increase the efficiency of data analysis, while at the same time require support for designing and managing the pipelines. Many organi- zations recognize the significance of Big Data pipelines, however there are still critical challenges in their implementation, such as the heterogeneity of involved stakeholders and limited knowledge reuse [6]. In this paper, we refer to the tools that support the description and execution of Big Data pipelines as Big Data pipeline orchestration tools. Although there exist many orchestration tools for Big Data pipelines, they have not focused on crucial areas such as reusability and separation of concerns. The problem of combining different components into an executable process is not new. Workflow systems are systems that support the integration of steps of a semi- or fully automated procedure into a manageable process. Business work- flows are workflow systems oriented towards automating business processes [6]. Scientific workflows are other types of workflow systems oriented towards au- tomation of scientific experiments [5]. Recently, Big Data workflows are becoming prevalent and refer to modeling processes containing various Big Data analytic or processing steps [25]. The main characteristics of Big Data workflows are the dynamics and heterogeneity of data sources and processing components, which typically require different orchestration models. Big Data pipelines are special cases of Big Data workflows where workflows are more oriented towards the end- users. However, since there is no clear boundary between Big Data workflows and Big Data pipelines, in this work we do not make an explicit difference between them. The approach in this paper is based on (1) extracting requirements for Big Data pipelines from the business cases of the DataCloud project as well as ex- isting scientific literature and software tools around Big Date pipelines, and (2) analyzing which of the existing solutions (if any) can satisfy the identified re- quirements. For the requirements extraction, we defined the following Research Questions (RQ), which guided our work: – RQ1. How to bridge the technological gap between different experts involved in the Big Data pipeline design, implementation, and management? – RQ2. How to support the reuse of previously developed knowledge and so- lutions in designing and implementing Big Data pipelines? – RQ3. How to support debugging of Big Data pipelines? In this paper, we survey existing Big Data pipeline orchestration tools, identify research and technological gaps, and suggest approaches to filling the gaps. This work is done in the context of the HORIZON 2020 project DataCloud3 . The rest of the paper is organized as follows. Section 2 provides a brief overview of the DataCloud project, including its objectives and expected results. Section 3 identifies the requirements and the classifiers for building a comparison table of the existing tools relevant to the DataCloud project perspective, and includes the comparison tables to identify gaps in the current solutions. The final Section 3 https://datacloudproject.eu 64 Fig. 1. Big Data pipeline lifecycle. 4 summarizes the main findings and refers to ongoing work in the DataCloud project to solve the identified gaps. 2 DataCloud Project Perspective In this section, we provide a brief overview the DataCloud project and present some requirements for Big Data pipeline orchestration tools identified while working on the project. 2.1 The DataCloud project The DataCloud project, which runs between 2021 and 2023, aims to develop a novel paradigm for Big Data processing over heterogeneous resources, including the Cloud/Edge/Fog Computing Continuum. The core concept of the project is Big Data pipelines, whose complete lifecycle is supported by several processing capabilities. The DataCloud project utilizes this paradigm to solve issues for a broad set of business cases coming from Small and Medium-sized Enterprises (SMEs) and large organizations with difficulties in capitalizing on Big Data due to the lack of technical expertise and suitable processing capabilities. In the DataCloud project, we develop a set of new languages, methods, in- frastructures, and software prototypes for discovering, simulating, deploying, and adapting Big Data pipelines on heterogeneous and untrusted resources. The project underlines the separation of concerns and separates the design from the run-time aspects of their deployment. This separation allows domain experts without significant technical/programming knowledge to participate in the def- inition and management of Big Data pipelines. Moreover, the DataCloud solu- tions allow the incorporation of Big Data pipelines in organizations’ business processes more efficiently and make them more accessible to a broader set of stakeholders regardless of the hardware infrastructure. DataCloud assumes a typical Big Data pipeline lifecycle involving a set of high-level processing steps executed in a loop (see Figure 1). The project aims to deliver its solutions to sup- port the Big Data pipelines lifecycle as a set of interoperable tools that form the 65 DataCloud toolbox. The toolbox includes the following components (see Figure 2): – DIS-PIPE: Provides integration of process mining techniques and Artificial Intelligence (AI) algorithms to learn the structure of Big Data pipelines. The learning is based on extracting, processing, and interpreting huge amounts of event data collected from heterogeneous data sources. – DEF-PIPE: Provides support for the visual design and description of Big Data pipelines based on a Domain Specific Language (DSL). The tool in- cludes the means to store and load the pipeline definitions that enable the reuse of previously developed solutions. It also enables data experts and domain experts to define the content by configuring individual steps and injecting code or customizing generic predefined step templates. – SIM-PIPE: Simulates the enactment of Big Data pipelines and provides pipelines testing functionality, including a sandbox for evaluating individ- ual pipeline step performance. Furthermore, SIM-PIPE provides a simulator to analytically predict the performance of the overall Big Data pipelines across the Computing Continuum resources. – R-MARKET: Deploys a decentralized backbone resource network based on a hybrid permissioned and permissionless blockchain. This component pro- vides a marketplace for resources and enables transparent provisioning of resources that increase the overall trust. – DEP-PIPE: Enables elastic and scalable deployment of Big Data pipelines with real-time event detection and automated decision making. – ADA-PIPE: Provides a data-aware algorithm for intelligent and adaptive provisioning of resources and services across the Computing Continuum as well as intelligent resource reconfiguration. We evaluate the DataCloud solutions on five business cases provided by the DataCloud business partners, which cover a broad spectrum of Big Data pipeline applications: – Smart mobile marketing campaigns. – Automatic live sports content annotation. – Digital health system. – Predicting deformations in ceramics. – Analytics of manufacturing assets. 2.2 Identifying Requirements The diversity and complexity of modeling data, processing Big Data pipelines, and the heterogeneity of Computing Continuum platforms require a multidisci- plinary effort using expert knowledge of the domain, data, and technical knowl- edge of the computational environment. However, the collaboration among do- main, data, and technical experts requires repeated communication cycles in- troducing significant overhead and barriers to success. Therefore, it is crucial and challenging to provide tools to bridge the technological and knowledge gaps 66 Fig. 2. DataCloud Tools. between all relevant experts and enable them to collaborate while keeping the separation of concerns. Nevertheless, there are many design solutions for Big Data pipelines, so reusing them can boost the design and the development of new pipelines. The recent development of containers (as a technology allowing efficient and reliable installation of software components on different platforms) provides a good foundation for implementing reusable solutions. This is why containerization is a promising approach to support reusable solutions. In designing Big Data pipelines, we need to consider the computational re- sources and decide about the deployment of the pipelines. However, it is not easy to make such a decision in the design phase in many cases, and running pipelines in the production computational environment might be expensive. Therefore, to make such a process more efficient, we need a simulation and debugging envi- ronment for pipelines that operates without deploying the pipeline. This is why combining pipeline descriptions with simulation tools is an essential requirement for modern systems. By summarizing the above aspects and taking into account the properties of the DataCloud project described in Section 2.1, we outline the following require- ments for Big Data pipeline orchestration tools: – Req1: Provide separation of concerns between design and run-time aspects. – Req2: Provide convenient means for describing pipelines including visual and textual (DSL-based) interfaces. – Req3: Support reusability of the previously developed steps and pipelines in designing new pipelines. – Req4: Provide flexible data transfer between steps in pipelines. – Req5: Support containerization for nodes and pipeline descriptions. 67 – Req6: Provide smooth integration of description and simulation of compo- nents. 3 Overview of Existing Solutions In this section, we describe a set of criteria that we consider important for Big Data pipeline orchestration tools and compare current solutions with respect to those criteria. Most of the criteria refer to the requirements identified in Section 2.2. 3.1 Criteria for Comparison The traditional way of representing pipelines is DSL that is incorporated into an orchestration environment. We performed an analysis of existing tools based on the following classifiers reflecting the requirements identified in the previous section. Type of Workflow/Pipeline. We consider three types of workflows/pipelines: business workflows, scientific workflows, and Big Data workflows (defined in Section 1). Each tool falls into one of these three types according to its main applications. Workflow/Pipeline Model. These categories are not mutually exclusive. Dif- ferent possible workflow models can be categorized as follows: – Script-based: In these type of workflows the composition of nodes is described using scripting languages. These workflows are useful for expert users to design complex applications more flexibly and concisely [33]. – Event-based: These workflows are characterized by a discrete set of states. A transition from one state to another happens on the occurrence of events emitted asynchronously or by an external trigger. Hence, in event-based workflows, users define event rules to declare under what circumstances state transitions should occur. This provides a responsive orchestration process [6, 50]. – Adaptive: This model allows designing adaptive or context-aware workflows to consider the runtime situations and exceptions, such as a failure in pre- processing an input file. Such workflows can respond to dynamic environ- mental needs effectively [1, 6, 49, 58]. – Declarative: In a declarative approach, a minimal set of requirements, which are often expressed by a set of constraints, is defined. Therefore, the execu- tion of the workflow is allowed until this set of constraints is satisfied. This is advantageous for increasing flexibility and is especially useful in highly unpredictable contexts, in which there are a large number of allowed and possible alternatives [7, 13, 14]. – Procedural: This workflow model explicitly specifies the sequence of steps and tasks known as control-flow. Thus, during the execution of the model, it is possible to execute a process only as explicitly specified in the control-flow [14, 44]. 68 Separation of Concerns. This criterion is related to the Req1 requirement. If a tool has a mechanism for the separation of high-level workflow definition concerns from step-specific implementation and deployment details, it supports separation of concerns for stakeholders (the value of the classifier is Yes in the tables in the rest of the paper). Otherwise, it is concluded that separation of concerns is not a focus [15] (the value of the classifier is No). Type of Language. This criterion is related to the Req2 requirement. We consider the following types of languages: – General-Purpose Language (GPL): GPL is a highly applicable language (e.g., Python, R, Java) across a variety of application domains. – Domain-Specific Language (DSL): DSL is a language that is specially de- signed for a specific problem domain. The main advantage of using a DSL is that domain experts, who have little knowledge outside of their domain, can efficiently design relevant parts of the pipeline logic. On the other hand, its main drawbacks are the limited portability across different environments and low applicability outside the discrete domain. Input Supported. This criterion is related to the Req2 requirement. Here, we consider two possible input types for designing and implementing Big Data pipelines: text-based input and graphical or visual input. We assume that tools with visual input types may also support text-based input types. Ease of Use. This criterion is related to the Req2 requirement. The possible values for this classifier are Hard, Medium, and Easy, which refer to the level of expertise a user needs to have to be able to use the tool. For example, if a tool allows designing a workflow through a clear graphical interface, the tool is considered to be Easy to use. However, if the tool relies on using a lightweight, human-readable text format, such as YAML, XML, and JSON, the level of ease- of-use is considered Medium. Otherwise, if programming knowledge and skills are needed for using the tool, it is concluded that the tool is Hard to use. Sometimes it is difficult to place a tool into one only category and we allow a combination of values, for example, Medium/Hard or Easy/Medium. Focus on Reusability. This criterion is related to the Req3 requirement. Reusability refers to the characteristic of a designed workflow, whereby it can be used to create another similar workflow. The reusability is a tool’s focus if (i) it provides a visual drag-and-drop feature for reusing the previously-designed workflows, or (ii) it supports a search for previously-designed similar solutions to be imported as text-based workflows for reusing in designing the current one. Otherwise, it is concluded that reusability is not a focus [15]. For the former group of tools, the value of the classifier is Yes, and the value of the classifier for the latter is No. Reusable elements. This criterion is related to the Req3 requirement. In this research, the reusability of each tool is evaluated in three aspects: – Whether or not it is possible to reuse the definition of the entire workflow. – Whether or not it is possible to reuse the definition of each step. 69 – Whether or not it is possible to reuse the step implementation. The possible values for each of these aspects are Yes, No, and Partial, which means that although no specific way is designed to share assets, sharing can be done by manually copying scripts. Nested Step Definition This criterion is related to the Req3 requirement. This is a binary classifier, the value of which would be Yes, if the tool has a specific way for defining and using a step inside and as a part of another more granular step. Otherwise, its value would be No. Configurable Data Transmission Medium Definition. This criterion is related to the Req4 requirement. This classifier has two possible values; Yes and No, depending on whether or not the tool allows users to choose between multiple data transmission mediums (such as shared file system, web services, RPC, file transfer, HTTP, and FTP) for transferring data between steps during the pipeline execution. Configurable Communication Medium Definition. This criterion is re- lated to the Req4 requirement. This classifier has two possible values; Yes and No, depending on whether or not the tool supports choosing the medium of control flow communication between steps during the pipeline definition, invo- cation, and execution. For example, this could be done using RESTful API, message queues, or RPC. Containerization. This criterion is related to the Req5 requirement. In differ- ent workflow types, containers can be used to automate the deployment process, enabling better scalability. We consider three possible approaches for container- ization [15]: – Workflow-level: In this approach, the entire workflow is wrapped inside a container allowing scalability of the full workflow. – Step-level: In this approach, each step is encapsulated and wrapped inside a container allowing scalability of individual steps. – Encapsulation: Here, a Big Data pipeline framework/tool is available as a container image for installation and local or distributed usage. Integrates a Simulation Tool. This criterion is related to the Req6 require- ment. Here, we consider if a tool integrates a simulation tool. The value of this classifier is either Yes or No. Monitoring. If the tool provides means for monitoring the execution of the pipeline. Depending on how the execution can be monitored, it can be further classified into the following three sub-categories: – Runtime: If the monitoring is available real-time. – Logging: If the run logs are available at the end of the execution. – No: If the tool offers no monitoring support. 3.2 Tools Comparison In Tables 1-4, we compare a representative number of the most popular Big Data pipeline orchestration tools with respect to the classifiers described in the previous subsection. In these tables denotes Yes and denotes No. 70 Table 1. Classification Summary Mapping to Reusable Containers Integrates a SIM Tool Separation of Concerns Nested Step Definition Communication Medium Transmission Medium Focus on Reusability Configurable Data Step Implementation Type of Language Monitoring Ease of Use Entire Workflow Configurable Entire Workflow Encapsulation Step Definition Step Level Tools Workflow Model Input Script-based Apache Airflow Event-based Big Data GPL Text4 Hard [30, 3] Adaptive Procedural Script-based Argo Workflow Big Data Declarative DSL Text Hard [4] Procedural Pachyderm [40, Script-based Big Data DSL Text Hard Partial Runtime 42] Adaptive Kepler [59] Scientific Procedural DSL Visual Easy Runtime Trifacta [56] Scientific Procedural DSL Visual Easy Runtime Script-based Event-based Easy StreamPipes [54] Big Data DSL5 Text6 Runtime Adaptive Medium Procedural Script-based MakeFlow [32] Scientific DSL Text Hard Runtime Declarative Script-based CGAT-Core [16] Scientific GPL Text Hard Partial Runtime Adaptive Script-based Snakemake [28, Big Data Adaptive DSL Text Hard Runtime 52] Declarative Pegasus [12, 43] Scientific Script-based DSL Text Hard Runtime 71 Table 2. Classification Summary (cont') Mapping to Reusable Containers Integrates a SIM Tool Separation of Concerns Nested Step Definition Communication Medium Transmission Medium Focus on Reusability Configurable Data Step Implementation Type of Language Monitoring Ease of Use Entire Workflow Configurable Entire Workflow Encapsulation Step Definition Step Level Tools Workflow Model Input Script-based NextFlow [35] Scientific Adaptive DSL Text Medium Runtime Declarative Script-based KubeFlow [29] Big Data Adaptive GPL Text Hard Procedural Script-based Toil [55] Big Data Adaptive GPL Text Hard Runtime Procedural BioDepot Script-based Workflow Builder Scientific DSL Visual Easy Runtime Procedural [8] Script-based Hyperloom [23] Big Data Adaptive GPL Text Hard PartialPartial Partial Logging Procedural MachineFlow [31] Scientific Procedural GPL Text Medium Partial Logging Script-based SoS Workflows Scientific Adaptive DSL7 Text Medium Partial Partial Runtime [57] Procedural Script-based Dray [17] Scientific GPL Text Medium Partial Runtime Procedural Script-based Flyte [18] Scientific Adaptive GPL Text Medium Runtime Procedural Galaxy [20] Scientific Procedural DSL Visual Easy Runtime 72 Table 3. Classification Summary (cont') Mapping to Reusable Containers Integrates a SIM Tool Separation of Concerns Nested Step Definition Communication Medium Transmission Medium Focus on Reusability Configurable Data Step Implementation Type of Language Monitoring Ease of Use Entire Workflow Configurable Entire Workflow Encapsulation Step Definition Step Level Tools Workflow Model Input Script-based Easy- Popper [24] Scientific Procedural DSL Text Medium Declarative8 Script-based Medium- StreamFlow [53] Scientific DSL Text Partial Partial Procedural Hard Medium BPipe [10] Scientific Script-based DSL Text Logging Easy Adaptive YAWL [19] Business DSL Visual Easy Declarative Script-based Apache Oozie Big Data Event-based DSL Text Hard Partial Partial Partial Runtime [41] Procedural Script-based Adaptive KNIME [27] Big Data DSL Visual Easy Runtime Declarative Procedural Script-based Google Workflow Adaptive Business DSL Text Medium Logging [21] Declarative Procedural Script-based Keboola [26] Big Data Event-based DSL Visual Easy Runtime Procedural Accio [2] Scientific Script-based DSL Text4 Medium Runtime Script-based Event-based Node-RED [38, Big Data Adaptive DSL Visual Easy Logging 39] Declarative Procedural 73 Table 4. Classification Summary (cont') Mapping to Reusable Containers Integrates a SIM Tool Separation of Concerns Nested Step Definition Communication Medium Transmission Medium Focus on Reusability Configurable Data Step Implementation Type of Language Monitoring Ease of Use Entire Workflow Configurable Entire Workflow Encapsulation Step Definition Step Level Tools Workflow Model Input Script-based Skitter [48, 51] Big Data DSL Text Hard Adaptive Script-based Event-based Dagster [11] Big Data GPL Text4 Hard Partial Logging Adaptive Procedural Script-based Event-based Prefect [45, 46] Big Data Adaptive GPL Text4 Hard Partial Runtime Declarative Procedural Declarative Apache NiFi [36] Big Data DSL Visual Easy Runtime Procedural Script-based Event-based Conductor [34] Big Data DSL Text4 Medium Partial Partial Partial Logging Adaptive Declarative Reflow [47] Scientific Procedural DSL Text Hard Partial Logging Script-based Event-based BMC Control-M Business Adaptive DSL Visual Easy Realtime [9] Declarative Procedural 74 4 Conclusions In this paper we investigated existing Big Data pipeline orchestration tools. By analysing them (in Section 3.2) we show that important requirements defined in the DataCloud project are not (or are only partially) satisfied by the cur- rently available tools. In particular, only a few tools support a graphical input language for the description of pipelines. While several tools allow some levels of reusability, most reusability aspects (such as support for searching available suitable solutions) are not implemented. Moreover, integration with other tools (including simulation tools) is not supported by many of them. However, we can mention that Apache Airflow, Argo Workflow, and Snakemake are the closest tools to our requirements among the reviewed ones. Nevertheless, although some aspects of some requirements are considered in some tools, no tool supports all required aspects. To address these problems in the DataCloud project, we are developing a DEF-PIPE component (see Section 2.1) for designing Big Data pipelines. This component will provide means for the description and manipulation of pipelines and the environment. It will also support a complete graphical and textual in- terface, accumulation and reuse of solutions across different applications, flexi- ble integration with a simulation tool, and separation of concerns between the description of design-time and run-time aspects. Preliminary results in this di- rection are reported in [37]. References 1. van der Aalst, W.M., Basten, T., Verbeek, H., Verkoulen, P.A., Voorhoeve, M.: Adaptive workflow: On the interplay between flexibility and support. In: Enterprise Inf. Systems, pp. 63–70. Springer (2000) 2. Accio: Accio - Workflow Authoring. https://privamov.github.io/accio/docs/ creating-workflows.html, [Online; accessed 01-April-2021] 3. Airflow, A.: Tutorial. https://airflow.apache.org/docs/apache-airflow/ stable/tutorial, [Online; accessed 25-March-2021] 4. Argo: Argo Workflow - Docs. https://argoproj.github.io/argo-workflows/, [Online; accessed 21-April-2021] 5. Barga, G., Taylor, D.: Workflows for e-Science, chap. Scientific versus Business Workflows, pp. 9–16. Springer (2007) 6. Barika, M., Garg, S., Zomaya, A.Y., Wang, L., Moorsel, A.V., Ranjan, R.: Orches- trating big data analysis workflows in the cloud: research challenges, survey, and future directions. ACM Computing Surveys (CSUR) 52(5), 1–41 (2019) 4 There exist visual and graphical tools for working with workflows, but not for creating and defining workflows. 5 Despite the fact that Java programming language is used for defining data sources, data sinks, and data processors, since they are all converted to RDF notation, and also since the graphical tool is used for defining the whole control flow, the language is DSL, not GPL. 6 Data sources, data sinks, and data processors are implemented in Java program- ming language. However, for creating the whole workflow, there exists a GUI interface. 7 The DSL provides domain-specific syntax and is built on top of Python 3. 8 Provides explicit support for external automation tools to deploy experiments. 75 7. Bernardi, M.L., Cimitile, M., Di Lucca, G., Maggi, F.M.: Using declarative work- flow languages to develop process-centric web applications. In: 2012 IEEE 16th International Enterprise Distributed Object Computing Conference Workshops. pp. 56–65. IEEE (2012) 8. BioDepot: BioDepot-Workflow-Builder - General Information. https://bwb. readthedocs.io/en/latest/#general-information, [Online; accessed 01-April- 2021] 9. BMC: BMC Control-M - Docker Image with Embedded Agent. https: //docs.bmc.com/docs/display/public/ctmapitutorials/Manage+workload+ in+Docker+Containers, [Online; accessed 26-March-2021] 10. BPipe: BPipe - Pipeline Even. http://docs.bpipe.org/Guides/ PipelineEvents/, [Online; accessed 02-April-2021] 11. Dagster: Dagster - Concepts. https://docs.dagster.io/concepts, [Online; ac- cessed 26-March-2021] 12. Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., Da Silva, R.F., Livny, M., et al.: Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46, 17–35 (2015) 13. Demeyer, R., Van Assche, M., Langevine, L., Vanhoof, W.: Declarative workflows to efficiently manage flexible and advanced business processes. In: Proceedings of the 12th international ACM SIGPLAN symposium on Principles and practice of declarative programming. pp. 209–218 (2010) 14. van Der Aalst, W.M., Pesic, M., Schonenberg, H.: Declarative workflows: Balanc- ing between flexibility and support. Computer Science-Research and Development 23(2), 99–113 (2009) 15. Dessalk, Y.D., Nikolov, N., Matskin, M., Soylu, A., Roman, D.: Scalable execu- tion of big data workflows using software containers. In: Proceedings of the 12th International Conference on Management of Digital EcoSystems. pp. 76–83 (2020) 16. Developers, C.: CGAT-core documentation. https://cgat-core.readthedocs. io/en/latest/, [Online; accessed 21-April-2021] 17. Dray: Dray Overview. https://github.com/CenturyLinkLabs/dray (2015), [On- line; accessed 17-March-2021] 18. Flyte: Flyte Documentation. https://docs.flyte.org/en/latest/index.html (2020), [Online; accessed 17-March-2021] 19. Foundation, Y.: YAWL - Yet Another Workflow Language. http://www. yawlfoundation.org/documents/YAWL_leaflet-final.pdf (2007), [Online; ac- cessed 19-March-2021] 20. Galaxy: Galaxy Tutorials. https://galaxyproject.org/learn/ (2005), [Online; accessed 17-March-2021] 21. Google: GoogleWorkflow - Error Handling Syntax. https://cloud.google.com/ workflows/docs, [Online; accessed 31-March-2021] 22. Henning Baars, J.E.: From data warehouses to analytical atoms - the internet of things as a centrifugal force in business intelligence and analytics. In: wenty-Fourth European Conference on Information Systems. ECIS (2016) 23. Hyperloom: Hyperloom - Basic Terms. https://loom-it4i.readthedocs.io/en/ latest/intro.html#basicterms, [Online; accessed 01-April-2021] 24. Jimenez, I., Sevilla, M., Watkins, N., Maltzahn, C., Lofstead, J., Mohror, K., Arpaci-Dusseau, A., Arpaci-Dusseau, R.: The popper convention: Making repro- ducible systems evaluation practical. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). pp. 1561–1570 (2017). https://doi.org/10.1109/IPDPSW.2017.157 76 25. Kashlev, A., Lu, S., Mohan, A.: Big data workflows: A reference architecture and the dataview system. Services Transactions on Big Data (STBD) 4(1), 1–19 (2017) 26. Keboola: Keboola - Overview. https://help.keboola.com/overview/, [Online; accessed 02-April-2021] 27. Knime: KNIME - Extensions. .https://docs.knime.com/2019-06/analytics_ platform_quickstart_guide/index.html#extend-knime-analytics-platform., [Online; accessed 31-March-2021] 28. Köster, J., Rahmann, S.: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28(19), 2520–2522 (2012) 29. Kubeflow: Kubeflow - Pipelin. https://www.kubeflow.org/docs/components/ pipelines/overview/pipelines-overview/#what-is-a-pipeline, [Online; ac- cessed 01-April-2021] 30. Lal Chattaraj, J., Villamariona, J.: Apache Airflow Tutorial – DAGs, Tasks, Operators, Sensors, Hooks & XCom. https://www.qubole.com/tech-blog/ apache-airflow-tutorial-dags-tasks-operators-sensors-hooks-xcom/, [On- line; accessed 25-March-2021] 31. MachineFlow: MachineFlow Overview. https://github.com/sean-mcclure/ machine_flow (2018), [Online; accessed 17-March-2021] 32. MakeFlow: MakeFlow - Overview. https://cctools.readthedocs.io/en/ latest/makeflow/#overview, [Online; accessed 02-April-2021] 33. Marozzo, F., Talia, D., Trunfio, P.: Js4cloud: script-based workflow programming for scalable data analysis on cloud platforms. Concurrency and Computation: Prac- tice and Experience 27(17), 5214–5237 (2015) 34. Netflix: Conductor - Configuration. https://netflix.github.io/conductor/ configuration, [Online; accessed 25-March-2021] 35. Nextflow: NextFlow - Completion Handler. https://www.nextflow.io/docs/ latest/metadata.html#completion-handler, [Online; accessed 25-March-2021] 36. NiFi, A.: Apache NiFi - Expression Language Guide. https://nifi.apache.org/ docs/nifi-docs/html/expression-language-guide.html, [Online; accessed 26- March-2021] 37. Nikolov, N., Dessalk, Y.D., Khan, A.Q., Soylu, A., Matskin, M., Payberah, A.H., Roman, D.: Conceptualization and scalable execution of big data workflows us- ing domain-specific languages and software containers. Internet of Things p. 100440 (2021). https://doi.org/https://doi.org/10.1016/j.iot.2021.100440, https: //www.sciencedirect.com/science/article/pii/S2542660521000834 38. Node-RED: Node-RED - Flow Control. https://cookbook.nodered.org/, [On- line; accessed 01-April-2021] 39. Node-RED: Node-RED - Subflow. https://nodered.org/docs/creating-nodes/, [Online; accessed 01-April-2021] 40. Novella, J.A., Emami Khoonsari, P., Herman, S., Whitenack, D., Capuccini, M., Burman, J., Kultima, K., Spjuth, O.: Container-based bioinformatics with pachy- derm. Bioinformatics 35(5), 839–846 (2019) 41. Oozie, A.: Oozie, A. Apache Oozie - Coordinator Job. .https://oozie. apache.org/docs/5.2.1/CoordinatorFunctionalSpec.html#a1._Coordinator_ Overview, [Online; accessed 26-March-2021] 42. Pachyderm: Pachyderm - Introduction. https://docs.pachyderm.com/latest/ how-tos/create-pipeline/, [Online; accessed 25-March-2021] 43. Pegasus: Pegasus - Documentation. https://pegasus.isi.edu/documentation, [Online; accessed 02-April-2021] 44. Pesic, M., Schonenberg, H., van der Aalst, W.: Declarative workflow. In: Modern Business Process Automation, pp. 175–201. Springer (2010) 77 45. Prefect: Prefect - Core Concepts. https://docs.prefect.io/core/concepts/, [Online; accessed 25-March-2021] 46. Prefect: Prefect - Latest API. https://docs.prefect.io/api/latest/, [Online; accessed 25-March-2021] 47. Reflow: Reflow - Overview. https://github.com/grailbio/reflow, [Online; ac- cessed 25-March-2021] 48. Saey, M.: A dsl for distributed, reactive workflows (2018) 49. Seiger, R., Huber, S., Schlegel, T.: Toward an execution system for self-healing workflows in cyber-physical systems. Software & Systems Modeling 17(2), 551– 572 (2018) 50. Semeniuta, O., Falkman, P.: Epypes: a framework for building event-driven data processing pipelines. PeerJ Computer Science 5, e176 (2019) 51. Skitter: Skitter - Documents. https://soft.vub.ac.be/~mathsaey/skitter/ docs/latest/readme.html, [Online; accessed 26-March-2021] 52. Snakemake: Snakemake - Tutorial. https://snakemake.readthedocs.io/en/ stable/tutorial/tutorial.html#snakemake-tutorial, [Online; accessed 26- March-2021] 53. StreamFlow: StreamFlow GitHub. https://github.com/alpha-unito/ streamflow (2020), [Online; accessed 17-March-2021] 54. Streampipe, A.: StreamPipes - Tutoria. https://streampipes.apache.org/docs/ docs/dev-guide-tutorialsources/, [Online; accessed 01-April-2021] 55. Toil: Toil - Workflows with Multiple Job. https://toil. readthedocs.io/en/latest/developingWorkflows/developing.html# workflows-with-multiple-jobs, [Online; accessed 01-April-2021] 56. Trifacta: Trifacta - Function. https://docs.trifacta.com/display/SS/Wrangle+ Language#WrangleLanguage-Functions.1 (2013), [Online; accessed 26-March- 2021] 57. Wang, G., Peng, B.: Script of scripts: A pragmatic workflow system for daily com- putational research. PLoS Computational Biology 15 (2019) 58. Wieland, M., Schwarz, H., Breitenbücher, U., Leymann, F.: Towards situation- aware adaptive workflows: Sitopt—a general purpose situation-aware workflow management system. In: 2015 IEEE International Conference on Pervasive Com- puting and Communication Workshops (PerCom Workshops). pp. 32–37. IEEE (2015) 59. Wikipedia: Kepler - HierarchicalWorkflow. https://en.wikipedia.org/wiki/ Kepler_scientific_workflow_system#Hierarchical_workflows (2015), [Online; accessed 01-April-2021] 78