Workflow Models for Heterogeneous Distributed
                                Systems
                                Iacopo Colonnelli1
                                1
                                    Università degli Studi di Torino, Department of Computer Science, Corso Svizzera 185, 10149, Torino, Italy


                                                                         Abstract
                                                                         This article introduces a novel hybrid workflow abstraction that injects topology awareness directly into
                                                                         the definition of a distributed workflow model. In particular, the article briefly discusses the advantages
                                                                         brought by this approach to the design and orchestration of large-scale data-oriented workflows, the
                                                                         current level of support from state-of-the-art workflow systems, and some future research directions.

                                                                         Keywords
                                                                         Scientific Workflows, HPC, Cloud Computing, Distributed Computing, Hybrid Workflows


                                1. Hybrid workflow models
                                When considering data-oriented workflows, all the aspects of data management become crucial
                                for performance optimisation, privacy preservation and security. The data locality principle,
                                i.e., moving computation close to the data, inspired the foundational algorithms [1] and data
                                structures [2] of modern Big Data analysis frameworks and became an unwaivable requirement
                                of federated learning approaches [3]. On the other hand, there are scenarios in which it is
                                worth, or even unavoidable, to transfer data between different modules of a complex application.
                                The modular nature of modern applications and the heterogeneity in contemporary hardware
                                resources and their features, further exacerbated by the end-to-end co-design approach [4],
                                require Workflow Management Systems (WMSs) to support a large ecosystem of execution
                                environments (from HPC to cloud, to the Edge), optimisation policies (performance vs. energy
                                efficiency) and computational models (from classical to quantum). For these reasons, modern
                                workflow models and tools need to be topology-aware, allowing an explicit mapping of workflow
                                steps onto (families of) processing elements. This mapping can be either manual, driven by the
                                combined experience of domain experts and computer scientists, or (semi-)automatic, using
                                advanced learning algorithms to infer the best-suited execution environment for each step.
                                   A hybrid workflow can be defined as a workflow whose steps can span multiple, heteroge-
                                neous, and independent computing infrastructures [5]. Each of these aspects has significant
                                implications. Support for multiple infrastructures implies that each step must potentially target a
                                different deployment location in charge of executing it. Locations can be heterogeneous, exposing
                                different methods and protocols for authentication, communication, resource allocation and job
                                ITADATA2023: The 2nd Italian Conference on Big Data and Data Science, September 11–13, 2023, Naples, Italy
                                $ iacopo.colonnelli@unito.it (I. Colonnelli)
                                 https://alpha.di.unito.it/iacopo-colonnelli/ (I. Colonnelli)
                                 0000-0001-9290-2017 (I. Colonnelli)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
execution. Plus, they can be independent of each other, meaning that direct communications
and data transfers among them may not be allowed. A suitable model for hybrid workflows
must then be composite, enclosing a specification of the workflow dependencies, a topology of
the involved locations, and a mapping relation between steps and locations.


2. State of the art and future directions
Grid-native WMSs [6, 7, 8] typically support distributed workflows out of the box, providing au-
tomatic scheduling and data transfer management across multiple execution locations. However,
all the orchestration aspects are delegated to external, grid-specific libraries and frameworks,
limiting the spectrum of supported execution environments.
   Recently, a new class of topology-aware WMSs is starting to be designed and implemented,
bringing advantages in performance and costs of workflow executions on top of heterogeneous
distributed environments. StreamFlow [9] augments the Common Workflow Language (CWL)
[10] open standard with a topology of deployment locations and relies on a set of connectors to
support several execution environments, from HPC queue managers to container orchestrators.
DagOnStar [11] allows users to model hybrid workflows as pure Python scripts, scheduling each
task on an HPC facility, a cloud VM, or a software container. Jupyter Workflow [12] transforms
a sequential computational notebook into a hybrid workflow by treating each cell as a workflow
step, semi-automatically extracting inter-cell data dependencies from the code, and mapping
each cell into one or more execution locations. Mashup [13] automatically maps each workflow
step onto the best-suited location, choosing between Cloud VMs and serverless platforms.
   Hybrid workflows proved themselves flexible enough to efficiently model and orchestrate
large-scale applications from a diverse set of domains, including bioinformatics [9, 14], large-
scale scientific simulations [15, 12], and deep learning [16, 17], on top of hybrid cloud-HPC
environments. Nevertheless, the syntax and semantics used to model and execute distributed
workflows are still product-specific, hindering the portability and reusability of both workflow
models and orchestration strategies. Hybrid workflow models [5] represent a first step toward
a vendor-agnostic way to incorporate topology awareness directly in the workflow definition,
and further research efforts are ongoing to distil a formal representation of hybrid workflows,
enabling optimisation strategies with theoretical correctness and consistency guarantees [18].
Another promising research direction involves relying on topology information to improve
the overall workflow execution plan, e.g., developing location-aware scheduling algorithms or
transparently injecting streaming capabilities in file-based workflows [19].


Acknowledgments
This work has been partially supported by the ACROSS project, “HPC Big Data Artificial
Intelligence cross-stack platform toward exascale,” and the EUPEX project, “European Pilot
for Exascale,” which have received funding from the EuroHPC JU under grant agreements No.
955648 and 101033975, respectively. Plus, it has been partially supported by the ICSC – Centro
Nazionale di Ricerca in High Performance Computing, BigData and Quantum Computing,
funded by European Union – NextGenerationEU.
References
 [1] J. Dean, S. Ghemawat, MapReduce: Simplified data processing on large clusters, in:
     6th Symposium on Operating System Design and Implementation (OSDI 2004), USENIX
     Association, San Francisco, California, USA, 2004, pp. 137–150.
 [2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker,
     I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster
     computing, in: Proceedings of the 9th USENIX Symposium on Networked Systems Design
     and Implementation, NSDI 2012, San Jose, CA, USA, April 25-27, 2012, USENIX Association,
     2012, pp. 15–28.
 [3] B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. y Arcas, Communication-efficient
     learning of deep networks from decentralized data, in: Proceedings of the 20th International
     Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort
     Lauderdale, FL, USA, 2017, pp. 1273–1282.
 [4] D. A. Reed, D. Gannon, J. J. Dongarra, Reinventing high performance computing: Chal-
     lenges and opportunities, CoRR abs/2203.02544 (2022). doi:10.48550/arXiv.2203.
     02544. arXiv:2203.02544.
 [5] I. Colonnelli, Workflow models for heterogeneous distributed systems, Ph.D. thesis, Uni-
     versità degli Studi di Torino, 2022. doi:10.5281/zenodo.7135483.
 [6] I. J. Taylor, M. S. Shields, I. Wang, A. Harrison, The Triana workflow environment:
     Architecture and applications, in: Workflows for e-Science, Scientific Workflows for Grids,
     Springer, 2007, pp. 320–339. doi:10.1007/978-1-84628-757-2\_20.
 [7] T. Fahringer, R. Prodan, R. Duan, J. Hofer, F. Nadeem, F. Nerieri, S. Podlipnig, J. Qin,
     M. Siddiqui, H. L. Truong, A. Villazón, M. Wieczorek, ASKALON: A development and grid
     computing environment for scientific workflows, in: Workflows for e-Science, Scientific
     Workflows for Grids, Springer, 2007, pp. 450–471. doi:10.1007/978-1-84628-757-2\
     _27.
 [8] E. Deelman, K. Vahi, M. Rynge, R. Mayani, R. F. da Silva, G. Papadimitriou, M. Livny, The
     evolution of the Pegasus workflow management software, Computing in Science and
     Engineering 21 (2019) 22–36. doi:10.1109/MCSE.2019.2919690.
 [9] I. Colonnelli, B. Cantalupo, I. Merelli, M. Aldinucci, StreamFlow: cross-breeding cloud
     with HPC, IEEE Transactions on Emerging Topics in Computing 9 (2021) 1723–1737.
     doi:10.1109/TETC.2020.3019202.
[10] M. R. Crusoe, S. Abeln, A. Iosup, P. Amstutz, J. Chilton, N. Tijanic, H. Ménager, S. Soiland-
     Reyes, C. A. Goble, Methods included: Standardizing computational reuse and portability
     with the Common Workflow Language, Communication of the ACM (2022). doi:10.1145/
     3486897.
[11] D. D. Sánchez-Gallegos, D. D. Luccio, S. Kosta, J. L. G. Compeán, R. Montella, An efficient
     pattern-based approach for workflow supporting large-scale science: The DagOnStar
     experience, Future Generation Computer Systems 122 (2021) 187–203. doi:10.1016/J.
     FUTURE.2021.03.017.
[12] I. Colonnelli, M. Aldinucci, B. Cantalupo, L. Padovani, S. Rabellino, C. Spampinato,
     R. Morelli, R. Di Carlo, N. Magini, C. Cavazzoni, Distributed workflows with Jupyter, Fu-
     ture Generation Computer Systems 128 (2022) 282–298. doi:10.1016/j.future.2021.
     10.007.
[13] R. B. Roy, T. Patel, V. Gadepally, D. Tiwari, Mashup: making serverless computing
     useful for HPC workflows via hybrid execution, in: PPoPP ’22: 27th ACM SIGPLAN
     Symposium on Principles and Practice of Parallel Programming, ACM, 2022, pp. 46–60.
     doi:10.1145/3503221.3508407.
[14] A. Mulone, S. Awad, D. Chiarugi, M. Aldinucci, Porting the variant calling pipeline
     for NGS data in cloud-hpc environment, in: 47th IEEE Annual Computers, Software,
     and Applications Conference, COMPSAC 2023, IEEE, Torino, Italy, 2023, pp. 1858–1863.
     doi:10.1109/COMPSAC57700.2023.00288.
[15] R. Montella, D. Di Luccio, S. Kosta, DagOn*: Executing direct acyclic graphs as parallel jobs
     on anything, in: IEEE/ACM Workshop on Workflows in Support of Large-Scale Science,
     WORKS@SC 2018, IEEE, 2018, pp. 64–73. doi:10.1109/WORKS.2018.00012.
[16] I. Colonnelli, B. Cantalupo, R. Esposito, M. Pennisi, C. Spampinato, M. Aldinucci, HPC appli-
     cation cloudification: The StreamFlow toolkit, in: 12th Workshop on Parallel Programming
     and Run-Time Management Techniques for Many-core Architectures and 10th Workshop
     on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-
     DITAM 2021, volume 88 of OASIcs, Schloss Dagstuhl - Leibniz-Zentrum für Informatik,
     Budapest, Hungary, 2021, pp. 5:1–5:13. doi:10.4230/OASIcs.PARMA-DITAM.2021.5.
[17] I. Colonnelli, B. Casella, G. Mittone, Y. Arfat, B. Cantalupo, R. Esposito, A. R. Martinelli,
     D. Medić, M. Aldinucci, Federated learning meets HPC and cloud, in: Astrophysics
     and Space Science Proceedings, volume 60, Springer, Catania, Italy, 2023, pp. 193–199.
     doi:10.1007/978-3-031-34167-0_39.
[18] D. Medić, M. Aldinucci, Towards formal model for location aware workflows, in: 47th
     IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2023, IEEE,
     Torino, Italy, 2023, pp. 1864–1869. doi:10.1109/COMPSAC57700.2023.00289.
[19] A. R. Martinelli, M. Torquati, I. Colonnelli, B. Cantalupo, M. Aldinucci, CAPIO: a middle-
     ware for transparent I/O streaming in data-intensive workflows, in: 30th IEEE International
     Conference on High Performance Computing, Data, and Analytics, HiPC 2023, IEEE, Goa,
     India, 2023.