Interactive Workflows for Exploratory Data Analysis Nourhan Elfaramawy supervised by Prof. Matthias Weidlich, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany Abstract The analysis of scientific data is often exploratory, meaning that the exact design of a workflow to process data is subject to continuous investigation and redesign. While support for the design of such workflows is manifold, it focuses primarily on reuse, reproducibility, and traceability of analysis results. Yet, it typically relies on static models of workflows that force scientists to wait for completion and restart a workflow repeatedly to explore different design choices. This is inefficient in terms of the invested time and resources. In this PhD project, we strive for support of user interactions in workflow execution. Our proposal is to extend common workflow models with concepts to define interaction points and possible actions, thereby providing users the flexibility to realize diverse interaction primitives, such as forwarding, repetition, and sample-based exploration. We further outline our initial results on realizing a model for interactive workflows. Keywords Exploratory Data Analysis, Scientific Workflows, Interactive Workflows, Snakemake 1. Introduction In domains such as bio-informatics, remote sensing, and materials science, the analysis of large-scale data is a prerequisite for scientific progress [1]. To this end, com- plex pipelines of operators, also referred to as scientific workflows or data analysis workflows, are designed and ex- ecuted using infrastructures for distributed computation. However, the respective analysis is typically exploratory, meaning that it emerges from a scientific process, in which hypotheses are designed and step-wise confirmed or invalidated. Therefore, workflows used for the analy- sis are also subject to continuous change. While the importance of supporting the design and exe- Figure 1: (a) Data analysis based on a traditional workflow cution of workflows is widely recognized [3] [4], existing that is first defined and then executed; (b) exploratory data models and methods focus on reuse, reproducibility, and analysis that is supported by interactive workflows. traceability of analysis results. Workflow engines such as Kepler [5], Galaxy [6], Pegasus [7], Snakemake [8], defined at design time, if at all. In practice, therefore, or Nextflow [9] offer means to specify workflows from such exploration is restricted to relatively simple scenar- reusable building blocks, provide technical abstractions ios, such as the definition of parameter sweeps of certain of compute infrastructures, and include functionality for operators, see Nimrod [10] for Kepler and Scalarm [11] exchange and collaboration in the workflow design. How- for Pegasus. The lack of flexibility in the workflow execu- ever, they adopt a static notion of a workflow as shown tion has severe implications. Scientists waste their own in fig. 1(a): A user specifies and configures a workflow, time as well as resources of a compute infrastructure as which is subsequently executed. they have to resort to submitting their workflow for exe- A static workflow model is inherently limited in its sup- cution and waiting for its completion, before repeating it port of interactivity for exploratory data analysis, though. all over again with potentially only minor adjustments. The exploration of design choices and possible changes In this paper, we propose to extend common workflow in the analysis, and hence in the workflow, can only be models with interaction capabilities, thereby providing VLDB’22: International Conference on Very Large Databases, support for exploratory data analysis by a human-in- September 05–09, 2022, Sydney, Australia the-loop model for workflow execution, see fig. 1(b). By $ nourhan.elfaramawy@hu-berlin.de (N. Elfaramawy) enabling scientists to examine the intermediate data pro-  0000-0001-9444-5163 (N. Elfaramawy) duced by a workflow, and to configure and adjust the © 2022 Proceedings of the VLDB 2022 PhD Workshop, September 5, 2022. Sydney, Australia. Copyright (C) 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). workflow based on their observations, design choices can CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 2: (a) The static PopIns workflow [2] for calling non-reference sequence insertions from many genomes jointly. The workflow includes operators that are executed separately per genome, as well as operators executed for all genomes. (b) A trace of the execution of this workflow once interactions by a user are included: A user explores the results of the contig alignment step in a notebook and decides to repeat the respective step before continuing with the workflow. be explored immediately and systematically, using less multiple individuals, as shown in fig. 2(a). In general, compute resources compared to the traditional model. this workflow starts with the assemble (AS) operator that In order to realize this objective, we set out to answer takes a genome (G) as input and reconstructs contigs the following research questions: from unalgined reads (UnR), i.e., a set of unaligned reads (R1) How to support interactions during workflow exe- from a sequencing dataset of a single individual and its cution? task is to reconstruct a set of contigs, representing can- (i) When and where shall a user be able to inter- didate sequences of insertions. Then, a merge operator act? combines the contigs from different genomes, which re- (ii) What are the actions a user may apply? sults in so-called super-contigs. Those are used in the (R2) How can this dynamic interactive exploratory contig alignment (CoA) step which aligns the unaligned model be implemented in common workflow sys- reads to the supercontigs and outputs candidate loca- tems and infrastructures? tions (approximate positions) of the supercontigs in the reference genome. The placement of the reference align- Below, we first introduce a motivating example (§ 2) and ment (PR) step identifies precise insertion positions of review related work (§ 3). Then, we outline our ideas to the supercontigs in the reference genome. answer the above questions. This includes a model for To summarize, the output of each step in the PopIns interactive workflows including notions of interaction workflow is the input of the next one, and the behaviour points and actions (§ 4). Moreover, we elaborate on our of the system depends on the input/output quality. There- preliminary results in terms of realizing this model in an fore, it is necessary for the scientist to examine the in- existing workflow engine (§ 5), before we conclude (§ 6). termediate results and observe the system behaviour at various points during execution. This may result in re- 2. Motivation peating or skipping some steps. These design choices make genome analysis workflows good examples for ex- We illustrate the need for interactive workflows for ex- ploratory data analysis. Yet, the example also illustrates ploratory data analysis with an example from a bio- that, if no ground truth is available, the design of the informatics field, in particular genomics. Here, recent respective workflows cannot be optimized automatically. advances in DNA sequencing led to the wide-spread avail- To support such application scenarios effectively, we ability of large volumes of genome sequence data. Spe- envision a model of an interactive workflow to enable an cific research questions in this area, for instance, relate to execution as sketched in fig. 2(b). The execution starts the identification of structural variants (SVs) in genomes. with the assemble and merge steps, followed by contig In particular, tools such as PopIns [2] and PopDel [12] alignment. However, we envision the definition of an have been developed to detect large insertion and dele- interaction point, so that the execution of the workflow tion variants in whole-genome sequencing (WGS) data is paused. Then, a frontend, here denoted as a notebook, of hundreds to tens of thousands of individuals. shall enable the visualization of the contig alignments A tool such as PopIns actually implements a workflow and provide descriptive statistics over unaligned reads. of multiple operators that are applied to the genome of Let us now assume that, based on some observations, each individual separately, or that combine data from a user takes the action to repeat the separation of un- aligned reads with a different algorithm. However, the extension to the common workflow model (§ 4.1), before user may also update the definition of the interaction elaborating on exploration primitives (§ 4.2). point that may pause the workflow after that specific step: It may be assigned a condition based on statistics 4.1. Interaction Points and Actions over the unaligned reads, so that it is triggered only when this condition is met. Afterwards another interaction As a starting point, we consider a traditional model of a point is activated, enabling the user to investigate all in- workflow, see [4]. It defines a workflow as a DAG, where termediate results obtained so far, i.e., contig alignments, vertices denote operators and edges denote data depen- unaligned reads and a genome sample. dencies between the references to the datasets consumed These above interaction points provide the user with or produced by operators, also known as input and out- the flexibility to make decisions and take actions on the put ports. A state of such a workflow is then given by a workflow execution based on intermediate results. This binding of specific files to these input and output ports. way, a user can incorporate immediate and systematic As hinted at already above, the question of how to changes at runtime, which will not only save time and model interactive workflows can be split into two parts, computational resources, but could also act as an early when and where to interact; and what actions to apply. indicator for technical errors during workflow execution. We therefore propose to extend the traditional workflow model with two concepts, as follows: Interaction points indicate that the workflow execu- 3. Related Work tion shall be paused for a user to explore the current Scientific workflows help scientists to manage and or- state in terms of the data generated so far, which poten- ganize their data-driven analysis [13]. There are many tially involves executing some additional analysis to get workflow engines that scientists rely on, such as Ke- insightful visualizations or to compute descriptive statis- pler [5], Galaxy [6], Pegasus [7], Snakemake [8], and tics on the intermediate results. Such an interaction point Nextflow [9]. These engines provide ease-of-use through is given by an edge of the workflow DAG and, potentially, user interfaces, graphically or script-based, and cata- a condition. The latter may refer to a state of workflow logues of standardized data preprocessing techniques. execution (e.g., checking the number of lines in a data As mentioned above, some engines provide some lim- file) or meta-data (e.g., checking the execution time of an ited support for exploratory analysis, e.g., for parameter operator). The semantics of an interaction point (IP) are sweeps. Also, notably, dynamic control of iterations in summarized as follows: Upon completing the execution workflows based on changes of the processed data by a of the operator that is the source of the respective edge, user was proposed in [14]. However, there is a gap in the workflow engine checks the condition and, if it is terms of expressive models to support generic interactiv- true, does not continue execution with the operator that ity during runtime, i.e., while a workflow is executed. is the target of the edge, but waits for user input. Data flow optimization is related as it targets some Actions indicate how the user intends to continue the of the challenges stemming from workflows used for ex- execution of a workflow once an interaction point is ploratory analysis. For instance, meta-dataflows (MDFs), reached. To this end, we consider different types of ac- introduced in [15], to improve the task scheduling and tions, including: memory allocation in exploratory analysis. Data access • Revise interaction points: The set of interaction patterns caused by exploratory analysis may also benefit points defined for the workflow is updated. from caching layers, such as Tachyon [16]. • Revise workflow: The structure of the workflow Debugging of data processing pipelines received in terms of operators and data dependencies is increased interest in the data management community re- updated. cently [17]. For example, Dagger [18] provides interactiv- • Continuation: Workflow execution continues ity through debugging primitives in data-driven pipelines. with the operator following the interaction point. DataExposer [19], in turn, helps to identify properties • Skipping: Workflow execution continues based that can be considered to be root causes of performance on the workflow DAG, but skips over the specified degradation, or system failure due to data. Yet, most of operators when doing so, i.e., the output ports of this work focuses on debugging at the data-level, rather skipped operators denote empty datasets. than the control-flow level. • Rewind: Workflow execution continues from an earlier state, which is identified by an operator in the workflow DAG. 4. Model of Interactive Workflows Naturally, the actions to revise interaction points and the workflow shall be combined with a continuation, skip- Below, we describe our take on research question (R1), i.e., ping, or rewind action. Moreover, we note that the ac- how to model interactive workflows. We first propose an tions impose certain consistency requirements to enable 5. Towards a Realization proper workflow execution, e.g., in terms of reachability of operators in the workflow DAG and the realization of Technical environment: To realize our conceptual data dependencies. model for interactive workflows (§ 4), we choose Snake- make, a state-of-the-art rule-based workflow manage- ment system. Here, a workflow is defined by a set of 4.2. Exploration Primitives rules. Each rule denotes a task or operator and specifies The extension of a model for workflows realized by inter- how to create sets of output files from sets of input files. action points and actions enables us to support various Then, the engine establishes the dependencies between primitives often found in exploratory workflows. Below, the rules by matching file names. In Snakemake, when we outline how this support is achieved for some of these starting a workflow, these rules are used to create a DAG exploration primitives: as the basis for execution. However, this also means that an adaptation of the rules and, hence, the DAG is not Fast-forward: Based on properties of some intermedi- possible after the start of workflow execution. ate results, a user may want to fast-forward the However, even though the DAG cannot be altered at workflow execution to save time and compute runtime, Snakemake provides limited support for interac- resources. An example would be a sequence of tivity. That is, a Jupyter notebook [20], a popular Python- operators to implement noise filtering, which may based computational environment, can easily be inte- not be needed if the data variance stays within grated with Snakemake. Such a notebook combines code certain limits. This is enabled by defining an in- snippets, documentation, as well as plots into a single teraction point to decide on fast-forwarding (e.g., document. The integration in Snakemake is realized via to compute the variance), which may then be dedicated rules, which, once executed, start a notebook realized through a skipping action. for a user to work with. The workflow only continues execution according to the constructed DAG once the Repetition: A user may want to repeat a certain step, user closes the respective notebook. or a set thereof, before advancing with the work- Preliminary results: To test the feasibility of our flow execution, e.g., to fine the configuration of ideas, we implemented parts of our proposed model for operators. Support for such repetition is limited interactive workflows in Snakemake using the notebook in common workflow management systems. The integration, and applied it to the PopIns workflow men- reason being that most of them adopt an exe- tioned earlier (§ 2). Specifically, we added interaction cution model based on a DAG, which prevents points in the workflow by rules that start a notebook, the definition of a cycles in the workflow struc- as sketched in fig. 2(b). These notebooks then enable ture. Using the concepts envisioned for interac- access to the intermediate results (e.g., unaligned reads tive workflows, we support repetitions by defin- and contigs in our example). Moreover, we realized the ing an interaction point at which a user may de- aforementioned rewind and continuation actions, which, cide on a rewind action or a continuation action. once triggered by a user in the notebook, enable the repe- Sample-based exploration: Another pattern in ex- tition of particular steps in the workflow. Since the DAG ploratory analysis is that a user wants to test their constructed by Snakemake is immutable, our solution for workflow on a subset of data, before applying it the rewind action is based on an abortion of the current to the complete dataset in order to save time and workflow instance and the creation of a new instance, compute resources. An example would be the cal- while the control-flow in the new instance is guided by ibration of some data transformations by fitting a the automated creation and deletion of dedicated files. statistical model. Here, the fit of various models Applied in the context of the PopIns workflow, our may first be explored using a sample of the data. prototype, despite its limited support of the envisioned Based thereon, the model is adopted to transform model, highlights the benefits of offering interactivity the whole dataset, or the workflow is altered to in workflow execution. Users can explore and examine incorporate a different transformation. This is intermediate results at runtime, and realize common ex- realized by defining an interaction point after the ploration primitives directly, rather than being delayed sample was processed, with the possible actions by the need to wait for workflow completion. being to rewind and re-execute the workflow with the complete dataset, or to revise the workflow 6. Conclusions by replacing the transformation operator. In this work, we outlined the need to support interactiv- ity in workflows for scientific data. To address this need, we outlined a model for interactive workflows, which is Galaxy: A web-based genome analysis tool for ex- based on interaction points and actions. We further re- perimentalists, Current Protocols in Molecular Bi- ported on our preliminary results of realizing this model ology 89 (2010). in Snakemake, a rule-based workflow engine, and its inte- [7] E. Deelman, K. Vahi, G. Juve, M. Rynge, gration with Jupyter notebooks. Specifically, we adopted S. Callaghan, P. J. Maechling, R. Mayani, W. Chen, the implementation in the Popins workflow for structural R. Ferreira da Silva, M. Livny, K. Wenger, Pega- variant calling in genomics. sus, a workflow management system for science Having a first version of a model for interactive work- automation, Future Generation Computer Systems flows, our research plan involves the following phases: 46 (2015). First, we intend to study the realization of further ex- [8] J. Köster, S. Rahmann, Snakemake - a scalable bioin- ploration primitives using interaction points and actions. formatics workflow engine, Bioinform. 34 (2018) This way, we also seek to understand whether our model 3600. shall incorporate more expressive actions. Second, we [9] P. D. Tommaso, E. W. Floden, C. Magis, E. Palumbo, aim to ensure that our implementation is supported not C. Notredame, [nextflow, an efficient tool to only in the stand-alone execution mode of Snakemake, improve computation numerical stability in ge- but can also be employed for cluster-based execution. nomic analysis] 211 (2017). doi:10.1051/jbio/ Third, our goal is to provide implementation strategies 2017029. for our model of interactive workflows for other work- [10] D. Abramson, B. Bethwaite, C. Enticott, S. Garic, flow engines, such as Nextflow. T. Peachey, Parameter space exploration using scientific workflows, in: G. Allen, J. Nabrzyski, E. Seidel, G. D. van Albada, J. Dongarra, P. M. A. Acknowledgments Sloot (Eds.), Computational Science – ICCS 2009, Springer Berlin Heidelberg, 2009. This work is funded by the German Research Foundation [11] D. Król, J. Kitowski, R. F. da Silva, G. Juve, K. Vahi, (DFG), Project-ID 414984028, SFB 1404 FONDA. M. Rynge, E. Deelman, Science automation in prac- tice: Performance data farming in workflows, in: References 2016 IEEE 21st International Conference on Emerg- ing Technologies and Factory Automation (ETFA), [1] U. Leser, M. Hilbrich, C. Draxl, P. Eisert, L. Grunske, 2016. P. Hostert, D. Kainmüller, O. Kao, B. Kehr, T. Kehrer, [12] S. Niehus, H. Jónsson, J. Schönberger, E. Björnsson, C. Koch, V. Markl, H. Meyerhenke, T. Rabl, A. Reine- D. Beyter, H. P. Eggertsson, P. Sulem, K. Stefáns- feld, K. Reinert, K. Ritter, B. Scheuermann, F. Schin- son, B. V. Halldórsson, B. Kehr, Popdel identifies tke, N. Schweikardt, M. Weidlich, The Collaborative medium-size deletions jointly in tens of thousands Research Center FONDA, Datenbank-Spektrum of genomes, bioRxiv (2020). (2021). doi:10.1007/s13222-021-00397-5. [13] C. S. Liew, M. P. Atkinson, M. Galea, T. F. Ang, [2] B. Kehr, P. Melsted, B. V. Halldórsson, Popins: P. Martin, J. I. V. Hemert, Scientific workflows: population-scale detection of novel sequence in- Moving across paradigms, ACM Comput. Surv. 49 sertions, Bioinform. 32 (2016) 961–967. (2016). [3] A. Barker, J. van Hemert, Scientific workflow: A [14] J. Dias, G. Guerra, F. Rochinha, A. L. Coutinho, survey and research directions, Springer Berlin P. Valduriez, M. Mattoso, Data-centric iteration Heidelberg, Berlin, Heidelberg, 2008, pp. 746–753. in dynamic workflows, Future Gener. Comput. Syst. [4] Workflows and e-science: An overview of work- 46 (2015) 114–126. flow system features and capabilities, Future [15] R. Fernandez, W. Culhane, P. Watcharapichat, Generation Computer Systems 25 (2009) 528–540. M. Weidlich, V. Morales, P. Pietzuch, Meta- doi:https://doi.org/10.1016/j.future. dataflows: Efficient exploratory dataflow jobs, 2018, 2008.06.012. pp. 1157–1172. [5] D. Barseghian, I. Altintas, M. B. Jones, D. Crawl, [16] H. Li, A. Ghodsi, M. Zaharia, S. Shenker, I. Sto- N. Potter, J. Gallagher, P. Cornillon, M. Schildhauer, ica, Tachyon: Reliable, memory speed storage for E. T. Borer, E. W. Seabloom, P. R. Hosseini, Work- cluster computing frameworks, SOCC ’14, Asso- flows and extensions to the kepler scientific work- ciation for Computing Machinery, 2014, p. 1–15. flow system to support environmental sensor data doi:10.1145/2670979.2670985. access and analysis, Ecological Informatics 5 (2010). [17] M. Brachmann, W. Spoth, Y. Yang, C. Bautista, [6] D. Blankenberg, G. V. Kuster, N. Coraor, G. Ananda, S. Castelo, S. Feng, J. Freire, B. Glavic, O. Kennedy, R. Lazarus, M. Mangan, A. Nekrutenko, J. Taylor, H. Müeller, R. Rampin, Data debugging and explo- ration with vizier, 2019, pp. 1877–1880. [18] Y. Yang, M. Youill, M. E. Woicik, Y. Liu, X. Yu, M. Ser- afini, A. Aboulnaga, M. Stonebraker, Flexpush- downdb: Hybrid pushdown and caching in a cloud DBMS, Proc. VLDB Endow. 14 (2021) 2101–2113. [19] S. Galhotra, A. Fariha, R. Lourenço, J. Freire, A. Meliou, D. Srivastava, Dataexposer: Expos- ing disconnect between data and systems, CoRR abs/2105.06058 (2021). [20] T. Kluyver, B. Ragan-Kelley, F. Pérez, B. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. Hamrick, J. Grout, S. Corlay, P. Ivanov, D. Avila, S. Abdalla, C. Willing, Jupyter notebooks – a publishing for- mat for reproducible computational workflows, in: F. Loizides, B. Schmidt (Eds.), Positioning and Power in Academic Publishing: Players, Agents and Agen- das, IOS Press, 2016, pp. 87 – 90.