1. Introduction

PhD Workshop, September

Interactive Workflows for Exploratory Data Analysis

Nourhan Elfaramawy

0 0 supervised by Prof. Matthias Weidlich, Humboldt-Universität zu Berlin , Unter den Linden 6, 10099 Berlin , Germany

2022

5 2022 05 09

The analysis of scientific data is often exploratory, meaning that the exact design of a workflow to process data is subject to continuous investigation and redesign. While support for the design of such workflows is manifold, it focuses primarily on reuse, reproducibility, and traceability of analysis results. Yet, it typically relies on static models of workflows that force scientists to wait for completion and restart a workflow repeatedly to explore diferent design choices. This is ineficient in terms of the invested time and resources. In this PhD project, we strive for support of user interactions in workflow execution. Our proposal is to extend common workflow models with concepts to define interaction points and possible actions, thereby providing users the flexibility to realize diverse interaction primitives, such as forwarding, repetition, and sample-based exploration. We further outline our initial results on realizing a model for interactive workflows.

eol>Exploratory Data Analysis Scientific Workflows Interactive Workflows Snakemake

1. Introduction In domains such as bio-informatics, remote sensing, and

materials science, the analysis of large-scale data is a prerequisite for scientific progress [ 1]. To this end, complex pipelines of operators, also referred to as scientific workflows or data analysis workflows , are designed and executed using infrastructures for distributed computation.

However, the respective analysis is typically exploratory, meaning that it emerges from a scientific process, in which hypotheses are designed and step-wise confirmed or invalidated. Therefore, workflows used for the analysis are also subject to continuous change.

While the importance of supporting the design and exe- Figure 1: (a) Data analysis based on a traditional workflow cution of workflows is widely recognized [ 3] [4], existing that is first defined and then executed; (b) exploratory data models and methods focus on reuse, reproducibility, and analysis that is supported by interactive workflows. traceability of analysis results. Workflow engines such as Kepler [5], Galaxy [6], Pegasus [7], Snakemake [8], defined at design time, if at all. In practice, therefore, or Nextflow [ 9] ofer means to specify workflows from such exploration is restricted to relatively simple scenarreusable building blocks, provide technical abstractions ios, such as the definition of parameter sweeps of certain of compute infrastructures, and include functionality for operators, see Nimrod [10] for Kepler and Scalarm [11] exchange and collaboration in the workflow design. How- for Pegasus. The lack of flexibility in the workflow execuever, they adopt a static notion of a workflow as shown tion has severe implications. Scientists waste their own in fig. 1(a): A user specifies and configures a workflow, time as well as resources of a compute infrastructure as which is subsequently executed. they have to resort to submitting their workflow for exe

A static workflow model is inherently limited in its sup- cution and waiting for its completion, before repeating it port of interactivity for exploratory data analysis, though. all over again with potentially only minor adjustments. The exploration of design choices and possible changes In this paper, we propose to extend common workflow in the analysis, and hence in the workflow, can only be models with interaction capabilities, thereby providing support for exploratory data analysis by a human-inthe-loop model for workflow execution, see fig. 1(b). By enabling scientists to examine the intermediate data produced by a workflow, and to configure and adjust the workflow based on their observations, design choices can be explored immediately and systematically, using less multiple individuals, as shown in fig. 2(a). In general, compute resources compared to the traditional model. this workflow starts with the assemble (AS) operator that

In order to realize this objective, we set out to answer takes a genome (G) as input and reconstructs contigs the following research questions: from unalgined reads (UnR), i.e., a set of unaligned reads (R1) How to support interactions during workflow exe- from a sequencing dataset of a single individual and its cution? task is to reconstruct a set of contigs, representing can(i) When and where shall a user be able to inter- didate sequences of insertions. Then, a merge operator act? combines the contigs from diferent genomes, which re(ii) What are the actions a user may apply? sults in so-called super-contigs. Those are used in the (R2) How can this dynamic interactive exploratory contig alignment (CoA) step which aligns the unaligned model be implemented in common workflow sys- reads to the supercontigs and outputs candidate locatems and infrastructures? tions (approximate positions) of the supercontigs in the reference genome. The placement of the reference alignBelow, we first introduce a motivating example (§ 2) and ment (PR) step identifies precise insertion positions of review related work (§ 3). Then, we outline our ideas to the supercontigs in the reference genome. answer the above questions. This includes a model for To summarize, the output of each step in the PopIns interactive workflows including notions of interaction workflow is the input of the next one, and the behaviour points and actions (§ 4). Moreover, we elaborate on our of the system depends on the input/output quality. Therepreliminary results in terms of realizing this model in an fore, it is necessary for the scientist to examine the inexisting workflow engine (§ 5), before we conclude (§ 6). termediate results and observe the system behaviour at various points during execution. This may result in re2. Motivation peating or skipping some steps. These design choices make genome analysis workflows good examples for exWe illustrate the need for interactive workflows for ex- ploratory data analysis. Yet, the example also illustrates ploratory data analysis with an example from a bio- that, if no ground truth is available, the design of the informatics field, in particular genomics. Here, recent respective workflows cannot be optimized automatically. advances in DNA sequencing led to the wide-spread avail- To support such application scenarios efectively, we ability of large volumes of genome sequence data. Spe- envision a model of an interactive workflow to enable an cific research questions in this area, for instance, relate to execution as sketched in fig. 2(b). The execution starts the identification of structural variants (SVs) in genomes. with the assemble and merge steps, followed by contig In particular, tools such as PopIns [2] and PopDel [12] alignment. However, we envision the definition of an have been developed to detect large insertion and dele- interaction point, so that the execution of the workflow tion variants in whole-genome sequencing (WGS) data is paused. Then, a frontend, here denoted as a notebook, of hundreds to tens of thousands of individuals. shall enable the visualization of the contig alignments

A tool such as PopIns actually implements a workflow and provide descriptive statistics over unaligned reads. of multiple operators that are applied to the genome of Let us now assume that, based on some observations, each individual separately, or that combine data from a user takes the action to repeat the separation of unaligned reads with a diferent algorithm. However, the user may also update the definition of the interaction point that may pause the workflow after that specific step: It may be assigned a condition based on statistics over the unaligned reads, so that it is triggered only when this condition is met. Afterwards another interaction point is activated, enabling the user to investigate all intermediate results obtained so far, i.e., contig alignments, unaligned reads and a genome sample.

These above interaction points provide the user with the flexibility to make decisions and take actions on the workflow execution based on intermediate results. This way, a user can incorporate immediate and systematic changes at runtime, which will not only save time and computational resources, but could also act as an early indicator for technical errors during workflow execution. extension to the common workflow model (§ 4.1), before elaborating on exploration primitives (§ 4.2). 4.1. Interaction Points and Actions As a starting point, we consider a traditional model of a workflow, see [ 4]. It defines a workflow as a DAG, where vertices denote operators and edges denote data dependencies between the references to the datasets consumed or produced by operators, also known as input and output ports. A state of such a workflow is then given by a binding of specific files to these input and output ports.

As hinted at already above, the question of how to model interactive workflows can be split into two parts, when and where to interact; and what actions to apply.

We therefore propose to extend the traditional workflow model with two concepts, as follows: 3. Related Work Interaction points indicate that the workflow execution shall be paused for a user to explore the current Scientific workflows help scientists to manage and or- state in terms of the data generated so far, which potenganize their data-driven analysis [13]. There are many tially involves executing some additional analysis to get workflow engines that scientists rely on, such as Ke- insightful visualizations or to compute descriptive statispler [5], Galaxy [6], Pegasus [7], Snakemake [8], and tics on the intermediate results. Such an interaction point Nextflow [ 9]. These engines provide ease-of-use through is given by an edge of the workflow DAG and, potentially, user interfaces, graphically or script-based, and cata- a condition. The latter may refer to a state of workflow logues of standardized data preprocessing techniques. execution (e.g., checking the number of lines in a data As mentioned above, some engines provide some lim- file) or meta-data (e.g., checking the execution time of an ited support for exploratory analysis, e.g., for parameter operator). The semantics of an interaction point (IP) are sweeps. Also, notably, dynamic control of iterations in summarized as follows: Upon completing the execution workflows based on changes of the processed data by a of the operator that is the source of the respective edge, user was proposed in [14]. However, there is a gap in the workflow engine checks the condition and, if it is terms of expressive models to support generic interactiv- true, does not continue execution with the operator that ity during runtime, i.e., while a workflow is executed. is the target of the edge, but waits for user input.

Data flow optimization is related as it targets some of the challenges stemming from workflows used for exploratory analysis. For instance, meta-dataflows (MDFs), introduced in [15], to improve the task scheduling and memory allocation in exploratory analysis. Data access patterns caused by exploratory analysis may also benefit from caching layers, such as Tachyon [16].

Debugging of data processing pipelines received increased interest in the data management community recently [17]. For example, Dagger [18] provides interactivity through debugging primitives in data-driven pipelines.

DataExposer [19], in turn, helps to identify properties that can be considered to be root causes of performance degradation, or system failure due to data. Yet, most of this work focuses on debugging at the data-level, rather than the control-flow level.

Actions indicate how the user intends to continue the execution of a workflow once an interaction point is reached. To this end, we consider diferent types of actions, including: • Revise interaction points: The set of interaction

points defined for the workflow is updated. • Revise workflow: The structure of the workflow in terms of operators and data dependencies is updated. • Continuation: Workflow execution continues

with the operator following the interaction point. • Skipping: Workflow execution continues based on the workflow DAG, but skips over the specified operators when doing so, i.e., the output ports of skipped operators denote empty datasets. • Rewind: Workflow execution continues from an earlier state, which is identified by an operator in the workflow DAG.

Naturally, the actions to revise interaction points and the workflow shall be combined with a continuation, skipping, or rewind action. Moreover, we note that the ac

4. Model of Interactive Workflows Below, we describe our take on research question (R1), i.e.,

how to model interactive workflows. We first propose an tions impose certain consistency requirements to enable proper workflow execution, e.g., in terms of reachability of operators in the workflow DAG and the realization of data dependencies.

Technical environment: To realize our conceptual

model for interactive workflows (§ 4), we choose Snakemake, a state-of-the-art rule-based workflow manage4.2. Exploration Primitives ment system. Here, a workflow is defined by a set of rules. Each rule denotes a task or operator and specifies The extension of a model for workflows realized by inter- how to create sets of output files from sets of input files. action points and actions enables us to support various Then, the engine establishes the dependencies between primitives often found in exploratory workflows. Below, the rules by matching file names. In Snakemake, when we outline how this support is achieved for some of these starting a workflow, these rules are used to create a DAG exploration primitives: as the basis for execution. However, this also means that an adaptation of the rules and, hence, the DAG is not Fast-forward: Based on properties of some intermedi- possible after the start of workflow execution. ate results, a user may want to fast-forward the However, even though the DAG cannot be altered at workflow execution to save time and compute runtime, Snakemake provides limited support for interacresources. An example would be a sequence of tivity. That is, a Jupyter notebook [20], a popular Pythonoperators to implement noise filtering, which may based computational environment, can easily be intenot be needed if the data variance stays within grated with Snakemake. Such a notebook combines code certain limits. This is enabled by defining an in- snippets, documentation, as well as plots into a single teraction point to decide on fast-forwarding (e.g., document. The integration in Snakemake is realized via to compute the variance), which may then be dedicated rules, which, once executed, start a notebook realized through a skipping action. for a user to work with. The workflow only continues execution according to the constructed DAG once the Repetition: A user may want to repeat a certain step, user closes the respective notebook. or a set thereof, before advancing with the work- Preliminary results: To test the feasibility of our lfow execution, e.g., to fine the configuration of ideas, we implemented parts of our proposed model for operators. Support for such repetition is limited interactive workflows in Snakemake using the notebook in common workflow management systems. The integration, and applied it to the PopIns workflow menreason being that most of them adopt an exe- tioned earlier (§ 2). Specifically, we added interaction cution model based on a DAG, which prevents points in the workflow by rules that start a notebook, the definition of a cycles in the workflow struc- as sketched in fig. 2(b). These notebooks then enable ture. Using the concepts envisioned for interac- access to the intermediate results (e.g., unaligned reads tive workflows, we support repetitions by defin- and contigs in our example). Moreover, we realized the ing an interaction point at which a user may de- aforementioned rewind and continuation actions, which, cide on a rewind action or a continuation action. once triggered by a user in the notebook, enable the repeSample-based exploration: Another pattern in ex- tition of particular steps in the workflow. Since the DAG ploratory analysis is that a user wants to test their constructed by Snakemake is immutable, our solution for workflow on a subset of data, before applying it the rewind action is based on an abortion of the current to the complete dataset in order to save time and workflow instance and the creation of a new instance, compute resources. An example would be the cal- while the control-flow in the new instance is guided by ibration of some data transformations by fitting a the automated creation and deletion of dedicated files. statistical model. Here, the fit of various models Applied in the context of the PopIns workflow, our may first be explored using a sample of the data. prototype, despite its limited support of the envisioned Based thereon, the model is adopted to transform model, highlights the benefits of ofering interactivity the whole dataset, or the workflow is altered to in workflow execution. Users can explore and examine incorporate a diferent transformation. This is intermediate results at runtime, and realize common exrealized by defining an interaction point after the ploration primitives directly, rather than being delayed sample was processed, with the possible actions by the need to wait for workflow completion. being to rewind and re-execute the workflow with the complete dataset, or to revise the workflow 6. Conclusions by replacing the transformation operator.

In this work, we outlined the need to support interactivity in workflows for scientific data. To address this need, Acknowledgments

This work is funded by the German Research Foundation (DFG), Project-ID 414984028, SFB 1404 FONDA. we outlined a model for interactive workflows, which is based on interaction points and actions. We further reported on our preliminary results of realizing this model in Snakemake, a rule-based workflow engine, and its integration with Jupyter notebooks. Specifically, we adopted the implementation in the Popins workflow for structural variant calling in genomics.

Having a first version of a model for interactive worklfows, our research plan involves the following phases: First, we intend to study the realization of further exploration primitives using interaction points and actions. This way, we also seek to understand whether our model shall incorporate more expressive actions. Second, we aim to ensure that our implementation is supported not only in the stand-alone execution mode of Snakemake, but can also be employed for cluster-based execution. Third, our goal is to provide implementation strategies for our model of interactive workflows for other worklfow engines, such as Nextflow. [18] Y. Yang, M. Youill, M. E. Woicik, Y. Liu, X. Yu, M. Serafini, A. Aboulnaga, M. Stonebraker, Flexpushdowndb: Hybrid pushdown and caching in a cloud DBMS, Proc. VLDB Endow. 14 (2021) 2101–2113. [19] S. Galhotra, A. Fariha, R. Lourenço, J. Freire, A. Meliou, D. Srivastava, Dataexposer: Exposing disconnect between data and systems, CoRR abs/2105.06058 (2021). [20] T. Kluyver, B. Ragan-Kelley, F. Pérez, B. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. Hamrick, J. Grout, S. Corlay, P. Ivanov, D. Avila, S. Abdalla, C. Willing, Jupyter notebooks – a publishing format for reproducible computational workflows, in: F. Loizides, B. Schmidt (Eds.), Positioning and Power in Academic Publishing: Players, Agents and Agendas, IOS Press, 2016, pp. 87 – 90.