=Paper= {{Paper |id=Vol-3186/paper2 |storemode=property |title=Interactive Workflows for Exploratory Data Analysis |pdfUrl=https://ceur-ws.org/Vol-3186/paper_2.pdf |volume=Vol-3186 |authors=Nourhan Elfaramawy |dblpUrl=https://dblp.org/rec/conf/vldb/Elfaramawy22 }} ==Interactive Workflows for Exploratory Data Analysis== https://ceur-ws.org/Vol-3186/paper_2.pdf
Interactive Workflows for Exploratory Data Analysis
Nourhan Elfaramawy
supervised by Prof. Matthias Weidlich,
Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany


                                          Abstract
                                          The analysis of scientific data is often exploratory, meaning that the exact design of a workflow to process data is subject
                                          to continuous investigation and redesign. While support for the design of such workflows is manifold, it focuses primarily
                                          on reuse, reproducibility, and traceability of analysis results. Yet, it typically relies on static models of workflows that force
                                          scientists to wait for completion and restart a workflow repeatedly to explore different design choices. This is inefficient in
                                          terms of the invested time and resources.
                                              In this PhD project, we strive for support of user interactions in workflow execution. Our proposal is to extend common
                                          workflow models with concepts to define interaction points and possible actions, thereby providing users the flexibility to
                                          realize diverse interaction primitives, such as forwarding, repetition, and sample-based exploration. We further outline our
                                          initial results on realizing a model for interactive workflows.

                                           Keywords
                                           Exploratory Data Analysis, Scientific Workflows, Interactive Workflows, Snakemake



1. Introduction
In domains such as bio-informatics, remote sensing, and
materials science, the analysis of large-scale data is a
prerequisite for scientific progress [1]. To this end, com-
plex pipelines of operators, also referred to as scientific
workflows or data analysis workflows, are designed and ex-
ecuted using infrastructures for distributed computation.
However, the respective analysis is typically exploratory,
meaning that it emerges from a scientific process, in
which hypotheses are designed and step-wise confirmed
or invalidated. Therefore, workflows used for the analy-
sis are also subject to continuous change.
   While the importance of supporting the design and exe-                                                              Figure 1: (a) Data analysis based on a traditional workflow
cution of workflows is widely recognized [3] [4], existing                                                             that is first defined and then executed; (b) exploratory data
models and methods focus on reuse, reproducibility, and                                                                analysis that is supported by interactive workflows.
traceability of analysis results. Workflow engines such
as Kepler [5], Galaxy [6], Pegasus [7], Snakemake [8],                                                                 defined at design time, if at all. In practice, therefore,
or Nextflow [9] offer means to specify workflows from                                                                  such exploration is restricted to relatively simple scenar-
reusable building blocks, provide technical abstractions                                                               ios, such as the definition of parameter sweeps of certain
of compute infrastructures, and include functionality for                                                              operators, see Nimrod [10] for Kepler and Scalarm [11]
exchange and collaboration in the workflow design. How-                                                                for Pegasus. The lack of flexibility in the workflow execu-
ever, they adopt a static notion of a workflow as shown                                                                tion has severe implications. Scientists waste their own
in fig. 1(a): A user specifies and configures a workflow,                                                              time as well as resources of a compute infrastructure as
which is subsequently executed.                                                                                        they have to resort to submitting their workflow for exe-
   A static workflow model is inherently limited in its sup-                                                           cution and waiting for its completion, before repeating it
port of interactivity for exploratory data analysis, though.                                                           all over again with potentially only minor adjustments.
The exploration of design choices and possible changes                                                                    In this paper, we propose to extend common workflow
in the analysis, and hence in the workflow, can only be                                                                models with interaction capabilities, thereby providing
VLDB’22: International Conference on Very Large Databases,                                                             support for exploratory data analysis by a human-in-
September 05–09, 2022, Sydney, Australia                                                                               the-loop model for workflow execution, see fig. 1(b). By
$ nourhan.elfaramawy@hu-berlin.de (N. Elfaramawy)                                                                      enabling scientists to examine the intermediate data pro-
 0000-0001-9444-5163 (N. Elfaramawy)                                                                                  duced by a workflow, and to configure and adjust the
                                    © 2022 Proceedings of the VLDB 2022 PhD Workshop, September 5, 2022. Sydney,
                                    Australia. Copyright (C) 2022 for this paper by its authors. Use permitted under
                                    Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                       workflow based on their observations, design choices can
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
Figure 2: (a) The static PopIns workflow [2] for calling non-reference sequence insertions from many genomes jointly. The
workflow includes operators that are executed separately per genome, as well as operators executed for all genomes. (b) A
trace of the execution of this workflow once interactions by a user are included: A user explores the results of the contig
alignment step in a notebook and decides to repeat the respective step before continuing with the workflow.


be explored immediately and systematically, using less           multiple individuals, as shown in fig. 2(a). In general,
compute resources compared to the traditional model.             this workflow starts with the assemble (AS) operator that
  In order to realize this objective, we set out to answer       takes a genome (G) as input and reconstructs contigs
the following research questions:                                from unalgined reads (UnR), i.e., a set of unaligned reads
(R1) How to support interactions during workflow exe-            from a sequencing dataset of a single individual and its
     cution?                                                     task is to reconstruct a set of contigs, representing can-
       (i) When and where shall a user be able to inter-         didate sequences of insertions. Then, a merge operator
           act?                                                  combines the contigs from different genomes, which re-
      (ii) What are the actions a user may apply?                sults in so-called super-contigs. Those are used in the
(R2) How can this dynamic interactive exploratory                contig alignment (CoA) step which aligns the unaligned
     model be implemented in common workflow sys-                reads to the supercontigs and outputs candidate loca-
     tems and infrastructures?                                   tions (approximate positions) of the supercontigs in the
                                                                 reference genome. The placement of the reference align-
Below, we first introduce a motivating example (§ 2) and
                                                                 ment (PR) step identifies precise insertion positions of
review related work (§ 3). Then, we outline our ideas to
                                                                 the supercontigs in the reference genome.
answer the above questions. This includes a model for
                                                                    To summarize, the output of each step in the PopIns
interactive workflows including notions of interaction
                                                                 workflow is the input of the next one, and the behaviour
points and actions (§ 4). Moreover, we elaborate on our
                                                                 of the system depends on the input/output quality. There-
preliminary results in terms of realizing this model in an
                                                                 fore, it is necessary for the scientist to examine the in-
existing workflow engine (§ 5), before we conclude (§ 6).
                                                                 termediate results and observe the system behaviour at
                                                                 various points during execution. This may result in re-
2. Motivation                                                    peating or skipping some steps. These design choices
                                                                 make genome analysis workflows good examples for ex-
We illustrate the need for interactive workflows for ex-         ploratory data analysis. Yet, the example also illustrates
ploratory data analysis with an example from a bio-              that, if no ground truth is available, the design of the
informatics field, in particular genomics. Here, recent          respective workflows cannot be optimized automatically.
advances in DNA sequencing led to the wide-spread avail-            To support such application scenarios effectively, we
ability of large volumes of genome sequence data. Spe-           envision a model of an interactive workflow to enable an
cific research questions in this area, for instance, relate to   execution as sketched in fig. 2(b). The execution starts
the identification of structural variants (SVs) in genomes.      with the assemble and merge steps, followed by contig
In particular, tools such as PopIns [2] and PopDel [12]          alignment. However, we envision the definition of an
have been developed to detect large insertion and dele-          interaction point, so that the execution of the workflow
tion variants in whole-genome sequencing (WGS) data              is paused. Then, a frontend, here denoted as a notebook,
of hundreds to tens of thousands of individuals.                 shall enable the visualization of the contig alignments
   A tool such as PopIns actually implements a workflow          and provide descriptive statistics over unaligned reads.
of multiple operators that are applied to the genome of             Let us now assume that, based on some observations,
each individual separately, or that combine data from            a user takes the action to repeat the separation of un-
aligned reads with a different algorithm. However, the         extension to the common workflow model (§ 4.1), before
user may also update the definition of the interaction         elaborating on exploration primitives (§ 4.2).
point that may pause the workflow after that specific
step: It may be assigned a condition based on statistics       4.1. Interaction Points and Actions
over the unaligned reads, so that it is triggered only when
this condition is met. Afterwards another interaction          As a starting point, we consider a traditional model of a
point is activated, enabling the user to investigate all in-   workflow, see [4]. It defines a workflow as a DAG, where
termediate results obtained so far, i.e., contig alignments,   vertices denote operators and edges denote data depen-
unaligned reads and a genome sample.                           dencies between the references to the datasets consumed
   These above interaction points provide the user with        or produced by operators, also known as input and out-
the flexibility to make decisions and take actions on the      put ports. A state of such a workflow is then given by a
workflow execution based on intermediate results. This         binding of specific files to these input and output ports.
way, a user can incorporate immediate and systematic             As hinted at already above, the question of how to
changes at runtime, which will not only save time and          model interactive workflows can be split into two parts,
computational resources, but could also act as an early        when and where to interact; and what actions to apply.
indicator for technical errors during workflow execution.      We therefore propose to extend the traditional workflow
                                                               model with two concepts, as follows:
                                                               Interaction points indicate that the workflow execu-
3. Related Work                                                tion shall be paused for a user to explore the current
Scientific workflows help scientists to manage and or-         state in terms of the data generated so far, which poten-
ganize their data-driven analysis [13]. There are many         tially involves executing some additional analysis to get
workflow engines that scientists rely on, such as Ke-          insightful visualizations or to compute descriptive statis-
pler [5], Galaxy [6], Pegasus [7], Snakemake [8], and          tics on the intermediate results. Such an interaction point
Nextflow [9]. These engines provide ease-of-use through        is given by an edge of the workflow DAG and, potentially,
user interfaces, graphically or script-based, and cata-        a condition. The latter may refer to a state of workflow
logues of standardized data preprocessing techniques.          execution (e.g., checking the number of lines in a data
As mentioned above, some engines provide some lim-             file) or meta-data (e.g., checking the execution time of an
ited support for exploratory analysis, e.g., for parameter     operator). The semantics of an interaction point (IP) are
sweeps. Also, notably, dynamic control of iterations in        summarized as follows: Upon completing the execution
workflows based on changes of the processed data by a          of the operator that is the source of the respective edge,
user was proposed in [14]. However, there is a gap in          the workflow engine checks the condition and, if it is
terms of expressive models to support generic interactiv-      true, does not continue execution with the operator that
ity during runtime, i.e., while a workflow is executed.        is the target of the edge, but waits for user input.
   Data flow optimization is related as it targets some      Actions indicate how the user intends to continue the
of the challenges stemming from workflows used for ex-       execution of a workflow once an interaction point is
ploratory analysis. For instance, meta-dataflows (MDFs),     reached. To this end, we consider different types of ac-
introduced in [15], to improve the task scheduling and       tions, including:
memory allocation in exploratory analysis. Data access            • Revise interaction points: The set of interaction
patterns caused by exploratory analysis may also benefit              points defined for the workflow is updated.
from caching layers, such as Tachyon [16].                        • Revise workflow: The structure of the workflow
   Debugging of data processing pipelines received                    in terms of operators and data dependencies is
increased interest in the data management community re-               updated.
cently [17]. For example, Dagger [18] provides interactiv-        • Continuation: Workflow execution continues
ity through debugging primitives in data-driven pipelines.           with the operator following the interaction point.
DataExposer [19], in turn, helps to identify properties           • Skipping: Workflow execution continues based
that can be considered to be root causes of performance               on the workflow DAG, but skips over the specified
degradation, or system failure due to data. Yet, most of              operators when doing so, i.e., the output ports of
this work focuses on debugging at the data-level, rather              skipped operators denote empty datasets.
than the control-flow level.                                      • Rewind: Workflow execution continues from an
                                                                      earlier state, which is identified by an operator in
                                                                      the workflow DAG.
4. Model of Interactive Workflows Naturally, the actions to revise interaction points and the
                                                             workflow shall be combined with a continuation, skip-
Below, we describe our take on research question (R1), i.e.,
                                                             ping, or rewind action. Moreover, we note that the ac-
how to model interactive workflows. We first propose an
tions impose certain consistency requirements to enable      5. Towards a Realization
proper workflow execution, e.g., in terms of reachability
of operators in the workflow DAG and the realization of      Technical environment: To realize our conceptual
data dependencies.                                           model for interactive workflows (§ 4), we choose Snake-
                                                             make, a state-of-the-art rule-based workflow manage-
                                                             ment system. Here, a workflow is defined by a set of
4.2. Exploration Primitives                                  rules. Each rule denotes a task or operator and specifies
The extension of a model for workflows realized by inter-    how to create sets of output files from sets of input files.
action points and actions enables us to support various      Then, the engine establishes the dependencies between
primitives often found in exploratory workflows. Below,      the rules by matching file names. In Snakemake, when
we outline how this support is achieved for some of these    starting a workflow, these rules are used to create a DAG
exploration primitives:                                      as the basis for execution. However, this also means that
                                                             an adaptation of the rules and, hence, the DAG is not
Fast-forward: Based on properties of some intermedi-         possible after the start of workflow execution.
       ate results, a user may want to fast-forward the         However, even though the DAG cannot be altered at
       workflow execution to save time and compute           runtime, Snakemake provides limited support for interac-
       resources. An example would be a sequence of          tivity. That is, a Jupyter notebook [20], a popular Python-
       operators to implement noise filtering, which may     based computational environment, can easily be inte-
       not be needed if the data variance stays within       grated with Snakemake. Such a notebook combines code
       certain limits. This is enabled by defining an in-    snippets, documentation, as well as plots into a single
       teraction point to decide on fast-forwarding (e.g.,   document. The integration in Snakemake is realized via
       to compute the variance), which may then be           dedicated rules, which, once executed, start a notebook
       realized through a skipping action.                   for a user to work with. The workflow only continues
                                                             execution according to the constructed DAG once the
 Repetition: A user may want to repeat a certain step,       user closes the respective notebook.
     or a set thereof, before advancing with the work-          Preliminary results: To test the feasibility of our
     flow execution, e.g., to fine the configuration of      ideas, we implemented parts of our proposed model for
     operators. Support for such repetition is limited       interactive workflows in Snakemake using the notebook
     in common workflow management systems. The              integration, and applied it to the PopIns workflow men-
     reason being that most of them adopt an exe-            tioned earlier (§ 2). Specifically, we added interaction
     cution model based on a DAG, which prevents             points in the workflow by rules that start a notebook,
     the definition of a cycles in the workflow struc-       as sketched in fig. 2(b). These notebooks then enable
     ture. Using the concepts envisioned for interac-        access to the intermediate results (e.g., unaligned reads
     tive workflows, we support repetitions by defin-        and contigs in our example). Moreover, we realized the
     ing an interaction point at which a user may de-        aforementioned rewind and continuation actions, which,
     cide on a rewind action or a continuation action.       once triggered by a user in the notebook, enable the repe-
 Sample-based exploration: Another pattern in ex-            tition of particular steps in the workflow. Since the DAG
     ploratory analysis is that a user wants to test their   constructed by Snakemake is immutable, our solution for
     workflow on a subset of data, before applying it        the rewind action is based on an abortion of the current
     to the complete dataset in order to save time and       workflow instance and the creation of a new instance,
     compute resources. An example would be the cal-         while the control-flow in the new instance is guided by
     ibration of some data transformations by fitting a      the automated creation and deletion of dedicated files.
     statistical model. Here, the fit of various models         Applied in the context of the PopIns workflow, our
     may first be explored using a sample of the data.       prototype, despite its limited support of the envisioned
     Based thereon, the model is adopted to transform        model, highlights the benefits of offering interactivity
     the whole dataset, or the workflow is altered to        in workflow execution. Users can explore and examine
     incorporate a different transformation. This is         intermediate results at runtime, and realize common ex-
     realized by defining an interaction point after the     ploration primitives directly, rather than being delayed
     sample was processed, with the possible actions         by the need to wait for workflow completion.
     being to rewind and re-execute the workflow with
     the complete dataset, or to revise the workflow         6. Conclusions
     by replacing the transformation operator.
                                                             In this work, we outlined the need to support interactiv-
                                                             ity in workflows for scientific data. To address this need,
we outlined a model for interactive workflows, which is            Galaxy: A web-based genome analysis tool for ex-
based on interaction points and actions. We further re-            perimentalists, Current Protocols in Molecular Bi-
ported on our preliminary results of realizing this model          ology 89 (2010).
in Snakemake, a rule-based workflow engine, and its inte- [7] E. Deelman, K. Vahi, G. Juve, M. Rynge,
gration with Jupyter notebooks. Specifically, we adopted           S. Callaghan, P. J. Maechling, R. Mayani, W. Chen,
the implementation in the Popins workflow for structural           R. Ferreira da Silva, M. Livny, K. Wenger, Pega-
variant calling in genomics.                                       sus, a workflow management system for science
   Having a first version of a model for interactive work-         automation, Future Generation Computer Systems
flows, our research plan involves the following phases:            46 (2015).
First, we intend to study the realization of further ex- [8] J. Köster, S. Rahmann, Snakemake - a scalable bioin-
ploration primitives using interaction points and actions.         formatics workflow engine, Bioinform. 34 (2018)
This way, we also seek to understand whether our model             3600.
shall incorporate more expressive actions. Second, we          [9] P. D. Tommaso, E. W. Floden, C. Magis, E. Palumbo,
aim to ensure that our implementation is supported not             C. Notredame, [nextflow, an efficient tool to
only in the stand-alone execution mode of Snakemake,               improve computation numerical stability in ge-
but can also be employed for cluster-based execution.              nomic analysis] 211 (2017). doi:10.1051/jbio/
Third, our goal is to provide implementation strategies            2017029.
for our model of interactive workflows for other work- [10] D. Abramson, B. Bethwaite, C. Enticott, S. Garic,
flow engines, such as Nextflow.                                    T. Peachey, Parameter space exploration using
                                                                   scientific workflows, in: G. Allen, J. Nabrzyski,
                                                                   E. Seidel, G. D. van Albada, J. Dongarra, P. M. A.
Acknowledgments                                                    Sloot (Eds.), Computational Science – ICCS 2009,
                                                                   Springer Berlin Heidelberg, 2009.
This work is funded by the German Research Foundation
                                                              [11] D. Król, J. Kitowski, R. F. da Silva, G. Juve, K. Vahi,
(DFG), Project-ID 414984028, SFB 1404 FONDA.
                                                                   M. Rynge, E. Deelman, Science automation in prac-
                                                                   tice: Performance data farming in workflows, in:
References                                                         2016 IEEE 21st International Conference on Emerg-
                                                                   ing Technologies and Factory Automation (ETFA),
  [1] U. Leser, M. Hilbrich, C. Draxl, P. Eisert, L. Grunske,      2016.
      P. Hostert, D. Kainmüller, O. Kao, B. Kehr, T. Kehrer, [12] S. Niehus, H. Jónsson, J. Schönberger, E. Björnsson,
      C. Koch, V. Markl, H. Meyerhenke, T. Rabl, A. Reine-         D. Beyter, H. P. Eggertsson, P. Sulem, K. Stefáns-
      feld, K. Reinert, K. Ritter, B. Scheuermann, F. Schin-       son, B. V. Halldórsson, B. Kehr, Popdel identifies
      tke, N. Schweikardt, M. Weidlich, The Collaborative          medium-size deletions jointly in tens of thousands
      Research Center FONDA, Datenbank-Spektrum                    of genomes, bioRxiv (2020).
      (2021). doi:10.1007/s13222-021-00397-5.                 [13] C. S. Liew, M. P. Atkinson, M. Galea, T. F. Ang,
  [2] B. Kehr, P. Melsted, B. V. Halldórsson, Popins:              P. Martin, J. I. V. Hemert, Scientific workflows:
      population-scale detection of novel sequence in-             Moving across paradigms, ACM Comput. Surv. 49
      sertions, Bioinform. 32 (2016) 961–967.                      (2016).
  [3] A. Barker, J. van Hemert, Scientific workflow: A [14] J. Dias, G. Guerra, F. Rochinha, A. L. Coutinho,
      survey and research directions, Springer Berlin              P. Valduriez, M. Mattoso, Data-centric iteration
      Heidelberg, Berlin, Heidelberg, 2008, pp. 746–753.           in dynamic workflows, Future Gener. Comput. Syst.
  [4] Workflows and e-science: An overview of work-                46 (2015) 114–126.
      flow system features and capabilities, Future [15] R. Fernandez, W. Culhane, P. Watcharapichat,
      Generation Computer Systems 25 (2009) 528–540.               M. Weidlich, V. Morales, P. Pietzuch, Meta-
      doi:https://doi.org/10.1016/j.future.                        dataflows: Efficient exploratory dataflow jobs, 2018,
      2008.06.012.                                                 pp. 1157–1172.
  [5] D. Barseghian, I. Altintas, M. B. Jones, D. Crawl, [16] H. Li, A. Ghodsi, M. Zaharia, S. Shenker, I. Sto-
      N. Potter, J. Gallagher, P. Cornillon, M. Schildhauer,       ica, Tachyon: Reliable, memory speed storage for
      E. T. Borer, E. W. Seabloom, P. R. Hosseini, Work-           cluster computing frameworks, SOCC ’14, Asso-
      flows and extensions to the kepler scientific work-          ciation for Computing Machinery, 2014, p. 1–15.
      flow system to support environmental sensor data             doi:10.1145/2670979.2670985.
      access and analysis, Ecological Informatics 5 (2010). [17] M. Brachmann, W. Spoth, Y. Yang, C. Bautista,
  [6] D. Blankenberg, G. V. Kuster, N. Coraor, G. Ananda,          S. Castelo, S. Feng, J. Freire, B. Glavic, O. Kennedy,
      R. Lazarus, M. Mangan, A. Nekrutenko, J. Taylor,             H. Müeller, R. Rampin, Data debugging and explo-
                                                                   ration with vizier, 2019, pp. 1877–1880.
[18] Y. Yang, M. Youill, M. E. Woicik, Y. Liu, X. Yu, M. Ser-
     afini, A. Aboulnaga, M. Stonebraker, Flexpush-
     downdb: Hybrid pushdown and caching in a cloud
     DBMS, Proc. VLDB Endow. 14 (2021) 2101–2113.
[19] S. Galhotra, A. Fariha, R. Lourenço, J. Freire,
     A. Meliou, D. Srivastava, Dataexposer: Expos-
     ing disconnect between data and systems, CoRR
     abs/2105.06058 (2021).
[20] T. Kluyver, B. Ragan-Kelley, F. Pérez, B. Granger,
     M. Bussonnier, J. Frederic, K. Kelley, J. Hamrick,
     J. Grout, S. Corlay, P. Ivanov, D. Avila, S. Abdalla,
     C. Willing, Jupyter notebooks – a publishing for-
     mat for reproducible computational workflows, in:
     F. Loizides, B. Schmidt (Eds.), Positioning and Power
     in Academic Publishing: Players, Agents and Agen-
     das, IOS Press, 2016, pp. 87 – 90.