<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>PhD Workshop, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Interactive Workflows for Exploratory Data Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nourhan Elfaramawy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>supervised by Prof. Matthias Weidlich, Humboldt-Universität zu Berlin</institution>
          ,
          <addr-line>Unter den Linden 6, 10099 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>5</volume>
      <issue>2022</issue>
      <fpage>05</fpage>
      <lpage>09</lpage>
      <abstract>
        <p>The analysis of scientific data is often exploratory, meaning that the exact design of a workflow to process data is subject to continuous investigation and redesign. While support for the design of such workflows is manifold, it focuses primarily on reuse, reproducibility, and traceability of analysis results. Yet, it typically relies on static models of workflows that force scientists to wait for completion and restart a workflow repeatedly to explore diferent design choices. This is ineficient in terms of the invested time and resources. In this PhD project, we strive for support of user interactions in workflow execution. Our proposal is to extend common workflow models with concepts to define interaction points and possible actions, thereby providing users the flexibility to realize diverse interaction primitives, such as forwarding, repetition, and sample-based exploration. We further outline our initial results on realizing a model for interactive workflows.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Exploratory Data Analysis</kwd>
        <kwd>Scientific Workflows</kwd>
        <kwd>Interactive Workflows</kwd>
        <kwd>Snakemake</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>In domains such as bio-informatics, remote sensing, and</title>
        <p>materials science, the analysis of large-scale data is a
prerequisite for scientific progress [ 1]. To this end,
complex pipelines of operators, also referred to as scientific
workflows or data analysis workflows , are designed and
executed using infrastructures for distributed computation.</p>
        <p>However, the respective analysis is typically exploratory,
meaning that it emerges from a scientific process, in
which hypotheses are designed and step-wise confirmed
or invalidated. Therefore, workflows used for the
analysis are also subject to continuous change.</p>
        <p>While the importance of supporting the design and exe- Figure 1: (a) Data analysis based on a traditional workflow
cution of workflows is widely recognized [ 3] [4], existing that is first defined and then executed; (b) exploratory data
models and methods focus on reuse, reproducibility, and analysis that is supported by interactive workflows.
traceability of analysis results. Workflow engines such
as Kepler [5], Galaxy [6], Pegasus [7], Snakemake [8], defined at design time, if at all. In practice, therefore,
or Nextflow [ 9] ofer means to specify workflows from such exploration is restricted to relatively simple
scenarreusable building blocks, provide technical abstractions ios, such as the definition of parameter sweeps of certain
of compute infrastructures, and include functionality for operators, see Nimrod [10] for Kepler and Scalarm [11]
exchange and collaboration in the workflow design. How- for Pegasus. The lack of flexibility in the workflow
execuever, they adopt a static notion of a workflow as shown tion has severe implications. Scientists waste their own
in fig. 1(a): A user specifies and configures a workflow, time as well as resources of a compute infrastructure as
which is subsequently executed. they have to resort to submitting their workflow for
exe</p>
        <p>A static workflow model is inherently limited in its sup- cution and waiting for its completion, before repeating it
port of interactivity for exploratory data analysis, though. all over again with potentially only minor adjustments.
The exploration of design choices and possible changes In this paper, we propose to extend common workflow
in the analysis, and hence in the workflow, can only be models with interaction capabilities, thereby providing
support for exploratory data analysis by a
human-inthe-loop model for workflow execution, see fig. 1(b). By
enabling scientists to examine the intermediate data
produced by a workflow, and to configure and adjust the
workflow based on their observations, design choices can
be explored immediately and systematically, using less multiple individuals, as shown in fig. 2(a). In general,
compute resources compared to the traditional model. this workflow starts with the assemble (AS) operator that</p>
        <p>In order to realize this objective, we set out to answer takes a genome (G) as input and reconstructs contigs
the following research questions: from unalgined reads (UnR), i.e., a set of unaligned reads
(R1) How to support interactions during workflow exe- from a sequencing dataset of a single individual and its
cution? task is to reconstruct a set of contigs, representing
can(i) When and where shall a user be able to inter- didate sequences of insertions. Then, a merge operator
act? combines the contigs from diferent genomes, which
re(ii) What are the actions a user may apply? sults in so-called super-contigs. Those are used in the
(R2) How can this dynamic interactive exploratory contig alignment (CoA) step which aligns the unaligned
model be implemented in common workflow sys- reads to the supercontigs and outputs candidate
locatems and infrastructures? tions (approximate positions) of the supercontigs in the
reference genome. The placement of the reference
alignBelow, we first introduce a motivating example (§ 2) and ment (PR) step identifies precise insertion positions of
review related work (§ 3). Then, we outline our ideas to the supercontigs in the reference genome.
answer the above questions. This includes a model for To summarize, the output of each step in the PopIns
interactive workflows including notions of interaction workflow is the input of the next one, and the behaviour
points and actions (§ 4). Moreover, we elaborate on our of the system depends on the input/output quality.
Therepreliminary results in terms of realizing this model in an fore, it is necessary for the scientist to examine the
inexisting workflow engine (§ 5), before we conclude (§ 6). termediate results and observe the system behaviour at
various points during execution. This may result in
re2. Motivation peating or skipping some steps. These design choices
make genome analysis workflows good examples for
exWe illustrate the need for interactive workflows for ex- ploratory data analysis. Yet, the example also illustrates
ploratory data analysis with an example from a bio- that, if no ground truth is available, the design of the
informatics field, in particular genomics. Here, recent respective workflows cannot be optimized automatically.
advances in DNA sequencing led to the wide-spread avail- To support such application scenarios efectively, we
ability of large volumes of genome sequence data. Spe- envision a model of an interactive workflow to enable an
cific research questions in this area, for instance, relate to execution as sketched in fig. 2(b). The execution starts
the identification of structural variants (SVs) in genomes. with the assemble and merge steps, followed by contig
In particular, tools such as PopIns [2] and PopDel [12] alignment. However, we envision the definition of an
have been developed to detect large insertion and dele- interaction point, so that the execution of the workflow
tion variants in whole-genome sequencing (WGS) data is paused. Then, a frontend, here denoted as a notebook,
of hundreds to tens of thousands of individuals. shall enable the visualization of the contig alignments</p>
        <p>A tool such as PopIns actually implements a workflow and provide descriptive statistics over unaligned reads.
of multiple operators that are applied to the genome of Let us now assume that, based on some observations,
each individual separately, or that combine data from a user takes the action to repeat the separation of
unaligned reads with a diferent algorithm. However, the
user may also update the definition of the interaction
point that may pause the workflow after that specific
step: It may be assigned a condition based on statistics
over the unaligned reads, so that it is triggered only when
this condition is met. Afterwards another interaction
point is activated, enabling the user to investigate all
intermediate results obtained so far, i.e., contig alignments,
unaligned reads and a genome sample.</p>
        <p>These above interaction points provide the user with
the flexibility to make decisions and take actions on the
workflow execution based on intermediate results. This
way, a user can incorporate immediate and systematic
changes at runtime, which will not only save time and
computational resources, but could also act as an early
indicator for technical errors during workflow execution.
extension to the common workflow model (§ 4.1), before
elaborating on exploration primitives (§ 4.2).
4.1. Interaction Points and Actions
As a starting point, we consider a traditional model of a
workflow, see [ 4]. It defines a workflow as a DAG, where
vertices denote operators and edges denote data
dependencies between the references to the datasets consumed
or produced by operators, also known as input and
output ports. A state of such a workflow is then given by a
binding of specific files to these input and output ports.</p>
        <p>As hinted at already above, the question of how to
model interactive workflows can be split into two parts,
when and where to interact; and what actions to apply.</p>
        <p>We therefore propose to extend the traditional workflow
model with two concepts, as follows:
3. Related Work Interaction points indicate that the workflow
execution shall be paused for a user to explore the current
Scientific workflows help scientists to manage and or- state in terms of the data generated so far, which
potenganize their data-driven analysis [13]. There are many tially involves executing some additional analysis to get
workflow engines that scientists rely on, such as Ke- insightful visualizations or to compute descriptive
statispler [5], Galaxy [6], Pegasus [7], Snakemake [8], and tics on the intermediate results. Such an interaction point
Nextflow [ 9]. These engines provide ease-of-use through is given by an edge of the workflow DAG and, potentially,
user interfaces, graphically or script-based, and cata- a condition. The latter may refer to a state of workflow
logues of standardized data preprocessing techniques. execution (e.g., checking the number of lines in a data
As mentioned above, some engines provide some lim- file) or meta-data (e.g., checking the execution time of an
ited support for exploratory analysis, e.g., for parameter operator). The semantics of an interaction point (IP) are
sweeps. Also, notably, dynamic control of iterations in summarized as follows: Upon completing the execution
workflows based on changes of the processed data by a of the operator that is the source of the respective edge,
user was proposed in [14]. However, there is a gap in the workflow engine checks the condition and, if it is
terms of expressive models to support generic interactiv- true, does not continue execution with the operator that
ity during runtime, i.e., while a workflow is executed. is the target of the edge, but waits for user input.</p>
        <p>Data flow optimization is related as it targets some
of the challenges stemming from workflows used for
exploratory analysis. For instance, meta-dataflows (MDFs),
introduced in [15], to improve the task scheduling and
memory allocation in exploratory analysis. Data access
patterns caused by exploratory analysis may also benefit
from caching layers, such as Tachyon [16].</p>
        <p>Debugging of data processing pipelines received
increased interest in the data management community
recently [17]. For example, Dagger [18] provides
interactivity through debugging primitives in data-driven pipelines.</p>
        <p>DataExposer [19], in turn, helps to identify properties
that can be considered to be root causes of performance
degradation, or system failure due to data. Yet, most of
this work focuses on debugging at the data-level, rather
than the control-flow level.</p>
        <p>Actions indicate how the user intends to continue the
execution of a workflow once an interaction point is
reached. To this end, we consider diferent types of
actions, including:
• Revise interaction points: The set of interaction</p>
        <p>points defined for the workflow is updated.
• Revise workflow: The structure of the workflow
in terms of operators and data dependencies is
updated.
• Continuation: Workflow execution continues</p>
        <p>with the operator following the interaction point.
• Skipping: Workflow execution continues based
on the workflow DAG, but skips over the specified
operators when doing so, i.e., the output ports of
skipped operators denote empty datasets.
• Rewind: Workflow execution continues from an
earlier state, which is identified by an operator in
the workflow DAG.</p>
        <p>Naturally, the actions to revise interaction points and the
workflow shall be combined with a continuation,
skipping, or rewind action. Moreover, we note that the
ac</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Model of Interactive Workflows</title>
      <sec id="sec-2-1">
        <title>Below, we describe our take on research question (R1), i.e.,</title>
        <p>how to model interactive workflows. We first propose an
tions impose certain consistency requirements to enable
proper workflow execution, e.g., in terms of reachability
of operators in the workflow DAG and the realization of
data dependencies.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Technical environment: To realize our conceptual</title>
        <p>model for interactive workflows (§ 4), we choose
Snakemake, a state-of-the-art rule-based workflow
manage4.2. Exploration Primitives ment system. Here, a workflow is defined by a set of
rules. Each rule denotes a task or operator and specifies
The extension of a model for workflows realized by inter- how to create sets of output files from sets of input files.
action points and actions enables us to support various Then, the engine establishes the dependencies between
primitives often found in exploratory workflows. Below, the rules by matching file names. In Snakemake, when
we outline how this support is achieved for some of these starting a workflow, these rules are used to create a DAG
exploration primitives: as the basis for execution. However, this also means that
an adaptation of the rules and, hence, the DAG is not
Fast-forward: Based on properties of some intermedi- possible after the start of workflow execution.
ate results, a user may want to fast-forward the However, even though the DAG cannot be altered at
workflow execution to save time and compute runtime, Snakemake provides limited support for
interacresources. An example would be a sequence of tivity. That is, a Jupyter notebook [20], a popular
Pythonoperators to implement noise filtering, which may based computational environment, can easily be
intenot be needed if the data variance stays within grated with Snakemake. Such a notebook combines code
certain limits. This is enabled by defining an in- snippets, documentation, as well as plots into a single
teraction point to decide on fast-forwarding (e.g., document. The integration in Snakemake is realized via
to compute the variance), which may then be dedicated rules, which, once executed, start a notebook
realized through a skipping action. for a user to work with. The workflow only continues
execution according to the constructed DAG once the
Repetition: A user may want to repeat a certain step, user closes the respective notebook.
or a set thereof, before advancing with the work- Preliminary results: To test the feasibility of our
lfow execution, e.g., to fine the configuration of ideas, we implemented parts of our proposed model for
operators. Support for such repetition is limited interactive workflows in Snakemake using the notebook
in common workflow management systems. The integration, and applied it to the PopIns workflow
menreason being that most of them adopt an exe- tioned earlier (§ 2). Specifically, we added interaction
cution model based on a DAG, which prevents points in the workflow by rules that start a notebook,
the definition of a cycles in the workflow struc- as sketched in fig. 2(b). These notebooks then enable
ture. Using the concepts envisioned for interac- access to the intermediate results (e.g., unaligned reads
tive workflows, we support repetitions by defin- and contigs in our example). Moreover, we realized the
ing an interaction point at which a user may de- aforementioned rewind and continuation actions, which,
cide on a rewind action or a continuation action. once triggered by a user in the notebook, enable the
repeSample-based exploration: Another pattern in ex- tition of particular steps in the workflow. Since the DAG
ploratory analysis is that a user wants to test their constructed by Snakemake is immutable, our solution for
workflow on a subset of data, before applying it the rewind action is based on an abortion of the current
to the complete dataset in order to save time and workflow instance and the creation of a new instance,
compute resources. An example would be the cal- while the control-flow in the new instance is guided by
ibration of some data transformations by fitting a the automated creation and deletion of dedicated files.
statistical model. Here, the fit of various models Applied in the context of the PopIns workflow, our
may first be explored using a sample of the data. prototype, despite its limited support of the envisioned
Based thereon, the model is adopted to transform model, highlights the benefits of ofering interactivity
the whole dataset, or the workflow is altered to in workflow execution. Users can explore and examine
incorporate a diferent transformation. This is intermediate results at runtime, and realize common
exrealized by defining an interaction point after the ploration primitives directly, rather than being delayed
sample was processed, with the possible actions by the need to wait for workflow completion.
being to rewind and re-execute the workflow with
the complete dataset, or to revise the workflow 6. Conclusions
by replacing the transformation operator.</p>
      </sec>
      <sec id="sec-2-3">
        <title>In this work, we outlined the need to support interactivity in workflows for scientific data. To address this need,</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>This work is funded by the German Research Foundation
(DFG), Project-ID 414984028, SFB 1404 FONDA.
we outlined a model for interactive workflows, which is
based on interaction points and actions. We further
reported on our preliminary results of realizing this model
in Snakemake, a rule-based workflow engine, and its
integration with Jupyter notebooks. Specifically, we adopted
the implementation in the Popins workflow for structural
variant calling in genomics.</p>
      <p>Having a first version of a model for interactive
worklfows, our research plan involves the following phases:
First, we intend to study the realization of further
exploration primitives using interaction points and actions.
This way, we also seek to understand whether our model
shall incorporate more expressive actions. Second, we
aim to ensure that our implementation is supported not
only in the stand-alone execution mode of Snakemake,
but can also be employed for cluster-based execution.
Third, our goal is to provide implementation strategies
for our model of interactive workflows for other
worklfow engines, such as Nextflow.
[18] Y. Yang, M. Youill, M. E. Woicik, Y. Liu, X. Yu, M.
Serafini, A. Aboulnaga, M. Stonebraker,
Flexpushdowndb: Hybrid pushdown and caching in a cloud
DBMS, Proc. VLDB Endow. 14 (2021) 2101–2113.
[19] S. Galhotra, A. Fariha, R. Lourenço, J. Freire,
A. Meliou, D. Srivastava, Dataexposer:
Exposing disconnect between data and systems, CoRR
abs/2105.06058 (2021).
[20] T. Kluyver, B. Ragan-Kelley, F. Pérez, B. Granger,
M. Bussonnier, J. Frederic, K. Kelley, J. Hamrick,
J. Grout, S. Corlay, P. Ivanov, D. Avila, S. Abdalla,
C. Willing, Jupyter notebooks – a publishing
format for reproducible computational workflows, in:
F. Loizides, B. Schmidt (Eds.), Positioning and Power
in Academic Publishing: Players, Agents and
Agendas, IOS Press, 2016, pp. 87 – 90.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>