=Paper= {{Paper |id=Vol-2065/paper07 |storemode=property |title=Requirements for Facilitating the Continuous Creation of Scientific Workflow Variants |pdfUrl=https://ceur-ws.org/Vol-2065/paper07.pdf |volume=Vol-2065 |authors=Lucas A. M. C. Carvalho,Daniel Garijo,Bakinam T. Essawy,Claudia Bauzer Medeiros,Yolanda Gil |dblpUrl=https://dblp.org/rec/conf/kcap/CarvalhoGEMG17 }} ==Requirements for Facilitating the Continuous Creation of Scientific Workflow Variants== https://ceur-ws.org/Vol-2065/paper07.pdf
            Requirements for Supporting the Iterative Exploration of
                                           Scientific Workflow Variants
     Lucas A. M. C. Carvalho1, Bakinam T. Essawy2, Daniel Garijo3, Claudia Bauzer Medeiros1, Yolanda Gil3
                                1
                              University of Campinas, Institute of Computing, Campinas, SP, Brazil
            2
             University of Virginia, Department of Civil and Environmental Engineering, Charlottesville, VA, U.S.A
                 3
                   University of Southern California, Information Sciences Institute, Marina del Rey, CA, U.S.A
       lucas.carvalho@ic.unicamp.br, bte2rn@virginia.edu, dgarijo@isi.edu, cmbm@ic.unicamp.br, gil@isi.edu
ABSTRACT                                                                    This paper presents use cases and their requirements to support
                                                                        scientists in the process of exploring different variations of an
Workflow systems support scientists in capturing computational          original workflow, and introduces research directions to address
experiments and managing their execution. However, such                 these requirements. These scenarios are based on discussion with
systems are not designed to help scientists create and track the        domain scientists, particularly in hydrology and bioinformatics.
many related workflows that they build as variants, trying different
software implementations and distinct ways to process data and          2    WORKFLOW VARIANTS
deciding what to do next by looking at previous workflow
results. An initial workflow will be changed to create many new         Computational workflows describe the computational steps and the
variants thereof that differ from each other in one or more steps.      dataflow among them to perform complex multi-step analyses. The
Our goal is to support scientists in the iterative design of            steps are implemented by software components (or workflow
computational experiments by assisting them in the creation and         components) that process data. A software component has a well-
management of workflow variants. In this paper, we present              defined interface consisting of input and output files as well as
several use cases for creating workflow variants in hydrology, from     parameter constant values. The dataflow between components is
which we specify requirements for workflow variants. We also            captured as connections among their respective interfaces. A
discuss major research directions to address these requirements.        workflow component may be implemented by a scientist, for
                                                                        example a routine to check for erroneous sensor readings. A
CCS CONCEPTS                                                            workflow component may also be implemented using third-party
                                                                        software, for example invoking a linear regression function from a
• Information systems → Artificial intelligence; Knowledge
                                                                        machine learning software package. A workflow component can
representation and reasoning
                                                                        be updated in two ways. In some cases, a new upgrade of the
                                                                        component is created to override a previous one, for example in
KEYWORDS                                                                cases where the underlying software was corrected to fix a bug. In
Scientific workflows,        workflow      variants,   computational    other cases, a new variant of the component is created with new
experiments                                                             inputs or outputs or other modifications, where the previous
                                                                        versions are still valid and available to the user to use in workflows.
1   INTRODUCTION                                                            Workflow executions are the result of running workflows and
Scientific workflow systems play a major role in supporting             provide provenance for the newly generated data products.
scientists to design, document and execute their computational              After running a certain workflow, a scientist may want to
experiments, automatically tracking provenance during the               explore a workflow variant that represents a variation of an existing
workflow execution [11; 1]. Scientists follow an iterative              workflow that was run earlier where one or more steps are changed.
exploratory cycle where they often create an initial workflow, and      That step change may require changing other steps that may be
then explore variations of it using different data, replacing some of   affected. In other cases, the scientist may create a new workflow
the software steps, or adding new steps. Sometimes workflows            upgrade of a previously run workflow that simply replaces a
have to be modified because of changes in data (e.g. when datasets      component by a new one with a bug fix. When a workflow is
are updated with new formats) or software (e.g., software is no         upgraded, the scientist may need to redo previous runs.
longer available, a newer version is better).                               Due to the exploratory nature of science, a scientist may start
    In current workflow systems, scientists manage this exploratory     with an initial workflow and iteratively create many workflow
process manually. Updating a workflow is a complex and time-            variants. During this process, the scientist will want to consider
consuming task that may involve several steps, and may require          different designs of variants, compare any given variant with
tracking down information about different versions of the software      previous ones, and synthesize the results of several variants with
used in the workflow.                                                   comparable settings. This iterative process of creating and
                                                                        managing workflow variants is currently not well supported. There
K-CAP2017 Workshops and Tutorials Proceedings,                          are several reasons why a scientist may create a workflow variant:
© Copyright held by the owner/author(s)
 SciKnow’2017 Austin, Texas USA                                                                                             Carvalho et al.

    1.  New versions of the software used in the workflow
        components are released.             These may add new
        functionality that could be useful for the investigation.
        These may also correct errors or fix bugs, and in that case
        the scientist may be interested to check whether their
        results done with the older version still hold.
    2. New possible models or algorithms become available.
        The scientist may discovery these through online search,
        reading articles, or talking with colleagues. These open
        new possibilities for exploring alternative designs of the
        workflow.
    3. New datasets become available to the scientist. In this
        case, the scientist may want to change their workflow to
        incorporate that new kind of data.
    Sometimes the explorations are due to a combination of these.
For example, new software versions may fix errors and offer new
functionality that allows the scientist to use new kinds of data.

3 RELATED WORK
There have been several efforts to keep track and manage workflow
updates and versions. VisTrails [2] tracks the evolution of
workflows using a change-based provenance model that records
information about modifications to workflow components, inputs,
outputs and parameters. They compare results of workflow
executions using visualizations. However, this approach focuses on
capturing changes and comparing workflows, while we are
interested in supporting the process of designing, creating, and
managing workflow variants.
    Koop et al. [8] focuses on the problem of supporting workflow
upgrades when the software that implements a component has a
new version by suggesting how the change-based provenance
actions might be reused to upgrade other similar workflows. The
focus is on the mechanics of the upgrades, while our interest is on
supporting the iterative exploration and design of new workflows.
    Workflow variants are also explored in Experiment Lines [20].
Their focus is on the variation of models or algorithms and software
packages. In contrast, our focus is broader in that we support the        Figure 1. A workflow diagram representing the initial workflow
creation of workflow variants.                                            W0 used in our scenarios. In scenarios S1 and S2 new variants
                                                                          W1 and W2 are created by updating the MODFLOW step to use
4   MOTIVATING SCENARIOS AND                                              a different version that has the same interface, and therefore has
    REQUIREMENTS                                                          the same overall workflow diagram shown here.
This section describes several scenarios where scientists iteratively   simulations depending on the input data selected, so it needs to be
create and explore workflow variants. The scenarios use examples        configured to use the packages needed to process the desired data.
from hydrology. A hydrologist uses models, often developed by           This is done using FloPy [12], a Python package to create, run, and
others, to estimate how much water will flow in an area. We will        post-process MODFLOW-based models. We will also use MIKE-
consider several hydrology models in these scenarios.                   SHE [6], another computational hydrology model that solves for
MODFLOW is the U.S. Geological Survey's three-dimensional               both saturated and unsaturated zones in groundwater.
(3D) finite-difference groundwater model that has been developed            Hydrology models need data about the area for the simulation.
for several years and has many versions and variants. A major           For example, MODFLOW requires elevation data, in the U.S.
version of the core implementation is MODFLOW-2005 [7] which            typically coming from the National Elevation Dataset, recharge
simulates confined, unconfined, or a combination of confined and        data, typically from the National Recharge Dataset, and the data for
unconfined groundwater-flow problems. A major variant is                the area from the Watershed Boundary Dataset.
MODFLOW-NWT [9] which uses a Newton-Raphson formulation.                    Figure 1 shows an initial workflow W0 that uses MODFLOW-
MODFLOW has many packages that run different types of                   NWT. The input data includes the boundary for the area being
 Supporting the Iterative Exploration of Workflow Variants                                         SciKnow’2017 Austin, Texas USA

                                                                    and the data format is converted from raster to text using the step
                                                                    Convert Elevation, which is implemented using NumPy. Rasterize
                                                                    Recharge, which is implemented using GDAL, generates the
                                                                    recharge raster and then Convert Recharge, which is implemented
                                                                    using NumPy, converts the unit of measurement of the data from
                                                                    centimeters to meters and from meter/year to meter/day and from
                                                                    raster format to ASCII. The simulation component is implemented
                                                                    using MODFLOW-NWT Version 1.0.2, and uses FloPy to
                                                                    configure it with the appropriate packages.

                                                                    4.1 Case I: Same Component Interface, Different
                                                                        Software Version
                                                                    In this case, a workflow component is replaced by another one that
                                                                    uses a different version of software to implement it but the
                                                                    component interface remains the same. We consider two main
                                                                    scenarios for this case. One occurs when a new version of the
                                                                    software used in a workflow component is released to fix errors or
                                                                    bugs. The other one occurs when a new version is released to carry
                                                                    out a different function.
                                                                        The first scenario S1 starts with a scientist that runs workflow
                                                                    W0 several times, changing the data sets used and comparing the
                                                                    results to understand how changes in the inputs influence the
                                                                    results. After several weeks, the scientist notices a new release of
                                                                    MODFLOW, with modifications to enhance the model outputs. So
                                                                    they create an upgrade of the MODFLOW component, which
                                                                    results in the creation of workflow W1, an upgrade of W0, shown in
                                                                    Figure 2. In this example, from version 1.0.2 to 1.0.3 of
                                                                    MODFLOW-NWT a bug was fixed in the UZF1 package that was
                                                                    causing UZF1 to incorrectly calculate unsaturated-zone
                                                                    evapotranspiration, which results in a much smaller value [19]. The
                                                                    earlier version 1.0.1 calculated this value properly, so the bug was
                                                                    introduced in version 1.0.2 but fixed in 1.0.3. In some cases,
                                                                    scientists may downgrade to an earlier version because it has a
                                                                    desired feature or it does not produce a wrong value introduced by
                                                                    a bug in later versions but not fixed yet. The scientist may need to
                                                                    discard all previous executions of W0 because the results were
                                                                    incorrect due to bugs, and run them using W1 instead.
                                                                        The second scenario S2 occurs when the software in a
                                                                    component is modified to carry out a different function. In this
                                                                    case, from version 1.0.2 to 1.0.3 of MODFLOW-NWT there is a
 Figure 2. In scenario S1 a new upgrade W1 is created by            major change to generate a more accurate calculation of
 updating the MODFLOW component to use a different version          evapotranspiration. First, the header of the listing file that results
 that has the same interface and fixes a bug. The changed           from running MODFLOW-NWT is changed from having a variable
 component is shown with a thicker outline. In scenario S2 a        "RMS" to "RMS1" and "RMS2," and from a variable "L2-NORM"
 new variant W2 is created with a similar diagram but using a       to "L2-NEW" and "L2-OLD". This change was done to improve
 different MODFLOW component.                                       the calculation of the residual terms as the L2-NORM rather than
                                                                    the root-mean-squared error (RMS error). This change does not
                                                                    affect the format of the results, only their value to be more accurate.
studied, elevation data, and recharge (drainage) data. Area is an   Thus, the interface of the new component variant does not change,
input to the step Rasterize, which converts the data from a         and the newly created workflow variant W2 has the same structure
geographic system format to raster (bitmap) and is implemented      as W1 in Figure 2.
using GDAL (Geospatial Data Abstraction Library). The step              The scientist needs to be able to understand the changes to the
Convert Area converts the raster data to text data (ASCII) using    software in new versions in order to assess the differences between
NumPy, a library for array/matrix in Python. Then, the unit of      versions and estimate the effort to make the changes in the
measurement for this data is converted from centimeters to meters
 SciKnow’2017 Austin, Texas USA                                                                                                 Carvalho et al.
                                                                            , whi




workflow. This may require a significant effort, as this information
may be scattered across release notes, documentation, papers, web
sites, and other sources. In some cases, the scientist may be
interested in skipping ahead several versions. For example, she
may want to change from version 1.0.2 all the way to the newest
version that is 1.1.3. This is challenging since the scientist needs to
track and summarize all the differences between several
consecutive versions.
    In addition, when changing a software version used to
implement a component, the scientist needs to check if the new
version is compatible with the software version of other
components of the same workflow. For example, a specific FloPy
version is compatible only with some MODFLOW-NWT versions.
Sometimes these incompatibilities can occur across different
workflow components, for example if two components make
different assumptions about the Newton-Raphson formulation.
This means that the scientist needs to track in detail all the software
dependencies and compatibilities across the software components
of a workflow.
    Finally, the scientist may need to check that the new simulation
results do not require additional changes in the workflow steps that
use those results. In our case they were the output of the workflow,
but in other cases further adjustments may be required.
    Scenarios 1 and 2 motivate the following requirements:
          R1 – Version descriptions need to capture useful
           metadata of the software.
          R2 – Scientists need to understand differences in
           metadata between different software versions,
           particularly about their interfaces.
          R3 – Scientists need to be alerted about relevant updates
           of software used in their workflows.
          R4 – Workflow descriptions need to capture the
           software, software version, and functions used in the            Figure 3. A workflow variant W3 derived from W2 is created
           implementation of workflow components.                           in Scenario S3 after adding well data, which requires adding the
          R5 – Scientists need to understand how new workflow              necessary data conversion steps and also changing the
           variants can be used to correct errors in prior results.         MODFLOW component to have an additional input for well
          R6 – Scientists should be able to easily replace a               data. The new and modified components are shown with a
           component of a workflow with a new one when the                  thicker outline.
           interfaces of the components are the same.
                                                                                       R12 – Scientists need a summarization of changes
          R7 – Given a software package that can be used to create                     between a given software version and a newer version to
           many workflow components, scientists need to easily                          understand their differences without need to understand
           figure out how to implement new variants of a workflow                       the changes associated to each version in between those.
           component with newer versions of that package.
                                                                                       R13 – Scientists need to understand any incompatibilities
          R8 – Scientists should be able to easily create new                          between versions of different software packages and
           versions of workflow components and relate them to each                      libraries used to implement a workflow component.
           other.
                                                                                       R14 – Scientists need to know whether a new workflow
          R9 – Scientists should be able to easily create new                          version or a new workflow variant is valid.
           workflow variants and relate them to each other.
          R10 – Scientists should be able to relate changes in           4.2 Case II: Different Component Interface, Same
           software to specific workflow results, so it is clear how
           new software versions affect calculated variables to
                                                                              Software Version
           produce wrong values.                                          In this case, a workflow variant is created by replacing a workflow
          R11 – Version descriptions need to capture bug fixes and       component by another one that uses the same software
           known bugs and relate them to software features and            implementation but invokes a different function and as a result has
           input and output file variables.                               a new component interface (i.e., adding, removing or replacing
 Supporting the Iterative Exploration of Workflow Variants                                            SciKnow’2017 Austin, Texas USA

 , whi
                                                                       as an ASCII file, so unlike the elevation and recharge data there is
                                                                       no need to convert wells data to ASCII. The only data preparation
                                                                       needed is converting the unit of measurement from feet to meters.
                                                                       To perform this change, the scientist adds Well to the workflow
                                                                       inputs, adds the step Convert Well for unit conversion. This results
                                                                       in workflow variant W3, shown in Figure 3. The scientist created
                                                                       one new component variant and created a variant of an existing
                                                                       component.
                                                                           In scenario S4, the scientist decides to include snowmelt in the
                                                                       simulation. This can be done by using infiltration as an input
                                                                       instead of recharge (since the infiltration package will also account
                                                                       for recharge). Figure 4 shows the resulting workflow variant W4
                                                                       where the recharge input of W3 is replaced with infiltration.
                                                                       Additional changes include replacing the steps to prepare data for
                                                                       simulation with those steps to clip and resample infiltration, and to
                                                                       convert the unit of measurement in the input data from centimeter
                                                                       per year to meters per day, reformatting it to ASCII format. In
                                                                       addition, the MODFLOW step needs to be modified in two ways.
                                                                       First, the recharge input needs to be replaced with infiltration input.
                                                                       Second, the FloPy software configures MODFLOW to use the
                                                                       infiltration packages. In total, the scientist created five new
                                                                       components and created a variant of an existing component.
                                                                           There are several important tasks that the scientist needs to
                                                                       address in these two scenarios.
                                                                           Before creating the data preparation components for W3 and W4
                                                                       the scientist has to find whether components that already do those
                                                                       conversions are available or not. Reusing components saves time,
                                                                       but after spending many years running similar workflows with
                                                                       similar data it may be hard to remember which components have
                                                                       been created before.         Furthermore, in addition to reusing
                                                                       components it may be possible to reuse entire sub-workflows. In
                                                                       our example, the sub-workflow to prepare infiltration data has five
                                                                       steps that can be reused together.
 Figure 4. A workflow variant W4 derived from W3 is created                Another important task is to compare the results of different
 in Scenario S4 after replacing the input for recharge data with       workflow variants. For example, a scientist would run W3 and W4
 infiltration data, which requires adding the necessary data           and compare the results to each other and to W2 to understand how
 conversion steps and also creating a new MODFLOW                      changes in the workflows affect the simulation results.
 component variant to make the well data compatible. The five              Scenarios 3 and 4 motivate these additional requirements:
 new components and the modified one are shown with a thicker                     R15 – Scientists need to easily find software packages
 outline.                                                                          and workflow components that are appropriate to process
                                                                                   a specific type of data input.
inputs or outputs). This interface change may require changes in                  R16 – Scientists need to easily find workflow
other steps of the workflow (e.g., adding, replacing or removing                   components for data conversion.
data conversion or post-processing steps.) We consider two                        R17 – Scientists need to be able to understand the
scenarios for this case. One occurs when a component is changed                    differences between two workflow variants.
to use additional inputs or outputs provided by the software used to
implement it. Another occurs when a component is changed to            4.3 Case III: Alternative Component, Different
replace inputs or outputs or use them differently in the software
                                                                           Software
used to implement it. In both cases, the rest of the workflow may
be affected by the changes.                                            In this case, a workflow variant is created by replacing a workflow
    Scenario S3 starts with a scientist running workflow W2. The       component by a component that does an equivalent function but is
scientist would like to add an input regarding water elevation         implemented using a different software. There are several reasons
through wells, so she creates a new variant of the workflow            to create workflow variants that use equivalent software, such as
component by adapting the MODFLOW component used in W2 by              testing different models or taking into account parameters that are
adding a new input for well data. The well data is already provided    ignored by the current model used in a workflow. The new
                                                                       component may have a very different interface from the previous
 SciKnow’2017 Austin, Texas USA                                                                                              Carvalho et al.

                                                                        combine the simulation results to obtain a format that is comparable
                                                                        to W4 and to all the previous workflow variants. Many components
                                                                        from W4 were discarded as they were no longer needed.
                                                                             In scenario S6, the scientist decides to investigate other models,
                                                                        since MIKE-SHE is a commercial, proprietary model. There are
                                                                        many other hydrology models available, including PIHM [13],
                                                                        TopoFlow [14], VIC [15], and dozens of others available in
                                                                        repositories such as CSDMS [16]. The scientist starts to investigate
                                                                        which models produce interesting simulation results, and considers
                                                                        how much effort is required to locate the data required by each
                                                                        model, to develop the data pre-processing components needed, and
                                                                        to install and run each of these models. The scientist finds out that
                                                                        PIHM provides Hydroterre [17], a comprehensive data repository
                                                                        that already provides data in the required format, and a PIHM-GIS
  Figure 5. Workflow variant W5 using the MIKE-SHE                      software to visualize simulation results [18]. The scientist also
  hydrological model. The area data can be used raw, and the            finds that PIHM requires a solver in order to run, so the simulation
  component to clip elevation data can be reused from workflow          component needs to include the solver software in addition to
  W4. The five new components and the modified one are shown            PIHM. The scientist develops a workflow variant W6 that includes
  with a thicker outline. Several components from W4 were               new components implemented using PIHM, Hydroterre, and
                                                                        PIHM-GIS.
  removed as they were no longer needed.
                                                                             It is important to highlight several important tasks done by the
one, thus requiring a major update of the workflow to create,           scientist in these scenarios. In both scenarios, but particularly in
replace, or remove several data preparation or post-processing          scenario S6, the scientist needs to compare how two models are
steps. Note that although the tasks to create the workflow variant      similar and how they differ in terms of the input data that they use
may be similar to those in scenarios 3 and 4, now there are             and the output data that they generate. The documentation of
additional tasks in finding out information about the new software      models always includes details of the input and output requirements
to check its functionality and analyze how it fits into the workflow    in terms of files and formats. The scientist will want to understand
and the overall exploration that the scientist is doing. We consider    conceptually how the models work in terms of the physical
two scenarios. One occurs when the scientist already knows which        variables used or generated in the model. That is, understanding
alternative method to use. Another one occurs when the scientist        the inputs and outputs at the file and format level is important, but
needs to find and compare the assumptions, functionalities and the      understanding how model variables map to each of the files is also
effort to change the workflow when considering more than one            necessary. This information is usually not included in the software
method to decide which one to use.                                      documentation, but in the publications associated with the model.
     Scenario S5 has a scientist who is concerned about                 The scientist will need to consult a variety of sources in order to
MODFLOW-NWT only solving for saturated zones, so some                   understand how different models compare [10].
parameters are simplified or even ignored. The scientist would like          Another important task is to understand the assumptions made
to use a different method to solve for the unsaturated zones, and has   by the different models. For example, in hydrology some models
heard that the MIKE-SHE model does a similar simulation to              may assume the Navier-Stokes equations for fluid motion, while
MODFLOW, but is a fully coupled and integrated surface water            others do not. These assumptions are often not captured in the
and ground water model that considers parameters regarding              descriptions of workflow components, which focus on the models
unsaturated zones. The inputs and outputs for MIKE-SHE are              as software artifacts rather than research artifacts.
different from MODFLOW. MIKE-SHE uses area data in the same                  In addition, after creating and running the new workflow
raw format that is provided by the data source, so it does not need     variants W5 and W6, the scientist will want to compare their results
to be pre-processed. It also uses topography data and can ingest        to the results obtained with previous workflows W4, W3, and earlier
formats very similar to the data source, so the data only needs to be   ones. This requires that the scientist understands how the model
clipped. MIKE-SHE also requires several new input data, namely          results are related to one another, which requires understanding
rainfall, evaporation, and temperature, all in the format provided by   what modeling variables are generated and included in the
data sources so they only need to be clipped. MIKE-SHE also             simulation outputs.
generates several separate outputs, including files associated with           Scenarios S5 and S6 motivate the following additional
the simulation (SHERES), a binary output file containing all the        requirements:
static information on the simulation (FRF), and other results stored          R18 – Version descriptions need to capture assumptions
in a series of DFS0, DFS2 and DFS3 files. As for the MIKE-SHE                    used in software.
component, there is no need to use the FloPy software to implement            R19 – Workflow components, inputs, outputs or parameters
it. Figure 5 shows workflow variant W5 created from W4: the                      in new workflow variants that are no longer needed need to
scientist had to create three new data pre-processing components,                be removed.
one new component for MIKE-SHE, and one new component to
                                                 Table 1. Summary of requirements from cases.
 Category         Requirement                                                                                                      Cases
 Workflow         R1 – Version descriptions need to capture useful metadata of the software.                                       C1, C2, C3
 component        R2 – Scientists need to understand differences in metadata between different software versions, particularly     C1, C3
 metadata         about their interfaces.
                  R3 – Scientists need to be alerted about relevant updates of software used in their workflows.                   C1
                  R4 – Workflow descriptions need to capture the software, software version, and functions used in the             C1, C2, C3
                  implementation of workflow components.
                  R8 – Scientists should be able to easily create new variants of workflow components and relate them to           C1, C2, C3
                  each other.
                  R9 – Scientists should be able to easily create new workflow variants and relate them to each other.             C1, C2, C3
                  R10 – Scientists should be able to relate changes in software to specific workflow results, so it is clear how   C1
                  new software versions affect calculated variables to produce wrong values.
                  R11 – Version descriptions need to capture bug fixes and known bugs and relate them to software features         C1
                  and input and output file variables.
                  R12 – Scientists need a summarization of changes between a given software version and a newer version            C1
                  to understand their differences without need to understand the changes associated to each version in
                  between those.
                  R13 – Scientists need to understand any incompatibilities between versions of different software packages        C1, C2, C3
                  and libraries used to implement a workflow component.
                  R18 – Version descriptions need to capture assumptions used in software.                                         C1, C2, C3
 Workflow         R6 – Scientists should be able to easily replace a component of the workflow with a new one when the             C1
 updates          interfaces of the components are the same.
                  R7 – Given a software package that can be used to create many workflow components, scientists need to            C1
                  easily figure out how to implement new variants of a workflow component with newer versions of that
                  package.
                  R14 – Scientists need to know whether a new workflow version or a new workflow variant is valid.                 C2, C3
                  R15 – Scientists need to easily find software packages and workflow components that are appropriate to           C1, C2, C3
                  process a specific type of data input.
                  R16 – Scientists need to easily find workflow components for data conversion.                                    C2, C3
                  R19 – Workflow components, inputs, outputs or parameters in new workflow variants that are no longer             C3
                  needed need to be removed.
                  R20 – Scientists need to assess and compare the effort in creating new workflow variants that represent a        C3
                  significant departure from previous ones.
                  R21 – Scientists need to find and compare equivalent computational models, including their inputs,               C3
                  outputs, model variables, data formats, and assumptions
 Workflow         R5 – Scientists need to understand how new workflow variants can be used to correct errors in prior results.     C1, C2, C3
 Comparisons      R17 – Scientists need to be able to understand the differences between two workflow variants.                    C1, C2, C3

     R20 – Scientists need to assess and compare the effort in                   different versions of the software and the different versions
      creating new workflow variants that represent a significant                 and variations of a given workflow component.
      departure from previous ones.                                             Workflow updates, which address the creation of new
     R21 – Scientists need to find and compare equivalent                        workflow variants by replacing, adding, or removing
      computational models, including their inputs, outputs, model                workflow components, the propagation of the effects of
      variables, data formats, and assumptions.                                   those changes throughout the structure of the workflow, and
                                                                                  the validation of the new workflow variants.
4.4 Requirements summary                                                        Workflow comparisons, which address the comparison
The requirements of the previous scenarios can be grouped into                    between different software versions, software packages,
three main categories:                                                            workflow variants and workflow runs.
                                                                               Table 1 summarizes the requirements introduced in this section,
     Workflow component metadata, which tackles the
                                                                           pointing out the broad categories they belong to and the cases where
       representation and metadata of workflow components
                                                                           they occur. Although we adopt the hydrology domain in our
       regarding their interface, functionalities and assumptions,
                                                                           scenarios to illustrate the requirements, our requirements are
       and implementation using software packages and libraries.
                                                                           domain-independent. Workflows in any domain have pre-
       This metadata would also represent the characteristics of the
 SciKnow’2017 Austin, Texas USA                                                                                                       Carvalho et al.

processing steps, post-processing steps, and major analytic steps     updates, and workflow comparisons. We also discussed major
[3]. In the case of hydrology, the analytic steps are done using      research directions to address those requirements, including
different hydrology models. Other sciences use algorithms rather      improved frameworks for describing workflow components and the
than models. For example, different clustering algorithms or          associated software, for managing and tracking workflow variants,
sequence alignment algorithms would be used in genomics. The          and supporting scientists in the iterative exploration and
requirements outlined here are generally applicable to other          experimentation process through workflow variants.
domains.
                                                                      Acknowledgments. This work was supported in part by a grant from the
5   DISCUSSION AND FUTURE RESEARCH                                    US National Science Foundation under award ICER-1440323 and ICER-
                                                                      1632211 (EarthCube RCN IS-GEO), and in part by the Sao Paulo Research
Given the state of the art and the requirements from the scenarios,   Foundation (FAPESP) under grants 2017/03570-3, 2014/23861-4 and
we outline possible research directions for future work:              2013/08293-7.
1. Describing workflow components and their underlying
    software. This includes the creation and adaptation of existing   REFERENCES
    ontologies to capture information about software versions and      [1] Altintas, I., Barney, O., and Jaeger-Frank, E. (2006). Provenance collection
    variants, including software interfaces and features. OntoSoft           support in the Kepler scientific workflow system. International Provenance and
                                                                             Annotation Workshop (IPAW), pages 118–132. Springer.
    [5] is an ontology that might be extended to capture relevant      [2] Freire, J., Silva, C. T., Callahan, S. P., Santos, E., Scheidegger, C. E., and Vo, H.
    information about software versions and variants. It is also             T. (2006). Managing rapidly-evolving scientific workflows. In Provenance and
                                                                             Annotation of Data, pages 10–18. Springer.
    important to integrate these ontologies with workflow systems      [3] Garijo, D., Alper, P., Belhajjame, K., Corcho, O., Gil, Y., & Goble, C. (2014).
    to describe workflow components. Another area of work is to              Common motifs in scientific workflows: An empirical analysis. Future
                                                                             Generation Computer Systems, 36, 338-351.
    use them to support the creation of workflow variants.
                                                                       [4] Gil, Y. and Garijo, D. (2017). Towards Automating Data Narratives. In
2. Managing and tracking workflow variants and their                         Proceedings of the Twenty-Second ACM International Conference on
    differences. This includes how to compare workflow                       Intelligent User Interfaces (IUI-17), Limassol, Cyprus.
                                                                       [5] Gil, Y., Ratnakar, V., & Garijo, D. (2015). OntoSoft: Capturing scientific software
    components and workflow variants regarding their interfaces              metadata. Proceedings of the 8th International Conference on Knowledge
    and functions, and present these results in a useful way for             Capture (K-CAP), 2015.
                                                                       [6] Graham, D. N., & Butts, M. B. (2005). Flexible, integrated watershed modelling
    scientists to understand their differences and the implications          with MIKE-SHE. Watershed models, 849336090, 245-272.
    on experiment results. A possible approach is using multi-         [7] Harbaugh, A. W. MODFLOW-2005, the US Geological Survey modular ground-
    media narratives that combine text, graphics, and                        water model: the ground-water flow process. Reston: US Department of the
                                                                             Interior, US Geological Survey, 2005.
    visualizations to explain the similarities and differences         [8] Koop, D., Scheidegger, C. E., Freire, J., & Silva, C. T. (2011). The Provenance of
    between software versions, software variants, workflow                   Workflow Upgrades. Third International Provenance and Annotation Workshop
                                                                             (IPAW),Vol. 6378. Springer.
    versions, or functions/methods. More importantly, these            [9] Niswonger, R.G., Panday, Sorab, and Ibaraki, Motomu, 2011, MODFLOW-NWT,
    narratives should be easily customized to the reader’s level of          A Newton formulation for MODFLOW-2005: U.S. Geological Survey
    expertise and interest. As a starting point our approach may be          Techniques and Methods 6-A37, 44 p.
                                                                      [10] Essawy, B. T.; Goodall, J. L.; Xu, H.; and Gil, Y. Evaluation of the OntoSoft
    based in an approach for data narrative generation [4]. Another          Ontology for Describing Legacy Hydrologic Modeling Software.
    research area is to manage histories of creation and evolution           Environmental Modelling & Software, 92. 2017.
                                                                      [11] Wolstencroft, K., Haines, R., Fellows, D., Williams, A., Withers, D., Owen, S.,
    of workflow variants, and doing so across many users that may            Soiland-Reyes, S., Dunlop, I., Nenadic, A., Fisher, P., Bhagat, J., Belhajjame,
    benefit from reusing segments or traversals across users.                K., Bacall, F., Hardisty, A., Nieva de la Hidalga, A., Balcazar Vargas, M. P.,
                                                                             Sufi, S., and Goble, C. (2013). The Taverna workflow suite: designing and
3. Designing an interactive framework to support scientists in               executing workflows of web services on the desktop, web or in the cloud.
    the exploration and experimentation process through                      Nucleic Acids Research, 41(W1). W557–W561.
    workflow variants. This includes how to leverage workflow          [12] Bakker, M., Post, V., Langevin, C.D., Hughes, J.D., White, J.T., Starn, J.J., and
                                                                             Fienen, M.N., 2016, FloPy v3.2.6: U.S. Geological Survey Software Release,
    reuse and composition to support the creation of workflow                19 March 2017, http://dx.doi.org/10.5066/F7BK19FH.
    variants. For example, given a new component that needs to         [13] Qu, Y. and C. J. Duffy. "A semidiscrete finite volume formulation for
                                                                             multiprocess watershed simulation." Water Resources Research 43(8), 2007.
    replace an existing one in a workflow, suggest what other          [14] Peckham, S. D. Geomorphometry and spatial hydrologic modelling. In
    components may need to be added or removed from the                      Geomorphometry: Concepts, Software, Applications, Developments in Soil
    workflow. Other research would involve mechanisms to                     Science, vol. 33, edited by S. D. Peckham, pp. 579–602, Elsevier.
                                                                       [15] Liang, X., D. P. Lettenmaier, E. F. Wood, and S. J. Burges, 1994: A Simple
    identify critical and non-critical components in workflows.              Hydrologically-Based Model of Land Surface Water and Energy Fluxes for
    The critical and non-critical components could be associated             GSMs, J. Geophys. Res., 99(D7), 14,415-14,428.
                                                                       [16] Community Surface Dynamics Modeling System (CSDMS) Model Repository.
    to abstractions defined as motifs [3].                                   Available from http://csdms.colorado.edu/wiki/Model_download_portal.
                                                                       [17] HydroTerre Data Services. Available from http://www.hydroterre.psu.edu.
                                                                       [18] The Pennsylvania Integrated Hydrology Model GIS Interface (PIHMgis),
6 Conclusions                                                                http://www.pihm.psu.edu/pihmgis_home.html
                                                                       [19] USGS. MODFLOW-NWT Release Notes. https://water.usgs.gov/ogw/modflow-
This paper discusses the need to support scientists in exploring             nwt/Release.txt
different experiment designs over time. We presented several          [20] Marinho, A., de Oliveira, D., Ogasawara, E., Silva, V., Ocaña, K., Murta, L.,
scenarios where an initial workflow is modified to create workflow           Braganholo, V. and Mattoso, M., 2017. Deriving scientific workflows from
                                                                             algebraic experiment lines: A practical approach. Future Generation Computer
variants by replacing, adding or removing workflow steps. We                 Systems, 68, pp.111-127.
describe the requirements of these scenarios, and grouped them into
three categories: workflow component metadata, workflow