=Paper=
{{Paper
|id=Vol-2065/paper07
|storemode=property
|title=Requirements for Facilitating the Continuous Creation of Scientific Workflow Variants
|pdfUrl=https://ceur-ws.org/Vol-2065/paper07.pdf
|volume=Vol-2065
|authors=Lucas A. M. C. Carvalho,Daniel Garijo,Bakinam T. Essawy,Claudia Bauzer Medeiros,Yolanda Gil
|dblpUrl=https://dblp.org/rec/conf/kcap/CarvalhoGEMG17
}}
==Requirements for Facilitating the Continuous Creation of Scientific Workflow Variants==
Requirements for Supporting the Iterative Exploration of Scientific Workflow Variants Lucas A. M. C. Carvalho1, Bakinam T. Essawy2, Daniel Garijo3, Claudia Bauzer Medeiros1, Yolanda Gil3 1 University of Campinas, Institute of Computing, Campinas, SP, Brazil 2 University of Virginia, Department of Civil and Environmental Engineering, Charlottesville, VA, U.S.A 3 University of Southern California, Information Sciences Institute, Marina del Rey, CA, U.S.A lucas.carvalho@ic.unicamp.br, bte2rn@virginia.edu, dgarijo@isi.edu, cmbm@ic.unicamp.br, gil@isi.edu ABSTRACT This paper presents use cases and their requirements to support scientists in the process of exploring different variations of an Workflow systems support scientists in capturing computational original workflow, and introduces research directions to address experiments and managing their execution. However, such these requirements. These scenarios are based on discussion with systems are not designed to help scientists create and track the domain scientists, particularly in hydrology and bioinformatics. many related workflows that they build as variants, trying different software implementations and distinct ways to process data and 2 WORKFLOW VARIANTS deciding what to do next by looking at previous workflow results. An initial workflow will be changed to create many new Computational workflows describe the computational steps and the variants thereof that differ from each other in one or more steps. dataflow among them to perform complex multi-step analyses. The Our goal is to support scientists in the iterative design of steps are implemented by software components (or workflow computational experiments by assisting them in the creation and components) that process data. A software component has a well- management of workflow variants. In this paper, we present defined interface consisting of input and output files as well as several use cases for creating workflow variants in hydrology, from parameter constant values. The dataflow between components is which we specify requirements for workflow variants. We also captured as connections among their respective interfaces. A discuss major research directions to address these requirements. workflow component may be implemented by a scientist, for example a routine to check for erroneous sensor readings. A CCS CONCEPTS workflow component may also be implemented using third-party software, for example invoking a linear regression function from a • Information systems → Artificial intelligence; Knowledge machine learning software package. A workflow component can representation and reasoning be updated in two ways. In some cases, a new upgrade of the component is created to override a previous one, for example in KEYWORDS cases where the underlying software was corrected to fix a bug. In Scientific workflows, workflow variants, computational other cases, a new variant of the component is created with new experiments inputs or outputs or other modifications, where the previous versions are still valid and available to the user to use in workflows. 1 INTRODUCTION Workflow executions are the result of running workflows and Scientific workflow systems play a major role in supporting provide provenance for the newly generated data products. scientists to design, document and execute their computational After running a certain workflow, a scientist may want to experiments, automatically tracking provenance during the explore a workflow variant that represents a variation of an existing workflow execution [11; 1]. Scientists follow an iterative workflow that was run earlier where one or more steps are changed. exploratory cycle where they often create an initial workflow, and That step change may require changing other steps that may be then explore variations of it using different data, replacing some of affected. In other cases, the scientist may create a new workflow the software steps, or adding new steps. Sometimes workflows upgrade of a previously run workflow that simply replaces a have to be modified because of changes in data (e.g. when datasets component by a new one with a bug fix. When a workflow is are updated with new formats) or software (e.g., software is no upgraded, the scientist may need to redo previous runs. longer available, a newer version is better). Due to the exploratory nature of science, a scientist may start In current workflow systems, scientists manage this exploratory with an initial workflow and iteratively create many workflow process manually. Updating a workflow is a complex and time- variants. During this process, the scientist will want to consider consuming task that may involve several steps, and may require different designs of variants, compare any given variant with tracking down information about different versions of the software previous ones, and synthesize the results of several variants with used in the workflow. comparable settings. This iterative process of creating and managing workflow variants is currently not well supported. There K-CAP2017 Workshops and Tutorials Proceedings, are several reasons why a scientist may create a workflow variant: © Copyright held by the owner/author(s) SciKnow’2017 Austin, Texas USA Carvalho et al. 1. New versions of the software used in the workflow components are released. These may add new functionality that could be useful for the investigation. These may also correct errors or fix bugs, and in that case the scientist may be interested to check whether their results done with the older version still hold. 2. New possible models or algorithms become available. The scientist may discovery these through online search, reading articles, or talking with colleagues. These open new possibilities for exploring alternative designs of the workflow. 3. New datasets become available to the scientist. In this case, the scientist may want to change their workflow to incorporate that new kind of data. Sometimes the explorations are due to a combination of these. For example, new software versions may fix errors and offer new functionality that allows the scientist to use new kinds of data. 3 RELATED WORK There have been several efforts to keep track and manage workflow updates and versions. VisTrails [2] tracks the evolution of workflows using a change-based provenance model that records information about modifications to workflow components, inputs, outputs and parameters. They compare results of workflow executions using visualizations. However, this approach focuses on capturing changes and comparing workflows, while we are interested in supporting the process of designing, creating, and managing workflow variants. Koop et al. [8] focuses on the problem of supporting workflow upgrades when the software that implements a component has a new version by suggesting how the change-based provenance actions might be reused to upgrade other similar workflows. The focus is on the mechanics of the upgrades, while our interest is on supporting the iterative exploration and design of new workflows. Workflow variants are also explored in Experiment Lines [20]. Their focus is on the variation of models or algorithms and software packages. In contrast, our focus is broader in that we support the Figure 1. A workflow diagram representing the initial workflow creation of workflow variants. W0 used in our scenarios. In scenarios S1 and S2 new variants W1 and W2 are created by updating the MODFLOW step to use 4 MOTIVATING SCENARIOS AND a different version that has the same interface, and therefore has REQUIREMENTS the same overall workflow diagram shown here. This section describes several scenarios where scientists iteratively simulations depending on the input data selected, so it needs to be create and explore workflow variants. The scenarios use examples configured to use the packages needed to process the desired data. from hydrology. A hydrologist uses models, often developed by This is done using FloPy [12], a Python package to create, run, and others, to estimate how much water will flow in an area. We will post-process MODFLOW-based models. We will also use MIKE- consider several hydrology models in these scenarios. SHE [6], another computational hydrology model that solves for MODFLOW is the U.S. Geological Survey's three-dimensional both saturated and unsaturated zones in groundwater. (3D) finite-difference groundwater model that has been developed Hydrology models need data about the area for the simulation. for several years and has many versions and variants. A major For example, MODFLOW requires elevation data, in the U.S. version of the core implementation is MODFLOW-2005 [7] which typically coming from the National Elevation Dataset, recharge simulates confined, unconfined, or a combination of confined and data, typically from the National Recharge Dataset, and the data for unconfined groundwater-flow problems. A major variant is the area from the Watershed Boundary Dataset. MODFLOW-NWT [9] which uses a Newton-Raphson formulation. Figure 1 shows an initial workflow W0 that uses MODFLOW- MODFLOW has many packages that run different types of NWT. The input data includes the boundary for the area being Supporting the Iterative Exploration of Workflow Variants SciKnow’2017 Austin, Texas USA and the data format is converted from raster to text using the step Convert Elevation, which is implemented using NumPy. Rasterize Recharge, which is implemented using GDAL, generates the recharge raster and then Convert Recharge, which is implemented using NumPy, converts the unit of measurement of the data from centimeters to meters and from meter/year to meter/day and from raster format to ASCII. The simulation component is implemented using MODFLOW-NWT Version 1.0.2, and uses FloPy to configure it with the appropriate packages. 4.1 Case I: Same Component Interface, Different Software Version In this case, a workflow component is replaced by another one that uses a different version of software to implement it but the component interface remains the same. We consider two main scenarios for this case. One occurs when a new version of the software used in a workflow component is released to fix errors or bugs. The other one occurs when a new version is released to carry out a different function. The first scenario S1 starts with a scientist that runs workflow W0 several times, changing the data sets used and comparing the results to understand how changes in the inputs influence the results. After several weeks, the scientist notices a new release of MODFLOW, with modifications to enhance the model outputs. So they create an upgrade of the MODFLOW component, which results in the creation of workflow W1, an upgrade of W0, shown in Figure 2. In this example, from version 1.0.2 to 1.0.3 of MODFLOW-NWT a bug was fixed in the UZF1 package that was causing UZF1 to incorrectly calculate unsaturated-zone evapotranspiration, which results in a much smaller value [19]. The earlier version 1.0.1 calculated this value properly, so the bug was introduced in version 1.0.2 but fixed in 1.0.3. In some cases, scientists may downgrade to an earlier version because it has a desired feature or it does not produce a wrong value introduced by a bug in later versions but not fixed yet. The scientist may need to discard all previous executions of W0 because the results were incorrect due to bugs, and run them using W1 instead. The second scenario S2 occurs when the software in a component is modified to carry out a different function. In this case, from version 1.0.2 to 1.0.3 of MODFLOW-NWT there is a Figure 2. In scenario S1 a new upgrade W1 is created by major change to generate a more accurate calculation of updating the MODFLOW component to use a different version evapotranspiration. First, the header of the listing file that results that has the same interface and fixes a bug. The changed from running MODFLOW-NWT is changed from having a variable component is shown with a thicker outline. In scenario S2 a "RMS" to "RMS1" and "RMS2," and from a variable "L2-NORM" new variant W2 is created with a similar diagram but using a to "L2-NEW" and "L2-OLD". This change was done to improve different MODFLOW component. the calculation of the residual terms as the L2-NORM rather than the root-mean-squared error (RMS error). This change does not affect the format of the results, only their value to be more accurate. studied, elevation data, and recharge (drainage) data. Area is an Thus, the interface of the new component variant does not change, input to the step Rasterize, which converts the data from a and the newly created workflow variant W2 has the same structure geographic system format to raster (bitmap) and is implemented as W1 in Figure 2. using GDAL (Geospatial Data Abstraction Library). The step The scientist needs to be able to understand the changes to the Convert Area converts the raster data to text data (ASCII) using software in new versions in order to assess the differences between NumPy, a library for array/matrix in Python. Then, the unit of versions and estimate the effort to make the changes in the measurement for this data is converted from centimeters to meters SciKnow’2017 Austin, Texas USA Carvalho et al. , whi workflow. This may require a significant effort, as this information may be scattered across release notes, documentation, papers, web sites, and other sources. In some cases, the scientist may be interested in skipping ahead several versions. For example, she may want to change from version 1.0.2 all the way to the newest version that is 1.1.3. This is challenging since the scientist needs to track and summarize all the differences between several consecutive versions. In addition, when changing a software version used to implement a component, the scientist needs to check if the new version is compatible with the software version of other components of the same workflow. For example, a specific FloPy version is compatible only with some MODFLOW-NWT versions. Sometimes these incompatibilities can occur across different workflow components, for example if two components make different assumptions about the Newton-Raphson formulation. This means that the scientist needs to track in detail all the software dependencies and compatibilities across the software components of a workflow. Finally, the scientist may need to check that the new simulation results do not require additional changes in the workflow steps that use those results. In our case they were the output of the workflow, but in other cases further adjustments may be required. Scenarios 1 and 2 motivate the following requirements: R1 – Version descriptions need to capture useful metadata of the software. R2 – Scientists need to understand differences in metadata between different software versions, particularly about their interfaces. R3 – Scientists need to be alerted about relevant updates of software used in their workflows. R4 – Workflow descriptions need to capture the software, software version, and functions used in the Figure 3. A workflow variant W3 derived from W2 is created implementation of workflow components. in Scenario S3 after adding well data, which requires adding the R5 – Scientists need to understand how new workflow necessary data conversion steps and also changing the variants can be used to correct errors in prior results. MODFLOW component to have an additional input for well R6 – Scientists should be able to easily replace a data. The new and modified components are shown with a component of a workflow with a new one when the thicker outline. interfaces of the components are the same. R12 – Scientists need a summarization of changes R7 – Given a software package that can be used to create between a given software version and a newer version to many workflow components, scientists need to easily understand their differences without need to understand figure out how to implement new variants of a workflow the changes associated to each version in between those. component with newer versions of that package. R13 – Scientists need to understand any incompatibilities R8 – Scientists should be able to easily create new between versions of different software packages and versions of workflow components and relate them to each libraries used to implement a workflow component. other. R14 – Scientists need to know whether a new workflow R9 – Scientists should be able to easily create new version or a new workflow variant is valid. workflow variants and relate them to each other. R10 – Scientists should be able to relate changes in 4.2 Case II: Different Component Interface, Same software to specific workflow results, so it is clear how new software versions affect calculated variables to Software Version produce wrong values. In this case, a workflow variant is created by replacing a workflow R11 – Version descriptions need to capture bug fixes and component by another one that uses the same software known bugs and relate them to software features and implementation but invokes a different function and as a result has input and output file variables. a new component interface (i.e., adding, removing or replacing Supporting the Iterative Exploration of Workflow Variants SciKnow’2017 Austin, Texas USA , whi as an ASCII file, so unlike the elevation and recharge data there is no need to convert wells data to ASCII. The only data preparation needed is converting the unit of measurement from feet to meters. To perform this change, the scientist adds Well to the workflow inputs, adds the step Convert Well for unit conversion. This results in workflow variant W3, shown in Figure 3. The scientist created one new component variant and created a variant of an existing component. In scenario S4, the scientist decides to include snowmelt in the simulation. This can be done by using infiltration as an input instead of recharge (since the infiltration package will also account for recharge). Figure 4 shows the resulting workflow variant W4 where the recharge input of W3 is replaced with infiltration. Additional changes include replacing the steps to prepare data for simulation with those steps to clip and resample infiltration, and to convert the unit of measurement in the input data from centimeter per year to meters per day, reformatting it to ASCII format. In addition, the MODFLOW step needs to be modified in two ways. First, the recharge input needs to be replaced with infiltration input. Second, the FloPy software configures MODFLOW to use the infiltration packages. In total, the scientist created five new components and created a variant of an existing component. There are several important tasks that the scientist needs to address in these two scenarios. Before creating the data preparation components for W3 and W4 the scientist has to find whether components that already do those conversions are available or not. Reusing components saves time, but after spending many years running similar workflows with similar data it may be hard to remember which components have been created before. Furthermore, in addition to reusing components it may be possible to reuse entire sub-workflows. In our example, the sub-workflow to prepare infiltration data has five steps that can be reused together. Figure 4. A workflow variant W4 derived from W3 is created Another important task is to compare the results of different in Scenario S4 after replacing the input for recharge data with workflow variants. For example, a scientist would run W3 and W4 infiltration data, which requires adding the necessary data and compare the results to each other and to W2 to understand how conversion steps and also creating a new MODFLOW changes in the workflows affect the simulation results. component variant to make the well data compatible. The five Scenarios 3 and 4 motivate these additional requirements: new components and the modified one are shown with a thicker R15 – Scientists need to easily find software packages outline. and workflow components that are appropriate to process a specific type of data input. inputs or outputs). This interface change may require changes in R16 – Scientists need to easily find workflow other steps of the workflow (e.g., adding, replacing or removing components for data conversion. data conversion or post-processing steps.) We consider two R17 – Scientists need to be able to understand the scenarios for this case. One occurs when a component is changed differences between two workflow variants. to use additional inputs or outputs provided by the software used to implement it. Another occurs when a component is changed to 4.3 Case III: Alternative Component, Different replace inputs or outputs or use them differently in the software Software used to implement it. In both cases, the rest of the workflow may be affected by the changes. In this case, a workflow variant is created by replacing a workflow Scenario S3 starts with a scientist running workflow W2. The component by a component that does an equivalent function but is scientist would like to add an input regarding water elevation implemented using a different software. There are several reasons through wells, so she creates a new variant of the workflow to create workflow variants that use equivalent software, such as component by adapting the MODFLOW component used in W2 by testing different models or taking into account parameters that are adding a new input for well data. The well data is already provided ignored by the current model used in a workflow. The new component may have a very different interface from the previous SciKnow’2017 Austin, Texas USA Carvalho et al. combine the simulation results to obtain a format that is comparable to W4 and to all the previous workflow variants. Many components from W4 were discarded as they were no longer needed. In scenario S6, the scientist decides to investigate other models, since MIKE-SHE is a commercial, proprietary model. There are many other hydrology models available, including PIHM [13], TopoFlow [14], VIC [15], and dozens of others available in repositories such as CSDMS [16]. The scientist starts to investigate which models produce interesting simulation results, and considers how much effort is required to locate the data required by each model, to develop the data pre-processing components needed, and to install and run each of these models. The scientist finds out that PIHM provides Hydroterre [17], a comprehensive data repository that already provides data in the required format, and a PIHM-GIS Figure 5. Workflow variant W5 using the MIKE-SHE software to visualize simulation results [18]. The scientist also hydrological model. The area data can be used raw, and the finds that PIHM requires a solver in order to run, so the simulation component to clip elevation data can be reused from workflow component needs to include the solver software in addition to W4. The five new components and the modified one are shown PIHM. The scientist develops a workflow variant W6 that includes with a thicker outline. Several components from W4 were new components implemented using PIHM, Hydroterre, and PIHM-GIS. removed as they were no longer needed. It is important to highlight several important tasks done by the one, thus requiring a major update of the workflow to create, scientist in these scenarios. In both scenarios, but particularly in replace, or remove several data preparation or post-processing scenario S6, the scientist needs to compare how two models are steps. Note that although the tasks to create the workflow variant similar and how they differ in terms of the input data that they use may be similar to those in scenarios 3 and 4, now there are and the output data that they generate. The documentation of additional tasks in finding out information about the new software models always includes details of the input and output requirements to check its functionality and analyze how it fits into the workflow in terms of files and formats. The scientist will want to understand and the overall exploration that the scientist is doing. We consider conceptually how the models work in terms of the physical two scenarios. One occurs when the scientist already knows which variables used or generated in the model. That is, understanding alternative method to use. Another one occurs when the scientist the inputs and outputs at the file and format level is important, but needs to find and compare the assumptions, functionalities and the understanding how model variables map to each of the files is also effort to change the workflow when considering more than one necessary. This information is usually not included in the software method to decide which one to use. documentation, but in the publications associated with the model. Scenario S5 has a scientist who is concerned about The scientist will need to consult a variety of sources in order to MODFLOW-NWT only solving for saturated zones, so some understand how different models compare [10]. parameters are simplified or even ignored. The scientist would like Another important task is to understand the assumptions made to use a different method to solve for the unsaturated zones, and has by the different models. For example, in hydrology some models heard that the MIKE-SHE model does a similar simulation to may assume the Navier-Stokes equations for fluid motion, while MODFLOW, but is a fully coupled and integrated surface water others do not. These assumptions are often not captured in the and ground water model that considers parameters regarding descriptions of workflow components, which focus on the models unsaturated zones. The inputs and outputs for MIKE-SHE are as software artifacts rather than research artifacts. different from MODFLOW. MIKE-SHE uses area data in the same In addition, after creating and running the new workflow raw format that is provided by the data source, so it does not need variants W5 and W6, the scientist will want to compare their results to be pre-processed. It also uses topography data and can ingest to the results obtained with previous workflows W4, W3, and earlier formats very similar to the data source, so the data only needs to be ones. This requires that the scientist understands how the model clipped. MIKE-SHE also requires several new input data, namely results are related to one another, which requires understanding rainfall, evaporation, and temperature, all in the format provided by what modeling variables are generated and included in the data sources so they only need to be clipped. MIKE-SHE also simulation outputs. generates several separate outputs, including files associated with Scenarios S5 and S6 motivate the following additional the simulation (SHERES), a binary output file containing all the requirements: static information on the simulation (FRF), and other results stored R18 – Version descriptions need to capture assumptions in a series of DFS0, DFS2 and DFS3 files. As for the MIKE-SHE used in software. component, there is no need to use the FloPy software to implement R19 – Workflow components, inputs, outputs or parameters it. Figure 5 shows workflow variant W5 created from W4: the in new workflow variants that are no longer needed need to scientist had to create three new data pre-processing components, be removed. one new component for MIKE-SHE, and one new component to Table 1. Summary of requirements from cases. Category Requirement Cases Workflow R1 – Version descriptions need to capture useful metadata of the software. C1, C2, C3 component R2 – Scientists need to understand differences in metadata between different software versions, particularly C1, C3 metadata about their interfaces. R3 – Scientists need to be alerted about relevant updates of software used in their workflows. C1 R4 – Workflow descriptions need to capture the software, software version, and functions used in the C1, C2, C3 implementation of workflow components. R8 – Scientists should be able to easily create new variants of workflow components and relate them to C1, C2, C3 each other. R9 – Scientists should be able to easily create new workflow variants and relate them to each other. C1, C2, C3 R10 – Scientists should be able to relate changes in software to specific workflow results, so it is clear how C1 new software versions affect calculated variables to produce wrong values. R11 – Version descriptions need to capture bug fixes and known bugs and relate them to software features C1 and input and output file variables. R12 – Scientists need a summarization of changes between a given software version and a newer version C1 to understand their differences without need to understand the changes associated to each version in between those. R13 – Scientists need to understand any incompatibilities between versions of different software packages C1, C2, C3 and libraries used to implement a workflow component. R18 – Version descriptions need to capture assumptions used in software. C1, C2, C3 Workflow R6 – Scientists should be able to easily replace a component of the workflow with a new one when the C1 updates interfaces of the components are the same. R7 – Given a software package that can be used to create many workflow components, scientists need to C1 easily figure out how to implement new variants of a workflow component with newer versions of that package. R14 – Scientists need to know whether a new workflow version or a new workflow variant is valid. C2, C3 R15 – Scientists need to easily find software packages and workflow components that are appropriate to C1, C2, C3 process a specific type of data input. R16 – Scientists need to easily find workflow components for data conversion. C2, C3 R19 – Workflow components, inputs, outputs or parameters in new workflow variants that are no longer C3 needed need to be removed. R20 – Scientists need to assess and compare the effort in creating new workflow variants that represent a C3 significant departure from previous ones. R21 – Scientists need to find and compare equivalent computational models, including their inputs, C3 outputs, model variables, data formats, and assumptions Workflow R5 – Scientists need to understand how new workflow variants can be used to correct errors in prior results. C1, C2, C3 Comparisons R17 – Scientists need to be able to understand the differences between two workflow variants. C1, C2, C3 R20 – Scientists need to assess and compare the effort in different versions of the software and the different versions creating new workflow variants that represent a significant and variations of a given workflow component. departure from previous ones. Workflow updates, which address the creation of new R21 – Scientists need to find and compare equivalent workflow variants by replacing, adding, or removing computational models, including their inputs, outputs, model workflow components, the propagation of the effects of variables, data formats, and assumptions. those changes throughout the structure of the workflow, and the validation of the new workflow variants. 4.4 Requirements summary Workflow comparisons, which address the comparison The requirements of the previous scenarios can be grouped into between different software versions, software packages, three main categories: workflow variants and workflow runs. Table 1 summarizes the requirements introduced in this section, Workflow component metadata, which tackles the pointing out the broad categories they belong to and the cases where representation and metadata of workflow components they occur. Although we adopt the hydrology domain in our regarding their interface, functionalities and assumptions, scenarios to illustrate the requirements, our requirements are and implementation using software packages and libraries. domain-independent. Workflows in any domain have pre- This metadata would also represent the characteristics of the SciKnow’2017 Austin, Texas USA Carvalho et al. processing steps, post-processing steps, and major analytic steps updates, and workflow comparisons. We also discussed major [3]. In the case of hydrology, the analytic steps are done using research directions to address those requirements, including different hydrology models. Other sciences use algorithms rather improved frameworks for describing workflow components and the than models. For example, different clustering algorithms or associated software, for managing and tracking workflow variants, sequence alignment algorithms would be used in genomics. The and supporting scientists in the iterative exploration and requirements outlined here are generally applicable to other experimentation process through workflow variants. domains. Acknowledgments. This work was supported in part by a grant from the 5 DISCUSSION AND FUTURE RESEARCH US National Science Foundation under award ICER-1440323 and ICER- 1632211 (EarthCube RCN IS-GEO), and in part by the Sao Paulo Research Given the state of the art and the requirements from the scenarios, Foundation (FAPESP) under grants 2017/03570-3, 2014/23861-4 and we outline possible research directions for future work: 2013/08293-7. 1. Describing workflow components and their underlying software. This includes the creation and adaptation of existing REFERENCES ontologies to capture information about software versions and [1] Altintas, I., Barney, O., and Jaeger-Frank, E. (2006). Provenance collection variants, including software interfaces and features. OntoSoft support in the Kepler scientific workflow system. International Provenance and Annotation Workshop (IPAW), pages 118–132. Springer. [5] is an ontology that might be extended to capture relevant [2] Freire, J., Silva, C. T., Callahan, S. P., Santos, E., Scheidegger, C. E., and Vo, H. information about software versions and variants. It is also T. (2006). Managing rapidly-evolving scientific workflows. In Provenance and Annotation of Data, pages 10–18. Springer. important to integrate these ontologies with workflow systems [3] Garijo, D., Alper, P., Belhajjame, K., Corcho, O., Gil, Y., & Goble, C. (2014). to describe workflow components. Another area of work is to Common motifs in scientific workflows: An empirical analysis. Future Generation Computer Systems, 36, 338-351. use them to support the creation of workflow variants. [4] Gil, Y. and Garijo, D. (2017). Towards Automating Data Narratives. In 2. Managing and tracking workflow variants and their Proceedings of the Twenty-Second ACM International Conference on differences. This includes how to compare workflow Intelligent User Interfaces (IUI-17), Limassol, Cyprus. [5] Gil, Y., Ratnakar, V., & Garijo, D. (2015). OntoSoft: Capturing scientific software components and workflow variants regarding their interfaces metadata. Proceedings of the 8th International Conference on Knowledge and functions, and present these results in a useful way for Capture (K-CAP), 2015. [6] Graham, D. N., & Butts, M. B. (2005). Flexible, integrated watershed modelling scientists to understand their differences and the implications with MIKE-SHE. Watershed models, 849336090, 245-272. on experiment results. A possible approach is using multi- [7] Harbaugh, A. W. MODFLOW-2005, the US Geological Survey modular ground- media narratives that combine text, graphics, and water model: the ground-water flow process. Reston: US Department of the Interior, US Geological Survey, 2005. visualizations to explain the similarities and differences [8] Koop, D., Scheidegger, C. E., Freire, J., & Silva, C. T. (2011). The Provenance of between software versions, software variants, workflow Workflow Upgrades. Third International Provenance and Annotation Workshop (IPAW),Vol. 6378. Springer. versions, or functions/methods. More importantly, these [9] Niswonger, R.G., Panday, Sorab, and Ibaraki, Motomu, 2011, MODFLOW-NWT, narratives should be easily customized to the reader’s level of A Newton formulation for MODFLOW-2005: U.S. Geological Survey expertise and interest. As a starting point our approach may be Techniques and Methods 6-A37, 44 p. [10] Essawy, B. T.; Goodall, J. L.; Xu, H.; and Gil, Y. Evaluation of the OntoSoft based in an approach for data narrative generation [4]. Another Ontology for Describing Legacy Hydrologic Modeling Software. research area is to manage histories of creation and evolution Environmental Modelling & Software, 92. 2017. [11] Wolstencroft, K., Haines, R., Fellows, D., Williams, A., Withers, D., Owen, S., of workflow variants, and doing so across many users that may Soiland-Reyes, S., Dunlop, I., Nenadic, A., Fisher, P., Bhagat, J., Belhajjame, benefit from reusing segments or traversals across users. K., Bacall, F., Hardisty, A., Nieva de la Hidalga, A., Balcazar Vargas, M. P., Sufi, S., and Goble, C. (2013). The Taverna workflow suite: designing and 3. Designing an interactive framework to support scientists in executing workflows of web services on the desktop, web or in the cloud. the exploration and experimentation process through Nucleic Acids Research, 41(W1). W557–W561. workflow variants. This includes how to leverage workflow [12] Bakker, M., Post, V., Langevin, C.D., Hughes, J.D., White, J.T., Starn, J.J., and Fienen, M.N., 2016, FloPy v3.2.6: U.S. Geological Survey Software Release, reuse and composition to support the creation of workflow 19 March 2017, http://dx.doi.org/10.5066/F7BK19FH. variants. For example, given a new component that needs to [13] Qu, Y. and C. J. Duffy. "A semidiscrete finite volume formulation for multiprocess watershed simulation." Water Resources Research 43(8), 2007. replace an existing one in a workflow, suggest what other [14] Peckham, S. D. Geomorphometry and spatial hydrologic modelling. In components may need to be added or removed from the Geomorphometry: Concepts, Software, Applications, Developments in Soil workflow. Other research would involve mechanisms to Science, vol. 33, edited by S. D. Peckham, pp. 579–602, Elsevier. [15] Liang, X., D. P. Lettenmaier, E. F. Wood, and S. J. Burges, 1994: A Simple identify critical and non-critical components in workflows. Hydrologically-Based Model of Land Surface Water and Energy Fluxes for The critical and non-critical components could be associated GSMs, J. Geophys. Res., 99(D7), 14,415-14,428. [16] Community Surface Dynamics Modeling System (CSDMS) Model Repository. to abstractions defined as motifs [3]. Available from http://csdms.colorado.edu/wiki/Model_download_portal. [17] HydroTerre Data Services. Available from http://www.hydroterre.psu.edu. [18] The Pennsylvania Integrated Hydrology Model GIS Interface (PIHMgis), 6 Conclusions http://www.pihm.psu.edu/pihmgis_home.html [19] USGS. MODFLOW-NWT Release Notes. https://water.usgs.gov/ogw/modflow- This paper discusses the need to support scientists in exploring nwt/Release.txt different experiment designs over time. We presented several [20] Marinho, A., de Oliveira, D., Ogasawara, E., Silva, V., Ocaña, K., Murta, L., scenarios where an initial workflow is modified to create workflow Braganholo, V. and Mattoso, M., 2017. Deriving scientific workflows from algebraic experiment lines: A practical approach. Future Generation Computer variants by replacing, adding or removing workflow steps. We Systems, 68, pp.111-127. describe the requirements of these scenarios, and grouped them into three categories: workflow component metadata, workflow