NiW: Converting Notebooks into Workflows to Capture Dataflow and Provenance Lucas A. M. C. Carvalho1, Regina Wang2, Yolanda Gil2, Daniel Garijo2 1 University of Campinas, Institute of Computing, Campinas, SP, Brazil 2 University of Southern California, Information Sciences Institute, Marina del Rey, CA, U.S.A lucas.carvalho@ic.unicamp.br, gil@isi.edu, dgarijo@isi.edu ABSTRACT as well as automated mining of reusable workflow fragments [4]. Workflows also facilitates understanding and performing Interactive notebooks are increasingly popular among scientists to comparisons, particularly for non-programmers [6]. expose computational methods and share their results. However, This paper presents an approach for converting notebooks into it is often challenging to track their dataflow, and therefore the workflow descriptions by mapping various aspects of notebook provenance of their results. This paper presents an approach to cells into workflow components and dataflow. Our approach is convert notebooks into scientific workflows that capture explicitly implemented in NiW, a prototype tool to convert Jupyter the dataflow across software components and facilitate tracking Notebooks 1 into WINGS workflows [7]. Based on the provenance of new results. In our approach, users should first assumptions of our approach, we propose a set of guidelines for write notebooks according to a set of guidelines that we have designing notebooks that facilitate the conversion and can be used designed, and then use an automated tool to generate workflow by notebook developers to improve the understandability of their descriptions from the modified notebooks. Our approach is notebooks. implemented in NiW (Notebooks into Workflows), and we demonstrate its use by generating workflows with third-party notebooks. The resulting workflow descriptions have explicit 2 DATAFLOW AND PROVENANCE IN dataflow, which facilitates tracking provenance of new results, WORKFLOWS AND NOTEBOOKS comparison of workflows, and sub-workflow mining. Our This section discusses general issues for identifying dataflow and guidelines can also be used to improve understandability of tracking provenance in notebooks, compared with simple dataflow notebooks by making the dataflow more explicit. in workflows. Our work to date has focused on mapping Jupyter Notebooks to workflow descriptions that can be used in the CCS CONCEPTS WINGS workflow system [6], but many of the issues will be • Information systems → Artificial intelligence; Knowledge common for other notebook and workflow systems. representation and reasoning 2.1 Dataflow in Workflows KEYWORDS Workflows capture explicitly the dataflow across software Scientific Workflows; Workflow Design; Electronic Notebooks. components. We describe here a very simple dataflow representation and workflow structure that we assume in the rest 1 INTRODUCTION of the paper. This approach is used in several workflow systems, including WINGS [6], Pegasus/Condor [4], and Apache Taverna Interactive notebooks have become very popular in science to [12]. capture computational experiments [14]. These notebooks include Each software component (or step) of a workflow may have code, visualizations, and explanations, and can be easily shared multiple datasets as inputs, multiple datasets as outputs, and and re-run. multiple parameters, which are provided as simple numeric or As scientists carry out their research, they may need to Boolean values. A dataset generated by a component can be input compare the results and methods of different experiments. This to another component, thereby indicating the flow of data (i.e., the involves comparing final results, comparing intermediate results, dataflow) from a component to another. comparing steps of the method, and comparing parameter values. A workflow management system can run a workflow if the Since notebooks contain raw code, it can be hard to understand software components can be executed and the respective input how new results are generated, as well as to compare notebooks. datasets and parameter values are provided. Because the dataflow In contrast, workflows offer modular components to run code, and is explicitly captured in the workflow, the system can record the have an explicit dataflow. This can facilitate provenance capture, provenance of each new dataset generated by the workflow. K-CAP2017 Workshops and Tutorials Proceedings, 1 http://jupyter.org © Copyright held by the owner/author(s) SciKnow’2017, December 2017, Austin, Texas USA Carvalho et al. The dataflow of a workflow is often shown as a graph. generate visualizations, but those do not necessarily Workflows can be compared as graphs. Indeed, graph algorithms generate output files. have been used to query workflow repositories [1], and to mine 5. Files: It is very difficult to understand how the files in workflow repositories to find commonly occurring sub-workflows notebook folders correspond to the cells that used or [5]. Visual user interfaces that show the dataflow graph in a generated the files. workflow are easy to use for non-programmers [8]. 6. Data: Some notebooks are available in repositories without any test data. Therefore, this makes it hard to 2.2 Overview of Computational Notebooks understand the expected data format of the input files and Notebooks aggregate text and code, grouped into a sequence of the outputs generated by the notebook. It also makes it containers or cells. Cells can be code cells, markdown cells, and hard to rerun the notebooks. raw cells. Code cells have running code usually with one In summary, many problems arise in trying to understand what programming language such as Python, R, Java, etc. Code cells is the dataflow across the cells of a notebook and how they use or are the heart of notebooks. Markdown cells are comments and generate notebook files. This makes it very hard to figure out the documentation, so users can add graphics, formatting, etc. These provenance of any results. This also makes it hard to understand a cells are not linked with any other cells and run without notebook, as well as comparing different notebooks. interfering other cells. Raw cells display raw text without any conversions, and are much less used. Unlike raw and markdown 3 MAPPING NOTEBOOKS TO WORKFLOWS cells, code cells are linked with each other when the notebook When mapping notebooks into workflows, many issues must be runs, though code cells run like a single unit of code. Thus, when addressed. We discuss here those issues, and our approach to a code cell performs any activity (e.g. initialize a variable), the tackle them. We start with general issues. After that, we focus on next code cell to be run carries it on as though there were no other issues specific to Jupyter Notebooks and Python, since that has cells in between. Outputs are shown in the notebook when a cell been the focus of our work so far. Then we discuss issues specific code includes a plot or print statement. to the WINGS workflow system, which is the target of our Jupyter Notebook 2 is one of the most popular notebook mappings. platforms. They were originally named IPython Notebooks since 3. 1 Components they are primarily used with Python, but expanded with kernels for several other programming languages other than Python. A 3.1.1 Executable Code kernel is a program that runs the notebook’s code. Differences: Each component in a workflow must have some running code within it. In notebooks, cells may contain solely 2.3 Understanding Dataflow of Notebooks value assignment to variables, function declarations or library We analyzed a diversity of Jupyter Notebooks to understand their imports, or documentation, which are not executable code by dataflow and the provenance of their results. Common problems themselves and cannot be easily mapped into a workflow that we found include: component. Another difference is that in notebooks even though code is split cell by cell, most of the splits only exist to benefit 1. Processing: A user may not have a clear understanding of human readability and do not actually affect the code itself. Cells what are the main processing cells of a notebook. For are used just to modularize the code. As a result, different users example, cells for assigning values to variables or may break up the code at different places and it will not matter importing libraries are not processing units and should be much if the sequential order to run these cells is preserved when placed in the same cell that uses those variables. there are any data dependencies between them. In contrast, in 2. Dataflow: Input file names are either implicit in the code workflows the code is split into components which are isolated or defined as parameters through method calls in previous from one another and that individually carry out a meaningful cells. This represents implicit dependencies between cells, function. An example of this is that if a variable is created in a and therefore make it difficult to understand the dataflow component and used in a second component, the latter will not among them. In addition, users may also have difficulty have access to this variable unless it is generated by the former as figuring out what files were generated by a given cell. an output and explicitly consumed as an input by the second 3. Inputs: Input files may contain pointers to other files that component. are opened and used as inputs inside of a cell. This Approach: Each notebook cell with running code will be creates an implicit dependency that is difficult to detect. mapped to a workflow component. If a notebook cell does not 4. Outputs: When a cell does not have an explicit output it have running code and only has library imports or method is very difficult to understand what kind of process that declarations, it will become part of a cell that requires that cell performed. A cell may overwrite a file with the same information. name generated by other cells, so it can be hard to track the provenance of newly generated files. Notebooks can 2 http://jupyter.org/ 2 Converting Notebooks into Workflows SciKnow’2017, December 2017, Austin, Texas USA 3.1.2 Libraries and Methods 3.2.2 Input Files Differences: A notebook only needs to import a library or Differences: A notebook may be given input data once at the state a method once. Since workflows are componentized, the beginning, and there is no need to pass data through files from cell imports and method declarations need to be done in each to cell. In a workflow, a component must output a data file that is workflow component that uses them. then an input to another component. Approach: Every library used in the notebook will be Approach: A data file will be explicitly generated from the imported into all workflow components created. A method will be notebook code in order to be passed to another component in the included in a component only if the method is used in it. workflow. Code will be added to the component that generates the 3.1.3 Open Files data so that the resulting data is written into a file that can be passed to the next component. Although this may not be an Differences: A notebook can open a file and use it in any efficient approach, it facilitates provenance tracking which is very subsequent cell. In a workflow, a file can be used only inside the important in scientific analysis. component that has that file as input. 3.2.3 Output Files Approach: If a file is opened and used across many cells, those cells will be merged into a single component. Note that an Differences: In a notebook, cells can write results directly into alternative approach might be to create separate components for files in the local file system. Other cells show results in each of the cells and open and close the file in each of those visualizations. In contrast, workflows generate results in output components, but this would result in inefficiencies if the data is files. written to files and read from files too many times. Approach: When a notebook cell writes into files in the local 3.1.4 Markdown Cells file system, the corresponding workflow component will have an output corresponding to that file. When a notebook cell shows a Differences: Markdown cells in notebooks do not contain visualization, a workflow component will be created to generate running code, but need to be included in the workflow as that visualization as an output file. documentation so that the information that they contain is not lost. Workflow components can have documentation. There can be 3.3 Workflow Structure more markdown cells than code cells. In addition, the relationship 3.3.1 Dataflow between markdown cells and code cell is not explicit. A Differences: In a notebook, although cells are specified markdown cell may be related to either its previous or its sequentially they might be executed in any order or a cell may not subsequent cell. be executed at all. In a workflow, the flow of data among Approach: Since the relationship between markdown cells components must be specified, and all components are executed. and code cells is unknown, the assignment is made in the Approach: The identifiers of the files generated and following way: a markdown cells’ information will be attached to consumed by components generated for a given notebook will be the documentation of the component created for its subsequent used to obtain the dataflow between the components, and the code cell. dataflow will be explicitly stated in the workflow structure. We 3.1.4 Component Naming assume that notebooks run the cells sequentially, so we do not Differences: In a notebook, a cell does not have a name. In a consider any parallelism in the execution. workflow, a component has a name that generally describes the 3.4 Python-Specific Differences function of the component is in the workflow. A few mappings are challenging because of the specific way that Approach: A name will be generated for each component of Python is used in Jupyter Notebooks. The IPython kernel allows the workflow, starting with “Component” followed by the notebooks to use special functions that the standard Python ordering number from the cell (e.g., Component1, Component2 interpreter does not support. Since a workflow component would and Component3). be executed using standard Python, these functions cannot be 3.2 Data and Parameters directly mapped. Notebooks are also designed for human readability and are, as a result, much more documented and 3.2.1 Parameters aggregate more resources than plain Python code. In addition, Differences: In notebooks, method parameters are set through notebooks include Python commands to generate visualizations program variables. In workflows, parameters are inputs to (e.g., graphs), which are executed and the results shown in the components and provided by users. In workflows, if parameters notebook but not necessarily saved. are coming from other components, these parameters must be 3.4.1 Visualizations passed explicitly through a file. Approach: Variables of primitive types (i.e., Boolean, string, Differences: Notebooks show visualizations which may not be integer, float, date, etc.) that are given constant values in saved into a file. Workflow components would generate notebooks will be mapped to parameters in workflows, and they visualizations and save them in an output file. will be given the name that was used in the notebook. 3 SciKnow’2017, December 2017, Austin, Texas USA Carvalho et al. Approach: If the notebook does not save a visualization, the Approach: All workflow components will be given a general workflow will automatically save the visualization in a file. component type. 3.4.2 Magic Commands 3.6 Usability Requirements Differences: Notebooks have IPython kernel commands know Our approach requires that users make changes to their notebooks as magic commands. They start with “%” and can list all in order to facilitate the conversion of notebooks into workflows. variables, return the directory being used, etc. Magic commands We took into account additional requirements to reduce the only work on IPython kernels. The Python standard interpreter burden to users and maximize the utility of the changes required: does not recognize these commands since it is not Python code.  The user should have to make minimal changes to a Approach: Magic commands will be replaced with pre- notebook to allow the conversion tool to generate a defined Python code that implements them. For magic commands workflow. where no code is available, they will be ignored and not mapped.  Any changes made to a notebook should improve its readability and documentation as well as facilitate its 3.4.3 Automatic Output conversion into a workflow. Differences: In a notebook, if a variable is printed (using the  Any changes made to a notebook should be independent print statement), it will appear as an output. In a software of the target workflow system. component, that code would not generate an output.  Any changes made to a notebook should improve the Approach: A workflow component that includes print understanding of the dataflow. commands will have an extra output with all the results from the  The workflows should include all the documentation of print statements. the original notebooks.  All the results generated by a notebook should be 3.5 WINGS-Specific Differences generated by the workflow as well, even if they are not A few mappings are specific to requirements in the WINGS explicit in the notebook. workflow system. In WINGS, the code for each software  A conversion tool should automate the process as much as component has an associated script that indicates the command possible, and some manual intervention may be needed line invocation for the software, but notebooks do not have this. after running it. Input files are also treated differently from notebooks. In WINGS, input files are classified into a hierarchy of data types. Semantic metadata properties can be specified as well for all input 4 GUIDELINES TO DEVELOP NOTEBOOKS files. Notebooks do not have either. Based on our approach to map notebooks into workflows, we 3.5.1 Software Components designed a set of guidelines that users can follow to facilitate the conversion of notebooks into workflows. Users who are not Differences: Each workflow component has an associated creating workflows will be able to use these guidelines to create script that invokes the code to be executed in that component. The notebooks that have more explicit dataflow, which will facilitate script for a workflow component in WINGS must specify the understanding, comparisons, and reuse by others. invocation command, number of inputs, parameters and outputs We list here the set of guidelines, each with a justification. that the code for the component expects. Notebooks do not have 1. Provide at least one cell with running code: a this information. workflow component must have running code within it Approach: After mapping the code from the cells of the to be created and a workflow must be composed of at notebook into components, a script will be generated for each least one component. component indicating the invocation command together with the 2. Write into files any newly generated data: the code in total number of inputs, outputs and parameters. a cell should write to files with the intermediate and 3.5.2 Input Data Files final data generated, so that other cells can use those files. This will make the dataflow across cells more Differences: In WINGS, each input file is assigned a data clear. Here we have a trade-off between input/output type. In notebooks, there are no data types or metadata for the (I/O) performance in disk and provenance capture. files. 3. Keep code that uses the same file in the same cell: if Approach: All input files will be considered to be of the same files are opened and used across many cells, all those general data type. cells should be merged into a single cell, making the 3.5.3 Workflow Components notebook cells and workflow components more modular. Differences: In WINGS, each workflow component is 4. Keep the notebook clean and working: the cells that assigned to a type in a component type hierarchy based on its are not needed to run the notebook should be removed function. and the code in cells must be running correctly to create the workflow components. 4 Converting Notebooks into Workflows SciKnow’2017, December 2017, Austin, Texas USA Jupyter Notebook WINGS Workflow Notebook cells Making inputs explicit Workflow steps Saving intermediate results Component2.py NiW Run.sh (wrapper) Merging related cells in components Figure 1. Using NiW to generate a WINGS workflow from a Jupyter Notebook. 5. Ensure that the notebook produces correct results generates first (1) and (2) and uses those files to automatically when running its cells from top to bottom: the create (3). NiW also creates the data type “File” and associates all notebook cells are considered sequentially (i.e., from data files (inputs, intermediate, and results) to this data type. NiW top to bottom) to create the workflow structure. This uses the notebook’s filename to name the workflow. makes it easier to capture the dataflow between cells Figure 1 illustrates how the notebooks are converted into and understand the notebook. workflows by NiW using the approach outlined in Section 3. 6. Provide meaningful names for variables and files: these names should make clear what kind of data the 5.1 Current Limitations of NiW files contain. Avoid names such as “load” or “data05”. The following are limitations of our current NiW implementation. Instead, use names like “PluviometricCalculation” or Python is the only programming language supported. The use of “SensorReadings”. This makes the visual presentation magic commands is restricted, currently only the magic command of notebooks and workflows more readable. %matplotlib (which allows visualizations to be generated) is These guidelines aim to facilitate the automated conversion of supported. The only methods supported for opening files are the notebooks into workflows. They also improve the built-in method “open” and the method "read_csv" from Pandas,3 understandability of notebooks by making the dataflow more a well-known data analysis library in Python. Only Matplotlib can clear. be used to generate visualizations. Finally, the notebook should run fully without errors. This is because if an error occurs while 5 NiW: A Tool for Converting Notebooks into executing a notebook, it would be difficult to identify how data Workflows are generated and used throughout all the cells. Moreover, errors NiW (Notebooks into Workflows) is a prototype system that in code might be propagated to the workflow components. implements our approach to convert notebooks into workflows. 5.2 Using NiW Our current NiW prototype creates workflows for WINGS system from Jupyter Notebooks. The software is available online [3]. To demonstrate how NiW works we have chosen a Jupyter NiW takes as input a notebook file and generates: 1) a zip file Notebook for computational journalism taken from for each workflow component (e.g., Component1.zip), containing http://nbviewer.jupyter.org/gist/darribas/4121857. This notebook the component code as a Python script (e.g., Component1.py), a was created by journalists at The Guardian newspaper and uses script file (named io.sh) to handle the inputs and outputs of the real world data to analyze and map the incidents during the 2012 component, and a script (named run) to execute the component; 2) Gaza-Israel crisis, exploiting the spatial as well as the temporal a file with a list of the names of the components and their inputs, 3 http://pandas.pydata.org outputs, and parameters; and 3) the workflow structure. NiW 5 SciKnow’2017, December 2017, Austin, Texas USA Carvalho et al. because there are no data dependencies across them. This helps a user see how the different steps are related. Unfortunately, the workflow is not more understandable because the components and datasets do not have very meaningful names. A user could easily edit the workflow in WINGS to change those names. One benefit of the workflow is to support comparison and provenance when run with new datasets. Journalists from The Guardian created the input dataset in collaboration with Internet users. If the input data is updated, the workflow could easily be executed again, and its results can be compared. Since all the intermediate results are stored as provenance information, they may also be compared to previous executions. Another benefit of the workflow is to compare the results when the code changes. In this case, the notebook is collaborative and can be extended by users via GitHub. When the notebook is changed, NiW can be re- run and a new workflow would be generated and executed. The workflows for different notebook versions can be easily compared. 6 RELATED WORK There are several related approaches to expose the dataflow within scripts and/or to map scripts into structures that support provenance tracking. Figure 2. WINGS workflow created from The Guardian’s notebook. NoWorkflow [11] captures provenance information from dimension of the data. The modified version of the notebook, the scripts to help scientists understand the script execution. However, WINGS workflow, and the workflow execution are available at this approach does not simplify the understanding of the script [3]. specifications for non-programmers. YesWorkflow [10] enables We modified the notebook based on the guidelines presented scientists to make explicit the dataflow in scripts by providing in Section 4, and to address the limitations of our current special tags that scientists use to annotate the scripts. These implementation of NiW mentioned above. The only changes annotations split the script into steps and clarify the inputs and outputs of each step as well as the structure of the workflow. It required by our guidelines in the notebook code were related to enables the creation of a visualization based on these annotations, guideline #3 – to write newly generated data into files: (1) saving helping scientists to understand the dataflow within the script. the data retrieved online in a local file, instead of loading it in However, the scientist still has a script which is difficult to reuse memory to be used in subsequent cells; (2) saving changes made compared to workflows. to the data in each cell into a new file; (3) opening the updated W2Share [2] focus on the conversion of scripts into scientific data file saved by (2) in subsequent cells. workflows. This approach automatically generates workflows The nine code cells in the original notebook resulted in five from annotated scripts. However, this work does not consider workflow components. The code cells containing only library peculiarities of notebooks. imports were merged with other components as well as the cells [13] proposes an approach to capture provenance from containing the declaration code of the function parse_loc. The notebooks automatically allowing the analysis of provenance inputs of the workflow are the parameters api_key, request_url information within the notebook, both to reason about and to and query, variables with assignment to string values in the debug their work. [9] captures dataflows from notebooks by original cell. After retrieving the data, it is saved as a CSV file by specifying a unique and persistent identifier for each cell and its Component1. Components 3, 4 and 5 save the graphs generated outputs which can be referred in other cells. In our work we for future inspection, originally showed inline in the notebook. provide guidelines to notebook designers to improve the The modified notebook is improved for use by scientists understanding of the notebooks by scientists, and then converted with respect to the original version in several respects, such as the modified notebooks into workflows automatically. making inputs explicit, saving intermediate results, merging related cells into meaningful components, and making outputs 7 CONCLUSIONS explicit. We presented an approach to map notebooks into workflows, Figure 2 shows the workflow created by NiW using the addressing many issues that arise because of the implicit dataflow modified notebook as input. Note that even if notebooks always in notebooks. We introduced a set of general guidelines for have a sequential structure, workflows do not. In this case, there notebook developers that help make dataflow more explicit, are three components at the bottom that could be run in any order which improves understandability and provenance tracking. We 6 Converting Notebooks into Workflows SciKnow’2017, December 2017, Austin, Texas USA implemented NiW, a prototype tool that can convert notebooks [7] Gil, Y.; Ratnakar, V.; Kim, J.; Gonzalez-Calero, P. A.; Groth, P.; Moody, J.; and Deelman. Wings: Intelligent Workflow-Based Design of Computational that follow those guidelines into workflows, in particular Jupyter Experiments. E. IEEE Intelligent Systems, 26(1). 2011. Notebooks into WINGS workflows. [8] Hauder, M.; Gil, Y.; Sethi, R.; Liu, Y.; and Jo, H. Making Data Analysis Expertise Broadly Accessible through Workflows. In Proceedings of the Sixth An important area of future work is to make workflows more Workshop on Workflows in Support of Large-Scale Science (WORKS'11), held understandable. Users could edit the names of the workflow in conjunction with SC 2011, Seattle, Washington, 2011. components and datasets to make them more meaningful. [9] Koop, D., and Patel, J. Dataflow Notebooks: Encoding and Tracking Dependencies of Cells. In 9th USENIX Workshop on the Theory and Practice of Another approach would be to use metadata tags for notebooks to Provenance (TaPP 17). USENIX Association. 2017. facilitate the creation of components and workflows. For example, [10] McPhillips, T., Song, T., Kolisnik, T., Aulenbach, S., Belhajjame, K., Bocinsky, K., Cao, Y., Chirigati, F., Dey, S., Freire, J. and Huntzinger. YesWorkflow: A a metadata tag could be added in the notebook to give each cell a User-Oriented, Language-Independent Tool for Recovering Workflow meaningful name, so that NiW would use that name rather than a Information from Scripts., D. International Journal of Digital Curation 10, no. 1 synthetic one. Another example is the assignment of markdown (2015): 298-313. cells to workflow components, could also be addressed by using [11] Murta, L., Braganholo, V., Chirigati, F., Koop, D. and Freire, J. noWorkflow: capturing and analyzing provenance of scripts. In International Provenance and metadata tags provided by the notebook creator. Annotation Workshop (pp. 71-83). Springer. 2014. Future work also includes the automatic generation of [12] Oinn, T., M. Greenwood, M. Addis, N. Alpdemir, J. Ferris, K. Glover, C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M. Pocock, M. Senger, R. notebooks from workflows. This would be useful for users who Stevens, A. Wipat, and C. Wroe. Taverna: lessons in creating a workflow reuse workflows written by others, as it would enable them to use environment for the life sciences. Concurrency and Computation: Practice and the notebook environment to inspect the code that implements the Experience, 18(10), 2006. [13] Pimentel, J.F.N., Braganholo, V., Murta, L. and Freire, J. Collecting and workflows. In addition, a notebook diagram could be included in analyzing provenance on interactive notebooks: when IPython meets the notebook to make the role of each cell clearer. noWorkflow. In Workshop on the Theory and Practice of Provenance (TaPP), Edinburgh, Scotland (pp. 155-167), 2015. An interesting direction for future work is to explore the use of [14] Shen, H. Interactive notebooks: Sharing the code. Nature, 05 November 2014. workflows for tracking provenance of notebook results and for comparing different notebooks. Workflows can provide provenance records for all the newly generated results. In addition, the structure of workflows makes it easier to compare notebooks because they expose the similarities, the common sub- workflows, and the differences in implementations. There are many opportunities to explore the interplay between notebooks and workflows in terms of alternative user interfaces, execution paradigms, and provenance tracking, and comparison and reuse. ACKNOWLEDGMENTS This work was supported in part by a grant from the US National Science Foundation under award ICER-1440323 and ICER- 1632211 (EarthCube RCN IS-GEO), and in part by the Sao Paulo Research Foundation (FAPESP) under grants 2017/03570-3, 2014/23861-4 and 2013/08293-7. We would like to thank many collaborators for their feedback on this work, in particular Jeremy White and Zachary Stanko. REFERENCES [1] Bergmann, R.; and Gil, Y. Similarity Assessment and Efficient Retrieval of Semantic Workflows. Information Systems Journal, 40. 2014. [2] Carvalho, L. A. M. C.; Malaverri, J. E. G.; Medeiros, C. B. Implementing W2Share: Supporting Reproducibility and Quality Assessment in eScience. In Proceedings of the 11th Brazilian e-Science Workshop, São Paulo, Brazil, 2017. [3] Carvalho, L. A. M. C, Wang, R and Garijo, D. (2017, December 9). KnowledgeCaptureAndDiscovery/niw: Notebooks into workflows 0.0.1 (Version 0.0.1). Zenodo. http://doi.org/10.5281/zenodo.1098344 [4] Deelman, E., Singh, G., Su, M. H., Blythe, J., Gil, Y., Kesselman,C., Mehta, G., Vahi, K., Berriman, G. B., Good, J., Laity, A., Jacob, J. C. and Katz, D. S. Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. (2005) Scientific Programming Journal, vol. 13, pp. 219- 237. [5] Garijo, D.; Corcho, O.; Gil, Y.; Gutman, B. A.; Dinov, I. D.; Thompson, P.; and Toga, A. W. FragFlow: Automated Fragment Detection in Scientific Workflows. In Proceedings of the IEEE Conference on e-Science, Guaruja, Brazil, 2014. [6] Garijo, D.; Corcho, O.; Gil, Y.; Braskie, M. N.; Hibar, D.; Hua, X.; Jahanshad, N.; Thompson, P.; and Toga, A. W. Workflow Reuse in Practice: A Study of Neuroimaging Pipeline Users. In Proceedings of the IEEE Conference on e- Science, Guaruja, Brazil, 2014. 7