Improving Publication and Reproducibility of Computational
               Experiments through Workflow Abstractions

                Yolanda Gil                                  Daniel Garijo                           Margaret Knoblock
       Information Sciences Institute                Information Sciences Institute              Information Sciences Institute
      University of Southern California             University of Southern California           University of Southern California
                    USA                                           USA                                         USA
                 gil@isi.edu                                dgarijo@isi.edu                          mrk022@bucknell.edu

                Alyssa Deng                                 Ravali Adusumilli                          Varun Ratnakar
       Information Sciences Institute                      School of Medicine                    Information Sciences Institute
      University of Southern California                    Stanford University                  University of Southern California
                    USA                                           USA                                         USA
        shipingd@andrew.cmu.edu                            ravali@stanford.edu                           varunr@isi.edu

                                                             Parag Mallick
                                                           School of Medicine
                                                           Stanford University
                                                                  USA
                                                          paragm@stanford.edu


ABSTRACT                                                               1    INTRODUCTION
The current practice of publishing articles solely containing          The reproducibility crisis in science has received
textual descriptions of methods is error prone and incomplete.
                                                                       significant attention. Reproducibility requires that methods
Even when a reproducible workflow or notebook is linked to an
article, the text of the article is not well integrated with those
                                                                       are described with enough details to repeat the experiment
computational components, and the workflow and notebook are            in an independent lab or setting. For computational
focused mostly on implementation details that are disconnected         experiments, studies show that published papers often
from the scientific approach described in the text of the article.     provide insufficient information about the data, protocols,
Through an analysis of three multi-omics articles, we illustrate       software, and overall method used to obtain the new results
why this makes it difficult to understand, reproduce, compare, and     [Van Noorden 2015]. A major barrier to reproducibility
reuse computational methods. We propose workflow abstractions          can be traced to the traditional, unstructured format of
that that capture different concepts and perspectives that are         publications “materials and methods” sections.            The
important to scientists. These abstractions connect the text of an
                                                                       ambiguity, imprecision, and linearity of text make natural
article to the corresponding workflow, and provide a framework
to improve the publication and reproducibility of computational
                                                                       language descriptions of computational analyses inadequate
experiments.                                                           for reproducible research [Steehouder et al 2000; Garijo et
                                                                       al 2013; Gil 2015; Groth and Gil 2009]. A major problem
CCS CONCEPTS                                                           is that there is no guidance or methodology to describe
• Information systems → Artificial intelligence; Knowledge             computational methods in articles. It is unclear what the
representation and reasoning                                           intent of the descriptions in methods sections is. Is the goal
                                                                       to provide a step-by-step account of the procedures taken,
KEYWORDS                                                               parameters employed, and data provenance such that a
Reproducibility, semantic workflows, semantic science                  study might be reproduced? Alternately, is the goal to
                                                                       provide a high-level intuition for the steps that were
                                                                       performed? Both are valuable but are incompatible
                                                                       objectives in current text-based descriptions, leading to
K-CAP2017 Workshops and Tutorials Proceedings,
                                                                       neither an intuitive reading experience nor a reproducible
© Copyright held by the owner/author(s)                                description.
K-CAP’17 SciKnow, December 2017, Austin, TX, USA                                                                  Y. Gil et al.

    Computational workflows and notebooks can be used to        from the text of the paper how the work was done. The
organize and record computational methods, and are often        reproducibility maps showed that only researchers with the
linked to publications. Workflows capture the dataflow          same level of expertise in the subject as the authors were
among computations, so the different steps of the method        able to figure out how to fully reproduce the work. There
are explicitly represented and linked. Notebooks are            are many similar results in the literature, some mentioning
composed of cells that contain either text or code that can     the lack of publication of data [Ioannidis et al 2009] and
be easily re-run. Workflows and notebooks facilitate the        others the lack of details in the description of methods
documentation of the software and the structure of a            leading to “exercises in ‘forensic bioinformatics’ where
computational method. But even when workflows and               aspects of raw data and reported results are used to infer
notebooks are used, the text for those publications is always   what methods must have been employed” [Baggerly and
manually generated by the authors and often inadequately        Coombes 2009]. There are several reasons why text
captures the full complexity of an analysis, leading to poor    descriptions of methods are riddled with problems. First,
reproducibility. In prior work, we developed an approach        articles often have space limitations, so authors tend to omit
to automatically create descriptions of computational           anything that seems not important. Second, they are
methods by generating text from workflows [Gil and Garijo       manually written without any particular guidance, it is easy
2017], where the text accurately represents what was done       for authors to provide imprecise descriptions. Finally,
and can be presented from different perspectives. The text,     computational methods are often complex procedures with
however, can only be as good as the workflows that it was       non-linear structures that are hard to describe with the
generated from. This motivates the need for a workflow          sequential nature of text [Gil 2015]. Even when authors
design methodology that leads to workflow representations       endeavor to describe enough details, textual descriptions
that support the automated generation of explanations as        are often ambiguous. A study reported in [Ince et al 2012]
text that can be used in publications.                          looked at writing software from scratch based on the
    This paper proposes abstractions designed to improve        textual descriptions reported in geophysics papers and
the publication of computational methods to facilitate          found radical differences in the implementations. The
reproducibility. This methodology extends our prior work        papers were found to be ambiguous at the lexical, syntactic,
on representing and publishing workflow abstractions using      and semantic level, and not necessarily because the authors
community standards for workflows and provenance                were not rigorous but because natural language is
[Garijo et al 2017].                                            inherently ambiguous. We also find that the methods
   The paper begins with an overview of related work on         sections of articles mix general methods with specific
reproducibility and publication of computational methods.       details of the executions carried out [Gil and Garijo 2017].
We then present an analysis of the computational methods        Although there are many tools and recommendations of
described in three seminal papers in cancer omics. We           best practices for authors [Stodden et al 2016], it is still up
introduce the abstractions proposed, and discuss their          to them to figure out what to include in an article and its
merits in improving the descriptions of computational           methods section. In summary, textual descriptions of
methods in scientific publications.                             methods in articles are far from ideal, since the text tends to
                                                                be: 1) Incomplete, omitting important details about the
2 PUBLICATION AND                                               computations performed; 2) Ambiguous, having several
   REPRODUCIBILITY OF                                           interpretations of how the computations were actually
   COMPUTATIONAL METHODS                                        done; 3) Mingled, interspersing general overviews with
                                                                execution details.
Textual descriptions of methods in articles may be
                                                                    Workflows capture unambiguously a computational
incomplete (e.g., [Ioannidis et al 2009; Donoho et al
                                                                analysis as a dataflow among steps [Taylor et al 2006]. In
2009]). Authors focus on conveying the major
                                                                prior work, we found that workflow reusability is a major
contributions of the work and describe the methods in that
                                                                drive for users [Garijo et al 2014a]. Workflow repositories
light, omitting details that may be important for
                                                                provide mechanisms to publish and search workflows,
transparency and reproducibility. For example, [Garijo et al
                                                                particularly to improve reproducibility and sharing of
2013] describes our work to reproduce an article for which
                                                                computational experiments. However, the descriptions of
the authors had provided the data, software, and results to
                                                                workflows are manually generated and therefore are as
facilitate reproducibility. We created reproducibility maps,
                                                                incomplete as those in scientific articles. In prior work we
that showed different categories of users could figure out

2
A Workflow Design Methodology to Improve Reproducibility             K-CAP’17 SciKnow, December 2017, Austin, TX, USA

analyzed the textual descriptions of workflows from one of        [TCGA 2008] and [Imielinski et al 2012] which describe
these repositories [Groth and Gil 2009]. We found                 work on genomics that Zhang and colleagues built upon. A
significant differences between what was included in the          detailed analysis of all three articles is provided in
textual descriptions and the actual formal specification of       [Knoblock 2017].
the workflows. A major limitation of workflow
representations is that they mix major method steps with          3.1    Method Descriptions
ancillary steps that do for example minor data reformatting.
                                                                  The methods section of a scientific article describes,
Also in previous work, we analyzed workflows to identify
                                                                  together with the supplementary materials, the data and
by hand general categories of steps (motifs) that make such
                                                                  computational steps used for data analysis. We illustrate
distinctions [Garijo et al 2014b]. But workflows in
                                                                  how methods are typically described using excerpts from
themselves have no explicit mention of the relative
                                                                  [TCGA 2008] for variant calling from resequencing data,
importance of steps and all steps are treated equally. In
                                                                  where SNPs and indels were screened against dbSNP for
summary, although workflows provide a formal
                                                                  position/allele match.     First, “Putative variants were
computational representation of methods, the workflows
                                                                  identified using Polyphred 6.1, Polyscan 3.0, SNPdetector
themselves are: 1) Incomplete, because workflow
                                                                  3, and SNP Compare. SNPs and indels were screened
representations do not express important semantic
                                                                  against dbSNP for position/allele match”. This excerpt
properties of steps; 2) Flat, with abstractions often absent
                                                                  describes the software, although it refers to entire packages
from the workflow structure; and 3) Undifferentiated, as
                                                                  and not how they are used in the method. Then,
there is no explicit distinction between important steps and
                                                                  “Boundaries of insertion, deletion and complex
ancillary steps.
                                                                  rearrangements [were] annotated”, and the detailed
     A recent popular trend is electronic notebooks, such as
                                                                  annotation guidelines are outlined in the article. That
Jupyter Notebook and Apache Zeppelin, where the
                                                                  excerpt focuses on the science method. Next, “The first
advantage is that the text is intermixed with data and code
                                                                  step in analysis of the mutation data was to combine the
so it is easier to follow step by step how the method
                                                                  .maf files from all centers into a .mut file containing at
actually works. This approach is akin to executable papers
                                                                  most one record for each site-sample pair. In the process of
which have been around for some time, such as Sweave
                                                                  combining the files, care was taken to detect and resolve
and knitr which combine Latex and R [Xie 2015].
                                                                  conflicts between multiple records for the same site-
However, a reader cannot easily compare two notebooks,
                                                                  sample”. This excerpt is focused on low level
since that requires comparing the code line by line, and
                                                                  implementation aspects such as file formats and handling
cannot easily reuse parts of one to create another since the
                                                                  duplicate entries. And finally, “As part of our sequencing
code in notebook cells is not necessarily modular. In
                                                                  pipeline, non-synonymous mutations were subjected to an
addition, although notebooks are easily published and
                                                                  orthogonal validation or re-sequencing (verification) step
shared they have not replaced published papers, possibly
                                                                  to decrease the prevalence of false positives. In our
due to their idiosyncratic formats which do not yet offer the
                                                                  analysis we considered only those mutations that were
persistence and archival guarantees required by publishers.
                                                                  confirmed by validation or verification to be actual somatic
    In summary, in order to understand a published article,
                                                                  mutations”. This excerpt focuses on the science aspects of
and assess its validity, reproduce the work, or to compare
                                                                  the analysis, but it does not specify how each step of the
its method to another article, a reader must do a significant
                                                                  method are implemented by the software packages
amount of work. Even when authors capture computational
                                                                  mentioned earlier.
methods as workflows, there is no guidance on how to
                                                                     This is a common approach to describing methods.
facilitate reproducibility and reuse. The next section
                                                                  Method descriptions in scientific articles mix mentions
analyzes specific articles in detail as motivating scenarios,
                                                                  of software, data formats, and scientific descriptions of
and extracts desiderata for workflow design.
                                                                  the experiment.
                                                                     Figure 1 shows a workflow that a biologist created
3    AN ANALYSIS OF MULTI-OMICS                                   based on the method description in the article. Not
    METHODS                                                       surprisingly, the workflow steps are also mixture of science
This section motivates our work in the context of three           concepts and software implementation and data formats.
seminal articles in multi-omics: [Zhang et al 2014], which        Note that this makes it hard for another scientist to
is the first publication of a large-scale multi-omics analysis,   understand and therefore reuse the workflow.

                                                                                                                              3
K-CAP’17 SciKnow, December 2017, Austin, TX, USA                                                                 Y. Gil et al.

                                                                         Base caller (Phred software): Assembling genome
                                                                          for genomic alignment/features.
                                                                      Genome assembly (Consed software): Calling
                                                                          genomic bases from input files.
                                                                      SNP caller (Polyphred software): Calling SNPs
                                                                          from the input genomic file (variant calling).
                                                                      Indel and SNP caller: (PolyScan software):
                                                                          Calling both indels and SNPs (variant calling).
                                                                      Variant annotation (Annotate_ MAFFormat
                                                                          software): Annotating variants based on reference
                                                                          genomes.
                                                                      Join data files (mutipleFilesToOne software):
                                                                          Appending multiple text files.
                                                                      Filter data files (filt_MAF_file software): Variant
                                                                          filtering based on input parameters.
                                                                    We make a few observations about how the software is
                                                                presented in the article and used in the computational
                                                                method.
                                                                    Paper descriptions of conceptual steps contain very
                                                                limited information about how they map to software.
                                                                Given the capabilities of the software used, conceptual
                                                                steps may be mapped to several implemented steps, and
                                                                vice versa. For example, Polyphred and Polyscan are two
                                                                separate steps, both implement variant calling but the
                                                                former does SNP calling only and the latter implements
                                                                indel and SNP calling steps. Therefore, a requirement in
                                                                designing a workflow is that it must make clear how each
                                                                step is implemented in software.
                                                                    Paper descriptions of software contain limited
                                                                information about what conceptual steps they
Figure 1: Computational workflow to annotate variant            implement. For example [TCGA 2008] says: “Putative
calling for resequencing data, based on [TCGA 2008].            variants were identified using Polyphred 6.1, Polyscan 3.0,
Each workflow step (square boxes) is described as the           SNPdetector 3, and SNP Compare.” Four pieces of
software that implements the step.                              software are mentioned, but there are no details that specify
                                                                what types of variants are detected by each of them.
   A scientist may not be familiar with the different           Therefore, in designing a workflow, the mapping of
software packages (there are hundreds of packages that are      conceptual steps to software must be clearly stated.
available for this kind of analysis), and therefore would not       A given function can be implemented by many
understand the function of each step. Therefore, the design     software packages. There are many software packages that
of workflows should accommodate the separation between          provide a desired functionality. As a result, identical
the conceptual description of the experiment and the            functions in different methods may be implemented by
implementation of the experiment in software.                   different software, making it hard for a scientist to compare
                                                                workflows. For example, in the workflow in Figure 1 the
                                                                Consed software is used to perform the genome assembly
3.2    Software Descriptions
                                                                step. In [Zhang et al 2014], Tophat2 performs this genome
Each step in a method can be described at a conceptual          assembly step. Therefore, a requirement is that the software
level in terms of the function that it performs, and at an      steps be described according to their functionality, so that
implementation level as the software used for the step. For     the methods for several papers can be more easily
the articles we analyzed, these descriptions are as follows:    compared by a scientist. Functions should be specified for

4
A Workflow Design Methodology to Improve Reproducibility              K-CAP’17 SciKnow, December 2017, Austin, TX, USA

each method step so that the correspondences across               workflow is to represent explicitly what formats are
different software implementations for the same conceptual        imposed by the use of specific software packages.
step will be explicit.                                                 Data of the same type can play very different roles
    A given software package has many functions. In               in a method.          In the workflow in Figure 1, the
comparing software to the workflows built from them, we           varAnnotParams input is an annotated variant parameter
found that many scientific software packages have a large         file but this type is not represented explicitly. Moreover, it
number of functions. Though it is useful for scientists to        is also not the only annotated file that is input to this
have multiple functions in one software package, in               method, but is the only one with a name that mentions
research papers it can be difficult to tell what software         annotation. Therefore, a requirement is to describe data
packages are being used for what functions. Sometimes the         conceptually according to the type of data contained, and
functionality of a software package is quite broad. For           that different data used or generated in the workflow be
example, the SAMtools software package, used in [Zhang            related by those types.
et al 2014], can be used for Variant Calling and Variant              Data results of the same type may be combined,
Filtering but the article does not explicitly indicate for what   filtered, or sorted in ways that are not considered
function it is used. Therefore, when specifying what              important to mention in the paper. Readers must
software is used to implement a step in a method, it is vital     hypothesize these data manipulations.
to indicate the specific function of that software to make it
unambiguous what conceptual function the software is              3.4    Discussion
implementing.
                                                                  Through examples we have illustrated that the text
    A computational step may perform a data
                                                                  descriptions of methods sections of articles makes them
reformatting, conversion, or other minor step that is not
                                                                  hard to reconstruct and replicate into an unambiguous and
conceptually important and therefore is not mentioned
                                                                  complete workflow. This is because papers describe
in the article. Without a description of these steps, it may
                                                                  methods in a mix of high-level conceptual terms together
not be possible to interpret the results appropriately or to
                                                                  with mentions of specific software and formats. This
reproduce the method.
                                                                  makes it hard to understand and compare methods.
   In summary, the descriptions of method steps and their
                                                                  Another observation is that different readers might be
implementation in software that are typically found in
                                                                  interested in different descriptions of the methods, some
scientific articles are very ambiguous and incomplete.
                                                                  more abstract and some more specific. For example, a
Computational workflows can eliminate this ambiguity, but
                                                                  developer would be interested in data formats and software
they must be intentionally designed to be unambiguous and
                                                                  versions, while a biologist would be more interested in the
complete.
                                                                  overall statistical approach used.
                                                                       Ideally, method descriptions would make clear
3.3    Data Descriptions                                          distinctions between high-level conceptual terminology and
Like software, data is described in scientific papers with a      implementation terms, both for software and for data. In
mixture of high-level concepts and low-level format               addition, method descriptions would make it clear what
references.                                                       function each step performs, and whether a given function
    Data is often described based on its format rather            is implemented by a single step or by a set of steps. These
than its contents. We saw examples of this in the earlier         desiderata lead us to propose workflow abstractions for
article excerpts. Therefore, a requirement in the design of       describing computational experiments in a paper.
workflows is that data abstractions should be used to
complement step abstractions.                                     4     WORKFLOW ABSTRACTIONS
    Data formats are sometimes used when data is                  A computational method is typically described in terms of
generated in idiosyncratic formats by specific software           the specific software, data, and formats used. However,
used. This can be seen in Figure 1. The Phred software            there are many ways to describe a method
generates output in a format called phd, and as a result the      conceptually. This section describes different ways to
workflow indicates phd_File which is specific to Phred.           design workflow abstractions that would be useful to make
Thus, a user of the workflow unfamiliar with Phred would          methods more understandable and comparable. Table 1
find it hard to understand that format. Therefore, a              summarizes the issues identified earlier and the
requirement in describing software steps and data in a            corresponding proposed abstractions to address them.

                                                                                                                               5
K-CAP’17 SciKnow, December 2017, Austin, TX, USA                                                             Y. Gil et al.


Table 1. An overview of the issues identified in the papers
analyzed and abstractions proposed to address them.
Issue identified                                Abstraction
                                                approach
1) Method descriptions in scientific articles   Step
mix mentions of software, data formats, and     abstractions
scientific descriptions of the experiment
2) Paper descriptions of conceptual steps       Sub-
contain very limited information about how      workflow
they map to software                            and      step
                                                abstractions
3) Paper descriptions of software contain       Sub-
limited information about what conceptual       workflow
steps they implement                            and      step
                                                abstractions
4) A given function can be implemented by       Step
many software packages                          abstractions
5) A given software package has many            Step
functions                                       abstractions
6) A computational step may perform a data      Criticality
reformatting, conversion, or other minor        abstractions
step that is not conceptually important
7) Data is often described based on its         Data
format rather than its contents                 abstractions
8) Data formats are sometimes used when         Data
data is generated in idiosyncratic formats by   abstractions
the specific software used                                      Figure 2: A computational workflow that corresponds to
9) Data of the same type can play very          Data            the workflow in Figure 1 but where each step is described
different roles in a method                     abstractions    conceptually.
10) Data results of the same type may be        Criticality
combined, filtered, or sorted in ways that      abstractions    At the same time, the software steps in Figure 1 and the
are not considered important to mention in                      conceptual steps of Figure 2 should be mapped to one
the paper                                                       another. This can be done through a hierarchy of
                                                                component functions, which defines many conceptual
                                                                functions at different levels of detail. The hierarchy
4.1    Step Abstractions                                        bottoms out with mentions of software that implements the
A computational workflow can be described at a conceptual       parent function. Note that there may be several software
level in terms of the functions that each step carries out.     implementations of the same abstract function.
Figure 2 describes the same workflow introduced in Figure
1. While Figure 1 describes the software implementation
of each step, Figure 2 characterizes the function of each
step.


6
A Workflow Design Methodology to Improve Reproducibility            K-CAP’17 SciKnow, December 2017, Austin, TX, USA

                                                                 very helpful to a reader. Workflow designers should design
                                                                 appropriate conceptual levels.
                                                                    A hierarchy of component functions becomes a
                                                                 powerful enabler for automation.        Given a concrete
                                                                 workflow, the hierarchy could be used to generate abstract
                                                                 workflows automatically. Conversely, given an abstract
                                                                 workflow, the hierarchy could be used to specialize it and
                                                                 create a concrete workflow. [Gil et al 2011] describe
                                                                 algorithms to do this kind of automation.

                                                                 4.2    Sub-Workflows
                                                                 Several components may implement different aspects of the
                                                                 same function. For example, in the workflow of Figures 1
                                                                 and 2 the Polyphred software and the Polyscan software
                                                                 implement SNP calling and indel calling respectively,
Figure 3: A hierarchy of component functions to describe         which are two aspects of variant calling. The software
method steps. The software steps in Figure 1 are shown in        Annotate_MafFormat annotates the resulting variants with
dark blue, and the abstract steps in Figure 2 are shown in       respect to reference genomes. All three steps could be
green, both from [TCGA 2008]. Additional steps in                considered as a sub-workflow, with an overarching abstract
[Imielinski et al 2012] are shown in purple, and those in        function of detecting and annotating variants.
[Zhang et al 2014] are shown in light blue.                         A knowledge base of sub-workflows would capture these
                                                                 functional decompositions. A sub-workflow would consist
    Figure 3 shows a hierarchy of component functions for        of a root abstract component, which indicates the
the steps in Figures 1 and 2. The function in a given node       overarching abstract function, and a workflow fragment that
represents a more specific function than the function of its     decomposes that function into a set of components at a
parent node. It also includes steps for the other two articles   lower level of abstraction and the dataflow among them.
that we analyzed. Using this hierarchy, it becomes possible      Data abstractions should be taken into account as well as
to relate the method steps of the three articles.                the sub-workflows express functions of different
    When designing a workflow, two distinct types of             abstraction levels. We discuss data abstractions below.
workflows should be created. One type of workflow is an             When designing a workflow, steps that are functionally
abstract workflow, with abstract components that                 related should be organized as sub-workflows. There may
correspond to the more general functions in the hierarchy.       be alternative ways to group steps in a workflow.
These abstract workflows capture the general functionality       Workflow designers should make decisions based on the
of methods, and they would be independent of the software        expected use of the sub-workflow decompositions by
used to implement it. A second type of workflow would be         readers. The knowledge base of sub-workflows could be
a grounded workflow, which would specify what software           dynamically extended based on a growing corpus of
is used to implement each step.                                  workflows created by users. [Garijo et al 2014c] describe
    We find that in practice it is hard to create a complete     techniques to detect workflow fragments automatically.
hierarchy of component functions before creating the
workflows. We recommend an iterative process, where an           4.3    Criticality
initial hierarchy is created and then refined as the             Some steps in a workflow perform functions that are
workflows are fleshed out.                                       critical to the overall computational method, while other
    Depending on the depth of the hierarchy of component         steps carry out minor format conversions and other
functions there could be several abstract workflows that         ancillary functions. For example, the workflow in Figures
could have different levels of detail and generality. Each       1 and 2 has a step to merge several files. Other workflows
abstract workflow may be useful to a different reader,           have reformatting steps, unit conversion steps, and other
depending on the level of detail that they are looking to        functions that manage the details of how the data is
find.    At the same time, if the workflow contains              implemented. When describing a method in a paper, these
descriptions of the steps that are too general, it may not be    ancillary functions are rarely mentioned. There may be

                                                                                                                           7
K-CAP’17 SciKnow, December 2017, Austin, TX, USA                                                                 Y. Gil et al.

different degrees of criticality, depending on how much          5    CONCLUSIONS
detail each reader is interested in seeing.
                                                                 This paper motivates the need for capturing abstractions in
   This kind of abstraction could be captured in a hierarchy
                                                                 the design scientific workflows. These abstractions are
of criticality levels. This hierarchy would identify the
                                                                 based on our analysis of published articles and the
importance of including a step in a scientific description of
                                                                 workflows created to reconstruct their methods. The
a method. [Garijo et al 2014b] describe an approach to
                                                                 proposed abstractions are captured in hierarchies of
identifying criticality based on a library of workflow motifs
                                                                 component functions and criticality as well as knowledge
that include data pre-processing, visualization, and format
                                                                 bases of sub-workflows, and need to be supported by data
conversion. Criticality levels are highly dependent on the
                                                                 abstractions. Using these abstractions, different workflows
specific domain, but a broad methodology to design those
                                                                 can be created to describe the same computation for readers
categories could be more generally designed.
                                                                 with different interests. In future work, we plan to develop
                                                                 these abstractions for a target domain and associated
4.4    Data Abstractions                                         publications, in order to demonstrate their benefits.
Data type abstractions should be included in all three
hierarchies above. The data type in a node would represent       Acknowledgements. We gratefully acknowledge support
data that is of a more specific type than its parent node, for   from the Defense Advanced Research Projects Agency
example because it is of a subtype or has more specific          through the SIMPLEX program with award W911NF-15-1-
metadata properties. In the hierarchy of component               0555, the National Institutes of Health with awards
functions, each abstract component function should specify       1U01CA196387 and 1R01GM117097, and the Canary
inputs and outputs in terms of those general types. At the       Foundation.
bottom of the hierarchy, a component is specified with a
specific software invocation, including the exact command        REFERENCES
line call to invoke the software and all the input data types    [Baggerly and Coombes 2009] Baggerly KA, and KR
and formats that the software expects. In the hierarchy of       Coombes. “Deriving Chemosensitivity from Cell Lines:
sub-workflows, the root component may refer to data types        Forensic Bioinformatics and Reproducible Research in
that are more abstract than those of the workflow fragment.      High-Throughput Biology.” Annals of Applied Statistics 3
    Data abstractions depend on the domain. In multi-            (4), 2009.
omics, there are many aspects of data that can be described      [Donoho et al 2009] Donoho DL, Maleki A, Rahman IU,
in very specific terms but can be abstracted away when           Shahram M, and V Stodden. “Reproducible Research in
describing an         experiment    in scientific      terms.    Computational Harmonic Analysis.” Computing in Science
Characteristics of a dataset that can lead to useful data        & Engineering 11 (1): 8–18, 2009.
abstractions include: 1) type of sequence, such as RNA,
                                                                 [Garijo et al 2013] Garijo D, Kinnings S, Xie L, Xie L,
DNA, etc.; 2) annotations on those sequences, such as
                                                                 Zhang Y, Bourne PE, and Y Gil. “Quantifying
indels, CNVs, SNPs, etc.; 3) formats that are often imposed
                                                                 Reproducibility in Computational Biology: The Case of the
by how software works, such as FASTA, MAF, phd, etc.;
                                                                 Tuberculosis Drugome.” PLoS ONE 8 (11), 2013.
4) level of detail or accuracy on the sequences, for example
sequences obtained with next-generation sequencing               [Garijo et al 2014a] Garijo D, Corcho O, Gil Y, Braskie
machines are more accurate; 5) the role of a dataset for a       MN, Hibar D, Hua X, Jahanshad N, Thompson P, and Toga
specific component, for example a sequence can be a              AW. “Workflow Reuse in Practice: A Study of
patient sequence or a reference sequence.                        Neuroimaging Pipeline Users.” Proceedings of the 10th
    Workflow designers should create a taxonomy of data          IEEE International Conference on e-Science, 2014.
abstractions that facilitate the abstractions needed for the     [Garijo et al 2014b] Garijo D, Alper P, Belhajjame K,
three hierarchies discussed earlier. In our work, we have        Corcho O, Gil Y, and C Goble. “Common Motifs in
found that a proliferation of data types makes the creation      Scientific Workflows: An Empirical Analysis.” Future
of workflows more complex. Instead, we create properties         Generation Computer Systems 36, 2014.
for describing the different characteristics of data.            [Garijo et al 2014c] Garijo D, Corcho O, Gil Y, Gutman
                                                                 BA, Dinov ID, Thompson P, and AW Toga. 2014.
                                                                 “FragFlow: Automated Fragment Detection in Scientific


8
A Workflow Design Methodology to Improve Reproducibility      K-CAP’17 SciKnow, December 2017, Austin, TX, USA

Workflows.” Proceedings of the 10th IEEE International     Annual Conference      on    Computer    Documentation
Conference on e-Science, 2014.                             (IPCC/SIGDOC 2000)
[Garijo et al 2017] Garijo D, Gil Y, and O Corcho.         [Taylor et al 2006] Taylor IJ, Deelman E, Gannon DB, and
“Abstract, Link, Publish, Exploit: An End to End           M Shields.        “Workflows for e-Science: scientific
Framework for Workflow Sharing.” Future Generation         workflows for grids.” Springer, 2006.
Computer Systems, 2017.                                    [Van Noorden 2015] Van Noorden, R. Sluggish data
[Gil 2015] Gil, Y. “Human Tutorial Instruction in the      sharing hampers reproducibility effort. Nature, 2015.
Raw.” ACM Transactions on Interactive Intelligent          [Xie 2015] Y Xie. “Dynamic Documents with R and
Systems, 5 (1): 1–29, 2015.                                knitr.” CRC Press, 2015.
[Gil and Garijo 2017] Gil Y, and D Garijo. “Towards        [Zhang et al 2014] Zhang B, Wang J, Wang X, et al.
Automating Data Narratives.” Proceedings of the ACM        “Proteogenomic Characterization of Human Colon and
Conf. on Intelligent User Interfaces, 2017.                Rectal Cancer.” Nature 513 (7518): 382–87, 2014.
[Gil et al 2011] Gil Y, Gonzalez-Calero PA, Kim J, Moody
J, and V. Ratnakar. “A Semantic Framework for Automatic
Generation of Computational Workflows Using Distributed
Data and Component Catalogs.” Journal of Experimental
and Theoretical Artificial Intelligence, 23(4), 2011.
[Groth and Gil 2009] Groth P and Y Gil. “Analyzing the
Gap between Workflows and Their Natural Language
Descriptions.” Proceedings of the IEEE International
Workshop on Scientific Workflows (SWF), 2009.
[Knoblock 2017] Knoblock M.        “Designing Useful
Abstractions for Multi-Omics Data Analysis.” Technical
Report, Information Sciences Institute, University of
Southern California, October 2017.
[Imielinski et al 2012] Imielinski M, Berger AH,
Hammerman PS, Hernandez B, et al. “Mapping the
hallmarks of lung adenocarcinoma with massively parallel
sequencing.” Cell;150(6):1107-20, 2012.
[Ince et al 2012] Ince DC, Hatton L, and J Graham-
Cumming. “The Case for Open Computer Programs.”
Nature, Vol 482, 2012.
[Ioannidis et al 2009] Ioannidis JPA, Allison DB, et al.
“Repeatability of Published Microarray Gene Expression
Analyses.” Nature Genetics 41 (2), 2009.
[TCGA 2008] The Cancer Genome Atlas (TCGA)
collaboration. “Comprehensive Genomic Characterization
Defines Human Glioblastoma Genes and Core Pathways”.
Nature, 455, 1061-1068, 23 October 2008.
[Stodden et al 2016] Stodden V, McNutt M, Bailey DH,
Deelman E, Gil Y, Hanson B, Heroux MA, Ioannidis JP,
and M Taufer. “Enhancing Reproducibility for
Computational Methods.” Science, 354, 2016.
[Steehouder et al 2000] Steehouder, M., Karreman, J. and
Ummelen, N. Making sense of step-by-step procedures.
Proceedings of 2000 Joint IEEE International and 18th


                                                                                                                  9