<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Stodden V, McNutt M, Bailey DH,
Deelman E, Gil Y, Hanson B, Heroux MA, Ioannidis JP,
and M Taufer. “Enhancing Reproducibility for
Computational Methods.” Science</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Improving Publication and Reproducibility of Computational Experiments through Workflow Abstractions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yolanda Gil</string-name>
          <email>gil@isi.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alyssa Deng</string-name>
          <email>shipingd@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Garijo</string-name>
          <email>dgarijo@isi.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ravali Adusumilli</string-name>
          <email>ravali@stanford.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Parag Mallick</string-name>
          <email>paragm@stanford.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Margaret Knoblock</string-name>
          <email>mrk022@bucknell.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Varun Ratnakar</string-name>
          <email>varunr@isi.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Sciences Institute, University of Southern California</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Medicine, Stanford University</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2000</year>
      </pub-date>
      <volume>354</volume>
      <issue>2016</issue>
      <fpage>1061</fpage>
      <lpage>1068</lpage>
      <abstract>
        <p>The current practice of publishing articles solely containing textual descriptions of methods is error prone and incomplete. Even when a reproducible workflow or notebook is linked to an article, the text of the article is not well integrated with those computational components, and the workflow and notebook are focused mostly on implementation details that are disconnected from the scientific approach described in the text of the article. Through an analysis of three multi-omics articles, we illustrate why this makes it difficult to understand, reproduce, compare, and reuse computational methods. We propose workflow abstractions that that capture different concepts and perspectives that are important to scientists. These abstractions connect the text of an article to the corresponding workflow, and provide a framework to improve the publication and reproducibility of computational experiments.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Artificial intelligence; Knowledge
representation and reasoning
Reproducibility, semantic workflows, semantic science</p>
      <sec id="sec-1-1">
        <title>INTRODUCTION</title>
        <p>The reproducibility crisis in science has received
significant attention. Reproducibility requires that methods
are described with enough details to repeat the experiment
in an independent lab or setting. For computational
experiments, studies show that published papers often
provide insufficient information about the data, protocols,
software, and overall method used to obtain the new results
[Van Noorden 2015]. A major barrier to reproducibility
can be traced to the traditional, unstructured format of
publications “materials and methods” sections. The
ambiguity, imprecision, and linearity of text make natural
language descriptions of computational analyses inadequate
for reproducible research [Steehouder et al 2000; Garijo et
al 2013; Gil 2015; Groth and Gil 2009]. A major problem
is that there is no guidance or methodology to describe
computational methods in articles. It is unclear what the
intent of the descriptions in methods sections is. Is the goal
to provide a step-by-step account of the procedures taken,
parameters employed, and data provenance such that a
study might be reproduced? Alternately, is the goal to
provide a high-level intuition for the steps that were
performed? Both are valuable but are incompatible
objectives in current text-based descriptions, leading to
neither an intuitive reading experience nor a reproducible
description.</p>
        <p>Computational workflows and notebooks can be used to
organize and record computational methods, and are often
linked to publications. Workflows capture the dataflow
among computations, so the different steps of the method
are explicitly represented and linked. Notebooks are
composed of cells that contain either text or code that can
be easily re-run. Workflows and notebooks facilitate the
documentation of the software and the structure of a
computational method. But even when workflows and
notebooks are used, the text for those publications is always
manually generated by the authors and often inadequately
captures the full complexity of an analysis, leading to poor
reproducibility. In prior work, we developed an approach
to automatically create descriptions of computational
methods by generating text from workflows [Gil and Garijo
2017], where the text accurately represents what was done
and can be presented from different perspectives. The text,
however, can only be as good as the workflows that it was
generated from. This motivates the need for a workflow
design methodology that leads to workflow representations
that support the automated generation of explanations as
text that can be used in publications.</p>
        <p>This paper proposes abstractions designed to improve
the publication of computational methods to facilitate
reproducibility. This methodology extends our prior work
on representing and publishing workflow abstractions using
community standards for workflows and provenance
[Garijo et al 2017].</p>
        <p>The paper begins with an overview of related work on
reproducibility and publication of computational methods.
We then present an analysis of the computational methods
described in three seminal papers in cancer omics. We
introduce the abstractions proposed, and discuss their
merits in improving the descriptions of computational
methods in scientific publications.</p>
      </sec>
      <sec id="sec-1-2">
        <title>2 PUBLICATION AND</title>
      </sec>
      <sec id="sec-1-3">
        <title>REPRODUCIBILITY OF</title>
      </sec>
      <sec id="sec-1-4">
        <title>COMPUTATIONAL METHODS</title>
        <p>
          Textual descriptions of methods in articles may be
incomplete
          <xref ref-type="bibr" rid="ref2 ref2">(e.g., [Ioannidis et al 2009; Donoho et al
2009])</xref>
          . Authors focus on conveying the major
contributions of the work and describe the methods in that
light, omitting details that may be important for
transparency and reproducibility. For example, [Garijo et al
2013] describes our work to reproduce an article for which
the authors had provided the data, software, and results to
facilitate reproducibility. We created reproducibility maps,
that showed different categories of users could figure out
from the text of the paper how the work was done. The
reproducibility maps showed that only researchers with the
same level of expertise in the subject as the authors were
able to figure out how to fully reproduce the work. There
are many similar results in the literature, some mentioning
the lack of publication of data [Ioannidis et al 2009] and
others the lack of details in the description of methods
leading to “exercises in ‘forensic bioinformatics’ where
aspects of raw data and reported results are used to infer
what methods must have been employed” [Baggerly and
Coombes 2009]. There are several reasons why text
descriptions of methods are riddled with problems. First,
articles often have space limitations, so authors tend to omit
anything that seems not important. Second, they are
manually written without any particular guidance, it is easy
for authors to provide imprecise descriptions. Finally,
computational methods are often complex procedures with
non-linear structures that are hard to describe with the
sequential nature of text [Gil 2015]. Even when authors
endeavor to describe enough details, textual descriptions
are often ambiguous. A study reported in [Ince et al 2012]
looked at writing software from scratch based on the
textual descriptions reported in geophysics papers and
found radical differences in the implementations. The
papers were found to be ambiguous at the lexical, syntactic,
and semantic level, and not necessarily because the authors
were not rigorous but because natural language is
inherently ambiguous. We also find that the methods
sections of articles mix general methods with specific
details of the executions carried out [Gil and Garijo 2017].
Although there are many tools and recommendations of
best practices for authors [Stodden et al 2016], it is still up
to them to figure out what to include in an article and its
methods section. In summary, textual descriptions of
methods in articles are far from ideal, since the text tends to
be: 1) Incomplete, omitting important details about the
computations performed; 2) Ambiguous, having several
interpretations of how the computations were actually
done; 3) Mingled, interspersing general overviews with
execution details.
        </p>
        <p>Workflows capture unambiguously a computational
analysis as a dataflow among steps [Taylor et al 2006]. In
prior work, we found that workflow reusability is a major
drive for users [Garijo et al 2014a]. Workflow repositories
provide mechanisms to publish and search workflows,
particularly to improve reproducibility and sharing of
computational experiments. However, the descriptions of
workflows are manually generated and therefore are as
incomplete as those in scientific articles. In prior work we
analyzed the textual descriptions of workflows from one of
these repositories [Groth and Gil 2009]. We found
significant differences between what was included in the
textual descriptions and the actual formal specification of
the workflows. A major limitation of workflow
representations is that they mix major method steps with
ancillary steps that do for example minor data reformatting.
Also in previous work, we analyzed workflows to identify
by hand general categories of steps (motifs) that make such
distinctions [Garijo et al 2014b]. But workflows in
themselves have no explicit mention of the relative
importance of steps and all steps are treated equally. In
summary, although workflows provide a formal
computational representation of methods, the workflows
themselves are: 1) Incomplete, because workflow
representations do not express important semantic
properties of steps; 2) Flat, with abstractions often absent
from the workflow structure; and 3) Undifferentiated, as
there is no explicit distinction between important steps and
ancillary steps.</p>
        <p>A recent popular trend is electronic notebooks, such as
Jupyter Notebook and Apache Zeppelin, where the
advantage is that the text is intermixed with data and code
so it is easier to follow step by step how the method
actually works. This approach is akin to executable papers
which have been around for some time, such as Sweave
and knitr which combine Latex and R [Xie 2015].
However, a reader cannot easily compare two notebooks,
since that requires comparing the code line by line, and
cannot easily reuse parts of one to create another since the
code in notebook cells is not necessarily modular. In
addition, although notebooks are easily published and
shared they have not replaced published papers, possibly
due to their idiosyncratic formats which do not yet offer the
persistence and archival guarantees required by publishers.</p>
        <p>In summary, in order to understand a published article,
and assess its validity, reproduce the work, or to compare
its method to another article, a reader must do a significant
amount of work. Even when authors capture computational
methods as workflows, there is no guidance on how to
facilitate reproducibility and reuse. The next section
analyzes specific articles in detail as motivating scenarios,
and extracts desiderata for workflow design.
3</p>
      </sec>
      <sec id="sec-1-5">
        <title>AN ANALYSIS OF MULTI-OMICS</title>
      </sec>
      <sec id="sec-1-6">
        <title>METHODS</title>
        <p>This section motivates our work in the context of three
seminal articles in multi-omics: [Zhang et al 2014], which
is the first publication of a large-scale multi-omics analysis,
[TCGA 2008] and [Imielinski et al 2012] which describe
work on genomics that Zhang and colleagues built upon. A
detailed analysis of all three articles is provided in
[Knoblock 2017].
3.1</p>
      </sec>
      <sec id="sec-1-7">
        <title>Method Descriptions</title>
        <p>The methods section of a scientific article describes,
together with the supplementary materials, the data and
computational steps used for data analysis. We illustrate
how methods are typically described using excerpts from
[TCGA 2008] for variant calling from resequencing data,
where SNPs and indels were screened against dbSNP for
position/allele match. First, “Putative variants were
identified using Polyphred 6.1, Polyscan 3.0, SNPdetector
3, and SNP Compare. SNPs and indels were screened
against dbSNP for position/allele match”. This excerpt
describes the software, although it refers to entire packages
and not how they are used in the method. Then,
“Boundaries of insertion, deletion and complex
rearrangements [were] annotated”, and the detailed
annotation guidelines are outlined in the article. That
excerpt focuses on the science method. Next, “The first
step in analysis of the mutation data was to combine the
.maf files from all centers into a .mut file containing at
most one record for each site-sample pair. In the process of
combining the files, care was taken to detect and resolve
conflicts between multiple records for the same
sitesample”. This excerpt is focused on low level
implementation aspects such as file formats and handling
duplicate entries. And finally, “As part of our sequencing
pipeline, non-synonymous mutations were subjected to an
orthogonal validation or re-sequencing (verification) step
to decrease the prevalence of false positives. In our
analysis we considered only those mutations that were
confirmed by validation or verification to be actual somatic
mutations”. This excerpt focuses on the science aspects of
the analysis, but it does not specify how each step of the
method are implemented by the software packages
mentioned earlier.</p>
        <p>This is a common approach to describing methods.</p>
        <sec id="sec-1-7-1">
          <title>Method descriptions in scientific articles mix mentions of software, data formats, and scientific descriptions of the experiment.</title>
          <p>Figure 1 shows a workflow that a biologist created
based on the method description in the article. Not
surprisingly, the workflow steps are also mixture of science
concepts and software implementation and data formats.
Note that this makes it hard for another scientist to
understand and therefore reuse the workflow.</p>
          <p>A scientist may not be familiar with the different
software packages (there are hundreds of packages that are
available for this kind of analysis), and therefore would not
understand the function of each step. Therefore, the design
of workflows should accommodate the separation between
the conceptual description of the experiment and the
implementation of the experiment in software.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-1-8">
        <title>Software Descriptions</title>
        <p>Each step in a method can be described at a conceptual
level in terms of the function that it performs, and at an
implementation level as the software used for the step. For
the articles we analyzed, these descriptions are as follows:
 Base caller (Phred software): Assembling genome
for genomic alignment/features.
 Genome assembly (Consed software): Calling
genomic bases from input files.
 SNP caller (Polyphred software): Calling SNPs
from the input genomic file (variant calling).
 Indel and SNP caller: (PolyScan software):</p>
        <p>Calling both indels and SNPs (variant calling).
 Variant annotation (Annotate_ MAFFormat
software): Annotating variants based on reference
genomes.
 Join data files (mutipleFilesToOne software):</p>
        <p>Appending multiple text files.
 Filter data files (filt_MAF_file software): Variant
filtering based on input parameters.</p>
        <p>We make a few observations about how the software is
presented in the article and used in the computational
method.</p>
        <sec id="sec-1-8-1">
          <title>Paper descriptions of conceptual steps contain very limited information about how they map to software.</title>
          <p>Given the capabilities of the software used, conceptual
steps may be mapped to several implemented steps, and
vice versa. For example, Polyphred and Polyscan are two
separate steps, both implement variant calling but the
former does SNP calling only and the latter implements
indel and SNP calling steps. Therefore, a requirement in
designing a workflow is that it must make clear how each
step is implemented in software.</p>
        </sec>
        <sec id="sec-1-8-2">
          <title>Paper descriptions of software contain limited</title>
          <p>information about what conceptual steps they
implement. For example [TCGA 2008] says: “Putative
variants were identified using Polyphred 6.1, Polyscan 3.0,
SNPdetector 3, and SNP Compare.” Four pieces of
software are mentioned, but there are no details that specify
what types of variants are detected by each of them.
Therefore, in designing a workflow, the mapping of
conceptual steps to software must be clearly stated.</p>
        </sec>
        <sec id="sec-1-8-3">
          <title>A given function can be implemented by many</title>
          <p>software packages. There are many software packages that
provide a desired functionality. As a result, identical
functions in different methods may be implemented by
different software, making it hard for a scientist to compare
workflows. For example, in the workflow in Figure 1 the
Consed software is used to perform the genome assembly
step. In [Zhang et al 2014], Tophat2 performs this genome
assembly step. Therefore, a requirement is that the software
steps be described according to their functionality, so that
the methods for several papers can be more easily
compared by a scientist. Functions should be specified for
each method step so that the correspondences across
different software implementations for the same conceptual
step will be explicit.</p>
          <p>A given software package has many functions. In
comparing software to the workflows built from them, we
found that many scientific software packages have a large
number of functions. Though it is useful for scientists to
have multiple functions in one software package, in
research papers it can be difficult to tell what software
packages are being used for what functions. Sometimes the
functionality of a software package is quite broad. For
example, the SAMtools software package, used in [Zhang
et al 2014], can be used for Variant Calling and Variant
Filtering but the article does not explicitly indicate for what
function it is used. Therefore, when specifying what
software is used to implement a step in a method, it is vital
to indicate the specific function of that software to make it
unambiguous what conceptual function the software is
implementing.</p>
        </sec>
        <sec id="sec-1-8-4">
          <title>A computational step may perform a data</title>
          <p>reformatting, conversion, or other minor step that is not
conceptually important and therefore is not mentioned
in the article. Without a description of these steps, it may
not be possible to interpret the results appropriately or to
reproduce the method.</p>
          <p>In summary, the descriptions of method steps and their
implementation in software that are typically found in
scientific articles are very ambiguous and incomplete.
Computational workflows can eliminate this ambiguity, but
they must be intentionally designed to be unambiguous and
complete.
3.3</p>
        </sec>
      </sec>
      <sec id="sec-1-9">
        <title>Data Descriptions</title>
        <p>Like software, data is described in scientific papers with a
mixture of high-level concepts and low-level format
references.</p>
        <sec id="sec-1-9-1">
          <title>Data is often described based on its format rather</title>
          <p>than its contents. We saw examples of this in the earlier
article excerpts. Therefore, a requirement in the design of
workflows is that data abstractions should be used to
complement step abstractions.</p>
        </sec>
        <sec id="sec-1-9-2">
          <title>Data formats are sometimes used when data is</title>
          <p>generated in idiosyncratic formats by specific software
used. This can be seen in Figure 1. The Phred software
generates output in a format called phd, and as a result the
workflow indicates phd_File which is specific to Phred.
Thus, a user of the workflow unfamiliar with Phred would
find it hard to understand that format. Therefore, a
requirement in describing software steps and data in a
workflow is to represent explicitly what formats are
imposed by the use of specific software packages.</p>
          <p>Data of the same type can play very different roles
in a method. In the workflow in Figure 1, the
varAnnotParams input is an annotated variant parameter
file but this type is not represented explicitly. Moreover, it
is also not the only annotated file that is input to this
method, but is the only one with a name that mentions
annotation. Therefore, a requirement is to describe data
conceptually according to the type of data contained, and
that different data used or generated in the workflow be
related by those types.</p>
        </sec>
        <sec id="sec-1-9-3">
          <title>Data results of the same type may be combined,</title>
          <p>filtered, or sorted in ways that are not considered
important to mention in the paper. Readers must
hypothesize these data manipulations.
3.4</p>
        </sec>
      </sec>
      <sec id="sec-1-10">
        <title>Discussion</title>
        <p>Through examples we have illustrated that the text
descriptions of methods sections of articles makes them
hard to reconstruct and replicate into an unambiguous and
complete workflow. This is because papers describe
methods in a mix of high-level conceptual terms together
with mentions of specific software and formats. This
makes it hard to understand and compare methods.
Another observation is that different readers might be
interested in different descriptions of the methods, some
more abstract and some more specific. For example, a
developer would be interested in data formats and software
versions, while a biologist would be more interested in the
overall statistical approach used.</p>
        <p>Ideally, method descriptions would make clear
distinctions between high-level conceptual terminology and
implementation terms, both for software and for data. In
addition, method descriptions would make it clear what
function each step performs, and whether a given function
is implemented by a single step or by a set of steps. These
desiderata lead us to propose workflow abstractions for
describing computational experiments in a paper.
4</p>
      </sec>
      <sec id="sec-1-11">
        <title>WORKFLOW ABSTRACTIONS</title>
        <p>A computational method is typically described in terms of
the specific software, data, and formats used. However,
there are many ways to describe a method
conceptually. This section describes different ways to
design workflow abstractions that would be useful to make
methods more understandable and comparable. Table 1
summarizes the issues identified earlier and the
corresponding proposed abstractions to address them.
1) Method descriptions in scientific articles
mix mentions of software, data formats, and
scientific descriptions of the experiment
2) Paper descriptions of conceptual steps
contain very limited information about how
they map to software
3) Paper descriptions of software contain
limited information about what conceptual
steps they implement
4) A given function can be implemented by
many software packages
5) A given software package has many
functions
6) A computational step may perform a data
reformatting, conversion, or other minor
step that is not conceptually important
7) Data is often described based on its
format rather than its contents
8) Data formats are sometimes used when
data is generated in idiosyncratic formats by
the specific software used
9) Data of the same type can play very
different roles in a method
10) Data results of the same type may be
combined, filtered, or sorted in ways that
are not considered important to mention in
the paper
4.1</p>
      </sec>
      <sec id="sec-1-12">
        <title>Step Abstractions</title>
        <p>A computational workflow can be described at a conceptual
level in terms of the functions that each step carries out.
Figure 2 describes the same workflow introduced in Figure
1. While Figure 1 describes the software implementation
of each step, Figure 2 characterizes the function of each
step.</p>
        <sec id="sec-1-12-1">
          <title>Abstraction approach</title>
          <p>Step
abstractions
Subworkflow
and step
abstractions
Subworkflow
and step
abstractions
Step
abstractions
Step
abstractions
Criticality
abstractions
Data
abstractions
Data
abstractions
Data
abstractions
Criticality
abstractions</p>
          <p>At the same time, the software steps in Figure 1 and the
conceptual steps of Figure 2 should be mapped to one
another. This can be done through a hierarchy of
component functions, which defines many conceptual
functions at different levels of detail. The hierarchy
bottoms out with mentions of software that implements the
parent function. Note that there may be several software
implementations of the same abstract function.</p>
          <p>Figure 3 shows a hierarchy of component functions for
the steps in Figures 1 and 2. The function in a given node
represents a more specific function than the function of its
parent node. It also includes steps for the other two articles
that we analyzed. Using this hierarchy, it becomes possible
to relate the method steps of the three articles.</p>
          <p>When designing a workflow, two distinct types of
workflows should be created. One type of workflow is an
abstract workflow, with abstract components that
correspond to the more general functions in the hierarchy.
These abstract workflows capture the general functionality
of methods, and they would be independent of the software
used to implement it. A second type of workflow would be
a grounded workflow, which would specify what software
is used to implement each step.</p>
          <p>We find that in practice it is hard to create a complete
hierarchy of component functions before creating the
workflows. We recommend an iterative process, where an
initial hierarchy is created and then refined as the
workflows are fleshed out.</p>
          <p>Depending on the depth of the hierarchy of component
functions there could be several abstract workflows that
could have different levels of detail and generality. Each
abstract workflow may be useful to a different reader,
depending on the level of detail that they are looking to
find. At the same time, if the workflow contains
descriptions of the steps that are too general, it may not be
very helpful to a reader. Workflow designers should design
appropriate conceptual levels.</p>
          <p>A hierarchy of component functions becomes a
powerful enabler for automation. Given a concrete
workflow, the hierarchy could be used to generate abstract
workflows automatically. Conversely, given an abstract
workflow, the hierarchy could be used to specialize it and
create a concrete workflow. [Gil et al 2011] describe
algorithms to do this kind of automation.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-1-13">
        <title>Sub-Workflows</title>
        <p>Several components may implement different aspects of the
same function. For example, in the workflow of Figures 1
and 2 the Polyphred software and the Polyscan software
implement SNP calling and indel calling respectively,
which are two aspects of variant calling. The software
Annotate_MafFormat annotates the resulting variants with
respect to reference genomes. All three steps could be
considered as a sub-workflow, with an overarching abstract
function of detecting and annotating variants.</p>
        <p>A knowledge base of sub-workflows would capture these
functional decompositions. A sub-workflow would consist
of a root abstract component, which indicates the
overarching abstract function, and a workflow fragment that
decomposes that function into a set of components at a
lower level of abstraction and the dataflow among them.
Data abstractions should be taken into account as well as
the sub-workflows express functions of different
abstraction levels. We discuss data abstractions below.</p>
        <p>When designing a workflow, steps that are functionally
related should be organized as sub-workflows. There may
be alternative ways to group steps in a workflow.
Workflow designers should make decisions based on the
expected use of the sub-workflow decompositions by
readers. The knowledge base of sub-workflows could be
dynamically extended based on a growing corpus of
workflows created by users. [Garijo et al 2014c] describe
techniques to detect workflow fragments automatically.
4.3</p>
      </sec>
      <sec id="sec-1-14">
        <title>Criticality</title>
        <p>Some steps in a workflow perform functions that are
critical to the overall computational method, while other
steps carry out minor format conversions and other
ancillary functions. For example, the workflow in Figures
1 and 2 has a step to merge several files. Other workflows
have reformatting steps, unit conversion steps, and other
functions that manage the details of how the data is
implemented. When describing a method in a paper, these
ancillary functions are rarely mentioned. There may be
different degrees of criticality, depending on how much
detail each reader is interested in seeing.</p>
        <p>This kind of abstraction could be captured in a hierarchy
of criticality levels. This hierarchy would identify the
importance of including a step in a scientific description of
a method. [Garijo et al 2014b] describe an approach to
identifying criticality based on a library of workflow motifs
that include data pre-processing, visualization, and format
conversion. Criticality levels are highly dependent on the
specific domain, but a broad methodology to design those
categories could be more generally designed.
4.4</p>
      </sec>
      <sec id="sec-1-15">
        <title>Data Abstractions</title>
        <p>Data type abstractions should be included in all three
hierarchies above. The data type in a node would represent
data that is of a more specific type than its parent node, for
example because it is of a subtype or has more specific
metadata properties. In the hierarchy of component
functions, each abstract component function should specify
inputs and outputs in terms of those general types. At the
bottom of the hierarchy, a component is specified with a
specific software invocation, including the exact command
line call to invoke the software and all the input data types
and formats that the software expects. In the hierarchy of
sub-workflows, the root component may refer to data types
that are more abstract than those of the workflow fragment.</p>
        <p>Data abstractions depend on the domain. In
multiomics, there are many aspects of data that can be described
in very specific terms but can be abstracted away when
describing an experiment in scientific terms.
Characteristics of a dataset that can lead to useful data
abstractions include: 1) type of sequence, such as RNA,
DNA, etc.; 2) annotations on those sequences, such as
indels, CNVs, SNPs, etc.; 3) formats that are often imposed
by how software works, such as FASTA, MAF, phd, etc.;
4) level of detail or accuracy on the sequences, for example
sequences obtained with next-generation sequencing
machines are more accurate; 5) the role of a dataset for a
specific component, for example a sequence can be a
patient sequence or a reference sequence.</p>
        <p>Workflow designers should create a taxonomy of data
abstractions that facilitate the abstractions needed for the
three hierarchies discussed earlier. In our work, we have
found that a proliferation of data types makes the creation
of workflows more complex. Instead, we create properties
for describing the different characteristics of data.
5</p>
      </sec>
      <sec id="sec-1-16">
        <title>CONCLUSIONS</title>
        <p>This paper motivates the need for capturing abstractions in
the design scientific workflows. These abstractions are
based on our analysis of published articles and the
workflows created to reconstruct their methods. The
proposed abstractions are captured in hierarchies of
component functions and criticality as well as knowledge
bases of sub-workflows, and need to be supported by data
abstractions. Using these abstractions, different workflows
can be created to describe the same computation for readers
with different interests. In future work, we plan to develop
these abstractions for a target domain and associated
publications, in order to demonstrate their benefits.
Acknowledgements. We gratefully acknowledge support
from the Defense Advanced Research Projects Agency
through the SIMPLEX program with award
W911NF-15-10555, the National Institutes of Health with awards
1U01CA196387 and 1R01GM117097, and the Canary
Foundation.
Workflows.” Proceedings of the 10th IEEE International
Conference on e-Science, 2014.
[Garijo et al 2017] Garijo D, Gil Y, and O Corcho.
“Abstract, Link, Publish, Exploit: An End to End
Framework for Workflow Sharing.” Future Generation
Computer Systems, 2017.
[Gil 2015] Gil, Y. “Human Tutorial Instruction in the
Raw.” ACM Transactions on Interactive Intelligent
Systems, 5 (1): 1–29, 2015.
[Gil and Garijo 2017] Gil Y, and D Garijo. “Towards
Automating Data Narratives.” Proceedings of the ACM
Conf. on Intelligent User Interfaces, 2017.
[Gil et al 2011] Gil Y, Gonzalez-Calero PA, Kim J, Moody
J, and V. Ratnakar. “A Semantic Framework for Automatic
Generation of Computational Workflows Using Distributed
Data and Component Catalogs.” Journal of Experimental
and Theoretical Artificial Intelligence, 23(4), 2011.
[Groth and Gil 2009] Groth P and Y Gil. “Analyzing the
Gap between Workflows and Their Natural Language
Descriptions.” Proceedings of the IEEE International
Workshop on Scientific Workflows (SWF), 2009.
[Knoblock 2017] Knoblock M. “Designing Useful
Abstractions for Multi-Omics Data Analysis.” Technical
Report, Information Sciences Institute, University of
Southern California, October 2017.
[Imielinski et al 2012] Imielinski M, Berger AH,
Hammerman PS, Hernandez B, et al. “Mapping the
hallmarks of lung adenocarcinoma with massively parallel
sequencing.” Cell;150(6):1107-20, 2012.
[Ince et al 2012] Ince DC, Hatton L, and J
GrahamCumming. “The Case for Open Computer Programs.”
Nature, Vol 482, 2012.
[Ioannidis et al 2009] Ioannidis JPA, Allison DB, et al.
“Repeatability of Published Microarray Gene Expression
Analyses.” Nature Genetics 41 (2), 2009.
Annual Conference
(IPCC/SIGDOC 2000)</p>
        <p>Computer</p>
        <p>Documentation</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Baggerly and Coombes</source>
          <year>2009</year>
          ]
          <string-name>
            <surname>Baggerly</surname>
            <given-names>KA</given-names>
          </string-name>
          , and KR Coombes.
          <article-title>“Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-Throughput Biology</article-title>
          .
          <source>” Annals of Applied Statistics</source>
          <volume>3</volume>
          (
          <issue>4</issue>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Donoho et al 2009]
          <string-name>
            <surname>Donoho</surname>
            <given-names>DL</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maleki</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahman</surname>
            <given-names>IU</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shahram</surname>
            <given-names>M</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>V</given-names>
            <surname>Stodden</surname>
          </string-name>
          . “
          <source>Reproducible Research in Computational Harmonic Analysis.” Computing in Science &amp; Engineering</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>8</fpage>
          -
          <lpage>18</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Garijo et al 2013]
          <string-name>
            <surname>Garijo</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kinnings</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bourne</surname>
            <given-names>PE</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>Y</given-names>
            <surname>Gil</surname>
          </string-name>
          .
          <article-title>“Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome</article-title>
          .”
          <source>PLoS ONE</source>
          <volume>8</volume>
          (
          <issue>11</issue>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Garijo et al 2014a]
          <string-name>
            <surname>Garijo</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gil</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Braskie</surname>
            <given-names>MN</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hibar</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hua</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jahanshad</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            <given-names>P</given-names>
          </string-name>
          , and Toga AW. “
          <article-title>Workflow Reuse in Practice: A Study of Neuroimaging Pipeline Users</article-title>
          .
          <source>” Proceedings of the 10th IEEE International Conference on e-Science</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Garijo et al 2014b]
          <string-name>
            <surname>Garijo</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alper</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belhajjame</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gil</surname>
            <given-names>Y</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>C</given-names>
            <surname>Goble</surname>
          </string-name>
          <article-title>. “Common Motifs in Scientific Workflows: An Empirical Analysis</article-title>
          .
          <source>” Future Generation Computer Systems</source>
          <volume>36</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Garijo et al 2014c]
          <string-name>
            <surname>Garijo</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gil</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutman</surname>
            <given-names>BA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dinov</surname>
            <given-names>ID</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            <given-names>P</given-names>
          </string-name>
          , and AW Toga.
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>“FragFlow: Automated Fragment Detection in Scientific [Taylor</source>
          et al 2006]
          <string-name>
            <surname>Taylor</surname>
            <given-names>IJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deelman</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gannon</surname>
            <given-names>DB</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>M</given-names>
            <surname>Shields</surname>
          </string-name>
          . “
          <article-title>Workflows for e-Science: scientific workflows for grids</article-title>
          .” Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>[Van Noorden 2015] Van</surname>
            <given-names>Noorden</given-names>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <article-title>Sluggish data sharing hampers reproducibility effort</article-title>
          .
          <source>Nature</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Xie 2015]
          <string-name>
            <given-names>Y</given-names>
            <surname>Xie</surname>
          </string-name>
          .
          <article-title>“Dynamic Documents with R and knitr</article-title>
          .” CRC Press,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Zhang et al 2014]
          <string-name>
            <surname>Zhang</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>X</given-names>
          </string-name>
          , et al.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>“Proteogenomic Characterization of Human Colon and Rectal Cancer</article-title>
          .”
          <source>Nature</source>
          <volume>513</volume>
          (
          <issue>7518</issue>
          ):
          <fpage>382</fpage>
          -
          <lpage>87</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>