<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Best practices for workflow design: how to prevent workflow decay</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kristina Hettne</string-name>
          <email>k.m.hettne@lumc.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katy Wolstencroft</string-name>
          <email>katherine.wolstencroft@manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khalid Belhajjame</string-name>
          <email>khalid.belhajjame@cs.man.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carole Goble</string-name>
          <email>carole.goble@manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eleni</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harish Dharuri</string-name>
          <email>h.k.dharuri@lumc.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lourdes Verdes-Montenegro</string-name>
          <email>Lourdes@iaa.es</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julian Garrido</string-name>
          <email>jgarrido@iaa.es</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David de</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roure</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Roos</string-name>
          <email>m.roos@lumc.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leiden University Medical Center</institution>
          ,
          <addr-line>Leiden</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Oxford e-Research Centre, University of Oxford</institution>
          ,
          <addr-line>Oxford</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Manchester</institution>
          ,
          <addr-line>Manchester</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this position paper we present a set of best practices for workflow design to prevent workflow decay and increase reuse and re-purposing of scientific workflows. MyExperiment provides access to a large number of scientific workflows. However, scientists find it difficult to reuse or re-purpose these workflows for mainly two reasons: workflows suffer from decay over time and lack sufficient metadata to understand their purpose. We argue that good workflow design is a prerequisite for repairing a workflow, or redesigning an equivalent workflow pattern with new components. We present a set of best practices for workflow design and the semantic tooling that is being developed in the Workflow4Ever (Wf4Ever) project to support these best practices.</p>
      </abstract>
      <kwd-group>
        <kwd>3Instituto de Astrofísica de Andalucía</kwd>
        <kwd>Granada</kwd>
        <kwd>Spain</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Workflows are increasingly used in life sciences as a means to capture, share, and
publish the steps of a computational analysis. Tools such as Taverna [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Galaxy
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] are widely adopted tools for creating and executing workflows, which can be
published and shared using myExperiment [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the largest public repository of
scientific workflows. While myExperiment provides access to a large number of
workflows from different domains such as genomics, biodiversity and astronomy,
scientists find it often difficult (or impossible) to exploit those workflows by reusing
or re-purposing them for their analyses [
        <xref ref-type="bibr" rid="ref19 ref4">4, 19</xref>
        ]. Deciding factors in re-purposing or
not would be the functionality of the workflow that the user is after, and if the existing
workflow provides a fnctionality that can be used as a building block (i.e., as a
subworkflow) in the new workflow. Regarding workflow decay, in a recent study [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] of
Taverna workflows on myExperiment, the authors found that as much as 80% failed
to be either executed, or to produce the same results. The main reasons for the
workflows to break were the following: volatile third-party resources, missing
example data that can help the scientist understand the main purpose of the workflow,
missing execution environment, and insufficient metadata. We believe that all of these
issues, with a possible exception of the first one since it is not under the direct control
of the designer, can be prevented by following a minimum set of guidelines at the
workflow design stage. Unfortunately, to our knowledge no such guidelines exist, this
hence leading us to define here 10 Best Practices for designing workflows.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Best Practices for Workflow Design</title>
      <p>
        The proposed best practices for workflow design are based on combining the
principles of the scientific method and the best practices for software development
and data management [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Therefore, the next 10 steps allow the creation of higher
quality workflows, as required in the scientific discourse.
      </p>
      <p>
        1. Make an abstract workflow: A workflow sketch provides a reference to the
main task(s) of the workflow through its implementation process. A
workflow can be compared to a scientific protocol, so sketching out the
method helps when designing the experiment. We also anticipate that a
workflow sketch will help in communication with for example supervisors
and colleagues, while at the same time promoting sharing between computer
and human generated systems due to its non-explicit nature.
2. Use modules: One of the main strengths of workflows is the possibility of
plugging in and reusing parts and also swapping broken parts, plugging in
different methods and comparing them. Implementing all the executable
components of a workflow in such a way that they can be run as separate
subworkflows would facilitate the understanding, maintenance, re-use and
separate testing and validation of the workflow.
3. Think about the output: What is the output intended for? Is it supposed to
be used as input to another workflow, stored in a database, or be presented to
the end user? Should it be a graph, a table or text? Thinking about the output
of the workflow at the design stage is easier than trying to adjust a finished
workflow and will drive the design of the next steps. Also, a workflow has
the potential to produce masses of data that need to be visualized and
managed properly.
4. Provide input and output examples: Inputs and output examples are
crucial for: the understanding of the workflow, validation, maintenance
purposes, as well as to be able to use them as tools for training or tutorials.
5. Annotate: Careful annotation of a workflow helps to record all steps and
assumptions hidden in the workflow, what is not only needed for a
publication later on but also crucial for the scientific method. It also
facilitates use and re-use of workflows. There is no accepted standard for
annotating a workflow. We propose to choose meaningful names for the
workflow title, inputs, outputs, and for the processes that constitute the
workflow as well as for the interconnections between the components, so
that annotations are not only a collection of static tags but capture the
dynamics of the workflow. A high-level functional annotation should be
included (for example similar to the functional units suggested in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), as
well as a description of the resource, keeping in mind a scenario where it
may disappear or change at some time in the future.
6. Make it executable from outside the local environment: This best practice
leads to portability of the workflow which potentially increases its
reproducibility and reuse. It can for example be realized either by using
remote Web services, or platform independent code/plugins. However, if
there is need to use local services, library or tools, then the workflow should
be annotated in order to define its dependencies i.e. which local tool, version
or operating system is required, where to find it, if it is licensed or any other
particular restriction e.g. the application has to be called with a particular
name.
7. Choose services carefully: One of the major reasons that cause workflows
to break are volatile third-party services. The status, reliability and stability
of a Web Service as well as the reputation of the service provider are often
the deciding factors for choosing a service.
8. Reuse existing workflows: Reuse is important for many reasons. It fights
redundancy, and perpetuates “tried and tested” and published methods
conveying good scientific practice. It will also help the workflow developer
get ideas on methods and workflow patterns. It is also beneficial when
repairing workflows: repairing a given workflow may entail repairing the
workflows in which it is used as a subworkflow.
9. Test and validate: Defining test cases and implementing validation
mechanisms facilitates maintenance, decay identification and guarantee the
correctness of the results by, for example including components in the
workflow whose function is checking assertions that must be true.
10. Advertise and maintain: It is a duty of science to share your results. It also
helps progress by letting others build on your work without reinventing it.
Workflow maintenance is expected to increase the longevity of the
workflows. Frequent testing, monitoring services used, communication with
other users all represent ways to maintain a workflow.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Technology Supporting the Best Practices</title>
        <p>
          Research Objects (ROs) are semantically rich aggregations of resources that bring
together data, methods and people in scientific investigations [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Their goal is to
create a class of artifacts that can encapsulate digital knowledge and provide a
mechanism for sharing and discovering assets of reusable research and scientific
knowledge. In the EU Wf4Ever project [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] the focus is on those ROs whose methods
are implemented as scientific workflows. A (workflow-centric) RO is an artifact that
bundles one or several workflows, the provenance of the results obtained by their
enactments, other digital objects that are relevant to the experiment (papers, datasets,
etc.), and annotations that semantically describe all these objects. A model that can be
used to describe these workflow-centric research objects is proposed in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. This
model is implemented as a suite of lightweight ontologies or vocabularies, building
upon existing work from related communities. The RO manager tool [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] was
designed to prepare a workflow-centric RO and its related data and metadata.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Use case: Next-generation text mining for functional annotation of Single Nucleotide Polymorphisms (SNPs)</title>
      <p>
        We followed the 10 best practices for workflow design when implementing a set of
workflows in Taverna that perform a sophisticated text mining procedure referred to
as concept profile matching [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to functionally annotate SNPs. The set of workflows
were published on myExperiment as a pack [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We describe point by point below
how the RO model supported the design and implementation process.
1) Make an abstract workflow: We made workflow sketch using flowchart symbols,
since there is no accepted standard for creating such a sketch. The sketch tries to
capture the whole experiment at a higher level. It is aggregated in the RO as a
wf4ever:Document. A file description was added using the rdfs:comment property.
2) Use modules: All executable components were implemented as separate, runnable
workflows. These were described using the following properties of the RO model:
class: wfdesc:Workflow and relation: wfdesc:hasSubWorkflow. This indeed
facilitated independent testing and validation of the execution of each of the
individual components as well as the main nested workflow.
3) Think about the output: The outputs of the workflows were implemented as
output boxes in Taverna. They can be saved to disc from Taverna, for example using
the save to Excel option. Although the output can be saved from Taverna in this way,
the limited export options give a scattered impression and it can be difficult to relate
the different outputs to each other. The outputs after a workflow run are annotated
with the wfprov:wasOutputFrom relation.
4) Provide example inputs and outputs: All workflows have example inputs and
outputs. The example outputs match the example inputs. However, there is no
standard for how to do this if the example is a large data file that does not fit in the
example window of Taverna. We solved this by providing the example files in the RO
and use the wfdesc:Input and wfdesc:Output properties to annotate the example files.
5) Annotate: All elements of the workflows including inputs and outputs, processes
(for example Web services) and subworkflows (nested components) were annotated
using the description and example fields in Taverna. We used the Taverna description
fields and example fields for workflow-related annotation since a workflow developer
is expected to provide the annotations continuously while developing the workflow.
We used the Wf4Ever tool scufl2-wfdesc [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to extract a wfdesc-compliant
workflow description from a Taverna workflow.
      </p>
      <sec id="sec-3-1">
        <title>6) Make it executable from outside the local environment: Since our workflows</title>
        <p>
          only use public Web services we experienced no problems when testing them outside
the local environment. We also successfully executed them using t2web (e.g.
http://workflow.biosemantics.org/t2web/workflow/2972), a web tool that uses the
Taverna server [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
7) Choose services carefully: The workflows use public Web services listed on
BioCatalogue [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Service availability in BioCatalogue is indicated using a simple
‘traffic light’ mechanism, whereby green means the service is active, yellow means it
has one or more unresolved issues, and red means it is currently unavailable. This
green light, together with very limited history on BioCatalogue, is the only available
trust-metric for a Web service. More effort to develop and implement reliability
statistics for Web services is needed.
8) Reuse existing workflows: We made our workflows modular and noticed that the
modularity made it possible to use one subworkflow twice in a nested workflow. This
relates to ongoing work on Taverna workflow components, which we expect will
make these types of implementations easier in the future.
9) Test and validate: We did notice that Taverna does not provide test mechanisms
for the workflows and the nested workflows. For this reason, we used a well-known
and small sample for testing (it matches with the sample data provided in the
annotations).
10) Advertise and maintain: The workflows were put on myExperiment, both as
separate workflows and as a pack [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. We propose to use these links when referring
to them from scientific publications. Other ways to advertise workflows include
making them available through the Galaxy and Genome Space [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] environments.
One major advantage of myExperiment is that it connects users to the creators of the
workflow. We respond to e-mails from users regarding Web services being down and
causing the workflow to break. Our maintenance plan is to perform monthly tests of
the workflows by running them with their example values. A schedule for this in
myExperiment with built-in reminders (for example, automatically generated e-mails
alerting the developer that the workflow needs to be run) would help the maintenance.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Concluding remarks</title>
      <p>
        When following the 10 best practices for workflow design we were helped by the
available models and tooling, but we also noticed several technology gaps where
specific guidelines or tooling would be of great help (see section 2). We suggest that
something similar to unit testing would be useful for testing workflows and support
workflow maintenance. The lack of testing support makes it even more important to
integrate validation mechanisms in the workflow. We propose that myExperiment and
Taverna draw from the same linked sources (the ROs), showing appropriate fields at
appropriate times. The Wf4Ever project is working towards the release of an
ROenabled myExperiment. The pack functionality in myExperiment will be extended
according to the RO model, which will constitute the interpretation layer between
Taverna and myExperiment. Ongoing work on nanopublications [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] might provide
means to refer to different parts of a workflow or an RO in a more fine-grained way.
The question on how to convince users that the best practices are beneficial in the
long run remains.
      </p>
      <p>Acknowledgements. The research reported in this paper is supported by the EU
Wf4Ever project (270129) funded under EU FP7 (ICT-2009.4.1).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Taverna. http://www.taverna.org.uk.</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. Galaxy. http://galaxy.psu.edu.</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>3. MyExperiment. http://www.myexperiment.org.</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Goderis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Seven bottlenecks to workflow reuse and repurposing</article-title>
          .
          <source>In: Proceedings of the 4th International Semantic Web Conference</source>
          , Galway, Ireland,
          <fpage>6</fpage>
          -
          <lpage>10</lpage>
          November 2005
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Why workflows break - Understanding and combating decay in Taverna workflows, accepted for the 8th</article-title>
          <source>IEEE International Conference on eScience 2012</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Aruliah</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          , et al.:
          <article-title>Best Practices for Scientific Computing</article-title>
          .
          <source>(Submitted on 1 Oct</source>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Missier</surname>
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Functional Units: Abstractions for Web Service Annotations</article-title>
          .
          <source>In: SERVICES 2010; 05 Jul</source>
          <year>2010</year>
          -05 Jul 2010; Miami, USA. (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Bechhofer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>Why linked data is not enough for scientists. Future Generation Computer Systems</article-title>
          . In Press.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Wf4Ever project. http://www.wf4ever-project.
          <source>org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Belhajjame</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , et al.:
          <article-title>Workflow-Centric Research Objects: First Class Citizens in Scholarly Discourse</article-title>
          . In: SEPUBLICA:
          <article-title>Future of scholarly communication in the Semantic Web</article-title>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>11. RO manager. https://github.com/wf4ever/ro-manager.</mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Jelier</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.:
          <article-title>Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation</article-title>
          .
          <source>BMC Bioinformatics</source>
          .
          <volume>8</volume>
          ,
          <issue>14</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>13. Workflow pack on myExperiment. http://www.myexperiment.org/packs/282.</mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>14. Scufl2-wfdesc. https://github.com/wf4ever/scufl2-wfdesc.</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>15. Taverna server. http://www.taverna.org.uk/download/server/.</mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>16. BioCatalogue. http://www.biocatalogue.org.</mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>17. GenomeSpace. http://www.genomespace.org.</mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Nanopub</surname>
          </string-name>
          . http://nanopub.org.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Hettne</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , et al.:
          <article-title>Best practices for workflows</article-title>
          .
          <source>Presented at 2nd BioVeL Workshop on taxonomic and phylogenetic workflows. Gothenburg</source>
          , Sweden.
          <year>2012</year>
          . http://www.biovel.eu/images/events/MS6WorkshopPix/presentations/hettne. ppt
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>