Scaling bio-analyses
                      from computational clusters to grids

                             Heorhiy Byelas, Martijn Dijkstra, Pieter Neerincx, Freerk van Dijk,
                                   Alexandros Kanterakis, Patrick Deelen, Morris Swertz
                                       Genomics Coordination Center, Department of Genetics
                                               University Medical Center Groningen
                                            University of Groningen, The Netherlands
                                        Email: h.v.byelas@med.umcg.nl, m.a.swertz@rug.nl

    Abstract—Life sciences have moved rapidly into big data            data management for routine running of large bioinformatics
thanks to new parallel methods for gene expression, genome-            analyses.
wide association, proteomics and whole genome DNA sequencing.
The scale of these methods is growing faster than predicted                 In this work, our goal is to run different workflows in a
by Moores law. This has introduced new challenges and needs            unified way and make workflow adaptation to different back-
for methods for specifying computation protocols for e.g. Next-        ends, such as clusters and grids environments more standard
Generation Sequencing (NGS) and genome-wide association study          and easier. Hence, users can choose a computational back-end
(GWAS) imputation analyses and running these on a large scale
                                                                       with less load to run analyses, run workflows in different back-
is a complicated task, due to the many steps involved, long
runtimes, heterogeneous computational resources and large files.       end environments and achieve identical analysis results and, if
The process becomes error-prone when dealing with hundreds of          it is needed, combine results together. We do not want to add
samples, such as in genomic analysis facilities, if it is performed    another middleware layer above the back-end services, such
without an integrated workflow framework and data management           as job schedulers, but prefer to minimize an overhead to run
system. From recent projects we learnt that bioinformaticians          shell scripts in different environments.
do not want to invest much time in learning advanced grid
or cluster scheduling tools, preferring to concentrate on their            In this paper, we consider the principles of the MOL-
analyses, be closer to old-fashion shell scripts that they can fully   GENIS software suite [2], [3] in a new perspective, where
control and have some automatic mechanisms taking care of              it can be used as a separated module to generate all the
all submission and monitoring details. We present a lightweight        necessary instruments for tracing data and computations and
workflow declaration and execution system to address these             collecting logging information from different computational
needs, built on top of the MOLGENIS framework for data
                                                                       environments. We allow the users to decide how they want to
tracking. We describe lessons learnt when scaling running NGS
and imputation analyses from computational clusters to grids and       orchestrate these instruments in different middleware systems,
show application of our solution, in particular, in the nation-wide    although we supply a default management solution, which
”Genome of the Netherlands” project (GoNL, 700TB of data and           includes database and a ”pilot-job” framework to run analysis
about 200.000 computing hours)                                         jobs. In addition, we show how the system can be applied to
                                                                       run NGS and imputation workflows in the cluster and grid
                                                                       environments.
                      I.   I NTRODUCTION
                                                                           This paper is structured as follows. Section II reviews
    High-throughput analysis methods have created exciting             related work in the context of conventions used in MOLGE-
new possibilities for unraveling genotype-to-phenotype rela-           NIS Compute and attempts to deploy the workflow manage-
tionships. However, these experiments are heavily dependent            ment systems to different computational back-ends. Section
on large computational analysis for their success. For instance,       III describes the workflow design model used for workflow
next generation sequencing analyses typically involve about            generation and the generation process. Section IV presents
30 computational steps such as alignment, duplicate marking,           the workflow deployment and the design of the ”pilot-job”
single-nucleotide polymorphism (SNP) calling, annotation and           framework used for workflow execution. Section V explains
many re-indexing and quality control steps [1]. Similarly,             implementation details of creating analysis protocols, which
GWAS data typically requires batching for imputation. Ge-              are suitable for execution in different back-ends. Section VI
nomic analysis facilities typically face running many different        provides details on NGS and imputation workflows. Section
versions of such computational pipelines on hundreds or                VII discusses our practical experiences, benefits and drawbacks
thousands of samples. This quickly becomes a nightmare of              of using the system. Section VIII offers our conclusions.
data file logistics (raw, intermediate, result, quality controls
and log data) and a computational scheduling nightmare (large,
small, short, long jobs that may have error states and different                           II.   R ELATED W ORK
computational back-ends). Furthermore, different cluster and
grid middleware do not provide all necessary operations to                We divide the related work into two sub-domains that
execute bio-workflows. To address these challenges we present          describe work in (1) workflow design and generation and (2)
a practical software system that combines computational and            workflow execution and deployment.
A. Workflow generation and design                                    Besides WMS, bioinformatics analyses can be executed
                                                                 in distributed computational environments using web-based
    Software generators are systems that generate other soft-    portals, such as P-GRADE Portal [16], which supports devel-
ware and software specifications described in a model language   opment and submission of distributed applications executed
play a role of the input for the generating. Our workflow        on the computational resources of various distributed com-
model is described in Section III and used as an input for       puting infrastructures (DCIs) including clusters, service grids
the workflow generation. As an output, we expect a kind of       via web browser. The portal architecture consists of several
generated workflow management system, which is able to run       layers including the presentation layer with a web interface,
workflows in a distributed computational environment.            middle layer with different services and architecture layer with
                                                                 available cluster/grid middleware connectors. In our approach,
    We have gained a lot of experience in generating software
                                                                 we also want to communicate with the system through the
infrastructures to manage and process large bioinformatics
                                                                 generated web presentation layer, but we minimise another
datasets [4], [5], [6]. The MOLGENIS toolkit [7] provides
                                                                 middleware layer between cluster/grid services and our system
bioinformaticians with a simple language to model biological
                                                                 and to keep a minimal implementation to communicate to
data structures and user interfaces. The MOLGENIS generator
                                                                 back-ends middleware.
suite automatically translates these models into a feature-
rich, ready-to-use web application including database, user          For the workflow deployment, instead of the service based
interfaces, exchange formats, and scriptable interfaces. This    strategies as proposed in Taverna, where actual analysis tools
”model-driven” method ensures re-use of best practices and       are wrapped into web-services or other ”black-box” compo-
improves quality because the modelling language and genera-      nents, we would like to use tools directly in both cluster and
tors are shared between all MOLGENIS applications, so that       grid settings. This will allow us to use analysis tools without
errors are found quickly and improvements are shared easily      loosing any functionality, that can be hidden in a wrapper
by a re-generation. A plug-in mechanism ensures that both the    interface. We want to automate tool installation process which
generator suite and the generated product can be customized      can be suitable for all, if it is possible, environments, that we
just as much as hand-written software.                           include into our back-end compute infrastructure.
    In our previous work in the workflow management, we              To summarize, we (1) re-use best practices from the
aimed to combine computational and data management in a          Molgenis database generator to generate experiment data and
single system - MOLGENIS Compute [8]. Then, we extended          workflow models for users and (2) fulfil a direct connection
the initial solution with more specific for NGS analysis meta    to cluster/grid infrastructures for bioinformatics.
model, which allowed us to specify data provenance and
workflow execution logic efficiently [9]. In our NGS-specific
                                                                         III.   W ORKFLOW DESIGN AND GENERATION
solution, we consider the computational cluster as a back-end,
where all the computation take place. The users communicate      A. Workflow design
with the system through the generated web-interface. Some
logging information is usually hidden in a user interface,           In this new effort, we would like to model bioinformatics
which makes a view on the data more compact and easier to        workflows to run them in a unified way in different compu-
comprehend. However, in some cases, such as debugging a new      tational environments. We can divide the modelling task into
workflow or deploying it in a new environment, users may need    two sub-tasks, such as
this hidden logging information from an operational system or
analysis tools. Expert users would like to have a direct and        •    workflow modelling, which match describing work-
easy access to all the log files produced and it should not be           flow to run in different computational back-ends
obliged to do this through the web-interface, which requires        •    data modelling, which covers different bioinformatics
extra effort to implement. This time we want to provide both             analyses (e.g. NGS or imputations)
command-line and web interface to the workflow management.


B. Workflow execution and deployment
    If, in the previous work, we focused on generating compute
management software for PBS clusters [10], here, we want to
run workflows in different back-end environments. There are
a number of projects such as MOTEUR [11] and TavernaPBS
[12], which try to deploy the Taverna workflow management
system (WMS) [13] to the grid and cluster environment
respectively. Furthermore, adding the BioMoby [14] plug-in
into Taverna can make running biological workflows and data
provenance even more structural. Another workflow system
Galaxy [15], which is specialised in running bioinformatics
workflows can be configured to run analysis in Portable Batch                   Fig. 1: Core of the computing model
System (PBS) or Sun Grid Engine (SGE) clusters, but we are
not aware of any projects, which connect Galaxy to the grid         1) Workflow modelling: The core of our workflow design
middleware.                                                      remains stable [8], [9]. Our main goal is to keep the model
                          Fig. 2: Generation procedure using Molgenis job generator (a command-line version)


as simple as possible and avoid abstract entities. The model is      chunks are arranged into workflows. The workflow structure
shown in Figure 1.                                                   is essential to run analysis parts in the correct order. However,
                                                                     usually, all the workflow parts can be generated instantly for
    The Workflow element includes all individual analysis
                                                                     the later execution. An exception is a case than a number of
WorkflowElements and represents the succession of the analy-
                                                                     outputs (i.e. output files) is not know beforehand and these
sis. The core of the model are analysis script templates stored
                                                                     files should be used as inputs in further steps of the workflow.
in the Protocols. A Protocol has its Parameters. Parameters
can be of different types, such as workflow parameters, envi-            We support two implementations for the workflow gener-
ronment parameters and user parameters. During generation,           ation: (1) a database version and (2) a command-line version.
the template parameters will be filled with actual Values.           which uses files instead of database. In the database version,
Then, the resulting script will be combined with supplementary       a workflow is described in the MySQL database, which is
headers and footers, which are specific for different back-ends,     generated from the model (Sec. III-A1). In the command-line
and stored as an analysis Job. The generation using templates        version, a workflow is described in files:
is discussed next.
                                                                         •    workflow.csv: a file with a sequence of workflow
    2) Data modelling: Our NGS model is described in details                  elements,
in [9]. It is based on the laboratory process when using an
Illumina HiSeq 2000 machine, which produces the input data               •    paratemers.csv: a file with workflow parameters, and
for our analysis workflows. It consists of such elements as              •    protocols: a folder with templates of workflow proto-
Projects, Samples, DNA Libraries, Flowcells, Barcodes etc.                    cols.
    We introduce one common Analysis Target interface to                 These files match the database model and can be im-
enable us later to refer uniformly to subjects from different        ported/exported into the database system. Analysis Targets
analyses. Analysis Targets can have complex relations between        also can be either selected as database records through the
themselves. For instance, in our NGS workflow, some analyses         generated web interface or listed in the input parameters file.
are executed per Barcode being Analysis Target, but others are       The workflow description and Analysis Targets are an actual
executed per Project or Sample being Analysis Target, which          input for generating analysis jobs.
includes all Barcodes. Hence, it is very important to specify
Analysis Targets and their relations for every analysis element          In the database version jobs are generated as records in the
in the workflow.                                                     database. The generation scheme of the command-line version
                                                                     is presented in Figure 2. Here, jobs are generated as script
                                                                     files in the specified directory. These script files are ready for
B. Analysis jobs generation
                                                                     submission to a specified in the generation back-end. For this,
   In general, an analysis, which is intended for execution          we provide supplementary headers and footers for different
in the cluster or grid, consists of many chunks and these            computational back-ends. These headers/footers are written as
templates and stored as files. They are filled with parameters      system libraries, e.g. the R software for statistical computing.
during generation. Headers/footers are used to specify e.g.         When deployment is completed, the deployed modules can be
job wall-time, memory requirements, job logging. All that is        initialized using the following statement in analysis scripts:
needed to generate jobs for a new back-end, is to add new
                                                                    module load your_software_module/version
headers/footers templates for it. The generator source code
stays the same. In addition, we provide a possibility for users          The loaded in the module software should be invoked
to submit single analysis jobs or jobs with dependencies to a       directly without specifying its classpath. Hence, we ensure
specified back-end.                                                 that the same software is used for all analyses and the
                                                                    analysis scripts for different back-ends have exactly the same
   In the database version, we are using ”pilot-jobs” to run
                                                                    listings to access tools used in the analysis. As a result, the
and monitor analysis jobs. This solution requires usage of the
                                                                    generated analysis scripts (Sec. III-B) will be executed in all
web-server and DB, that is discussed next.
                                                                    computational sites in the same way and produce identical
                                                                    results.
   IV.     W ORKFLOW DEPLOYMENT AND EXECUTION IN
         DIFFERENT COMPUTATIONAL ENVIRONMENTS                       B. Workflow execution
A. Workflow deployment
    We define a workflow deployment method that allows us
easily add a new computational back-end and run workflows
there. Here, by workflow deployment, we mean possibility to
execute all workflow analysis steps in all available compu-
tational resources, such as clusters and grids, that does not
require changes in analysis structure or protocols. This is very
important to ensure reproducibility of workflow results, even if
the analysis is executed in a new environment. Previously, we
have executed workflows only in the computational clusters.
Hence, it was sufficient to set up the same execution settings
in all clusters to ensure reproducibility. In this new effort, we
reuse the environment modules package [17], that provides a
dynamic modification of a user’s environment via module files.
    Two actions should be performed to install a new software
module into the system: (1) a software should be deployed
on an execution site and (2) a module file, which describes
the software should be added to the system. For a module
deployment, we submit deploy scripts as simple cluster/grid
jobs to a cluster/grid scheduler. An example of the deploy job,
that deploys GATK software for the glite-WMS scheduler
can be found at the Dutch project for biobank repository
[18]. We define two types of deployment scripts: (A) pre-built
and (B) on-site build. The (A) pre-build deployment has the
following logical steps:
   1)     (manually) create a module file, that contains all
          changes to environment variables required to make
          the software working
   2)     (manually) build the software binary files
   3)     (manually) compress the built software and module
          file and upload them to an online repository
   4)     (cluster/grid job) downloads packaged software from
          the repository and decompress it
   5)     (cluster/grid job) moves the software to the right
          directory and ensures the permissions are right
   6)     (cluster/grid job) moves the module file to the right
          directory
   7)     (cluster/grid job) cleans-up, checks an environment
          variables and sends the deployment report
                                                                         Fig. 3: Job execution using the MOLGENIS ”pilot job”
    The (B) on-site build software deployment has a few dif-
ferences. A cluster/grid job downloads the source code files of        We aim to run workflows in different computational back-
the software from the online repository and build the software      ends in a unified way. Different back-ends can have different
binary on the cluster/grid site. This approach is needed when       scheduling middleware and we like to minimize efforts to
software, which should be installed, has dependencies on the        switch to new ones. Furthermore, we learnt that distributed
scheduling systems cannot be always fully reliable in practice.      To summarize, we propose the solution, which trivially enables
Hence, we want to minimize our dependency on a back-end              using a new middleware in our system and does not requires
middleware when running large bioinformatics analyses.               any deep study of a new middleware technical details.
    One of possible solutions is to use a ”pilot-job” framework.
                                                                      V.    I MPLEMENTATION DETAIL ON ANALYSIS PROTOCOL
Our proposed ”pilot-job” framework is less exhaustive than
                                                                                               TEMPLATES
proposed by Luckow et al. [19] and it does not cover all
components of the complete ”pilot-job” frameworks, such as               As discussed above, actual scripts are generated from
”pilot-data” or ”pilot-resource”.                                    templates stored in Protocol elements or protocol files, where
                                                                     template parameters are filled in with actual values (Sec.
     In our scenario, the ”pilot-job” (further Pilot) is sent to     III-A1). Here, we present the listing of the protocol template
a computational back-end from the MOLGENIS web-server,               of the BwaAlign operation, as a protocol example.
which also contains the database with actual generated analysis
jobs (Sec. III-B). A back-end scheduler put Pilot into a queue             //header
for execution. When Pilot is started in the execution node, it             #MOLGENIS walltime=15:00:00 \
calls back to the MOLGENIS web server using cURL and asks                  nodes=1 cores=4 mem=6
for an available analysis job to execute. If any job is available,         #FOREACH leftbarcodefqgz
it is given to Pilot. Pilot starts the actual analysis by starting
analysis script in the back-ground and continues in the main               //tool management
thread to send notifications and updates about the analysis job            module load bwa/${bwaVersion}
status back to the web server. When analysis is finished, its
log files are send back from the analysis script to the database           //data management
also using cURL. The analysis job is counted as completed                  getFile ${indexfile}
if both the Pilot, as a monitor, and the job itself reported a             getFile ${leftbarcodefqgz}
successful completion. Otherwise, the job is counted as failed.
The whole execution process using the proposed ”pilot-job”                 //template of the actual analysis
scenario is shown in Figure 3.                                             bwa aln \
                                                                           ${indexfile} \
    This approach works for both the glite-WMS grid and                    ${leftbarcodefqgz} \
PBS/SGE cluster middleware. The only differences are the                   -t ${bwaaligncores} \
”pilot-job” itself and a command line, which is used to submit             -f ${leftbwaout} \
the ”pilot-job” to the scheduler. In the current implementation,
”pilot-jobs” are submitted via the web-generated for every                 //data management
user@backend pilots dashboard (Fig. 4).                                    putFile ${leftbwaout}
                                                                         The template consists of several parts. Lets us look at them
                                                                     in detail. The script header is used only in the command-
                                                                     line version of MOLGENIS Compute (Sec. III-B) for workflow
                                                                     generation from files. In the command-line version, the final
                                                                     analysis commands are enveloped with the header and footer
                                                                     of supplementary commands for a specific back-end. The
                                                                     #MOLGENIS header is used for generation of these back-end
                                                                     specific headers/footers, which contain job specific parameters
                                                                     used for job scheduling, such as a job wall-time, number of re-
    Fig. 4: Example of the web MOLGENIS pilots dashboard.            quired memory, cores to run the analysis etc. The #FOREACH
                                                                     specifies the Analysis Target used in this protocol (Sec. III-A1).
    The MOLGENIS database contains a number of back-                     Next, the tool management section is listed, which
end records, which include such information as the back-             is discussed in Section IV-A. The bwa module is loaded in
end address, the back-end type (e.g. gLite grid, PBS cluster),       this example. Further, there are the data management and
the username and the path to the file, which will be used            actual analysis sections. The data management en-
for the pilot submission (e.g. .jdl, .sh files). The dashboard       sures transfer of required for analysis files to the location
user can generate jobs for submission to the specific back-end       (computational node) in the grid or cluster, where actual com-
and, then, submit and monitor their execution after entering         putations take place. Functions getFile and putFile are
his password. In the future, we are planning to try creating         differ per a back-end type. For example, for the grid back-end,
separated dashboards for every analysis run. The resubmission        these functions implement file transfer using srm-commands
option becomes visible in the dashboard, if failed jobs exist.       from and to the central storage used in the National Computing
                                                                     Grid for Life Sciences [21], where we run our computations.
    The actual ”pilot-jobs” are simple shell scripts. They are
                                                                     This transfer is needed to keep analysis results saved, after
defined per beck-end middleware and embody allocation of
                                                                     analysis is finished. The listings of data transfer files are also
the back-end resources (e.g. wall-time, memory). The ”pilot-
                                                                     available at the MOLGENIS repository. [20].
jobs” source code is available at the MOLGENIS repository
[20]. The dashboard user can choose specific ”pilot-jobs” for           The actual analysis section contains the template
submission of long, short, CPU or memory demanding jobs.             of an actual analysis command, which is filed with values
during generation (Sec. III-B). Variables in curly brackets are     B. Imputation
Parameters in our model. Some Parameters can consist of
                                                                        The second workflow is imputation of the genome wide
a combination of others and, in turn, are constructed using
                                                                    association study (GWAS). This workflow consists of four
templates. For example, Parameter leftbwaout has the
                                                                    steps:
following value:
                                                                       (1) Preparing the reference: Here, the reference is created
${leftfile}.bwa_align.${indexfileID}.sai                            from the Variant Call Format (VCF) of the genome reference
                                                                    (This steps is executed per the reference set).
     So, a variable can be a complex combination of other
variables. The generated command the BwaAlign example is               (2) Preparing and checking the quality of the study
listed below:                                                       data: In this step, all alleles are aligned to the reference
                                                                    genome and performs quality control.
bwa aln \
human_g1k_v37.fa \
                                                                       (3) Phasing: Here, the data is phased using SHAPEIT2
121128_SN163_0484_AC1D3HACXX_L8_CAACCT_1.fq.gz \                    [23], this is done per chromosome. The phasing also can be
-t 4 \                                                              done once for a specific study.
-f 121128_SN163_0484_AC1D3HACXX_L8_CAACCT_1.bwa_align.human
_g1k_v37.sai                                                            (4) Imputation: This step consists of imputing the phased
                                                                    data with IMPUTE2 [24] and concatenation of the results per
    Hiding supplementary operations for tool and data manage-       chromosome.
ment, which are specific per back-end, outside the generated
scripts, allows us to run the same scripts in different back-end.       Normally, we split the dataset per 5 megabases regions
Using the module system to access analysis software and some        and per 500 samples for one imputation job. The number of
additional data needed for analysis, such as large reference sets   imputation jobs can be calculated using Formula 1. This step
and genome builds, guarantees the identical analysis results,       is executed per study.
when running the analysis in different back-ends.
                                                                                          22
                                                                                          X   Lenghtchr   Samples
  VI.   E XAMPLES OF AVAILABLE WORKFLOWS FOR THE                             Chunks =                   ∗                       (1)
            G L ITE -WMS GRID AND CLUSTERS
                                                                                             5M egabase     500
                                                                                         chr=1

   We created a library of several public available workflows          For genome build 37 and 500 samples this yields 589
[22]. Two of them are fully tested in the National Computing        chunks. The execution time mostly depends on the used refer-
Grid for Life Sciences, which uses gLite-WMS as a scheduling        ence set. We observed an overage of 50 minutes to impute one
middleware, PBS/SGE/BSUB clusters and on a local machine.           chunk, when running the workflow in the National Computing
                                                                    Grid for Life Sciences [21]
A. NGS alignment
    The first workflow is the NGS alignment and SNP calling                              VII.    D ISCUSSION
workflow, which is comprised of best-practice open-source
                                                                    A. Scaling bio-analysis
software packages used in multiple institutions leading to 23
analysis steps. It consists of four major parts:                       Figure 5 summarizes our view on the bio-analysis scaling
    (1) Alignment: The alignment is performed using Burrows-        process.
Wheeler Aligner BWA. The produced SAM file is converted
to a binary format using Picard and sorted afterwards.
    (2) Realignment: This part of the workflow duplicates
reads are marked using Picard. Afterwards, realignment is
performed around known insertions and deletions using the
Genome Analysis ToolKit GATK. If reads are re-aligned, the
fix-mates step will update the coordinates of the reads mate.
    (3) Quality score recalibration: The base quality scores
of a read a re-calibrated using covariate information are per-
formed here. This method takes several covariates like cycle,
dinucleotides, readgroup etc. into account and recalculates the                Fig. 5: Scaling computational environment
quality scores, leading to reads being re-written with better
empirical quality scores.                                               Scaling can be split up into two environment spaces: (1)
                                                                    analysis tools and (2) analysis data. Both tool and data should
   (4) Variant calling: The last part of the workflow performs
                                                                    be available in computational back-ends, where we want to
indel and SNP calling using the GATK. The output of the
                                                                    run analyses. The module deployment approach (Sec. IV-A)
pipeline are two VCF files, one with indels and one containing
                                                                    solves the software availability problem in both cluster and
SNPs, ready for downstream analysis.
                                                                    grid environment. Deployment scripts should be run in all sites
   Expected time to run this workflow per Sample is about           of the grid for the grid deployment. The data availability in
two days, depending on whether we are doing the exome or            the grid can be solved by uploading input data and intermedia
whole genome sequencing.                                            analysis results to the central grid srm storage, which is
available to all grid sites. We have found a good practice to
define a root directory both in grid and clusters used and
keep the directory structures starting from the root identical
in all back-ends. Also, automating of the file transfer between
the srm grid storage and local clusters outside the grid can be
very useful.

B. Pilot implementation
    We tried to keep modelling and implementation of our
”pilot-job” approach straight forward, therefore, robust. We
found that sending the ”pulse” or the status of the actual
analysis is an important functionality of the ”pilot-job”. It          Fig. 6: Showing workflow structure in the MOLGENIS framework
                                                                       [8]
is very useful, because we intend to run long (2-10 hours)
analysis jobs and we want to know about possible analysis
failure as soon as possible. We can improve usage of resources
allocated for a ”pilot-job” by sending several analysis jobs to        methods used in software field, such as combined UML
one ”pilot-job”, if the analysis jobs fit the ”pilot-job” wall-time.   (unified modeling language) diagrams and ”quality” metrics,
We meet conditions of ”pilot-data” by the proper workflow              to enhance understanding and sharing of workflows, and ease
design, where we divide analysis data on chunks if it is needed.       workflow analysis and design [26].
However, it can be interesting to try out other third party
”pilot” solutions.
                                                                                               VIII.     C ONCLUSION
C. Ensuring correctness when running NGS and imputations                   We proposed a lightweight collaborative analysis environ-
in the distributed environment                                         ment for bioinformatics service teams, where the analysis
    Running analysis in the grid is even less reliable than in the     to be computed in different back-end environments, such as
cluster, since it includes extra file transfers and less control of    computational clusters and grids, is specified in a unified way.
remote execution nodes. Furthermore, we do not want to rely            For this, we use one model, which is suitable for different
only on the context of the log files produced by the analysis          workflows, such as the NGS alignment and SNP imputation.
tools for NGS and imputations with large datasets to analyse.          All analysis jobs can be generated in the database and executed
                                                                       using a ”pilot-job” framework, or they can be generated as
    For NGS analysis, we kept a number of DNA sequence                 shell scripts with back-end specific headers and footers.
reads constant from input workflow files to the final results
of the analysis. We did not remove any reads during filter-                We would next like to create workflow importer/exporter
ing, instead we just marked them. So, the number of reads              to interchange workflows between different workflow man-
stayed constant during the whole analysis and checking this            agement tools, such as Galaxy or Taverna. Hence, we can
postcondition after analysis guarantees its correct completion.        re-use more externally defined workflows and easily share
The number of reads in sequence data files can be calculated           workflow defined in our group. A second essential direction
using the Genome Analysis Toolkit GATK [25]. We performed              is to study the practical effectiveness of running workflows
the reads calculation only in the end of the pipeline due              in different computational back-ends using described in [26]
to computational overhead. For imputation with minimac, a              visual analytics methods and learning from the best practices
number of samples in input and output files is a good indication       and experiences.
of a successful analysis completion. A number of samples is
equal to a number of lines in these files. It should be equal in                                ACKNOWLEDGMENT
input and output files and can be easily checked.
                                                                           The authors would like to thank the BBMRI-NL Rain-
                                                                       bow Project 2 (http://www.bbmri.nl), the Target project
D. Workflow modelling and visualization                                (http://www.rug.nl/target) and the BigGrid eBioGrid project
    Having the database and web user interface generated               (http://www.ebiogrid.nl). Also, the authors thank Erwin
from the model, it is not surprising that we chose a simple            Winder for implementing the pilots dashboard generator.
table to show the workflow structure. An example workflow
visualization in MOLGENIS compute is shown in Figure 6.                                             R EFERENCES
Every line of the table shows WorkflowElement, its Protocol
and previous WorkflowElements, if they exist. Generated Jobs,           [1] M. DePristo and M. Daly, “A framework for variation discovery
                                                                            and genotyping using next-generation dna sequencing data,” Nature
their Parameters and Values are presented in separated tables               Genetics, vol. 43(5), pp. 491–498, 2011.
in the same style.                                                      [2] M. Swertz and R. Jansen, “Beyond standardization: dynamic software
                                                                            infrastructures for systems biology,” Nature Reviews Genetics, vol. 8:3,
    Many other established WMS visualize workflows as                       pp. 235–43, 2007.
graphs, where the nodes are analysis steps and the edges
                                                                        [3] ——, “The molgenis toolkit: rapid prototyping of biosoftware at the
are interactions and constraints between analysis steps. Still,             push of a button.” BMC Bioinformatics, vol. 11:12, 2010.
proposed visualization do not cover all analysis aspects needed         [4] ——, “Xgap: a uniform and extensible data model and software
to optimize workflow execution. As one of our next devel-                   platform for genotype and phenotype experiments,” Genome Biology,
opment direction, we want to apply some visual analytics                    vol. 11:27, 2010.
 [5]   Y. Li and R. Jansen, “Global genetic robustness of the alternative
       splicing machinery in caenorhabditis elegans,” Genetics, vol. 186(1),
       pp. 405–10, 2010.
 [6]   Y. Li and M. Swertz, “Designgg: an r-package and web tool for
       the optimal design of genetical genomics,” BMC Bioinformatics, vol.
       10:188, 2009.
 [7]   Genomics Coordination Center, Groningen, “Molgenis web-site,” 2011,
       http://www.molgenis.org.
 [8]   H. Byelas and M. Swertz, “Towards a molgenis based computational
       framework,” in proceedings of the 19th EUROMICRO International
       Conference on Parallel, Distributed and Network-Based Computing, pp.
       331–339, 2011.
 [9]   ——, “Introducing data provenance and error handling for ngs work-
       flows within the molgenis computational framework,” in proceedings of
       the BIOSTEC BIOINFORMATICS-2012 conference, pp. 42–50, 2012.
[10]   Adaptive Computing, “Torque resource manager,” 2012, http://www.
       adaptivecomputing.com/products/open-source/torque/.
[11]   T. Glatard, J. Montagnat, D. Lingrand, and X. Pennec, “Flexible and
       efficient workflow deployment of data-intensive applications on grids
       with moteur,” Int. J. High Perform. Comput. Appl., vol. 22, no. 3, pp.
       347–360, 2008.
[12]   University of Virginia, Center for Public Health Genomics,
       “Tavernapbs,”        2012,     http://cphg.virginia.edu/mackey/projects/
       sequencing-pipelines/tavernapbs/.
[13]   T. Oinn and M. Greenwood, “Taverna: lessons in creating a workflow
       environment for the life sciences,” CONCURRENCY AND COMPUTA-
       TION: PRACTICE AND EXPERIENCE, vol. 18:10, pp. 1067 – 1100,
       2005.
[14]   E. Kawas and M. Wilkinson, “Biomoby extensions to the taverna
       workflow management and enactment software,” BMC Bioinformatics,
       vol. 7:523, 2006.
[15]   D. Blankenberg and J. Taylor, “A framework for collaborative analysis
       of encode data: making large-scale analyses biologist-friendly,” Genome
       Res., vol. 17:6, pp. 960 – 4, 2007.
[16]   P. Kacsuk, “P-grade portal family for grid infrastructures,” Concurrency
       and Computation: Practice and Experience journal, vol. 23:3, pp. 235–
       245, 2012.
[17]   P. O. J.L. Furlani, “Abstract yourself with modules,” Proceedings of
       the Tenth Large Installation Systems Administration Conference (LISA
       ’96), pp. 193–204, 1996.
[18]   BBMRI.NL, “the dutch project for biobank,” 2012, http://www.
       bbmriwiki.nl/svn/ebiogrid/modules/GATK/1.0.5069/.
[19]   A. Luckow, “Saga bigjob: An extensible and interoperable pilot-job
       abstraction for distributed applications and systems,” Proceedings of
       the Tenth 0th IEEE/ACM International Conference on Cluster, Cloud
       and Grid Computing, 2010.
[20]   MOLGENIS team, “the molgenis github repository, pilot sources,”
       2012,     http://github.com/molgenis/molgenis apps-legacy/tree/testing/
       modules/compute/pilots/.
[21]   BIG Grid, “the dutch e-science grid,” 2010, http://www.biggrid.nl.
[22]   MOLGENIS team, “the molgenis workflow repository,” 2013, http://
       github.com/molgenis/molgenis-pipelines/tree/master/compute4.
[23]   O. Delaneau and J. Marchini, “Improved whole-chromosome phasing
       for disease and population genetic studies,” Nature Methods, vol. 10,
       pp. 5–6, 2013.
[24]   B. Howie and J. Marchini, “A flexible and accurate genotype imputation
       method for the next generation of genome-wide association studies,”
       PLoS Genet, vol. 5, p. e1000529, 2009.
[25]   The Genome Analysis Toolkit, “Broad institute,” 2011, http://www.
       broadinstitute.org/.
[26]   H. Byelas and M. Swertz, “Visualization of bioinformatics workflows
       for ease of understanding and design activities,” Proceedings of the
       BIOSTEC BIOINFORMATICS-2013 conference, pp. 117–123, 2013.