Scaling bio-analyses from computational clusters to grids Heorhiy Byelas, Martijn Dijkstra, Pieter Neerincx, Freerk van Dijk, Alexandros Kanterakis, Patrick Deelen, Morris Swertz Genomics Coordination Center, Department of Genetics University Medical Center Groningen University of Groningen, The Netherlands Email: h.v.byelas@med.umcg.nl, m.a.swertz@rug.nl Abstract—Life sciences have moved rapidly into big data data management for routine running of large bioinformatics thanks to new parallel methods for gene expression, genome- analyses. wide association, proteomics and whole genome DNA sequencing. The scale of these methods is growing faster than predicted In this work, our goal is to run different workflows in a by Moores law. This has introduced new challenges and needs unified way and make workflow adaptation to different back- for methods for specifying computation protocols for e.g. Next- ends, such as clusters and grids environments more standard Generation Sequencing (NGS) and genome-wide association study and easier. Hence, users can choose a computational back-end (GWAS) imputation analyses and running these on a large scale with less load to run analyses, run workflows in different back- is a complicated task, due to the many steps involved, long runtimes, heterogeneous computational resources and large files. end environments and achieve identical analysis results and, if The process becomes error-prone when dealing with hundreds of it is needed, combine results together. We do not want to add samples, such as in genomic analysis facilities, if it is performed another middleware layer above the back-end services, such without an integrated workflow framework and data management as job schedulers, but prefer to minimize an overhead to run system. From recent projects we learnt that bioinformaticians shell scripts in different environments. do not want to invest much time in learning advanced grid or cluster scheduling tools, preferring to concentrate on their In this paper, we consider the principles of the MOL- analyses, be closer to old-fashion shell scripts that they can fully GENIS software suite [2], [3] in a new perspective, where control and have some automatic mechanisms taking care of it can be used as a separated module to generate all the all submission and monitoring details. We present a lightweight necessary instruments for tracing data and computations and workflow declaration and execution system to address these collecting logging information from different computational needs, built on top of the MOLGENIS framework for data environments. We allow the users to decide how they want to tracking. We describe lessons learnt when scaling running NGS and imputation analyses from computational clusters to grids and orchestrate these instruments in different middleware systems, show application of our solution, in particular, in the nation-wide although we supply a default management solution, which ”Genome of the Netherlands” project (GoNL, 700TB of data and includes database and a ”pilot-job” framework to run analysis about 200.000 computing hours) jobs. In addition, we show how the system can be applied to run NGS and imputation workflows in the cluster and grid environments. I. I NTRODUCTION This paper is structured as follows. Section II reviews High-throughput analysis methods have created exciting related work in the context of conventions used in MOLGE- new possibilities for unraveling genotype-to-phenotype rela- NIS Compute and attempts to deploy the workflow manage- tionships. However, these experiments are heavily dependent ment systems to different computational back-ends. Section on large computational analysis for their success. For instance, III describes the workflow design model used for workflow next generation sequencing analyses typically involve about generation and the generation process. Section IV presents 30 computational steps such as alignment, duplicate marking, the workflow deployment and the design of the ”pilot-job” single-nucleotide polymorphism (SNP) calling, annotation and framework used for workflow execution. Section V explains many re-indexing and quality control steps [1]. Similarly, implementation details of creating analysis protocols, which GWAS data typically requires batching for imputation. Ge- are suitable for execution in different back-ends. Section VI nomic analysis facilities typically face running many different provides details on NGS and imputation workflows. Section versions of such computational pipelines on hundreds or VII discusses our practical experiences, benefits and drawbacks thousands of samples. This quickly becomes a nightmare of of using the system. Section VIII offers our conclusions. data file logistics (raw, intermediate, result, quality controls and log data) and a computational scheduling nightmare (large, small, short, long jobs that may have error states and different II. R ELATED W ORK computational back-ends). Furthermore, different cluster and grid middleware do not provide all necessary operations to We divide the related work into two sub-domains that execute bio-workflows. To address these challenges we present describe work in (1) workflow design and generation and (2) a practical software system that combines computational and workflow execution and deployment. A. Workflow generation and design Besides WMS, bioinformatics analyses can be executed in distributed computational environments using web-based Software generators are systems that generate other soft- portals, such as P-GRADE Portal [16], which supports devel- ware and software specifications described in a model language opment and submission of distributed applications executed play a role of the input for the generating. Our workflow on the computational resources of various distributed com- model is described in Section III and used as an input for puting infrastructures (DCIs) including clusters, service grids the workflow generation. As an output, we expect a kind of via web browser. The portal architecture consists of several generated workflow management system, which is able to run layers including the presentation layer with a web interface, workflows in a distributed computational environment. middle layer with different services and architecture layer with available cluster/grid middleware connectors. In our approach, We have gained a lot of experience in generating software we also want to communicate with the system through the infrastructures to manage and process large bioinformatics generated web presentation layer, but we minimise another datasets [4], [5], [6]. The MOLGENIS toolkit [7] provides middleware layer between cluster/grid services and our system bioinformaticians with a simple language to model biological and to keep a minimal implementation to communicate to data structures and user interfaces. The MOLGENIS generator back-ends middleware. suite automatically translates these models into a feature- rich, ready-to-use web application including database, user For the workflow deployment, instead of the service based interfaces, exchange formats, and scriptable interfaces. This strategies as proposed in Taverna, where actual analysis tools ”model-driven” method ensures re-use of best practices and are wrapped into web-services or other ”black-box” compo- improves quality because the modelling language and genera- nents, we would like to use tools directly in both cluster and tors are shared between all MOLGENIS applications, so that grid settings. This will allow us to use analysis tools without errors are found quickly and improvements are shared easily loosing any functionality, that can be hidden in a wrapper by a re-generation. A plug-in mechanism ensures that both the interface. We want to automate tool installation process which generator suite and the generated product can be customized can be suitable for all, if it is possible, environments, that we just as much as hand-written software. include into our back-end compute infrastructure. In our previous work in the workflow management, we To summarize, we (1) re-use best practices from the aimed to combine computational and data management in a Molgenis database generator to generate experiment data and single system - MOLGENIS Compute [8]. Then, we extended workflow models for users and (2) fulfil a direct connection the initial solution with more specific for NGS analysis meta to cluster/grid infrastructures for bioinformatics. model, which allowed us to specify data provenance and workflow execution logic efficiently [9]. In our NGS-specific III. W ORKFLOW DESIGN AND GENERATION solution, we consider the computational cluster as a back-end, where all the computation take place. The users communicate A. Workflow design with the system through the generated web-interface. Some logging information is usually hidden in a user interface, In this new effort, we would like to model bioinformatics which makes a view on the data more compact and easier to workflows to run them in a unified way in different compu- comprehend. However, in some cases, such as debugging a new tational environments. We can divide the modelling task into workflow or deploying it in a new environment, users may need two sub-tasks, such as this hidden logging information from an operational system or analysis tools. Expert users would like to have a direct and • workflow modelling, which match describing work- easy access to all the log files produced and it should not be flow to run in different computational back-ends obliged to do this through the web-interface, which requires • data modelling, which covers different bioinformatics extra effort to implement. This time we want to provide both analyses (e.g. NGS or imputations) command-line and web interface to the workflow management. B. Workflow execution and deployment If, in the previous work, we focused on generating compute management software for PBS clusters [10], here, we want to run workflows in different back-end environments. There are a number of projects such as MOTEUR [11] and TavernaPBS [12], which try to deploy the Taverna workflow management system (WMS) [13] to the grid and cluster environment respectively. Furthermore, adding the BioMoby [14] plug-in into Taverna can make running biological workflows and data provenance even more structural. Another workflow system Galaxy [15], which is specialised in running bioinformatics workflows can be configured to run analysis in Portable Batch Fig. 1: Core of the computing model System (PBS) or Sun Grid Engine (SGE) clusters, but we are not aware of any projects, which connect Galaxy to the grid 1) Workflow modelling: The core of our workflow design middleware. remains stable [8], [9]. Our main goal is to keep the model Fig. 2: Generation procedure using Molgenis job generator (a command-line version) as simple as possible and avoid abstract entities. The model is chunks are arranged into workflows. The workflow structure shown in Figure 1. is essential to run analysis parts in the correct order. However, usually, all the workflow parts can be generated instantly for The Workflow element includes all individual analysis the later execution. An exception is a case than a number of WorkflowElements and represents the succession of the analy- outputs (i.e. output files) is not know beforehand and these sis. The core of the model are analysis script templates stored files should be used as inputs in further steps of the workflow. in the Protocols. A Protocol has its Parameters. Parameters can be of different types, such as workflow parameters, envi- We support two implementations for the workflow gener- ronment parameters and user parameters. During generation, ation: (1) a database version and (2) a command-line version. the template parameters will be filled with actual Values. which uses files instead of database. In the database version, Then, the resulting script will be combined with supplementary a workflow is described in the MySQL database, which is headers and footers, which are specific for different back-ends, generated from the model (Sec. III-A1). In the command-line and stored as an analysis Job. The generation using templates version, a workflow is described in files: is discussed next. • workflow.csv: a file with a sequence of workflow 2) Data modelling: Our NGS model is described in details elements, in [9]. It is based on the laboratory process when using an Illumina HiSeq 2000 machine, which produces the input data • paratemers.csv: a file with workflow parameters, and for our analysis workflows. It consists of such elements as • protocols: a folder with templates of workflow proto- Projects, Samples, DNA Libraries, Flowcells, Barcodes etc. cols. We introduce one common Analysis Target interface to These files match the database model and can be im- enable us later to refer uniformly to subjects from different ported/exported into the database system. Analysis Targets analyses. Analysis Targets can have complex relations between also can be either selected as database records through the themselves. For instance, in our NGS workflow, some analyses generated web interface or listed in the input parameters file. are executed per Barcode being Analysis Target, but others are The workflow description and Analysis Targets are an actual executed per Project or Sample being Analysis Target, which input for generating analysis jobs. includes all Barcodes. Hence, it is very important to specify Analysis Targets and their relations for every analysis element In the database version jobs are generated as records in the in the workflow. database. The generation scheme of the command-line version is presented in Figure 2. Here, jobs are generated as script files in the specified directory. These script files are ready for B. Analysis jobs generation submission to a specified in the generation back-end. For this, In general, an analysis, which is intended for execution we provide supplementary headers and footers for different in the cluster or grid, consists of many chunks and these computational back-ends. These headers/footers are written as templates and stored as files. They are filled with parameters system libraries, e.g. the R software for statistical computing. during generation. Headers/footers are used to specify e.g. When deployment is completed, the deployed modules can be job wall-time, memory requirements, job logging. All that is initialized using the following statement in analysis scripts: needed to generate jobs for a new back-end, is to add new module load your_software_module/version headers/footers templates for it. The generator source code stays the same. In addition, we provide a possibility for users The loaded in the module software should be invoked to submit single analysis jobs or jobs with dependencies to a directly without specifying its classpath. Hence, we ensure specified back-end. that the same software is used for all analyses and the analysis scripts for different back-ends have exactly the same In the database version, we are using ”pilot-jobs” to run listings to access tools used in the analysis. As a result, the and monitor analysis jobs. This solution requires usage of the generated analysis scripts (Sec. III-B) will be executed in all web-server and DB, that is discussed next. computational sites in the same way and produce identical results. IV. W ORKFLOW DEPLOYMENT AND EXECUTION IN DIFFERENT COMPUTATIONAL ENVIRONMENTS B. Workflow execution A. Workflow deployment We define a workflow deployment method that allows us easily add a new computational back-end and run workflows there. Here, by workflow deployment, we mean possibility to execute all workflow analysis steps in all available compu- tational resources, such as clusters and grids, that does not require changes in analysis structure or protocols. This is very important to ensure reproducibility of workflow results, even if the analysis is executed in a new environment. Previously, we have executed workflows only in the computational clusters. Hence, it was sufficient to set up the same execution settings in all clusters to ensure reproducibility. In this new effort, we reuse the environment modules package [17], that provides a dynamic modification of a user’s environment via module files. Two actions should be performed to install a new software module into the system: (1) a software should be deployed on an execution site and (2) a module file, which describes the software should be added to the system. For a module deployment, we submit deploy scripts as simple cluster/grid jobs to a cluster/grid scheduler. An example of the deploy job, that deploys GATK software for the glite-WMS scheduler can be found at the Dutch project for biobank repository [18]. We define two types of deployment scripts: (A) pre-built and (B) on-site build. The (A) pre-build deployment has the following logical steps: 1) (manually) create a module file, that contains all changes to environment variables required to make the software working 2) (manually) build the software binary files 3) (manually) compress the built software and module file and upload them to an online repository 4) (cluster/grid job) downloads packaged software from the repository and decompress it 5) (cluster/grid job) moves the software to the right directory and ensures the permissions are right 6) (cluster/grid job) moves the module file to the right directory 7) (cluster/grid job) cleans-up, checks an environment variables and sends the deployment report Fig. 3: Job execution using the MOLGENIS ”pilot job” The (B) on-site build software deployment has a few dif- ferences. A cluster/grid job downloads the source code files of We aim to run workflows in different computational back- the software from the online repository and build the software ends in a unified way. Different back-ends can have different binary on the cluster/grid site. This approach is needed when scheduling middleware and we like to minimize efforts to software, which should be installed, has dependencies on the switch to new ones. Furthermore, we learnt that distributed scheduling systems cannot be always fully reliable in practice. To summarize, we propose the solution, which trivially enables Hence, we want to minimize our dependency on a back-end using a new middleware in our system and does not requires middleware when running large bioinformatics analyses. any deep study of a new middleware technical details. One of possible solutions is to use a ”pilot-job” framework. V. I MPLEMENTATION DETAIL ON ANALYSIS PROTOCOL Our proposed ”pilot-job” framework is less exhaustive than TEMPLATES proposed by Luckow et al. [19] and it does not cover all components of the complete ”pilot-job” frameworks, such as As discussed above, actual scripts are generated from ”pilot-data” or ”pilot-resource”. templates stored in Protocol elements or protocol files, where template parameters are filled in with actual values (Sec. In our scenario, the ”pilot-job” (further Pilot) is sent to III-A1). Here, we present the listing of the protocol template a computational back-end from the MOLGENIS web-server, of the BwaAlign operation, as a protocol example. which also contains the database with actual generated analysis jobs (Sec. III-B). A back-end scheduler put Pilot into a queue //header for execution. When Pilot is started in the execution node, it #MOLGENIS walltime=15:00:00 \ calls back to the MOLGENIS web server using cURL and asks nodes=1 cores=4 mem=6 for an available analysis job to execute. If any job is available, #FOREACH leftbarcodefqgz it is given to Pilot. Pilot starts the actual analysis by starting analysis script in the back-ground and continues in the main //tool management thread to send notifications and updates about the analysis job module load bwa/${bwaVersion} status back to the web server. When analysis is finished, its log files are send back from the analysis script to the database //data management also using cURL. The analysis job is counted as completed getFile ${indexfile} if both the Pilot, as a monitor, and the job itself reported a getFile ${leftbarcodefqgz} successful completion. Otherwise, the job is counted as failed. The whole execution process using the proposed ”pilot-job” //template of the actual analysis scenario is shown in Figure 3. bwa aln \ ${indexfile} \ This approach works for both the glite-WMS grid and ${leftbarcodefqgz} \ PBS/SGE cluster middleware. The only differences are the -t ${bwaaligncores} \ ”pilot-job” itself and a command line, which is used to submit -f ${leftbwaout} \ the ”pilot-job” to the scheduler. In the current implementation, ”pilot-jobs” are submitted via the web-generated for every //data management user@backend pilots dashboard (Fig. 4). putFile ${leftbwaout} The template consists of several parts. Lets us look at them in detail. The script header is used only in the command- line version of MOLGENIS Compute (Sec. III-B) for workflow generation from files. In the command-line version, the final analysis commands are enveloped with the header and footer of supplementary commands for a specific back-end. The #MOLGENIS header is used for generation of these back-end specific headers/footers, which contain job specific parameters used for job scheduling, such as a job wall-time, number of re- Fig. 4: Example of the web MOLGENIS pilots dashboard. quired memory, cores to run the analysis etc. The #FOREACH specifies the Analysis Target used in this protocol (Sec. III-A1). The MOLGENIS database contains a number of back- Next, the tool management section is listed, which end records, which include such information as the back- is discussed in Section IV-A. The bwa module is loaded in end address, the back-end type (e.g. gLite grid, PBS cluster), this example. Further, there are the data management and the username and the path to the file, which will be used actual analysis sections. The data management en- for the pilot submission (e.g. .jdl, .sh files). The dashboard sures transfer of required for analysis files to the location user can generate jobs for submission to the specific back-end (computational node) in the grid or cluster, where actual com- and, then, submit and monitor their execution after entering putations take place. Functions getFile and putFile are his password. In the future, we are planning to try creating differ per a back-end type. For example, for the grid back-end, separated dashboards for every analysis run. The resubmission these functions implement file transfer using srm-commands option becomes visible in the dashboard, if failed jobs exist. from and to the central storage used in the National Computing Grid for Life Sciences [21], where we run our computations. The actual ”pilot-jobs” are simple shell scripts. They are This transfer is needed to keep analysis results saved, after defined per beck-end middleware and embody allocation of analysis is finished. The listings of data transfer files are also the back-end resources (e.g. wall-time, memory). The ”pilot- available at the MOLGENIS repository. [20]. jobs” source code is available at the MOLGENIS repository [20]. The dashboard user can choose specific ”pilot-jobs” for The actual analysis section contains the template submission of long, short, CPU or memory demanding jobs. of an actual analysis command, which is filed with values during generation (Sec. III-B). Variables in curly brackets are B. Imputation Parameters in our model. Some Parameters can consist of The second workflow is imputation of the genome wide a combination of others and, in turn, are constructed using association study (GWAS). This workflow consists of four templates. For example, Parameter leftbwaout has the steps: following value: (1) Preparing the reference: Here, the reference is created ${leftfile}.bwa_align.${indexfileID}.sai from the Variant Call Format (VCF) of the genome reference (This steps is executed per the reference set). So, a variable can be a complex combination of other variables. The generated command the BwaAlign example is (2) Preparing and checking the quality of the study listed below: data: In this step, all alleles are aligned to the reference genome and performs quality control. bwa aln \ human_g1k_v37.fa \ (3) Phasing: Here, the data is phased using SHAPEIT2 121128_SN163_0484_AC1D3HACXX_L8_CAACCT_1.fq.gz \ [23], this is done per chromosome. The phasing also can be -t 4 \ done once for a specific study. -f 121128_SN163_0484_AC1D3HACXX_L8_CAACCT_1.bwa_align.human _g1k_v37.sai (4) Imputation: This step consists of imputing the phased data with IMPUTE2 [24] and concatenation of the results per Hiding supplementary operations for tool and data manage- chromosome. ment, which are specific per back-end, outside the generated scripts, allows us to run the same scripts in different back-end. Normally, we split the dataset per 5 megabases regions Using the module system to access analysis software and some and per 500 samples for one imputation job. The number of additional data needed for analysis, such as large reference sets imputation jobs can be calculated using Formula 1. This step and genome builds, guarantees the identical analysis results, is executed per study. when running the analysis in different back-ends. 22 X Lenghtchr Samples VI. E XAMPLES OF AVAILABLE WORKFLOWS FOR THE Chunks = ∗ (1) G L ITE -WMS GRID AND CLUSTERS 5M egabase 500 chr=1 We created a library of several public available workflows For genome build 37 and 500 samples this yields 589 [22]. Two of them are fully tested in the National Computing chunks. The execution time mostly depends on the used refer- Grid for Life Sciences, which uses gLite-WMS as a scheduling ence set. We observed an overage of 50 minutes to impute one middleware, PBS/SGE/BSUB clusters and on a local machine. chunk, when running the workflow in the National Computing Grid for Life Sciences [21] A. NGS alignment The first workflow is the NGS alignment and SNP calling VII. D ISCUSSION workflow, which is comprised of best-practice open-source A. Scaling bio-analysis software packages used in multiple institutions leading to 23 analysis steps. It consists of four major parts: Figure 5 summarizes our view on the bio-analysis scaling (1) Alignment: The alignment is performed using Burrows- process. Wheeler Aligner BWA. The produced SAM file is converted to a binary format using Picard and sorted afterwards. (2) Realignment: This part of the workflow duplicates reads are marked using Picard. Afterwards, realignment is performed around known insertions and deletions using the Genome Analysis ToolKit GATK. If reads are re-aligned, the fix-mates step will update the coordinates of the reads mate. (3) Quality score recalibration: The base quality scores of a read a re-calibrated using covariate information are per- formed here. This method takes several covariates like cycle, dinucleotides, readgroup etc. into account and recalculates the Fig. 5: Scaling computational environment quality scores, leading to reads being re-written with better empirical quality scores. Scaling can be split up into two environment spaces: (1) analysis tools and (2) analysis data. Both tool and data should (4) Variant calling: The last part of the workflow performs be available in computational back-ends, where we want to indel and SNP calling using the GATK. The output of the run analyses. The module deployment approach (Sec. IV-A) pipeline are two VCF files, one with indels and one containing solves the software availability problem in both cluster and SNPs, ready for downstream analysis. grid environment. Deployment scripts should be run in all sites Expected time to run this workflow per Sample is about of the grid for the grid deployment. The data availability in two days, depending on whether we are doing the exome or the grid can be solved by uploading input data and intermedia whole genome sequencing. analysis results to the central grid srm storage, which is available to all grid sites. We have found a good practice to define a root directory both in grid and clusters used and keep the directory structures starting from the root identical in all back-ends. Also, automating of the file transfer between the srm grid storage and local clusters outside the grid can be very useful. B. Pilot implementation We tried to keep modelling and implementation of our ”pilot-job” approach straight forward, therefore, robust. We found that sending the ”pulse” or the status of the actual analysis is an important functionality of the ”pilot-job”. It Fig. 6: Showing workflow structure in the MOLGENIS framework [8] is very useful, because we intend to run long (2-10 hours) analysis jobs and we want to know about possible analysis failure as soon as possible. We can improve usage of resources allocated for a ”pilot-job” by sending several analysis jobs to methods used in software field, such as combined UML one ”pilot-job”, if the analysis jobs fit the ”pilot-job” wall-time. (unified modeling language) diagrams and ”quality” metrics, We meet conditions of ”pilot-data” by the proper workflow to enhance understanding and sharing of workflows, and ease design, where we divide analysis data on chunks if it is needed. workflow analysis and design [26]. However, it can be interesting to try out other third party ”pilot” solutions. VIII. C ONCLUSION C. Ensuring correctness when running NGS and imputations We proposed a lightweight collaborative analysis environ- in the distributed environment ment for bioinformatics service teams, where the analysis Running analysis in the grid is even less reliable than in the to be computed in different back-end environments, such as cluster, since it includes extra file transfers and less control of computational clusters and grids, is specified in a unified way. remote execution nodes. Furthermore, we do not want to rely For this, we use one model, which is suitable for different only on the context of the log files produced by the analysis workflows, such as the NGS alignment and SNP imputation. tools for NGS and imputations with large datasets to analyse. All analysis jobs can be generated in the database and executed using a ”pilot-job” framework, or they can be generated as For NGS analysis, we kept a number of DNA sequence shell scripts with back-end specific headers and footers. reads constant from input workflow files to the final results of the analysis. We did not remove any reads during filter- We would next like to create workflow importer/exporter ing, instead we just marked them. So, the number of reads to interchange workflows between different workflow man- stayed constant during the whole analysis and checking this agement tools, such as Galaxy or Taverna. Hence, we can postcondition after analysis guarantees its correct completion. re-use more externally defined workflows and easily share The number of reads in sequence data files can be calculated workflow defined in our group. A second essential direction using the Genome Analysis Toolkit GATK [25]. We performed is to study the practical effectiveness of running workflows the reads calculation only in the end of the pipeline due in different computational back-ends using described in [26] to computational overhead. For imputation with minimac, a visual analytics methods and learning from the best practices number of samples in input and output files is a good indication and experiences. of a successful analysis completion. A number of samples is equal to a number of lines in these files. It should be equal in ACKNOWLEDGMENT input and output files and can be easily checked. The authors would like to thank the BBMRI-NL Rain- bow Project 2 (http://www.bbmri.nl), the Target project D. Workflow modelling and visualization (http://www.rug.nl/target) and the BigGrid eBioGrid project Having the database and web user interface generated (http://www.ebiogrid.nl). Also, the authors thank Erwin from the model, it is not surprising that we chose a simple Winder for implementing the pilots dashboard generator. table to show the workflow structure. An example workflow visualization in MOLGENIS compute is shown in Figure 6. R EFERENCES Every line of the table shows WorkflowElement, its Protocol and previous WorkflowElements, if they exist. Generated Jobs, [1] M. DePristo and M. Daly, “A framework for variation discovery and genotyping using next-generation dna sequencing data,” Nature their Parameters and Values are presented in separated tables Genetics, vol. 43(5), pp. 491–498, 2011. in the same style. [2] M. Swertz and R. Jansen, “Beyond standardization: dynamic software infrastructures for systems biology,” Nature Reviews Genetics, vol. 8:3, Many other established WMS visualize workflows as pp. 235–43, 2007. graphs, where the nodes are analysis steps and the edges [3] ——, “The molgenis toolkit: rapid prototyping of biosoftware at the are interactions and constraints between analysis steps. Still, push of a button.” BMC Bioinformatics, vol. 11:12, 2010. proposed visualization do not cover all analysis aspects needed [4] ——, “Xgap: a uniform and extensible data model and software to optimize workflow execution. As one of our next devel- platform for genotype and phenotype experiments,” Genome Biology, opment direction, we want to apply some visual analytics vol. 11:27, 2010. [5] Y. Li and R. Jansen, “Global genetic robustness of the alternative splicing machinery in caenorhabditis elegans,” Genetics, vol. 186(1), pp. 405–10, 2010. [6] Y. Li and M. Swertz, “Designgg: an r-package and web tool for the optimal design of genetical genomics,” BMC Bioinformatics, vol. 10:188, 2009. [7] Genomics Coordination Center, Groningen, “Molgenis web-site,” 2011, http://www.molgenis.org. [8] H. Byelas and M. Swertz, “Towards a molgenis based computational framework,” in proceedings of the 19th EUROMICRO International Conference on Parallel, Distributed and Network-Based Computing, pp. 331–339, 2011. [9] ——, “Introducing data provenance and error handling for ngs work- flows within the molgenis computational framework,” in proceedings of the BIOSTEC BIOINFORMATICS-2012 conference, pp. 42–50, 2012. [10] Adaptive Computing, “Torque resource manager,” 2012, http://www. adaptivecomputing.com/products/open-source/torque/. [11] T. Glatard, J. Montagnat, D. Lingrand, and X. Pennec, “Flexible and efficient workflow deployment of data-intensive applications on grids with moteur,” Int. J. High Perform. Comput. Appl., vol. 22, no. 3, pp. 347–360, 2008. [12] University of Virginia, Center for Public Health Genomics, “Tavernapbs,” 2012, http://cphg.virginia.edu/mackey/projects/ sequencing-pipelines/tavernapbs/. [13] T. Oinn and M. Greenwood, “Taverna: lessons in creating a workflow environment for the life sciences,” CONCURRENCY AND COMPUTA- TION: PRACTICE AND EXPERIENCE, vol. 18:10, pp. 1067 – 1100, 2005. [14] E. Kawas and M. Wilkinson, “Biomoby extensions to the taverna workflow management and enactment software,” BMC Bioinformatics, vol. 7:523, 2006. [15] D. Blankenberg and J. Taylor, “A framework for collaborative analysis of encode data: making large-scale analyses biologist-friendly,” Genome Res., vol. 17:6, pp. 960 – 4, 2007. [16] P. Kacsuk, “P-grade portal family for grid infrastructures,” Concurrency and Computation: Practice and Experience journal, vol. 23:3, pp. 235– 245, 2012. [17] P. O. J.L. Furlani, “Abstract yourself with modules,” Proceedings of the Tenth Large Installation Systems Administration Conference (LISA ’96), pp. 193–204, 1996. [18] BBMRI.NL, “the dutch project for biobank,” 2012, http://www. bbmriwiki.nl/svn/ebiogrid/modules/GATK/1.0.5069/. [19] A. Luckow, “Saga bigjob: An extensible and interoperable pilot-job abstraction for distributed applications and systems,” Proceedings of the Tenth 0th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010. [20] MOLGENIS team, “the molgenis github repository, pilot sources,” 2012, http://github.com/molgenis/molgenis apps-legacy/tree/testing/ modules/compute/pilots/. [21] BIG Grid, “the dutch e-science grid,” 2010, http://www.biggrid.nl. [22] MOLGENIS team, “the molgenis workflow repository,” 2013, http:// github.com/molgenis/molgenis-pipelines/tree/master/compute4. [23] O. Delaneau and J. Marchini, “Improved whole-chromosome phasing for disease and population genetic studies,” Nature Methods, vol. 10, pp. 5–6, 2013. [24] B. Howie and J. Marchini, “A flexible and accurate genotype imputation method for the next generation of genome-wide association studies,” PLoS Genet, vol. 5, p. e1000529, 2009. [25] The Genome Analysis Toolkit, “Broad institute,” 2011, http://www. broadinstitute.org/. [26] H. Byelas and M. Swertz, “Visualization of bioinformatics workflows for ease of understanding and design activities,” Proceedings of the BIOSTEC BIOINFORMATICS-2013 conference, pp. 117–123, 2013.