1. Introduction

workflows to changing infrastructures

Ninon De Mecquenem

0 1 2

Supervised by Ulf Leser

leser@informatik.hu-berlin.de 0 1 2 0 Data Analysis Workflows , Bioinformatics, Distributed infrastructures, Portability 1 Humboldt-Universität zu Berlin , Unter den Linden 6, Berlin, DE 10099 , USA 2 Workshop Proce dings

Scientific workflows are increasingly popular for large-scale data analyses as they promise better documentation, increased reproducibility, and easier scalability of complex analysis pipelines. However, reproducibility is severely reduced when a given workflow is optimized for a specific infrastructure, as it would require other scientists to access the same computing environment. Hence, it is important to develop techniques that automatically adapt a given workflow to changes in the underlying infrastructure or characteristics of the analyzed data, for instance, by using diferent data partitions or diferent tools for individual steps of the analysis. Automatic workflow adaptation requires a cost model setting properties of diferent tools, data set sizes, and characteristics of the given infrastructure into perspective. As a first step in this direction, we here study in detail the performance of an important analysis in genomics, namely RNASeq, in diferent settings. We experimentally measured the runtime of diferent RNAseq workflows implemented in Nextflow on diferent infrastructures (stand-alone or distributed), composed of diferent tool chains, using diferent data set sizes. As diferent tools also lead to (slightly) diferent outputs, we additionally compared the output of diferent workflow variants. We show that workflow variants designed for a given infrastructure perform much worse in other settings and that rewritings sometimes keep and sometimes change the output, even when tools are only replaced by others with the same purpose. We see these experiments as an important first step toward automatically adapting workflows to diferent infrastructures.

1. Introduction

Data Analysis Workflows (DAWs) are used to solve a specific data analysis problem using a chain of tools connected by input/output dependencies. In bioinformatics, the usage of DAWs is critical to perform reproducible analyses [ 1 ]. However, porting DAWs to diferent infrastructures or using them for diferent input data sizes can tructure has fewer resources, the workflow can crash due to insuficient memory, or time outs as computations take longer than anticipated at workflow design time. On lems can afect the runtime of the analysis [ 2, 3, 4 ]. In our research, we hypothesize that knowledge of the infrastructure, the input, the DAW itself and the particular tools it is made of can be used to automatically adapt a given DAW such that it performs gracefully also in a new environment. The rewriting of chains of interdependent commands has a long tradition, especially in rewriting DAWs for scientific data analysis difers from these settings in two regards. First, DAWs are typically designed and executed in a black box model both for the nEvelop-O (S. b. U. Leser) CEUR htp:/ceur-ws.org ISN1613-073 © 2023 Copyright for this paper by its authors. Use permitted under Creative

CEUR data and the operations - the executing infrastructure cannot make any assumptions regarding the functionality of the operations nor the format of the data [ 7 ]. Second, DAWs for complex scientific data analysis consist of many steps that are heuristics, which means that the ”correct” result of an analysis actually is not known and that diferent DAWs for the same purpose on the same data might produce diverging results [ 8 ]. Therefore, DAW operations, such as: the replacement a tool of the DAW by another one with the same purpose (1), the change of tool/DAW parameters (2), the modification of the DAW tions (4). Before implementing such functionalities, it is crucial to understand the impact of a given adaptation for a given setting on the workflow runtime and the output.

In this paper, we study this problem for DAWs performing an RNA sequencing (RNAseq) analysis. RNASeq is particularly interesting as it is a widespread of analysis used to understand gene expression and regulation under DAWs take a large set of short strings as input, which are sequenced fractions of mRNA, the transient molecules generated during gene expression as an intermediate step to protein sequences. DAWs next map each string to a reference genome to then cluster sets of strings stemming from the same transcript. Real-life DAWs also include further steps, such as data pre-processing, quality filtering, or computation of diferent quality metrics. Each task of these DAWs can be performed by several tools that serve the same purpose, but use diferent heuristics leading to diferent resource requirements and results. the field of AutoML [ 13 ]. The main diferences are that A particularly complex step is the mapping, for which a in AutoML (1) pipelines are linear and only exchanges of plethora of possible tools exist [ 8 ]. tasks are considered and (2) the results of diferent work

Here, we studied the behaviour of three RNAseq DAWs flow variants may also vary, but that typically a notion of on two diferent infrastructures using two diferent data ”best” is defined (e.g. highest accuracy on a test data set), sets. We created the DAWs tool-chains based on the tools’ which often is not the case in scientific data analysis. popularity and compatibility with each other. Each DAW was implemented in two versions: one is designed for a stand-alone server, and another one is designed for a distributed infrastructure. Adaptation consists of splitting the load of resource-demanding tasks across several nodes of the cluster by splitting the input files - which is not supported equally well by all tools. We ran the two versions of these three workflows on two diferent infrastructures and measured the runtime and output diferences between the workflow versions depending on several parameters. We consider this work as a base to better understand the impact of DAW rewritings. Ultimately, we aim at abstracting these findings into a set of rules that lead to an automatic DAW rewriting according to a given input and context specifications.

2. Related work

As described in Figure 1, we created two RNASeq worklfows performing the same operations but optimized for diferent infrastructures. Both have as central and most time-consuming task the alignment, which matches short stretches of genomic sequences to a reference genome. The alignment is also fundamental in other bioinformatics workflows studying genomic sequences [ 14 ]. RS1 follows a pipeline structure, which we assume fits better to a stand-alone server. RS2 is optimized for a distributed infrastructure, as it splits the input to allow for a distributed computation of the alignment step across several nodes of a cluster. From each workflow, we furthermore created three variants according to the specific tool used to compute alignments, i.e., STAR [ 15 ], Salmon [ 16 ], and Hisat2 [ 17 ]. We implemented all DAWs using Nextflow [ 18 ], a workflow engine of increasing popularity in the Bioinformatics community. Nextflow workflows are implemented in a specific DSL which allows for automatic parallelization and distributed execution of tasks. For local execution, NextFlow uses its own execution engine; for a distributed setup, it can work together with Kubernetes resource managers. The goal of this experiment is Data and Infrastructures

Two RNAseq-paired input datasets of diferent sizes were

considered. Both data sets were obtained by sequencing the transcriptome of Drosophila melanogaster. The difference in size allows us to understand better the impact of the input size on the decision to rewrite the workflow. Dataset 1 consists of two paired files of 13GB each, and Dataset 2 is two files of 48G.

The DAWs were run on two diferent infrastructures: one stand-alone server and one cluster. The stand-alone server (infra1) consists of 32 Intel Xeon CPU E5-2667 v2 Octa Core, with a memory of 387 GB and a SATA SSD 1,9TiB Raid 5. The cluster Infra2 in our experiments consists of 10 homogeneous nodes, each with a Quadcore Intel Xeon CPU E3-1230 V2 3.30GHz; Memory: 16 GB; Disks: 3x1TB, connected by a network of 2x 1GBit. The stand alone server has way more resources than the cluster. Therefore, we expect all the runtimes to be faster on Infra1.

DAW

STAR Hisat2 Salmon Dataset D1 D2 D1

D2 D1 D2

Interestingly, the DAWS require very diferent run

times depending on which tool was used for the align4. Results ment step. On Infra1 with D2, the runtime of Salmon is approx. 20 times faster than HiSat2 on RS1 and five times Runtime comparison faster on RS2. In almost all cases, workflows profited from the plus in resources in the distributed Infra2 when Table 1 shows the runtimes of the pipeline version (RS1) switching from RS1 to RS2, but to varying degrees. and the distributed version (RS2) of the DAWs on the two However, recall that our intention is not to find the best infrastructures with both datasets. Note that each value DAW for a given infrastructure, but to develop algorithms displayed in this table was obtained from a single run. that can rewrite a given DAW developed for a setting A We are currently generating duplicate runs to acquire to adapt it to a new setting B - which might simply have more robust values. However, as we had exclusive usage a slow network, such as Infra2. For instance, imagine a of the infrastructures during the measurements, we are researcher developed RS1 on Infra2 using HiSat2 for data confident about our experiments not being perturbed by sets of the size of DS1. Now, she wants to run it on larger other computations. Furthermore, the duplicated runs datasets yet avoid that 5-fold increase in runtime. An that were already computed are consistent with the re- adaptation an optimizer could propose is to rewrite the sults presented in the table. workflow into RS2, which would only lead to a 2-fold run

We observe notable runtime diferences that show ten- time. Or imagine another user who wants to reuse this dencies but not an entirely consistent picture. In general, workflow, but is forced to use STAR as aligner because it RS1 and RS2 show similar runtimes on the stand-alone is the lab-internal standard. Runtime would be doubled, server Infra1, with the notable exception of Hisat2 on the or even increased by a factor of 13 when also switching to large dataset D2. For this case, time reduction is almost larger files. An optimizer could recognize that switching 50%, while runtimes for the smaller dataset D1 are very to RS2 would decrease the expected increase by 65%. similar. We attribute this behaviour to the low resource usage of Hisat2. In a non-distributed setting, Nextflow parallelizes the tasks over the diferent CPUs available, Quality comparison which makes the runtime on Infra1 overall smaller. Al- In scientific data analysis, diferent DAWs for the same most no diference is observed for STAR on infra1 as it problem often lead to (slightly) diferent results due to requires a lot of RAM to run over a single chunk of the the usage of diferent heuristics for solving complex subinput data. On Infra2, runtimes difer considerably. In al- problems. Sometimes, avoiding such changes can be most all cases, RS2 (designed for distributed computation) mandatory, for instance, when a certain analysis method achieves much lower runtimes than RS1, with reductions is defined as an organizational standard. However, often up to 66%. Again, there is one exception: Salmon actu- such changes are acceptable, for instance, in the early ally takes longer with RS2 than with RS1. This runtime phases of a data analysis project in which diferent tradediference is due to the task splitting the input files. ofs are explored, such as runtime, result quality, analysis cost etc. In any case, users need to be informed about the

Dataset 1 Dataset 2 STAR 100 % 100 %

expected degree of changes a DAW rewriting would incur. To this end, we compared the results of the diferent DAW versions to understand how much the DAG structure modification impacts analysis results. We measured the similarity between the results of the RS1 and RS2 versions of each DAW in Table 2. Clearly, the DAWs using Hisat2 and STAR are very robust to this rewriting, while the one using Salmon produces largely diferent results.

5. Conclusion and future work

We presented the results of an initial study on the impact of DAW adaptations to diferent infrastructures, considering both the replacement of central tools as well as changing the workflow structure. The main purpose is to show that such rewritings impact performance considerably and that certain variants are more suitable for certain infrastructures and that suitability also depends on the input size. Relationships are overall complex and certainly will vary with diferent analysis problems, different DAWs for solving them, and diferent infrastructures. We are consolidating these results with experiment replicates and more workflows and dataset sizes. In future work, we will focus on languages to provide descriptions of core aspects of infrastructures, methods to derive properties of tools on diferent infrastructures, annotation schemes to describe the equivalence of tools in genomics and a cost model as a basis for a rule-based DAW adaptation algorithm that takes these properties into account. We will then develop an automatic DAW rewriting that implements this algorithm.

Acknowledgements Funded by the Deutsche Forschungsgemeinschaft – Project-ID 414984028 – SFB 1404 FONDA

[1]

Wratten ,

Wilm ,

Göke , Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers , Nat Methods ( 2021 ).

[2]

Schiefer ,

Bux ,

Brandt , et al., Portability of Scientific Workflows in NGS Data Analysis: A Case Study , arXiv: 2006 . 03104 ( 2020 ).

[3]

Lehmann ,

Frantz ,

Becker , et al., FORCE on Nextflow: Scalable Analysis of Earth Observation data on Commodity Clusters, CIKM Workshops ( 2021 ).

[4]

Hanussek ,

Bartusch ,

Krüger , Performance and scaling behavior of bioinformatics applications in virtualization environments to create awareness for the eficient use of compute resources , PLOS Computational Biology 17 ( 7 ): e1009244 ( 2021 ).

[5]

Garcia-Molina ,

J. D.

Ullman ,

Widom , Database systems - the complete book , Pearson ( 2009 ).

[6]

Rheinländer ,

Heise ,

Hueske , et al., Sofa: An extensible logical optimizer for udf-heavy data lfows , Information Systems 52 ( 2015 ) 96 - 125 .

[7]

Mork ,

Martin , et al., Contemporary challenges for data-intensive scientific workflow management systems , Workshop on Workflows in Support of Large-Scale Science ( 2015 ).

[8]

Schaarschmidt ,

Fischer ,

Zuther , et al., Evaluation of seven diferent rna-seq alignment tools based on experimental data from the model plant arabidopsis thaliana , Int J Mol Sci ( 2020 ).

[9]

Quan , L. Xu-Bin , J. Wen-Rui , et al., Survey of MapReduce frame operation in bioinformatics , Briefings in Bioinformatics 15 ( 2013 ) 637 - 647 .

[10] A.-A. Zaid , W.

Saiyi , M.

Hamid , Sparkra: Enabling big data scalability for the gatk rna-seq pipeline with apache spark , Genes 11 ( 2020 ).

[11]

Yakneen ,

Waszak ,

Gertz , et al., Butler enables rapid cloud-based analysis of thousands of human genomes , Nat Biotechnol ( 2020 ).

[12]

Roy ,

Diao ,

Evani , et al., Massively parallel processing of whole genome sequence data: An in-depth performance study , SIGMOD ( 2017 ).

[13]

He , Xin amd Zhao , X. Chu , Automl: A survey of the state-of-the-art, Knowledge-Based Systems 12 ( 2021 ).

[14]

Musich ,

Cadle-Davidson ,

Osier , Comparison of short-read sequence aligners indicates strengths and weaknesses for biologists to consider , Front Plant Sci . ( 2021 ).

[15]

Dobin ,

C. A.

Davis ,

Schlesinger , et al., STAR: ultrafast universal RNA-seq aligner , Bioinformatics 29 ( 2012 ) 15 - 21 .

[16]

Patro , G. Duggal,

Love , Salmon provides fast and bias-aware quantification of transcript expression , Nat Methods ( 2017 ).

[17]

Kim ,

Paggi ,

Park , Graph-based genome alignment and genotyping with hisat2 and hisatgenotype , Nat Biotechnol ( 2019 ).

[18]

P. D.

Tommaso ,

Chatzou , E. Floden, Nextflow enables reproducible computational workflows , Nat Biotechnol ( 2017 ).