=Paper=
{{Paper
|id=Vol-3379/phdworkshop_poster5
|storemode=property
|title=Adapting scientific workflows to changing infrastructures
|pdfUrl=https://ceur-ws.org/Vol-3379/PhDWorkshop_2023_deMecquenem-paper.pdf
|volume=Vol-3379
|authors=Ninon De Mecquenem
|dblpUrl=https://dblp.org/rec/conf/edbt/Mecquenem23
}}
==Adapting scientific workflows to changing infrastructures==
Adapting scientific workflows to changing infrastructures
Ninon De Mecquenem1 , Supervised by Ulf Leser1
1
Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin, DE 10099
Abstract
Scientific workflows are increasingly popular for large-scale data analyses as they promise better documentation, increased
reproducibility, and easier scalability of complex analysis pipelines. However, reproducibility is severely reduced when a
given workflow is optimized for a specific infrastructure, as it would require other scientists to access the same computing
environment. Hence, it is important to develop techniques that automatically adapt a given workflow to changes in the
underlying infrastructure or characteristics of the analyzed data, for instance, by using different data partitions or different
tools for individual steps of the analysis. Automatic workflow adaptation requires a cost model setting properties of different
tools, data set sizes, and characteristics of the given infrastructure into perspective. As a first step in this direction, we here
study in detail the performance of an important analysis in genomics, namely RNASeq, in different settings. We experimentally
measured the runtime of different RNAseq workflows implemented in Nextflow on different infrastructures (stand-alone or
distributed), composed of different tool chains, using different data set sizes. As different tools also lead to (slightly) different
outputs, we additionally compared the output of different workflow variants. We show that workflow variants designed for a
given infrastructure perform much worse in other settings and that rewritings sometimes keep and sometimes change the
output, even when tools are only replaced by others with the same purpose. We see these experiments as an important first
step toward automatically adapting workflows to different infrastructures.
Keywords
Data Analysis Workflows, Bioinformatics, Distributed infrastructures, Portability
1. Introduction data and the operations - the executing infrastructure can-
not make any assumptions regarding the functionality
Data Analysis Workflows (DAWs) are used to solve a of the operations nor the format of the data [7]. Sec-
specific data analysis problem using a chain of tools con- ond, DAWs for complex scientific data analysis consist of
nected by input/output dependencies. In bioinformatics, many steps that are heuristics, which means that the ”cor-
the usage of DAWs is critical to perform reproducible rect” result of an analysis actually is not known and that
analyses [1]. However, porting DAWs to different infras- different DAWs for the same purpose on the same data
tructures or using them for different input data sizes can might produce diverging results [8]. Therefore, DAW
cause severe problems. For example, if the new infras- adaptation may consider a wide range of valid primitive
tructure has fewer resources, the workflow can crash operations, such as: the replacement a tool of the DAW
due to insufficient memory, or time outs as computations by another one with the same purpose (1), the change of
take longer than anticipated at workflow design time. On tool/DAW parameters (2), the modification of the DAW
the other hand, also with more resources scaling prob- structure (3), or the adjustment of the sizes of data parti-
lems can affect the runtime of the analysis [2, 3, 4]. In tions (4). Before implementing such functionalities, it is
our research, we hypothesize that knowledge of the in- crucial to understand the impact of a given adaptation for
frastructure, the input, the DAW itself and the particular a given setting on the workflow runtime and the output.
tools it is made of can be used to automatically adapt In this paper, we study this problem for DAWs perform-
a given DAW such that it performs gracefully also in a ing an RNA sequencing (RNAseq) analysis. RNASeq is
new environment. The rewriting of chains of interde- particularly interesting as it is a widespread of analysis
pendent commands has a long tradition, especially in used to understand gene expression and regulation under
the database [5] and the big data world [6]. However, certain conditions or diseases such as cancer. RNASeq
rewriting DAWs for scientific data analysis differs from DAWs take a large set of short strings as input, which are
these settings in two regards. First, DAWs are typically sequenced fractions of mRNA, the transient molecules
designed and executed in a black box model both for the generated during gene expression as an intermediate step
to protein sequences. DAWs next map each string to a
Published in the Workshop Proceedings of the EDBT/ICDT 2023 Joint reference genome to then cluster sets of strings stemming
Conference (March 28-March 31, 2023, Ioannina, Greece) from the same transcript. Real-life DAWs also include
Envelope-Open mecquenn@informatik.hu-berlin.de (N. D. Mecquenem); further steps, such as data pre-processing, quality filter-
leser@informatik.hu-berlin.de (S. b. U. Leser)
Orcid 0000-0003-3052-6129 (N. D. Mecquenem); 0000-0003-2166-9582
ing, or computation of different quality metrics. Each
(S. b. U. Leser) task of these DAWs can be performed by several tools
© 2023 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
that serve the same purpose, but use different heuristics
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
1
Ninon De Mecquenem et al. CEUR Workshop Proceedings 1–4
leading to different resource requirements and results. the field of AutoML [13]. The main differences are that
A particularly complex step is the mapping, for which a in AutoML (1) pipelines are linear and only exchanges of
plethora of possible tools exist [8]. tasks are considered and (2) the results of different work-
Here, we studied the behaviour of three RNAseq DAWs flow variants may also vary, but that typically a notion of
on two different infrastructures using two different data ”best” is defined (e.g. highest accuracy on a test data set),
sets. We created the DAWs tool-chains based on the tools’ which often is not the case in scientific data analysis.
popularity and compatibility with each other. Each DAW
was implemented in two versions: one is designed for
a stand-alone server, and another one is designed for a
distributed infrastructure. Adaptation consists of split-
ting the load of resource-demanding tasks across several
nodes of the cluster by splitting the input files - which
is not supported equally well by all tools. We ran the
two versions of these three workflows on two different
infrastructures and measured the runtime and output
differences between the workflow versions depending
on several parameters. We consider this work as a base
to better understand the impact of DAW rewritings. Ulti-
mately, we aim at abstracting these findings into a set of
rules that lead to an automatic DAW rewriting according
to a given input and context specifications.
2. Related work Figure 1: Design of the experiments. RS1 was created with
three tool-chains (Salmon, STAR, Hisat2). A variation of these
Several strategies have already been used to optimize workflows (RS2) was created. We ran it on two infrastructures
DAW execution in a given context. For instance, [9] sur- (infra1 and infra2) and two datasets (D1 and D2). Their run-
veys genomic workflows for the Map-Reduce infrastruc- time was measured, and their output was compared.
ture. Zaid Al-Ars et al. created a version of the popular
RNAseq GATK workflow adapted to Spark [10]. Yakeen
et al. describe a large-scale variant caller optimized for
execution on a commercial cloud [11]. Roy et al. stud- 3. Experiments
ied the influence of different Hadoop parameters on a
specific genomics DAW [12]. However, all these works fo- As described in Figure 1, we created two RNASeq work-
cused on optimization for a specific target infrastructure; flows performing the same operations but optimized for
in contrast, we research methods that can adapt DAWs different infrastructures. Both have as central and most
to any execution infrastructure. Another line of related time-consuming task the alignment, which matches short
research is concerned with optimization of (relational) stretches of genomic sequences to a reference genome.
queries with user-defined functions, especially in big The alignment is also fundamental in other bioinformat-
data processing pipelines [6]. These, however, typically ics workflows studying genomic sequences [14]. RS1
are built on the paradigm that (a) individual operations follows a pipeline structure, which we assume fits bet-
(tasks in a workflow setting) have pre-defined semantics, ter to a stand-alone server. RS2 is optimized for a dis-
(b) data follows a relational model, and (c) all rewritings tributed infrastructure, as it splits the input to allow for
preserve exactly the results of a query. These assump- a distributed computation of the alignment step across
tions do not hold in the realm of DAWs for scientific several nodes of a cluster. From each workflow, we fur-
analysis - data can have arbitrary formats, tasks are typi- thermore created three variants according to the spe-
cally exchanged as binaries without any guarantees, and cific tool used to compute alignments, i.e., STAR [15],
operations are partly computationally so complex that Salmon [16], and Hisat2 [17]. We implemented all DAWs
they can only be approached using heuristics, leading to using Nextflow [18], a workflow engine of increasing
different results for different concrete physical implemen- popularity in the Bioinformatics community. Nextflow
tations. Accordingly, we envision that DAW rewriting workflows are implemented in a specific DSL which al-
makes up for these more complex settings by relying on lows for automatic parallelization and distributed execu-
knowledge provided by workflow designers that want to tion of tasks. For local execution, NextFlow uses its own
support portability and re-usability of their workflows. execution engine; for a distributed setup, it can work
Finally, there is some commonality to recent studies in together with Kubernetes resource managers. The goal
of this experiment is
2
Ninon De Mecquenem et al. CEUR Workshop Proceedings 1–4
Data and Infrastructures DAW Dataset Infrastructure RS1 RS2
Infra1 58 m 48 m
D1
Two RNAseq-paired input datasets of different sizes were Infra2 364 m 134 m
STAR
considered. Both data sets were obtained by sequencing Infra1 401 m 321 m
D2
the transcriptome of Drosophila melanogaster. The dif- Infra2 2383 m 720 m
ference in size allows us to understand better the impact Infra1 60 m 47 m
D1
of the input size on the decision to rewrite the workflow. Infra2 175 m 85 m
Hisat2
Infra1 569 m 232 m
Dataset 1 consists of two paired files of 13GB each, and D2
Infra2 935 m 437 m
Dataset 2 is two files of 48G.
Infra1 8m 15 m
The DAWs were run on two different infrastructures: D1
Infra2 68 m 51 m
one stand-alone server and one cluster. The stand-alone Salmon
Infra1 32 m 51 m
server (infra1) consists of 32 Intel Xeon CPU E5-2667 v2 D2
Infra2 186 m 270 m
Octa Core, with a memory of 387 GB and a SATA SSD
1,9TiB Raid 5. The cluster Infra2 in our experiments con- Table 1
sists of 10 homogeneous nodes, each with a Quadcore Runtimes of the three RNAseq DAWs, compared across DAW
versions and datasets for both infrastructures. The DAWs are
Intel Xeon CPU E3-1230 V2 3.30GHz; Memory: 16 GB;
named by the aligner used in their tool-chain. In bold, we
Disks: 3x1TB, connected by a network of 2x 1GBit. The highlight large reduction in runtimes from RS1 to RS2.
stand alone server has way more resources than the clus-
ter. Therefore, we expect all the runtimes to be faster on
Infra1.
Interestingly, the DAWS require very different run-
times depending on which tool was used for the align-
4. Results ment step. On Infra1 with D2, the runtime of Salmon is
approx. 20 times faster than HiSat2 on RS1 and five times
Runtime comparison faster on RS2. In almost all cases, workflows profited
from the plus in resources in the distributed Infra2 when
Table 1 shows the runtimes of the pipeline version (RS1) switching from RS1 to RS2, but to varying degrees.
and the distributed version (RS2) of the DAWs on the two However, recall that our intention is not to find the best
infrastructures with both datasets. Note that each value DAW for a given infrastructure, but to develop algorithms
displayed in this table was obtained from a single run. that can rewrite a given DAW developed for a setting A
We are currently generating duplicate runs to acquire to adapt it to a new setting B - which might simply have
more robust values. However, as we had exclusive usage a slow network, such as Infra2. For instance, imagine a
of the infrastructures during the measurements, we are researcher developed RS1 on Infra2 using HiSat2 for data
confident about our experiments not being perturbed by sets of the size of DS1. Now, she wants to run it on larger
other computations. Furthermore, the duplicated runs datasets yet avoid that 5-fold increase in runtime. An
that were already computed are consistent with the re- adaptation an optimizer could propose is to rewrite the
sults presented in the table. workflow into RS2, which would only lead to a 2-fold run-
We observe notable runtime differences that show ten- time. Or imagine another user who wants to reuse this
dencies but not an entirely consistent picture. In general, workflow, but is forced to use STAR as aligner because it
RS1 and RS2 show similar runtimes on the stand-alone is the lab-internal standard. Runtime would be doubled,
server Infra1, with the notable exception of Hisat2 on the or even increased by a factor of 13 when also switching to
large dataset D2. For this case, time reduction is almost larger files. An optimizer could recognize that switching
50%, while runtimes for the smaller dataset D1 are very to RS2 would decrease the expected increase by 65%.
similar. We attribute this behaviour to the low resource
usage of Hisat2. In a non-distributed setting, Nextflow
parallelizes the tasks over the different CPUs available, Quality comparison
which makes the runtime on Infra1 overall smaller. Al- In scientific data analysis, different DAWs for the same
most no difference is observed for STAR on infra1 as it problem often lead to (slightly) different results due to
requires a lot of RAM to run over a single chunk of the the usage of different heuristics for solving complex sub-
input data. On Infra2, runtimes differ considerably. In al- problems. Sometimes, avoiding such changes can be
most all cases, RS2 (designed for distributed computation) mandatory, for instance, when a certain analysis method
achieves much lower runtimes than RS1, with reductions is defined as an organizational standard. However, often
up to 66%. Again, there is one exception: Salmon actu- such changes are acceptable, for instance, in the early
ally takes longer with RS2 than with RS1. This runtime phases of a data analysis project in which different trade-
difference is due to the task splitting the input files. offs are explored, such as runtime, result quality, analysis
cost etc. In any case, users need to be informed about the
3
Ninon De Mecquenem et al. CEUR Workshop Proceedings 1–4
STAR Salmon Hisat2 Scientific Workflows in NGS Data Analysis: A Case
Dataset 1 100 % 82,06 % 97,39 %
Study, arXiv:2006.03104 (2020).
Dataset 2 100 % 71,71 % 99,65 %
[3] F. Lehmann, D. Frantz, S. Becker, et al., FORCE on
Table 2 Nextflow: Scalable Analysis of Earth Observation
Percentage of similarity of transcript counts between the data on Commodity Clusters, CIKM Workshops
pipeline version (RS1) and the scatter/gather version (RS2) (2021).
of each DAW. [4] M. Hanussek, F. Bartusch, J. Krüger, Performance
and scaling behavior of bioinformatics applications
in virtualization environments to create awareness
expected degree of changes a DAW rewriting would incur. for the efficient use of compute resources, PLOS
To this end, we compared the results of the different DAW Computational Biology 17(7): e1009244 (2021).
versions to understand how much the DAG structure [5] H. Garcia-Molina, J. D. Ullman, J. Widom, Database
modification impacts analysis results. We measured the systems - the complete book, Pearson (2009).
similarity between the results of the RS1 and RS2 versions [6] A. Rheinländer, A. Heise, F. Hueske, et al., Sofa:
of each DAW in Table 2. Clearly, the DAWs using Hisat2 An extensible logical optimizer for udf-heavy data
and STAR are very robust to this rewriting, while the one flows, Information Systems 52 (2015) 96–125.
using Salmon produces largely different results. [7] R. Mork, P. Martin, et al., Contemporary challenges
for data-intensive scientific workflow management
systems, Workshop on Workflows in Support of
5. Conclusion and future work Large-Scale Science (2015).
We presented the results of an initial study on the impact [8] A. Schaarschmidt, A. Fischer, E. Zuther, et al., Eval-
of DAW adaptations to different infrastructures, consid- uation of seven different rna-seq alignment tools
ering both the replacement of central tools as well as based on experimental data from the model plant
changing the workflow structure. The main purpose is arabidopsis thaliana, Int J Mol Sci (2020).
to show that such rewritings impact performance con- [9] Z. Quan, L. Xu-Bin, J. Wen-Rui, et al., Survey
siderably and that certain variants are more suitable for of MapReduce frame operation in bioinformatics,
certain infrastructures and that suitability also depends Briefings in Bioinformatics 15 (2013) 637–647.
on the input size. Relationships are overall complex and [10] A.-A. Zaid, W. Saiyi, M. Hamid, Sparkra: Enabling
certainly will vary with different analysis problems, dif- big data scalability for the gatk rna-seq pipeline
ferent DAWs for solving them, and different infrastruc- with apache spark, Genes 11 (2020).
tures. We are consolidating these results with experi- [11] S. Yakneen, S. Waszak, M. Gertz, et al., Butler en-
ment replicates and more workflows and dataset sizes. ables rapid cloud-based analysis of thousands of
In future work, we will focus on languages to provide human genomes, Nat Biotechnol (2020).
descriptions of core aspects of infrastructures, methods [12] A. Roy, Y. Diao, U. Evani, et al., Massively parallel
to derive properties of tools on different infrastructures, processing of whole genome sequence data: An
annotation schemes to describe the equivalence of tools in-depth performance study, SIGMOD (2017).
in genomics and a cost model as a basis for a rule-based [13] K. He, Xin amd Zhao, X. Chu, Automl: A survey of
DAW adaptation algorithm that takes these properties the state-of-the-art, Knowledge-Based Systems 12
into account. We will then develop an automatic DAW (2021).
rewriting that implements this algorithm. [14] R. Musich, L. Cadle-Davidson, M. Osier, Com-
parison of short-read sequence aligners indicates
strengths and weaknesses for biologists to consider,
Acknowledgements Front Plant Sci. (2021).
[15] A. Dobin, C. A. Davis, F. Schlesinger, et al., STAR:
Funded by the Deutsche Forschungsgemeinschaft – ultrafast universal RNA-seq aligner, Bioinformatics
Project-ID 414984028 – SFB 1404 FONDA 29 (2012) 15–21.
[16] R. Patro, G. Duggal, M. Love, Salmon provides fast
and bias-aware quantification of transcript expres-
References sion, Nat Methods (2017).
[1] L. Wratten, A. Wilm, J. Göke, Reproducible, scalable, [17] D. Kim, J. Paggi, C. Park, Graph-based genome
and shareable analysis pipelines with bioinformat- alignment and genotyping with hisat2 and hisat-
ics workflow managers, Nat Methods (2021). genotype, Nat Biotechnol (2019).
[2] C. Schiefer, M. Bux, J. Brandt, et al., Portability of [18] P. D. Tommaso, M. Chatzou, E. Floden, Nextflow
enables reproducible computational workflows, Nat
Biotechnol (2017).
4