=Paper=
{{Paper
|id=Vol-3379/phdworkshop_poster5
|storemode=property
|title=Adapting scientific workflows to changing infrastructures
|pdfUrl=https://ceur-ws.org/Vol-3379/PhDWorkshop_2023_deMecquenem-paper.pdf
|volume=Vol-3379
|authors=Ninon De Mecquenem
|dblpUrl=https://dblp.org/rec/conf/edbt/Mecquenem23
}}
==Adapting scientific workflows to changing infrastructures==
<pdf width="1500px">https://ceur-ws.org/Vol-3379/PhDWorkshop_2023_deMecquenem-paper.pdf</pdf>
<pre>
Adapting scientific workflows to changing infrastructures
Ninon De Mecquenem1 , Supervised by Ulf Leser1
1
    Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin, DE 10099


                                             Abstract
                                             Scientific workflows are increasingly popular for large-scale data analyses as they promise better documentation, increased
                                             reproducibility, and easier scalability of complex analysis pipelines. However, reproducibility is severely reduced when a
                                             given workflow is optimized for a specific infrastructure, as it would require other scientists to access the same computing
                                             environment. Hence, it is important to develop techniques that automatically adapt a given workflow to changes in the
                                             underlying infrastructure or characteristics of the analyzed data, for instance, by using different data partitions or different
                                             tools for individual steps of the analysis. Automatic workflow adaptation requires a cost model setting properties of different
                                             tools, data set sizes, and characteristics of the given infrastructure into perspective. As a first step in this direction, we here
                                             study in detail the performance of an important analysis in genomics, namely RNASeq, in different settings. We experimentally
                                             measured the runtime of different RNAseq workflows implemented in Nextflow on different infrastructures (stand-alone or
                                             distributed), composed of different tool chains, using different data set sizes. As different tools also lead to (slightly) different
                                             outputs, we additionally compared the output of different workflow variants. We show that workflow variants designed for a
                                             given infrastructure perform much worse in other settings and that rewritings sometimes keep and sometimes change the
                                             output, even when tools are only replaced by others with the same purpose. We see these experiments as an important first
                                             step toward automatically adapting workflows to different infrastructures.

                                             Keywords
                                             Data Analysis Workflows, Bioinformatics, Distributed infrastructures, Portability


1. Introduction                                                                                                           data and the operations - the executing infrastructure can-
                                                                                                                          not make any assumptions regarding the functionality
Data Analysis Workflows (DAWs) are used to solve a                                                                        of the operations nor the format of the data [7]. Sec-
specific data analysis problem using a chain of tools con-                                                                ond, DAWs for complex scientific data analysis consist of
nected by input/output dependencies. In bioinformatics,                                                                   many steps that are heuristics, which means that the ”cor-
the usage of DAWs is critical to perform reproducible                                                                     rect” result of an analysis actually is not known and that
analyses [1]. However, porting DAWs to different infras-                                                                  different DAWs for the same purpose on the same data
tructures or using them for different input data sizes can                                                                might produce diverging results [8]. Therefore, DAW
cause severe problems. For example, if the new infras-                                                                    adaptation may consider a wide range of valid primitive
tructure has fewer resources, the workflow can crash                                                                      operations, such as: the replacement a tool of the DAW
due to insufficient memory, or time outs as computations                                                                  by another one with the same purpose (1), the change of
take longer than anticipated at workflow design time. On                                                                  tool/DAW parameters (2), the modification of the DAW
the other hand, also with more resources scaling prob-                                                                    structure (3), or the adjustment of the sizes of data parti-
lems can affect the runtime of the analysis [2, 3, 4]. In                                                                 tions (4). Before implementing such functionalities, it is
our research, we hypothesize that knowledge of the in-                                                                    crucial to understand the impact of a given adaptation for
frastructure, the input, the DAW itself and the particular                                                                a given setting on the workflow runtime and the output.
tools it is made of can be used to automatically adapt                                                                    In this paper, we study this problem for DAWs perform-
a given DAW such that it performs gracefully also in a                                                                    ing an RNA sequencing (RNAseq) analysis. RNASeq is
new environment. The rewriting of chains of interde-                                                                      particularly interesting as it is a widespread of analysis
pendent commands has a long tradition, especially in                                                                      used to understand gene expression and regulation under
the database [5] and the big data world [6]. However,                                                                     certain conditions or diseases such as cancer. RNASeq
rewriting DAWs for scientific data analysis differs from                                                                  DAWs take a large set of short strings as input, which are
these settings in two regards. First, DAWs are typically                                                                  sequenced fractions of mRNA, the transient molecules
designed and executed in a black box model both for the                                                                   generated during gene expression as an intermediate step
                                                                                                                          to protein sequences. DAWs next map each string to a
Published in the Workshop Proceedings of the EDBT/ICDT 2023 Joint                                                         reference genome to then cluster sets of strings stemming
Conference (March 28-March 31, 2023, Ioannina, Greece)                                                                    from the same transcript. Real-life DAWs also include
Envelope-Open mecquenn@informatik.hu-berlin.de (N. D. Mecquenem);                                                         further steps, such as data pre-processing, quality filter-
leser@informatik.hu-berlin.de (S. b. U. Leser)
Orcid 0000-0003-3052-6129 (N. D. Mecquenem); 0000-0003-2166-9582
                                                                                                                          ing, or computation of different quality metrics. Each
(S. b. U. Leser)                                                                                                          task of these DAWs can be performed by several tools
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                          that serve the same purpose, but use different heuristics
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                      1
Ninon De Mecquenem et al. CEUR Workshop Proceedings                                                                        1–4


leading to different resource requirements and results.           the field of AutoML [13]. The main differences are that
A particularly complex step is the mapping, for which a           in AutoML (1) pipelines are linear and only exchanges of
plethora of possible tools exist [8].                             tasks are considered and (2) the results of different work-
   Here, we studied the behaviour of three RNAseq DAWs            flow variants may also vary, but that typically a notion of
on two different infrastructures using two different data         ”best” is defined (e.g. highest accuracy on a test data set),
sets. We created the DAWs tool-chains based on the tools’         which often is not the case in scientific data analysis.
popularity and compatibility with each other. Each DAW
was implemented in two versions: one is designed for
a stand-alone server, and another one is designed for a
distributed infrastructure. Adaptation consists of split-
ting the load of resource-demanding tasks across several
nodes of the cluster by splitting the input files - which
is not supported equally well by all tools. We ran the
two versions of these three workflows on two different
infrastructures and measured the runtime and output
differences between the workflow versions depending
on several parameters. We consider this work as a base
to better understand the impact of DAW rewritings. Ulti-
mately, we aim at abstracting these findings into a set of
rules that lead to an automatic DAW rewriting according
to a given input and context specifications.


2. Related work                                          Figure 1: Design of the experiments. RS1 was created with
                                                         three tool-chains (Salmon, STAR, Hisat2). A variation of these
Several strategies have already been used to optimize workflows (RS2) was created. We ran it on two infrastructures
DAW execution in a given context. For instance, [9] sur- (infra1 and infra2) and two datasets (D1 and D2). Their run-
veys genomic workflows for the Map-Reduce infrastruc- time was measured, and their output was compared.
ture. Zaid Al-Ars et al. created a version of the popular
RNAseq GATK workflow adapted to Spark [10]. Yakeen
et al. describe a large-scale variant caller optimized for
execution on a commercial cloud [11]. Roy et al. stud-            3. Experiments
ied the influence of different Hadoop parameters on a
specific genomics DAW [12]. However, all these works fo-          As described in Figure 1, we created two RNASeq work-
cused on optimization for a specific target infrastructure;       flows performing the same operations but optimized for
in contrast, we research methods that can adapt DAWs              different infrastructures. Both have as central and most
to any execution infrastructure. Another line of related          time-consuming task the alignment, which matches short
research is concerned with optimization of (relational)           stretches of genomic sequences to a reference genome.
queries with user-defined functions, especially in big            The alignment is also fundamental in other bioinformat-
data processing pipelines [6]. These, however, typically          ics workflows studying genomic sequences [14]. RS1
are built on the paradigm that (a) individual operations          follows a pipeline structure, which we assume fits bet-
(tasks in a workflow setting) have pre-defined semantics,         ter to a stand-alone server. RS2 is optimized for a dis-
(b) data follows a relational model, and (c) all rewritings       tributed infrastructure, as it splits the input to allow for
preserve exactly the results of a query. These assump-            a distributed computation of the alignment step across
tions do not hold in the realm of DAWs for scientific             several nodes of a cluster. From each workflow, we fur-
analysis - data can have arbitrary formats, tasks are typi-       thermore created three variants according to the spe-
cally exchanged as binaries without any guarantees, and           cific tool used to compute alignments, i.e., STAR [15],
operations are partly computationally so complex that             Salmon [16], and Hisat2 [17]. We implemented all DAWs
they can only be approached using heuristics, leading to          using Nextflow [18], a workflow engine of increasing
different results for different concrete physical implemen-       popularity in the Bioinformatics community. Nextflow
tations. Accordingly, we envision that DAW rewriting              workflows are implemented in a specific DSL which al-
makes up for these more complex settings by relying on            lows for automatic parallelization and distributed execu-
knowledge provided by workflow designers that want to             tion of tasks. For local execution, NextFlow uses its own
support portability and re-usability of their workflows.          execution engine; for a distributed setup, it can work
Finally, there is some commonality to recent studies in           together with Kubernetes resource managers. The goal
                                                                  of this experiment is


                                                              2
Ninon De Mecquenem et al. CEUR Workshop Proceedings                                                                       1–4


Data and Infrastructures                                             DAW       Dataset    Infrastructure      RS1      RS2
                                                                                              Infra1         58 m     48 m
                                                                                 D1
Two RNAseq-paired input datasets of different sizes were                                      Infra2         364 m    134 m
                                                                     STAR
considered. Both data sets were obtained by sequencing                                        Infra1         401 m    321 m
                                                                                 D2
the transcriptome of Drosophila melanogaster. The dif-                                        Infra2        2383 m    720 m
ference in size allows us to understand better the impact                                     Infra1         60 m     47 m
                                                                                 D1
of the input size on the decision to rewrite the workflow.                                    Infra2        175 m     85 m
                                                                    Hisat2
                                                                                              Infra1         569 m    232 m
Dataset 1 consists of two paired files of 13GB each, and                         D2
                                                                                              Infra2         935 m    437 m
Dataset 2 is two files of 48G.
                                                                                              Infra1          8m       15 m
   The DAWs were run on two different infrastructures:                           D1
                                                                                              Infra2        68 m      51 m
one stand-alone server and one cluster. The stand-alone            Salmon
                                                                                              Infra1        32 m      51 m
server (infra1) consists of 32 Intel Xeon CPU E5-2667 v2                         D2
                                                                                              Infra2         186 m    270 m
Octa Core, with a memory of 387 GB and a SATA SSD
1,9TiB Raid 5. The cluster Infra2 in our experiments con-         Table 1
sists of 10 homogeneous nodes, each with a Quadcore               Runtimes of the three RNAseq DAWs, compared across DAW
                                                                  versions and datasets for both infrastructures. The DAWs are
Intel Xeon CPU E3-1230 V2 3.30GHz; Memory: 16 GB;
                                                                  named by the aligner used in their tool-chain. In bold, we
Disks: 3x1TB, connected by a network of 2x 1GBit. The             highlight large reduction in runtimes from RS1 to RS2.
stand alone server has way more resources than the clus-
ter. Therefore, we expect all the runtimes to be faster on
Infra1.
                                                                     Interestingly, the DAWS require very different run-
                                                                  times depending on which tool was used for the align-
4. Results                                                        ment step. On Infra1 with D2, the runtime of Salmon is
                                                                  approx. 20 times faster than HiSat2 on RS1 and five times
Runtime comparison                                                faster on RS2. In almost all cases, workflows profited
                                                                  from the plus in resources in the distributed Infra2 when
Table 1 shows the runtimes of the pipeline version (RS1)          switching from RS1 to RS2, but to varying degrees.
and the distributed version (RS2) of the DAWs on the two             However, recall that our intention is not to find the best
infrastructures with both datasets. Note that each value          DAW for a given infrastructure, but to develop algorithms
displayed in this table was obtained from a single run.           that can rewrite a given DAW developed for a setting A
We are currently generating duplicate runs to acquire             to adapt it to a new setting B - which might simply have
more robust values. However, as we had exclusive usage            a slow network, such as Infra2. For instance, imagine a
of the infrastructures during the measurements, we are            researcher developed RS1 on Infra2 using HiSat2 for data
confident about our experiments not being perturbed by            sets of the size of DS1. Now, she wants to run it on larger
other computations. Furthermore, the duplicated runs              datasets yet avoid that 5-fold increase in runtime. An
that were already computed are consistent with the re-            adaptation an optimizer could propose is to rewrite the
sults presented in the table.                                     workflow into RS2, which would only lead to a 2-fold run-
   We observe notable runtime differences that show ten-          time. Or imagine another user who wants to reuse this
dencies but not an entirely consistent picture. In general,       workflow, but is forced to use STAR as aligner because it
RS1 and RS2 show similar runtimes on the stand-alone              is the lab-internal standard. Runtime would be doubled,
server Infra1, with the notable exception of Hisat2 on the        or even increased by a factor of 13 when also switching to
large dataset D2. For this case, time reduction is almost         larger files. An optimizer could recognize that switching
50%, while runtimes for the smaller dataset D1 are very           to RS2 would decrease the expected increase by 65%.
similar. We attribute this behaviour to the low resource
usage of Hisat2. In a non-distributed setting, Nextflow
parallelizes the tasks over the different CPUs available,         Quality comparison
which makes the runtime on Infra1 overall smaller. Al-            In scientific data analysis, different DAWs for the same
most no difference is observed for STAR on infra1 as it           problem often lead to (slightly) different results due to
requires a lot of RAM to run over a single chunk of the           the usage of different heuristics for solving complex sub-
input data. On Infra2, runtimes differ considerably. In al-       problems. Sometimes, avoiding such changes can be
most all cases, RS2 (designed for distributed computation)        mandatory, for instance, when a certain analysis method
achieves much lower runtimes than RS1, with reductions            is defined as an organizational standard. However, often
up to 66%. Again, there is one exception: Salmon actu-            such changes are acceptable, for instance, in the early
ally takes longer with RS2 than with RS1. This runtime            phases of a data analysis project in which different trade-
difference is due to the task splitting the input files.          offs are explored, such as runtime, result quality, analysis
                                                                  cost etc. In any case, users need to be informed about the


                                                              3
Ninon De Mecquenem et al. CEUR Workshop Proceedings                                                                      1–4


                      STAR     Salmon    Hisat2                        Scientific Workflows in NGS Data Analysis: A Case
         Dataset 1    100 %    82,06 %   97,39 %
                                                                       Study, arXiv:2006.03104 (2020).
         Dataset 2    100 %    71,71 %   99,65 %
                                                                   [3] F. Lehmann, D. Frantz, S. Becker, et al., FORCE on
Table 2                                                                Nextflow: Scalable Analysis of Earth Observation
Percentage of similarity of transcript counts between the              data on Commodity Clusters, CIKM Workshops
pipeline version (RS1) and the scatter/gather version (RS2)            (2021).
of each DAW.                                                       [4] M. Hanussek, F. Bartusch, J. Krüger, Performance
                                                                       and scaling behavior of bioinformatics applications
                                                                       in virtualization environments to create awareness
expected degree of changes a DAW rewriting would incur.                for the efficient use of compute resources, PLOS
To this end, we compared the results of the different DAW              Computational Biology 17(7): e1009244 (2021).
versions to understand how much the DAG structure                  [5] H. Garcia-Molina, J. D. Ullman, J. Widom, Database
modification impacts analysis results. We measured the                 systems - the complete book, Pearson (2009).
similarity between the results of the RS1 and RS2 versions         [6] A. Rheinländer, A. Heise, F. Hueske, et al., Sofa:
of each DAW in Table 2. Clearly, the DAWs using Hisat2                 An extensible logical optimizer for udf-heavy data
and STAR are very robust to this rewriting, while the one              flows, Information Systems 52 (2015) 96–125.
using Salmon produces largely different results.                   [7] R. Mork, P. Martin, et al., Contemporary challenges
                                                                       for data-intensive scientific workflow management
                                                                       systems, Workshop on Workflows in Support of
5. Conclusion and future work                                          Large-Scale Science (2015).
We presented the results of an initial study on the impact         [8] A. Schaarschmidt, A. Fischer, E. Zuther, et al., Eval-
of DAW adaptations to different infrastructures, consid-               uation of seven different rna-seq alignment tools
ering both the replacement of central tools as well as                 based on experimental data from the model plant
changing the workflow structure. The main purpose is                   arabidopsis thaliana, Int J Mol Sci (2020).
to show that such rewritings impact performance con-               [9] Z. Quan, L. Xu-Bin, J. Wen-Rui, et al., Survey
siderably and that certain variants are more suitable for              of MapReduce frame operation in bioinformatics,
certain infrastructures and that suitability also depends              Briefings in Bioinformatics 15 (2013) 637–647.
on the input size. Relationships are overall complex and          [10] A.-A. Zaid, W. Saiyi, M. Hamid, Sparkra: Enabling
certainly will vary with different analysis problems, dif-             big data scalability for the gatk rna-seq pipeline
ferent DAWs for solving them, and different infrastruc-                with apache spark, Genes 11 (2020).
tures. We are consolidating these results with experi-            [11] S. Yakneen, S. Waszak, M. Gertz, et al., Butler en-
ment replicates and more workflows and dataset sizes.                  ables rapid cloud-based analysis of thousands of
In future work, we will focus on languages to provide                  human genomes, Nat Biotechnol (2020).
descriptions of core aspects of infrastructures, methods          [12] A. Roy, Y. Diao, U. Evani, et al., Massively parallel
to derive properties of tools on different infrastructures,            processing of whole genome sequence data: An
annotation schemes to describe the equivalence of tools                in-depth performance study, SIGMOD (2017).
in genomics and a cost model as a basis for a rule-based          [13] K. He, Xin amd Zhao, X. Chu, Automl: A survey of
DAW adaptation algorithm that takes these properties                   the state-of-the-art, Knowledge-Based Systems 12
into account. We will then develop an automatic DAW                    (2021).
rewriting that implements this algorithm.                         [14] R. Musich, L. Cadle-Davidson, M. Osier, Com-
                                                                       parison of short-read sequence aligners indicates
                                                                       strengths and weaknesses for biologists to consider,
Acknowledgements                                                       Front Plant Sci. (2021).
                                                                  [15] A. Dobin, C. A. Davis, F. Schlesinger, et al., STAR:
Funded by the Deutsche Forschungsgemeinschaft –                        ultrafast universal RNA-seq aligner, Bioinformatics
Project-ID 414984028 – SFB 1404 FONDA                                  29 (2012) 15–21.
                                                                  [16] R. Patro, G. Duggal, M. Love, Salmon provides fast
                                                                       and bias-aware quantification of transcript expres-
References                                                             sion, Nat Methods (2017).
 [1] L. Wratten, A. Wilm, J. Göke, Reproducible, scalable,        [17] D. Kim, J. Paggi, C. Park, Graph-based genome
     and shareable analysis pipelines with bioinformat-                alignment and genotyping with hisat2 and hisat-
     ics workflow managers, Nat Methods (2021).                        genotype, Nat Biotechnol (2019).
 [2] C. Schiefer, M. Bux, J. Brandt, et al., Portability of       [18] P. D. Tommaso, M. Chatzou, E. Floden, Nextflow
                                                                       enables reproducible computational workflows, Nat
                                                                       Biotechnol (2017).


                                                              4

</pre>