<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>workflows to changing infrastructures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ninon De Mecquenem</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Supervised by Ulf Leser</string-name>
          <email>leser@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Analysis Workflows</institution>
          ,
          <addr-line>Bioinformatics, Distributed infrastructures, Portability</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Humboldt-Universität zu Berlin</institution>
          ,
          <addr-line>Unter den Linden 6, Berlin, DE 10099</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Scientific workflows are increasingly popular for large-scale data analyses as they promise better documentation, increased reproducibility, and easier scalability of complex analysis pipelines. However, reproducibility is severely reduced when a given workflow is optimized for a specific infrastructure, as it would require other scientists to access the same computing environment. Hence, it is important to develop techniques that automatically adapt a given workflow to changes in the underlying infrastructure or characteristics of the analyzed data, for instance, by using diferent data partitions or diferent tools for individual steps of the analysis. Automatic workflow adaptation requires a cost model setting properties of diferent tools, data set sizes, and characteristics of the given infrastructure into perspective. As a first step in this direction, we here study in detail the performance of an important analysis in genomics, namely RNASeq, in diferent settings. We experimentally measured the runtime of diferent RNAseq workflows implemented in Nextflow on diferent infrastructures (stand-alone or distributed), composed of diferent tool chains, using diferent data set sizes. As diferent tools also lead to (slightly) diferent outputs, we additionally compared the output of diferent workflow variants. We show that workflow variants designed for a given infrastructure perform much worse in other settings and that rewritings sometimes keep and sometimes change the output, even when tools are only replaced by others with the same purpose. We see these experiments as an important first step toward automatically adapting workflows to diferent infrastructures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Data Analysis Workflows (DAWs) are used to solve a
specific data analysis problem using a chain of tools
connected by input/output dependencies. In bioinformatics,
the usage of DAWs is critical to perform reproducible
analyses [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, porting DAWs to diferent
infrastructures or using them for diferent input data sizes can
tructure has fewer resources, the workflow can crash
due to insuficient memory, or time outs as computations
take longer than anticipated at workflow design time. On
lems can afect the runtime of the analysis [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. In
our research, we hypothesize that knowledge of the
infrastructure, the input, the DAW itself and the particular
tools it is made of can be used to automatically adapt
a given DAW such that it performs gracefully also in a
new environment. The rewriting of chains of
interdependent commands has a long tradition, especially in
rewriting DAWs for scientific data analysis difers from
these settings in two regards. First, DAWs are typically
designed and executed in a black box model both for the
nEvelop-O
(S. b. U. Leser)
CEUR
htp:/ceur-ws.org
ISN1613-073
© 2023 Copyright for this paper by its authors. Use permitted under Creative
      </p>
      <p>
        CEUR
data and the operations - the executing infrastructure
cannot make any assumptions regarding the functionality
of the operations nor the format of the data [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Second, DAWs for complex scientific data analysis consist of
many steps that are heuristics, which means that the
”correct” result of an analysis actually is not known and that
diferent DAWs for the same purpose on the same data
might produce diverging results [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Therefore, DAW
operations, such as: the replacement a tool of the DAW
by another one with the same purpose (1), the change of
tool/DAW parameters (2), the modification of the DAW
tions (4). Before implementing such functionalities, it is
crucial to understand the impact of a given adaptation for
a given setting on the workflow runtime and the output.
      </p>
      <p>
        In this paper, we study this problem for DAWs
performing an RNA sequencing (RNAseq) analysis. RNASeq is
particularly interesting as it is a widespread of analysis
used to understand gene expression and regulation under
DAWs take a large set of short strings as input, which are
sequenced fractions of mRNA, the transient molecules
generated during gene expression as an intermediate step
to protein sequences. DAWs next map each string to a
reference genome to then cluster sets of strings stemming
from the same transcript. Real-life DAWs also include
further steps, such as data pre-processing, quality
filtering, or computation of diferent quality metrics. Each
task of these DAWs can be performed by several tools
that serve the same purpose, but use diferent heuristics
leading to diferent resource requirements and results. the field of AutoML [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The main diferences are that
A particularly complex step is the mapping, for which a in AutoML (1) pipelines are linear and only exchanges of
plethora of possible tools exist [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. tasks are considered and (2) the results of diferent
work
      </p>
      <p>Here, we studied the behaviour of three RNAseq DAWs flow variants may also vary, but that typically a notion of
on two diferent infrastructures using two diferent data ”best” is defined (e.g. highest accuracy on a test data set),
sets. We created the DAWs tool-chains based on the tools’ which often is not the case in scientific data analysis.
popularity and compatibility with each other. Each DAW
was implemented in two versions: one is designed for
a stand-alone server, and another one is designed for a
distributed infrastructure. Adaptation consists of
splitting the load of resource-demanding tasks across several
nodes of the cluster by splitting the input files - which
is not supported equally well by all tools. We ran the
two versions of these three workflows on two diferent
infrastructures and measured the runtime and output
diferences between the workflow versions depending
on several parameters. We consider this work as a base
to better understand the impact of DAW rewritings.
Ultimately, we aim at abstracting these findings into a set of
rules that lead to an automatic DAW rewriting according
to a given input and context specifications.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        As described in Figure 1, we created two RNASeq
worklfows performing the same operations but optimized for
diferent infrastructures. Both have as central and most
time-consuming task the alignment, which matches short
stretches of genomic sequences to a reference genome.
The alignment is also fundamental in other
bioinformatics workflows studying genomic sequences [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. RS1
follows a pipeline structure, which we assume fits
better to a stand-alone server. RS2 is optimized for a
distributed infrastructure, as it splits the input to allow for
a distributed computation of the alignment step across
several nodes of a cluster. From each workflow, we
furthermore created three variants according to the
specific tool used to compute alignments, i.e., STAR [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
Salmon [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], and Hisat2 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. We implemented all DAWs
using Nextflow [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], a workflow engine of increasing
popularity in the Bioinformatics community. Nextflow
workflows are implemented in a specific DSL which
allows for automatic parallelization and distributed
execution of tasks. For local execution, NextFlow uses its own
execution engine; for a distributed setup, it can work
together with Kubernetes resource managers. The goal
of this experiment is
Data and Infrastructures
      </p>
      <sec id="sec-2-1">
        <title>Two RNAseq-paired input datasets of diferent sizes were</title>
        <p>considered. Both data sets were obtained by sequencing
the transcriptome of Drosophila melanogaster. The
difference in size allows us to understand better the impact
of the input size on the decision to rewrite the workflow.
Dataset 1 consists of two paired files of 13GB each, and
Dataset 2 is two files of 48G.</p>
        <p>The DAWs were run on two diferent infrastructures:
one stand-alone server and one cluster. The stand-alone
server (infra1) consists of 32 Intel Xeon CPU E5-2667 v2
Octa Core, with a memory of 387 GB and a SATA SSD
1,9TiB Raid 5. The cluster Infra2 in our experiments
consists of 10 homogeneous nodes, each with a Quadcore
Intel Xeon CPU E3-1230 V2 3.30GHz; Memory: 16 GB;
Disks: 3x1TB, connected by a network of 2x 1GBit. The
stand alone server has way more resources than the
cluster. Therefore, we expect all the runtimes to be faster on
Infra1.</p>
        <p>DAW</p>
        <sec id="sec-2-1-1">
          <title>STAR</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Hisat2</title>
        </sec>
        <sec id="sec-2-1-3">
          <title>Salmon</title>
        </sec>
        <sec id="sec-2-1-4">
          <title>Dataset D1 D2 D1</title>
          <p>D2
D1
D2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Interestingly, the DAWS require very diferent run</title>
        <p>times depending on which tool was used for the
align4. Results ment step. On Infra1 with D2, the runtime of Salmon is
approx. 20 times faster than HiSat2 on RS1 and five times
Runtime comparison faster on RS2. In almost all cases, workflows profited
from the plus in resources in the distributed Infra2 when
Table 1 shows the runtimes of the pipeline version (RS1) switching from RS1 to RS2, but to varying degrees.
and the distributed version (RS2) of the DAWs on the two However, recall that our intention is not to find the best
infrastructures with both datasets. Note that each value DAW for a given infrastructure, but to develop algorithms
displayed in this table was obtained from a single run. that can rewrite a given DAW developed for a setting A
We are currently generating duplicate runs to acquire to adapt it to a new setting B - which might simply have
more robust values. However, as we had exclusive usage a slow network, such as Infra2. For instance, imagine a
of the infrastructures during the measurements, we are researcher developed RS1 on Infra2 using HiSat2 for data
confident about our experiments not being perturbed by sets of the size of DS1. Now, she wants to run it on larger
other computations. Furthermore, the duplicated runs datasets yet avoid that 5-fold increase in runtime. An
that were already computed are consistent with the re- adaptation an optimizer could propose is to rewrite the
sults presented in the table. workflow into RS2, which would only lead to a 2-fold
run</p>
        <p>We observe notable runtime diferences that show ten- time. Or imagine another user who wants to reuse this
dencies but not an entirely consistent picture. In general, workflow, but is forced to use STAR as aligner because it
RS1 and RS2 show similar runtimes on the stand-alone is the lab-internal standard. Runtime would be doubled,
server Infra1, with the notable exception of Hisat2 on the or even increased by a factor of 13 when also switching to
large dataset D2. For this case, time reduction is almost larger files. An optimizer could recognize that switching
50%, while runtimes for the smaller dataset D1 are very to RS2 would decrease the expected increase by 65%.
similar. We attribute this behaviour to the low resource
usage of Hisat2. In a non-distributed setting, Nextflow
parallelizes the tasks over the diferent CPUs available, Quality comparison
which makes the runtime on Infra1 overall smaller. Al- In scientific data analysis, diferent DAWs for the same
most no diference is observed for STAR on infra1 as it problem often lead to (slightly) diferent results due to
requires a lot of RAM to run over a single chunk of the the usage of diferent heuristics for solving complex
subinput data. On Infra2, runtimes difer considerably. In al- problems. Sometimes, avoiding such changes can be
most all cases, RS2 (designed for distributed computation) mandatory, for instance, when a certain analysis method
achieves much lower runtimes than RS1, with reductions is defined as an organizational standard. However, often
up to 66%. Again, there is one exception: Salmon actu- such changes are acceptable, for instance, in the early
ally takes longer with RS2 than with RS1. This runtime phases of a data analysis project in which diferent
tradediference is due to the task splitting the input files. ofs are explored, such as runtime, result quality, analysis
cost etc. In any case, users need to be informed about the</p>
        <sec id="sec-2-2-1">
          <title>Dataset 1</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Dataset 2 STAR 100 % 100 %</title>
          <p>expected degree of changes a DAW rewriting would incur.
To this end, we compared the results of the diferent DAW
versions to understand how much the DAG structure
modification impacts analysis results. We measured the
similarity between the results of the RS1 and RS2 versions
of each DAW in Table 2. Clearly, the DAWs using Hisat2
and STAR are very robust to this rewriting, while the one
using Salmon produces largely diferent results.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion and future work</title>
      <p>We presented the results of an initial study on the impact
of DAW adaptations to diferent infrastructures,
considering both the replacement of central tools as well as
changing the workflow structure. The main purpose is
to show that such rewritings impact performance
considerably and that certain variants are more suitable for
certain infrastructures and that suitability also depends
on the input size. Relationships are overall complex and
certainly will vary with diferent analysis problems,
different DAWs for solving them, and diferent
infrastructures. We are consolidating these results with
experiment replicates and more workflows and dataset sizes.
In future work, we will focus on languages to provide
descriptions of core aspects of infrastructures, methods
to derive properties of tools on diferent infrastructures,
annotation schemes to describe the equivalence of tools
in genomics and a cost model as a basis for a rule-based
DAW adaptation algorithm that takes these properties
into account. We will then develop an automatic DAW
rewriting that implements this algorithm.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <sec id="sec-4-1">
        <title>Funded by the Deutsche Forschungsgemeinschaft – Project-ID 414984028 – SFB 1404 FONDA</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wratten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wilm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Göke</surname>
          </string-name>
          , Reproducible, scalable, and
          <article-title>shareable analysis pipelines with bioinformatics workflow managers</article-title>
          ,
          <source>Nat Methods</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schiefer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brandt</surname>
          </string-name>
          , et al.,
          <source>Portability of Scientific Workflows in NGS Data Analysis: A Case Study</source>
          , arXiv:
          <year>2006</year>
          .
          <volume>03104</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Frantz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Becker</surname>
          </string-name>
          , et al.,
          <source>FORCE on Nextflow: Scalable Analysis of Earth Observation data on Commodity Clusters, CIKM Workshops</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hanussek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bartusch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krüger</surname>
          </string-name>
          ,
          <article-title>Performance and scaling behavior of bioinformatics applications in virtualization environments to create awareness for the eficient use of compute resources</article-title>
          ,
          <source>PLOS Computational Biology</source>
          <volume>17</volume>
          (
          <issue>7</issue>
          ): e1009244 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          ,
          <article-title>Database systems - the complete book</article-title>
          ,
          <source>Pearson</source>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rheinländer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Heise</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hueske</surname>
          </string-name>
          , et al.,
          <article-title>Sofa: An extensible logical optimizer for udf-heavy data lfows</article-title>
          ,
          <source>Information Systems</source>
          <volume>52</volume>
          (
          <year>2015</year>
          )
          <fpage>96</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          , et al.,
          <article-title>Contemporary challenges for data-intensive scientific workflow management systems</article-title>
          , Workshop on Workflows in Support of Large-Scale
          <string-name>
            <surname>Science</surname>
          </string-name>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schaarschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zuther</surname>
          </string-name>
          , et al.,
          <article-title>Evaluation of seven diferent rna-seq alignment tools based on experimental data from the model plant arabidopsis thaliana</article-title>
          ,
          <source>Int J Mol Sci</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Quan</surname>
          </string-name>
          , L.
          <string-name>
            <surname>Xu-Bin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wen-Rui</surname>
          </string-name>
          , et al.,
          <article-title>Survey of MapReduce frame operation in bioinformatics</article-title>
          , Briefings in Bioinformatics 15 (
          <year>2013</year>
          )
          <fpage>637</fpage>
          -
          <lpage>647</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>A.-A. Zaid</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Saiyi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hamid</surname>
          </string-name>
          , Sparkra:
          <article-title>Enabling big data scalability for the gatk rna-seq pipeline with apache spark</article-title>
          ,
          <source>Genes</source>
          <volume>11</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yakneen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Waszak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gertz</surname>
          </string-name>
          , et al.,
          <article-title>Butler enables rapid cloud-based analysis of thousands of human genomes</article-title>
          ,
          <source>Nat Biotechnol</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Diao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Evani</surname>
          </string-name>
          , et al.,
          <article-title>Massively parallel processing of whole genome sequence data: An in-depth performance study</article-title>
          ,
          <source>SIGMOD</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>Xin amd Zhao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Chu</surname>
          </string-name>
          ,
          <article-title>Automl: A survey of the state-of-the-art, Knowledge-Based Systems 12 (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Musich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cadle-Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Osier</surname>
          </string-name>
          ,
          <article-title>Comparison of short-read sequence aligners indicates strengths and weaknesses for biologists to consider</article-title>
          ,
          <source>Front Plant Sci</source>
          .
          <article-title>(</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dobin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schlesinger</surname>
          </string-name>
          , et al.,
          <article-title>STAR: ultrafast universal RNA-seq aligner</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>29</volume>
          (
          <year>2012</year>
          )
          <fpage>15</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Patro</surname>
          </string-name>
          , G. Duggal,
          <string-name>
            <given-names>M.</given-names>
            <surname>Love</surname>
          </string-name>
          ,
          <article-title>Salmon provides fast and bias-aware quantification of transcript expression</article-title>
          ,
          <source>Nat Methods</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Paggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <article-title>Graph-based genome alignment and genotyping with hisat2 and hisatgenotype</article-title>
          ,
          <source>Nat Biotechnol</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Tommaso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chatzou</surname>
          </string-name>
          , E. Floden,
          <article-title>Nextflow enables reproducible computational workflows</article-title>
          ,
          <source>Nat Biotechnol</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>