<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Adapting scientific workflows to changing infrastructures</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ninon</forename><surname>De Mecquenem</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Humboldt-Universität zu Berlin</orgName>
								<address>
									<addrLine>Unter den Linden 6</addrLine>
									<postCode>10099</postCode>
									<settlement>Berlin</settlement>
									<region>DE</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ulf</forename><surname>Leser</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Humboldt-Universität zu Berlin</orgName>
								<address>
									<addrLine>Unter den Linden 6</addrLine>
									<postCode>10099</postCode>
									<settlement>Berlin</settlement>
									<region>DE</region>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Adapting scientific workflows to changing infrastructures</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">778AE15A7C2B111E8390F7B16013FFD9</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-04-29T06:29+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Data Analysis Workflows</term>
					<term>Bioinformatics</term>
					<term>Distributed infrastructures</term>
					<term>Portability</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Scientific workflows are increasingly popular for large-scale data analyses as they promise better documentation, increased reproducibility, and easier scalability of complex analysis pipelines. However, reproducibility is severely reduced when a given workflow is optimized for a specific infrastructure, as it would require other scientists to access the same computing environment. Hence, it is important to develop techniques that automatically adapt a given workflow to changes in the underlying infrastructure or characteristics of the analyzed data, for instance, by using different data partitions or different tools for individual steps of the analysis. Automatic workflow adaptation requires a cost model setting properties of different tools, data set sizes, and characteristics of the given infrastructure into perspective. As a first step in this direction, we here study in detail the performance of an important analysis in genomics, namely RNASeq, in different settings. We experimentally measured the runtime of different RNAseq workflows implemented in Nextflow on different infrastructures (stand-alone or distributed), composed of different tool chains, using different data set sizes. As different tools also lead to (slightly) different outputs, we additionally compared the output of different workflow variants. We show that workflow variants designed for a given infrastructure perform much worse in other settings and that rewritings sometimes keep and sometimes change the output, even when tools are only replaced by others with the same purpose. We see these experiments as an important first step toward automatically adapting workflows to different infrastructures.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Data Analysis Workflows (DAWs) are used to solve a specific data analysis problem using a chain of tools connected by input/output dependencies. In bioinformatics, the usage of DAWs is critical to perform reproducible analyses <ref type="bibr" target="#b0">[1]</ref>. However, porting DAWs to different infrastructures or using them for different input data sizes can cause severe problems. For example, if the new infrastructure has fewer resources, the workflow can crash due to insufficient memory, or time outs as computations take longer than anticipated at workflow design time. On the other hand, also with more resources scaling problems can affect the runtime of the analysis <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref>. In our research, we hypothesize that knowledge of the infrastructure, the input, the DAW itself and the particular tools it is made of can be used to automatically adapt a given DAW such that it performs gracefully also in a new environment. The rewriting of chains of interdependent commands has a long tradition, especially in the database <ref type="bibr" target="#b4">[5]</ref> and the big data world <ref type="bibr" target="#b5">[6]</ref>. However, rewriting DAWs for scientific data analysis differs from these settings in two regards. First, DAWs are typically designed and executed in a black box model both for the data and the operations -the executing infrastructure cannot make any assumptions regarding the functionality of the operations nor the format of the data <ref type="bibr" target="#b6">[7]</ref>. Second, DAWs for complex scientific data analysis consist of many steps that are heuristics, which means that the "correct" result of an analysis actually is not known and that different DAWs for the same purpose on the same data might produce diverging results <ref type="bibr" target="#b7">[8]</ref>. Therefore, DAW adaptation may consider a wide range of valid primitive operations, such as: the replacement a tool of the DAW by another one with the same purpose (1), the change of tool/DAW parameters (2), the modification of the DAW structure (3), or the adjustment of the sizes of data partitions <ref type="bibr" target="#b3">(4)</ref>. Before implementing such functionalities, it is crucial to understand the impact of a given adaptation for a given setting on the workflow runtime and the output. In this paper, we study this problem for DAWs performing an RNA sequencing (RNAseq) analysis. RNASeq is particularly interesting as it is a widespread of analysis used to understand gene expression and regulation under certain conditions or diseases such as cancer. RNASeq DAWs take a large set of short strings as input, which are sequenced fractions of mRNA, the transient molecules generated during gene expression as an intermediate step to protein sequences. DAWs next map each string to a reference genome to then cluster sets of strings stemming from the same transcript. Real-life DAWs also include further steps, such as data pre-processing, quality filtering, or computation of different quality metrics. Each task of these DAWs can be performed by several tools that serve the same purpose, but use different heuristics 1-4 leading to different resource requirements and results. A particularly complex step is the mapping, for which a plethora of possible tools exist <ref type="bibr" target="#b7">[8]</ref>.</p><p>Here, we studied the behaviour of three RNAseq DAWs on two different infrastructures using two different data sets. We created the DAWs tool-chains based on the tools' popularity and compatibility with each other. Each DAW was implemented in two versions: one is designed for a stand-alone server, and another one is designed for a distributed infrastructure. Adaptation consists of splitting the load of resource-demanding tasks across several nodes of the cluster by splitting the input files -which is not supported equally well by all tools. We ran the two versions of these three workflows on two different infrastructures and measured the runtime and output differences between the workflow versions depending on several parameters. We consider this work as a base to better understand the impact of DAW rewritings. Ultimately, we aim at abstracting these findings into a set of rules that lead to an automatic DAW rewriting according to a given input and context specifications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work</head><p>Several strategies have already been used to optimize DAW execution in a given context. For instance, <ref type="bibr" target="#b8">[9]</ref>   <ref type="bibr" target="#b11">[12]</ref>. However, all these works focused on optimization for a specific target infrastructure; in contrast, we research methods that can adapt DAWs to any execution infrastructure. Another line of related research is concerned with optimization of (relational) queries with user-defined functions, especially in big data processing pipelines <ref type="bibr" target="#b5">[6]</ref>. These, however, typically are built on the paradigm that (a) individual operations (tasks in a workflow setting) have pre-defined semantics, (b) data follows a relational model, and (c) all rewritings preserve exactly the results of a query. These assumptions do not hold in the realm of DAWs for scientific analysis -data can have arbitrary formats, tasks are typically exchanged as binaries without any guarantees, and operations are partly computationally so complex that they can only be approached using heuristics, leading to different results for different concrete physical implementations. Accordingly, we envision that DAW rewriting makes up for these more complex settings by relying on knowledge provided by workflow designers that want to support portability and re-usability of their workflows. Finally, there is some commonality to recent studies in the field of AutoML <ref type="bibr" target="#b12">[13]</ref>. The main differences are that in AutoML (1) pipelines are linear and only exchanges of tasks are considered and (2) the results of different workflow variants may also vary, but that typically a notion of "best" is defined (e.g. highest accuracy on a test data set), which often is not the case in scientific data analysis.</p><p>Figure <ref type="figure">1</ref>: Design of the experiments. RS1 was created with three tool-chains (Salmon, STAR, Hisat2). A variation of these workflows (RS2) was created. We ran it on two infrastructures (infra1 and infra2) and two datasets (D1 and D2). Their runtime was measured, and their output was compared.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head><p>As described in Figure <ref type="figure">1</ref>, we created two RNASeq workflows performing the same operations but optimized for different infrastructures. Both have as central and most time-consuming task the alignment, which matches short stretches of genomic sequences to a reference genome. The alignment is also fundamental in other bioinformatics workflows studying genomic sequences <ref type="bibr" target="#b13">[14]</ref>. RS1 follows a pipeline structure, which we assume fits better to a stand-alone server. RS2 is optimized for a distributed infrastructure, as it splits the input to allow for a distributed computation of the alignment step across several nodes of a cluster. From each workflow, we furthermore created three variants according to the specific tool used to compute alignments, i.e., STAR <ref type="bibr" target="#b14">[15]</ref>, Salmon <ref type="bibr" target="#b15">[16]</ref>, and Hisat2 <ref type="bibr" target="#b16">[17]</ref>. We implemented all DAWs using Nextflow <ref type="bibr" target="#b17">[18]</ref>, a workflow engine of increasing popularity in the Bioinformatics community. Nextflow workflows are implemented in a specific DSL which allows for automatic parallelization and distributed execution of tasks. For local execution, NextFlow uses its own execution engine; for a distributed setup, it can work together with Kubernetes resource managers. The goal of this experiment is</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data and Infrastructures</head><p>Two RNAseq-paired input datasets of different sizes were considered. Both data sets were obtained by sequencing the transcriptome of Drosophila melanogaster. The difference in size allows us to understand better the impact of the input size on the decision to rewrite the workflow. Dataset 1 consists of two paired files of 13GB each, and Dataset 2 is two files of 48G.</p><p>The DAWs were run on two different infrastructures: one stand-alone server and one cluster. The stand-alone server (infra1) consists of 32 Intel Xeon CPU E5-2667 v2 Octa Core, with a memory of 387 GB and a SATA SSD 1,9TiB Raid 5. The cluster Infra2 in our experiments consists of 10 homogeneous nodes, each with a Quadcore Intel Xeon CPU E3-1230 V2 3.30GHz; Memory: 16 GB; Disks: 3x1TB, connected by a network of 2x 1GBit. The stand alone server has way more resources than the cluster. Therefore, we expect all the runtimes to be faster on Infra1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Runtime comparison</head><p>Table <ref type="table" target="#tab_0">1</ref> shows the runtimes of the pipeline version (RS1) and the distributed version (RS2) of the DAWs on the two infrastructures with both datasets. Note that each value displayed in this table was obtained from a single run. We are currently generating duplicate runs to acquire more robust values. However, as we had exclusive usage of the infrastructures during the measurements, we are confident about our experiments not being perturbed by other computations. Furthermore, the duplicated runs that were already computed are consistent with the results presented in the table.</p><p>We observe notable runtime differences that show tendencies but not an entirely consistent picture. In general, RS1 and RS2 show similar runtimes on the stand-alone server Infra1, with the notable exception of Hisat2 on the large dataset D2. For this case, time reduction is almost 50%, while runtimes for the smaller dataset D1 are very similar. We attribute this behaviour to the low resource usage of Hisat2. In a non-distributed setting, Nextflow parallelizes the tasks over the different CPUs available, which makes the runtime on Infra1 overall smaller. Almost no difference is observed for STAR on infra1 as it requires a lot of RAM to run over a single chunk of the input data. On Infra2, runtimes differ considerably. In almost all cases, RS2 (designed for distributed computation) achieves much lower runtimes than RS1, with reductions up to 66%. Again, there is one exception: Salmon actually takes longer with RS2 than with RS1. This runtime difference is due to the task splitting the input files. Runtimes of the three RNAseq DAWs, compared across DAW versions and datasets for both infrastructures. The DAWs are named by the aligner used in their tool-chain. In bold, we highlight large reduction in runtimes from RS1 to RS2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DAW</head><p>Interestingly, the DAWS require very different runtimes depending on which tool was used for the alignment step. On Infra1 with D2, the runtime of Salmon is approx. 20 times faster than HiSat2 on RS1 and five times faster on RS2. In almost all cases, workflows profited from the plus in resources in the distributed Infra2 when switching from RS1 to RS2, but to varying degrees.</p><p>However, recall that our intention is not to find the best DAW for a given infrastructure, but to develop algorithms that can rewrite a given DAW developed for a setting A to adapt it to a new setting B -which might simply have a slow network, such as Infra2. For instance, imagine a researcher developed RS1 on Infra2 using HiSat2 for data sets of the size of DS1. Now, she wants to run it on larger datasets yet avoid that 5-fold increase in runtime. An adaptation an optimizer could propose is to rewrite the workflow into RS2, which would only lead to a 2-fold runtime. Or imagine another user who wants to reuse this workflow, but is forced to use STAR as aligner because it is the lab-internal standard. Runtime would be doubled, or even increased by a factor of 13 when also switching to larger files. An optimizer could recognize that switching to RS2 would decrease the expected increase by 65%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Quality comparison</head><p>In scientific data analysis, different DAWs for the same problem often lead to (slightly) different results due to the usage of different heuristics for solving complex subproblems. Sometimes, avoiding such changes can be mandatory, for instance, when a certain analysis method is defined as an organizational standard. However, often such changes are acceptable, for instance, in the early phases of a data analysis project in which different tradeoffs are explored, such as runtime, result quality, analysis cost etc. In any case, users need to be informed about the expected degree of changes a DAW rewriting would incur.</p><p>To this end, we compared the results of the different DAW versions to understand how much the DAG structure modification impacts analysis results. We measured the similarity between the results of the RS1 and RS2 versions of each DAW in Table <ref type="table">2</ref>. Clearly, the DAWs using Hisat2 and STAR are very robust to this rewriting, while the one using Salmon produces largely different results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion and future work</head><p>We presented the results of an initial study on the impact of DAW adaptations to different infrastructures, considering both the replacement of central tools as well as changing the workflow structure. The main purpose is to show that such rewritings impact performance considerably and that certain variants are more suitable for certain infrastructures and that suitability also depends on the input size. Relationships are overall complex and certainly will vary with different analysis problems, different DAWs for solving them, and different infrastructures. We are consolidating these results with experiment replicates and more workflows and dataset sizes.</p><p>In future work, we will focus on languages to provide descriptions of core aspects of infrastructures, methods to derive properties of tools on different infrastructures, annotation schemes to describe the equivalence of tools in genomics and a cost model as a basis for a rule-based DAW adaptation algorithm that takes these properties into account. We will then develop an automatic DAW rewriting that implements this algorithm.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>surveys genomic workflows for the Map-Reduce infrastructure. Zaid Al-Ars et al. created a version of the popular RNAseq GATK workflow adapted to Spark [10]. Yakeen et al. describe a large-scale variant caller optimized for execution on a commercial cloud [11]. Roy et al. studied the influence of different Hadoop parameters on a specific genomics DAW</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc></figDesc><table><row><cell></cell><cell>Dataset</cell><cell>Infrastructure</cell><cell>RS1</cell><cell>RS2</cell></row><row><cell>STAR</cell><cell>D1 D2</cell><cell>Infra1 Infra2 Infra1 Infra2</cell><cell cols="2">58 m 364 m 401 m 2383 m 720 m 48 m 134 m 321 m</cell></row><row><cell>Hisat2</cell><cell>D1 D2</cell><cell>Infra1 Infra2 Infra1 Infra2</cell><cell>60 m 175 m 569 m 935 m</cell><cell>47 m 85 m 232 m 437 m</cell></row><row><cell>Salmon</cell><cell>D1 D2</cell><cell>Infra1 Infra2 Infra1 Infra2</cell><cell>8 m 68 m 32 m 186 m</cell><cell>15 m 51 m 51 m 270 m</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>Funded by the Deutsche Forschungsgemeinschaft -Project-ID 414984028 -SFB 1404 FONDA</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers</title>
		<author>
			<persName><forename type="first">L</forename><surname>Wratten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wilm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Göke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nat Methods</title>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Schiefer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Brandt</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2006.03104</idno>
		<title level="m">Portability of Scientific Workflows in NGS Data Analysis: A Case Study</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Lehmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Frantz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Becker</surname></persName>
		</author>
		<title level="m">FORCE on Nextflow: Scalable Analysis of Earth Observation data on Commodity Clusters</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note>CIKM Workshops</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Performance and scaling behavior of bioinformatics applications in virtualization environments to create awareness for the efficient use of compute resources</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hanussek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bartusch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Krüger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PLOS Computational Biology</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page">e1009244</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Garcia-Molina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Ullman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Widom</surname></persName>
		</author>
		<title level="m">Database systems -the complete book</title>
				<imprint>
			<publisher>Pearson</publisher>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Sofa: An extensible logical optimizer for udf-heavy data flows</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rheinländer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Heise</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hueske</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Systems</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="page" from="96" to="125" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Contemporary challenges for data-intensive scientific workflow management systems</title>
		<author>
			<persName><forename type="first">R</forename><surname>Mork</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Martin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on Workflows in Support of Large-Scale Science</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Evaluation of seven different rna-seq alignment tools based on experimental data from the model plant arabidopsis thaliana</title>
		<author>
			<persName><forename type="first">A</forename><surname>Schaarschmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zuther</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Int J Mol Sci</title>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Survey of MapReduce frame operation in bioinformatics</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Quan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Xu-Bin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wen-Rui</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Briefings in Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="637" to="647" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Sparkra: Enabling big data scalability for the gatk rna-seq pipeline with apache spark</title>
		<author>
			<persName><forename type="first">A.-A</forename><surname>Zaid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Saiyi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hamid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Genes</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Butler enables rapid cloud-based analysis of thousands of human genomes</title>
		<author>
			<persName><forename type="first">S</forename><surname>Yakneen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Waszak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gertz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nat Biotechnol</title>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Massively parallel processing of whole genome sequence data: An in-depth performance study</title>
		<author>
			<persName><forename type="first">A</forename><surname>Roy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Diao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Evani</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
			<publisher>SIGMOD</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Automl: A survey of the state-of-the-art</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xin Amd Zhao</surname></persName>
		</author>
		<author>
			<persName><surname>Chu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowledge-Based Systems</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Comparison of short-read sequence aligners indicates strengths and weaknesses for biologists to consider</title>
		<author>
			<persName><forename type="first">R</forename><surname>Musich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cadle-Davidson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Osier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Front Plant Sci</title>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">STAR: ultrafast universal RNA-seq aligner</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dobin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Davis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Schlesinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="15" to="21" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Salmon provides fast and bias-aware quantification of transcript expression</title>
		<author>
			<persName><forename type="first">R</forename><surname>Patro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Duggal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Love</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nat Methods</title>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Graph-based genome alignment and genotyping with hisat2 and hisatgenotype</title>
		<author>
			<persName><forename type="first">D</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Paggi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Park</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nat Biotechnol</title>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Nextflow enables reproducible computational workflows</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">D</forename><surname>Tommaso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chatzou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Floden</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nat Biotechnol</title>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
