<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Personalised cloud-computed genomics at health-system-relevant scale</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><roleName>Dr</roleName><forename type="first">Denis</forename><forename type="middle">C</forename><surname>Bauer</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Preventative Health Flagship</orgName>
								<orgName type="institution">CSIRO</orgName>
								<address>
									<addrLine>North Ryde</addrLine>
									<postCode>2113</postCode>
									<region>NSW</region>
									<country key="AU">Australia</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Computational Informatics</orgName>
								<orgName type="institution">CSIRO</orgName>
								<address>
									<addrLine>North Ryde</addrLine>
									<postCode>2113</postCode>
									<region>NSW</region>
									<country>Australia, Australia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><roleName>Dr</roleName><surname>Bauer</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Piotr</forename><surname>Szul</surname></persName>
							<affiliation key="aff3">
								<orgName type="department">Computational Informatics</orgName>
								<orgName type="institution">CSIRO</orgName>
								<address>
									<postCode>2122</postCode>
									<settlement>Marsfield</settlement>
									<region>NSW, Australia</region>
									<country key="AU">Australia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fabian</forename><forename type="middle">A</forename><surname>Buske</surname></persName>
							<affiliation key="aff4">
								<orgName type="department" key="dep1">Cancer Epigenetics Program</orgName>
								<orgName type="department" key="dep2">Cancer Research Division</orgName>
								<orgName type="department" key="dep3">Kinghorn Cancer Centre</orgName>
								<orgName type="institution">Garvan Institute of Medical Research</orgName>
								<address>
									<postCode>2010</postCode>
									<settlement>Sydney</settlement>
									<region>NSW</region>
									<country key="AU">Australia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="laboratory">Research Scientist</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Personalised cloud-computed genomics at health-system-relevant scale</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">CBD631A6C1754AC9942820E33944E5FE</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T17:15+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Genomic information is increasingly incorporated into medical practice for diagnosis and personalised treatment. However, processing genomic information at a scale relevant for the health-system remains challenging due to computational requirements as well as high demands on data reproducibility and data provenance. Here, we present Next Generation Sequencing Analysis for Enterprises (NGSANE), a Linux-based, High Performance Computing (HPC) framework for production informatics, tailored to the demands and fast pace of personalised medicine, which is available as on-demand virtual cluster in Amazon's Elastic cloud.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>INTRODUCTION</head><p>Unprecedented computational capabilities and high-throughput data collection methods promise a new era of personalised, evidence-based healthcare, utilising individual genetic or genomic testing to tailor health management as demonstrated by recent successes in rare genetic disorders <ref type="bibr" target="#b0">1,</ref><ref type="bibr" target="#b1">2</ref> or stratified cancer treatments <ref type="bibr" target="#b2">3</ref> . An analysis can take up to 4633 CPU hours per sample to process whole exome sequencing data and produce fully annotated genomic variants (see Figure <ref type="figure" target="#fig_0">1A</ref>, CPU-single-threaded). The time, especially in the mapping stage, can be substantially reduced (7 fold) by utilising multithreading on High Performance Computing (HPC) clusters, where parallelisation between and within sample analysis can be easily implemented (see Figure <ref type="figure" target="#fig_0">1A</ref>, CPU-multi-threaded).</p><p>To achieve minimal time delay between analysis tasks (i.e. mapping, recalibration, variant call, annotation) workflows are commonly automated by means of software 'pipelines'. While high demands are posed on data provenance and reproducibility of these pipelines, individual analysis components depreciate rapidly due to evolving technology and analysis methods, often rendering entire versions of production informatics pipelines obsolete. Furthermore, the necessary parallelisation requires a large investment associated with compute hardware and IT personnel, which is a barrier to entry for small laboratories and difficult to maintain at peak times for larger institutes. This hampers the creation of time-reliable production informatics environments for clinical genomics. Commercial cloud computing frameworks, like Amazon Web Services (AWS) provide an economical alternative to in-house compute clusters as they allow outsourcing of computation to third-party providers, while retaining the software and compute flexibility.</p><p>To cater for this resource-hungry, fast pace yet sensitive environment of personalised medicine, we developed NGSANE, a Linux-based, HPC-enabled framework that minimises overhead for set up and processing of new projects yet maintains full flexibility of custom scripting and data provenance when processing raw sequencing data either on a local cluster or Amazon's Elastic Compute Cloud (EC2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DESCRIPTION</head><p>Unlike currently available tools like Galaxy 4 , BPipe 5 , SeqWare 6 or Atlas2 7 , NGSANE constructs pipelines based on Linux bash commands, which enables the use of hot swappable, modular components as opposed to the more rigid program-call wrapping by higher level languages or web-based services. NGSANE separates project specific files from reference data, scripts, and software suites that are common to multiple projects. Access to confidential data is transparently handled via the underlying Linux permission system. A project specific configuration file defining the compute environment as well as the analysis tasks to perform facilitates the transaction between projects and framework. A full audit trail is generated recording performed tasks, utilised reference data, timestamps, software versions as well as HPC log files, including any errors.</p><p>denis.bauer@csiro.au</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">-4 april 2014 | melbourne</head><p>Individual task blocks (e.g. read mapping) are packaged into bash script modules, which can be executed locally or on data subsets to test module code, submission parameters and compute environment in stages thereby mitigating the lack of debug-support from higher level languages/submission frameworks. During production, NGSANE automatically submits separate module calls for each individual data set to the HPC queue. This allows different existing modules, parameter settings, or software versions to be executed by changes to the project specific configuration file rather than the software code (hot swapping). NGSANE gracefully recovers from unsuccessfully executed jobs be it due to failed commands, missing or incorrect input or under-resourced HPC jobs by enabling a clean restart from the most recent successfully executed checkpoint. Workflows can be fully automated by utilising NGSANE's control over HPC queuing systems and by leveraging the customisable interfaces between modules when submitting multiple dependent stages at once. NGSANE supports the generation of a high-level summary (Project Card) to enable informed decisions about the experimental success. This interactive HTML report provides an access point for new lab members or collaborators, as well as a gold standard that can be used for testing purposes in a continuous integration server framework.</p><p>NGSANE is available as an Amazon Machine Image (AMI), which can be deployed to Amazon's EC2 by using, for example, MIT's StarCluster framework (http:// star.mit.edu/cluster/) to launch a virtual cluster on demand (see Figure <ref type="figure" target="#fig_0">1B</ref>). Other than regular on-demand instances, whose availability is guaranteed at a fixed price, StarCluster also offers command line-based sourcing of Spot Instances, where prices are based on current supply and demand. While Spot Instances can be acquired at a substantially lower price, their availability is not guaranteed. Hence NGSANE's checkpoint recovery is critical in such an unstable, competitive environment. Finally, NGSANE's HPC job partitioning and submission structure is independent from the program calls, therefore allowing new technologies (e.g. Hadoop) to be incorporated. CONCLUSION NGSANE is a flexible HPC framework for NGS data analysis that is specifically tailored to the demands and issues of personalised genomics. NGSANE is implemented in bash and publicly available under BSD (3-Clause) licence via GitHub at https://github.com/BauerLab/ngsane. Currently implemented workflows include those for adapter trimming, read mapping, peak calling, motif discovery, transcript assembly, variant calling and chromatin conformation analysis.</p><p>NGSANE is available for local cluster installation or as an AMI to be deployed as an on-demand cluster on Amazon's EC2. This facilitates production-scale processing of large sample numbers and enables research at population scale to produce insights into individual disease risk and stratify treatment for common diseases with impact on the health system.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. A) Resource consumption of the four steps involved in exon capture genomic data analysis. The average per sample is plotted in hours and gigabytes for CPU usage (single and multithreaded) and RAM memory usage, respectively. B) Schematic for a nine-node on-demand cluster with the NGSANE AMI deployed on every node on the EC2 service as launched by StarCluster.</figDesc><graphic coords="2,143.65,458.50,342.00,106.00" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Whole-genome sequencing for optimized patient management</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">N</forename><surname>Bainbridge</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Sci Transl Med</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">87</biblScope>
			<biblScope unit="page" from="87" to="e90" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Clinical diagnosis by whole-genome sequencing of a prenatal sample</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Talkowski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">N Engl J Med</title>
		<imprint>
			<biblScope unit="volume">367</biblScope>
			<biblScope unit="issue">23</biblScope>
			<biblScope unit="page" from="2226" to="2232" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Genetic and lifestyle influence on telomere length and subsequent risk of colon cancer in a case control study</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">J</forename><surname>Pellatt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Int J Mol Epidemiol Genet</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="184" to="194" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences</title>
		<author>
			<persName><forename type="first">J</forename><surname>Goecks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nekrutenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Taylor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Genome Biol</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page">R86</biblScope>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Bpipe: a tool for running and managing bioinformatics pipelines</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">P</forename><surname>Sadedin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Pope</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Oshlack</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page" from="1525" to="1526" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">SeqWare Query Engine: storing and searching sequence data in the cloud</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">D</forename><surname>O'connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Merriman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">F</forename><surname>Nelson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page">S2</biblScope>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
	<note>Suppl 12</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Atlas2 Cloud: a framework for personal genome analysis in the cloud</title>
		<author>
			<persName><forename type="first">U</forename><forename type="middle">S</forename><surname>Evani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Genomics</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page">S19</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
	<note>Suppl</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
