<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Personalised cloud-computed genomics at health-system-relevant scale</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dr Denis Bauer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Research Scientist CSIRO</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denis C. Bauer</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Piotr Szul</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabian A. Buske</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cancer Epigenetics Program, Cancer Research Division, Kinghorn Cancer Centre, Garvan Institute of Medical Research</institution>
          ,
          <addr-line>Sydney, 2010, NSW</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computational Informatics, CSIRO</institution>
          ,
          <addr-line>Marsfield, NSW</addr-line>
          ,
          <country country="AU">Australia</country>
          ,
          <addr-line>2122</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Computational Informatics, CSIRO</institution>
          ,
          <addr-line>North Ryde, NSW</addr-line>
          ,
          <country country="AU">Australia</country>
          ,
          <addr-line>2113</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Preventative Health Flagship, CSIRO</institution>
          ,
          <addr-line>North Ryde, NSW, 2113</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>SUMMARY Genomic information is increasingly incorporated into medical practice for diagnosis and personalised treatment. However, processing genomic information at a scale relevant for the health-system remains challenging due to computational requirements as well as high demands on data reproducibility and data provenance. Here, we present Next Generation Sequencing Analysis for Enterprises (NGSANE), a Linux-based, High Performance Computing (HPC) framework for production informatics, tailored to the demands and fast pace of personalised medicine, which is available as on-demand virtual cluster in Amazon's Elastic cloud.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Individual task blocks (e.g. read mapping) are packaged into bash script modules, which can be executed locally or on data subsets to test module code, submission
parameters and compute environment in stages thereby mitigating the lack of debug-support from higher level languages/submission frameworks. During production,
NGSANE automatically submits separate module calls for each individual data set to the HPC queue. This allows different existing modules, parameter settings, or
software versions to be executed by changes to the project specific configuration file rather than the software code (hot swapping).
NGSANE gracefully recovers from unsuccessfully executed jobs be it due to failed commands, missing or incorrect input or under-resourced HPC jobs by enabling
a clean restart from the most recent successfully executed checkpoint. Workflows can be fully automated by utilising NGSANE’s control over HPC queuing systems
and by leveraging the customisable interfaces between modules when submitting multiple dependent stages at once.</p>
      <p>NGSANE supports the generation of a high-level summary (Project Card) to enable informed decisions about the experimental success. This interactive HTML
report provides an access point for new lab members or collaborators, as well as a gold standard that can be used for testing purposes in a continuous
integration server framework.</p>
      <p>NGSANE is available as an Amazon Machine Image (AMI), which can be deployed to Amazon’s EC2 by using, for example, MIT’s StarCluster framework (http://
star.mit.edu/cluster/) to launch a virtual cluster on demand (see Figure 1B). Other than regular on-demand instances, whose availability is guaranteed at a fixed
price, StarCluster also offers command line-based sourcing of Spot Instances, where prices are based on current supply and demand. While Spot Instances can
be acquired at a substantially lower price, their availability is not guaranteed. Hence NGSANE’s checkpoint recovery is critical in such an unstable, competitive
environment. Finally, NGSANE’s HPC job partitioning and submission structure is independent from the program calls, therefore allowing new technologies (e.g.
Hadoop) to be incorporated.</p>
      <p>1. A) Resource consumption of the four steps involved in exon capture genomic data analysis. The average per sample is plotted in hours and gigabytes for CPU usage (single
and multithreaded) and RAM memory usage, respectively. B) Schematic for a nine-node on-demand cluster with the NGSANE AMI deployed on every node on the EC2 service as
launched by StarCluster.</p>
      <p>CONCLUSION
NGSANE is a flexible HPC framework for NGS data analysis that is specifically tailored to the demands and issues of personalised genomics. NGSANE is implemented
in bash and publicly available under BSD (3-Clause) licence via GitHub at https://github.com/BauerLab/ngsane. Currently implemented workflows include those for
adapter trimming, read mapping, peak calling, motif discovery, transcript assembly, variant calling and chromatin conformation analysis.
NGSANE is available for local cluster installation or as an AMI to be deployed as an on-demand cluster on Amazon’s EC2. This facilitates production-scale processing
of large sample numbers and enables research at population scale to produce insights into individual disease risk and stratify treatment for common diseases with
impact on the health system.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bainbridge</surname>
            ,
            <given-names>M.N.</given-names>
          </string-name>
          , et al.,
          <article-title>Whole-genome sequencing for optimized patient management</article-title>
          .
          <source>Sci Transl Med</source>
          ,
          <year>2011</year>
          .
          <volume>3</volume>
          (
          <issue>87</issue>
          ): p.
          <fpage>87re3</fpage>
          -
          <lpage>87re3</lpage>
          .
          <fpage>2</fpage>
          .
          <string-name>
            <surname>Talkowski</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          , et al.,
          <article-title>Clinical diagnosis by whole-genome sequencing of a prenatal sample</article-title>
          .
          <source>N Engl J Med</source>
          ,
          <year>2012</year>
          .
          <volume>367</volume>
          (
          <issue>23</issue>
          ): p.
          <fpage>2226</fpage>
          -
          <lpage>32</lpage>
          . 3.
          <string-name>
            <surname>Pel</surname>
            <given-names>att</given-names>
          </string-name>
          , A.J., et al.,
          <article-title>Genetic and lifestyle influence on telomere length and subsequent risk of colon cancer in a case control study</article-title>
          .
          <source>Int J Mol Epidemiol Genet</source>
          ,
          <year>2012</year>
          .
          <volume>3</volume>
          (
          <issue>3</issue>
          ): p.
          <fpage>184</fpage>
          -
          <lpage>194</lpage>
          . 4.
          <string-name>
            <surname>Goecks</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nekrutenko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , Galaxy:
          <article-title>a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences</article-title>
          .
          <source>Genome Biol</source>
          ,
          <year>2010</year>
          .
          <volume>11</volume>
          (
          <issue>8</issue>
          ): p.
          <source>R86</source>
          . 5.
          <string-name>
            <surname>Sadedin</surname>
            ,
            <given-names>S.P.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pope</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Oshlack</surname>
          </string-name>
          ,
          <article-title>Bpipe: a tool for running and managing bioinformatics pipelines</article-title>
          . Bioinformatics,
          <year>2012</year>
          .
          <volume>28</volume>
          (
          <issue>11</issue>
          ): p.
          <fpage>1525</fpage>
          -
          <lpage>6</lpage>
          . 6.
          <string-name>
            <given-names>O</given-names>
            <surname>'Connor</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.D.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Merriman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.F.</given-names>
            <surname>Nelson</surname>
          </string-name>
          ,
          <article-title>SeqWare Query Engine: storing and searching sequence data in the cloud</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <year>2010</year>
          .
          <volume>11</volume>
          <issue>Suppl 12</issue>
          : p.
          <source>S2</source>
          . 7.
          <string-name>
            <surname>Evani</surname>
          </string-name>
          , U.S., et al.,
          <article-title>Atlas2 Cloud: a framework for personal genome analysis in the cloud</article-title>
          .
          <source>BMC Genomics</source>
          ,
          <year>2012</year>
          .
          <volume>13</volume>
          <issue>Suppl 6</issue>
          : p.
          <fpage>S19</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>