<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Resolving ambiguity in genome assembly using high performance computing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Software Engineer IBM Research Australia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mahtab Mirmomeni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Conway</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Reumann</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Justin Zobel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing and Information Systems, The University of Melbourne</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IBM Research</institution>
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>IBM Research Zurich</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>36</fpage>
      <lpage>37</lpage>
      <abstract>
        <p>Mahtab Mirmomeni SUMMARY DNA sequencing has revolutionised medicine and biology by providing insight into the nature of living organisms. High-throughput shotgun sequencing creates massive numbers of reads in a short period of time and de novo assembly attempts to reconstruct the original sequence, as closely as possible, using these reads. Longer pieces reconstructed by assemblies, shed more light on the underlying organism's biology. Repetitive sequences in the DNA, create ambiguities in the assembly which result in shorter fragments. In this project, we explore the search space of the assembly graph construction using the high performance computing capability of an IBM Blue Gene/Q and develop an algorithm that improves assembly quality through deeper search for valid longer sequences around repeat areas. Our results show that we can increase N50 of contigs by 4% and the number of contigs over 1000bp by up to 7%, however, this extension comes at the cost of using a great deal of computing power.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Mahtab Mirmomeni is a software engineer in IBM Research
Australia. She was previously studying Master of Science
(computer science) at the University of Melbourne.</p>
      <p>Given that the assembly supergraph in our human Gnerre dataset contains over 86 million contigs, we estimate that the amount of memory required for our Gnerre
dataset is over 87GB. In addition, over 13 million tangles have to be expanded. To tackle this problem, we have divided the supergraph into smaller partitions and
used high performance computing (HPC) to process each partition in parallel, in a reasonable amount of time. The cost of an exhaustive search in the supergraph
to expand all tangles is exponential, and therefore requires an infeasible amount of computing power. Thus, instead of an exhaustive search to find the best set of
tangle expansions in a partition of the supergraph, we have implemented a heuristic search, randomly expanding the tangles in that partition a number of rounds and
recording the lengths of the produced contigs. Our algorithm ran on 512 CPUs for 50 hours. Our results show that it is possible to create longer contigs, however,
we used around 8 times additional computing power to the assembly algorithm, to gain this improvement.</p>
      <p>CONCLUSION
In this project, we explored the possibility of producing longer, more meaningful contigs by extending contigs around repeat regions instead of breaking them into
separate contigs. The repeat regions create complex structures in our assembly supergraph called tangles. We used the structure of the graph and searched more
deeply in the assembly supergraph produced by Gossamer1 to find the best set of expansions for the tangles. Because of the size of the Gnerre dataset, we had to
partition it’s supergraph and use high performance computing to process different parts of the graph concurrently.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Conway</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wazny</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bromage</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zobel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Beresford-Smith</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Gossamer</surname>
          </string-name>
          -
          <article-title>- a resource-efficient de novo assembler</article-title>
          .
          <source>Bioinformatics</source>
          (Oxford, England),
          <volume>28</volume>
          , 14
          <year>2012</year>
          ),
          <fpage>1937</fpage>
          -
          <lpage>1938</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gnerre</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , MacCal um, I.,
          <string-name>
            <surname>Przybylski</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ribeiro</surname>
            ,
            <given-names>F. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burton</surname>
            ,
            <given-names>J. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>B. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharpe</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hal</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shea</surname>
            ,
            <given-names>T. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sykes</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Berlin,
          <string-name>
            <given-names>A. M.</given-names>
            ,
            <surname>Aird</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          , Costel o,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Daza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            , Wil iams, L.,
            <surname>Nicol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Gnirke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Nusbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Lander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            and
            <surname>Jaffe</surname>
          </string-name>
          , D. B.
          <article-title>High-quality draft assemblies of mammalian genomes from massively paral el sequence data</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>