-

Resolving ambiguity in genome assembly using high performance computing

Software Engineer IBM Research Australia

Mahtab Mirmomeni

0 1

Thomas Conway

Matthias Reumann

Justin Zobel

0 0 Department of Computing and Information Systems, The University of Melbourne 1 IBM Research Australia 2 IBM Research Zurich

2014

36 37

Mahtab Mirmomeni SUMMARY DNA sequencing has revolutionised medicine and biology by providing insight into the nature of living organisms. High-throughput shotgun sequencing creates massive numbers of reads in a short period of time and de novo assembly attempts to reconstruct the original sequence, as closely as possible, using these reads. Longer pieces reconstructed by assemblies, shed more light on the underlying organism's biology. Repetitive sequences in the DNA, create ambiguities in the assembly which result in shorter fragments. In this project, we explore the search space of the assembly graph construction using the high performance computing capability of an IBM Blue Gene/Q and develop an algorithm that improves assembly quality through deeper search for valid longer sequences around repeat areas. Our results show that we can increase N50 of contigs by 4% and the number of contigs over 1000bp by up to 7%, however, this extension comes at the cost of using a great deal of computing power.

Mahtab Mirmomeni is a software engineer in IBM Research Australia. She was previously studying Master of Science (computer science) at the University of Melbourne.

Given that the assembly supergraph in our human Gnerre dataset contains over 86 million contigs, we estimate that the amount of memory required for our Gnerre dataset is over 87GB. In addition, over 13 million tangles have to be expanded. To tackle this problem, we have divided the supergraph into smaller partitions and used high performance computing (HPC) to process each partition in parallel, in a reasonable amount of time. The cost of an exhaustive search in the supergraph to expand all tangles is exponential, and therefore requires an infeasible amount of computing power. Thus, instead of an exhaustive search to find the best set of tangle expansions in a partition of the supergraph, we have implemented a heuristic search, randomly expanding the tangles in that partition a number of rounds and recording the lengths of the produced contigs. Our algorithm ran on 512 CPUs for 50 hours. Our results show that it is possible to create longer contigs, however, we used around 8 times additional computing power to the assembly algorithm, to gain this improvement.

CONCLUSION In this project, we explored the possibility of producing longer, more meaningful contigs by extending contigs around repeat regions instead of breaking them into separate contigs. The repeat regions create complex structures in our assembly supergraph called tangles. We used the structure of the graph and searched more deeply in the assembly supergraph produced by Gossamer1 to find the best set of expansions for the tangles. Because of the size of the Gnerre dataset, we had to partition it’s supergraph and use high performance computing to process different parts of the graph concurrently.

1. Conway , T. , Wazny , J. , Bromage , A. , Zobel , J. and Beresford-Smith , B. Gossamer - - a resource-efficient de novo assembler . Bioinformatics (Oxford, England), 28 , 14 2012 ), 1937 - 1938 .

2. Gnerre , S. , MacCal um, I., Przybylski , D. , Ribeiro , F. J. , Burton , J. N. , Walker , B. J. , Sharpe , T. , Hal , G. , Shea , T. P. , Sykes , S. , Berlin, A. M. , Aird , D. , Costel o, M. , Daza , R. , Wil iams, L., Nicol , R. , Gnirke , A. , Nusbaum , C. , Lander , E. S. and Jaffe , D. B. High-quality draft assemblies of mammalian genomes from massively paral el sequence data .