=Paper=
{{Paper
|id=Vol-1149/bd2014_mirmomeni
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-1149/bd2014_mirmomeni.pdf
|volume=Vol-1149
}}
==None==
ABSTRACTS : scientific Resolving ambiguity in genome assembly using high performance computing Mahtab Mirmomenia,c, Thomas Conwaya, Matthias Reumannb , Justin Zobelc a IBM Research Australia b IBM Research Zurich c Department of Computing and Information Systems, The University of Melbourne SUMMARY DNA sequencing has revolutionised medicine and biology by providing insight into the nature of living organisms. High-throughput shotgun sequencing creates massive numbers of reads in a short period of time and de novo assembly attempts to reconstruct the original sequence, as closely as possible, using these reads. Longer pieces reconstructed by assemblies, shed more light on the underlying organism’s biology. Repetitive sequences in the DNA, create ambiguities in the assembly which result in shorter fragments. In this project, we Mahtab Mirmomeni explore the search space of the assembly graph construction using the high performance computing capability Software Engineer of an IBM Blue Gene/Q and develop an algorithm that improves assembly quality through deeper search for IBM Research Australia valid longer sequences around repeat areas. Our results show that we can increase N50 of contigs by 4% and the number of contigs over 1000bp by up to 7%, however, this extension comes at the cost of using a great deal of computing power. mahtabm@au1.ibm.com INTRODUCTION Genome sequencing has become an indispensable part of biomedical research. It enables researchers to understand the structure of genomes and the relationship between species. High throughput sequencing technologies produce large numbers of reads from the DNA, in a short period of time. These reads need to Mahtab Mirmomeni is a software engineer in IBM Research be merged to rebuild the original DNA sequence. This process is called assembly; de novo assembly is a type Australia. She was previously studying Master of Science of assembly where the reads are the only information used in this reconstruction. The programs that perform (computer science) at the University of Melbourne. assembly are called assemblers. Many assemblers create a graph (such as the de Bruijn graph) from the reads and traverse the graph to output unambiguous fragments called contigs. To construct a de Bruijn graph for assembly, reads are first decomposed into k-mers, which are strings of k nucleotides where k is a positive integer, which would become the nodes of the graph. Two nodes are connected together if the suffix of the source node shares an exact match of length k - 1 with the prefix of the destination node. Each read induces a path in the graph and the process of assembly will be mapped to finding an Eulerian path in the graph. Assemblers have limitations when dealing with repetitive regions of DNA, since it is not be clear how many copies of a repetitive sequence exist and how the graph should be traversed. In ambiguous situations, the assemblers rely on heuristics or break the consigns at points of uncertainty. The most common heuristics used in common assembly programs exploit local graph topology in a greedy manner. In this project, we explore the opportunity of extending the length of contigs by further searching the assembly graph performing a semi- global optimisation to identify valid sequences to produce more meaningful assemblies. DESCRIPTION In order to investigate the effect of an extensive search in the assembly graph around repetitive regions, on the length of the contigs reported by the assembler, we studied the assembly graph produced by Gossamer 1 using the human genome reads published by Gnerre et al. 2. The assembly graph in Gossamer, called the supergraph, is the result of identifying the Eulerian paths in the de Bruijn graph. The edges of the supergraph are the contigs, which are the output of our assembly. Repetitive elements, create complexities: for example, a repeated sequence presented by contig x can be preceded by contigs y1 and y2 and followed by contigs z1 and z2. These situations are called tangles. When Gossamer encounters a tangle, it outputs x, y1, y2, z1, and z2 separately. Given that x can be followed with either z1 or z2, both x.z1 (x followed by z1) and x.z2 are correct sequences in the underlying genome. Similarly y1.x and y2.x are both correct sequences. Reporting all four combinations can result in over- estimating the number of instances of x in the genome. We thus ‘expand’ x, either from left and report y1.x, y2.x, z1, and z2 as contigs or from right and report y1, y2, x.z1 and x.z2. 36 #bd14 | big data conference Given that the assembly supergraph in our human Gnerre dataset contains over 86 million contigs, we estimate that the amount of memory required for our Gnerre dataset is over 87GB. In addition, over 13 million tangles have to be expanded. To tackle this problem, we have divided the supergraph into smaller partitions and used high performance computing (HPC) to process each partition in parallel, in a reasonable amount of time. The cost of an exhaustive search in the supergraph to expand all tangles is exponential, and therefore requires an infeasible amount of computing power. Thus, instead of an exhaustive search to find the best set of tangle expansions in a partition of the supergraph, we have implemented a heuristic search, randomly expanding the tangles in that partition a number of rounds and recording the lengths of the produced contigs. Our algorithm ran on 512 CPUs for 50 hours. Our results show that it is possible to create longer contigs, however, we used around 8 times additional computing power to the assembly algorithm, to gain this improvement. CONCLUSION In this project, we explored the possibility of producing longer, more meaningful contigs by extending contigs around repeat regions instead of breaking them into separate contigs. The repeat regions create complex structures in our assembly supergraph called tangles. We used the structure of the graph and searched more deeply in the assembly supergraph produced by Gossamer1 to find the best set of expansions for the tangles. Because of the size of the Gnerre dataset, we had to partition it’s supergraph and use high performance computing to process different parts of the graph concurrently. REFERENCES 1. Conway, T., Wazny, J., Bromage, A., Zobel, J. and Beresford-Smith, B. Gossamer -- a resource-efficient de novo assembler. Bioinformatics (Oxford, England), 28, 14 2012), 1937-1938. 2. Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F. J., Burton, J. N., Walker, B. J., Sharpe, T., Hall, G., Shea, T. P., Sykes, S., Berlin, A. M., Aird, D., Costello, M., Daza, R., Williams, L., Nicol, R., Gnirke, A., Nusbaum, C., Lander, E. S. and Jaffe, D. B. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. 3 - 4 april 2014 | melbourne 37