=Paper=
{{Paper
|id=Vol-1146/paper7
|storemode=property
|title=Compressed Spaced Suffix Arrays
|pdfUrl=https://ceur-ws.org/Vol-1146/paper7.pdf
|volume=Vol-1146
|dblpUrl=https://dblp.org/rec/conf/icabd/GagieMV14
}}
==Compressed Spaced Suffix Arrays==
Compressed Spaced Suffix Arrays Travis Gagie1,⇤ , Giovanni Manzini2 , Daniel Valenzuela1 1 Department of Computer Science, University of Helsinki, Finland 2 University of Eastern Piedmont, Italy ⇤ Travis.Gagie@cs.helsinki.fi Abstract Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still support fast random access to them. We first prove a theoretical upper bound on the space needed to store an SSA when we already have the SA. We then present experiments indicating that our approach works even better in practice. 1 Introduction For the problem of similarity search, we are given two texts and asked to find each sufficiently long substring of the first text that is within a certain Hamming distance of some substring of the second text. Similarity search has many applications in bioinformatics — e.g., ortholog detection, structure prediction or determining rearrangements — and has been extensively studied (see, e.g., [19]). Researchers used to first look for short substrings of the first text that occur unchanged in the second text, called seeds, then try to extend these short, exact matches in either direction to obtain longer, approximate matches. This approach is called, naturally enough, “seed and extend”. The substrings’ exact matches are found using either a hash table of the substrings with the right length, or an index structure such as a suffix array (SA). Around the turn of the millenium, Burkhardt and Kärkkäinen [5] and Ma, Tromp and Li [15] independently proposed looking for short subsequences of the first text that have a certain shape and occur unchanged in the second text, and trying to extend those. A binary string encoding the shape of a subsequence, with Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. In: Costas S. Iliopoulos, Alessio Langiu (eds.): Proceedings of the 2nd International Conference on Algorithms for Big Data, Palermo, Italy, 7-9 April 2014, published at http://ceur-ws.org/ 37 Compressed Spaced Suffix Arrays 1s indicating positions where the characters must match and 0s indicating posi- tions where they need not, is called a spaced seed. The total number of bits in the binary string is called the seed’s length, and the number of 1s is called its weight. The subsequences’ exact matches are found using either a hash table of the subsequences with the right shape, or a kind of modified SA called a spaced suffix array [12] (SSA). Burkhardt and Kärkkäinen, Ma et al. and subsequent authors have shown that using spaced seeds significantly improves the performance of seeding and extend- ing. Many papers have been written about how to design spaced seeds to minimize the number of errors (see, e.g., [4, 8, 11] and references therein), with the specifics depending on the model of sequence similarity and the acceptable numbers of false positives (for which the characters indicated by 1s all match but the substrings are not similar) and false negatives (for which those do not all match but the sub- strings are still similar) for the application in question. Regardless of the particular application, however, researchers have consistently observed that the best results are obtained using more than one seed at a time. A set of spaced seeds used in combination is called a multiple seed. Multiple seeds are now a popular and powerful tool for similarity search, but they have a lingering flaw: we keep a hash table or SSA for each seed, and each instance of these data structures takes linear space. For example, SHRiMP2’s [7] index for the human genome takes 16 GB for each seed. In contrast, Bowtie 2’s [14] compressed SA for that genome takes only 2.5 GB. This is because a normal SA (which supports only substring matching) can be compressed such that the number of bits per character is only slightly greater than the empirical entropy of the text. Unfortunately, the techniques for compressing normal SAs do not seem to apply directly to SSAs. In this paper we show how to compress SSAs relative to normal SAs and still support fast random access to them. Whereas the normal SA for a text lists the starting points of the suffixes of that text by those suffixes’ lexicographic order, the SSA for a text and a spaced seed lists the starting points of the subsequences with the right shape by those subsequences’ lexicographic order. Intuitively, if the seed starts with many 1s, the SSA will be similar to the SA. In Section 2 we formalize this intuition and prove a theoretical upper bound on the space needed to store an SSA when we already have the SA, in terms of the text’s length, the alphabet’s size, and the seed’s length and weight. In Section 3 we present experiments showing that our approach works even better in practice. That is, even when we implement our data structures using simpler, theoretically sub-optimal components, we achieve better compression than our upper bound predicts. In fact, in practice we can even successfully apply our approach in some cases when the assumptions underlying our upper bounds are violated. However, we still want to improve our compression and random-access times for seeds with low weight-to-length ratios. We recently learned that Peterlongo et al. [17] and Crochemore and Tischler [6] independently defined SSAs, under the names “bi-factor arrays” and “gapped suffix arrays”, for the special case in which the spaced seed has the form 1a 0b 1c . Russo and Tischler [18] showed how to represent such an SSA in asymptotically succinct 38 Compressed Spaced Suffix Arrays space such that we can support random access to it in time logarithmic in the length of the text. We note, however, that the spaced seeds used for most applications do not have this form. We also recently learned that Battaglia et al. [2] used an idea similar to that of spaced seeds in an algorithm for finding motifs with don’t-care symbols. It seems possible our results could be useful in reducing their algorithm’s memory usage. 2 Theory Suppose we want to store an SSA for a text T [0..n 1] over an alphabet of size and a spaced seed S with length ` and weight w. For i < n, let Ti be the subsequence of T [i..n 1] that contains T [j] if and only if i j and S[j i] = 1. Let Ti0 be the subsequence of T [i..n 1] that contains T [j] if and only if S[j i] = 0. Let SSA be the permutation on {0, . . . , n 1} in which i precedes i0 if either Ti Ti0 , or Ti = Ti0 and T [i..n 1] T [i0 ..n 1]. For example, if T = abracadabra and S = 101 then T0 = ar T6 = db T00 = b T60 = a T1 = ba T7 = ar T10 = r T70 = b T2 = rc T8 = ba T20 = a T80 = r T3 = aa T9 = r T30 = c T90 = a T4 = cd T10 = a T40 = a T5 = aa T50 = d and so SSA = [10, 3, 5, 7, 0, 8, 1, 4, 6, 9, 2], while SA = [10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]. If Ti Ti0 and Ti0 Ti00 , then i precedes i0 in both SSA and SA. In particular, if Ti = Ti0 or Ti0 = Ti00 , then i and i0 have the same relative order in SSA and SA. In our example, T3 = T5 = aa, so 3 precedes 5 in both SSA and SA; T20 = T60 = T90 = a, so 6 precedes 9 and 9 precedes 2 in both SSA and SA. If we partition SSA into subsequences such that i and i0 are in the same subse- quence if and only if Ti = Ti0 , then we can partition SA into the same subsequences. Since there are at most w + w distinct strings Ti , our partitions each consist of at most w + w subsequences. Similarly, if we partition based on Ti0 and Ti00 , then our partitions each consist of at most ` w + ` w subsequences. For our example, we can partition both SSA and SA into [4, 6, 9, 2], for Ti0 = a; [7, 0], for Ti0 = b; [3], for Ti0 = c; [5], for Ti0 = d; [8, 1], for Ti0 = r; and [10], for Ti0 = ✏. In this particular case, however, we could just as well partition both SSA and SA into only two common subsequences: e.g., [10, 7, 0] and [3, 5, 8, 1, 4, 6, 9, 2]. Consider the permutation SA 1 SSA, which maps elements’ positions in SSA to their positions in SA, and let ⇢ be the minimum number of increasing subsequences into which SA 1 SSA can be partitioned. Since any subsequence common to SSA and SA corresponds to an increasing subsequence in SA 1 SSA, we have ⇢ min( w +w, ` w +` w). In our example, SA 1 SSA = [0, 3, 4, 1, 2, 5, 6, 7, 8, 9, 10] and ⇢ = 2. Supowit [20] gave a simple algorithm that partitions SA 1 SSA into ⇢ increas- ing subsequences in O(n lg ⇢) ✓ O(n min(w, ` w) lg ) time. When applied to 39 Compressed Spaced Suffix Arrays SA 1 SSA in our example, Supowit’s algorithm partitions it into [0, 3, 4] and [1, 2, 5, 6, 7, 8, 9, 10]. Barbay et al. [1] showed how, given a partition of SA 1 SSA into ⇢ increasing subsequences, we can store it in (2 + o(1))n lg ⇢ (2 + o(1))n min(w, ` w) lg bits and support random access to it in O(lg lg ⇢) time. Combining their ideas with later work by Belazzougui and Navarro [3], we can keep the same space bound and improve the time bound to O(1). To do this, for i ⇢, we replace each element in the ith subsequence in SA 1 SSA by a character ai , and store the resulting string R such that we can support random access to it and partial rank queries on it. We then permute R according to SA 1 SSA and store the resulting string R0 such that we can support fast select queries on it. In our example, R = a1 a2 a2 a1 a1 a2 a2 a2 a2 a2 a2 and R0 = a1 a1 a1 a2 a2 a2 a2 a2 a2 a2 a2 . The partial rank query R.p rank(i) returns the number of copies of R[i] in R[0..i], and the select query R0 .selecta (i) returns the position of the ith copy of a in R0 . Barbay et al. noted that, for i < n, (SA 1 SSA)[i] = R0 .selectR[i] (R.p rank(i)) . Belazzougui and Navarro showed how we can store R in (1 + o(1))n lg ⇢ bits and support random access to it and partial rank queries on it in O(1) time, and store R0 in (1 + o(1))n lg ⇢ bits and support select queries on it in O(1) time. In summary, we can store SA 1 SSA in (2 + o(1))n min(w, ` w) lg bits such that we can support random access to it in O(1) time. We will give a longer explanation in the full version of this paper. Since SSA = SA (SA 1 SSA), this gives us the following result: Theorem 2.1 Let T [0..n 1] be a text over an alphabet of size and let S be a spaced seed with length ` and weight w. If we have already stored the suffix array SA for T such that we can support random access to SA in time tSA , then we can store a spaced suffix array SSA for T and S in (2 + o(1))n min(w, ` w) lg bits such that we can support random access to SSA in tSA + O(1) time. 3 Practice Theorem 2.1 says we can store SSAs for the human Y-chromosome chrY.fa in FASTA format (about 60 million characters over an alphabet of size 5) and SHRiMP2’s three default spaced seeds — i.e., 11110111101111, 1111011100100001111 and 1111000011001101111 — in about 560 MB, in addition to the SA, whereas storing the SSAs naı̈vely would take about 720 MB. Storing the SSAs packed such that each entry takes dlg 60 000 000e = 26 bits would reduce this to about 580 MB. To test our approach, we built the SSAs as described in Section 2; computed SA 1 SSA, R and R0 for each SSA; and stored each copy of R or R0 as a wavelet tree. We chose wavelet trees because they are simple to use and often more practical than the theoretically smaller and faster data structures mentioned in Section 2. We ran all our tests described in this section on a computer with a quad-core Intel 40 Compressed Spaced Suffix Arrays Xeon CPU with 32 GB of RAM, running Ubuntu 12.04. We used a wavelet-tree implementation from https://github.com/fclaude/libcds and compiled it with GNU g++ version 4.4.3 with optimization flag -O3. The uncompressed SA took 226 MB, and the six wavelet trees took a total of 215 MB and performed 10 000 random accesses each in 7.67 microseconds per access. That is, we compressed the SSAs into about 60% of the space it would take to store them naı̈vely and, although our accesses were much slower than direct memory accesses, they were fast compared to disk accesses. Thus, our approach seems likely to be useful when a set of SSAs is slightly larger than the memory and fits only when compressed. Using the same test setup, we then compressed SSAs for the ten spaced seeds BFAST [9, Table S3] uses for 36-base-pair Illumina reads, which all have weight 18: 1. 111111111111111111 2. 11110100110111101010101111 3. 11111111111111001111 4. 1111011101100101001111111 5. 11110111000101010000010101110111 6. 1011001101011110100110010010111 7. 1110110010100001000101100111001111 8. 1111011111111111111 9. 11011111100010110111101101 10. 111010001110001110100011011111 . Since the first seed consists only of 1s, the SSA we would build for it is the same as the SA. The uncompressed SA again took 226 MB and the 18 wavelet trees for the other nine seeds took a total of 649 MB — so instead of 2.26 GB, we used 875 MB (about 39%) for all ten seeds — and together performed 10 000 random accesses to each of the ten SSAs in about 7 microseconds per access. The left side of the top half of Figure 1 shows how many bits per character (bpc) of the text each SSA took, and the average time per access to each SSA. We also compressed the SSAs for the ten spaced seeds BFAST uses for 50-base- pair Illumina reads, which all have weight 22: 1. 1111111111111111111111 2. 1111101110111010100101011011111 3. 1011110101101001011000011010001111111 4. 10111001101001100100111101010001011111 5. 11111011011101111011111111 6. 111111100101001000101111101110111 7. 11110101110010100010101101010111111 8. 111101101011011001100000101101001011101 9. 1111011010001000110101100101100110100111 41 Compressed Spaced Suffix Arrays 10. 1111010010110110101110010110111011 . Again, the first seed consists only of 1s. This time, the 18 wavelet trees for the other nine seeds took a total of 712 MB; each access took about 8 microseconds. The left side of the bottom half of Figure 1 shows how many bit per character of the text each SSA took, and the average access time per access to each SSA. If we have a permutation ⇡1 on {0, . . . , n 1} stored and ⇡2 is any other permu- tation on {0, . . . , n 1}, then we can store ⇡2 relative to ⇡1 using the ideas from Section 2. For example, we can store SSAs relative to other SSAs. Suppose we consider the size of each SSA (except the SA) when compressed relative to each other SSA (including the SA), build a minimum spanning tree rooted at the SA, and compress each SSA relative to its parent in the tree. This can reduce our space usage at the cost of increasing the random-access time, as shown for the BFAST seeds on the right side of Figure 1. There are other circumstances in which we can ignore SSAs’ semantics and consider them only as permutations. For example, spaced seeds can be generalized to subset seeds [13], such as ternary strings in which 1s indicate positions where the characters must match, 0s indicate positions where they need not, and Ts indicate positions where characters must fall within the same equivalence class (such as the pyrimidines C and T and the purines A and G). It is not difficult to generalize Theorem 2.1 to subset seeds — we will do so in the full version of this paper — but it is also not necessary to obtain practical results. The Iedera tool (available at http://bioinfo.lifl.fr/yass/iedera.php) generates good subset seeds. A more challenging change is from fixed-length seeds to repetitive seeds [12]. A repetitive seed is a string in whose repetition the digits indicate which characters must match and how. For example, with respect to the repetitive spaced seed 10110, ATCGATCGGT matches ACCGTTGGGA but not ACCGTTGAGA. Repetitive seeds are useful when looking for approximate matches of substrings that have been extended until they become sufficiently infrequent. It is not clear how or if we can extend Theorem 2.1 to repetitive seeds. Nevertheless, the LAST tool (available at http://last.cbrc.jp) generates SSAs for repetitive spaced or subset seeds, which we can still try to compress in practice; see also [10, 16]. Our current goal is to achieve reasonable compression and access times for a set of repetitive subset seeds that we received from Martin Frith, which have av- erage length 19.85 and average weight about 10.44, counting “same equivalence class” digits as 0.5. Unfortunately, at the moment we use nearly 24 bits per en- try in the corresponding SSAs (including the overhead for the uncompressed SA), which is only marginally better than the 26 bits we would use with simple pack- ing. Meanwhile, random accesses take about 12 microseconds on average, which is significantly slower than access to a packed array. On the other hand, these seeds have an unusually low average weight-to-length ratio. We used Iedera and LAST to generate SSAs for a set of eight repetitive subset seeds, with average length 17.875 and average weight 12. For these, we used only 20.15 bits per entry, with random accesses taking about 10 microseconds on average. 42 Compressed Spaced Suffix Arrays space time space time seed (bpc) (µs) seed reference (bpc) (µs) 1 32.00 0 1 - 32.00 0 2 11.29 9 2 8 9.71 11 3 4.41 4 3 1 4.41 4 4 9.75 8 4 8 9.22 10 5 11.54 9 5 4 9.23 19 6 13.77 11 6 8 12.27 14 7 13.14 10 7 3 12.58 14 8 3.85 3 8 1 3.85 3 9 10.10 7 9 1 10.10 7 10 13.91 11 10 7 12.59 26 space time space time seed (bpc) (µs) seed reference (bpc) (µs) 1 32.00 0 1 - 32.00 0 2 9.03 8 2 1 9.03 6 3 12.30 10 3 1 12.30 10 4 13.86 11 4 2 12.59 18 5 8.13 7 5 1 8.13 6 6 10.80 9 6 1 10.80 8 7 11.14 8 7 1 11.14 8 8 11.09 8 8 1 11.09 9 9 11.77 9 9 8 11.34 18 10 12.54 10 10 8 8.94 17 Figure 1: The space usage of the SSAs of the spaced seeds BFAST uses for Illumina reads, in bits per character of the text, and the average time for a random access. On top, the seeds are for 36-base-pair reads; on the bottom, the seeds are for 50- base-pair reads. On the left, all the SSAs are compressed relative to the SA; on the right, some of the SSAs are compressed relative to other SSAs. Acknowledgments Many thanks to Francisco Claude, Maxime Crochemore, Matei David, Martin Frith, Costas Iliopoulos, Juha Kärkkäinen, Gregory Kucherov, Bin Ma, Ian Munro, Taku Onodera, Gonzalo Navarro, Luis Russo, German Tischler and the anonymous reviewers. References [1] J. Barbay, F. Claude, T. Gagie, G. Navarro, and Y. Nekrich. Efficient fully- compressed sequence representations. Algorithmica. To appear. [2] G. Battaglia, D. Cangelosi, R. Grossi, and N. Pisanti. Masking patterns in sequences: A new class of motif discovery with don’t cares. Theoretical 43 Compressed Spaced Suffix Arrays Computer Science, 410:4327–4340, 2009. [3] D. Belazzougui and G. Navarro. Alphabet-independent compressed text in- dexing. ACM Transactions on Algorithms. To appear. [4] D. G. Brown. Bioinformatics Algorithms: Techniques and Applications, chap- ter A survey of seeding for sequence alignment, pages 126–152. Wiley- Interscience, 2008. [5] S. Burkhardt and J. Kärkkäinen. Better filtering with gapped q-grams. Fun- damenta Informicae, 56:51–70, 2003. [6] M. Crochemore and G. Tischler. The gapped suffix array: A new index struc- ture for fast approximate matching. In Proceedings of the 17th Symposium on String Processing and Information Retrieval (SPIRE), pages 359–364, 2010. [7] M. David, M. Dzamba, D. Lister, L. Ilie, and M. Brudno. SHRiMP2: Sensitive yet practical short read mapping. Bioinformatics, 27:1011–1012, 2011. [8] L. Egidi and G. Manzini. Better spaced seeds using quadratic residues. Journal of Compututer and System Sciences, 79:1144–1155, 2013. [9] N. Homer, B. Merriman, and S. F. Nelson. BFAST: An alignment tool for large scale genome resequencing. PLOS One, 4:e7767, 2009. [10] P. Horton, S. M. Kielbasa, and M. C. Frith. DisLex: A tranformation for discontiguous suffix array construction. In Proceedings of the Workshop on Knowledge, Language, and Learning in Bioinformatics (KLLBI), pages 1–11, 2008. [11] L. Ilie, S. Ilie, S. Khoshraftar, and A. Mansouri Bigvand. Seeds for e↵ective oligonucleotide design. BMC Genomics, 12:280, 2011. [12] S. M. Kielbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith. Adaptive seeds tame genomic sequence comparison. Genome Research, 21:487–493, 2011. [13] G. Kucherov, L. Noé, and M. A. Roytberg. A unifying framework for seed sensitivity and its application to subset seeds. Journal of Bioinformatics and Computational Biology, 4:553–570, 2006. [14] B. Langmeand and S. L. Salzberg. Fast gapped-read alignment with Bowtie 2. Nature Methods, 9:357–359, 2012. [15] B. Ma, J. Tromp, and M. Li. PatternHunter: faster and more sensitive ho- mology search. Bioinformatics, 18:440–445, 2002. [16] T. Onodera and T. Shibuya. An index structure for spaced seed search. In Proceedings of the 22nd International Symposium on Algorithms and Compu- tation (ISAAC), pages 764–772, 2011. 44 Compressed Spaced Suffix Arrays [17] P. Peterlongo, N. Pisanti, F. Boyer, and M.-F. Sagot. Lossless filter for finding long multiple approximate repetitions using a new data structure, the bi- factor array. In Proceedings of the 12th Symposium on String Processing and Information Retrieval (SPIRE), pages 179–190, 2005. [18] L. M. S. Russo and G. Tischler. Succinct gapped suffix arrays. In Proceed- ings of the 17th Symposium on String Processing and Information Retrieval (SPIRE), pages 290–294, 2011. [19] Y. Sun and J. Buhler. Designing multiple simultaneous seeds for DNA simi- larity search. Journal of Computational Biology, 12:847–861, 2005. [20] K. J. Supowit. Decomposing a set of points into chains, with applications to permutation and circle graphs. Information Processing Letters, 21:249–252, 1985. 45