=Paper= {{Paper |id=Vol-1146/paper7 |storemode=property |title=Compressed Spaced Suffix Arrays |pdfUrl=https://ceur-ws.org/Vol-1146/paper7.pdf |volume=Vol-1146 |dblpUrl=https://dblp.org/rec/conf/icabd/GagieMV14 }} ==Compressed Spaced Suffix Arrays== https://ceur-ws.org/Vol-1146/paper7.pdf
         Compressed Spaced Suffix Arrays
         Travis Gagie1,⇤ , Giovanni Manzini2 , Daniel Valenzuela1

        1
            Department of Computer Science, University of Helsinki, Finland
                       2
                         University of Eastern Piedmont, Italy
                              ⇤
                                  Travis.Gagie@cs.helsinki.fi




                                         Abstract

Spaced seeds are important tools for similarity search in bioinformatics, and
using several seeds together often significantly improves their performance. With
existing approaches, however, for each seed we keep a separate linear-size data
structure, either a hash table or a spaced suffix array (SSA). In this paper we show
how to compress SSAs relative to normal suffix arrays (SAs) and still support
fast random access to them. We first prove a theoretical upper bound on the
space needed to store an SSA when we already have the SA. We then present
experiments indicating that our approach works even better in practice.




1    Introduction
For the problem of similarity search, we are given two texts and asked to find each
sufficiently long substring of the first text that is within a certain Hamming distance
of some substring of the second text. Similarity search has many applications
in bioinformatics — e.g., ortholog detection, structure prediction or determining
rearrangements — and has been extensively studied (see, e.g., [19]). Researchers
used to first look for short substrings of the first text that occur unchanged in the
second text, called seeds, then try to extend these short, exact matches in either
direction to obtain longer, approximate matches. This approach is called, naturally
enough, “seed and extend”. The substrings’ exact matches are found using either
a hash table of the substrings with the right length, or an index structure such as
a suffix array (SA).
    Around the turn of the millenium, Burkhardt and Kärkkäinen [5] and Ma,
Tromp and Li [15] independently proposed looking for short subsequences of the
first text that have a certain shape and occur unchanged in the second text, and
trying to extend those. A binary string encoding the shape of a subsequence, with

Copyright c by the paper’s authors. Copying permitted only for private and academic purposes.
In: Costas S. Iliopoulos, Alessio Langiu (eds.): Proceedings of the 2nd International Conference
on Algorithms for Big Data, Palermo, Italy, 7-9 April 2014, published at http://ceur-ws.org/



                                              37
Compressed Spaced Suffix Arrays

1s indicating positions where the characters must match and 0s indicating posi-
tions where they need not, is called a spaced seed. The total number of bits in
the binary string is called the seed’s length, and the number of 1s is called its
weight. The subsequences’ exact matches are found using either a hash table of
the subsequences with the right shape, or a kind of modified SA called a spaced
suffix array [12] (SSA).
   Burkhardt and Kärkkäinen, Ma et al. and subsequent authors have shown that
using spaced seeds significantly improves the performance of seeding and extend-
ing. Many papers have been written about how to design spaced seeds to minimize
the number of errors (see, e.g., [4, 8, 11] and references therein), with the specifics
depending on the model of sequence similarity and the acceptable numbers of false
positives (for which the characters indicated by 1s all match but the substrings
are not similar) and false negatives (for which those do not all match but the sub-
strings are still similar) for the application in question. Regardless of the particular
application, however, researchers have consistently observed that the best results
are obtained using more than one seed at a time. A set of spaced seeds used in
combination is called a multiple seed.
   Multiple seeds are now a popular and powerful tool for similarity search, but
they have a lingering flaw: we keep a hash table or SSA for each seed, and each
instance of these data structures takes linear space. For example, SHRiMP2’s [7]
index for the human genome takes 16 GB for each seed. In contrast, Bowtie 2’s [14]
compressed SA for that genome takes only 2.5 GB. This is because a normal SA
(which supports only substring matching) can be compressed such that the number
of bits per character is only slightly greater than the empirical entropy of the text.
Unfortunately, the techniques for compressing normal SAs do not seem to apply
directly to SSAs.
   In this paper we show how to compress SSAs relative to normal SAs and still
support fast random access to them. Whereas the normal SA for a text lists the
starting points of the suffixes of that text by those suffixes’ lexicographic order, the
SSA for a text and a spaced seed lists the starting points of the subsequences with
the right shape by those subsequences’ lexicographic order. Intuitively, if the seed
starts with many 1s, the SSA will be similar to the SA. In Section 2 we formalize
this intuition and prove a theoretical upper bound on the space needed to store an
SSA when we already have the SA, in terms of the text’s length, the alphabet’s
size, and the seed’s length and weight.
   In Section 3 we present experiments showing that our approach works even
better in practice. That is, even when we implement our data structures using
simpler, theoretically sub-optimal components, we achieve better compression than
our upper bound predicts. In fact, in practice we can even successfully apply our
approach in some cases when the assumptions underlying our upper bounds are
violated. However, we still want to improve our compression and random-access
times for seeds with low weight-to-length ratios.
   We recently learned that Peterlongo et al. [17] and Crochemore and Tischler [6]
independently defined SSAs, under the names “bi-factor arrays” and “gapped suffix
arrays”, for the special case in which the spaced seed has the form 1a 0b 1c . Russo
and Tischler [18] showed how to represent such an SSA in asymptotically succinct


                                          38
Compressed Spaced Suffix Arrays

space such that we can support random access to it in time logarithmic in the length
of the text. We note, however, that the spaced seeds used for most applications do
not have this form. We also recently learned that Battaglia et al. [2] used an idea
similar to that of spaced seeds in an algorithm for finding motifs with don’t-care
symbols. It seems possible our results could be useful in reducing their algorithm’s
memory usage.

2    Theory
Suppose we want to store an SSA for a text T [0..n 1] over an alphabet of size and
a spaced seed S with length ` and weight w. For i < n, let Ti be the subsequence
of T [i..n 1] that contains T [j] if and only if i  j and S[j i] = 1. Let Ti0 be the
subsequence of T [i..n 1] that contains T [j] if and only if S[j i] = 0. Let SSA
be the permutation on {0, . . . , n 1} in which i precedes i0 if either Ti     Ti0 , or
Ti = Ti0 and T [i..n 1] T [i0 ..n 1].
   For example, if T = abracadabra and S = 101 then

           T0   =    ar       T6    =   db             T00   =   b      T60   =   a
           T1   =    ba       T7    =   ar             T10   =   r      T70   =   b
           T2   =    rc       T8    =   ba             T20   =   a      T80   =   r
           T3   =    aa       T9    =   r              T30   =   c      T90   =   a
           T4   =    cd      T10    =   a              T40   =   a
           T5   =    aa                                T50   =   d

and so SSA = [10, 3, 5, 7, 0, 8, 1, 4, 6, 9, 2], while SA = [10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2].
    If Ti Ti0 and Ti0 Ti00 , then i precedes i0 in both SSA and SA. In particular, if
Ti = Ti0 or Ti0 = Ti00 , then i and i0 have the same relative order in SSA and SA. In
our example, T3 = T5 = aa, so 3 precedes 5 in both SSA and SA; T20 = T60 = T90 = a,
so 6 precedes 9 and 9 precedes 2 in both SSA and SA.
    If we partition SSA into subsequences such that i and i0 are in the same subse-
quence if and only if Ti = Ti0 , then we can partition SA into the same subsequences.
Since there are at most w + w distinct strings Ti , our partitions each consist of
at most w + w subsequences. Similarly, if we partition based on Ti0 and Ti00 , then
our partitions each consist of at most ` w + ` w subsequences.
    For our example, we can partition both SSA and SA into [4, 6, 9, 2], for Ti0 = a;
[7, 0], for Ti0 = b; [3], for Ti0 = c; [5], for Ti0 = d; [8, 1], for Ti0 = r; and [10], for
Ti0 = ✏. In this particular case, however, we could just as well partition both SSA
and SA into only two common subsequences: e.g., [10, 7, 0] and [3, 5, 8, 1, 4, 6, 9, 2].
    Consider the permutation SA 1 SSA, which maps elements’ positions in SSA to
their positions in SA, and let ⇢ be the minimum number of increasing subsequences
into which SA 1 SSA can be partitioned. Since any subsequence common to SSA
and SA corresponds to an increasing subsequence in SA 1 SSA, we have ⇢ 
min( w +w, ` w +` w). In our example, SA 1 SSA = [0, 3, 4, 1, 2, 5, 6, 7, 8, 9, 10]
and ⇢ = 2.
    Supowit [20] gave a simple algorithm that partitions SA 1 SSA into ⇢ increas-
ing subsequences in O(n lg ⇢) ✓ O(n min(w, ` w) lg ) time. When applied to


                                              39
Compressed Spaced Suffix Arrays

SA 1 SSA in our example, Supowit’s algorithm partitions it into [0, 3, 4] and
[1, 2, 5, 6, 7, 8, 9, 10].
    Barbay et al. [1] showed how, given a partition of SA 1 SSA into ⇢ increasing
subsequences, we can store it in (2 + o(1))n lg ⇢  (2 + o(1))n min(w, ` w) lg bits
and support random access to it in O(lg lg ⇢) time. Combining their ideas with
later work by Belazzougui and Navarro [3], we can keep the same space bound and
improve the time bound to O(1).
    To do this, for i  ⇢, we replace each element in the ith subsequence in SA 1
SSA by a character ai , and store the resulting string R such that we can support
random access to it and partial rank queries on it. We then permute R according
to SA 1 SSA and store the resulting string R0 such that we can support fast
select queries on it. In our example, R = a1 a2 a2 a1 a1 a2 a2 a2 a2 a2 a2 and R0 =
a1 a1 a1 a2 a2 a2 a2 a2 a2 a2 a2 .
    The partial rank query R.p rank(i) returns the number of copies of R[i] in
R[0..i], and the select query R0 .selecta (i) returns the position of the ith copy of a
in R0 . Barbay et al. noted that, for i < n,

                   (SA 1 SSA)[i] = R0 .selectR[i] (R.p rank(i)) .

Belazzougui and Navarro showed how we can store R in (1 + o(1))n lg ⇢ bits and
support random access to it and partial rank queries on it in O(1) time, and store
R0 in (1 + o(1))n lg ⇢ bits and support select queries on it in O(1) time.
   In summary, we can store SA 1 SSA in (2 + o(1))n min(w, ` w) lg bits
such that we can support random access to it in O(1) time. We will give a longer
explanation in the full version of this paper. Since SSA = SA (SA 1 SSA), this
gives us the following result:

Theorem 2.1 Let T [0..n 1] be a text over an alphabet of size and let S be a
spaced seed with length ` and weight w. If we have already stored the suffix array
SA for T such that we can support random access to SA in time tSA , then we can
store a spaced suffix array SSA for T and S in (2 + o(1))n min(w, ` w) lg bits
such that we can support random access to SSA in tSA + O(1) time.

3   Practice
Theorem 2.1 says we can store SSAs for the human Y-chromosome chrY.fa
in FASTA format (about 60 million characters over an alphabet of size
5) and SHRiMP2’s three default spaced seeds — i.e., 11110111101111,
1111011100100001111 and 1111000011001101111 — in about 560 MB, in addition
to the SA, whereas storing the SSAs naı̈vely would take about 720 MB. Storing
the SSAs packed such that each entry takes dlg 60 000 000e = 26 bits would reduce
this to about 580 MB.
   To test our approach, we built the SSAs as described in Section 2; computed
SA 1 SSA, R and R0 for each SSA; and stored each copy of R or R0 as a wavelet
tree. We chose wavelet trees because they are simple to use and often more practical
than the theoretically smaller and faster data structures mentioned in Section 2.
We ran all our tests described in this section on a computer with a quad-core Intel


                                          40
Compressed Spaced Suffix Arrays

Xeon CPU with 32 GB of RAM, running Ubuntu 12.04. We used a wavelet-tree
implementation from https://github.com/fclaude/libcds and compiled it with
GNU g++ version 4.4.3 with optimization flag -O3.
    The uncompressed SA took 226 MB, and the six wavelet trees took a total
of 215 MB and performed 10 000 random accesses each in 7.67 microseconds per
access. That is, we compressed the SSAs into about 60% of the space it would take
to store them naı̈vely and, although our accesses were much slower than direct
memory accesses, they were fast compared to disk accesses. Thus, our approach
seems likely to be useful when a set of SSAs is slightly larger than the memory and
fits only when compressed.
    Using the same test setup, we then compressed SSAs for the ten spaced seeds
BFAST [9, Table S3] uses for 36-base-pair Illumina reads, which all have weight
18:
 1. 111111111111111111
 2. 11110100110111101010101111
 3. 11111111111111001111
 4. 1111011101100101001111111
 5. 11110111000101010000010101110111
 6. 1011001101011110100110010010111
 7. 1110110010100001000101100111001111
 8. 1111011111111111111
 9. 11011111100010110111101101
10. 111010001110001110100011011111 .
   Since the first seed consists only of 1s, the SSA we would build for it is the same
as the SA. The uncompressed SA again took 226 MB and the 18 wavelet trees for
the other nine seeds took a total of 649 MB — so instead of 2.26 GB, we used
875 MB (about 39%) for all ten seeds — and together performed 10 000 random
accesses to each of the ten SSAs in about 7 microseconds per access. The left side
of the top half of Figure 1 shows how many bits per character (bpc) of the text
each SSA took, and the average time per access to each SSA.
   We also compressed the SSAs for the ten spaced seeds BFAST uses for 50-base-
pair Illumina reads, which all have weight 22:
 1. 1111111111111111111111
 2. 1111101110111010100101011011111
 3. 1011110101101001011000011010001111111
 4. 10111001101001100100111101010001011111
 5. 11111011011101111011111111
 6. 111111100101001000101111101110111
 7. 11110101110010100010101101010111111
 8. 111101101011011001100000101101001011101
 9. 1111011010001000110101100101100110100111


                                         41
Compressed Spaced Suffix Arrays

10. 1111010010110110101110010110111011 .


   Again, the first seed consists only of 1s. This time, the 18 wavelet trees for the
other nine seeds took a total of 712 MB; each access took about 8 microseconds.
The left side of the bottom half of Figure 1 shows how many bit per character of
the text each SSA took, and the average access time per access to each SSA.
   If we have a permutation ⇡1 on {0, . . . , n 1} stored and ⇡2 is any other permu-
tation on {0, . . . , n 1}, then we can store ⇡2 relative to ⇡1 using the ideas from
Section 2. For example, we can store SSAs relative to other SSAs. Suppose we
consider the size of each SSA (except the SA) when compressed relative to each
other SSA (including the SA), build a minimum spanning tree rooted at the SA,
and compress each SSA relative to its parent in the tree. This can reduce our space
usage at the cost of increasing the random-access time, as shown for the BFAST
seeds on the right side of Figure 1.
   There are other circumstances in which we can ignore SSAs’ semantics and
consider them only as permutations. For example, spaced seeds can be generalized
to subset seeds [13], such as ternary strings in which 1s indicate positions where the
characters must match, 0s indicate positions where they need not, and Ts indicate
positions where characters must fall within the same equivalence class (such as
the pyrimidines C and T and the purines A and G). It is not difficult to generalize
Theorem 2.1 to subset seeds — we will do so in the full version of this paper —
but it is also not necessary to obtain practical results. The Iedera tool (available
at http://bioinfo.lifl.fr/yass/iedera.php) generates good subset seeds.
   A more challenging change is from fixed-length seeds to repetitive seeds [12]. A
repetitive seed is a string in whose repetition the digits indicate which characters
must match and how. For example, with respect to the repetitive spaced seed
10110, ATCGATCGGT matches ACCGTTGGGA but not ACCGTTGAGA. Repetitive seeds
are useful when looking for approximate matches of substrings that have been
extended until they become sufficiently infrequent. It is not clear how or if we
can extend Theorem 2.1 to repetitive seeds. Nevertheless, the LAST tool (available
at http://last.cbrc.jp) generates SSAs for repetitive spaced or subset seeds,
which we can still try to compress in practice; see also [10, 16].
   Our current goal is to achieve reasonable compression and access times for a
set of repetitive subset seeds that we received from Martin Frith, which have av-
erage length 19.85 and average weight about 10.44, counting “same equivalence
class” digits as 0.5. Unfortunately, at the moment we use nearly 24 bits per en-
try in the corresponding SSAs (including the overhead for the uncompressed SA),
which is only marginally better than the 26 bits we would use with simple pack-
ing. Meanwhile, random accesses take about 12 microseconds on average, which is
significantly slower than access to a packed array. On the other hand, these seeds
have an unusually low average weight-to-length ratio. We used Iedera and LAST to
generate SSAs for a set of eight repetitive subset seeds, with average length 17.875
and average weight 12. For these, we used only 20.15 bits per entry, with random
accesses taking about 10 microseconds on average.


                                         42
Compressed Spaced Suffix Arrays

             space    time                                      space   time
      seed   (bpc)    (µs)                   seed   reference   (bpc)   (µs)

         1    32.00      0                     1       -        32.00     0
         2    11.29      9                     2       8         9.71    11
         3     4.41      4                     3       1         4.41     4
         4     9.75      8                     4       8         9.22    10
         5    11.54      9                     5       4         9.23    19
         6    13.77     11                     6       8        12.27    14
         7    13.14     10                     7       3        12.58    14
         8     3.85      3                     8       1         3.85     3
         9    10.10      7                     9       1        10.10     7
        10    13.91     11                    10       7        12.59    26


             space    time                                      space   time
      seed   (bpc)    (µs)                   seed   reference   (bpc)   (µs)

         1    32.00      0                     1       -        32.00     0
         2     9.03      8                     2       1         9.03     6
         3    12.30     10                     3       1        12.30    10
         4    13.86     11                     4       2        12.59    18
         5     8.13      7                     5       1         8.13     6
         6    10.80      9                     6       1        10.80     8
         7    11.14      8                     7       1        11.14     8
         8    11.09      8                     8       1        11.09     9
         9    11.77      9                     9       8        11.34    18
        10    12.54     10                    10       8         8.94    17

Figure 1: The space usage of the SSAs of the spaced seeds BFAST uses for Illumina
reads, in bits per character of the text, and the average time for a random access.
On top, the seeds are for 36-base-pair reads; on the bottom, the seeds are for 50-
base-pair reads. On the left, all the SSAs are compressed relative to the SA; on
the right, some of the SSAs are compressed relative to other SSAs.
Acknowledgments
Many thanks to Francisco Claude, Maxime Crochemore, Matei David, Martin
Frith, Costas Iliopoulos, Juha Kärkkäinen, Gregory Kucherov, Bin Ma, Ian Munro,
Taku Onodera, Gonzalo Navarro, Luis Russo, German Tischler and the anonymous
reviewers.

References
 [1] J. Barbay, F. Claude, T. Gagie, G. Navarro, and Y. Nekrich. Efficient fully-
     compressed sequence representations. Algorithmica. To appear.

 [2] G. Battaglia, D. Cangelosi, R. Grossi, and N. Pisanti. Masking patterns
     in sequences: A new class of motif discovery with don’t cares. Theoretical


                                        43
Compressed Spaced Suffix Arrays

    Computer Science, 410:4327–4340, 2009.

 [3] D. Belazzougui and G. Navarro. Alphabet-independent compressed text in-
     dexing. ACM Transactions on Algorithms. To appear.

 [4] D. G. Brown. Bioinformatics Algorithms: Techniques and Applications, chap-
     ter A survey of seeding for sequence alignment, pages 126–152. Wiley-
     Interscience, 2008.

 [5] S. Burkhardt and J. Kärkkäinen. Better filtering with gapped q-grams. Fun-
     damenta Informicae, 56:51–70, 2003.

 [6] M. Crochemore and G. Tischler. The gapped suffix array: A new index struc-
     ture for fast approximate matching. In Proceedings of the 17th Symposium on
     String Processing and Information Retrieval (SPIRE), pages 359–364, 2010.

 [7] M. David, M. Dzamba, D. Lister, L. Ilie, and M. Brudno. SHRiMP2: Sensitive
     yet practical short read mapping. Bioinformatics, 27:1011–1012, 2011.

 [8] L. Egidi and G. Manzini. Better spaced seeds using quadratic residues. Journal
     of Compututer and System Sciences, 79:1144–1155, 2013.

 [9] N. Homer, B. Merriman, and S. F. Nelson. BFAST: An alignment tool for
     large scale genome resequencing. PLOS One, 4:e7767, 2009.

[10] P. Horton, S. M. Kielbasa, and M. C. Frith. DisLex: A tranformation for
     discontiguous suffix array construction. In Proceedings of the Workshop on
     Knowledge, Language, and Learning in Bioinformatics (KLLBI), pages 1–11,
     2008.

[11] L. Ilie, S. Ilie, S. Khoshraftar, and A. Mansouri Bigvand. Seeds for e↵ective
     oligonucleotide design. BMC Genomics, 12:280, 2011.

[12] S. M. Kielbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith. Adaptive seeds
     tame genomic sequence comparison. Genome Research, 21:487–493, 2011.

[13] G. Kucherov, L. Noé, and M. A. Roytberg. A unifying framework for seed
     sensitivity and its application to subset seeds. Journal of Bioinformatics and
     Computational Biology, 4:553–570, 2006.

[14] B. Langmeand and S. L. Salzberg. Fast gapped-read alignment with Bowtie
     2. Nature Methods, 9:357–359, 2012.

[15] B. Ma, J. Tromp, and M. Li. PatternHunter: faster and more sensitive ho-
     mology search. Bioinformatics, 18:440–445, 2002.

[16] T. Onodera and T. Shibuya. An index structure for spaced seed search. In
     Proceedings of the 22nd International Symposium on Algorithms and Compu-
     tation (ISAAC), pages 764–772, 2011.


                                        44
Compressed Spaced Suffix Arrays

[17] P. Peterlongo, N. Pisanti, F. Boyer, and M.-F. Sagot. Lossless filter for finding
     long multiple approximate repetitions using a new data structure, the bi-
     factor array. In Proceedings of the 12th Symposium on String Processing and
     Information Retrieval (SPIRE), pages 179–190, 2005.
[18] L. M. S. Russo and G. Tischler. Succinct gapped suffix arrays. In Proceed-
     ings of the 17th Symposium on String Processing and Information Retrieval
     (SPIRE), pages 290–294, 2011.
[19] Y. Sun and J. Buhler. Designing multiple simultaneous seeds for DNA simi-
     larity search. Journal of Computational Biology, 12:847–861, 2005.
[20] K. J. Supowit. Decomposing a set of points into chains, with applications to
     permutation and circle graphs. Information Processing Letters, 21:249–252,
     1985.




                                         45