Pruned BNDM: Extending the Bit-Parallel Suffix Automaton to Long Strings? ?? Simone Faro and Stefano Scafiti University of Catania, Department of Mathematics and Computer Science, Italy {simone.faro,stefano.scafiti}@unict.it Abstract. Automata have always played a very important role in the design of efficient solutions for the exact string matching problem. Among these, a special mention is deserved by those algorithms exploiting the bit-parallelism technique to efficiently simulate the suffix automaton of a string. However, the bit-parallel encoding requires one bit for each char- acter, and thus performance degrade quickly as the length of searched pattern grows beyond the machine word size w. In this paper, we present a novel technique for exploiting the bit-parallel representation of a non- deterministic suffix automaton also in the case of strings of length m > w. Our approach consists in searching for a pruned version of the original pattern, whose automaton can be represented with a reduced number of bits, thus allowing to retain the performance of the original approach also in the case patterns exceeding the word size. Experimental results show that, in the case of very long patterns, our method scales better than existing approaches. 1 Introduction The string matching problem consists in finding all the occurrences of a pattern P of length m in a text T of length n, both defined over an alphabet Σ of size σ. It is a basic problem of computer science that has been studied for more than 40 years [6]. The first linear-time solution to the problem was given by the KnuthMorrisPratt algorithm (KMP) [9], whereas the BoyerMoore (BM) algorithm provided the first sublinear solution on average. Subsequently, the BDM algorithm reached the optimal O(n logσ (m)/m) time complexity on the average [3]. Automata play a very important role in the design of efficient string matching algorithms. For instance both the KMP and the BDM algorithms are based on finite automata; in particular, they simulate, respectively, a deterministic automaton for the language Σ ? P and a deterministic suffix automaton for the language of the suffixes of P . The efficiency of such solutions is strictly influenced by the encoding used for simulating the underlying automata. ? This work has been supported by G.N.C.S., Istituto Nazionale di Alta Matematica Francesco Severi and by Programma Ricerca di Ateneo UNICT 2020-22 linea 2 ?? Copyright c 2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0) 2 Simone Faro and Stefano Scafiti A successful technique which has been extensively used for the efficient sim- ulation of non-deterministic automata is the bit parallelism [1]. Such approach is at the base, for instance, of the well known Shift-Or [1] and BNDM [10] al- gorithms. The first is based on the simulation of the non-deterministic version of the KMP automaton, while the second is a very fast variant of the BDM algorithm, based on the bit-parallel simulation of the non-deterministic suffix automaton. Specifically, the bit parallel encoding takes advantage of the intrinsic paral- lelism of the bitwise operations inside a computer word, allowing to cut down the number of operations that an algorithm performs by a factor up to ω, where ω is the number of bits in the computer word. However, one bit per pattern symbol is required for representing the states of the full automaton, for a total of dm/ωe words. Thus, as long as a pattern fits in a computer word, bit-parallel algorithms are extremely fast, otherwise their performances degrade consider- ably as dm/ωe grows. Although such limitation is intrinsic, several techniques have been developed which retain good performance also in the case of long patterns [11,2,4,5]. In this paper, we introduce a novel approach for the the bit-parallel simula- tion of the suffix automaton in the case of long patterns. The idea is to first search for a pruned pattern, and to subsequently verify each occurrence of the original pattern. The pruned pattern is constructed by preserving only the occurrences of a single pivot character c; remaining characters are implicitly associated to the special wildcard symbol ”∗” and are allowed to match any character of the alphabet, except from c. It turns out that the suffix automaton of a pruned pat- tern can be encoded using k bits, where k is the number of occurrences of c in P . Though in the worst case k = m, on the average it is much smaller than m, and, when the size of the alphabet is large enough, it is also smaller than ω even for very long patterns. From our experimental results it turns out that, under suitable conditions, our approach is able to represent patterns which far exceed the word size w. The paper is organized as follows. Section 2 presents the basic notions which we use along the paper. In Section 3 we review the previous solutions known in literature which make use of the bit-parallelism approach to efficiently encode the non-deterministic suffix automaton of a string. In Section 4, we present and describe in details the new algorithm. In Section 5 we perform an experimental evaluation comparing the new algorithm with existing solutions. Finally, we draw our conclusions in Section 6. 2 Basic notions and definitions Given a finite alphabet Σ, we denote S by m Σ m , with m ≥ 0, the set of strings of ∗ length m over Σ and put Σ = m∈N Σ . We regard a string P ∈ Σ m as an array P [0..m − 1] of characters and denote its length by |P | = m (in particular, denoting by  the empy string, we have || = 0). Thus, P [i] is the i-th character of P , for 0 ≤ i < m, and P [i..j] is the substring of P starting at the i-th position Pruned BNDM: Extending the Bit-Parallel Suffix Automaton 3 and ending at the j-th position, for 0 ≤ i ≤ j < m. Moreover, for each character c ∈ Σ, we denote by ρP (c) the number of its occurrences in P (we refer to such value as simply ρ(c) when the pattern P is clear from the context). For any two strings P and P 0 , we say that P 0 is a suffix of P if P 0 = P [i..m − 1] for some 0 ≤ i < m and write Suff (P ) for the set of all suffixes of P . Similarly, P 0 is a prefix of P if P 0 = P [0..i], for some 0 ≤ i < m. In addition, we write P · P 0 , or more simply P P 0 , for the concatenation of P and P 0 , and P r for the reverse of the string P , i.e. P r = P [m − 1]P [m − 2]...P [0]. For a string P ∈ Σ m , the suffix automaton of P is an automaton which rec- ognizes the language Suff (P ) of the suffixes of P . Finally, we recall the notation of some bitwise infix operators on computer words, namely the bitwise and “&”, the bitwise or “|”, the left shift “” operator (which shifts to the left its first argument by a number of bits equal to its second argument), and the unary bitwise not operator “∼”. 3 Related Results In this section we review existing solutions for the exact online string matching problem which make use of a suffix automaton for searching for all occurrences of a pattern in a text, focusing on those algorithms which implement specific solutions to encode the automata for long patterns. Most of them are filtering based solutions, which means that they use the suffix automaton, constructed over an approximate version of the input string, for finding candidate occurrences of the pattern that must subsequently verified for an exact occurrence by running an additional verification phase. All algorithms reviewed in this section are variants of the well known Backward- DAWG-Matching algorithm (BDM) algorithm [3], one of the first application of the suffix automaton to get optimal pattern matching algorithms on the average. Such algorithm moves a window of size m on the text. For each new position of the window, the automaton of the reverse of P is used to search for a factor of P from the right to the left of the window. The basic idea of the BDM algorithm is that if the backward search failed on a letter c after the reading of a word u then cu is not a factor of P and moving the beginning of the window just after c is secure. If a suffix of length m is recognized then an occurrence of the pattern itself was found. However, one of the side effects of the BDM algorithm lies in the use of the deterministic variant of the suffix automaton since the workload required to manage the individual transitions may be not negligible and, although its construction is linear in the size of the string, the proportionality factor hidden in the asymptotic notation is particularly high, making its construction prohibitive in the case of long patterns [6]. The BNDM algorithm [10] simulates the suffix automaton for P r by bit- parallelism. The bit-parallel representation of a suffix automaton uses an array B of |Σ| bit-vectors, each of size m, where the i-th bit of B[c] is set iff P [i] = c, for c ∈ Σ, 0 ≤ i < m. Automaton configurations si are then encoded as a bit-vector 4 Simone Faro and Stefano Scafiti D of m bits, where each bit corresponds to a state of the suffix automaton (the initial state does not need to be represented, as it is always active). In this context the i-th bit of D is set iff the corresponding state is active. D is initialized to 1m and the first transition on character c is implemented as D ← (D & B[c]). Any subsequent transition on character c can be implemented as D ← ((D  1) & B[c]) . A search ends when either D becomes zero (i.e., when no further prefixes of P can be found) or the algorithm has performed m iterations (i.e., when a match has been found). When the pattern size m is larger than ω, the configuration bit-vector and all auxiliary bit-vectors need to be split over dm/ωe multiple words. For this reason the performance of the BNDM algorithm degrades considerably as dm/ωe grows. A common approach to overcome this problem consists in constructing an automaton for a substring of the pattern fitting in a single computer word, to filter possible candidate occurrences of the pattern. When an occurrence of the selected substring is found, a subsequent naive verification phase allows to establish whether this belongs to an occurrence of the whole pattern. However, besides the costs of the additional verification phase, a drawback of this approach is that, in the case of the BNDM algorithm, the maximum possible shift length cannot exceed ω, which could be much smaller than m. Peltola and Tarhio presented in [11] an efficient approach for simulating the suffix automaton using bit-parallelism for long patterns. Specifically the algo- rithm (called LBNDM) works by partitioning the pattern in bm/kc consecutive substrings, each of k = b(m − 1)/ωc + 1 characters. The m − kbm/kc remaining characters are left to either end of the pattern. Then the algorithm constructs a superimposed pattern P 0 of length bm/kc, where P 0 [i] is a class of characters including all characters in the i-th substring, for 0 ≤ i < bm/kc. The idea is to search first the superimposed pattern in the text, so that only every k-th character of the text is examined. This filtration phase is done with the standard BNDM algorithm, where only the k-th characters of the text are inspected. When an occurrence of the superimposed pattern is found the occur- rence of the original pattern must be verified. The time for its verification phase grows proportionally to m/ω, so there is a threshold after which the performance of the algorithm degrades significantly. Durian et al. presented in [4] another efficient algorithm for simulating the suffix automaton in the case of long patterns. The algorithm is called BNDM with eXtended Shift (BXS). The idea is to cut the pattern into dm/ωe consecutive substrings of length w except for the rightmost piece which may be shorter. Then the substrings are superimposed getting a superimposed pattern of length ω. In each position of the superimposed pattern a character from any piece (in corresponding position) is accepted. Then a modified version of BNDM is used for searching consecutive occurrences of the superimposed pattern using bit vectors of length ω but still shifting the pattern by up to m positions. The main modification in the automaton simulation consists in moving the rightmost bit, when set, to the first position of the bit array, thus simulating a circular automaton. Like in the case of the LBNDM, algorithm the BXS algo- Pruned BNDM: Extending the Bit-Parallel Suffix Automaton 5 rithm works as a filter algorithm, thus an additional verification phase is needed when a candidate occurrence has been located. Cantone et al. presented in [2] an alternative technique, still suitable for bit- parallelism, to encode the non-deterministic suffix automaton of a given string in a more compact way. Their encoding is based on factorizations of strings in which no character occurs more than once in any factor. It turns out that the non-deterministic automaton can be encoded with k bits, where k is the size of the factorization. Though in the worst case k = m, on the average k is much smaller than m, making it possible to encode large automata in a single or few computer words. As a consequence, the resulting Factorized BNDM algorithm (FBNDM) tends to be faster in the case of sufficiently long patterns. 4 The Pruned BNDM Algorithm As we noticed in the previous section, the efficiency of an algorithm simulating the non-deterministic suffix automaton by bit-parallelism is heavily influenced by the length of the pattern and by the size of the resulting automaton: on the one hand, as we pointed out, the performance of such solutions degrade as m grows; on the other hand, automata constructed over longer patterns lead to larger shifts during the searching phase when a backward scan of the window is performed. Thus the need for efficient bit-parallel encoding able to keep as low as possible the number of words involved in the encoding and able to preserve, at the same time, the length of the pattern. In this section, we present a new algorithm for the online exact string match- ing problem based on a suffix automaton constructed over an approximate ver- sion of the pattern P , which we simply call pruned pattern, where some specifi- cally selected characters are replaced with don’t care symbols. We will then show how to efficiently simulate the suffix automaton constructed over the pruned ver- sion of the pattern using bit-parallelism. 4.1 The Pruned Version of a Pattern Let P be a pattern of size m and let T be a text of size n, both strings over a common alphabet Σ of size σ. In addition let c ∈ Σ be a character of the alphabet occurring in P which we refer as the pivot character. A pruned version of P over the pivot character c is a string Pc obtained by preserving in P all the occurrences of c, while the remaining positions are allowed to match any character belonging to Σ \ {c}. In other words the pruned pattern Pc is obtained from P by replacing any character in Σ \ {c} by a don’t care symbol. More formally, for each c ∈ Σ, the pruned string Pc is a string of length m defined over the alphabet Σc = {c, ?} where, for i = 0, 1, .., m − 1, Pc [i] is set to:  c if P [i] = c Pc [i] = ? otherwise 6 Simone Faro and Stefano Scafiti m z }| { P **** c **** c **** c ... c **** | {z } | {z } | {z } | {z } d0 d1 d2 dρ(c) ρ(c)+1 z }| { P̂ d0 d1 d2 ... dρ(c) Fig. 1. The pruned version Pc , for a given pattern P , over a generic character c, and its implicit encoding. Here ρ(c) is the absolute frequency of the pivot character c in P . Then the pruned string Pc of a string P is encoded as a sequence, hd0 , d1 , ..., dρ(c) i, of length ρ(c) + 1 over the alphabet ΣP = {0, 1, 2, . . . , m − 1}, where each element of Pc represents the number of consecutive ? symbols between two successive occurrences of the pivot character c, or located at the two extremities of the string. For instance, if we assume that P = abbacbbcac is a pattern of length m = 10 over the alphabet Σ = {a, b, c} and that a is the pivot character, then we have that Pa = a ? ?a ? ? ? ?a?, where ? is the don’t care symbol. Similarly we have Pb = ?bb ? ?bb ? ??. The string matching problem allowing for don’t care symbols is a well known approximate variant of the exact matching problem [8,12], also known as string matching on indeterminate strings. It is also well known that the bit parallel simulation of the suffix automaton of an indeterminate pattern can be easily constructed by allowing states corresponding to don’t care characters to be ac- tivated by any character in the alphabet [1]. However the resulting automaton has a number of states equal to the number of characters in the pattern, inher- iting the same problems as any other solution based on this technique. Here we show how the suffix automaton of a pruned pattern can be simulated using a number of bits proportional to the occurrences of the pivot character, leading to a filtration algorithm which may be particularly efficient for very long patterns. Specifically, let ρ(c) be the absolute frequency of the pivot character c in P . Then the pruned string Pc of a string P can be encoded as a sequence, hd0 , d1 , ..., dρ(c) i, of length ρ(c) + 1 over the alphabet ΣP = {0, 1, 2, . . . , m − 1}, where each element of Pc represents the number of consecutive ? symbols between two successive occurrences of the pivot character c, or located at the two extremities of the string. More formally, let hp0 , p1 , . . . , pρ(c)−1 i be the sequence of all positions in P where the pivot character occurs, with 0 ≤ p1 , pρ(c)−1 < m and pi−1 < pi , for 0 < i < ρ(c). Then we have, for 0 ≤ i ≤ ρ(c):   p0 if i = 0 di = pi+1 − pi − 1 if 0 < i < ρ(c) m − pρ(c)−1 − 1 if i = ρ(c)  Pruned BNDM: Extending the Bit-Parallel Suffix Automaton 7 We refer to such a representation of the pruned string Pc as its implicit encoding and we denote it as P̂c (see Figure 1). It trivially turns out that di ≤ m for 0 ≤ i ≤ k. More precisely we have ρ(c) X m = ρ(c) + di i=0 For instance, given the string P = banana, then Pa = *a*a*a, Pb = b*****, Pn = **n*n*, while P̂a = h1, 1, 1, 0i, P̂b = h0, 5i and P̂n = h2, 1, 1i. In the next sections we describe the preprocessing and the searching phases of our new algorithm, which we call Pruned BNDM (PBNDM) algorithm, and which solves the exact string matching problem by making use of the suffix automaton of the pruned pattern. 4.2 The Preprocessing Phase During the preprocessing phase of the PBNDM algorithm, a character c occur- ring in P is elected to be the pivot character. Since such choice is arbitrary, the pivot character is selected as the character with maximum absolute frequency not exceeding the word size w, if any. If such choice is not possible we truncate the pattern at its longest prefix that contains at least one character with an absolute frequency not exceeding the word size w. It is easy to observe that the selection of the pivot character can be performed in O(m) time, by computing the frequencies of all the characters appearing in P . Without loss in generality we can assume that such pivot character can be selected on the pattern P . Let P̂c = hd0 , d1 , ..., dρ(c) i the implicit encoding of Pc . It is a string of length ρ(c) + 1 over the alphabet Σ̂ = {1, 2, 3, . . . , m} of size m + 1. The bit-parallel representation of the suffix automaton of P̂cr is computed by means of an array B of m + 1 bit-vectors, each of size ρ(c) + 1, where the i-th bit of B[d] is set iff P̂ [i] = d, for 0 ≤ d ≤ m and 0 ≤ i ≤ ρ(c). However, since the last transition of the automaton is allowed for any value greater than or equal to d0 , the first bit of each bit-vector in B is set for any value d ≤ d0 . More formally, for 0 ≤ i ≤ ρ(c) and 0 ≤ d ≤ m, B[d][i] is defined as follows:  1 if (i = 0 and d ≥ d0 ) or (i > 0 and d = di ), B[d][i] = 0 otherwise A separate discussion should be made for the first transition made on the au- tomaton. Since it is admitted that the first transition can start from any position of the pattern it is necessary to allow that at the first transition each i-th state can be activated by values lower than or equal to di . For this purpose an auxil- iary set of m + 1 bit-vectors is defined, called S, which is used for the simulation of the first transition on the automaton. More formally, for 0 ≤ i ≤ ρ(c) and 0 ≤ d ≤ m, S[d][i] is defined as follows:  1 if (i = 0 and d ≥ d0 ) or (i > 0 and d ≤ di ), S[d][i] = 0 otherwise 8 Simone Faro and Stefano Scafiti Read(S, i, l, c) PBNDM(P, m, T, n) 1. j ← i 1. (c, k) ← SelectChar(P, m) 2. while j ≥ l and S[j] 6= c do 2. (B, S, d0 , dmax ) ← Preprocess(P, m, c, k) 4. j ←j−1 3. j ← 0 6. return (i − j, j) 4. while j ≤ n − m do 5. prf x ← 0 6. D ← 1k 7. i←m−1 Preprocess(P, m, c, k) 8. l ← Max(i − dmax , 0) ∆ Initialize bit-vectors B and S 9. (w, i) ← Read(P, i, l, c) 1. for i ← 0 to m do 10. if w > dmax then 2. B[i] ← 0 11. j ← j + m − d0 3. S[i] ← 0 12. continue ∆ Compute B, d0 and dmax 13. D ← D & S[w] 4. s ← 1 14. if D & 1k−1 do 5. d0 ← −1 15. prf x ← w + 1 6. dmax ← −1 16. if prf x = m then 7. i ← m − 1 17. if P = T [j..j + m − 1] then 8. while i ≥ 0 do 18. output j 9. (d, i) ← Read(P, i, 0, c) 19. prf x ← m − 1 10. B[d] ← B[d] | s 20. D←D1 11. s←s1 21. s←w+1 12. i←i−1 22. while i ≥ 0 and D 6= 0k do 13. if d0 = −1 then d0 ← d 23. i←i−1 14. dmax ← max(d, dmax ) 24. l ← Max(i − dmax , 0) ∆ Compute the bit-vectors S 25. (w, i) ← Read(P, i, l, c) 15. for i ← m − 2 downto 0 do 26. D ← D & B[w] 16. S[i] ← B[i] | S[i + 1] 27. if D & 1k−1 do ∆ Set first bits for d ≥ d0 28. prf x ← d0 + s 17. for i ← d0 to dmax do 29. if prf x = m then 18. B[i] ← B[i] | 10k−1 30. if P = T [j..j + m − 1] then 19. S[i] ← S[i] | 10k−1 31. output j 20. return (B, S, d0 , dmax ) 32. prf x ← m − 1 33. s←s+w+1 34. D←D1 35. j ← j + m − prf x Fig. 2. The pseudocode of the PBNDM algorithm and its auxiliary procedures. The pseudo-code of the preprocessing phase of the algorithm is shown in Figure 2. It makes use of an auxiliary procedure Read which performs a scan of the string S starting at position j = i and proceeds from right to left until a given position l ≤ j is reached or an occurrence of the pivot character c is found. It returns the couple of integers (i − j, j). The implicit encoding of Pc is computed gradually, during the initialization of table B through procedure Read, which computes the next element of the implicit encoding of Pc . Table S is then computed from table B in a single-pass for loop. The time complexity of the preprocessing phase is O(m). Since di ≤ m, then the space overhead to store B and S is O(m) too. Apart from the tables encoding the transitions of the suffix automaton, B and S, the preprocessing phase returns two additional integers, d0 and dmax = max{di : 0 ≤ i ≤ k}, which are used during the searching phase. 4.3 The Searching Phase The searching phase of the PBNDM algorithm acts using a filtering method. Specifically, it first searches for all the occurrences of the pruned pattern Pc Pruned BNDM: Extending the Bit-Parallel Suffix Automaton 9 in the text. When an occurrence of Pc is found, starting at position j of the text, the algorithm naively checks for the whole occurrence of the pattern, i.e. it checks if P = T [j..j + m − 1]. As in the original BNDM algorithm, a window W of length m is shifted over the text, starting from the left end of the text and sliding from left to right. At each iteration of the algorithm a position of the window W is attempted per- forming a scanning of its characters proceeding from right to left and performing the transitions over the automaton accordingly. During the backward scanning, Ŵc = hw0 , w1 , ..., wl i is computed on the fly by procedure Read, and the automaton configurations si , represented as a bit-vector D of ρ(c) + 1 bits, are updated accordingly. If wl > dmax , then the prefix of Pc of size d0 has been recognized, so the win- dow is shifted without performing any transition. Otherwise, the first transition is performed by setting D ← D & S[wl ], so that all states si such that di ≥ wl are kept active. Transition on any subsequent wi , for 0 < i ≤ l, is implemented as D ← D & B[wi ]. Moreover, by the definition of B and S, s0 is kept active during the i-th transition if wi ≥ d0 . When, afterPperforming the i-th transition, k state s0 is active, then a prefix of Pc of size d0 + j=i+1 wj has been recognized. Apart from the case where Pc is recognized, each attempt ends when either D becomes zero or it is established that wi > dmax while proceeding in the backward scan. As the original BNDM algorithm, the PBNDM algorithm has a O(nm) worst case time complexity and a O(σ + m) space complexity. 5 Experimental Results In this section, we compare our new algorithm against other suffix automaton based solutions, focusing on those which make use of bit-parallelism for solving the problem with long strings. In particular we included the following algorithms: – BNDM: the Backward-Nondeterministic-DAWG-Matching algorithm [10]; – SBNDM: the Simplified BNDM algorithm [11]; – LBNDM: the Long BNDM algorithm [11]; – BSX: the BNDM algorithm [10] with Extended Shift [4]; – FBNDM: the Factorized variant [2] of the BNDM algorithm [10]; – PBNDM: our PBNDM algorithm, presented in Section 4; All algorithms have been implemented in the C programming language and have been tested using the Smart tool [7]. Experiments have been executed locally on a computer running Linux Ubuntu 20.04.1 with an Intel Core i5 3.40 GHz processor and 8GB RAM. Our tests have been run on a genome sequence, a protein sequence, and an English text, each of size 5MB. Such sequences are provided by the Smart research tool and are available online for download (ad- ditional details on the sequences can be found in Faro et al. [7]). In our imple- mentations the value of the word size1 has been fixed to w = 32 and patterns 1 The value of the word size has been chosen in order to better emphasize scaling problems of the several bit-parallel algorithms. 10 Simone Faro and Stefano Scafiti Average shifts on a genome sequence m 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 BNDM 29 29 29 29 29 29 29 29 29 29 29 SBNDM 30 30 30 30 30 30 30 30 30 30 30 LBNDM 61 112 106 27 32 64 128 256 512 1024 2048 BXS 59 117 219 1 1 1 1 1 1 1 1 FBNDM 62 66 67 67 66 67 67 67 67 67 67 PBNDM 60 123 142 139 137 130 130 125 125 122 122 Gain -3% +5% -35% +107% +107% +94% +1% -51% -75% -88% -94% Average shifts on a protein sequence m 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 BNDM 31 31 31 31 31 31 31 31 31 31 31 SBNDM 31 31 31 31 31 31 31 31 31 31 31 LBNDM 63 126 248 470 713 353 206 286 526 1027 2048 BXS 62 125 252 502 984 1710 1635 1 1 1 1 FBNDM 63 126 146 143 141 145 144 143 144 145 144 PBNDM 56 118 244 492 979 1928 3022 3015 2942 2910 2871 Gain -11% -6% -3% -2% -1% +13% +85% +954% +459% +183% +41% Average shifts on a natural language sequence m 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 BNDM 30 30 30 30 30 30 30 30 30 30 30 SBNDM 31 31 31 31 31 31 31 31 31 31 31 LBNDM 63 126 247 471 824 1035 797 702 910 1467 2609 BXS 62 125 252 505 1010 2008 3910 7076 10410 11015 8533 FBNDM 63 127 156 157 156 156 156 155 156 156 157 PBNDM 58 122 245 493 982 1970 3940 7784 15438 30706 60803 Gain -7% -3% -2% -2% -3% -2% +1% +10% +48% +178% +613% Table 1. Average shifts achieved by bit-parallel algorithms on a genome sequence (on top), a protein sequence (in the middle) and natural language text (on bottom). of length m were randomly extracted from the sequences, with m ranging over the set of values {2i | 6 ≤ i ≤ 16}. For each length, the mean over the running times of 500 runs (expressed in Gigabytes per second) and the average shift advancements has been computed . Algorithms have been compared in terms of running times and average shift advancements. Table 1 and Table 2 report the average shift achieved by each algorithm during the searching phase and the the search speed (expressed in Gigabytes per second) observed in our experimental evaluations, respectively. In both tables best results have been boldfaced to ease their localization. Table 1 also reports the gain (expressed as a percentage) with respect to the best result obtained by the previous algorithms. If the shift advancement is lower, the gain is expressed with a negative value. Regarding the average shift advancement our experimental results clearly show that PBNDM achieves the best performance in the most of practical cases and always proposes the largest advancements in the case of long patterns. This suggests that our automaton encoding scales better as the size of the pattern increases and is thus more suitable to handle long strings. The only exception is the case of very small alphabets where PBNDM is second to LBNDM which is the clear winner in terms of shift advancements. In this case PBNDM proposes shorter shifts than 10 times those proposed by LBNDM. However, it is important to note that LBNDM search simultaneously m/w substrings of the pattern, each of length w (see Section 3), and that the Pruned BNDM: Extending the Bit-Parallel Suffix Automaton 11 Experimental results on a genome sequence m 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 BNDM 1.78 1.79 1.78 1.81 1.78 1.80 1.76 1.76 1.76 1.74 1.73 SBNDM 1.86 1.82 1.88 1.88 1.84 1.86 1.86 1.85 1.84 1.84 1.82 LBNDM 1.74 1.91 0.90 0.20 0.20 0.22 0.23 0.24 0.25 0.25 0.25 BXS 1.44 1.67 1.31 - - - - - - - - FBNDM 2.21 2.29 2.28 2.26 2.29 2.27 2.28 2.25 2.26 2.22 2.21 PBNDM 1.26 1.83 1.92 1.90 1.89 1.86 1.88 1.83 1.78 1.82 1.82 Experimental results on a protein sequence m 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 BNDM 2.24 2.23 2.21 2.23 2.17 2.21 2.19 2.19 2.21 2.21 2.16 SBNDM 2.33 2.31 2.36 2.31 2.30 2.34 2.33 2.29 2.34 2.30 2.26 LBNDM 2.34 2.56 2.70 2.81 2.63 1.32 0.60 0.47 0.45 0.46 0.47 BXS 2.18 2.36 2.61 2.52 2.49 2.16 - - - - - FBNDM 2.48 2.71 2.67 2.71 2.68 2.68 2.67 2.67 2.68 2.65 2.38 PBNDM 1.27 1.69 2.18 2.47 2.56 2.70 2.68 2.68 2.69 2.66 2.63 Experimental results on a natural language text m 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 BNDM 1.87 1.88 1.89 1.89 1.86 1.88 1.86 1.90 1.87 1.88 1.82 SBNDM 1.94 1.95 1.93 1.95 1.92 1.94 1.95 1.92 1.91 1.94 1.90 LBNDM 2.15 2.44 2.65 2.81 2.77 2.52 1.71 1.05 0.77 0.65 0.58 BXS 1.91 2.25 2.54 2.54 2.74 2.84 2.70 2.50 0.95 0.29 - FBNDM 2.28 2.60 2.57 2.56 2.60 2.61 2.58 2.60 2.53 2.57 2.54 PBNDM 1.07 1.59 2.14 2.38 2.45 2.76 2.84 3.64 4.93 6.18 7.40 Table 2. Experimental results on a genome sequence (on top), a protein sequence (in the middle) and a natural language text (on bottom). overall progress due to the shift is always counterbalanced by the number of checks that must be performed, which corresponds to the number of super- imposed substrings. This results in poor performance in the case of very long patterns, especially when the size of the alphabet is small. In other words, the advantage obtained in the advancements is not exploited in the general perfor- mance of the algorithm. Less performances are also shown by the BNDM and SBNDM algorithms, due to the fact that the length of the pattern is always limited to the size of the word w. Also the BXS algorithm uses an encoding based on a superimposed pattern and this causes it to have a degenerative behavior in the case of very long pat- terns, failing to scale well with respect to the length of the pattern. Specifically it does not work well when the superimposed pattern is not sensitive enough, i.e. too many different characters are accepted at the same position. This happens when the alphabet is too small or the pattern is too long. We can observe, indeed, that although the size of the pattern increases the length of the superimposed pattern remains fixed at w. As a result the BXS algorithm, although particularly fast for patterns of moderate length (up to 128 or 256), is particularly slow for very long patterns, going beyond the execution time limit set for our tests, especially in the case of small alphabets. Its behavior, on the other hand, improves considerably as the size of the alphabet increases, becoming the best alternative to PBNDM for natural language texts. The FBNDM algorithm is in general the one that performs best among the previous solutions, managing to scale the length of the shift quite well as the length of the pattern increases. Any factor of the pattern with no repeated 12 Simone Faro and Stefano Scafiti characters (see Section 3) is always limited by the size of the alphabet and, as the pattern length increases, this imposes a relatively limited number of factors which can be represented by a single computer word. As a consequence this limits the shift advancement to the overall length of the factors of the pattern taken into account. But despite this limitation FBNDM is in most cases the best alternative to PBNDM, especially in the case of small alphabets, and this translates into the best running times in most of the cases where PBNDM is not the best alternative. On the whole we notice how the loss relative to the advancement of the shift never goes beyond 11% (with the exception of LBNDM as discussed above) while the gain obtained in increasing the shift, in the case of longer patterns, is impressive and reaches up to 954%. This aspect is clearly reflected in the execution times in which PBNDM is always among the first two best alternatives, becoming the fastest among all the algorithms for very long patterns. Although the presets remain moderate for small alphabets, the speed up achieved as the alphabet size increases is such that PBNDM is up to 3 times faster than the best alternative for natural language texts. 6 Conclusions In this paper, we introduced a new algorithm, called PBNDM, based on a novel encoding of the suffix automaton of a string, suitable for patterns exceeding the word size w. Our algorithm is based on a pruned version of the pattern whose automaton can be encoded in an implicit form using few bits. From our experimental results our solution turns out to be competitive when compared for searching long strings against existing bit-parallel algorithms. We observe that, although in this paper we focused on the application of the new encoding in the case of the exact pattern matching problem, it turns out to be flexible enough to be applied in all those solutions that make use of such data structure, even in the case of non-standard and approximate pattern matching. In our future works we intend to tune the algorithm in order to make it competitive with the most efficient algorithms in practical cases, also in the case of small alphabets, a condition in which our algorithm still suffers due to the reduced length of the shifts. This includes the use of fast loops for further improving running times, and the use of condensed alphabets for representing longer and longer patterns inside a single word. References 1. Ricardo A. Baeza-Yates and Gaston H. Gonnet. A new approach to text searching. Commun. ACM, 35(10):74–82, 1992. URL: https://doi.org/10.1145/135239. 135243, doi:10.1145/135239.135243. 2. Domenico Cantone, Simone Faro, and Emanuele Giaquinta. A compact rep- resentation of nondeterministic (suffix) automata for the bit-parallel approach. In Amihood Amir and Laxmi Parida, editors, Combinatorial Pattern Matching, Pruned BNDM: Extending the Bit-Parallel Suffix Automaton 13 21st Annual Symposium, CPM 2010, New York, NY, USA, June 21-23, 2010. Proceedings, volume 6129 of Lecture Notes in Computer Science, pages 288–298. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-13509-5_26, doi: 10.1007/978-3-642-13509-5\_26. 3. Maxime Crochemore and Wojciech Rytter. Text Algorithms. Oxford University Press, 1994. URL: http://www-igm.univ-mlv.fr/%7Emac/REC/B1.html. 4. Branislav Durian, Hannu Peltola, Leena Salmela, and Jorma Tarhio. Bit-parallel search algorithms for long patterns. In Paola Festa, editor, Experimental Algorithms, 9th International Symposium, SEA 2010, Ischia Island, Naples, Italy, May 20-22, 2010. Proceedings, volume 6049 of Lecture Notes in Computer Science, pages 129– 140. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-13193-6_12, doi:10.1007/978-3-642-13193-6\_12. 5. Simone Faro and Thierry Lecroq. A fast suffix automata based algorithm for exact online string matching. In Nelma Moreira and Rogério Reis, editors, Implementation and Application of Automata - 17th International Conference, CIAA 2012, Porto, Portugal, July 17-20, 2012. Proceedings, volume 7381 of Lecture Notes in Com- puter Science, pages 149–158. Springer, 2012. URL: https://doi.org/10.1007/ 978-3-642-31606-7_13, doi:10.1007/978-3-642-31606-7\_13. 6. Simone Faro and Thierry Lecroq. The exact online string matching problem: A review of the most recent results. ACM Comput. Surv., 45(2):13:1–13:42, 2013. URL: https://doi.org/10.1145/2431211.2431212, doi:10.1145/2431211.2431212. 7. Simone Faro, Thierry Lecroq, Stefano Borzi, Simone Di Mauro, and Alessandro Maggio. The string matching algorithms research tool. In Jan Holub and Jan Zdárek, editors, Proceedings of the Prague Stringology Conference 2016, Prague, Czech Republic, August 29-31, 2016, pages 99–111. Department of Theoretical Com- puter Science, Faculty of Information Technology, Czech Technical University in Prague, 2016. URL: http://www.stringology.org/event/2016/p09.html. 8. M. J. Fischer and M. S. Paterson. String matching and other products. In R. Karp, editor, Complexity of Computation (SYAM-AMS Proceedings 7), volume 7, pages 113,125, USA, 1974. Massachusetts Institute of Technology. 9. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern match- ing in strings. SIAM J. Comput., 6(2):323–350, 1977. URL: https://doi.org/10. 1137/0206024, doi:10.1137/0206024. 10. Gonzalo Navarro and Mathieu Raffinot. A bit-parallel approach to suffix automata: Fast extended string matching. In Martin Farach-Colton, editor, Combinatorial Pattern Matching, 9th Annual Symposium, CPM 98, Piscataway, New Jersey, USA, July 20-22, 1998, Proceedings, volume 1448 of Lecture Notes in Computer Science, pages 14–33. Springer, 1998. URL: https://doi.org/10.1007/BFb0030778, doi: 10.1007/BFb0030778. 11. Hannu Peltola and Jorma Tarhio. Alternative algorithms for bit-parallel string matching. In Mario A. Nascimento, Edleno Silva de Moura, and Arlindo L. Oliveira, editors, String Processing and Information Retrieval, 10th International Symposium, SPIRE 2003, Manaus, Brazil, October 8-10, 2003, Proceedings, volume 2857 of Lec- ture Notes in Computer Science, pages 80–94. Springer, 2003. URL: https://doi. org/10.1007/978-3-540-39984-1_7, doi:10.1007/978-3-540-39984-1\_7. 12. Ron Y. Pinter. Efficient string matching with don’t-care patterns. In Alberto Apostolico and Zvi Galil, editors, Combinatorial Algorithms on Words, pages 11– 29, Berlin, Heidelberg, 1985. Springer Berlin Heidelberg.