Bit-Layers Text Representation for Efficient Text Processing∗ † Domenico Cantone, Simone Faro, and Stefano Scafiti Università di Catania, Viale A.Doria n.6, 95125 Catania, Italy domenico.cantone@unict.it, faro@dmi.unict.it, stefano.scafiti@studium.unict.it Abstract. Textual data still remains the main format for storing infor- mation, justifying why text processing is among the most relevant topics in computer science. However, despite capability to store information is growing fast, the amount and complexity of textual data grows faster than storage capacities. In many cases, the problem is not due to the size or complexity of the text, but rather to the representations (or data structure) employed for carrying out the needed processing. In this paper, we show the potentiality and the benefits of a straightfor- ward text representation, referred to as the Bit-Layers text representa- tion, which turns out to be particularly suitable for fast text searching, while still retaining the standard efficiency in the rest of text processing basic tasks. To show the advantages of the Bit-Layers representation, we also present a family of simple algorithms, tuned to it, for solving some classical and non-classical string-matching problems. Such algorithms turn out to be particularly suitable for implementation in modern hard- ware, and very fast in practice. Preliminary experimental results show that in some cases these algorithms are by far faster than their counter- parts based on the standard text representation. Keywords: Text processing · representation · experimental algorithms 1 Introduction Text processing is one of the most relevant topics in computer science. It in- cludes, among other problems, exact and approximate string matching, which are still among the most fundamental problems in computer science. Textual data remains indeed the main form for storing information, even though data are memorized in different ways. Thus, the need for faster and faster solutions to text processing problems. However, it turns out that the amount of avail- able textual data grows faster than storage capacities and, as a consequence, the ∗ We gratefully acknowledge support from “Università degli Studi di Catania, Piano della Ricerca 2016/2018 Linea di intervento 2”. † Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0) 13 14 D. Cantone et al. capability to perform text processing in the main memory of present-day com- puters is becoming more and more arduous. On the other hand, since processing texts in main memory is by far faster than on disks,1 operating in main memory is crucial for carrying out efficient text processing. In many cases, the problem is not the size of the texts stored, but rather the data structures that must be built on the texts in order to efficiently carry out the processing [16]. The representation of a dna sequence is representative of this problem: to encode a human genome, which is about 3.3 billion bases, slightly less than 800 MB are required, if one uses only 2 bits per character. On the other hand, standard text representation uses 8 bits per character, requiring a total of more than 2 GB, whereas the corresponding suffix tree requires at least 10 bytes per base, that is more than 30 GB, too large for many practical applications. Several works appeared in recent years aiming at finding a tradeoff between the space requirements needed to encode huge texts and complex data struc- tures and the time efficiency of the text searching algorithms designed to work with such representations. Among the most relevant approaches in this direc- tion, we mention compressed text searching, in which processing takes place on- line directly on compressed data in order to speed-up searching with reduced extra space. This problem has been widely investigated in the last few years and, although efficient solutions exist for searching under standard compres- sions schemes,2 they still require significant implementation efforts and turn out to be not very flexible, being designed for specific tasks. A more suitable solution is compact text processing, which aims at storing information using succint text representations, allowing, at the same time, fast in-place text processing with no decompression. The concept of succinct data representation and succinct data structures was originally introduced by Jacobson [15] to encode bit vectors, trees, and graphs, and has been recently brought back to the top [16] for the reasons discussed above. Improving results in this direction, we discuss in this paper the advantages of using the Bit-Layers text representation, a succinct and quite basic string rep- resentation, which turns out to be particularly suitable for fast text searching, while retaining standard efficiency in the rest of basic text processing applica- tions. We also present a family of simple algorithms for solving some classical and non-classical string-matching problems, tuned to our proposed text representa- tion, which turn out to be particularly suitable for implementation in modern hardware and very fast in practice. Preliminary experimental results show that in some cases these algorithms are by far faster than their counterparts based on the standard text representation. 1 In present-day computers, accessing a text in main memory is about 105 times faster than on secondary memory. 2 Best solutions for compressed string matching use less than 70% extra space for the text and are twice as fast in searching than standard online string matching algorithms. Bit-Layers Text Representation for Efficient Text Processing 15 2 Standard and Succinct Text Representations In this section we briefly review the most relevant text representations that are generally used to code a string y in main memories with a word size of w bits. We shall assume that the block size w is fixed, so that all references to a string will only be to entire blocks of w bits. For simplicity, we refer to a w-bit block as a byte, though larger values than w = 8 could be supported as well. Let y be a string of length n > 0 of characters from a finite alphabet Σ of size σ, such that ` := dlog(σ)e 6 w. In standard text representation, the string y is coded as an array Sy of n blocks, each of size w, that can be read and written at arbitrary positions, and where the i-th block of Sy contains the binary representation of the i-th character of y. We denote by ρ(c) the binary representation of c ∈ Σ, so that ρ(c)[j], with 0 6 j < `, is the j-th most significant bit in ρ(c). As long as ` 6 w, each character can be read and processed in a single operation, otherwise dlog(σ)/we operations are required. Any array A of n blocks of size w can be regarded as a virtual bit array  of nw bits, where each bit can be processed at the cost of a single operation. Conversely, any bit string B̂ of length m could be seen as an array B of dm/we blocks. Thus we have that B̂[i] is the j-th bit of B[bi/wc], where j := i mod w. Since the length m of a binary string needs not necessarily be a multiple of w, the last block may be only partially defined. This approach turns out to be very simple and fast, as read and write operations of a given character can be done in constant time, by using a direct access to the position of the array where the character is stored or needs to be written. However, such a simple representation may waste a lot of space, since wn bits are used, where just `n would suffice. A succinct representation of the string y has been discussed in [16]. It allo- cates an array C of d`n/we blocks, which is enough to encode n elements of ` bits. We regard these n bits stored in C as a virtual bit array Ĉ of `n bits, where each character y[i] is stored at Ĉ[i` .. (i + 1)` − 1]. Also in this case any character y[i] can be accessed in constant time, although it may require to access more then one word for a single access. Solutions tuned to this succinct text represen- tation exists [17, 8], aiming at speeding up text searching, however it turns out that the gain is quite poor in practice, and is obtained only when ` divides w exactly. For completeness, we mention other succinct text representations, based on variable-length encodings, which are related to our approach since they provide the enviable feature to give direct access to the text. We mention those based on sampling [11], the Elias-Fano-based representation [6, 7], Interpolative coding [18], Wavelet tree [12] and Direct Addressable Codes (DACs) [3]. Such text representations are specifically designed for succinctness and are not competitive for text processing tasks, if compared against standard text representations. Notably, our proposed bit-layers representation is strongly related with Direct Addressable Codes (DACs) where, as in our approach, the bits of each encoding are stored in different bit sequences. However, DACs use variable-length encod- ings (where our representation encodes all characters of the alphabet with the 16 D. Cantone et al. same number of bits, thus enhancing text processing performances) and main- tains additional informations about the structures of the layers. 3 Bit-Layers Text Representation Assume again that y is a string of length n over an alphabet Σ of size σ. As in the case of the succinct text representation described above, let us suppose that each character in Σ is represented by ` := dlog(σ)e bits. The bit-layers text representation codes the string y as an ordered collection of ` binary strings of length n, hB̂0 , B̂1 , . . . , B̂`−1 i (the reference to y has been omitted for conciseness), where the i-th binary string B̂i is the sequence of the i-th bits of the characters in y, in the order in which they appear in y. We refer to the bit vectors B̂0 , B̂1 , . . . , B̂`−1 induced by such representation as the bit layers of the encoding. More formally, letting c 7→ ρ(c) be the encoding map, for c ∈ Σ, then B̂i := hρ(y[0])[i], ρ(y[1])[i], . . . , ρ(y[n − 1])[i]i, for i = 0, . . . , ` − 1. Thus, each layer B̂i can be regarded as an array Bi of dn/we blocks of size w. Example 1. Let y = abfefdgabaadefcc be a string of 16 characters over the alphabet Σ = {a, b, c, d, e, f, g}, with σ = 7, so that each character can be represented by dlog(σ)e = 3 bits. Assume therefore that ρ(a) = 000, ρ(b) = 001, ρ(c) = 010, ρ(d) = 011, ρ(e) = 100, ρ(f) = 101 and ρ(g) = 110. According to our bit-layers text representation, the string y can be rep- resented by the following sequence of 3 binary vectors, each one stored on dn/we = 2 bytes: B0 = 108.148 B̂0 = 01101100.10010100 B1 = 006.019 B̂1 = 00000110.00010011 B2 = 058.012 B̂2 = 00111010.00001100 Since the bit B̂i [j] is stored at the r-th most significant bit of Bi [dj/we], where r = ((j − 1) mod w) + 1, the character y[i] can readily be retrieved by the following formula: y[i] = hB̂` [i], B̂`−1 [i], .., B̂0 [i]i. Procedure read in Figure 1 accesses character y[i] using the bit-layers text representation for coding y in Θ(`) time, which is worse than the constant time needed for reading a character under the standard or the succinct representa- tions. However, our experimental results show that, in practical cases, the access speed to text’s characters using the bit-layers text representations is as fast as using other representations (see Section 4). The bit-layers representation arranges textual data in a two-dimensional structure so that text processing may proceed both horizontally, by reading bits (or bit blocks) along a binary vector in a given layer, and vertically, by moving from a given layer to the next one while reconstructing the encoding of a character (or of a group of characters). Bit-Layers Text Representation for Efficient Text Processing 17 readbit(B, i): returns (B[i]  i) & 1 read(i): c←0 for h ← 0 to ` − 1 do c ← c | readbit(Bh [i/w], i mod w)  h return c Fig. 1. Procedure read for retrieving the i-th character of a text coded using the bit- layers text representation. We assume that each character of the alphabet can be represented by ` = dlog(σ)e 6 w bits. Thanks to its two-dimensional structure, the bit-layers text representation counts many favourable and interesting features. First of all, it naturally allows parallel computation on textual data. Since a string is partitioned in ` independent binary vectors, it is straightforward to entrust the processing of each bit vector to a different processor, provided that the corresponding ` outputs are then blended. The bit-layers representation is also well suited for parallel accessing of multi- ple data. A single computer word maintains, indeed, partial information about w contiguous characters. Although we can access only a fraction (namely (1/`)th) of a character encoding, such information can be processed in constant time, also by exploiting the intrinsic parallelism of bitwise operations, which allows one to cut down the horizontal processing by a factor up to w. In addition, it allows also adaptive accesses to textual data. This means that processing may not always need to go all the way in depth along the layers of the representation, as in favourable cases accesses can stop already at the initial layers of the representation of a given character. For instance, assume we are interested in searching all occurrences of the character a in a given text y, where ρ(a) = 000. If, while inspecting character y[i] it is discovered that B̂0 [i] = 1, then such position can immediately be marked as a mismatch. This feature may allow, under suitable conditions, to cut down the vertical data processing by a factor up to `. Finally, the bit-layers representation turns out to be also well suited for cache- friendly accesses to textual data. Indeed, the cache-hit rate is very crucial for good performances, as each cache miss results in fetching data from primary memory (or worse form secondary memory), which takes considerable time.3 In comparison, reading data from the cache typically takes only a handful of cycles. According to the principle of spatial locality, if a particular vector location is referenced at a particular time, then it is expected that nearby positions will be referenced in the near future. Thus, in present-day computers, it is common to attempt to guess the size of the area around the current position for which it is worthwhile to prepare for faster accesses for subsequent references. Assume, for 3 Data fetching takes hundreds of cycles from the primary memory and tens of billions of cycles from secondary memory. 18 D. Cantone et al. instance, that such a size is of k positions, so that, after a cache miss, when a certain block of a given array is accessed, the next k blocks are also stored in cache memory. Then, under the standard text representation, we have at most k supplementary text accesses before the next cache miss, whereas in our bit- layers representation such number may grow by a factor up to `, resulting, under suitable conditions, in faster accesses to text characters. In the following section, we shall present some preliminary evidences that the bit-layers text representation is particularly appropriate for fast text processing. Specifically, we shall provide basic solutions to some standard and non-standard search problems, comparing the results obtained against those achieved under the standard representation. 4 Text Searching Using Bit-Layers Representation Text processing, and especially text searching, can be included among the most significant topics in computer science. Many times, over the years, standard ap- proaches to text searching have been influenced, or even transformed, by partic- ularly innovative studies. Among them, we mention the Boyer-Moore algorithm [4, 13], the first practical approach to text searching, and the Bit-Parallelism [2], which respectively mainly focused on the search strategy and on the method to build or represent effective data structures to speed up the search process, even by carrying it out in parallel. The bit-layers text representation presented in this paper aims at improv- ing text searching by moving enhancements at the very bottom of the problem, namely at the encoding level, allowing a natural and more significant paral- lel computation, particularly suitable for hardware implementation. For these reasons, we believe that such representation is a valuable contribution to any application dealing with text searching. Although the representation can inspire several technical and creative ap- proaches to text searching, in this first paper we just show a few straightforward solutions tuned to the proposed representation. Specifically, we present some so- lutions based on the following two approaches: horizontal word parallelism and vertical adaptive parallelism. In the horizontal word parallelism, textual data are read in chunks of w bits, from all the layers of the representation, and partial information is processed in parallel. In the vertical adaptive parallelism approach, we proceed much in the same way; however one does not always need to go all the way in depth along the layers of the representation. In the following sections, we present solutions to some basic text processing tasks and we report the related experimental data that allow us to compare our proposed algorithms against those tuned for standard text representation. In our experiments, all algorithms have been implemented in the C pro- gramming language, using 32 bits words (i.e. w = 32), and have been tested with the Smart research tool [10], which is available online at the URL http: Bit-Layers Text Representation for Efficient Text Processing 19 //www.dmi.unict.it/~faro/smart/.4 All experiments have been executed lo- cally on a MacBook Pro with 4 Cores, a 2 GHz Intel Core i7 processor, 16 GB RAM 1600 MHz DDR3, 256 KB of L2 Cache, and 6 MB of Cache L3. In addition, all algorithms have been compared in terms of their running times, excluding preprocessing times, on the following real data sets: a genome sequence, a pro- tein sequence, and an English text. Each of the three sequences has a length of 15MB, and can be downloaded from the Smart tool. In the experimental eval- uation, patterns of length m have been randomly extracted from the sequences, with m ranging over the set of values {2i | 2 6 i 6 5}. For each experiment, the mean over the processing speed (expressed in billions of characters per second) of 500 runs has been reported. 4.1 Simple Text Scanning Simple text scanning is a basic task in text processing. It consists in reading the characters of a given text one after the other in order to extract some spe- cific features from the textual data. In this section, we focus on the following straightforward tasks: (a) ncomputing the absolute frequencies of all the charac- ters occurring in a text, and (b) computing the absolute frequency of a specific character in a text. Assume y is a text of length n over an alphabet Σ of size σ. The first task consists in counting the occurrences in y of each character of the alphabet. By us- ing the standard representation, this problem can be solved by a simple iterative cycle that sweeps all characters of the text while increasing the corresponding counters. With the bit-layers representation, task (a) can be accomplished by scanning the text by way of a horizontal word parallelism. Initially, a table T of size 28 is precomputed, where, for each 8-bit vector B̂, T [B̂] is an array of 8 characters (which can be regarded altogether as a 64-bit vector) such that, for 0 6 k < 8, T [B̂][k] := if B̂[k] = 0 then 00000000 else 00000001 endif. Then, during the scanning phase, for each 0 6 j < dn/8e, a 64-bit vector P`−1 X := i=0 (T [Bi [j]]  i) is computed and, by regarding the vector X as an 8-character array, the counter of each character X[k] is increased, for 0 6 k < 8. Task (b) consists in counting the occurrences of a given input character c ∈ Σ in the string y. By using the standard representation, this task can be carried through a simple iterative cycle that, while sweeping all characters of the text, whenever an occurrence of c is found it increases a counter by 1. With the bit- layers representation, task (b) can be accomplished through a scan of the text based on vertical adaptive parallelism. Initially, ` binary vectors Pi (each of size w), for 0 6 i < `, are precomputed, where Pi := if ρ(c)[i] = 1 then 0w else 1w endif. 4 The C codes of all tested algorithms are available at http://www.dmi.unict.it/ ~faro/BLE/ 20 D. Cantone et al. Frequencies Count Occurrence Count genome protein english genome protein english Standard 0.782 0.880 0.880 1.766 1.552 1.572 Bit-Layers 1.043 1.032 1.024 13.88 4.854 4.065 Table 1. Speeds (in billions of characters per second) of the frequency and occurrence count tasks under the standard and the bit-layers representations. Best results have been reported in bold. In the subsequent scanning phase, the bit-layers of the string y are read in blocks of w bits. Specifically, for 0 6 j < dn/we, a sequence of ` bit vectors, hXi : 0 6 i < `i is computed, where X0 = B0 [j] xor P0 and Xi = Xi−1 or (Bi [j] xor Pi ), for 1 6 i < `, provided that all Xi ’s are nonnull. It can easily be checked that y[jw + k] = c ⇐⇒ X`−1 [k] = 1 holds. Thus, if X`−1 6= 0, the counter is increased by bitcount(X`−1 ) units. Plainly, such procedure is adaptive, since, during each iteration, say the j-th one with 0 6 j < dn/we, as soon as Xi = 0 for any 0 6 i < `, the iteration is aborted and the execution resumes with the (j + 1)-st iteration. Table 1 reports the experimental results relative to the above two tasks with both the standard and the bit-layers representations, where speeds are expressed in billions of characters per second. As for the frequency count task (a), the bit- layers representation allows for a slightly faster access to textual data than the standard representation, especially in the case of small alphabets, characterized by a modest number of layers. When the size of the alphabet increases, our approach unavoidably gets slower. Concerning the occurrence count task (b), the vertical adaptive parallel approach based on the bit-layers representation reaches by far the best results, as it is from 2.5 times (for English texts) up to 8 times (for genome sequences) faster than the standard approach. 4.2 Exact String Matching The exact string matching problem is another basic task in text processing. It consists in finding all the (possibly overlapping) occurrences of an input pattern x of length m within a text y of length n, both strings over a common alphabet Σ of size σ. More formally, the problem aims at finding all positions j in y[0 .. n−m] such that y[j .. j + m − 1] = x. A huge number of solutions has been devised since the 1980s [9] and, despite such a wide literature, still much work has been produced in the last few years, demonstrating that the need for efficient solutions is currently high. We designed the following three basic solutions to the exact string matching problem, tuned to the bit-layers text representation. Brute-force algorithm based on horizontal word parallelism (Bf∗H ): For each po- sition j in y[0 .. n − m], the chunk B̂i [j .. j + m − 1] from the i-th layer of y, Bit-Layers Text Representation for Efficient Text Processing 21 genome protein english m 4 8 16 32 4 8 16 32 4 8 16 32 Bf 0.26 0.26 0.26 0.26 0.52 0.52 0.53 0.52 0.47 0.50 0.50 0.50 Sa 0.55 0.54 0.55 0.55 0.50 0.50 0.50 0.52 0.50 0.50 0.50 0.50 Hor 0.27 0.35 0.40 0.40 0.45 0.68 0.90 1.05 0.43 0.65 0.86 1.05 Wfr 0.49 0.91 1.29 1.71 0.74 1.03 1.28 1.64 0.72 1.03 1.30 1.65 Bf∗H 0.62 0.94 0.96 0.97 0.54 0.84 0.87 0.86 0.47 0.82 0.86 0.86 Pfx∗ 0.51 1.55 1.64 1.65 0.45 1.48 1.46 1.44 0.42 1.34 1.45 1.46 Sfx∗ 0.79 1.70 3.10 5.20 0.69 3.35 2.64 4.50 0.66 1.66 2.85 4.67 Table 2. Experimental results relative to the comparison of some string matching algorithms under the standard representation and the bit-layers representation. Speed is expressed in billions of characters per second and the best results are boldfaced. for each 0 6 i < `, is compared with the corresponding layer of the pattern D̂i [0 .. m − 1]. If B̂i [j .. j + m − 1] = D̂i [0 .. m − 1] for all 0 6 i < `, then a match is reported at position j. The algorithm Bf∗H works adaptively, since as soon as B̂i [j .. j + m − 1] 6= D̂i [0 .. m − 1] for some 0 6 i < `, the iteration for position j stops, and a new iteration starts from position j + 1 in y[0 .. n − m], if any. Prefix-based algorithm based on vertical adaptive parallelism (Pfx∗ ): It is a prefix-based improvement of the algorithm Bf∗H described above. A prefix table π : {0, . . . , 2k } → {1, . . . , k}, with k = min(m, 8), is precomputed, where, for any given binary vector B̂ of length k (which can be regarded as an integer in {0, . . . , 2k − 1}), we have: π(B̂) := min{s : 1 < s < k and B̂[s .. k − 1] = B̂[0 .. k − s − 1]} ∪ {k}. At the end of each iteration during the searching phase, the current window is shifted by π(B̂0 [j .. j + k − 1]) positions to the right. Suffix-based algorithm based on vertical adaptive parallelism (Sfx∗ ): It is an- other improvement (suffix-based) of the brute-force algorithm described earlier. A suffix table ψ : {0, . . . , 2k } → {1, . . . , k}, with k = min(m, 8), is precomputed, where, for any given a binary vector B̂ of length k (which as before can be regarded as an integer in {0, . . . , 2k − 1}), we have: ψ(B̂) := min{s : 1 6 s < k and D̂[m − s − k .. m − s − 1] = B̂[0 .. k − 1]} ∪ {k}. At the end of each iteration in the searching phase, the current window is shifted by ψ(B̂0 [j + m − k .. j + m − 1]) positions to the right. Table 2 shows the experimental results relative to the above algorithms (tuned to the bit-layers representation) against the following known solutions to the exact string matching problem (tuned to the standard representation), 22 D. Cantone et al. which are considered among the fastest algorithms in practical cases [5], namely the brute-force algorithm (Bf), the Boyer-Moore-Horspool algorithm (Hor) [13], the Shift-And algorithm (Sa) based on bit-parallelism [2], and the Weak-Factor- Recognition algorithm (Wfr) implemented with q-grams (1 6 q 6 4). From Table 2, it turns out that algorithm Bf∗H is always faster (up to 3 times) than its counterpart Bf and, in most cases, it is even faster than the Horspool algorithm (Hor). The suffix-based algorithm Sfx∗ , tuned to the bit- layers representation, is in almost all cases the fastest algorithm, even faster (up to 3 times) than the algorithm Wfr, showing a sub-linear behaviour which rapidly improves as the length of the pattern increases. 4.3 String Matching with Mismatches Approximate string matching has several variations. In this section we consider the δ-mismatches variation, where the task is to find all the occurrences of x with at most δ mismatches, where 0 6 δ < m.5 Shift-Add [2] was the first practical algorithm for solving the δ-mismatches problem. It is based on bit-parallelism, where a vector of m states is used to represent the state of the search. A field of m + 1 bits is used for representing each of the m states. When a mismatch is detected, the corresponding state is increased accordingly. Thus, a match is detected at a given position when the last state has a value less than or equal to δ. As in the case of other bit-parallel solutions, when m(log(m) + 1) is greater than w, multiple words need to be involved in the computation. In the context of bit-layers representation, we implemented the following algorithm: Brute-force algorithm based on vertical adaptive parallelism (Bf∗V ): Given as before a text y of length n and a pattern of length m 6 n, for each position j in y[0 .. n − m], the chunk B̂i [j .. j + m − 1] from the i-th layer of y, for 0 6 i < `, is compared with the corresponding layer D̂i [0 .. m − 1] of the pattern, by way of the xor bitwise operation. The result is a bit vector R̂i of size m, such that R̂i [k] = 1 ⇐⇒ B̂i [k] 6= D̂i [k], for 0 6 k < m. Let R̂(0,i) := R̂0 or R̂1 or . . . or R̂i . If bitcount(R̂(0,`−1) ) 6 δ, then an occurrence of the pattern is reported at position j. The algorithm Bf∗V works adaptively, since as soon as bitcount(R̂(0,i) ) > δ, for some i < `, the iteration at position j stops, and the subsequent iteration starts from position j + 1 in y[0 .. n − m], if any. Table 3 reports the experimental results relative to the comparison of three algorithms for the approximate string matching problem allowing for at most δ mismatches, with δ ∈ {1, 3}. Specifically, we tested a brute-borce algorithm (Bf) and the Shift-Add (Sa) bit-parallel algorithm [2] (both tuned to the standard text representation) and our brute-force algorithm (Bf∗V ) described above, tuned to our novel bit-layers representation. 5 In this context, when δ = 0 the approximate problem just reduces to the exact string matching problem. Bit-Layers Text Representation for Efficient Text Processing 23 genome protein english m 4 8 16 32 4 8 16 32 4 8 16 32 0.13 0.12 0.12 0.12 0.19 0.19 0.18 0.18 0.19 0.18 0.19 0.18 δ=1 Bf Sa 0.49 0.49 0.22 0.14 0.43 0.43 0.19 0.12 0.43 0.43 0.18 0.13 Bf∗V 0.16 0.32 0.39 0.32 0.13 0.28 0.34 0.29 0.12 0.27 0.34 0.28 Bf 0.19 0.08 0.08 0.08 0.17 0.11 0.11 0.11 0.17 0.11 0.11 0.11 δ=3 Sa 0.50 0.50 0.22 0.14 0.43 0.43 0.19 0.12 0.43 0.43 0.19 0.12 Bf∗V 0.17 0,15 0.36 0.33 0.05 0.12 0.31 0.28 0.05 0.12 0.32 0.29 Table 3. Experimental results relative to the comparison of three algorithms for the approximate string matching problem allowing for at most δ mismatches, with δ ∈ {1, 3}. Speed is expressed in billions of characters per second. Best results are boldfaced. For short patterns, our algorithm Bf∗V turns out to be slower than the Shift- Add bit-parallel algorithm, and it gets slower as the bound δ increases. This is due to the need, when the bound gets larger and larger, to go all the way in depth along the layers of the representation. However, thanks to its horizontal word parallelism, our solution shows a marked sub-linear behaviour, getting faster and faster (up to 2 times) than the Shift-Add algorithm in case of patterns of length greater than or equal to 16. In fact, as the pattern length increases, the Shift-Add algorithm gets slower and slower because of the need of involving more and more words in its computation, widening the performance gap with our algorithm Bf∗V . 5 Conclusions and Future Works In this paper we introduced the bit-layers text representation, a novel succinct string representation in which textual data are arranged into a two-dimensional structure, which allows text processing to proceed both horizontally (by han- dling multiple data in constant time) and vertically (by moving in depth along the layers of the representation). The bit-layers text representation aims at im- proving text searching solutions. We believe that it may represent a valuable contribution to applications dealing with text searching. To substantiate this point, we showed how to solve some basic text processing tasks using the bit- layers representation, and presented the results of an experimental comparison of our solutions tailored to the bit-layers representation against those tuned for the standard text representation. Our solutions take particular advantage of the horizontal word parallelism and of the vertical adaptive parallelism, intrinsic in the two-dimensionality of the bit-layers representation, which could kick off several technical and creative approaches to text searching. We plan to study other more effective solutions, tuned to the bit-layers repre- sentation, to further basic text processing problems and also to investigate other 24 D. Cantone et al. non-standard searching problems amenable to fast solutions under our proposed text representation. References 1. V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8, 151–166 (2005) 2. R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Commun. ACM, 35(10):74-82, (1992) 3. N. R. Brisaboa, S. Ladra, G. Navarro. DACs: Bringing direct access to variable- length codes. Information Processing and Management 49, pp. 392-404 (2013) 4. R.S. Boyer, J.S. Moore. A fast string searching algorithm. Commun. ACM 20(10), 762-772 (1977) 5. D. Cantone, S. Faro, A. Pavone: Speeding Up String Matching by Weak Factor Recognition. Stringology 2017, pp. 42–50 (2017) 6. P. Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21, 246-260 (1974) 7. R. Fano. On the number of bits required to implement an associative memory. Memo 61, Computer Structures Group, Project MAC, Massachusetts (1971) 8. S Faro, T Lecroq, An efficient matching algorithm for encoded DNA sequences and binary strings. In Proc. of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009). Lecture Notes In Computer Science, Vol. 5577, Springer-Verlag, pp.106-115 (2009) 9. S Faro, T Lecroq, The Exact Online String Matching Problem: a Review of the Most Recent Results, ACM Computing Surveys (CSUR) vol. 45 (2), pp. 13 (2013) 10. S. Faro, T. Lecroq, S. Borzı̀, S. Di Mauro, and A. Maggio. The String Matching Algorithms Research Tool. In Proc. of Stringology, pages 99–111, 2016. 11. P. Ferragina and R. Venturini. A simple storage scheme for strings achieving en- tropy bounds. In Proc. 18th symp. on discrete alg. (SODA), pp. 690-696 (2007) 12. R. Grossi, A. Gupta and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th symp. on discrete alg. (SODA), pp. 841-850 (2003) 13. R. N. Horspool, Practical fast searching in strings, Software: Practice & Experience 10 (6), pp. 501–506 (1980) 14. D. A. Huffman, A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers (IRE), 40(9), pp. 1098–1101 (1952) 15. G. J. Jacobson. Succinct static data structures (Ph.D.). Pittsburgh, PA: Carnegie Mellon University (1988) 16. G. Navarro: Compact Data Structures: A Practical Approach. Cambridge Univer- sity Press New York, NY, USA (2016) 17. H. Peltola, J. Tarhio, On String Matching in Chunked Texts. In Proc. of CIAA’07, Lecture Notes in Computer Science, vol. 4783, Springer Verlag, pp. 157-167 (2017) 18. J. Teuhola. Interpolative coding of integer sequences supporting log-time random access. Information Processing and Management, 47, pp. 742-761 (2011) 19. H. E. Williams and J. Zobel. Compressing integers for fast file access. Computer Journal, 42(3), pp. 193–201 (1999) 20. M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar ram-cpu cache com- pression. In Proc. 22nd international conference on data engineering (ICDE) (pp. 59). Washington, DC, USA: IEEE Computer Society (2006)