-

Bit-Layers Text Representation for E cient Text Processing y

Domenico Cantone

domenico.cantone@unict.it 0

Simone Faro

faro@dmi.unict.it 0

Stefano Sca ti

stefano.scafiti@studium.unict.it 0 0 Universita di Catania , Viale A.Doria n.6, 95125 Catania , Italy

2020

13 24

Textual data still remains the main format for storing information, justifying why text processing is among the most relevant topics in computer science. However, despite capability to store information is growing fast, the amount and complexity of textual data grows faster than storage capacities. In many cases, the problem is not due to the size or complexity of the text, but rather to the representations (or data structure) employed for carrying out the needed processing. In this paper, we show the potentiality and the bene ts of a straightforward text representation, referred to as the Bit-Layers text representation, which turns out to be particularly suitable for fast text searching, while still retaining the standard e ciency in the rest of text processing basic tasks. To show the advantages of the Bit-Layers representation, we also present a family of simple algorithms, tuned to it, for solving some classical and non-classical string-matching problems. Such algorithms turn out to be particularly suitable for implementation in modern hardware, and very fast in practice. Preliminary experimental results show that in some cases these algorithms are by far faster than their counterparts based on the standard text representation.

Text processing representation experimental algorithms

Text processing is one of the most relevant topics in computer science. It includes, among other problems, exact and approximate string matching, which are still among the most fundamental problems in computer science. Textual data remains indeed the main form for storing information, even though data are memorized in di erent ways. Thus, the need for faster and faster solutions to text processing problems. However, it turns out that the amount of available textual data grows faster than storage capacities and, as a consequence, the capability to perform text processing in the main memory of present-day computers is becoming more and more arduous. On the other hand, since processing texts in main memory is by far faster than on disks,1 operating in main memory is crucial for carrying out e cient text processing. In many cases, the problem is not the size of the texts stored, but rather the data structures that must be built on the texts in order to e ciently carry out the processing [ 16 ].

The representation of a dna sequence is representative of this problem: to encode a human genome, which is about 3:3 billion bases, slightly less than 800 MB are required, if one uses only 2 bits per character. On the other hand, standard text representation uses 8 bits per character, requiring a total of more than 2 GB, whereas the corresponding su x tree requires at least 10 bytes per base, that is more than 30 GB, too large for many practical applications.

Several works appeared in recent years aiming at nding a tradeo between the space requirements needed to encode huge texts and complex data structures and the time e ciency of the text searching algorithms designed to work with such representations. Among the most relevant approaches in this direction, we mention compressed text searching, in which processing takes place online directly on compressed data in order to speed-up searching with reduced extra space. This problem has been widely investigated in the last few years and, although e cient solutions exist for searching under standard compressions schemes,2 they still require signi cant implementation e orts and turn out to be not very exible, being designed for speci c tasks.

A more suitable solution is compact text processing, which aims at storing information using succint text representations, allowing, at the same time, fast in-place text processing with no decompression.

The concept of succinct data representation and succinct data structures was originally introduced by Jacobson [ 15 ] to encode bit vectors, trees, and graphs, and has been recently brought back to the top [ 16 ] for the reasons discussed above.

Improving results in this direction, we discuss in this paper the advantages of using the Bit-Layers text representation, a succinct and quite basic string representation, which turns out to be particularly suitable for fast text searching, while retaining standard e ciency in the rest of basic text processing applications.

We also present a family of simple algorithms for solving some classical and non-classical string-matching problems, tuned to our proposed text representation, which turn out to be particularly suitable for implementation in modern hardware and very fast in practice. Preliminary experimental results show that in some cases these algorithms are by far faster than their counterparts based on the standard text representation. 1 In present-day computers, accessing a text in main memory is about 105 times faster than on secondary memory. 2 Best solutions for compressed string matching use less than 70% extra space for the text and are twice as fast in searching than standard online string matching algorithms.

Standard and Succinct Text Representations In this section we brie y review the most relevant text representations that are generally used to code a string y in main memories with a word size of w bits. We shall assume that the block size w is xed, so that all references to a string will only be to entire blocks of w bits. For simplicity, we refer to a w-bit block as a byte, though larger values than w = 8 could be supported as well.

Let y be a string of length n > 0 of characters from a nite alphabet of size , such that ` := dlog( )e 6 w. In standard text representation, the string y is coded as an array Sy of n blocks, each of size w, that can be read and written at arbitrary positions, and where the i-th block of Sy contains the binary representation of the i-th character of y. We denote by (c) the binary representation of c 2 , so that (c)[j], with 0 6 j < `, is the j-th most signi cant bit in (c). As long as ` 6 w, each character can be read and processed in a single operation, otherwise dlog( )=we operations are required.

Any array A of n blocks of size w can be regarded as a virtual bit array A^ of nw bits, where each bit can be processed at the cost of a single operation. Conversely, any bit string B^ of length m could be seen as an array B of dm=we blocks. Thus we have that B^[i] is the j-th bit of B[bi=wc], where j := i mod w. Since the length m of a binary string needs not necessarily be a multiple of w, the last block may be only partially de ned. This approach turns out to be very simple and fast, as read and write operations of a given character can be done in constant time, by using a direct access to the position of the array where the character is stored or needs to be written. However, such a simple representation may waste a lot of space, since wn bits are used, where just `n would su ce.

A succinct representation of the string y has been discussed in [ 16 ]. It allocates an array C of d`n=we blocks, which is enough to encode n elements of ` bits. We regard these n bits stored in C as a virtual bit array C^ of `n bits, where each character y[i] is stored at C^[i` :: (i + 1)` 1]. Also in this case any character y[i] can be accessed in constant time, although it may require to access more then one word for a single access. Solutions tuned to this succinct text representation exists [ 17, 8 ], aiming at speeding up text searching, however it turns out that the gain is quite poor in practice, and is obtained only when ` divides w exactly.

For completeness, we mention other succinct text representations, based on variable-length encodings, which are related to our approach since they provide the enviable feature to give direct access to the text. We mention those based on sampling [ 11 ], the Elias-Fano-based representation [ 6, 7 ], Interpolative coding [ 18 ], Wavelet tree [ 12 ] and Direct Addressable Codes (DACs) [ 3 ]. Such text representations are speci cally designed for succinctness and are not competitive for text processing tasks, if compared against standard text representations.

Notably, our proposed bit-layers representation is strongly related with Direct Addressable Codes (DACs) where, as in our approach, the bits of each encoding are stored in di erent bit sequences. However, DACs use variable-length encodings (where our representation encodes all characters of the alphabet with the same number of bits, thus enhancing text processing performances) and maintains additional informations about the structures of the layers. 3

Bit-Layers Text Representation

Assume again that y is a string of length n over an alphabet of size . As in the case of the succinct text representation described above, let us suppose that each character in is represented by ` := dlog( )e bits.

The bit-layers text representation codes the string y as an ordered collection of ` binary strings of length n, hB^0; B^1; : : : ; B^` 1i (the reference to y has been omitted for conciseness), where the i-th binary string B^i is the sequence of the i-th bits of the characters in y, in the order in which they appear in y. We refer to the bit vectors B^0; B^1; : : : ; B^` 1 induced by such representation as the bit layers of the encoding. More formally, letting c 7! (c) be the encoding map, for c 2 , then

B^i := h (y[0])[i]; (y[ 1 ])[i]; : : : ; (y[n for i = 0; : : : ; ` blocks of size w.

1. Thus, each layer B^i can be regarded as an array Bi of dn=we Example 1. Let y = abfefdgabaadefcc be a string of 16 characters over the alphabet = fa, b, c, d, e, f, gg, with = 7, so that each character can be represented by dlog( )e = 3 bits. Assume therefore that (a) = 000, (b) = 001, (c) = 010, (d) = 011, (e) = 100, (f) = 101 and (g) = 110.

According to our bit-layers text representation, the string y can be represented by the following sequence of 3 binary vectors, each one stored on dn=we = 2 bytes:

Since the bit B^i[j] is stored at the r-th most signi cant bit of Bi[dj=we], where r = ((j 1) mod w) + 1, the character y[i] can readily be retrieved by the following formula: y[i] = hB^`[i]; B^` 1[i]; ::; B^0[i]i:

Procedure read in Figure 1 accesses character y[i] using the bit-layers text representation for coding y in (`) time, which is worse than the constant time needed for reading a character under the standard or the succinct representations. However, our experimental results show that, in practical cases, the access speed to text's characters using the bit-layers text representations is as fast as using other representations (see Section 4).

The bit-layers representation arranges textual data in a two-dimensional structure so that text processing may proceed both horizontally, by reading bits (or bit blocks) along a binary vector in a given layer, and vertically, by moving from a given layer to the next one while reconstructing the encoding of a character (or of a group of characters). readbit(B; i): returns (B[i]

i) & 1 read(i): c for h 0 c return c 0 to ` 1 do c j readbit(Bh[i=w]; i mod w) h

Thanks to its two-dimensional structure, the bit-layers text representation counts many favourable and interesting features.

First of all, it naturally allows parallel computation on textual data. Since a string is partitioned in ` independent binary vectors, it is straightforward to entrust the processing of each bit vector to a di erent processor, provided that the corresponding ` outputs are then blended.

The bit-layers representation is also well suited for parallel accessing of multiple data. A single computer word maintains, indeed, partial information about w contiguous characters. Although we can access only a fraction (namely (1=`)th) of a character encoding, such information can be processed in constant time, also by exploiting the intrinsic parallelism of bitwise operations, which allows one to cut down the horizontal processing by a factor up to w.

In addition, it allows also adaptive accesses to textual data. This means that processing may not always need to go all the way in depth along the layers of the representation, as in favourable cases accesses can stop already at the initial layers of the representation of a given character. For instance, assume we are interested in searching all occurrences of the character a in a given text y, where (a) = 000. If, while inspecting character y[i] it is discovered that B^0[i] = 1, then such position can immediately be marked as a mismatch. This feature may allow, under suitable conditions, to cut down the vertical data processing by a factor up to `.

Finally, the bit-layers representation turns out to be also well suited for cachefriendly accesses to textual data. Indeed, the cache-hit rate is very crucial for good performances, as each cache miss results in fetching data from primary memory (or worse form secondary memory), which takes considerable time.3 In comparison, reading data from the cache typically takes only a handful of cycles. According to the principle of spatial locality, if a particular vector location is referenced at a particular time, then it is expected that nearby positions will be referenced in the near future. Thus, in present-day computers, it is common to attempt to guess the size of the area around the current position for which it is worthwhile to prepare for faster accesses for subsequent references. Assume, for 3 Data fetching takes hundreds of cycles from the primary memory and tens of billions of cycles from secondary memory. instance, that such a size is of k positions, so that, after a cache miss, when a certain block of a given array is accessed, the next k blocks are also stored in cache memory. Then, under the standard text representation, we have at most k supplementary text accesses before the next cache miss, whereas in our bitlayers representation such number may grow by a factor up to `, resulting, under suitable conditions, in faster accesses to text characters.

In the following section, we shall present some preliminary evidences that the bit-layers text representation is particularly appropriate for fast text processing. Speci cally, we shall provide basic solutions to some standard and non-standard search problems, comparing the results obtained against those achieved under the standard representation. 4

Text Searching Using Bit-Layers Representation Text processing, and especially text searching, can be included among the most signi cant topics in computer science. Many times, over the years, standard approaches to text searching have been in uenced, or even transformed, by particularly innovative studies. Among them, we mention the Boyer-Moore algorithm [ 4, 13 ], the rst practical approach to text searching, and the Bit-Parallelism [ 2 ], which respectively mainly focused on the search strategy and on the method to build or represent e ective data structures to speed up the search process, even by carrying it out in parallel.

The bit-layers text representation presented in this paper aims at improving text searching by moving enhancements at the very bottom of the problem, namely at the encoding level, allowing a natural and more signi cant parallel computation, particularly suitable for hardware implementation. For these reasons, we believe that such representation is a valuable contribution to any application dealing with text searching.

Although the representation can inspire several technical and creative approaches to text searching, in this rst paper we just show a few straightforward solutions tuned to the proposed representation. Speci cally, we present some solutions based on the following two approaches: horizontal word parallelism and vertical adaptive parallelism.

In the horizontal word parallelism, textual data are read in chunks of w bits, from all the layers of the representation, and partial information is processed in parallel. In the vertical adaptive parallelism approach, we proceed much in the same way; however one does not always need to go all the way in depth along the layers of the representation.

In the following sections, we present solutions to some basic text processing tasks and we report the related experimental data that allow us to compare our proposed algorithms against those tuned for standard text representation.

In our experiments, all algorithms have been implemented in the C programming language, using 32 bits words (i.e. w = 32), and have been tested with the Smart research tool [ 10 ], which is available online at the URL http: //www.dmi.unict.it/~faro/smart/.4 All experiments have been executed locally on a MacBook Pro with 4 Cores, a 2 GHz Intel Core i7 processor, 16 GB RAM 1600 MHz DDR3, 256 KB of L2 Cache, and 6 MB of Cache L3. In addition, all algorithms have been compared in terms of their running times, excluding preprocessing times, on the following real data sets: a genome sequence, a protein sequence, and an English text. Each of the three sequences has a length of 15MB, and can be downloaded from the Smart tool. In the experimental evaluation, patterns of length m have been randomly extracted from the sequences, with m ranging over the set of values f2i j 2 6 i 6 5g. For each experiment, the mean over the processing speed (expressed in billions of characters per second) of 500 runs has been reported. 4.1

Simple Text Scanning

Simple text scanning is a basic task in text processing. It consists in reading the characters of a given text one after the other in order to extract some speci c features from the textual data. In this section, we focus on the following straightforward tasks: (a) ncomputing the absolute frequencies of all the characters occurring in a text, and (b) computing the absolute frequency of a speci c character in a text.

Assume y is a text of length n over an alphabet of size . The rst task consists in counting the occurrences in y of each character of the alphabet. By using the standard representation, this problem can be solved by a simple iterative cycle that sweeps all characters of the text while increasing the corresponding counters. With the bit-layers representation, task (a) can be accomplished by scanning the text by way of a horizontal word parallelism. Initially, a table T of size 28 is precomputed, where, for each 8-bit vector B^, T [B^] is an array of 8 characters (which can be regarded altogether as a 64-bit vector) such that, for 0 6 k < 8,

T [B^][k] := if B^[k] = 0 then 00000000 else 00000001 endif: Then, during the scanning phase, for each 0 6 j < dn=8e, a 64-bit vector X := Pi`=01(T [Bi[j]] i) is computed and, by regarding the vector X as an 8-character array, the counter of each character X[k] is increased, for 0 6 k < 8.

Task (b) consists in counting the occurrences of a given input character c 2 in the string y. By using the standard representation, this task can be carried through a simple iterative cycle that, while sweeping all characters of the text, whenever an occurrence of c is found it increases a counter by 1. With the bitlayers representation, task (b) can be accomplished through a scan of the text based on vertical adaptive parallelism. Initially, ` binary vectors Pi (each of size w), for 0 6 i < `, are precomputed, where

Pi := if (c)[i] = 1 then 0w else 1w endif: 4 The C codes of all tested algorithms are available at http://www.dmi.unict.it/ ~faro/BLE/

Standard Bit-Layers

Frequencies Count genome protein english 0.782 0.880 0.880 1.043 1.032 1.024

Occurrence Count genome protein english 1.766 1.552 1.572 13.88 4.854 4.065

In the subsequent scanning phase, the bit-layers of the string y are read in blocks of w bits. Speci cally, for 0 6 j < dn=we, a sequence of ` bit vectors, hXi : 0 6 i < `i is computed, where X0 = B0[j] xor P0 and Xi = Xi 1 or (Bi[j] xor Pi), for 1 6 i < `, provided that all Xi's are nonnull. It can easily be checked that y[jw + k] = c () X` 1[k] = 1 holds. Thus, if X` 1 6= 0, the counter is increased by bitcount(X` 1) units. Plainly, such procedure is adaptive, since, during each iteration, say the j-th one with 0 6 j < dn=we, as soon as Xi = 0 for any 0 6 i < `, the iteration is aborted and the execution resumes with the (j + 1)-st iteration.

Table 1 reports the experimental results relative to the above two tasks with both the standard and the bit-layers representations, where speeds are expressed in billions of characters per second. As for the frequency count task (a), the bitlayers representation allows for a slightly faster access to textual data than the standard representation, especially in the case of small alphabets, characterized by a modest number of layers. When the size of the alphabet increases, our approach unavoidably gets slower. Concerning the occurrence count task (b), the vertical adaptive parallel approach based on the bit-layers representation reaches by far the best results, as it is from 2:5 times (for English texts) up to 8 times (for genome sequences) faster than the standard approach. 4.2

Exact String Matching

The exact string matching problem is another basic task in text processing. It consists in nding all the (possibly overlapping) occurrences of an input pattern x of length m within a text y of length n, both strings over a common alphabet of size . More formally, the problem aims at nding all positions j in y[0 :: n m] such that y[j :: j + m 1] = x. A huge number of solutions has been devised since the 1980s [ 9 ] and, despite such a wide literature, still much work has been produced in the last few years, demonstrating that the need for e cient solutions is currently high.

We designed the following three basic solutions to the exact string matching problem, tuned to the bit-layers text representation.

Brute-force algorithm based on horizontal word parallelism (BfH): For each position j in y[0 :: n m], the chunk B^i[j :: j + m 1] from the i-th layer of y, Bf Sa Hor Wfr BfH Pfx Sfx 4 for each 0 6 i < `, is compared with the corresponding layer of the pattern D^ i[0 :: m 1]. If B^i[j :: j + m 1] = D^ i[0 :: m 1] for all 0 6 i < `, then a match is reported at position j. The algorithm BfH works adaptively, since as soon as B^i[j :: j + m 1] 6= D^ i[0 :: m 1] for some 0 6 i < `, the iteration for position j stops, and a new iteration starts from position j + 1 in y[0 :: n m], if any. Pre x-based algorithm based on vertical adaptive parallelism (Pfx ): It is a pre x-based improvement of the algorithm BfH described above. A pre x table : f0; : : : ; 2kg ! f1; : : : ; kg, with k = min(m; 8), is precomputed, where, for any given binary vector B^ of length k (which can be regarded as an integer in f0; : : : ; 2k 1g), we have: (B^) := minfs : 1 < s < k and B^[s :: k At the end of each iteration during the searching phase, the current window is shifted by (B^0[j :: j + k 1]) positions to the right.

Su x-based algorithm based on vertical adaptive parallelism (Sfx ): It is another improvement (su x-based) of the brute-force algorithm described earlier. A su x table : f0; : : : ; 2kg ! f1; : : : ; kg, with k = min(m; 8), is precomputed, where, for any given a binary vector B^ of length k (which as before can be regarded as an integer in f0; : : : ; 2k 1g), we have: (B^) := minfs : 1 6 s < k and D^ [m s k :: m s 1] = B^[0 :: k At the end of each iteration in the searching phase, the current window is shifted by (B^0[j + m k :: j + m 1]) positions to the right.

Table 2 shows the experimental results relative to the above algorithms (tuned to the bit-layers representation) against the following known solutions to the exact string matching problem (tuned to the standard representation), which are considered among the fastest algorithms in practical cases [ 5 ], namely the brute-force algorithm (Bf), the Boyer-Moore-Horspool algorithm (Hor) [ 13 ], the Shift-And algorithm (Sa) based on bit-parallelism [ 2 ], and the Weak-FactorRecognition algorithm (Wfr) implemented with q-grams (1 6 q 6 4).

From Table 2, it turns out that algorithm BfH is always faster (up to 3 times) than its counterpart Bf and, in most cases, it is even faster than the Horspool algorithm (Hor). The su x-based algorithm Sfx , tuned to the bitlayers representation, is in almost all cases the fastest algorithm, even faster (up to 3 times) than the algorithm Wfr, showing a sub-linear behaviour which rapidly improves as the length of the pattern increases. 4.3

String Matching with Mismatches

Approximate string matching has several variations. In this section we consider the -mismatches variation, where the task is to nd all the occurrences of x with at most mismatches, where 0 6 < m.5

Shift-Add [ 2 ] was the rst practical algorithm for solving the -mismatches problem. It is based on bit-parallelism, where a vector of m states is used to represent the state of the search. A eld of m + 1 bits is used for representing each of the m states. When a mismatch is detected, the corresponding state is increased accordingly. Thus, a match is detected at a given position when the last state has a value less than or equal to . As in the case of other bit-parallel solutions, when m(log(m) + 1) is greater than w, multiple words need to be involved in the computation.

In the context of bit-layers representation, we implemented the following algorithm: Brute-force algorithm based on vertical adaptive parallelism (BfV): Given as before a text y of length n and a pattern of length m 6 n, for each position j in y[0 :: n m], the chunk B^i[j :: j + m 1] from the i-th layer of y, for 0 6 i < `, is compared with the corresponding layer D^ i[0 :: m 1] of the pattern, by way of the xor bitwise operation. The result is a bit vector R^i of size m, such that R^i[k] = 1 () B^i[k] 6= D^ i[k], for 0 6 k < m. Let R^(0;i) := R^0 or R^1 or : : : or R^i. If bitcount(R^(0;` 1)) 6 , then an occurrence of the pattern is reported at position j. The algorithm BfV works adaptively, since as soon as bitcount(R^(0;i)) > , for some i < `, the iteration at position j stops, and the subsequent iteration starts from position j + 1 in y[0 :: n m], if any.

Table 3 reports the experimental results relative to the comparison of three algorithms for the approximate string matching problem allowing for at most mismatches, with 2 f1; 3g. Speci cally, we tested a brute-borce algorithm (Bf) and the Shift-Add (Sa) bit-parallel algorithm [ 2 ] (both tuned to the standard text representation) and our brute-force algorithm (BfV) described above, tuned to our novel bit-layers representation. 5 In this context, when matching problem.

= 0 the approximate problem just reduces to the exact string 1Bf =Sa BfV 3Bf =Sa BfV genome protein english 4 8 16 32 4 8 16 32 4 8 16 32

For short patterns, our algorithm BfV turns out to be slower than the ShiftAdd bit-parallel algorithm, and it gets slower as the bound increases. This is due to the need, when the bound gets larger and larger, to go all the way in depth along the layers of the representation. However, thanks to its horizontal word parallelism, our solution shows a marked sub-linear behaviour, getting faster and faster (up to 2 times) than the Shift-Add algorithm in case of patterns of length greater than or equal to 16. In fact, as the pattern length increases, the Shift-Add algorithm gets slower and slower because of the need of involving more and more words in its computation, widening the performance gap with our algorithm BfV. 5

Conclusions and Future Works

In this paper we introduced the bit-layers text representation, a novel succinct string representation in which textual data are arranged into a two-dimensional structure, which allows text processing to proceed both horizontally (by handling multiple data in constant time) and vertically (by moving in depth along the layers of the representation). The bit-layers text representation aims at improving text searching solutions. We believe that it may represent a valuable contribution to applications dealing with text searching. To substantiate this point, we showed how to solve some basic text processing tasks using the bitlayers representation, and presented the results of an experimental comparison of our solutions tailored to the bit-layers representation against those tuned for the standard text representation. Our solutions take particular advantage of the horizontal word parallelism and of the vertical adaptive parallelism, intrinsic in the two-dimensionality of the bit-layers representation, which could kick o several technical and creative approaches to text searching.

We plan to study other more e ective solutions, tuned to the bit-layers representation, to further basic text processing problems and also to investigate other non-standard searching problems amenable to fast solutions under our proposed text representation.

V. N.

Anh and A. Mo at. Inverted index compression using word-aligned binary codes . Information Retrieval , 8 , 151{ 166 ( 2005 )

Baeza-Yates and

G. H.

Gonnet . A new approach to text searching . Commun. ACM , 35 ( 10 ): 74 - 82 , ( 1992 )

N. R.

Brisaboa ,

Ladra , G. Navarro. DACs: Bringing direct access to variablelength codes . Information Processing and Management 49 , pp. 392 - 404 ( 2013 )

R.S.

Boyer ,

J.S.

Moore . A fast string searching algorithm . Commun. ACM 20 ( 10 ), 762 - 772 ( 1977 )

Cantone ,

Faro , A. Pavone: Speeding Up String Matching by Weak Factor Recognition . Stringology 2017 , pp. 42 { 50 ( 2017 )

Elias . E cient storage and retrieval by content and address of static les . Journal of the ACM , 21 , 246 - 260 ( 1974 )

Fano . On the number of bits required to implement an associative memory . Memo 61 , Computer Structures Group, Project

MAC

, Massachusetts ( 1971 )

Faro ,

Lecroq , An e cient matching algorithm for encoded DNA sequences and binary strings . In Proc. of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009). Lecture Notes In Computer Science , Vol. 5577 , Springer-Verlag, pp. 106 - 115 ( 2009 )

Faro ,

Lecroq , The Exact Online String Matching Problem: a Review of the Most Recent Results, ACM Computing Surveys (CSUR) vol . 45 ( 2 ), pp. 13 ( 2013 )

10. S. Faro,

Lecroq ,

Borz ,

S. Di

Mauro , and

Maggio . The String Matching Algorithms Research Tool . In Proc. of Stringology , pages 99 { 111 , 2016 .

11.

Ferragina and

Venturini . A simple storage scheme for strings achieving entropy bounds . In Proc. 18th symp. on discrete alg . (SODA) , pp. 690 - 696 ( 2007 )

12.

Grossi ,

Gupta and

Vitter . High-order entropy-compressed text indexes . In Proc. 14th symp. on discrete alg . (SODA) , pp. 841 - 850 ( 2003 )

13.

R. N.

Horspool , Practical fast searching in strings , Software: Practice & Experience 10 ( 6 ), pp. 501 { 506 ( 1980 )

14. D . A. Hu man, A method for the construction of minimum-redundancy codes . Proceedings of the Institute of Radio Engineers (IRE) , 40 ( 9 ), pp. 1098 { 1101 ( 1952 )

15.

G. J.

Jacobson . Succinct static data structures (Ph.D.) . Pittsburgh, PA: Carnegie Mellon University ( 1988 )

16. G. Navarro: Compact Data Structures: A Practical Approach . Cambridge University Press New York, NY, USA ( 2016 )

17. H. Peltola , J. Tarhio , On String Matching in Chunked Texts. In Proc. of CIAA'07, Lecture Notes in Computer Science , vol. 4783 , Springer Verlag, pp. 157 - 167 ( 2017 )

18.

Teuhola . Interpolative coding of integer sequences supporting log-time random access . Information Processing and Management , 47 , pp. 742 - 761 ( 2011 )

19.

H. E.

Williams and

Zobel . Compressing integers for fast le access . Computer Journal , 42 ( 3 ), pp. 193 { 201 ( 1999 )

20. M. Zukowski , S.

Heman , N.

Nes , and P.

Boncz . Super-scalar ram-cpu cache compression . In Proc. 22nd international conference on data engineering (ICDE) (pp. 59 ). Washington, DC, USA: IEEE Computer Society ( 2006 )