<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bit-Layers Text Representation for E cient Text Processing y</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Domenico Cantone</string-name>
          <email>domenico.cantone@unict.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Faro</string-name>
          <email>faro@dmi.unict.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Sca ti</string-name>
          <email>stefano.scafiti@studium.unict.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universita di Catania</institution>
          ,
          <addr-line>Viale A.Doria n.6, 95125 Catania</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>Textual data still remains the main format for storing information, justifying why text processing is among the most relevant topics in computer science. However, despite capability to store information is growing fast, the amount and complexity of textual data grows faster than storage capacities. In many cases, the problem is not due to the size or complexity of the text, but rather to the representations (or data structure) employed for carrying out the needed processing. In this paper, we show the potentiality and the bene ts of a straightforward text representation, referred to as the Bit-Layers text representation, which turns out to be particularly suitable for fast text searching, while still retaining the standard e ciency in the rest of text processing basic tasks. To show the advantages of the Bit-Layers representation, we also present a family of simple algorithms, tuned to it, for solving some classical and non-classical string-matching problems. Such algorithms turn out to be particularly suitable for implementation in modern hardware, and very fast in practice. Preliminary experimental results show that in some cases these algorithms are by far faster than their counterparts based on the standard text representation.</p>
      </abstract>
      <kwd-group>
        <kwd>Text processing</kwd>
        <kwd>representation</kwd>
        <kwd>experimental algorithms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Text processing is one of the most relevant topics in computer science. It
includes, among other problems, exact and approximate string matching, which
are still among the most fundamental problems in computer science. Textual
data remains indeed the main form for storing information, even though data
are memorized in di erent ways. Thus, the need for faster and faster solutions
to text processing problems. However, it turns out that the amount of
available textual data grows faster than storage capacities and, as a consequence, the
capability to perform text processing in the main memory of present-day
computers is becoming more and more arduous. On the other hand, since processing
texts in main memory is by far faster than on disks,1 operating in main memory
is crucial for carrying out e cient text processing. In many cases, the problem
is not the size of the texts stored, but rather the data structures that must be
built on the texts in order to e ciently carry out the processing [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>The representation of a dna sequence is representative of this problem: to
encode a human genome, which is about 3:3 billion bases, slightly less than
800 MB are required, if one uses only 2 bits per character. On the other hand,
standard text representation uses 8 bits per character, requiring a total of more
than 2 GB, whereas the corresponding su x tree requires at least 10 bytes per
base, that is more than 30 GB, too large for many practical applications.</p>
      <p>Several works appeared in recent years aiming at nding a tradeo between
the space requirements needed to encode huge texts and complex data
structures and the time e ciency of the text searching algorithms designed to work
with such representations. Among the most relevant approaches in this
direction, we mention compressed text searching, in which processing takes place
online directly on compressed data in order to speed-up searching with reduced
extra space. This problem has been widely investigated in the last few years
and, although e cient solutions exist for searching under standard
compressions schemes,2 they still require signi cant implementation e orts and turn out
to be not very exible, being designed for speci c tasks.</p>
      <p>A more suitable solution is compact text processing, which aims at storing
information using succint text representations, allowing, at the same time, fast
in-place text processing with no decompression.</p>
      <p>
        The concept of succinct data representation and succinct data structures was
originally introduced by Jacobson [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] to encode bit vectors, trees, and graphs,
and has been recently brought back to the top [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] for the reasons discussed
above.
      </p>
      <p>Improving results in this direction, we discuss in this paper the advantages of
using the Bit-Layers text representation, a succinct and quite basic string
representation, which turns out to be particularly suitable for fast text searching,
while retaining standard e ciency in the rest of basic text processing
applications.</p>
      <p>We also present a family of simple algorithms for solving some classical and
non-classical string-matching problems, tuned to our proposed text
representation, which turn out to be particularly suitable for implementation in modern
hardware and very fast in practice. Preliminary experimental results show that
in some cases these algorithms are by far faster than their counterparts based
on the standard text representation.
1 In present-day computers, accessing a text in main memory is about 105 times faster
than on secondary memory.
2 Best solutions for compressed string matching use less than 70% extra space for
the text and are twice as fast in searching than standard online string matching
algorithms.</p>
      <p>Standard and Succinct Text Representations
In this section we brie y review the most relevant text representations that are
generally used to code a string y in main memories with a word size of w bits.
We shall assume that the block size w is xed, so that all references to a string
will only be to entire blocks of w bits. For simplicity, we refer to a w-bit block
as a byte, though larger values than w = 8 could be supported as well.</p>
      <p>Let y be a string of length n &gt; 0 of characters from a nite alphabet
of size , such that ` := dlog( )e 6 w. In standard text representation, the
string y is coded as an array Sy of n blocks, each of size w, that can be read
and written at arbitrary positions, and where the i-th block of Sy contains the
binary representation of the i-th character of y. We denote by (c) the binary
representation of c 2 , so that (c)[j], with 0 6 j &lt; `, is the j-th most
signi cant bit in (c). As long as ` 6 w, each character can be read and processed
in a single operation, otherwise dlog( )=we operations are required.</p>
      <p>Any array A of n blocks of size w can be regarded as a virtual bit array A^
of nw bits, where each bit can be processed at the cost of a single operation.
Conversely, any bit string B^ of length m could be seen as an array B of dm=we
blocks. Thus we have that B^[i] is the j-th bit of B[bi=wc], where j := i mod w.
Since the length m of a binary string needs not necessarily be a multiple of w,
the last block may be only partially de ned. This approach turns out to be very
simple and fast, as read and write operations of a given character can be done
in constant time, by using a direct access to the position of the array where the
character is stored or needs to be written. However, such a simple representation
may waste a lot of space, since wn bits are used, where just `n would su ce.</p>
      <p>
        A succinct representation of the string y has been discussed in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. It
allocates an array C of d`n=we blocks, which is enough to encode n elements of `
bits. We regard these n bits stored in C as a virtual bit array C^ of `n bits, where
each character y[i] is stored at C^[i` :: (i + 1)` 1]. Also in this case any character
y[i] can be accessed in constant time, although it may require to access more
then one word for a single access. Solutions tuned to this succinct text
representation exists [
        <xref ref-type="bibr" rid="ref17 ref8">17, 8</xref>
        ], aiming at speeding up text searching, however it turns out
that the gain is quite poor in practice, and is obtained only when ` divides w
exactly.
      </p>
      <p>
        For completeness, we mention other succinct text representations, based on
variable-length encodings, which are related to our approach since they provide
the enviable feature to give direct access to the text. We mention those based
on sampling [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the Elias-Fano-based representation [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], Interpolative coding
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], Wavelet tree [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and Direct Addressable Codes (DACs) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Such text
representations are speci cally designed for succinctness and are not competitive
for text processing tasks, if compared against standard text representations.
      </p>
      <p>Notably, our proposed bit-layers representation is strongly related with Direct
Addressable Codes (DACs) where, as in our approach, the bits of each encoding
are stored in di erent bit sequences. However, DACs use variable-length
encodings (where our representation encodes all characters of the alphabet with the
same number of bits, thus enhancing text processing performances) and
maintains additional informations about the structures of the layers.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Bit-Layers Text Representation</title>
      <p>Assume again that y is a string of length n over an alphabet of size . As in
the case of the succinct text representation described above, let us suppose that
each character in is represented by ` := dlog( )e bits.</p>
      <p>The bit-layers text representation codes the string y as an ordered collection
of ` binary strings of length n, hB^0; B^1; : : : ; B^` 1i (the reference to y has been
omitted for conciseness), where the i-th binary string B^i is the sequence of the
i-th bits of the characters in y, in the order in which they appear in y. We refer
to the bit vectors B^0; B^1; : : : ; B^` 1 induced by such representation as the bit
layers of the encoding. More formally, letting c 7! (c) be the encoding map, for
c 2 , then</p>
      <p>
        B^i := h (y[0])[i]; (y[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ])[i]; : : : ; (y[n
for i = 0; : : : ; `
blocks of size w.
      </p>
      <p>1. Thus, each layer B^i can be regarded as an array Bi of dn=we
Example 1. Let y = abfefdgabaadefcc be a string of 16 characters over the
alphabet = fa, b, c, d, e, f, gg, with = 7, so that each character can be
represented by dlog( )e = 3 bits. Assume therefore that (a) = 000, (b) = 001,
(c) = 010, (d) = 011, (e) = 100, (f) = 101 and (g) = 110.</p>
      <p>According to our bit-layers text representation, the string y can be
represented by the following sequence of 3 binary vectors, each one stored on
dn=we = 2 bytes:</p>
      <p>Since the bit B^i[j] is stored at the r-th most signi cant bit of Bi[dj=we],
where r = ((j 1) mod w) + 1, the character y[i] can readily be retrieved by the
following formula: y[i] = hB^`[i]; B^` 1[i]; ::; B^0[i]i:</p>
      <p>Procedure read in Figure 1 accesses character y[i] using the bit-layers text
representation for coding y in (`) time, which is worse than the constant time
needed for reading a character under the standard or the succinct
representations. However, our experimental results show that, in practical cases, the access
speed to text's characters using the bit-layers text representations is as fast as
using other representations (see Section 4).</p>
      <p>The bit-layers representation arranges textual data in a two-dimensional
structure so that text processing may proceed both horizontally, by reading
bits (or bit blocks) along a binary vector in a given layer, and vertically, by
moving from a given layer to the next one while reconstructing the encoding of
a character (or of a group of characters).
readbit(B; i): returns (B[i]</p>
      <p>i) &amp; 1
read(i):
c
for h
0
c
return c
0 to ` 1 do
c j readbit(Bh[i=w]; i mod w)
h</p>
      <p>Thanks to its two-dimensional structure, the bit-layers text representation
counts many favourable and interesting features.</p>
      <p>First of all, it naturally allows parallel computation on textual data. Since
a string is partitioned in ` independent binary vectors, it is straightforward to
entrust the processing of each bit vector to a di erent processor, provided that
the corresponding ` outputs are then blended.</p>
      <p>The bit-layers representation is also well suited for parallel accessing of
multiple data. A single computer word maintains, indeed, partial information about w
contiguous characters. Although we can access only a fraction (namely (1=`)th)
of a character encoding, such information can be processed in constant time, also
by exploiting the intrinsic parallelism of bitwise operations, which allows one to
cut down the horizontal processing by a factor up to w.</p>
      <p>In addition, it allows also adaptive accesses to textual data. This means that
processing may not always need to go all the way in depth along the layers of
the representation, as in favourable cases accesses can stop already at the initial
layers of the representation of a given character. For instance, assume we are
interested in searching all occurrences of the character a in a given text y, where
(a) = 000. If, while inspecting character y[i] it is discovered that B^0[i] = 1,
then such position can immediately be marked as a mismatch. This feature may
allow, under suitable conditions, to cut down the vertical data processing by a
factor up to `.</p>
      <p>Finally, the bit-layers representation turns out to be also well suited for
cachefriendly accesses to textual data. Indeed, the cache-hit rate is very crucial for
good performances, as each cache miss results in fetching data from primary
memory (or worse form secondary memory), which takes considerable time.3 In
comparison, reading data from the cache typically takes only a handful of cycles.
According to the principle of spatial locality, if a particular vector location is
referenced at a particular time, then it is expected that nearby positions will be
referenced in the near future. Thus, in present-day computers, it is common to
attempt to guess the size of the area around the current position for which it is
worthwhile to prepare for faster accesses for subsequent references. Assume, for
3 Data fetching takes hundreds of cycles from the primary memory and tens of billions
of cycles from secondary memory.
instance, that such a size is of k positions, so that, after a cache miss, when a
certain block of a given array is accessed, the next k blocks are also stored in
cache memory. Then, under the standard text representation, we have at most
k supplementary text accesses before the next cache miss, whereas in our
bitlayers representation such number may grow by a factor up to `, resulting, under
suitable conditions, in faster accesses to text characters.</p>
      <p>In the following section, we shall present some preliminary evidences that the
bit-layers text representation is particularly appropriate for fast text processing.
Speci cally, we shall provide basic solutions to some standard and non-standard
search problems, comparing the results obtained against those achieved under
the standard representation.
4</p>
      <p>
        Text Searching Using Bit-Layers Representation
Text processing, and especially text searching, can be included among the most
signi cant topics in computer science. Many times, over the years, standard
approaches to text searching have been in uenced, or even transformed, by
particularly innovative studies. Among them, we mention the Boyer-Moore algorithm
[
        <xref ref-type="bibr" rid="ref13 ref4">4, 13</xref>
        ], the rst practical approach to text searching, and the Bit-Parallelism [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
which respectively mainly focused on the search strategy and on the method to
build or represent e ective data structures to speed up the search process, even
by carrying it out in parallel.
      </p>
      <p>The bit-layers text representation presented in this paper aims at
improving text searching by moving enhancements at the very bottom of the problem,
namely at the encoding level, allowing a natural and more signi cant
parallel computation, particularly suitable for hardware implementation. For these
reasons, we believe that such representation is a valuable contribution to any
application dealing with text searching.</p>
      <p>Although the representation can inspire several technical and creative
approaches to text searching, in this rst paper we just show a few straightforward
solutions tuned to the proposed representation. Speci cally, we present some
solutions based on the following two approaches: horizontal word parallelism and
vertical adaptive parallelism.</p>
      <p>In the horizontal word parallelism, textual data are read in chunks of w bits,
from all the layers of the representation, and partial information is processed in
parallel. In the vertical adaptive parallelism approach, we proceed much in the
same way; however one does not always need to go all the way in depth along
the layers of the representation.</p>
      <p>In the following sections, we present solutions to some basic text processing
tasks and we report the related experimental data that allow us to compare our
proposed algorithms against those tuned for standard text representation.</p>
      <p>
        In our experiments, all algorithms have been implemented in the C
programming language, using 32 bits words (i.e. w = 32), and have been tested
with the Smart research tool [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which is available online at the URL http:
//www.dmi.unict.it/~faro/smart/.4 All experiments have been executed
locally on a MacBook Pro with 4 Cores, a 2 GHz Intel Core i7 processor, 16 GB
RAM 1600 MHz DDR3, 256 KB of L2 Cache, and 6 MB of Cache L3. In addition,
all algorithms have been compared in terms of their running times, excluding
preprocessing times, on the following real data sets: a genome sequence, a
protein sequence, and an English text. Each of the three sequences has a length of
15MB, and can be downloaded from the Smart tool. In the experimental
evaluation, patterns of length m have been randomly extracted from the sequences,
with m ranging over the set of values f2i j 2 6 i 6 5g. For each experiment, the
mean over the processing speed (expressed in billions of characters per second)
of 500 runs has been reported.
4.1
      </p>
      <sec id="sec-2-1">
        <title>Simple Text Scanning</title>
        <p>Simple text scanning is a basic task in text processing. It consists in reading
the characters of a given text one after the other in order to extract some
speci c features from the textual data. In this section, we focus on the following
straightforward tasks: (a) ncomputing the absolute frequencies of all the
characters occurring in a text, and (b) computing the absolute frequency of a speci c
character in a text.</p>
        <p>Assume y is a text of length n over an alphabet of size . The rst task
consists in counting the occurrences in y of each character of the alphabet. By
using the standard representation, this problem can be solved by a simple iterative
cycle that sweeps all characters of the text while increasing the corresponding
counters. With the bit-layers representation, task (a) can be accomplished by
scanning the text by way of a horizontal word parallelism. Initially, a table T
of size 28 is precomputed, where, for each 8-bit vector B^, T [B^] is an array of 8
characters (which can be regarded altogether as a 64-bit vector) such that, for
0 6 k &lt; 8,</p>
        <p>T [B^][k] := if B^[k] = 0 then 00000000 else 00000001 endif:
Then, during the scanning phase, for each 0 6 j &lt; dn=8e, a 64-bit vector
X := Pi`=01(T [Bi[j]] i) is computed and, by regarding the vector X as an
8-character array, the counter of each character X[k] is increased, for 0 6 k &lt; 8.</p>
        <p>Task (b) consists in counting the occurrences of a given input character c 2
in the string y. By using the standard representation, this task can be carried
through a simple iterative cycle that, while sweeping all characters of the text,
whenever an occurrence of c is found it increases a counter by 1. With the
bitlayers representation, task (b) can be accomplished through a scan of the text
based on vertical adaptive parallelism. Initially, ` binary vectors Pi (each of size
w), for 0 6 i &lt; `, are precomputed, where</p>
        <p>Pi := if (c)[i] = 1 then 0w else 1w endif:
4 The C codes of all tested algorithms are available at http://www.dmi.unict.it/
~faro/BLE/</p>
        <p>Standard
Bit-Layers</p>
        <p>Frequencies Count
genome protein english
0.782 0.880 0.880
1.043 1.032 1.024</p>
        <p>Occurrence Count
genome protein english
1.766 1.552 1.572
13.88 4.854 4.065</p>
        <p>In the subsequent scanning phase, the bit-layers of the string y are read in blocks
of w bits. Speci cally, for 0 6 j &lt; dn=we, a sequence of ` bit vectors, hXi : 0 6
i &lt; `i is computed, where X0 = B0[j] xor P0 and Xi = Xi 1 or (Bi[j] xor Pi),
for 1 6 i &lt; `, provided that all Xi's are nonnull. It can easily be checked that
y[jw + k] = c () X` 1[k] = 1 holds. Thus, if X` 1 6= 0, the counter is increased
by bitcount(X` 1) units. Plainly, such procedure is adaptive, since, during each
iteration, say the j-th one with 0 6 j &lt; dn=we, as soon as Xi = 0 for any
0 6 i &lt; `, the iteration is aborted and the execution resumes with the (j + 1)-st
iteration.</p>
        <p>Table 1 reports the experimental results relative to the above two tasks with
both the standard and the bit-layers representations, where speeds are expressed
in billions of characters per second. As for the frequency count task (a), the
bitlayers representation allows for a slightly faster access to textual data than the
standard representation, especially in the case of small alphabets, characterized
by a modest number of layers. When the size of the alphabet increases, our
approach unavoidably gets slower. Concerning the occurrence count task (b),
the vertical adaptive parallel approach based on the bit-layers representation
reaches by far the best results, as it is from 2:5 times (for English texts) up to
8 times (for genome sequences) faster than the standard approach.
4.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Exact String Matching</title>
        <p>
          The exact string matching problem is another basic task in text processing. It
consists in nding all the (possibly overlapping) occurrences of an input pattern
x of length m within a text y of length n, both strings over a common alphabet
of size . More formally, the problem aims at nding all positions j in y[0 :: n m]
such that y[j :: j + m 1] = x. A huge number of solutions has been devised
since the 1980s [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and, despite such a wide literature, still much work has been
produced in the last few years, demonstrating that the need for e cient solutions
is currently high.
        </p>
        <p>We designed the following three basic solutions to the exact string matching
problem, tuned to the bit-layers text representation.</p>
        <p>Brute-force algorithm based on horizontal word parallelism (BfH): For each
position j in y[0 :: n m], the chunk B^i[j :: j + m 1] from the i-th layer of y,
Bf
Sa
Hor
Wfr
BfH
Pfx
Sfx
4
for each 0 6 i &lt; `, is compared with the corresponding layer of the pattern
D^ i[0 :: m 1]. If B^i[j :: j + m 1] = D^ i[0 :: m 1] for all 0 6 i &lt; `, then a match
is reported at position j. The algorithm BfH works adaptively, since as soon as
B^i[j :: j + m 1] 6= D^ i[0 :: m 1] for some 0 6 i &lt; `, the iteration for position j
stops, and a new iteration starts from position j + 1 in y[0 :: n m], if any.
Pre x-based algorithm based on vertical adaptive parallelism (Pfx ): It is a
pre x-based improvement of the algorithm BfH described above. A pre x table
: f0; : : : ; 2kg ! f1; : : : ; kg, with k = min(m; 8), is precomputed, where, for
any given binary vector B^ of length k (which can be regarded as an integer in
f0; : : : ; 2k 1g), we have:
(B^) := minfs : 1 &lt; s &lt; k and B^[s :: k
At the end of each iteration during the searching phase, the current window is
shifted by (B^0[j :: j + k 1]) positions to the right.</p>
        <p>Su x-based algorithm based on vertical adaptive parallelism (Sfx ): It is
another improvement (su x-based) of the brute-force algorithm described earlier.
A su x table : f0; : : : ; 2kg ! f1; : : : ; kg, with k = min(m; 8), is precomputed,
where, for any given a binary vector B^ of length k (which as before can be
regarded as an integer in f0; : : : ; 2k 1g), we have:
(B^) := minfs : 1 6 s &lt; k and D^ [m
s
k :: m
s
1] = B^[0 :: k
At the end of each iteration in the searching phase, the current window is shifted
by (B^0[j + m k :: j + m 1]) positions to the right.</p>
        <p>
          Table 2 shows the experimental results relative to the above algorithms
(tuned to the bit-layers representation) against the following known solutions
to the exact string matching problem (tuned to the standard representation),
which are considered among the fastest algorithms in practical cases [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], namely
the brute-force algorithm (Bf), the Boyer-Moore-Horspool algorithm (Hor) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
the Shift-And algorithm (Sa) based on bit-parallelism [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and the
Weak-FactorRecognition algorithm (Wfr) implemented with q-grams (1 6 q 6 4).
        </p>
        <p>From Table 2, it turns out that algorithm BfH is always faster (up to 3
times) than its counterpart Bf and, in most cases, it is even faster than the
Horspool algorithm (Hor). The su x-based algorithm Sfx , tuned to the
bitlayers representation, is in almost all cases the fastest algorithm, even faster
(up to 3 times) than the algorithm Wfr, showing a sub-linear behaviour which
rapidly improves as the length of the pattern increases.
4.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>String Matching with Mismatches</title>
        <p>Approximate string matching has several variations. In this section we consider
the -mismatches variation, where the task is to nd all the occurrences of x
with at most mismatches, where 0 6 &lt; m.5</p>
        <p>
          Shift-Add [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] was the rst practical algorithm for solving the -mismatches
problem. It is based on bit-parallelism, where a vector of m states is used to
represent the state of the search. A eld of m + 1 bits is used for representing
each of the m states. When a mismatch is detected, the corresponding state is
increased accordingly. Thus, a match is detected at a given position when the
last state has a value less than or equal to . As in the case of other bit-parallel
solutions, when m(log(m) + 1) is greater than w, multiple words need to be
involved in the computation.
        </p>
        <p>In the context of bit-layers representation, we implemented the following
algorithm:
Brute-force algorithm based on vertical adaptive parallelism (BfV): Given as
before a text y of length n and a pattern of length m 6 n, for each position j in
y[0 :: n m], the chunk B^i[j :: j + m 1] from the i-th layer of y, for 0 6 i &lt; `,
is compared with the corresponding layer D^ i[0 :: m 1] of the pattern, by way
of the xor bitwise operation. The result is a bit vector R^i of size m, such that
R^i[k] = 1 () B^i[k] 6= D^ i[k], for 0 6 k &lt; m. Let R^(0;i) := R^0 or R^1 or : : : or R^i. If
bitcount(R^(0;` 1)) 6 , then an occurrence of the pattern is reported at position
j. The algorithm BfV works adaptively, since as soon as bitcount(R^(0;i)) &gt; , for
some i &lt; `, the iteration at position j stops, and the subsequent iteration starts
from position j + 1 in y[0 :: n m], if any.</p>
        <p>
          Table 3 reports the experimental results relative to the comparison of three
algorithms for the approximate string matching problem allowing for at most
mismatches, with 2 f1; 3g. Speci cally, we tested a brute-borce algorithm (Bf)
and the Shift-Add (Sa) bit-parallel algorithm [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] (both tuned to the standard
text representation) and our brute-force algorithm (BfV) described above, tuned
to our novel bit-layers representation.
5 In this context, when
matching problem.
        </p>
        <p>= 0 the approximate problem just reduces to the exact string
1Bf
=Sa
BfV
3Bf
=Sa
BfV
genome
protein
english
4
8
16
32
4
8
16
32
4
8
16
32</p>
        <p>For short patterns, our algorithm BfV turns out to be slower than the
ShiftAdd bit-parallel algorithm, and it gets slower as the bound increases. This is
due to the need, when the bound gets larger and larger, to go all the way in depth
along the layers of the representation. However, thanks to its horizontal word
parallelism, our solution shows a marked sub-linear behaviour, getting faster
and faster (up to 2 times) than the Shift-Add algorithm in case of patterns
of length greater than or equal to 16. In fact, as the pattern length increases,
the Shift-Add algorithm gets slower and slower because of the need of involving
more and more words in its computation, widening the performance gap with
our algorithm BfV.
5</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and Future Works</title>
      <p>In this paper we introduced the bit-layers text representation, a novel succinct
string representation in which textual data are arranged into a two-dimensional
structure, which allows text processing to proceed both horizontally (by
handling multiple data in constant time) and vertically (by moving in depth along
the layers of the representation). The bit-layers text representation aims at
improving text searching solutions. We believe that it may represent a valuable
contribution to applications dealing with text searching. To substantiate this
point, we showed how to solve some basic text processing tasks using the
bitlayers representation, and presented the results of an experimental comparison
of our solutions tailored to the bit-layers representation against those tuned for
the standard text representation. Our solutions take particular advantage of the
horizontal word parallelism and of the vertical adaptive parallelism, intrinsic
in the two-dimensionality of the bit-layers representation, which could kick o
several technical and creative approaches to text searching.</p>
      <p>We plan to study other more e ective solutions, tuned to the bit-layers
representation, to further basic text processing problems and also to investigate other
non-standard searching problems amenable to fast solutions under our proposed
text representation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Anh</surname>
          </string-name>
          and
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Mo at. Inverted index compression using word-aligned binary codes</article-title>
          .
          <source>Information Retrieval</source>
          ,
          <volume>8</volume>
          , 151{
          <fpage>166</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          and
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Gonnet</surname>
          </string-name>
          .
          <article-title>A new approach to text searching</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>35</volume>
          (
          <issue>10</issue>
          ):
          <fpage>74</fpage>
          -
          <lpage>82</lpage>
          , (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>N. R.</given-names>
            <surname>Brisaboa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ladra</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Navarro.</surname>
          </string-name>
          <article-title>DACs: Bringing direct access to variablelength codes</article-title>
          .
          <source>Information Processing and Management 49</source>
          , pp.
          <fpage>392</fpage>
          -
          <lpage>404</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.S.</given-names>
            <surname>Boyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.S.</given-names>
            <surname>Moore</surname>
          </string-name>
          .
          <article-title>A fast string searching algorithm</article-title>
          .
          <source>Commun. ACM</source>
          <volume>20</volume>
          (
          <issue>10</issue>
          ),
          <fpage>762</fpage>
          -
          <lpage>772</lpage>
          (
          <year>1977</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D.</given-names>
            <surname>Cantone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Faro</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Pavone: Speeding Up String Matching by Weak Factor Recognition</article-title>
          .
          <source>Stringology</source>
          <year>2017</year>
          , pp.
          <volume>42</volume>
          {
          <issue>50</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>P.</given-names>
            <surname>Elias</surname>
          </string-name>
          .
          <article-title>E cient storage and retrieval by content and address of static les</article-title>
          .
          <source>Journal of the ACM</source>
          ,
          <volume>21</volume>
          ,
          <fpage>246</fpage>
          -
          <lpage>260</lpage>
          (
          <year>1974</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fano</surname>
          </string-name>
          .
          <article-title>On the number of bits required to implement an associative memory</article-title>
          .
          <source>Memo</source>
          <volume>61</volume>
          , Computer Structures Group,
          <string-name>
            <surname>Project</surname>
            <given-names>MAC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Massachusetts</surname>
          </string-name>
          (
          <year>1971</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S</given-names>
            <surname>Faro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T</given-names>
            <surname>Lecroq</surname>
          </string-name>
          ,
          <article-title>An e cient matching algorithm for encoded DNA sequences and binary strings</article-title>
          .
          <source>In Proc. of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009). Lecture Notes In Computer Science</source>
          , Vol.
          <volume>5577</volume>
          , Springer-Verlag, pp.
          <fpage>106</fpage>
          -
          <lpage>115</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S</given-names>
            <surname>Faro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T</given-names>
            <surname>Lecroq</surname>
          </string-name>
          ,
          <article-title>The Exact Online String Matching Problem: a Review of the Most Recent Results, ACM Computing Surveys (CSUR) vol</article-title>
          .
          <volume>45</volume>
          (
          <issue>2</issue>
          ), pp.
          <volume>13</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. S. Faro,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lecroq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Di</given-names>
            <surname>Mauro</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maggio</surname>
          </string-name>
          .
          <article-title>The String Matching Algorithms Research Tool</article-title>
          .
          <source>In Proc. of Stringology</source>
          , pages
          <volume>99</volume>
          {
          <fpage>111</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Venturini</surname>
          </string-name>
          .
          <article-title>A simple storage scheme for strings achieving entropy bounds</article-title>
          .
          <source>In Proc. 18th symp. on discrete alg</source>
          .
          <source>(SODA)</source>
          , pp.
          <fpage>690</fpage>
          -
          <lpage>696</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>R.</given-names>
            <surname>Grossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Vitter</surname>
          </string-name>
          .
          <article-title>High-order entropy-compressed text indexes</article-title>
          .
          <source>In Proc. 14th symp. on discrete alg</source>
          .
          <source>(SODA)</source>
          , pp.
          <fpage>841</fpage>
          -
          <lpage>850</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Horspool</surname>
          </string-name>
          ,
          <article-title>Practical fast searching in strings</article-title>
          ,
          <source>Software: Practice &amp; Experience</source>
          <volume>10</volume>
          (
          <issue>6</issue>
          ), pp.
          <volume>501</volume>
          {
          <issue>506</issue>
          (
          <year>1980</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>D</surname>
          </string-name>
          . A.
          <article-title>Hu man, A method for the construction of minimum-redundancy codes</article-title>
          .
          <source>Proceedings of the Institute of Radio Engineers (IRE)</source>
          ,
          <volume>40</volume>
          (
          <issue>9</issue>
          ), pp.
          <volume>1098</volume>
          {
          <issue>1101</issue>
          (
          <year>1952</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Jacobson</surname>
          </string-name>
          .
          <article-title>Succinct static data structures (Ph.D.)</article-title>
          . Pittsburgh, PA: Carnegie Mellon University (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. G. Navarro:
          <article-title>Compact Data Structures: A Practical Approach</article-title>
          . Cambridge University Press New York, NY, USA (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. H.
          <string-name>
            <surname>Peltola</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tarhio</surname>
          </string-name>
          , On String Matching in Chunked Texts.
          <source>In Proc. of CIAA'07, Lecture Notes in Computer Science</source>
          , vol.
          <volume>4783</volume>
          , Springer Verlag, pp.
          <fpage>157</fpage>
          -
          <lpage>167</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>J.</given-names>
            <surname>Teuhola</surname>
          </string-name>
          .
          <article-title>Interpolative coding of integer sequences supporting log-time random access</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>47</volume>
          , pp.
          <fpage>742</fpage>
          -
          <lpage>761</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>H. E.</given-names>
            <surname>Williams</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <article-title>Compressing integers for fast le access</article-title>
          .
          <source>Computer Journal</source>
          ,
          <volume>42</volume>
          (
          <issue>3</issue>
          ), pp.
          <volume>193</volume>
          {
          <issue>201</issue>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>M. Zukowski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Heman</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Nes</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Boncz</surname>
          </string-name>
          .
          <article-title>Super-scalar ram-cpu cache compression</article-title>
          .
          <source>In Proc. 22nd international conference on data engineering (ICDE)</source>
          (pp.
          <fpage>59</fpage>
          ). Washington, DC, USA: IEEE Computer Society (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>