Enhancing Characters Distance Text Sampling
              by Condensed Alphabets

            Simone Faro1 , Francesco Pio Marino1 , Arianna Pavone2
                   1
                      Dipartimento di Matematica e Informatica,
          Università di Catania, viale A.Doria n.6, 95125, Catania, Italia
                                faro@dmi.unict.it
                        2
                          Dipartimento di Scienze Cognitive,
         Università di Messina, via Concezione n.6, 98122, Messina, Italia
                                 apavone@unime.it


      Abstract. Sampled string matching is an efficient approach to the string
      matching problem introduced in order to overcome the prohibitive space
      requirements of indexed matching, on the one hand, and drastically re-
      duce searching time for the online solutions, on the other hand. Known
      solutions to sampled string matching are able to speed up the searching
      up to 9 times while using less than 11% of the text size. However, they
      appear to work efficiently only in the case of natural language texts or,
      in general, when searching on input sequences over large alphabets.
      In this paper we extend sampled-string matching to the case of small al-
      phabets obtaining a new efficient solution which turns out to be feasible
      also for searching biological data like genome or protein sequences. Our
      solution extends a recent approach called Character Distance Sampling
      by using condensed characters in order to enlarge the size of the alpha-
      bet and speed up the searching process. From our experimental results
      it turns out that our solution significantly reduces the searching time
      when compared against previous sampled string matching algorithms.
      In particular on biological data it leads to reduce the space consumption
      up to 80% and to speed up the searching up to 96%, thus improving the
      standard online string matching algorithm up to 99.6% using less than
      1% of the original text size.


1    Introduction

Exact string matching is a fundamental problem in computer science and in the
wide domain of text processing. It consists in finding all the occurrences of a
given pattern x, of length m, in a large text y, of length n, where both sequences
are composed by characters drawn from an alphabet Σ of size σ.
    It is a critical problem in computational molecular biology and plays a very
important role in biological sequences analysis, mainly due to the constantly
growing amount of molecular data extracted from living organisms. For this
reason sequence matching techniques play a very important role in various ap-
plications in computational biology for data analysis.
2       S. Faro, F.P. Marino and A. Pavone

    As the size of data increases the space needed to store it is constantly increas-
ing too, for this reasons the need for new efficient approaches to the problem.
    Applications require two kinds of solutions: online and offline string match-
ing. Solutions based on the first approach assume that the text is not prepro-
cessed and thus they need to scan the input sequence online, when searching.
Their worst case time complexity is Θ(n), and was achieved for the first time
by the well known Knuth-Morris-Pratt (KMP) algorithm [13], while the opti-
mal average time complexity of the problem is Θ(n logσ m/m) [16], achieved for
example by the Backward-Dawg-Matching (BDM) algorithm [5].
   Many string matching solutions have been also developed in order to obtain
sub-linear performance in practice [7]. Among them the Boyer-Moore-Horspool
(BMH) algorithm [1,12] deserves a special mention, since it has inspired much
work.
    Memory requirements of this class of algorithms are very low and gener-
ally limited to a precomputed table of size O(mσ) or O(σ 2 ) [7]. However their
searching time is always proportional to the length of the text and thus their
performances may stay poor in many practical cases, especially for huge texts
and short patterns.
    Solutions based on the second approach try to drastically speed up searching
by preprocessing the text and building a data structure that allows searching
in time proportional to the length of the pattern. This method is called indexed
searching [7,1,11]. However, despite their optimal time performances, space re-
quirements of such data structures are from 4 to 20 times the size of the text,
which may be too large for many practical applications.
    Leaving aside other different approaches, like those based on compressed
string matching [14,2], which turn out to be theoretically efficient and are hardly
feasible in practical cases, an alternative solution to the problem is sampled string
matching, introduced in 1991 by Vishkin [15], which consists in the construction
of a succinct sampled version of the text (which must be maintained together
with the original text) and in the application of any online string matching al-
gorithm directly on the sampled sequence. Although any candidate occurrence
of the pattern may be found more efficiently, the drawback of this approach is
that any occurrence reported in the sampled-text requires to be verified in the
original text. Apart from this point a sampled-text approach may have a lot
of good features: it may be easy to implement if compared with other succint
matching approaches, it may require very small extra space and may allow fast
searching. Additionally it may also allow fast updates of the data structure.
    Apart the theoretical result of Vishkin, the first practical solution to sam-
pled string matching has been introduced by Claude et al. [4] and is based on
an alphabet reduction. Their solution has an extra space requirement which is
only 14% of text size and turns out to be up to 5 times faster than standard
online string matching on English texts. In this paper we refer to this algorithm
as Occurrence Text Sampling (OTS).
                                  Title Suppressed Due to Excessive Length       3

    More recently Faro et al. presented a more effective sampling approach based
on character distance sampling (CDS) [10,9], obtaining in practice a speed up
by a factor of up to 9 on English texts, using limited additional space whose
amount goes from 11% to 2.8% of the text size, with a gain in searching time up
to 50% if compared against the previous solution. However the previous sampling
approaches to exact string matching prove to work efficiently only in the case
of natural language texts or, in general, when searching on input sequences over
large alphabets, while their performances degrade when the size of the underlying
alphabets decreases.
    In this paper, we present an extension of the approach proposed by Faro et
al. [10] to small alphabets which turns out to be much more feasible in the case
of biological data like genome or protein sequences and, in general, in the case
of small alphabets. Our proposed approach makes use of condensed characters
in order to enlarge the size of the underlying alphabet and, as a result, speed up
the searching process and reduce the space consumption of the resulting sam-
pled text. From our experimental results it turns out that the use of condensed
alphabets leads to reduce the space consumption up to 80% and to speed up
the searching process up to 95%, significantly improving the results obtained
by the original text sampling technique. In addition, the new approach, while
designed to work well with text on small alphabets, is also particularly effec-
tive on large alphabets. Our experimental results prove how the approach based
on condensed characters significantly improves performance even in the case of
natural language texts.
    The paper is organized as follows. In Section 2 we present the Characters
Distance Sampling approach introduced by Faro et al. [10] and we extend it to
condensed alphabets in Section 2.1. In Section 3 we describe the three procedures
used for searching a pattern in a text using the sampled text extended with
condensed alphabets. Then in Section 4 we present experimental results and
draw our conclusions in Section 5.


2   CDS and Condensed Alphabets

Let y be the input text, of length n, and let x be the input pattern, of length m,
both over an alphabet Σ of size σ. We assume that all strings can be treated as
vectors starting at position 1. Thus we refer to x[i] as the i-th character of the
string x, for 1 ≤ i ≤ m, where m is the size of x.
    We elect a set C ⊆ Σ to be the set of pivot characters. Given this set of pivot
characters we sample the text y by taking into account the distances between
consecutive positions of any pivot characters c ∈ C in y. More formally our
sampling approach is based on the following definition of position sampling of a
text.

Definition 1 (Position Sampling). Let y be a text of length n, let C ⊆ Σ be
the set of pivot characters and let nC be the number of occurrences of any c ∈ C
in the input text y.
4         S. Faro, F.P. Marino and A. Pavone

    First we define the position function, δ : {1, .., nC } → {1, .., n}, where δ(i) is
the position of the i-th occurrence of any character of C in y. Formally we have

      (i) 1 ≤ δ(i) < δ(i + 1) ≤ n                          for each 1 ≤ i ≤ nC − 1
     (ii) y[δ(i)] ∈ C                                      for each 1 ≤ i ≤ nC
    (iii) y[δ(i) + 1..δ(i + 1) − 1] contains no chars of C for each 0 ≤ i ≤ nC

where in (iii) we assume that δ(0) = 0 and δ(nC + 1) = n + 1.
    Then the position sampled version of y, indicated by ẏ, is a numeric sequence,
of length nC , defined as

                               ẏ = hδ(1), δ(2), .., δ(nC )i.                      (1)

Example 1. Suppose y = “agaacgcagtata” is a dna sequence of length 13, over
the alphabet Σ = {a,c,g,t}. Let C = {a} be the set of pivot characters. Thus
the position sampled version of y is ẏ = h1, 3, 4, 8, 11, 13i. Specifically the first
occurrence of character c ∈ C is at position 1 (y[1] = a), its second occurrence
is at position 3 (y[3] = a), and so on.

Definition 2 (Characters Distance Sampling). Let C ⊆ Σ be the set of
pivot characters, let nC ≤ n be the number of occurrences of any pivot character
in the text y and let δ be the position function of y. We define the characters
distance function ∆(i) = δ(i + 1) − δ(i), for 1 ≤ i ≤ nC − 1, as the distance
between two consecutive occurrences of any pivot character in y.
    Then the characters-distance sampled version of the text y is a numeric se-
quence, indicated by ȳ, of length nC − 1 defined as

                        ȳ = h∆(1), ∆(2), .., ∆(nC − 1)i
                                                                                   (2)
                           = hδ(2) − δ(1), .., δ(nC ) − δ(nC − 1)i

      Plainly we have
                                   C −1
                                  nX
                                          ∆(i) ≤ n − 1.
                                   i=1


Example 2. Let y = “agaacgcagtata” be a text of length 13, over the alphabet
Σ = {a,c,g,t}. Let C = {a} be the set of pivot characters. Thus the character
distance sampling version of y is ȳ = h2, 1, 4, 3, 2i. Specifically ȳ[1] = ∆(1) =
δ(2) − δ(1) = 3 − 1 = 2, while ȳ[3] = ∆(3) = δ(4) − δ(3) = 8 − 4 = 4, and so on.

Definition 3 (Rank of a character). Let x be a pattern of length m, and let
c ∈ Σ. We define φ : Σ → {0..m} as the function which associates any character
of the text with the number of its occurrences in x. The rank of the character c
is the position of c in the alphabet Σ, if we assume that all characters are sorted
by their φ(c) values in decreasing order. More formally the rank of c is given by
the cardinality of the set {k ∈ Σ | φ(k) > φ(c)} + 1
                                    Title Suppressed Due to Excessive Length       5

2.1   Extension to Condensed Alphabets
Let y be an input sequence, of length n, over an alphabet Σ of size σ. Given a
                                                                           (q)
constant parameter q, with 1 ≤ q < n, we define the condensed alphabet Σy ,
related to y, as

            {c ∈ Σ q | c = y[i..i + q − 1] for some 1 ≤ i ≤ n − q + 1}
                      (q)
Roughly speaking Σy is the set of all different subsequences of length q (or
q-grams) appearing in y. We define the q-condensed version of y as follows.
Definition 4 (q-Condensed Sequence). Let y be a text of length n over an
                               (q)
alphabet Σ of size σ and let Σy be the condensed alphabet, related to y, for a
given constant parameter q. We define the q-condensed version of the sequence y
as the sequence, of length n−q+1, of all consecutive (and overlapping) substrings
of length q appearing in y. More formally

             y (q) = hy[1..q], y[2..q + 1], y[3..q + 2], .., y[n − q + 1..n]i

Example 3. Assume y = “agtagcgcagt” is a dna sequence of length 11, over the
alphabet Σ = {a,c,g,t}. Then we have

               y (2) = hag, gt, ta, ag, gc, cg, gc, ca, ag, gti
               y (3) = hagt, gta, tag, agc, gcg, cgc, gca, cag, agti
               y (4) = hagta, gtag, tagc, agcg, gcgc, cgca, gcag, cagti
                                                                            (q)
Definition 5 (q-Characters Distance Sampling). Let C ⊆ Σy be the set
of pivot characters, let nC ≤ n be the number of occurrences of any pivot char-
acter in the text y (q) and let δ be the position function of y (q) . We define the
q-characters distance function ∆(q) as the distance between two consecutive oc-
currences of any pivot character in y (q) , where ∆(q) (i), for 1 ≤ i ≤ nC − 1, is
the distance between the (i + 1)-th and the i-th occurrence of any occurrences of
any pivot character in y (q) .
    Then the q-characters-distance sampled version of the sequence y is a nu-
meric sequence of length nC − 1, indicated by ȳ (q) and defined as

                    ȳ (q) = h∆(q) (1), ∆(q) (2), .., ∆(q) (nC − 1)i.             (3)

Example 4. As in the previous Example 3 assume y = “agtagcgcagtagta” is a
dna sequence of length 15, over the alphabet Σ = {a,c,g,t}. If we suppose q = 2
and C = “ag 00 is the set of pivot characters, then we have

                                  ẏ (2) = h1, 4, 9, 12i
                                  ȳ (2) = h3, 5, 3i.

Similarly, if we suppose q = 3 and C = {“agt” } is the set of pivot characters,
then we have
                                ẏ (3) = h1, 9, 12i
                                ȳ (3) = h8, 3i.
6       S. Faro, F.P. Marino and A. Pavone

3    The Sampled Text Searching Algorithm

Let y be an input text of length n over an alphabet Σ of size σ, let q > 1 and
      (q)
let Σy be the condensed alphabet over Σ. In addition let C ⊆ Σ (q) be the set
of pivot characters.
    In this paper we do not go into the way for a correct selection of the set
of pivot characters, and even we leave the details of an analysis about what
is the best subset to be chosen. However in our experimental evaluation (see
Section 4) we will show how it is enough to put a single character in the set of
pivot characters. Such character is selected on the basis of its rank value, where
we remember that the rank of a character c corresponds to its position in the
alphabet Σ when we assume that all characters are sorted by their frequencies
inside the text (see Definition 3).
    During the preprocessing phase the algorithm performs a scanning of the text
y and builds the corresponding position sampled text ẏ (q) . Assuming that the
maximum distance between two consecutive occurrences of the pivot character
is bounded by γ, by Definition 5, the sequences ẏ (q) and ȳ (q) require (nC ) log(n)
and (nC ) log(γ) bits, respectively, to be maintained.
    Let now x be an input pattern of length m and let mC be the number of oc-
currences of any pivot character in x. The searching phase can be then divided in
three different subroutines, depending on the value of mC . All searching proce-
dures work using a filtering approach. The idea behind such searching procedures
is to take advantage of the sampled text ẏ (q) computed during the preprocessing
phase in order to quickly locate any candidate substring s of the original text
which may include an occurrence of the pattern.
    If such candidate substring s has length m then the algorithm simply per-
forms a character-by-character comparison between the pattern and the sub-
string. Otherwise if the candidate substring s has length greater than m, then a
searching procedure is called, based on a standard exact online string matching
algorithm, for searching the pattern x in s.
    We can prove that if we suppose the underlying algorithm to be characterized
by a linear worst case time complexity, as in the case of the KMP algorithm [13],
then our solution achieves the same complexity in the worst case. Similarly if we
implement the underlying searching procedure using an optimal average string
matching algorithm, like the BDM algorithm [12], the resulting solution achieves
an optimal O(n logσ m/m) average time complexity.
    In what follows we describe in details the three different searching procedures
which are applied when mC = 0, mC = 1 and mC > 1, respectively. We will
refer to the pseudo-codes of procedures Search-0, Search-1 and Search-2+
as shown in Figure 1.

Case 1: mC = 0
If the pattern contains no occurrence of any pivot characters, we have that mC
is equal to 0. Under this assumption the algorithm searches for the pattern x
in all substrings of the original text which do not contain the pivot characters.
                                             Title Suppressed Due to Excessive Length   7

Search-0(x, ẏ (q) , y, q)
1. m ← len(x)
2. nC ← len(ẏ)
3. ẏ (q) [0] ← 0
3. ẏ (q) [nC + 1] ← n − q + 2
4. for i ← 1 to nC + 1 do
5.         if (ẏ (q) [i] − ẏ (q) [i − 1] + q − 2 ≥ m) then
6.                 l ← ẏ (q) [i − 1] + 1
7.                 r ← ẏ (q) [i] + q − 2
8.                 search for x in y[l..r]

Search-1(x, ẏ (q) , y, q)
1. m ← len(x)
2. nC ← len(ẏ)
3. α ← min{i : x(q) [i] ∈ C}
4. ẏ (q) [0] ← 0
5. ẏ (q) [nC + 1] ← n − q + 2
6. for i ← 1 to nC + 1 do
7.         if (ẏ (q) [i − 1] − ẏ (q) [i − 2] > α − 1 and
                   ẏ (q) [i] − ẏ (q) [i − 1] > m − α) then
8.                 l ← ẏ (q) [i − 1] − α + 1
9.                 r ← ẏ (q) [i − 1] − α + m
10.                compare x and y[l..r]

Search-2+ (x, ẏ (q) , y, q)
1. m ← len(x)
2. (x̄(q) , m̄) ← Compute-Distance-Sampling(x, m, C)
3. search for x̄(q) in ȳ (q) :
4.       for each i such that x̄(q) = ȳ (q) [i..i + m̄ − 1] do
5.              l ← ẏ (q) [i] − ẏ (q) [0]
6.              r ← ẏ (q) [i] + m − 1
7.              compare x and y[l..r]

Fig. 1. The pseudocode of procedures Search-0, Search-1 and Search-2+ , respec-
tively, for the sampled string matching problem.


Specifically such substrings are identified in the original text by the intervals
[δ (q) (i) + 1..δ (q) (i + 1) + q − 2], for each 0 ≤ i ≤ nC , assuming δ (q) (0) = 0 and
δ (q) (nC + 1) = n − q + 2.

      Specifically, for each 1 ≤ i ≤ nC + 1, the algorithm checks if the value
ẏ (q) (i) − ẏ (q) (i − 1) + q − 2 is greater or equal to m. In such a case the algorithm
searches for x in the substring of the text y[ẏ (q) [i−1]+1..ẏ (q) [i]+q −2] using any
standard string matching algorithm. Otherwise the substring is skipped, since
no occurrence of the pattern could be found at such position.
8        S. Faro, F.P. Marino and A. Pavone

Case 2: mC = 1
If the pattern x contains a single occurrence of any character of the set C, then
the length of the sampled version of the pattern is still equal to 0. However also
in this case the algorithm is able to efficiently take advantage of the information
precomputed in ẏ (q) using the positions of the pivot character in y (q) as an anchor
to locate all candidate occurrences of x.
      Specifically, let α be the unique position in x which contains the pivot char-
acter, i.e. we assume that x[α..α + q − 1] = c and that both x[1..α − 1] and
x[α + 1..m] do not contain any pivot character. Then, for each 0 ≤ i ≤ nC − 1,
the algorithm checks if the value ẏ (q) (i − 1) − ẏ (q) (i − 2) is greater than α − 1
and if the value ẏ (q) (i) − ẏ (q) (i − 1) is greater than m − α.
      In such a case the algorithm merely checks if the substring of the text
y[ẏ (q) [i − 1] − α + 1..ẏ (q) [i − 1] − α + m] is equal to the pattern. Otherwise the
substring is skipped. As before we assume that ẏ(0) = 0 and ẏ(nC + 1) = n + 1.

Case 3: mC ≥ 2
If the number of occurrences of any pivot character in C is greater than 1 then
the algorithm uses the sampled text ẏ (q) to compute on the fly the sampled
version ȳ (q) of y (q) and use it to search for any occurrence of x̄(q) . This is used
as a filtering phase for locating in y any candidate occurrence of x.
      First the character distance sampled version x̄ of x is computed. Then the
algorithm searches for x̄ in ȳ using any exact online string matching algorithm.
Notice that ȳ can be efficiently retrieved online from the sampled text ẏ, using
relation given in (2).
      For each candidate occurrence i of x̄ located in ȳ, an additional procedure
must be run to check if such occurrence corresponds to a match of the whole
pattern x in y. For this purpose the algorithm checks if the substring of the text
y[ẏ (q) [i] − ẏ (q) [0]..ẏ (q) [i] + m − 1] is equal to x, where ẏ (q) [0] is the position of
the first occurrence of the pivot character into the pattern.
      Regarding the time complexity we can prove that, assuming an underlying
auxiliary string matching algorithm with a linear worst case and a O(n log m/m))
average case time complexity, the resulting algorithm based on Character Dis-
tance Sampling achieves an optimal O(n) time complexity in the worst-case and
a O(n log m/m)) time complexity in the average case.


4    Experimental Results

In this section we briefly present experimental results obtained by comparing
the new proposed algorithms (with values of q ranging between 2 and 4) against
the standard distance sampling algorithm (obtained with q set to 1). We also
compare our algorithms against the Occurrence Text Sampling algorithms im-
plemented using q-grams. Also in this case we used values of q ranging between
2 and 4.
   Following the same lines of previous papers on sampled string matching [4,10]
we tested all sampling solutions in combination with the Boyer-Moore-Horspool
                                   Title Suppressed Due to Excessive Length          9

algorithm [12] for the implementation of the underlying standard searching pro-
cedure. As a consequence, in our comparison we also included the Boyer-Moore-
Horspool string matching algorithm (in its standard implementation) in order to
understand how much the proposed sampling approach contributes to speed-up
a standard online string matching solution.3 Results are compared in terms of
space consumption and searching speed.
    All algorithms have been implemented using the C programming language,
and have been tested using a variant of the Smart4 tool [8] properly tuned for
testing string matching algorithms based on a text-sampling. Tests have been
executed on a MacBook Pro with 4 Cores, a 2.7 GHz Intel Core i7 processor, 16
GB RAM 2133 MHz LPDDR3, 256 KB of L2 Cache and 8 MB of Cache L3.
    The algorithms have been tested on two 5MB text buffers containing real
biological sequences. Specifically we used a genome sequence containing base
pairs of Escherichia coli (with σ = 4). from the Large Canterbury Corpus (http:
//www.data-compression.info/Corpora/CanterburyCorpus/). The sequence
is also available within the Smart tool.
    In our implementation we selected the pivot character on the basis of its rank
value, where we remember that the rank of a character c is the position of c in
the alphabet Σ, if we assume that all characters are sorted by their frequencies
inside the text (see Definition 3).
    Then we evaluated the behaviour of our algorithms for different values of
the rank r of the selected pivot character and specifically for r ranging between
1 (the most frequent character) and 16. Observe that if σ is the size of the
original alphabet Σ, then σ q is the size of the condensed alphabet Σ (q) . As a
consequence, in the case of experimental tests on genome sequences and q = 1,
the value of the rank r is limited in the range between 1 and 4, since 4 is the
size of the alphabet.


4.1   Space Requirements

In the context of text-sampling string-matching space requirement is one of the
most significant parameter to take into account. It indicates how much additional
space, with regard to the size of the original input sequences, is required by a
given solution to solve the problem.
    Text-sampling algorithms require to store the whole text together with the
additional sampled-text which is used to speed-up the searching phase. In this
context they are much more similar to online string-matching solutions and,
although they have the additional good property to allow a direct access to the
3
  Although there exists many other searching algorithms able to show better practical
  performances on biological data (see for instance [3,6]) this kind of comparison goes
  beyond the objectives of this paper. We expect that the proposed approach is able
  to enhance the performances of different string matching algorithms with different,
  though similar, rates.
4
  The Smart tool is available online for download at http://www.dmi.unict.it/
  ~faro/smart/ or at https://github.com/smart-tool/smart.
10                 S. Faro, F.P. Marino and A. Pavone


                                                        0.2%

     CD Sampling
                   2%                         q=1                                  q=3
                                              q=2                                  q=4

                   1%                                   0.1%


                         2   4   6   8 10 12 14 16             2   4   6   8 10 12 14 16
     OT Sampling


                                                        0.8%
                   2%
                                                        0.6%
                                                        0.4%
                   1%
                                                        0.2%
                                                         0%
                         2   4   6   8 10 12 14 16             2   4   6   8 10 12 14 16


Fig. 2. Space consumption of the text sampled algorithms for different pivot characters
with rank ranging from 1 to 16 and for different values of the parameter q, ranging
from 1 to 4. Data are reported in terms of percentage of memory used relative to
the original text size. On the top: memory space required in the case of Character
Distance Sampling using q-grams. On the bottom: memory space required in the case
of Occurrence Text Sampling using q-grams.


input text, to be of any practical interest they should require as little extra space
as possible.
     Fig. 2 shows the space consumption of the newly proposed text-sampling
algorithms for different values of q in terms of percentage of memory used relative
to the original text size. In the case of CDS memory space consumption is plotted
on variations of the rank of the pivot character, while in the case of OTS it
is plotted on variations of the size of the set of sampled characters. In both
cases such value ranges from 2 to 16. As expected, the function which describes
memory requirements follows a decreasing trend while the rank of the pivot
character increases. Similarly space consumption drastically decreases when the
size of q increases.
     Data reported in Fig. 2 show that, when compared against the standard sam-
pling algorithm (obtained with q = 1), the benefit in space consumption obtained
by the algorithms based on condensed alphabets is impressive. Specifically the
gain ranges from 72% (for r = 1 and q = 2) to 95% (for r = 16 and q = 4).
In addition we can observe a sensible gain in the space consumption also in
comparison with the OTS algorithms implemented using condensed alphabets.
     For the sake of completeness we would like to point out that standard algo-
rithms for the online string matching problem require an amount of space which
is, in general, proportional to the length of the pattern and/or to the size of the
alphabet. In this particular case (a 5MB text buffer) the Boyer-Moore-Horspool
                                   Title Suppressed Due to Excessive Length       11

algorithm requires only 1.24 KB of memory for implementing the occurrence
heuristic (equivalent to a O(σ)-space complexity), while some among the most
effective algorithms (for instance Wfrq [3], Skipq [6]) are implemented by means
of a hash table of size 65536, requiring 0.2 MB of additional space. Thus it turns
out that, under particular conditions (texts of moderate lengths), the practical
space requirements of our proposed sampling algorithms are comparable with
those of standard online string matching solutions.


4.2   Searching Time

In this section we compare the different text-sampling string matching algorithms
in terms of searching times. In this context we refer to the searching time as the
time needed to perform the searching of the pattern on both sampled and original
texts, including any preprocessing of the underlying searching algorithm. In our
analysis the searching time doesn’t include the preprocessing time needed to
construct the partial index.
     Fig. 3 and Fig. 4 shows the resulting searching times of all tested algorithms
when they were used for searching on a genome sequence and on an English
text, respectively. As stated at the beginning of this section we also included
the standard Boyer-Moore-Horspool string matching algorithm (identified by a
dashed gray line), in order to put on evidence the speed-up proposed by our
sampling algorithms.
     From experimental results it turns out that in almost all cases the best results
are obtained by the variants based on condensed alphabets and specifically for
q = 4. When using a value of q greater than 1, the speed up obtained by CDS is
always greater than 50% and reaches the value of 90% under suitable conditions,
i.e. for q = 4 and long patterns.
     In general the behaviour of CDS algorithms follow an increasing trend for
increasing rank values. Thus in most cases the better choice is to use the most
frequent element as the pivot character. Observe indeed that, when the rank
of the pivot character is greater than a given threshold, the performances of
the algorithms based on q-grams sensibly degrades. Specifically this threshold
is approximately equal to 6 for short pattens (m = 8), while it increases for
moderate patterns.
     Going into details of the improvement in terms of running times we observe
that the original CDS algorithm (q = 1) leads to improvements which are in
percentage between 74% (in the case of short patterns) and 77% (in the case
of long patterns) if compared with the underlying standard string matching
algorithm. The new CDS algorithms based on condensed alphabets give instead
much more evident improvements which range from 96% (for short patterns)
and 99.6% (in the case of long patterns) compared with the same algorithm.
This improvements translate into a gain up to 70% for short patterns and up to
96% in the case of long patterns, if compared with sampling solutions with no
condensed alphabets.
12       S. Faro, F.P. Marino and A. Pavone

                        m=8                                    m = 16

     30 GB/s                               30 GB/s                        hor
                                                                          q=1
                                                                          q=2
                                                                          q=3
     20 GB/s                                                              q=4
                                           20 GB/s                        ots


     10 GB/s                               10 GB/s


     0 GB/s                                   0 GB/s

                2 4 6 8 10 12 14 16                    2 4 6 8 10 12 14 16
                     m = 32                                 m = 64

                                           40 GB/s

     30 GB/s
                                           30 GB/s

     20 GB/s
                                           20 GB/s


     10 GB/s                               10 GB/s


     0 GB/s                                   0 GB/s

                2 4 6 8 10 12 14 16                    2 4 6 8 10 12 14 16


Fig. 3. Searching speed of the text sampling algorithms based on condensed alphabets
for searching on a genome sequence. We used values q, with 1 ≤ q ≤ 4, and pattern
lengths m, with 8 ≤ m ≤ 256. The dashed gray line represents the searching time
of the standard Boyer-Moore-Horspool algorithm on the original text while the black
solid line represents the best searching time of the OTS solution implemented with
q-grams and q = 4. Running times (in the y axis) are represented in GB/s. The x axis
represents the rank r of the pivot character in the case of the sampling algorithms,
with 1 ≤ r ≤ 16.
                                    Title Suppressed Due to Excessive Length         13

                           m=8                                    m = 16

                                            120 GB/s                           hor
                                                                               q=1
 80 GB/s                                                                       q=2
                                            100 GB/s
                                                                               q=3
                                             80 GB/s                           q=4
 60 GB/s                                                                       ots
                                             60 GB/s
 40 GB/s
                                             40 GB/s
 20 GB/s
                                             20 GB/s

  0 GB/s                                      0 GB/s

              2    4   6    8 10 12 14 16                 2 4 6 8 10 12 14 16
                           m = 32                              m = 64

                                            120 GB/s

                                            100 GB/s
 150 GB/s

                                             80 GB/s

 100 GB/s
                                             60 GB/s

                                             40 GB/s
  50 GB/s
                                             20 GB/s

    0 GB/s                                    0 GB/s

                  2 4 6 8 10 12 14 16                     2 4 6 8 10 12 14 16


Fig. 4. Searching speed of the CDS and the best version of OTS algorithms based
on condensed alphabets for searching on an English text. We used values q, with
1 ≤ q ≤ 4, and short patterns of lengths m, with m = 8 and 16. The dashed gray line
represents the searching time of the standard Boyer-Moore-Horspool algorithm on the
original text. Running times (in the y axis) are represented in GB per second. The x
axis represents the rank r of the pivot character (for CDS) and the size of the set of
sampled characters (for OTS), with 1 ≤ r ≤ 16.
14      S. Faro, F.P. Marino and A. Pavone

5    Conclusions

In this paper we have presented an extension of a text sampling approach, called
Character Distance, to the case of texts over small alphabets, as in the case of
biological sequences. This extension was carried out using condensed alphabets
in which consecutive groups of q characters are assimilated to a single element
of the alphabet, significantly extending the size of it. The result obtained by this
new approach was to significantly lower the execution time in the search phase
while keeping the space used by the index below the space used by the previous
approaches. Although our tests were limited to the exact string matching prob-
lem, obtaining excellent results, we believe that the approach can be effectively
generalized even to non-standard string matching. Our future studies will focus
in this direction in order to apply sampled string matching to other problems
related to text processing.


References

 1. Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Com-
    mun. ACM, 20(10):762–772, 1977. doi:10.1145/359842.359859.
 2. Domenico Cantone, Simone Faro, and Emanuele Giaquinta. Adapting boyer-
    moore-like algorithms for searching huffman encoded texts. Int. J. Found. Comput.
    Sci., 23(2):343–356, 2012. doi:10.1142/S0129054112400163.
 3. Domenico Cantone, Simone Faro, and Arianna Pavone. Linear and efficient string
    matching algorithms based on weak factor recognition. ACM J. Exp. Algorithmics,
    24(1):1.8:1–1.8:20, 2019. doi:10.1145/3301295.
 4. Francisco Claude, Gonzalo Navarro, Hannu Peltola, Leena Salmela, and Jorma
    Tarhio. String matching with alphabet sampling. J. Discrete Algorithms, 11:37–
    50, 2012. doi:10.1016/j.jda.2010.09.004.
 5. Maxime Crochemore, Artur Czumaj, Leszek Gasieniec, Stefan Jarominek, Thierry
    Lecroq, Wojciech Plandowski, and Wojciech Rytter. Speeding up two string-
    matching algorithms.       Algorithmica, 12(4/5):247–267, 1994.      doi:10.1007/
    BF01185427.
 6. Simone Faro. A very fast string matching algorithm based on condensed alpha-
    bets. In Riccardo Dondi, Guillaume Fertin, and Giancarlo Mauri, editors, Al-
    gorithmic Aspects in Information and Management - 11th International Confer-
    ence, AAIM 2016, Bergamo, Italy, July 18-20, 2016, Proceedings, volume 9778 of
    Lecture Notes in Computer Science, pages 65–76. Springer, 2016. doi:10.1007/
    978-3-319-41168-2\_6.
 7. Simone Faro and Thierry Lecroq. The exact online string matching problem: A
    review of the most recent results. ACM Comput. Surv., 45(2):13:1–13:42, 2013.
    doi:10.1145/2431211.2431212.
 8. Simone Faro, Thierry Lecroq, Stefano Borzi, Simone Di Mauro, and Alessandro
    Maggio. The string matching algorithms research tool. In Jan Holub and Jan
    Zdárek, editors, Proceedings of the Prague Stringology Conference 2016, Prague,
    Czech Republic, August 29-31, 2016, pages 99–111. Department of Theoretical
    Computer Science, Faculty of Information Technology, Czech Technical Univer-
    sity in Prague, 2016. URL: http://www.stringology.org/event/2016/p09.html.
                                  Title Suppressed Due to Excessive Length       15

 9. Simone Faro and Francesco Pio Marino. Reducing time and space in indexed string
    matching by characters distance text sampling. In Jan Holub and Jan Zdárek,
    editors, Prague Stringology Conference 2020, Prague, Czech Republic, August 31 -
    September 2, 2020, pages 148–159. Czech Technical University in Prague, Faculty
    of Information Technology, Department of Theoretical Computer Science, 2020.
    URL: http://www.stringology.org/event/2020/p13.html.
10. Simone Faro, Francesco Pio Marino, and Arianna Pavone. Efficient online string
    matching based on characters distance text sampling. Algorithmica, 82(11):3390–
    3412, 2020. doi:10.1007/s00453-020-00732-4.
11. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM,
    52(4):552–581, 2005. doi:10.1145/1082036.1082039.
12. R. Nigel Horspool. Practical fast searching in strings. Softw. Pract. Exp.,
    10(6):501–506, 1980. doi:10.1002/spe.4380100608.
13. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern
    matching in strings. SIAM J. Comput., 6(2):323–350, 1977. doi:10.1137/0206024.
14. Gonzalo Navarro and Jorma Tarhio. Lzgrep: a boyer-moore string matching tool
    for ziv-lempel compressed text. Softw. Pract. Exp., 35(12):1107–1130, 2005. doi:
    10.1002/spe.663.
15. Uzi Vishkin. Deterministic sampling - A new technique for fast pattern matching.
    SIAM J. Comput., 20(1):22–40, 1991. doi:10.1137/0220002.
16. Andrew Chi-Chih Yao. The complexity of pattern matching for a random string.
    SIAM J. Comput., 8(3):368–387, 1979. doi:10.1137/0208029.