<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1016/J.TCS.2022.12.034</article-id>
      <title-group>
        <article-title>Improving Sampled Matching through Character Context Sampling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simone Faro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thierry Lecroq</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Pio Marino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arianna Pavone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Scafiti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ICTCS'24: Italian Conference on Theoretical Computer Science</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ Rouen Normandie, INSA Rouen Normandie, Université Le Havre Normandie</institution>
          ,
          <addr-line>Normandie Univ, LITIS UR 4108, CNRS NormaSTIC FR 3638, IRIB, Rouen F-76000</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università di Catania</institution>
          ,
          <addr-line>viale A. Doria n.6, 95125, Catania</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Università di Palermo</institution>
          ,
          <addr-line>via Archirafi n.34, 90123, Palermo</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>291</volume>
      <fpage>30</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>Sampled String Matching is a hybrid approach to the string matching problem, blending online and ofline solutions. Among various sampling methods, Character Distance Sampling (CDS) is one of the fastest and most versatile techniques. In optimal conditions, CDS can achieve speedup of up to 100 times, while requiring minimal additional space - ranging from 10% to as little as 0.8% of the original text size. Furthermore, CDS is adaptable, efe ctively handling non-classical string matching problems and computing structural properties of strings such as periods and coverages. As with all sampling-based matching algorithms, a verification phase on the original text is necessary after the initial matching on the sampled strings. Often, the computational efort required for this verification phase can be substantial. In this article, we introduce a novel sampling method that tracks the context around each sampled location rather than the distances to those locations. This approach aims to reduce the number of candidate occurrences and the subsequent verification efort. Our experimental results indicate that the proposed method outperforms CDS, particularly for short patterns, achieving a speedup of between 15% and 40%.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;String Matching</kwd>
        <kwd>Sampled String Matching</kwd>
        <kwd>Text Processing</kwd>
        <kwd>Contextual Information</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
24 The string matching problem involves identifying all instances of a pattern  of length  within
25 a text  of length , both define d over an alphabet Σ of size  . This task is fundamental in text
26 processing and underpins various software implementations across multiple operating systems.
27 Its importance is highlighted by the continued prevalence of text as the primary medium for
28 information exchange, despite diverse data storage formats. This is particularly evident in
29 linguistics, which relies on extensive corpora and dictionaries, and in computer science, where
30 large amounts of data are stored in linear files.
31 Applications of string matching require two main approaches: online and ofline string
32 matching. The online approach handles unprocessed text, necessitating real-time analysis
33 during the search operation. Its worst-case time complexity is Θ(), first achieved by the
34 renowned Knuth-Morris-Pratt (KMP) algorithm [1]. Its average time complexity is Θ ︁(  log  )︁
35 and was initially realized by the Backward-Dawg-Matching (BDM) algorithm [2]. Numerous
36 solutions have been devised to attain sub-linear performance in practical scenarios [3], with the
37 Boyer-Moore-Horspool algorithm [4, 5] standing out, having influenced subsequent research.
38 In contrast, the ofline approach aims to expedite searches through text preprocessing, creating
39 data structures that facilitate search operations. Known as indexed searching, this method
40 includes several eficient solutions. Prominent examples include sufix trees [ 6], which ofer an
41 (+) worst-case time, sufix arrays [ 7] with a time complexity of (+log +), where
42  it the number of occurrences of the searched pattern, and the FM-index [8], a compressed
43 structure derived from the Burrows-Wheeler transform that combines input compression with
44 eficient substring queries. However, these full-indexes require additional storage space, ranging
45 from four to twenty times the size of the text size.
46 A hybrid technique in the literature is Sampled String Matching, introduced by Vishkin in
47 1991 [9]. This method involves creating a partial index of the text and applying online string
48 matching algorithms to this index. While such approach accelerates the detection of candidate
49 pattern occurrences, each match in the sampled text must be verified within the original text.
50 The sampled-text methodology ofers several benefits: it generally requires straightforward
51 implementation, minimal additional space, and enables fast search and update operations.
52 Beyond Vishkin’s theoretical contributions, a more practical solution was presented by Claude
53 et al. [10], who developed an alphabet reduction technique. Their method requires an extra space
54 of 14% of the original text size and achieves up to a fivefold increase in search speed compared
55 to traditional online string matching algorithms on English texts. They also introduced an
56 indexed version of the sampled text, modifying the sufix array to index sampled positions.
57 Recently, Faro et al. have introduced several algorithms in the sampling field, notably their
58 Character Distance Sampling (CDS) approach [11, 12, 13, 14, 15, 16]. By sampling absolute
59 positions of specific characters, referred to as pivot characters, their method has achieved up
60 to a ninefold increase in speed on English texts, while requiring only 11% to 2.8% additional
61 space relative to the text size. This method provides a 50% reduction in search times compared
62 to previous text sampling techniques [12].
63 In this paper, we introduce a novel sampling method called Character Context Sampling
64 (CCS), which is designed to compactly track the context surrounding pivot characters identified
65 within the text. This method involves storing the hash of the substring (or a portion thereof)
66 located between two consecutive occurrences of the pivot character. Our experimental results
67 demonstrate that this technique significantly reduces the number of verifications required to
68 identify matches, thereby substantially decreasing search times, while maintaining the same
69 amount of space required for storing the partial index.
70 The paper is organized as follows. In Section 2 we briefly review the CDS method and the
71 most recent results achieved in this field. Section 3 introduces the new sampling method based
72 on the use of context around pivot characters. In Section 4 we provide some experimental
73 results and in Section 5 we draw our conclusions.
75 This section outlines the methodology employed to build a partial index using the Character
76 Distance Sampling (CDS) technique. These concepts are relevant for the description of the new
77 sampling method introduced in this paper.
78 Consider an input text  of length  and an input pattern  of length , both defined over
79 an alphabet Σ of size  . We treat all strings as vectors starting at position 0. Thus, [] refers to
80 the ( + 1)-th character of the string  for 0 ≤  &lt; .
81 The algorithm selects a sub-alphabet  ⊆ Σ to serve as the set of pivot characters.1 Using
82 these designated pivots, the text  is sampled by calculating the distances between consecutive
83 occurrences of any pivot character  ∈  within . Formally, this sampling methodology is
84 based on the concept of position sampling within the text.
85 Define  : {0, . . . ,  − 1} → {0, . . . ,  − 1}, where  () represents the position of the
86 ( + 1)-th occurrence of a pivot character  in . The position-sampled version of , denoted by
87 ˙, is a numerical sequence of length  defined as: ˙ = ⟨ (0),  (1), . . . ,  ( − 1)⟩.
88 Define the Character Distance Function Δ : {0, . . . ,  − 1} → {0, . . . ,  − 1}, where
89 Δ() =  ( + 1) −  () represents the distance between two consecutive occurrences of any
90 pivot character in . The character-distance sampled version of the text , denoted by ¯, is a
91 numerical sequence of length  − 1 defined as:</p>
      <p>¯ = ⟨Δ(0), Δ(1), . . . , Δ( − 1)⟩ = ⟨ (1) −  (0),  (2) −  (1), . . . ,  ( − 1) −  ( − 2)⟩.
92 Example 1. Let  = "agaacgcagtata" be a text of length 13, over the alphabet Σ = {a,c,g,t}. Let
93  = {a} be the set of pivot characters. The position-sampled version of  is ˙ = ⟨0, 2, 3, 7, 10, 12⟩.
94 Specifically, the first occurrence of character "a" is at position 0 ([0] = "a"), its second occurrence
95 is at position 2 ([2] = "a"), and so on. In addition, the character-distance sampled version
96 of  is ¯ = ⟨2, 1, 4, 3, 2⟩. Specifically, ¯[0] = Δ(0) =  (1) −  (0) = 2 − 0 = 2, while
97 ¯[2] = Δ(2) =  (3) −  (2) = 7 − 3 = 4, and so on.
98 The sampled string matching approach using CDS maintains a partial index, represented by
99 the position-sampled version of the text . The size of this index is 32 bits, assuming the
100 index resides in memory and it is readily available for any search operation on the text.
101 When searching for a pattern  of length  within , a preprocessing step computes its
102 sampled version ¯. It can be proven that an occurrence of  in  corresponds to an occurrence
103 of ¯ in ¯. Thus, any string matching algorithm can be used to locate occurrences of ¯ in ¯ to
104 solve the problem. However, the reverse is not necessarily true. Therefore, each occurrence of
105 ¯ in ¯, termed a candidate occurrence, requires a validation check in .
106 Given that the validation process takes () time, the entire search operation consumes
107 () time. Nevertheless, modifications to the fundamental procedure can ensure that the
108 overall search remains linear in time (see [12] for further details) and can also be implemented
109 using any online algorithm studied in literature [17].
110 An important aspect of the CDS-based approach is that it does not explicitly maintain the
111 character-distance sampled version ¯ of the text. Instead, it keeps the position-sampled version
1In practical applications, particularly when dealing with large alphabets, the set of pivot characters may include
only one character. For simplicity, we often refer to the pivot character in the singular form.
112 ˙. Since ¯ retains only distances between pivot characters without direct ties to their original
113 positions, direct verification of every candidate occurrence is impracticable. This is resolved
114 by maintaining ˙ and computing ¯ on-the-fly during the search. The -th element of ¯ can be
115 computed in constant time as ¯() = ˙( + 1) − ˙().
116 The CDS-based sampled string matching approach has proven highly efective in practical
117 applications, significantly reducing search times by up to 40 times compared to standard online
118 exact string matching techniques. This improvement comes at a relatively low cost, requiring
119 the construction of a partial index only 2% of the text size.
120 Moreover, the sampled string matching method has shown exceptional flexibility, making it
121 suitable for text searching challenges, including approximate searches. Notably, Faro et al. [15]
122 recently introduced the run-length text sampling, which is tailored for approximate searches in
123 texts, proving useful for tasks such as Order Preserving pattern matching [18, 19].
124 In addition to its space and time eficiency, the sampled string matching approach ofers other
125 advantages, such as ease of programming and the ability to adapt to text variations. Minor
126 alterations in the text, like character deletions or insertions, can be easily reflected in the index.
127 However, this method is not without its challenges. One such challenge is the performance
128 variability based on the choice of pivot character. Strategic selection of the pivot character is
129 crucial to balance partial index size and execution times. Research suggests that in the English
130 language, the pivot character ranked 8th often provides the best performance.
131 Another consideration is that if the pattern is very short and lacks occurrences of the pivot
132 character, a standard string search within the text may be necessary. Additionally, the method
133 may not yield significant benefits for texts with small alphabets, as space eficiency gains may
134 not be realized. However, studies by Faro et al. [13] have demonstrated the efectiveness of
135 techniques that use condensed alphabets to expand the alphabet size and improve performance.
136 A recent study introduced significant advancements in space and time eficiency through
137 the introduction of fake samples [16], slightly increasing the number of elements in the partial
138 index. Paradoxically, this results in a substantial three-quarters reduction in the overall space
139 required to represent the data structure while maintaining algorithmic correctness. The idea
140 lies in storing distances between pivot characters rather than their absolute positions within the
141 text, which reduces the space used but introduces the challenge of direct addressing of positions
142 within the original text.
143 The resulting fake distance representation of CDS leverages a clever balance between adding
144 minimal redundancy and achieving significant space reduction. By interspersing fake samples
145 within the pivot character set, the method ensures that the partial index retains its eficiency in
146 identifying potential matches. This leads to quicker verification processes and overall faster
147 search times.
148 For the sake of completeness, we emphasize that Lecroq and Marino have recently proven
149 that the CDS representation can accelerate not only string matching algorithms but also other
150 types of algorithms, such as those that compute structural properties of strings, including
151 finding periodicities and covers [ 20]. This broadens the scope of CDS beyond traditional string
152 matching, showcasing its adaptability and efectiveness in a wider range of computational
153 problems.
155 As introduced in the previous sections, sampled string matching stands as an hybrid method
156 that allows a practical speed-up in the searching phase with minimal costs required for space
157 and pre-processing time. This technique leverages the benefits of sampling to reduce the overall
158 search complexity, making it highly eficient for large datasets.
159 On the other hand, a crucial requirement of all sampled string matching algorithms is the
160 necessity of a verification phase after any candidate match is identified in sampled versions
161 of the text. Although such verification phase can be easily implemented in () time by
162 comparing all the pattern characters with the candidate substring in the text, in the worst-case
163 scenario, where false matches occur at every sampled position, it would be necessary to perform
164 a verification at each sampled position of the text. This results in a complexity of (), which
165 is sub-optimal if compared with the linear time complexity achieved by several well-known
166 algorithms. This trade-of highlights the importance of optimizing the sampling strategy to
167 balance the speed-up in the search phase with the cost of running the verification phases.
168 While the information provided by the CDS approach is enough to compute positions in the
169 original text and execute the searching phase using various online algorithms [17], it is not well
170 designed for avoiding (or reducing) false matches between the sampled version of the text and
171 the pattern. To obtain this result it is instead necessary to store additional data.
172 In this section, we introduce the Character Context Sampling (CCS) approach, an enhanced
173 variant of the CDS method which stores information computed on the context around each
174 pivot character instead of the distances stored by the CDS method. At the core of this idea is
175 the necessity to incorporate contextual information about the position of each pivot character
176 within the partial index used for the search. This approach aims to reduce the number of false
177 positives, thereby minimizing the number of verifications required during the search phase.
178 When we refer to a context, we mean a fixed-size substring, of fixed length , within the
179 vicinity of the pivot character. Figure 1 schematically shows the idea on which the context-based
180 sampling model is based. The Figure compares the CDS and CCS methods if the same pivot
181 character is used. Although the context size is set to  = 2, when two occurrences of the pivot
182 character closer than 2 characters apart then the context is reduced accordingly.
a
0
a
b
3
r
a
3
a
2
c
a
5
a
2
d
a
7
a
b
3
r
a m a
g
i</p>
      <p>c
10
a
2
12
a
4
a
16
a</p>
      <p>However, we immediately notice that memorizing entire substrings is extremely more
expen184 sive than memorizing individual positions. For this reason, due to the potentially high memory
185 cost of storing the entire set of substrings representing the set of contexts, our method stores
186 the context in a compact and approximate form by computing a fingerprint of the substrings,
187 specifically a hash value.</p>
      <p>More formally, let ℎℎ : Σ* × { 0, 1, . . . ,  − 1} × { 0, 1, . . . ,  − 1} be a function which,
189 taking as input a string  ∈ Σ* and two indices  and  such that 0 ≤  &lt;  ≤ | | − 1, computes
190 an hash value of the substring [..]. In addition, let  be an integer parameter such that  ≥
and let ˙ be the position sampled version of the text . Then, the CCS version of a string 
2,
192 is defined as ˜ = ⟨ccs(0), ccs(1), . . . , ccs( −
193 value of the -th pivot character, which is defined by
1)⟩, where ccs() corresponds to the context</p>
      <p>In other words, the CCS version, ˜, of the text  is the sequence of hash values computed
195 on the substring starting at the positions just after each pivot character in the text, whose
196 length is at most  and which does not include the next pivot character. These hash values
197</p>
      <p>encapsulate contextual information that can be used during the search phase to reduce the
198 number of verifications.</p>
      <p>We also notice that, in order to locate the candidate occurrences in the original text for
200 the verification phase, the CCS approach necessitates storing both the hash values and the
201 exact positions of each pivot character. In the worst-case scenario, storing the complete set of
positions may lead to an hybrid method consuming a significant amount of additional space. To
207 until the next pivot character (see [10] for more details).
203 address this issue, an alternative approach involves using a mapping table  , as demonstrated
204 in the OTS algorithm [10]. In their method, a mapping position is stored at regular intervals;
205 specifically, every  pivots, the exact position of the pivot is recorded. When a sampled match is
206 found at position , the verification phase commences at the position  [⌊/⌋] and continues
188
191
194
199
202
208
211
213
216
220
ccs() =
︂{
ℎℎ(, ˙[] + 1, ˙[] + )</p>
      <p>otherwise.
ℎℎ(, ˙[] + 1, ˙[ + 1] − 1) if ˙[ + 1] −
(˙[] + 1) ≤</p>
      <p>The searching procedure is divided into two distinct phases: the pre-matching phase, also
209 known as sample-matching, and the verification phase. The pseudo-code of the searching
210 procedure is described in Figure 2 (on the right).</p>
      <p>During the verification phase, the starting position is computed. This requires ˙[], which is
221 the exact position of the first pivot that was compared, and the distance  between the first pivot
222 in the pattern and the initial position of the text. Consequently, the position  can be computed
212 let  ⊆</p>
      <p>Le  be a text of length , let  be a pattern of length , both strings over an alphabet Σ, and
Σ be the set of pivot characters. In order to retrieve the exact position in the text, an
additional position  must be stored. This position  corresponds to the distance between the
214 first pivot and the initial position of the text. Specifically
215 if the text starts with a pivot character, otherwise  &gt; 0.
 = min( : [] ∈ ). Thus  = 0</p>
      <p>Both phases are straightforward and similar in nature. The first phase checks if all the
217 contexts of the sampled pattern ˜ occurs in the sampled text ˜. If a candidate occurrence is
218 found at position  of ˜, then the second verification phase is initiated to check the presence of
219 a real match.
223 as  = ˙[] −  . However, we point out that, in the procedure shown in Figure 2, the exact
224 position ˙[] is computed by sub-routine Compute-Position which uses the mapping table  to
225 identify the value of such position. Once the initial position  in the original text is computed, a
226 verification check is performed to ensure that all characters in the pattern [0.. − 1] match
227 the corresponding substring [.. +  − 1].
228 While its worst-case complexity can reach (), necessitating verification of all positions
229 in the text, the algorithm described shows competitive, and often superior, performance
com230 pared to other established methods in practical scenarios. The subsequent section will provide
231 supporting evidence in this regard.
232 4. Experimental Results
233 In this section, we present the results of an experimental evaluation of the new text sampling
234 approaches introduced in this article. Our evaluation compares our approaches with the best
235 solution available in the literature, focusing on three key metrics: time eficiency, memory space
236 requirements, and the computational efort needed for verifying candidate occurrences.
237 Algorithms and implementation. We compared the text sampling approach proposed in
238 this paper (CCS) for values of  ∈ 2, 4, 6 with the FCDS+ approach which is currently the
239 most efective CDS variant for natural language texts. We recall that FCDS+ is a CDS variant
240 enhanced with fake samples and a mapping table which maps one element every  elements to
241 its corresponding position within the text. For details of the FCDS+ implementation see [17].
242 In addition, we report that the CCS algorithm has been implemented using an hashing function
243 wich maps any substring of  characters on a 1 byte bucket. For both the CCS and FCDS+
244 algorithms, the mapping table utilizes values of  ∈ {8, 16, 32}, selecting the optimal execution
245 time for each case. The functions used to calculate the hash value in our implementation
246 are those shown in Figure 3. A hash value of 1 byte was selected to ensure that the space
247 requirements of CCS closely match those of FCDS+, thereby enabling a fair comparison between
248 the two solutions. We compared our algorithms using multiple pivot characters based on the
249 rank of their frequency .
250 Testbed. All the algorithms were implemented in the C programming language and tested
251 using the Smart tool2 [21]. The experiments were executed on a MacBook Pro with a 2.7 GHz
252 Intel Core i7 processor, 4 cores, 16 GB RAM 2133 MHz LPDDR3, 256 KB L2 Cache, and 8 MB L3
253 Cache. Compilation was performed with the -O3 optimization flag.
254 We tested all the algorithms on a text bufer of size 100 MB, sourced from the Pizza and Chili
255 dataset [22], which is available for download online. The algorithms were specifically tested
256 using the natural language text from this dataset. In our experiments we limited ourselves
257 to searching for patterns of fixed length  = 2, with  ∈ {4, 5, 6, 7}. For the purpose of
258 our experimental evaluation, 1000 patterns were randomly selected from the text bufer, all
259 algorithms were run to search for each of the patterns in this set, and the average running time
260 was computed over these 1000 runs.
261 The choice of using a 100 MB text bufer ensures a substantial and representative sample size
262 for evaluating algorithm performance. The natural language text from the Pizza and Chili dataset
263 provides a realistic testing scenario, reflecting typical use cases in text processing applications.
264 Analysis of search times. The analysis of search times is particularly important in this
con265 text, as sampled matching solutions are designed to achieve significantly superior performance
266 compared to online solutions, with minimal cost in terms of spatial resources.
267 Figure 4 reports the time performance obtained in our experimental results, highlighting
268 the average running times of the tested algorithms. The results shows that the new variant
269 proposed in this paper achieves superior performance in terms of time compared to the FCDS+
270 variant, with improvements ranging from 50% to 70%. As expected, the greatest advantages
2The Smart tool is available online for download at http://www.dmi.unict.it/~faro/smart/ or at https://github.com/
smart-tool/smart.
271 are observed for short to medium-length patterns, where context plays a fundamental role in
272 identifying possible occurrences of the pattern.
273 In our experimental results, a detailed description of the best pivot character selection is not
274 provided. Regarding the selection of the pivot character, superior performance is once again
275 observed with a rank around the value of 8 for both CCS and FCDS+.
276 Finally, we note that FCDS+ becomes the most eficient solution for very long patterns. In
277 this scenario, the context has less influence on identifying candidate occurrences, and sampling
278 based on the distance between pivot characters proves to be the best strategy.
279 Table 1 (on the left) presents the experimental results in terms of running times, compared
280 with the search times of the original Horspool algorithm (HOR) [5]. In the table, the execution
281 times of the Horspool algorithm are expressed in milliseconds, while the results for the sampled
282 matching algorithms are expressed in terms of speedup relative to the Horspool algorithm. The
283 best results, in terms of speedup, are highlighted with more intense shades of grey.
284 As clearly shown in Table 1, the CCS method provides superior speedups compared to FCDS+.
285 These speedups range from an impressive 33 times for short patterns ( = 16) to 5 times for
286 very long patterns ( = 128). These results shows that solutions based on sampled matching
287 are significantly more efective than those based on the standard approach.
288 Impact on the number of verifications. In order to provide a more comprehensive
un289 derstanding of the improvements brought by the proposed new technique, we analyzed the
290 average number of verifications executed by each algorithm under various settings. These
291 settings include diferent pattern lengths, ranks of the chosen pivot character, and the number
292 of characters used for hashing. The results of this analysis are illustrated in Table 1.
293 The experimental results show that for small patterns, the CDS algorithm requires, in the
294 worst case, more than 30, 000 verifications, which is significantly more than the actual number
295 of matches (147). In contrast, under optimal conditions, the CCS algorithm requires only 758
296 verifications, which is less than 2.5% of the original algorithm’s verification count.
297 In the most favorable cases, the new CCS variants significantly reduce the number of
veri298 fications to values remarkably close to the actual number of pattern occurrences within the
299 text. Specifically, for  = 64, the CCS4 variant reduces the number of false positives to nearly
300 match the number of actual occurrences (25 versus 13). Furthermore, for  = 128, the CCS6
301 variant completely eliminates false positives. This demonstrates that the context-based approach
302 manages to drastically reduce the number of verifications required during the search phase,
303 justifying the superior performance in terms of search times discussed previously.
304 Analysis of the space consumption. Given that the memory required to store the hashing
305 is equivalent to the memory needed to store all the distances using the fake-representation, and
306 that the mapping tables are identical, the newly proposed algorithm eliminates the need for
307 additional fake-samples. Consequently, the CCS algorithm demands fewer memory resources
308 while providing a speed-up in the searching time. Specifically, Figure 5 shows the space
309 consumption of both the FCDS+ and CCS algorithm for diferent rank of the pivot character,
310 ranging from 2 to 10. We observe that the size of the partial index is always below 10%
311 (corresponding to 10 MB) of the text size and reaches 5% (5 MB) for the rank 8 pivot.</p>
      <p>Space Consumption
8%
6%
4%
2
4
6
8
10
312 5. Conclusions and Future Works
313 In this paper, we introduced a novel sampling method called Character Context Sampling
314 (CCS), designed to enhance the eficiency of the string matching process. This method tracks
315 the context surrounding each sampled location, rather than just the distances between these
316 locations. Our experimental results demonstrate that CCS significantly reduces the number of
317 verifications required, thereby substantially decreasing search times while maintaining minimal
318 additional space requirements. CCS stands out by outperforming the existing Character Distance
319 Sampling (CDS) method, especially for short patterns, achieving a speedup of between 15% and
320 40%. This improvement is attributed to the efective use of contextual information, which helps
321 in reducing false positives during the verification phase.
322 Future research could focus on several areas to further enhance the performance and
applica323 bility of the CCS method. First, exploring more eficient hashing techniques and investigating
324 their impact on the speed and accuracy of CCS could yield valuable insights. Second, adapting
325 the CCS method for other types of string matching problems, such as approximate matching or
326 order-preserving matching, could broaden its utility.
327 Additionally, integrating CCS with other advanced data structures and algorithms, such as
328 sufix trees, may provide hybrid solutions that combine the strengths of diferent approaches.
329 Finally, optimizing the selection of pivot characters based on specific text characteristics or
330 application requirements could further improve the eficiency of the CCS method.
332 [1] D. E. Knuth, J. H. Morris, Jr., V. R. Pratt, Fast pattern matching in strings, SIAM Journal
333 on Computing 6 (1977) 323–350. URL: https://doi.org/10.1137/0206024. doi:10.1137/
334 0206024. arXiv:https://doi.org/10.1137/0206024.
335 [2] M. Crochemore, A. Czumaj, L. G¸asieniec, S. Jarominek, T. Lecroq, W. Plandowski, W. Rytter,
336 Speeding up two string-matching algorithms, Algorithmica 12 (1994) 247–267. URL:
337 https://doi.org/10.1007/BF01185427. doi:10.1007/BF01185427.
338 [3] S. Faro, T. Lecroq, The exact online string matching problem: A review of the most recent
339 results, ACM Comput. Surv. 45 (2013). URL: https://doi.org/10.1145/2431211.2431212.
340 doi:10.1145/2431211.2431212.
341 [4] R. S. Boyer, J. S. Moore, A fast string searching algorithm, Commun. ACM 20 (1977)
342 762–772. URL: https://doi.org/10.1145/359842.359859. doi:10.1145/359842.359859.
343 [5] R. N. Horspool, Practical fast searching in strings, Software: Practice and
344 Experience 10 (1980) 501–506. URL: https://onlinelibrary.wiley.com/doi/abs/
345 10.1002/spe.4380100608. doi:https://doi.org/10.1002/spe.4380100608.
346 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.4380100608.
347 [6] A. Apostolico, The myriad virtues of subword trees, in: A. Apostolico, Z. Galil (Eds.),
348 Combinatorial Algorithms on Words, Springer Berlin Heidelberg, Berlin, Heidelberg, 1985,
349 pp. 85–96.
350 [7] U. Manber, G. Myers, Sufix arrays: A new method for on-line string searches, SIAM
351 Journal on Computing 22 (1993) 935–948. URL: https://doi.org/10.1137/0222058. doi:10.
352 1137/0222058. arXiv:https://doi.org/10.1137/0222058.
353 [8] P. Ferragina, G. Manzini, Indexing compressed text, J. ACM 52 (2005) 552–581. URL:
354 https://doi.org/10.1145/1082036.1082039. doi:10.1145/1082036.1082039.
355 [9] U. Vishkin, Deterministic sampling–a new technique for fast pattern matching, SIAM
Jour356 nal on Computing 20 (1991) 22–40. URL: https://doi.org/10.1137/0220002. doi:10.1137/
357 0220002. arXiv:https://doi.org/10.1137/0220002.
358 [10] F. Claude, G. Navarro, H. Peltola, L. Salmela, J. Tarhio, String matching with alphabet
359 sampling, Journal of Discrete Algorithms 11 (2010). doi:10.1016/j.jda.2010.09.004.
360 [11] S. Faro, F. P. Marino, Reducing time and space in indexed string matching by characters
361 distance text sampling, in: J. Holub, J. Zdárek (Eds.), Prague Stringology Conference 2020,
362 Prague, Czech Republic, August 31 - September 2, 2020, Czech Technical University in
363 Prague, Faculty of Information Technology, Department of Theoretical Computer Science,
364 2020, pp. 148–159. URL: http://www.stringology.org/event/2020/p13.html.
365 [12] S. Faro, F. P. Marino, A. Pavone, Eficient online string matching based on characters
366 distance text sampling, Algorithmica 82 (2020) 3390–3412. URL: https://doi.org/10.1007/
367 s00453-020-00732-4. doi:10.1007/S00453-020-00732-4.
368 [13] S. Faro, F. P. Marino, A. Pavone, Enhancing characters distance text sampling by condensed
369 alphabets, in: C. S. Coen, I. Salvo (Eds.), Proceedings of the 22nd Italian Conference on
370 Theoretical Computer Science, Bologna, Italy, September 13-15, 2021, volume 3072 of CEUR
371 Workshop Proceedings, CEUR-WS.org, 2021, pp. 1–15. URL: https://ceur-ws.org/Vol-3072/
372 paper1.pdf.
373 [14] S. Faro, F. P. Marino, A. Pavone, Improved characters distance sampling for online and</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>