-

10.1016/J.TCS.2022.12.034

Improving Sampled Matching through Character Context Sampling

Simone Faro

0 2

Thierry Lecroq

0 1

Francesco Pio Marino

0 1 2

Arianna Pavone

0 3

Stefano Scafiti

0 2 0 ICTCS'24: Italian Conference on Theoretical Computer Science 1 Univ Rouen Normandie, INSA Rouen Normandie, Université Le Havre Normandie , Normandie Univ, LITIS UR 4108, CNRS NormaSTIC FR 3638, IRIB, Rouen F-76000 , France 2 Università di Catania , viale A. Doria n.6, 95125, Catania , Italy 3 Università di Palermo , via Archirafi n.34, 90123, Palermo , Italy

2021

291 30 31

Sampled String Matching is a hybrid approach to the string matching problem, blending online and ofline solutions. Among various sampling methods, Character Distance Sampling (CDS) is one of the fastest and most versatile techniques. In optimal conditions, CDS can achieve speedup of up to 100 times, while requiring minimal additional space - ranging from 10% to as little as 0.8% of the original text size. Furthermore, CDS is adaptable, efe ctively handling non-classical string matching problems and computing structural properties of strings such as periods and coverages. As with all sampling-based matching algorithms, a verification phase on the original text is necessary after the initial matching on the sampled strings. Often, the computational efort required for this verification phase can be substantial. In this article, we introduce a novel sampling method that tracks the context around each sampled location rather than the distances to those locations. This approach aims to reduce the number of candidate occurrences and the subsequent verification efort. Our experimental results indicate that the proposed method outperforms CDS, particularly for short patterns, achieving a speedup of between 15% and 40%.

eol>String Matching Sampled String Matching Text Processing Contextual Information

1. Introduction 24 The string matching problem involves identifying all instances of a pattern of length within 25 a text of length , both define d over an alphabet Σ of size . This task is fundamental in text 26 processing and underpins various software implementations across multiple operating systems. 27 Its importance is highlighted by the continued prevalence of text as the primary medium for 28 information exchange, despite diverse data storage formats. This is particularly evident in 29 linguistics, which relies on extensive corpora and dictionaries, and in computer science, where 30 large amounts of data are stored in linear files. 31 Applications of string matching require two main approaches: online and ofline string 32 matching. The online approach handles unprocessed text, necessitating real-time analysis 33 during the search operation. Its worst-case time complexity is Θ(), first achieved by the 34 renowned Knuth-Morris-Pratt (KMP) algorithm [1]. Its average time complexity is Θ ︁( log )︁ 35 and was initially realized by the Backward-Dawg-Matching (BDM) algorithm [2]. Numerous 36 solutions have been devised to attain sub-linear performance in practical scenarios [3], with the 37 Boyer-Moore-Horspool algorithm [4, 5] standing out, having influenced subsequent research. 38 In contrast, the ofline approach aims to expedite searches through text preprocessing, creating 39 data structures that facilitate search operations. Known as indexed searching, this method 40 includes several eficient solutions. Prominent examples include sufix trees [ 6], which ofer an 41 (+) worst-case time, sufix arrays [ 7] with a time complexity of (+log +), where 42 it the number of occurrences of the searched pattern, and the FM-index [8], a compressed 43 structure derived from the Burrows-Wheeler transform that combines input compression with 44 eficient substring queries. However, these full-indexes require additional storage space, ranging 45 from four to twenty times the size of the text size. 46 A hybrid technique in the literature is Sampled String Matching, introduced by Vishkin in 47 1991 [9]. This method involves creating a partial index of the text and applying online string 48 matching algorithms to this index. While such approach accelerates the detection of candidate 49 pattern occurrences, each match in the sampled text must be verified within the original text. 50 The sampled-text methodology ofers several benefits: it generally requires straightforward 51 implementation, minimal additional space, and enables fast search and update operations. 52 Beyond Vishkin’s theoretical contributions, a more practical solution was presented by Claude 53 et al. [10], who developed an alphabet reduction technique. Their method requires an extra space 54 of 14% of the original text size and achieves up to a fivefold increase in search speed compared 55 to traditional online string matching algorithms on English texts. They also introduced an 56 indexed version of the sampled text, modifying the sufix array to index sampled positions. 57 Recently, Faro et al. have introduced several algorithms in the sampling field, notably their 58 Character Distance Sampling (CDS) approach [11, 12, 13, 14, 15, 16]. By sampling absolute 59 positions of specific characters, referred to as pivot characters, their method has achieved up 60 to a ninefold increase in speed on English texts, while requiring only 11% to 2.8% additional 61 space relative to the text size. This method provides a 50% reduction in search times compared 62 to previous text sampling techniques [12]. 63 In this paper, we introduce a novel sampling method called Character Context Sampling 64 (CCS), which is designed to compactly track the context surrounding pivot characters identified 65 within the text. This method involves storing the hash of the substring (or a portion thereof) 66 located between two consecutive occurrences of the pivot character. Our experimental results 67 demonstrate that this technique significantly reduces the number of verifications required to 68 identify matches, thereby substantially decreasing search times, while maintaining the same 69 amount of space required for storing the partial index. 70 The paper is organized as follows. In Section 2 we briefly review the CDS method and the 71 most recent results achieved in this field. Section 3 introduces the new sampling method based 72 on the use of context around pivot characters. In Section 4 we provide some experimental 73 results and in Section 5 we draw our conclusions. 75 This section outlines the methodology employed to build a partial index using the Character 76 Distance Sampling (CDS) technique. These concepts are relevant for the description of the new 77 sampling method introduced in this paper. 78 Consider an input text of length and an input pattern of length , both defined over 79 an alphabet Σ of size . We treat all strings as vectors starting at position 0. Thus, [] refers to 80 the ( + 1)-th character of the string for 0 ≤ < . 81 The algorithm selects a sub-alphabet ⊆ Σ to serve as the set of pivot characters.1 Using 82 these designated pivots, the text is sampled by calculating the distances between consecutive 83 occurrences of any pivot character ∈ within . Formally, this sampling methodology is 84 based on the concept of position sampling within the text. 85 Define : {0, . . . , − 1} → {0, . . . , − 1}, where () represents the position of the 86 ( + 1)-th occurrence of a pivot character in . The position-sampled version of , denoted by 87 ˙, is a numerical sequence of length defined as: ˙ = ⟨ (0), (1), . . . , ( − 1)⟩. 88 Define the Character Distance Function Δ : {0, . . . , − 1} → {0, . . . , − 1}, where 89 Δ() = ( + 1) − () represents the distance between two consecutive occurrences of any 90 pivot character in . The character-distance sampled version of the text , denoted by ¯, is a 91 numerical sequence of length − 1 defined as:

¯ = ⟨Δ(0), Δ(1), . . . , Δ( − 1)⟩ = ⟨ (1) − (0), (2) − (1), . . . , ( − 1) − ( − 2)⟩. 92 Example 1. Let = "agaacgcagtata" be a text of length 13, over the alphabet Σ = {a,c,g,t}. Let 93 = {a} be the set of pivot characters. The position-sampled version of is ˙ = ⟨0, 2, 3, 7, 10, 12⟩. 94 Specifically, the first occurrence of character "a" is at position 0 ([0] = "a"), its second occurrence 95 is at position 2 ([2] = "a"), and so on. In addition, the character-distance sampled version 96 of is ¯ = ⟨2, 1, 4, 3, 2⟩. Specifically, ¯[0] = Δ(0) = (1) − (0) = 2 − 0 = 2, while 97 ¯[2] = Δ(2) = (3) − (2) = 7 − 3 = 4, and so on. 98 The sampled string matching approach using CDS maintains a partial index, represented by 99 the position-sampled version of the text . The size of this index is 32 bits, assuming the 100 index resides in memory and it is readily available for any search operation on the text. 101 When searching for a pattern of length within , a preprocessing step computes its 102 sampled version ¯. It can be proven that an occurrence of in corresponds to an occurrence 103 of ¯ in ¯. Thus, any string matching algorithm can be used to locate occurrences of ¯ in ¯ to 104 solve the problem. However, the reverse is not necessarily true. Therefore, each occurrence of 105 ¯ in ¯, termed a candidate occurrence, requires a validation check in . 106 Given that the validation process takes () time, the entire search operation consumes 107 () time. Nevertheless, modifications to the fundamental procedure can ensure that the 108 overall search remains linear in time (see [12] for further details) and can also be implemented 109 using any online algorithm studied in literature [17]. 110 An important aspect of the CDS-based approach is that it does not explicitly maintain the 111 character-distance sampled version ¯ of the text. Instead, it keeps the position-sampled version 1In practical applications, particularly when dealing with large alphabets, the set of pivot characters may include only one character. For simplicity, we often refer to the pivot character in the singular form. 112 ˙. Since ¯ retains only distances between pivot characters without direct ties to their original 113 positions, direct verification of every candidate occurrence is impracticable. This is resolved 114 by maintaining ˙ and computing ¯ on-the-fly during the search. The -th element of ¯ can be 115 computed in constant time as ¯() = ˙( + 1) − ˙(). 116 The CDS-based sampled string matching approach has proven highly efective in practical 117 applications, significantly reducing search times by up to 40 times compared to standard online 118 exact string matching techniques. This improvement comes at a relatively low cost, requiring 119 the construction of a partial index only 2% of the text size. 120 Moreover, the sampled string matching method has shown exceptional flexibility, making it 121 suitable for text searching challenges, including approximate searches. Notably, Faro et al. [15] 122 recently introduced the run-length text sampling, which is tailored for approximate searches in 123 texts, proving useful for tasks such as Order Preserving pattern matching [18, 19]. 124 In addition to its space and time eficiency, the sampled string matching approach ofers other 125 advantages, such as ease of programming and the ability to adapt to text variations. Minor 126 alterations in the text, like character deletions or insertions, can be easily reflected in the index. 127 However, this method is not without its challenges. One such challenge is the performance 128 variability based on the choice of pivot character. Strategic selection of the pivot character is 129 crucial to balance partial index size and execution times. Research suggests that in the English 130 language, the pivot character ranked 8th often provides the best performance. 131 Another consideration is that if the pattern is very short and lacks occurrences of the pivot 132 character, a standard string search within the text may be necessary. Additionally, the method 133 may not yield significant benefits for texts with small alphabets, as space eficiency gains may 134 not be realized. However, studies by Faro et al. [13] have demonstrated the efectiveness of 135 techniques that use condensed alphabets to expand the alphabet size and improve performance. 136 A recent study introduced significant advancements in space and time eficiency through 137 the introduction of fake samples [16], slightly increasing the number of elements in the partial 138 index. Paradoxically, this results in a substantial three-quarters reduction in the overall space 139 required to represent the data structure while maintaining algorithmic correctness. The idea 140 lies in storing distances between pivot characters rather than their absolute positions within the 141 text, which reduces the space used but introduces the challenge of direct addressing of positions 142 within the original text. 143 The resulting fake distance representation of CDS leverages a clever balance between adding 144 minimal redundancy and achieving significant space reduction. By interspersing fake samples 145 within the pivot character set, the method ensures that the partial index retains its eficiency in 146 identifying potential matches. This leads to quicker verification processes and overall faster 147 search times. 148 For the sake of completeness, we emphasize that Lecroq and Marino have recently proven 149 that the CDS representation can accelerate not only string matching algorithms but also other 150 types of algorithms, such as those that compute structural properties of strings, including 151 finding periodicities and covers [ 20]. This broadens the scope of CDS beyond traditional string 152 matching, showcasing its adaptability and efectiveness in a wider range of computational 153 problems. 155 As introduced in the previous sections, sampled string matching stands as an hybrid method 156 that allows a practical speed-up in the searching phase with minimal costs required for space 157 and pre-processing time. This technique leverages the benefits of sampling to reduce the overall 158 search complexity, making it highly eficient for large datasets. 159 On the other hand, a crucial requirement of all sampled string matching algorithms is the 160 necessity of a verification phase after any candidate match is identified in sampled versions 161 of the text. Although such verification phase can be easily implemented in () time by 162 comparing all the pattern characters with the candidate substring in the text, in the worst-case 163 scenario, where false matches occur at every sampled position, it would be necessary to perform 164 a verification at each sampled position of the text. This results in a complexity of (), which 165 is sub-optimal if compared with the linear time complexity achieved by several well-known 166 algorithms. This trade-of highlights the importance of optimizing the sampling strategy to 167 balance the speed-up in the search phase with the cost of running the verification phases. 168 While the information provided by the CDS approach is enough to compute positions in the 169 original text and execute the searching phase using various online algorithms [17], it is not well 170 designed for avoiding (or reducing) false matches between the sampled version of the text and 171 the pattern. To obtain this result it is instead necessary to store additional data. 172 In this section, we introduce the Character Context Sampling (CCS) approach, an enhanced 173 variant of the CDS method which stores information computed on the context around each 174 pivot character instead of the distances stored by the CDS method. At the core of this idea is 175 the necessity to incorporate contextual information about the position of each pivot character 176 within the partial index used for the search. This approach aims to reduce the number of false 177 positives, thereby minimizing the number of verifications required during the search phase. 178 When we refer to a context, we mean a fixed-size substring, of fixed length , within the 179 vicinity of the pivot character. Figure 1 schematically shows the idea on which the context-based 180 sampling model is based. The Figure compares the CDS and CCS methods if the same pivot 181 character is used. Although the context size is set to = 2, when two occurrences of the pivot 182 character closer than 2 characters apart then the context is reduced accordingly. a 0 a b 3 r a 3 a 2 c a 5 a 2 d a 7 a b 3 r a m a g i

c 10 a 2 12 a 4 a 16 a

However, we immediately notice that memorizing entire substrings is extremely more expen184 sive than memorizing individual positions. For this reason, due to the potentially high memory 185 cost of storing the entire set of substrings representing the set of contexts, our method stores 186 the context in a compact and approximate form by computing a fingerprint of the substrings, 187 specifically a hash value.

More formally, let ℎℎ : Σ* × { 0, 1, . . . , − 1} × { 0, 1, . . . , − 1} be a function which, 189 taking as input a string ∈ Σ* and two indices and such that 0 ≤ < ≤ | | − 1, computes 190 an hash value of the substring [..]. In addition, let be an integer parameter such that ≥ and let ˙ be the position sampled version of the text . Then, the CCS version of a string 2, 192 is defined as ˜ = ⟨ccs(0), ccs(1), . . . , ccs( − 193 value of the -th pivot character, which is defined by 1)⟩, where ccs() corresponds to the context

In other words, the CCS version, ˜, of the text is the sequence of hash values computed 195 on the substring starting at the positions just after each pivot character in the text, whose 196 length is at most and which does not include the next pivot character. These hash values 197

encapsulate contextual information that can be used during the search phase to reduce the 198 number of verifications.

We also notice that, in order to locate the candidate occurrences in the original text for 200 the verification phase, the CCS approach necessitates storing both the hash values and the 201 exact positions of each pivot character. In the worst-case scenario, storing the complete set of positions may lead to an hybrid method consuming a significant amount of additional space. To 207 until the next pivot character (see [10] for more details). 203 address this issue, an alternative approach involves using a mapping table , as demonstrated 204 in the OTS algorithm [10]. In their method, a mapping position is stored at regular intervals; 205 specifically, every pivots, the exact position of the pivot is recorded. When a sampled match is 206 found at position , the verification phase commences at the position [⌊/⌋] and continues 188 191 194 199 202 208 211 213 216 220 ccs() = ︂{ ℎℎ(, ˙[] + 1, ˙[] + )

otherwise. ℎℎ(, ˙[] + 1, ˙[ + 1] − 1) if ˙[ + 1] − (˙[] + 1) ≤

The searching procedure is divided into two distinct phases: the pre-matching phase, also 209 known as sample-matching, and the verification phase. The pseudo-code of the searching 210 procedure is described in Figure 2 (on the right).

During the verification phase, the starting position is computed. This requires ˙[], which is 221 the exact position of the first pivot that was compared, and the distance between the first pivot 222 in the pattern and the initial position of the text. Consequently, the position can be computed 212 let ⊆

Le be a text of length , let be a pattern of length , both strings over an alphabet Σ, and Σ be the set of pivot characters. In order to retrieve the exact position in the text, an additional position must be stored. This position corresponds to the distance between the 214 first pivot and the initial position of the text. Specifically 215 if the text starts with a pivot character, otherwise > 0. = min( : [] ∈ ). Thus = 0

Both phases are straightforward and similar in nature. The first phase checks if all the 217 contexts of the sampled pattern ˜ occurs in the sampled text ˜. If a candidate occurrence is 218 found at position of ˜, then the second verification phase is initiated to check the presence of 219 a real match. 223 as = ˙[] − . However, we point out that, in the procedure shown in Figure 2, the exact 224 position ˙[] is computed by sub-routine Compute-Position which uses the mapping table to 225 identify the value of such position. Once the initial position in the original text is computed, a 226 verification check is performed to ensure that all characters in the pattern [0.. − 1] match 227 the corresponding substring [.. + − 1]. 228 While its worst-case complexity can reach (), necessitating verification of all positions 229 in the text, the algorithm described shows competitive, and often superior, performance com230 pared to other established methods in practical scenarios. The subsequent section will provide 231 supporting evidence in this regard. 232 4. Experimental Results 233 In this section, we present the results of an experimental evaluation of the new text sampling 234 approaches introduced in this article. Our evaluation compares our approaches with the best 235 solution available in the literature, focusing on three key metrics: time eficiency, memory space 236 requirements, and the computational efort needed for verifying candidate occurrences. 237 Algorithms and implementation. We compared the text sampling approach proposed in 238 this paper (CCS) for values of ∈ 2, 4, 6 with the FCDS+ approach which is currently the 239 most efective CDS variant for natural language texts. We recall that FCDS+ is a CDS variant 240 enhanced with fake samples and a mapping table which maps one element every elements to 241 its corresponding position within the text. For details of the FCDS+ implementation see [17]. 242 In addition, we report that the CCS algorithm has been implemented using an hashing function 243 wich maps any substring of characters on a 1 byte bucket. For both the CCS and FCDS+ 244 algorithms, the mapping table utilizes values of ∈ {8, 16, 32}, selecting the optimal execution 245 time for each case. The functions used to calculate the hash value in our implementation 246 are those shown in Figure 3. A hash value of 1 byte was selected to ensure that the space 247 requirements of CCS closely match those of FCDS+, thereby enabling a fair comparison between 248 the two solutions. We compared our algorithms using multiple pivot characters based on the 249 rank of their frequency . 250 Testbed. All the algorithms were implemented in the C programming language and tested 251 using the Smart tool2 [21]. The experiments were executed on a MacBook Pro with a 2.7 GHz 252 Intel Core i7 processor, 4 cores, 16 GB RAM 2133 MHz LPDDR3, 256 KB L2 Cache, and 8 MB L3 253 Cache. Compilation was performed with the -O3 optimization flag. 254 We tested all the algorithms on a text bufer of size 100 MB, sourced from the Pizza and Chili 255 dataset [22], which is available for download online. The algorithms were specifically tested 256 using the natural language text from this dataset. In our experiments we limited ourselves 257 to searching for patterns of fixed length = 2, with ∈ {4, 5, 6, 7}. For the purpose of 258 our experimental evaluation, 1000 patterns were randomly selected from the text bufer, all 259 algorithms were run to search for each of the patterns in this set, and the average running time 260 was computed over these 1000 runs. 261 The choice of using a 100 MB text bufer ensures a substantial and representative sample size 262 for evaluating algorithm performance. The natural language text from the Pizza and Chili dataset 263 provides a realistic testing scenario, reflecting typical use cases in text processing applications. 264 Analysis of search times. The analysis of search times is particularly important in this con265 text, as sampled matching solutions are designed to achieve significantly superior performance 266 compared to online solutions, with minimal cost in terms of spatial resources. 267 Figure 4 reports the time performance obtained in our experimental results, highlighting 268 the average running times of the tested algorithms. The results shows that the new variant 269 proposed in this paper achieves superior performance in terms of time compared to the FCDS+ 270 variant, with improvements ranging from 50% to 70%. As expected, the greatest advantages 2The Smart tool is available online for download at http://www.dmi.unict.it/~faro/smart/ or at https://github.com/ smart-tool/smart. 271 are observed for short to medium-length patterns, where context plays a fundamental role in 272 identifying possible occurrences of the pattern. 273 In our experimental results, a detailed description of the best pivot character selection is not 274 provided. Regarding the selection of the pivot character, superior performance is once again 275 observed with a rank around the value of 8 for both CCS and FCDS+. 276 Finally, we note that FCDS+ becomes the most eficient solution for very long patterns. In 277 this scenario, the context has less influence on identifying candidate occurrences, and sampling 278 based on the distance between pivot characters proves to be the best strategy. 279 Table 1 (on the left) presents the experimental results in terms of running times, compared 280 with the search times of the original Horspool algorithm (HOR) [5]. In the table, the execution 281 times of the Horspool algorithm are expressed in milliseconds, while the results for the sampled 282 matching algorithms are expressed in terms of speedup relative to the Horspool algorithm. The 283 best results, in terms of speedup, are highlighted with more intense shades of grey. 284 As clearly shown in Table 1, the CCS method provides superior speedups compared to FCDS+. 285 These speedups range from an impressive 33 times for short patterns ( = 16) to 5 times for 286 very long patterns ( = 128). These results shows that solutions based on sampled matching 287 are significantly more efective than those based on the standard approach. 288 Impact on the number of verifications. In order to provide a more comprehensive un289 derstanding of the improvements brought by the proposed new technique, we analyzed the 290 average number of verifications executed by each algorithm under various settings. These 291 settings include diferent pattern lengths, ranks of the chosen pivot character, and the number 292 of characters used for hashing. The results of this analysis are illustrated in Table 1. 293 The experimental results show that for small patterns, the CDS algorithm requires, in the 294 worst case, more than 30, 000 verifications, which is significantly more than the actual number 295 of matches (147). In contrast, under optimal conditions, the CCS algorithm requires only 758 296 verifications, which is less than 2.5% of the original algorithm’s verification count. 297 In the most favorable cases, the new CCS variants significantly reduce the number of veri298 fications to values remarkably close to the actual number of pattern occurrences within the 299 text. Specifically, for = 64, the CCS4 variant reduces the number of false positives to nearly 300 match the number of actual occurrences (25 versus 13). Furthermore, for = 128, the CCS6 301 variant completely eliminates false positives. This demonstrates that the context-based approach 302 manages to drastically reduce the number of verifications required during the search phase, 303 justifying the superior performance in terms of search times discussed previously. 304 Analysis of the space consumption. Given that the memory required to store the hashing 305 is equivalent to the memory needed to store all the distances using the fake-representation, and 306 that the mapping tables are identical, the newly proposed algorithm eliminates the need for 307 additional fake-samples. Consequently, the CCS algorithm demands fewer memory resources 308 while providing a speed-up in the searching time. Specifically, Figure 5 shows the space 309 consumption of both the FCDS+ and CCS algorithm for diferent rank of the pivot character, 310 ranging from 2 to 10. We observe that the size of the partial index is always below 10% 311 (corresponding to 10 MB) of the text size and reaches 5% (5 MB) for the rank 8 pivot.

Space Consumption 8% 6% 4% 2 4 6 8 10 312 5. Conclusions and Future Works 313 In this paper, we introduced a novel sampling method called Character Context Sampling 314 (CCS), designed to enhance the eficiency of the string matching process. This method tracks 315 the context surrounding each sampled location, rather than just the distances between these 316 locations. Our experimental results demonstrate that CCS significantly reduces the number of 317 verifications required, thereby substantially decreasing search times while maintaining minimal 318 additional space requirements. CCS stands out by outperforming the existing Character Distance 319 Sampling (CDS) method, especially for short patterns, achieving a speedup of between 15% and 320 40%. This improvement is attributed to the efective use of contextual information, which helps 321 in reducing false positives during the verification phase. 322 Future research could focus on several areas to further enhance the performance and applica323 bility of the CCS method. First, exploring more eficient hashing techniques and investigating 324 their impact on the speed and accuracy of CCS could yield valuable insights. Second, adapting 325 the CCS method for other types of string matching problems, such as approximate matching or 326 order-preserving matching, could broaden its utility. 327 Additionally, integrating CCS with other advanced data structures and algorithms, such as 328 sufix trees, may provide hybrid solutions that combine the strengths of diferent approaches. 329 Finally, optimizing the selection of pivot characters based on specific text characteristics or 330 application requirements could further improve the eficiency of the CCS method. 332 [1] D. E. Knuth, J. H. Morris, Jr., V. R. Pratt, Fast pattern matching in strings, SIAM Journal 333 on Computing 6 (1977) 323–350. URL: https://doi.org/10.1137/0206024. doi:10.1137/ 334 0206024. arXiv:https://doi.org/10.1137/0206024. 335 [2] M. Crochemore, A. Czumaj, L. G¸asieniec, S. Jarominek, T. Lecroq, W. Plandowski, W. Rytter, 336 Speeding up two string-matching algorithms, Algorithmica 12 (1994) 247–267. URL: 337 https://doi.org/10.1007/BF01185427. doi:10.1007/BF01185427. 338 [3] S. Faro, T. Lecroq, The exact online string matching problem: A review of the most recent 339 results, ACM Comput. Surv. 45 (2013). URL: https://doi.org/10.1145/2431211.2431212. 340 doi:10.1145/2431211.2431212. 341 [4] R. S. Boyer, J. S. Moore, A fast string searching algorithm, Commun. ACM 20 (1977) 342 762–772. URL: https://doi.org/10.1145/359842.359859. doi:10.1145/359842.359859. 343 [5] R. N. Horspool, Practical fast searching in strings, Software: Practice and 344 Experience 10 (1980) 501–506. URL: https://onlinelibrary.wiley.com/doi/abs/ 345 10.1002/spe.4380100608. doi:https://doi.org/10.1002/spe.4380100608. 346 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.4380100608. 347 [6] A. Apostolico, The myriad virtues of subword trees, in: A. Apostolico, Z. Galil (Eds.), 348 Combinatorial Algorithms on Words, Springer Berlin Heidelberg, Berlin, Heidelberg, 1985, 349 pp. 85–96. 350 [7] U. Manber, G. Myers, Sufix arrays: A new method for on-line string searches, SIAM 351 Journal on Computing 22 (1993) 935–948. URL: https://doi.org/10.1137/0222058. doi:10. 352 1137/0222058. arXiv:https://doi.org/10.1137/0222058. 353 [8] P. Ferragina, G. Manzini, Indexing compressed text, J. ACM 52 (2005) 552–581. URL: 354 https://doi.org/10.1145/1082036.1082039. doi:10.1145/1082036.1082039. 355 [9] U. Vishkin, Deterministic sampling–a new technique for fast pattern matching, SIAM Jour356 nal on Computing 20 (1991) 22–40. URL: https://doi.org/10.1137/0220002. doi:10.1137/ 357 0220002. arXiv:https://doi.org/10.1137/0220002. 358 [10] F. Claude, G. Navarro, H. Peltola, L. Salmela, J. Tarhio, String matching with alphabet 359 sampling, Journal of Discrete Algorithms 11 (2010). doi:10.1016/j.jda.2010.09.004. 360 [11] S. Faro, F. P. Marino, Reducing time and space in indexed string matching by characters 361 distance text sampling, in: J. Holub, J. Zdárek (Eds.), Prague Stringology Conference 2020, 362 Prague, Czech Republic, August 31 - September 2, 2020, Czech Technical University in 363 Prague, Faculty of Information Technology, Department of Theoretical Computer Science, 364 2020, pp. 148–159. URL: http://www.stringology.org/event/2020/p13.html. 365 [12] S. Faro, F. P. Marino, A. Pavone, Eficient online string matching based on characters 366 distance text sampling, Algorithmica 82 (2020) 3390–3412. URL: https://doi.org/10.1007/ 367 s00453-020-00732-4. doi:10.1007/S00453-020-00732-4. 368 [13] S. Faro, F. P. Marino, A. Pavone, Enhancing characters distance text sampling by condensed 369 alphabets, in: C. S. Coen, I. Salvo (Eds.), Proceedings of the 22nd Italian Conference on 370 Theoretical Computer Science, Bologna, Italy, September 13-15, 2021, volume 3072 of CEUR 371 Workshop Proceedings, CEUR-WS.org, 2021, pp. 1–15. URL: https://ceur-ws.org/Vol-3072/ 372 paper1.pdf. 373 [14] S. Faro, F. P. Marino, A. Pavone, Improved characters distance sampling for online and