S. Krajči (ed.): ITAT 2018 Proceedings, pp. 181–187 CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Tatiana Jajcayová and Zuzana Majeríková A note on some interesting test results for the Rabin-Karp string matching algorithm Tatiana Jajcayová, Zuzana Majeríková Department of Applied Informatics, Comenius University, Bratislava, Slovakia jajcayova@fmph.uniba.sk z.majerikova19@gmail.com Abstract: While analyzing and testing some of the well known string matching algorithms, we came across rather interesting phenomenon in tests results for the Rabin-Karp algorithm. Testing which moduli work best in the algo- rithm, we realized that on some texts some moduli perform much worse then others. This caught our attention, and we continued testing. In this paper, we summarize tests using different moduli on the same text, tests run on different Figure 1: Computation of the value for a window based on a previous window in a constant time. The first window has a value “232”. We drop English texts (fiction, scientific, law, newspaper), and tests the high-order digit that is “2” and multiply the rest of the window by run on different languages and alphabets (English, Slovak, 10. Then we add the low-order digit that is “8” and we get new value French, German and Russian). We include analysis of the that is “328”. The value of the first window is 11 and the value of the obtained results and state some hypotheses. new window is 3, because all computations are performed with out hash function that is modulo 13. 1 The Rabin-Karp algorithm Similarly, again by the Horner’s Rule, the value t0 can We start with a brief review of the Rabin-Karp algorithm be computed in Θ(m). and its computational complexity. This string match- For every s, the value ts+1 can be then computed from ing algorithm is based on the use of elementary number- the previous value ts : theoretic notions such as the equivalence of two numbers ts+1 = 10(ts − 10m−1 T [s + 1]) + T [s + m + 1]. modulo a third number. Rabin and Karp [5] proposed the algorithm to improve performance of string matching al- For example, to compute the value t1 from t0 for our text gorithms in practice. This algorithm generalizes to other T = “12345” from the previous paragraph and pattern P of algorithms for related problems, such as two dimensional length m = 3, we use the value t0 which is 123. In this pattern matching. computation s = 0. A character at the position T[s+1] is To simplify the explanations, it will be convenient to “1”, and its numerical value is 1, a character at position assume for a moment that characters are decimal digits, T[s+m+1] is “4” and its value is 4. The computation is as so we interpret the characters of strings as both digits and follows: also graphical symbols. Thus, a string of length k can be represented as a decimal number. For instatnce, a string t1 = 10(t0 − 103−1 T [0 + 1]) + T [0 + 3 + 1] s = “12345” of length 5 corresponds to a decimal number 12 345. In general case, we assume to have an alphabet Σ t1 = 10(123 − 102 T [1]) + T [4] which consists of characters that are digits from the num- t1 = 10(123 − 100) + 4 ber system over the base b = |Σ|. Let us have a pattern P[1...m] and text T [1...n]. The t1 = 234 decimal value of pattern P will be denoted as p. Similarly, We see that by computing ts − 10m−1 T [s + 1], we re- the decimal value of every substring of a text T of length move the highest order digit from value ts . By subsequent m starting at a position s will be denoted as ts . We say that multiplication of the remainder of the value by 10, we shift pattern P occurs in text T if and only if one can find at least the number to the left. 10 is used because we work with one ts that has the same decimal value as p, i.e. ts = p and the decimal system. In the general case, where base b is T [s + 1...s + m] = P[1...m], s = 0, 1, . . . , n − m. In that case |Σ|, we would have to shift by b. By adding T [s + m + 1] we call s a valid shift. we add the low-order digit and we get the final value of We can compute decimal value of p in time Θ(m) by ts+1 . (Figure 1) using the Horner’s Rule: If the constant 10m−1 is precomputed, then for each p = P[m] + 10(P[m − 1] + 10(P[m − 2] + 10(P[m − 3] + . . . s = 1, 2, . . . , n − m, the computation of the value ts is done in constant time. Thus the computation of all the remain- +10(P[2] + 10P[1])...) ing values t1 , t2 , t3 , . . ., tn−m is done in Θ(n − m). That 182 Tatiana Jajcayová and Zuzana Majeríková means, that by comparing value of p with every value of In our version of the Rabin-Karp algorithm the pre- ts , we can get all valid shifts in time Θ(m) + Θ(n − m + 1). processing time takes Θ(n). The processing time takes A problem may occur, when value p and values ts Θ((n − m + 1)m) in the worst case, and only Θ(n + 1) in are too large. Some mathematical operations with large the best case of running. The worst case occurs when all numbers may take a long time, and the assumption that characters of pattern P and all characters of text T are the they take only constant time may be unreasonable. One same unique symbol. For example, if we have a pattern possible remedy for this is, what Rabin-Karp algorithm P=“aaa” and a text T=“aaaaaaaaa”, then algorithm consid- does: instead of comparing the original values, it com- ers every possible shift as valid and behaves as the naive putes values p and ts modulo some suitable modulus q, and string matching algorithm. then compare these new modulated numbers. The modu- lated numbers are not bigger than q − 1. Computation of p (mod q) takes again time Θ(m) and computation of all 2 Implementation ts (mod q) values together takes Θ(n − m + 1). The modulus q is usually chosen to be a prime such that In this section we sketch how we implemented the Rabin- the value b · q fits within a computer word. (b is the base, Karp algorithm. The particulars of the implementation can the number of characters in the alphabet Σ). According to become interesting when discussing the test results in the this, the previous equation for computing ts + 1 changes last section of the paper. to: We have chosen Java as the programming language. Java is an object-oriented language, and its biggest ad- ts+1 ≡ (b(ts − T [s + 1]h) + T [S + m + 1]) (mod q), vantage is its platform independence, which means that the Java code can be run without modifying at any plat- where form that supports Java. As a development tool we chose h ≡ d m−1 (mod q) Eclipse Java Oxygen with a WindowBuilder component. It is clear that if p 6≡ ts (mod q) then P does not start WindowBuilder is designed to create Java GUI applica- at position s in T . So if this happens, we know that s is not tions in an easy way. a valid shift. On the other hand, it is possible that p ≡ ts (mod q), and still the pattern P does not start at s. That 2.1 Implementation of the Main Application means that to complete this algorithm, we must not only compare the modulated values and find a match, but also We study and test Rabin-Karp algorithm in a context of test further to see if the substring T [s + 1...s + m] is the other string matching algorithms. To organize the testing same as the pattern P. This additional but indispensable of the implemented algorithms, we needed to create the comparison takes another Θ(m) time. The simple graphi- main application with Graphical User Interface. This ap- cal example of the Rabin-Karp algorithm and modulation plication is build to work with all the studied algorithms of the values of ts is shown in Figure 2. and to manage the input texts, patterns and all of the out- put results. The main components that we need in our ap- plication: • Input pattern - text input field, where the user can write or paste the pattern that he wants to find in the text Input text • Input text - text input field, where the user can write or paste the text where he wants to find all the occur- rences of the pattern from the Input pattern • Load text from file - button, that allows the user to load text from a file to the Input text • Algorithm - selector, where the user can choose which algorithm he wants to test Figure 2: Example of the Rabin-Karp string matching with alphabet Σ • Modulo - number input field, where the user can in which every character is a decimal digit. a) an example of a hashed choose the modulo that will be used in the Rabin- pattern P = 315 that is located in text T . Part b) demonstrates hashing of Karp algorithm every possible substring of T with the length of P. Two substrings that have the same hash value as the pattern are found and they are denoted as valid shifts. Only the substrings with the valid shift are compared • Output text - output text field, where the text is further with the pattern by characters. For this case we have found one rewritten from the Input text, but with highlighted occurrence of pattern P in text T at position 6. occurrences of the pattern from Input pattern A note on some interesting test results for the Rabin-Karp string matching algorithm 183 OpenFile is the class that allows the user to load a text from a file with a JFileChooser. Algorithm is an interface for implemented algorithms. Every algorithm needs to have implemented functions get- TimeOfPreprocessing, getTimeOfProcessing, getOccur- rences, getOtherConsoleLogs and run(text, pattern, alpha- bet, base, modulo). NaiveMatching, RabinKarpMatching, FiniteAu- tomataMatching, OptimalizedFiniteAutomataMatch- ing, KnuthMorrisPrattMatching are the classes where the implemented algorithms that we used are. StringMatching is the class that gets the selected algo- rithm, input text, input pattern and modulo. First, it pro- cesses the text and the pattern with TextHolder and gets the alphabet and the base generated from the text. Then, it creates a new instance of the selected algorithm and calls it’s function run. The review of the implemented application and it’s GUI is shown in Figure 4, where a simple running of the pro- gram with the naive string matching algorithm with pattern Figure 3: Class diagram of the main application generated by The “the” and text of the book “The Project Gutenberg EBook ObjectAid UML Explorer for Eclipse of The Adventures of Sherlock Holmes by Sir Arthur Co- nan Doyle” is demonstrated. • Additional console logs - check field, if it is checked it means that in the Console there will be written some additional logs for the given algorithm along- side with the proccessing and preprocessing time • Console - after running an algorithm, in the console there is shown it’s processing and preprocessing time and when the Additional console logs is checked there are some additional logs for every algorithm • Run - button that runs the program, where with a se- lected Algorithm, search in Input text for an Input pattern Figure 4: The main application, where the user can test the imple- A class diagram is shown in Figure 3. mented algorithms. The application consists of the following classes that work as follows: 2.2 The Rabin-Karp algorithm implementation Main is the main class that runs the application. Its pur- pose is to initialize the contents of the frame of GUI win- For the Rabin-Karp string matching procedure, after it re- dow. When users run the application, the Main class gets ceives the input text, the pattern, the value of d that rep- the value of the input text, the input pattern, the input mod- resents the radix − d notation that is uses in its matching ulo and the selected algorithm. The class passes all this and modulo, we need to initialize the value of the high- values to class StringMatching and calls the function run order digit position of an m-digit window. This is shown of the class StringMatching. After the run is completed, in Listing 1 on line 1. After that the preprocessing of this the class fills the output text and console with the results algorithm follows. We need to compute the value of p and of running. the value of the first substring t0 of the text with length of the pattern. Implementation of this is shown in Listing 1 TextHolder is the class that stores the text and the pat- on lines 3 − 6, where the pattern_hash is our value of p tern that the application got on input. It also generates the and sub_text_hash represents the value of t0 . As a default, alphabet and the base based on the given text and pattern. these variables were preset to value 0. 184 Tatiana Jajcayová and Zuzana Majeríková Listing 1: Calculation of Hashed Values of Value h, p and t0 1 i n t h = ( i n t ) Math . pow ( b a s e , p a t t e r n _ l e n g t h −1) % modulo ; 2 3 f o r ( i n t i = 0 ; i < p a t t e r n _ l e n g t h ; i ++) { 4 p a t t e r n _ h a s h = ( b a s e∗p a t t e r n _ h a s h + p a t t e r n . c h a r A t ( i ) ) % modulo ; 5 s u b _ t e x t _ h a s h = ( b a s e∗s u b _ t e x t _ h a s h + t e x t . c h a r A t ( i ) ) % modulo ; 6 } How the processing is implemented can seen in List- ing 2. It starts by iterating through all possible shifts s in the text. On line 3 there is a check whether value p matches value of ts at shift s where the loop is executed. If so, then the character by character comparison between pattern P and substring of text T [s + 1...s + m] follows. If these two strings are equal, we add shift to the array of all of the occurrences, where we store all shifts in which Figure 5: The Rabin-Karp algorithm on an English text generated by we have found a positive match. The check on line 9 will an online random text generator. Three different colors represent three be true on every ts , except the last one tn−m . That is be- different lengths of a searched pattern P. cause the main loop will execute one more time and in that case we need to compute the value of upcoming ts+1 . The text was again about 500 000 characters long, and This computation is implemented on line 10. In the case we tested with different patterns and different lengths of that the value of ts+1 is smaller that 0, we add q to this patterns. As we see in Figure 6, modulus q = 13 performed value on lines 13 − 15. significantly worse than the other moduli. Listing 2: Processing of the Rabin-Karp String Matching Algorithm 1 f o r ( i n t s = 0 ; s < ( t e x t _ l e n g t h − p a t t e r n _ l e n g t h ) ; s ++) { 2 3 i f ( p a t t e r n _ h a s h == s u b _ t e x t _ h a s h ) { 4 String text_substring = text . substring (s , ( s + pattern_length )); 5 i f ( Objects . equals ( pattern , text_substring )) 6 o c c u r r e n c e s . add ( s ) ; 7 } 8 9 if ( s < ( text_length − pattern_length )) { 10 s u b _ t e x t _ h a s h = ( b a s e ∗( s u b _ t e x t _ h a s h − t e x t . c h a r A t ( s )∗h ) + 11 t e x t . c h a r A t ( s + p a t t e r n _ l e n g t h ) ) % modulo ; 12 13 i f ( sub_text_hash < 0) 14 s u b _ t e x t _ h a s h = ( s u b _ t e x t _ h a s h + modulo ) ; 15 } 16 } 3 Testing moduli in the Rabin-Karp algorithm Figure 6: The Rabin-Karp algorithm on English fiction: The adven- tures of Sherlock Holmes by Sir Arthur Conan Doyle. While testing a performance of the implemented string matching algorithms, we were interested whether a choice of the modulus q in the Rabin-Karp algorithm affects the performance of the algorithm. Recall that for the given base b = |Σ|, the modulus q is usually chosen to be a prime number such that the value b · q fits within a computer word. In this section we summarize results of our testing of moduli in the Rabin-Karp algorithm. We had some preconceived ideas how the test results should look like. These expectations were met while test- ing some of the texts, for an example see Figure 5. The sample text in Figure 5 is an English text. It is an artificial text in the sense that it was generated in an online random text generator, with even some further random changes and additions. The text is 400 000 characters long. Figure 7: The Rabin-Karp algorithm on English fiction: The adven- A suprise came when we tested the Rabin-Karp algo- tures of Sherlock Holmes by Sir Arthur Conan Doyle. rithm on The adventures of Sherlock Holmes by Sir Arthur Conan Doyle. The text was obtained from The Project This caught our attention. We tested the same text for Gutenberg. Release Date: March, 1999 [EBook #1661] the second time, now with the moduli ordered in the de- [Most recently updated: November 29, 2002], Edition: 12, scending order (see Figure 7) to make sure that a possi- Language: English, Character set encoding: ASCII. ble error is not caused in the program (language Java can A note on some interesting test results for the Rabin-Karp string matching algorithm 185 sometimes slow down the performance of the program be- cause of the Garbage Collector). Since we got the two graphs that are exactly mirror images of each other, we see that the garbage collecting in Java is not the reason for the unexpected behavior. Our first hypothesis was that the natural language has different characteristics than randomly generated texts. So we continue testing on many different types of English texts and patterns. In Figure 8, we see the results of the testing on a law text The Common Law, by Oliver Wendell Holmes, Jr., and in Figure 9 on a scientific text from the book Ancient And Modern Physics by Thomas E. Will- son. (Both these text were again obtained from the Project Gutenberg.) As we see the charts for all these texts are Figure 10: The Rabin-Karp algorithm on another ran- different. To definitely show that our first hypothesis does domly generated English text. not hold, in Figure 10 is yet another randomly generated English text that has q = 13 behaving as it is in the fiction in Figure 6. Our another guess was that the performance depending on moduli may somehow be related to the language or to the alphabet of the text. Therefore we tested on other languages, including German (Figure 11), and other lan- guages using other alphabets, including Russian (Figures 12 and 13.) We see that we can get all types of behavior. Figure 8: The Rabin-Karp algorithm on an English law text: The Com- mon Law, by Oliver Wendell Holmes, Jr. Figure 11: The Rabin-Karp algorithm on a randomly gen- erated German text. Figure 9: The Rabin-Karp algorithm on an English scientific text: An- cient And Modern Physics by Thomas E. Willson Figure 12: The Rabin-Karp algorithm on the first Russian text (randomly generated). 186 Tatiana Jajcayová and Zuzana Majeríková Figure 13: The Rabin-Karp algorithm on the second Rus- Figure 16: The Rabin-Karp algorithm on the Slovak trans- sian text (randomly generated). lation of Molier’s Lakomec (L’Avare). We were also interested what is happening in Slovak Here it seems that q = 13 is again behaving worse that language. Below you see the results of three tests. The other moduli. (Why?) Just to make a comparison, we also first one, Figure 14, is on a randomly generated Slovak include the results of the testing on the original French text, the second one, Figure 15, is on the original Slovak text. As we see below, Figure 17, modulus q = 13 is not fiction, Adam Šangala by Ladislav Nádaši-Jégé, and the different from others, while q = 7 and in some extend its third one, Figure 16, is on the Slovak translation of French multiples are now the moduli with the worst performance. Molier’s Lakomec (L’Avare). Figure 17: The Rabin-Karp algorithm on the French text. Figure 14: The Rabin-Karp algorithm on a randomly gen- erated Slovak text. We were hoping that by systematically running tests on different types of texts, on different languages and alpha- bets, with different patterns to search for, we would be able to understand the relationship between the computational time and the modulus for Rabin-Karp algorithm. At this moment we see that this tests are not sufficient to explain this relationship, and that other, more involved (probably statistical) methods would be needed to completely under- stand the phenomenon. Acknowlegment The first author acknowledges the support by VEGA 1/0039/17 and VEGA 1/0719/18 Figure 15: The Rabin-Karp algorithm on the original Slo- References vak fiction, Adam Šangala by Ladislav Nádaši-Jégé [1] Marc GOU, Algorithms for String matching, July 30, 2014 A note on some interesting test results for the Rabin-Karp string matching algorithm 187 [2] Nimisha Singla, Deepak Garg, January 2012, String Match- ing Algorithms and their Applicability in various Applica- tionsl, International Journal of Soft Computing and Engi- neering (IJSCE) ISSN: 2231-2307, Volume-I, Issue-6, Jan- uary 2012 [3] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, 2009, Introduction to Algorithms: Third Edition, Massachusetts Institute of Technology, Cambridge, Massachusetts 02142 [4] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman, The Design and Analysis of Computer Algorithms, Addison- Wesley, 1974 [5] Richard M. Karp and Michael O. Rabin, Efficient random- ized pattern-matching algorithms, IBM Journal of Research and Development, 31(2):249-260, 1987 [6] Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt, Fast pattern matching in strings, SIAM Journal on Computing, 6(2):323–350, 1977 [7] Edward M. Reingold, Kenneth J. Urban, and David Gries, K- M-P string matching revisited, Information Processing Let- ters, 64(5):217–223, 1997 [8] Zvi Galil and Joel Seiferas, Time-space-optimal string matching, Journal of Computer and System Sciences, 26(3):280–294, 1983 [9] Vidya SaiKrishna, Prof. Akhtar Rasool and Dr. Nilay Khare, String Matching and its Applications in Diversified Fields, IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012