=Paper= {{Paper |id=Vol-2203/181 |storemode=property |title=A Note on Some Interesting Test Results for the Rabin-Karp String Matching Algorithm |pdfUrl=https://ceur-ws.org/Vol-2203/181.pdf |volume=Vol-2203 |authors=Tatiana Jajcayova,Zuzana Majeríkova |dblpUrl=https://dblp.org/rec/conf/itat/JajcayovaM18 }} ==A Note on Some Interesting Test Results for the Rabin-Karp String Matching Algorithm== https://ceur-ws.org/Vol-2203/181.pdf
S. Krajči (ed.): ITAT 2018 Proceedings, pp. 181–187
CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Tatiana Jajcayová and Zuzana Majeríková



         A note on some interesting test results for the Rabin-Karp string matching
                                          algorithm

                                                    Tatiana Jajcayová, Zuzana Majeríková

                                 Department of Applied Informatics, Comenius University, Bratislava, Slovakia
                                                        jajcayova@fmph.uniba.sk
                                                         z.majerikova19@gmail.com

     Abstract: While analyzing and testing some of the well
     known string matching algorithms, we came across rather
     interesting phenomenon in tests results for the Rabin-Karp
     algorithm. Testing which moduli work best in the algo-
     rithm, we realized that on some texts some moduli perform
     much worse then others. This caught our attention, and we
     continued testing. In this paper, we summarize tests using
     different moduli on the same text, tests run on different             Figure 1: Computation of the value for a window based on a previous
                                                                           window in a constant time. The first window has a value “232”. We drop
     English texts (fiction, scientific, law, newspaper), and tests
                                                                           the high-order digit that is “2” and multiply the rest of the window by
     run on different languages and alphabets (English, Slovak,            10. Then we add the low-order digit that is “8” and we get new value
     French, German and Russian). We include analysis of the               that is “328”. The value of the first window is 11 and the value of the
     obtained results and state some hypotheses.                           new window is 3, because all computations are performed with out hash
                                                                           function that is modulo 13.

     1 The Rabin-Karp algorithm
                                                                             Similarly, again by the Horner’s Rule, the value t0 can
     We start with a brief review of the Rabin-Karp algorithm              be computed in Θ(m).
     and its computational complexity. This string match-                    For every s, the value ts+1 can be then computed from
     ing algorithm is based on the use of elementary number-               the previous value ts :
     theoretic notions such as the equivalence of two numbers
                                                                                  ts+1 = 10(ts − 10m−1 T [s + 1]) + T [s + m + 1].
     modulo a third number. Rabin and Karp [5] proposed the
     algorithm to improve performance of string matching al-                  For example, to compute the value t1 from t0 for our text
     gorithms in practice. This algorithm generalizes to other             T = “12345” from the previous paragraph and pattern P of
     algorithms for related problems, such as two dimensional              length m = 3, we use the value t0 which is 123. In this
     pattern matching.                                                     computation s = 0. A character at the position T[s+1] is
        To simplify the explanations, it will be convenient to             “1”, and its numerical value is 1, a character at position
     assume for a moment that characters are decimal digits,               T[s+m+1] is “4” and its value is 4. The computation is as
     so we interpret the characters of strings as both digits and          follows:
     also graphical symbols. Thus, a string of length k can be
     represented as a decimal number. For instatnce, a string                       t1 = 10(t0 − 103−1 T [0 + 1]) + T [0 + 3 + 1]
     s = “12345” of length 5 corresponds to a decimal number
     12 345. In general case, we assume to have an alphabet Σ                              t1 = 10(123 − 102 T [1]) + T [4]
     which consists of characters that are digits from the num-                                 t1 = 10(123 − 100) + 4
     ber system over the base b = |Σ|.
        Let us have a pattern P[1...m] and text T [1...n]. The                                           t1 = 234
     decimal value of pattern P will be denoted as p. Similarly,              We see that by computing ts − 10m−1 T [s + 1], we re-
     the decimal value of every substring of a text T of length            move the highest order digit from value ts . By subsequent
     m starting at a position s will be denoted as ts . We say that        multiplication of the remainder of the value by 10, we shift
     pattern P occurs in text T if and only if one can find at least       the number to the left. 10 is used because we work with
     one ts that has the same decimal value as p, i.e. ts = p and          the decimal system. In the general case, where base b is
     T [s + 1...s + m] = P[1...m], s = 0, 1, . . . , n − m. In that case   |Σ|, we would have to shift by b. By adding T [s + m + 1]
     we call s a valid shift.                                              we add the low-order digit and we get the final value of
        We can compute decimal value of p in time Θ(m) by                  ts+1 . (Figure 1)
     using the Horner’s Rule:                                                 If the constant 10m−1 is precomputed, then for each
     p = P[m] + 10(P[m − 1] + 10(P[m − 2] + 10(P[m − 3] + . . .            s = 1, 2, . . . , n − m, the computation of the value ts is done
                                                                           in constant time. Thus the computation of all the remain-
                         +10(P[2] + 10P[1])...)                            ing values t1 , t2 , t3 , . . ., tn−m is done in Θ(n − m). That
182                                                                                                             Tatiana Jajcayová and Zuzana Majeríková

      means, that by comparing value of p with every value of                          In our version of the Rabin-Karp algorithm the pre-
      ts , we can get all valid shifts in time Θ(m) + Θ(n − m + 1).                 processing time takes Θ(n). The processing time takes
           A problem may occur, when value p and values ts                          Θ((n − m + 1)m) in the worst case, and only Θ(n + 1) in
      are too large. Some mathematical operations with large                        the best case of running. The worst case occurs when all
      numbers may take a long time, and the assumption that                         characters of pattern P and all characters of text T are the
      they take only constant time may be unreasonable. One                         same unique symbol. For example, if we have a pattern
      possible remedy for this is, what Rabin-Karp algorithm                        P=“aaa” and a text T=“aaaaaaaaa”, then algorithm consid-
      does: instead of comparing the original values, it com-                       ers every possible shift as valid and behaves as the naive
      putes values p and ts modulo some suitable modulus q, and                     string matching algorithm.
      then compare these new modulated numbers. The modu-
      lated numbers are not bigger than q − 1. Computation of
      p (mod q) takes again time Θ(m) and computation of all                        2     Implementation
      ts (mod q) values together takes Θ(n − m + 1).
           The modulus q is usually chosen to be a prime such that                  In this section we sketch how we implemented the Rabin-
      the value b · q fits within a computer word. (b is the base,                  Karp algorithm. The particulars of the implementation can
      the number of characters in the alphabet Σ). According to                     become interesting when discussing the test results in the
      this, the previous equation for computing ts + 1 changes                      last section of the paper.
      to:                                                                              We have chosen Java as the programming language.
                                                                                    Java is an object-oriented language, and its biggest ad-
         ts+1 ≡ (b(ts − T [s + 1]h) + T [S + m + 1]) (mod q),                       vantage is its platform independence, which means that
                                                                                    the Java code can be run without modifying at any plat-
      where                                                                         form that supports Java. As a development tool we chose
                              h ≡ d m−1      (mod q)                                Eclipse Java Oxygen with a WindowBuilder component.
         It is clear that if p 6≡ ts (mod q) then P does not start                  WindowBuilder is designed to create Java GUI applica-
      at position s in T . So if this happens, we know that s is not                tions in an easy way.
      a valid shift. On the other hand, it is possible that p ≡ ts
      (mod q), and still the pattern P does not start at s. That                    2.1 Implementation of the Main Application
      means that to complete this algorithm, we must not only
      compare the modulated values and find a match, but also                       We study and test Rabin-Karp algorithm in a context of
      test further to see if the substring T [s + 1...s + m] is the                 other string matching algorithms. To organize the testing
      same as the pattern P. This additional but indispensable                      of the implemented algorithms, we needed to create the
      comparison takes another Θ(m) time. The simple graphi-                        main application with Graphical User Interface. This ap-
      cal example of the Rabin-Karp algorithm and modulation                        plication is build to work with all the studied algorithms
      of the values of ts is shown in Figure 2.                                     and to manage the input texts, patterns and all of the out-
                                                                                    put results. The main components that we need in our ap-
                                                                                    plication:

                                                                                        • Input pattern - text input field, where the user can
                                                                                          write or paste the pattern that he wants to find in the
                                                                                          text Input text

                                                                                        • Input text - text input field, where the user can write
                                                                                          or paste the text where he wants to find all the occur-
                                                                                          rences of the pattern from the Input pattern

                                                                                        • Load text from file - button, that allows the user to
                                                                                          load text from a file to the Input text

                                                                                        • Algorithm - selector, where the user can choose
                                                                                          which algorithm he wants to test

      Figure 2: Example of the Rabin-Karp string matching with alphabet Σ               • Modulo - number input field, where the user can
      in which every character is a decimal digit. a) an example of a hashed              choose the modulo that will be used in the Rabin-
      pattern P = 315 that is located in text T . Part b) demonstrates hashing of         Karp algorithm
      every possible substring of T with the length of P. Two substrings that
      have the same hash value as the pattern are found and they are denoted
      as valid shifts. Only the substrings with the valid shift are compared            • Output text - output text field, where the text is
      further with the pattern by characters. For this case we have found one             rewritten from the Input text, but with highlighted
      occurrence of pattern P in text T at position 6.                                    occurrences of the pattern from Input pattern
A note on some interesting test results for the Rabin-Karp string matching algorithm                                                              183

                                                                             OpenFile is the class that allows the user to load a text
                                                                             from a file with a JFileChooser.

                                                                             Algorithm is an interface for implemented algorithms.
                                                                             Every algorithm needs to have implemented functions get-
                                                                             TimeOfPreprocessing, getTimeOfProcessing, getOccur-
                                                                             rences, getOtherConsoleLogs and run(text, pattern, alpha-
                                                                             bet, base, modulo).

                                                                             NaiveMatching,     RabinKarpMatching,        FiniteAu-
                                                                             tomataMatching, OptimalizedFiniteAutomataMatch-
                                                                             ing, KnuthMorrisPrattMatching are the classes
                                                                             where the implemented algorithms that we used are.

                                                                             StringMatching is the class that gets the selected algo-
                                                                             rithm, input text, input pattern and modulo. First, it pro-
                                                                             cesses the text and the pattern with TextHolder and gets
                                                                             the alphabet and the base generated from the text. Then, it
                                                                             creates a new instance of the selected algorithm and calls
                                                                             it’s function run.
                                                                                The review of the implemented application and it’s GUI
                                                                             is shown in Figure 4, where a simple running of the pro-
                                                                             gram with the naive string matching algorithm with pattern
      Figure 3: Class diagram of the main application generated by The       “the” and text of the book “The Project Gutenberg EBook
      ObjectAid UML Explorer for Eclipse                                     of The Adventures of Sherlock Holmes by Sir Arthur Co-
                                                                             nan Doyle” is demonstrated.
         • Additional console logs - check field, if it is checked
           it means that in the Console there will be written
           some additional logs for the given algorithm along-
           side with the proccessing and preprocessing time
         • Console - after running an algorithm, in the console
           there is shown it’s processing and preprocessing time
           and when the Additional console logs is checked
           there are some additional logs for every algorithm
         • Run - button that runs the program, where with a se-
           lected Algorithm, search in Input text for an Input
           pattern
                                                                             Figure 4: The main application, where the user can test the imple-
         A class diagram is shown in Figure 3.                               mented algorithms.

        The application consists of the following classes that
      work as follows:
                                                                             2.2 The Rabin-Karp algorithm implementation
      Main is the main class that runs the application. Its pur-
      pose is to initialize the contents of the frame of GUI win-            For the Rabin-Karp string matching procedure, after it re-
      dow. When users run the application, the Main class gets               ceives the input text, the pattern, the value of d that rep-
      the value of the input text, the input pattern, the input mod-         resents the radix − d notation that is uses in its matching
      ulo and the selected algorithm. The class passes all this              and modulo, we need to initialize the value of the high-
      values to class StringMatching and calls the function run              order digit position of an m-digit window. This is shown
      of the class StringMatching. After the run is completed,               in Listing 1 on line 1. After that the preprocessing of this
      the class fills the output text and console with the results           algorithm follows. We need to compute the value of p and
      of running.                                                            the value of the first substring t0 of the text with length of
                                                                             the pattern. Implementation of this is shown in Listing 1
      TextHolder is the class that stores the text and the pat-              on lines 3 − 6, where the pattern_hash is our value of p
      tern that the application got on input. It also generates the          and sub_text_hash represents the value of t0 . As a default,
      alphabet and the base based on the given text and pattern.             these variables were preset to value 0.
184                                                                                                                                                                          Tatiana Jajcayová and Zuzana Majeríková


                       Listing 1: Calculation of Hashed Values of Value h, p and t0
       1     i n t h = ( i n t ) Math . pow ( b a s e , p a t t e r n _ l e n g t h −1) % modulo ;
       2
       3     f o r ( i n t i = 0 ; i < p a t t e r n _ l e n g t h ; i ++) {
       4           p a t t e r n _ h a s h = ( b a s e∗p a t t e r n _ h a s h + p a t t e r n . c h a r A t ( i ) ) % modulo ;
       5           s u b _ t e x t _ h a s h = ( b a s e∗s u b _ t e x t _ h a s h + t e x t . c h a r A t ( i ) ) % modulo ;
       6     }



              How the processing is implemented can seen in List-
           ing 2. It starts by iterating through all possible shifts s
           in the text. On line 3 there is a check whether value p
           matches value of ts at shift s where the loop is executed.
           If so, then the character by character comparison between
           pattern P and substring of text T [s + 1...s + m] follows. If
           these two strings are equal, we add shift to the array of
           all of the occurrences, where we store all shifts in which                                                                        Figure 5: The Rabin-Karp algorithm on an English text generated by
           we have found a positive match. The check on line 9 will                                                                          an online random text generator. Three different colors represent three
           be true on every ts , except the last one tn−m . That is be-                                                                      different lengths of a searched pattern P.
           cause the main loop will execute one more time and in
           that case we need to compute the value of upcoming ts+1 .                                                                            The text was again about 500 000 characters long, and
           This computation is implemented on line 10. In the case                                                                           we tested with different patterns and different lengths of
           that the value of ts+1 is smaller that 0, we add q to this                                                                        patterns. As we see in Figure 6, modulus q = 13 performed
           value on lines 13 − 15.                                                                                                           significantly worse than the other moduli.
                 Listing 2: Processing of the Rabin-Karp String Matching Algorithm
       1     f o r ( i n t s = 0 ; s < ( t e x t _ l e n g t h − p a t t e r n _ l e n g t h ) ; s ++) {
       2
       3           i f ( p a t t e r n _ h a s h == s u b _ t e x t _ h a s h ) {
       4                 String text_substring = text . substring (s , ( s + pattern_length ));
       5                 i f ( Objects . equals ( pattern , text_substring ))
       6                          o c c u r r e n c e s . add ( s ) ;
       7           }
       8
       9            if ( s < ( text_length − pattern_length )) {
      10                 s u b _ t e x t _ h a s h = ( b a s e ∗( s u b _ t e x t _ h a s h − t e x t . c h a r A t ( s )∗h ) +
      11                                                            t e x t . c h a r A t ( s + p a t t e r n _ l e n g t h ) ) % modulo ;
      12
      13                   i f ( sub_text_hash < 0)
      14                         s u b _ t e x t _ h a s h = ( s u b _ t e x t _ h a s h + modulo ) ;
      15           }
      16     }




           3 Testing moduli in the Rabin-Karp
             algorithm
                                                                                                                                             Figure 6: The Rabin-Karp algorithm on English fiction: The adven-
                                                                                                                                             tures of Sherlock Holmes by Sir Arthur Conan Doyle.
           While testing a performance of the implemented string
           matching algorithms, we were interested whether a choice
           of the modulus q in the Rabin-Karp algorithm affects the
           performance of the algorithm. Recall that for the given
           base b = |Σ|, the modulus q is usually chosen to be a prime
           number such that the value b · q fits within a computer
           word. In this section we summarize results of our testing
           of moduli in the Rabin-Karp algorithm.
              We had some preconceived ideas how the test results
           should look like. These expectations were met while test-
           ing some of the texts, for an example see Figure 5. The
           sample text in Figure 5 is an English text. It is an artificial
           text in the sense that it was generated in an online random
           text generator, with even some further random changes and
           additions. The text is 400 000 characters long.                                                                                   Figure 7: The Rabin-Karp algorithm on English fiction: The adven-
              A suprise came when we tested the Rabin-Karp algo-                                                                             tures of Sherlock Holmes by Sir Arthur Conan Doyle.
           rithm on The adventures of Sherlock Holmes by Sir Arthur
           Conan Doyle. The text was obtained from The Project                                                                                 This caught our attention. We tested the same text for
           Gutenberg. Release Date: March, 1999 [EBook #1661]                                                                                the second time, now with the moduli ordered in the de-
           [Most recently updated: November 29, 2002], Edition: 12,                                                                          scending order (see Figure 7) to make sure that a possi-
           Language: English, Character set encoding: ASCII.                                                                                 ble error is not caused in the program (language Java can
A note on some interesting test results for the Rabin-Karp string matching algorithm                                                      185

      sometimes slow down the performance of the program be-
      cause of the Garbage Collector). Since we got the two
      graphs that are exactly mirror images of each other, we
      see that the garbage collecting in Java is not the reason for
      the unexpected behavior.
         Our first hypothesis was that the natural language has
      different characteristics than randomly generated texts. So
      we continue testing on many different types of English
      texts and patterns. In Figure 8, we see the results of the
      testing on a law text The Common Law, by Oliver Wendell
      Holmes, Jr., and in Figure 9 on a scientific text from the
      book Ancient And Modern Physics by Thomas E. Will-
      son. (Both these text were again obtained from the Project
      Gutenberg.) As we see the charts for all these texts are                Figure 10: The Rabin-Karp algorithm on another ran-
      different. To definitely show that our first hypothesis does            domly generated English text.
      not hold, in Figure 10 is yet another randomly generated
      English text that has q = 13 behaving as it is in the fiction
      in Figure 6.                                                               Our another guess was that the performance depending
                                                                              on moduli may somehow be related to the language or
                                                                              to the alphabet of the text. Therefore we tested on other
                                                                              languages, including German (Figure 11), and other lan-
                                                                              guages using other alphabets, including Russian (Figures
                                                                              12 and 13.) We see that we can get all types of behavior.




      Figure 8: The Rabin-Karp algorithm on an English law text: The Com-
      mon Law, by Oliver Wendell Holmes, Jr.




                                                                              Figure 11: The Rabin-Karp algorithm on a randomly gen-
                                                                              erated German text.




      Figure 9: The Rabin-Karp algorithm on an English scientific text: An-
      cient And Modern Physics by Thomas E. Willson




                                                                              Figure 12: The Rabin-Karp algorithm on the first Russian
                                                                              text (randomly generated).
186                                                                                              Tatiana Jajcayová and Zuzana Majeríková




      Figure 13: The Rabin-Karp algorithm on the second Rus-         Figure 16: The Rabin-Karp algorithm on the Slovak trans-
      sian text (randomly generated).                                lation of Molier’s Lakomec (L’Avare).


         We were also interested what is happening in Slovak            Here it seems that q = 13 is again behaving worse that
      language. Below you see the results of three tests. The        other moduli. (Why?) Just to make a comparison, we also
      first one, Figure 14, is on a randomly generated Slovak        include the results of the testing on the original French
      text, the second one, Figure 15, is on the original Slovak     text. As we see below, Figure 17, modulus q = 13 is not
      fiction, Adam Šangala by Ladislav Nádaši-Jégé, and the         different from others, while q = 7 and in some extend its
      third one, Figure 16, is on the Slovak translation of French   multiples are now the moduli with the worst performance.
      Molier’s Lakomec (L’Avare).




                                                                     Figure 17: The Rabin-Karp algorithm on the French text.
      Figure 14: The Rabin-Karp algorithm on a randomly gen-
      erated Slovak text.                                               We were hoping that by systematically running tests on
                                                                     different types of texts, on different languages and alpha-
                                                                     bets, with different patterns to search for, we would be able
                                                                     to understand the relationship between the computational
                                                                     time and the modulus for Rabin-Karp algorithm. At this
                                                                     moment we see that this tests are not sufficient to explain
                                                                     this relationship, and that other, more involved (probably
                                                                     statistical) methods would be needed to completely under-
                                                                     stand the phenomenon.

                                                                     Acknowlegment
                                                                      The first author acknowledges the support by
                                                                     VEGA 1/0039/17 and VEGA 1/0719/18

      Figure 15: The Rabin-Karp algorithm on the original Slo-       References
      vak fiction, Adam Šangala by Ladislav Nádaši-Jégé
                                                                     [1] Marc GOU, Algorithms for String matching, July 30, 2014
A note on some interesting test results for the Rabin-Karp string matching algorithm   187

      [2] Nimisha Singla, Deepak Garg, January 2012, String Match-
          ing Algorithms and their Applicability in various Applica-
          tionsl, International Journal of Soft Computing and Engi-
          neering (IJSCE) ISSN: 2231-2307, Volume-I, Issue-6, Jan-
          uary 2012
      [3] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest,
          Clifford Stein, 2009, Introduction to Algorithms: Third
          Edition, Massachusetts Institute of Technology, Cambridge,
          Massachusetts 02142 
      [4] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman,
          The Design and Analysis of Computer Algorithms, Addison-
          Wesley, 1974
      [5] Richard M. Karp and Michael O. Rabin, Efficient random-
          ized pattern-matching algorithms, IBM Journal of Research
          and Development, 31(2):249-260, 1987
      [6] Donald E. Knuth, James H. Morris, Jr., and Vaughan R.
          Pratt, Fast pattern matching in strings, SIAM Journal on
          Computing, 6(2):323–350, 1977
      [7] Edward M. Reingold, Kenneth J. Urban, and David Gries, K-
          M-P string matching revisited, Information Processing Let-
          ters, 64(5):217–223, 1997
      [8] Zvi Galil and Joel Seiferas, Time-space-optimal string
          matching, Journal of Computer and System Sciences,
          26(3):280–294, 1983
      [9] Vidya SaiKrishna, Prof. Akhtar Rasool and Dr. Nilay Khare,
          String Matching and its Applications in Diversified Fields,
          IJCSI International Journal of Computer Science Issues, Vol.
          9, Issue 1, No 1, January 2012