Multiple Keyword Pattern Matching using
              Position Encoded Pattern Lattices

                Fritz J. Venter1 , Bruce W. Watson2 , and Derrick G. Kourie1
                    1
                        University of Pretoria, {fritz, dkourie}@fastar.org
                           2
                             Stellenbosch University, bruce@fastar.org


             Abstract. Formal concept analysis is used as the basis for two new
             multiple keyword string pattern matching algorithms. The algorithms ad-
             dressed are built upon a so-called position encoded pattern lattice (PEPL).
             The algorithms presented are in conceptual form only; no experimental
             results are given. The ﬁrst algorithm to be presented is easily understood
             and relies directly on the PEPL for matching. Its worst case complexity
             depends on both the length of the longest keyword, and the length of
             the search text. Subsequently a ﬁnite-automaton-like structure, called a
             PEPL automaton, is deﬁned which is derived from the PEPL, and which
             forms the basis for a second more eﬃcient algorithm. In this case, worst
             case behaviour depends only on the length of the input stream. The
             second algorithm’s worst case performance is the same as the match-
             ing phase of the well-known (advanced) Aho-Corasick multiple-keyword
             pattern matching algorithm—widely regarded as the multiple keyword
             pattern matching algorithm of choice in contexts such as network in-
             trusion detection. The ﬁrst algorithm’s performance is comparable to
             that of the matching phase of the lesser-known failure-function version
             of Aho-Corasick.


      1    Overview

      The (multiple) keyword pattern matching problem (in which the patterns are
      ﬁnite strings, or ‘keywords’) consists of ﬁnding all occurrences (including over-
      lapping ones) of the keywords within an input string. Typically, the input string
      is much larger than the set of keywords and the set of keywords are ﬁxed, mean-
      ing they can be preprocessed to produce data-structures for later use while pro-
      cessing the input string. We also make these assumptions in this paper, as this
      problem variant corresponds to many real-life applications in security, computa-
      tional biology, etc [8]. Several decades of keyword pattern matching research have
      yielded many well-known algorithms, such as Knuth-Morris-Pratt, Boyer-Moore,
      Aho-Corasick, and Commentz-Walter. Overview articles are typically more ac-
      cessible than the original literature—see [3, 4, 7] for comprehensive overviews and
      [10, 2] for taxonomies and correctness proofs of such algorithms.
      In this text, the idea of “position encoding” of a set of patterns is introduced.
      This strategy serves as an alternative to traditional algorithms used to match


c 2012 by the paper authors. CLA 2012, pp. 281–292. Copying permitted only for
  private and academic purposes. Volume published and copyrighted by its editors.
  Local Proceedings in ISBN 978–84–695–5252–0,
  Universidad de Málaga (Dept. Matemática Aplicada), Spain.
282       Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie

      a set of patterns. These algorithms typically rely on common pattern preﬁxes,
      suﬃxes and/or factors in general. The Aho-Corasick [1] algorithm (AC) is prob-
      ably the best known and most widely used of these algorithms. Its best- and
      worst-case performance are both being linear in the size of the input stream and
      independent of the number of patterns to be matched.
      Formal concept analysis (FCA) is used to leverage the potential beneﬁts of po-
      sition encoding. A formal context in which O is the set of objects, A is the
      set of attributes, and I is an incidence relation between the objects and the
      attributes will be denoted by K = hO, A, Ii. The concept lattice, B, derived
      from the context hO, A, Ii will be denoted by B(hO, A, Ii). The extent, intent
      and set of own objects of a concept c in a concept lattice will be denoted by
      extent(c), intent(c), and ownobj(c) respectively. The inﬁmum of a set of con-
      cepts C will be denoted by inf(C). Finally, ⊤(b) denotes the attribute top of
      attribute b—i.e. ⊤(b) is the largest concept whose intent contains b.
      In Section 2, it is shown how FCA can be used to construct a concept lattice from
      a position encoded set of patterns. Such a lattice is called a position encoded pat-
      tern lattice (PEPL). A ﬁrst algorithm, called PMatch, is developed in Section 3,
      which takes such a PEPL together with text stream to be searched as input and
      produces the desired match occurrences as output, albeit in a rather ineﬃcient
      way. As an alternative, a so-called PEPL automaton is deﬁned in Section 4,
      based on the information in a PEPL. A second algorithm given in Section 5 uses
      this automaton and the text stream to be searched as input and also produces
      the desired match occurrences as output. However, in this instance the theoreti-
      cal performance of the algorithm corresponds to that of Aho-Corasick. In a ﬁnal
      section, we reﬂect on the implications of these results.


      2    Position Encoded Pattern Lattices (PEPLs)

      The length of string p will be denoted by |p| and its (i + 1)st element by pi for
      i ∈ [0, |p|). A match occurrence of a single pattern p in target s is a pair hp, ti,
      such that ∀ k ∈ [0, |p|), pk = st+k . The problem of matching a set of patterns P
      on target s can be deﬁned as the requirement to construct the set of all match
      occurrences, denoted by MO in our algorithms.

      Deﬁnition 1 (Position encoding of a set of patterns). The position en-
                                                                        →•
      coding of string w is the set of position-symbol pairs denoted by w and is given
         →•    S
      by w = ( k : k ∈ [1, |w|] : {hk, wk−1 i}).
                                                                   →•                   →•
      The position encoding of a set of strings P is denoted P and is given by P =
       S             →•
      ( w : w ∈P : w)
                                                          →•
      For example, the position encoding of “pack” is pack = {h1, pi, h2, ai, h3, ci, h4, ki},
                                 →•
      and of “packet” it is packet = {h1, pi, h2, ai, h3, ci, h4, ki, h5, ei, h6, ti}. In this
    Mul. Keyword Pattern Matching using Pos. Encoded Pattern Lattices                                       283

case, the position encoding of the set of patterns P = {pack, packet} and of
                                                  →•         →•
“packet” happens to be the same, i.e. P = packet.
Given any set of patterns, we can now constitute a formal context K # along
the following lines. Regard the words in the set of patterns as a set of objects.
Let the position-symbol pairs of the position encoding of the set of patterns
serve as attributes of these objects: a given word has as its attributes all the
position-symbol pairs that make up its position-encoding.
As an example, consider the set of patterns P = {abc, aabc, abcc}. Table 1 shows
the cross table that represents the position encoded formal context derived from
                                                   →•       →•                  →•
P . This context can be denoted by hP, P , I P i, where I P is the incidence re-
lation between objects and attributes depicted in the cross table. The formal
concept lattice to be derived from such a context will be called a Position En-
                                                            →•        →•       →•
coded Pattern Lattice (PEPL), denoted by P (hP, P , I P i) or, more concisely,
    →•
by P . The cover graph of the underlying PEPL is shown in Figure 2.
                                                                                         h1, ai
                                                                                           1

                                                                                h4, ci            h3, cih2, bi
    →•   →•
                h1, ai h2, ai h2, bi h3, bi h3, ci h4, ci
hP, P , I P i                                                                        2              3
abc                  ×          ×             ×                  h2, aih3, bi                     abc
aabc                 ×   ×             ×                                   4               5
abcc                 ×          ×             ×     ×
                                                                       aabc              abcc
                                                                                     6
Fig. 1: Position encoded context for P =
{abc, aabc, abcc}.
                                                                  Fig. 2: Cover graph of PEPL for
                                                                  P = {abc, aabc, abcc}.


If a search of text s is currently at position s[t], and it is found that s[t + n] = a,
then attribute hn, ai is said to positively check against s at t.
It is evident that the intent of a PEPL concept has the following property: If all
the attributes in the intent have been positively checked against a search text s
at position t, then the (one or more) words that are own objects of the concept
match the text, starting at position t. This is clearly the case for concepts 3, 4
and 5 with own objects “abc”, “aabc” and “abcc” respectively.


3    PEPL-based Matching Using PMatch

Algorithm 1 described below is based on the insights of the previous section. Its
                                                                                           →•        →•     →•
top level procedure is called PMatch, which takes as input a PEPL P (hP, P , I P i)
                →•
(or simply P ) and a text, s. It then ﬁnds in s all match occurrences of words
284       Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie

      in P , recording them in MO. The algorithm is articulated in Dijkstra’s guarded
      command language (GCL), widely used for its conciseness and precision [5]. The
      deﬁnition of PMatch assumes constant minlength(P) as the length of the short-
                                                        →•
      est keyword in P . To avoid notational clutter, P , s and MO are assumed to be
      globally accessible to all procedures. A special symbol, nil, is used to designate
      a non-existent concept, speciﬁcally the parent of the top concept.

      PMatch calls matchIntent for each character in s where a match could possibly
      start (i.e. the tail is ignored). The condition in the associated for-loop is intended
      to signify that these probes are from left to right. In each call the intent of the
      top of the lattice and its non-existent parent, nil, are used as parameters.

      matchIntent takes a string position t, and two concepts, c and p, as parameters.
      It is assumed that c is a child of p and the set diﬀerence, ∆, between their intents
      is computed. The special case of ⊤, which has no parent, is catered for. A loop
      checks whether all the attributes in ∆ indicate positional matches in the text
      s as oﬀset by the current search position, t—i.e. the loop removes from ∆ all
      attributes of the form hi, αi such that s[t + i] = α. If this reduces ∆ to the empty
      set, then a match occurrence is considered to have been found for each own
      object at c. Moreover, match(c, t) can be called to investigate whether further
      match occurrences at t can be inferred by considering c’s children. match(p, t), in
      turn, simply sweeps through the children of p, recursively invoking matchIntent
      in each case.

      Algorithm 1 PEPL Based Matching
                      →•
      proc PMatch( P , s)
           MO, j : = ∅, minlength(P );
           { Traverse target string s from left to right }
           for (t ∈ [0, |s| − j + 1)) →
                matchIntent(t, ⊤, nil)
           rof
      corp{ post : MO is the set of match occurrences of P in s }
      proc matchIntent(t, c, p)
           if (p = nil) → ∆ : = intent(c)
           [] (p 6= nil) → ∆ : = intent(c) \ intent(p)
           f i;
           do (∃hi, αi : hi, αi ∈ ∆ : (s[t + i − 1] = α)) →
                ∆ : = ∆ \ {hi, αi}
           od;
           if (∆ = ∅) → MO : = MO ∪ ownobj(c) × {t};
                            for all c′ ∈ children(c) → matchIntent(t, c′ , c) rof
           [] (∆ 6= ∅) → skip
           fi
      corp
  Mul. Keyword Pattern Matching using Pos. Encoded Pattern Lattices                           285

To illustrate how Algorithm 1 works, consider the keywords to match P =
                                                                                         →•   →•
{abc, aabc, abcc} and the target s = aaabcdabccd. The formal context hP, P , I P i
                                                                                    →•
is given in Fig. 1 and the cover graph for the corresponding PEPL, P , is in Fig. 2.
For convenience, the intents and own object sets of each concept are made explicit
in Table 1. Table 2 provides a trace summary of calls to matchIntent. The ﬁrst


                 Id of c intent(c)                                      ownobj(c)
                 1 = ⊤ {h1, ai}
                 2       {h1, ai, h4, ci}
                 3       {h1, ai, h3, ci, h2, bi}                          abc
                 4       {h1, ai, h4, ci, h3, bi, h2, ai}                 aabc
                 5       {h1, ai, h4, ci, h3, ci, h2, bi}                 abcc
                 6 = ⊥ {h1, ai, h4, ci, h3, bi, h3, ci, h2, bi, h2, ai}
               Table 1: Details of concepts of the PEPL in Fig. 2

column shows t, the oﬀset into s from which matching positions are calculated.
The second and third columns show the lattice concept visited and its concept
participating in the call to matchIntent. The fourth column marked ∆ gives the
set diﬀerence between the intent of the child and parent concept. A column per
symbol in the string aaabcdabccd then follows. The last column gives the own
object set to be to be used to update MO when a match has been found. Note
that since minlength(P ) = 3 and |s| = 11, the trace ranges over t ∈ [0, 9). Each


                t c p ∆              a a a b c d a b c c d ownobj(c)
                0 1 nil {h1,ai}      T                     ∅
                0 2 1 {h4,ci}              F
                0 3 1 {h2,bi, h3,ci}   F
                1 1 nil {h1,ai}        T                   ∅
                1 2 1 {h4,ci}                T             ∅
                1 4 2 {h2,ai, h3,bi}     T T               {aabc}
                1 3 1 {h2,bi, h3,ci}     F
                2 1 nil {h1,ai}          T
                2 2 1 {h4,ci}                  F
                2 3 1 {h2,bi, h3,ci}       T T             {abc}
                2 5 3 {h4,ci}                  F
                3 1 nil {h1,ai}            F
                4 1 nil {h1,ai}              F
                5 1 nil {h1,ai}                F
                6 1 nil {h1,ai}                  T         ∅
                6 2 1 {h4,ci}                          T   ∅
                6 4 2 {h2,ai, h3,bi}               F
                6 5 2 {h2,bi, h3,ci}               T T     {abcc}
                6 3 1 {h2,bi, h3,ci}               T T     {abc}
                7 1 nil {h1,ai}                    F
                8 1 nil {h1,ai}                      F
     Table 2: Algorithm 1 trace: matching {abc, aabc, abcc} in aaabcdabccd


row is a matching step of the algorithm—i.e. every row represents a call of the
function matchIntent . As an example, the ﬁrst row indicates that the matching
position t = 0 and the attribute set to match is ∆ = {h1,ai}. The ﬁrst (and only)
286       Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie

      element of the set is hi,αi = h1,ai. This means that position t + i = 1 is checked
      for the symbol α = a, which is indeed the case as indicated by the “T” (for the
      boolean value true) shown in the ﬁrst column for the target string. All “T” entries
      in the table indicate that attributes in ∆ have been successfully matched in the
      do-loop of matchIntent. Once ∆ has been reduced to ∅, MO has to be updated.
      Of course, if the concept has no own object—as is the case for the top concept
      marked 1—then nothing is added to MO (i.e. ownobj(c)×{t} = ∅). Subsequent
      calls to match without updating t, recursively deal with children concepts of the
      one currently under test. The second row of the table therefore logs the results
      the call to matchIntent made via match in respect of concept 2, the leftmost
      child of concept 1. In this case, the intent diﬀerence set is ∆ = {h4,ci}, and since
      ∄hi, αi : {h4,ci} : (s[t + i − 1] = α), (or, more explicitly, s[0 + 4 − 1] = b and not
      c) matchIntent cannot reduce ∆ to ∅. This is indicated by “F” (for false) as an
      entry in the relevant column of the table. Control now returns to match, where
      the next child of concept 1, namely concept 3, is considered. Further rows of
      the table illustrate the execution steps of Algorithm 1 for the rest of the target
      string.
      PMatch eliminates sets of words from P that do not match in s without ever
      backing up in s, i.e. t is monotonically increasing. In this sense PMatch is an
      online algorithm, similar to the AC algorithm. However, PMatch sometimes
      revisits symbols in s. Such revisits are reﬂected by the multiple entries in various
      columns representing symbols in aaabcdabccd in Table 2.
      The execution complexity of the matching process per position checked in s is
      bounded by the size of the PEPL. Table 2 shows how all concepts are visited when
      t = 6. An (rather conservative) upper bound of the complexity of Algorithm 1 is
                 →•
      therefore (| P |× |s|). The advanced AC algorithm is of course more eﬃcient than
      this. Not only does it check every symbol in s exactly once; it also avoids the
      application of the expensive set diﬀerence operator that is applied in matchIntent
      of Algorithm 1. Instead, the advanced AC simply makes an automaton transition
      and considers whether an accepting state has been entered. In the upcoming
      sections, we reﬁne our algorithm to arrive at a PEPL-based algorithm with
      similar performance characteristics to advanced AC.


      4    PEPL Automata

      For PEPL based matching to achieve the same order-of-magnitude performance
      as the advanced AC algorithm, this section deﬁnes a structure called a PEPL
      Automaton.
      Firstly, the position encoded formal context for the set of keywords P is aug-
      mented. This augmented context has additional entries to reﬂect information
      about each keyword p ∈ P whose ﬁrst symbol matches the symbol at the mth
      index of some other keyword, where m > 0.
    Mul. Keyword Pattern Matching using Pos. Encoded Pattern Lattices                   287

Deﬁnition 2 (Augmentation operator). For two strings p and y we define
the operator denoted # as
           (
             {(p)y} if y 6= ε ∧ p 6= ε
    p#y =
             ∅       otherwise

Deﬁnition 3 (Augmentation of a string). We define the augmentation of
string y with respect to string x as
          #
               [
    hx, yi = (     p, s, r, t : ((x = p · s ∧ y = s · r ∧ s 6= ε) ∨ (x = p · y · t)) : p#y )


Thus, for each proper suﬃx3 s of x that is also preﬁx of y, we compute the
singleton set p#y (but possibly the empty set) and add all such singleton sets
into one big set. Note that there may be several such sets. For example, if x = aa
and y = aaaa then the following decompositions of x are relevant :
x = ε · aa; x = a · a; so that
     hx, yi# = ε#aaaa ∪ a#aaaa
             = ∅ ∪ {(a)aaaa}}
             = {(a)aaaa}

Deﬁnition 4 (String-augmentation of a language). We define the string-
augmentation of language V with respect to string w as follows.

   hV, wi# = (    v : v ∈ V : hv, wi# )
              [


Then
               #
     h{x, y}, yi = {(a)aaaa, (aa)aaaa, (aaa)aaaa}
               #
     h{x, y}, xi = {(a)aa, (aa)aa, (aaa)aa}
               #
        h{x}, yi = {(a)aaaa}
        h{y}, xi# = {(a)aa, (aa)aa, (aaa)aa}

Deﬁnition 5. For a set of patterns P we define the augmented patterns as
                                          #
              [
  P# = P ∪ (      p : p ∈ P : hP \ {p}, pi )

Thus,
                                   #            #
     {x, y}# = {x, y} ∪ h{x}, yi ∪ h{y}, xi
             = {aa, aaaa} ∪ {(a)aaaa}{(a)aa, (aa)aa, (aaa)aa}
             = {aa, aaaa, (a)aaaa, (a)aa, (aa)aa, (aaa)aa}
3
    By proper suﬃx, we mean that the empty string is not taken as a suﬃx.
288        Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie

                                K#    h1, ai h2, ai h2, bi h3, ci h4, ci h3, bi h5, ci
                              abc       ×             ×      ×
                              aabc      ×      ×                    ×      ×
                              abcc      ×             ×      ×      ×
                              (a)abc    ×      ×                    ×      ×
                              (a)abcc   ×      ×                    ×      ×      ×
                   Table 3: Context derived by augmenting P = {abc, aabc, abcc}


      Given set of patterns P , we can constitute a formal context denoted K # for a
      PEPL using objects from the augmented set of patterns P # and attributes from
      →•
      P #.

      Example 1. As an example, consider again the set of patterns P = {abc, aabc, abcc}
      augmented with the set of patterns P # derived as follows:
                                               #                              #                       #
      P # = P ∪ h{aabc, abcc}, abci ∪ h{abc, abcc}, aabci                         ∪ h{abc, aabc}, abcci
          = P ∪ {(a)abc} ∪ ∅ ∪ {(a)abcc}
          = P ∪ {(a)abc, (a)abcc}
          = {abc, aabc, abcc, (a)abc, (a)abcc}

      Table 3 depicts the context derived from the set of objects P # and their corre-
      sponding position encoding as attributes.

                                                          •→
      Algorithm 2 discussed in Section 5 uses P to match a set of keywords P against
                                                                                                      •→
      a target string. The algorithm relies on a second pre-processing step after P
                                                                                   •→
      has been generated from its context. This step traverses P to create a special
      automaton, called a “Position Encoded Pattern Lattice Automaton” or simply
      a PEPL Automaton. It is deﬁned as follows.

                                                                    •→
      Deﬁnition 6 (PEPL Automaton). Given P , the PEPL that has been de-
      rived from set of patterns P # , the associated PEPL automaton is a five-tuple
         •→        •→   •→   •→    •→
      hQ P , V P , δ P , q0 P , F P i such that:
           •→                                     •→
       – Q P ⊆ {0} ∪ ( q, c : q = id(c) ∧ c ∈ P : {q}) is regarded as the automa-
                         S
         ton’s set of states, where id(c) simply gives a unique numerical identifier for
         concept c.
              •→        •→
       – V P = P is regarded as the automaton’s alphabet, indicating position-symbol
         pairs.
              •→        •→    •→          •→
       – δ P : Q P × V P 9 Q P × |P # Max | is the automaton’s transition func-
         tion. The mapping is generally determined by the recursive relationship:
                              Mul. Keyword Pattern Matching using Pos. Encoded Pattern Lattices                                                                              289


h5, ci     h3, bi       h2, ai        h4, ci   h1, ai   h3, ci   h2, bi       h5, ci     h3, bi        h2, ai          h4, ci           h1, ai         h3, ci        h2, bi

                                                                                                                                        h1, ai/2


                                                 1                                                                                           1
                                                                                                                                                      h2, bi/3

                                                                                                                 h2, ai/3

                                           2                                                                                  2                  h3, ai/2


                                                                                                                                  h3, bi/1


                                  3                     7                                           h4, ci/5       3         h3, bi/4                  7         h3, ci/4

                                                                                                                           h5, ci/6
                                                                                                                h3, ai/3
                                                                                                                                                     h4, ci/5
                                  4              6                                                                 4                         6


                                           5                                                                                  5


         aabc       (a)abc    (a)abcc           abcc                abc                aabc       (a)abc       (a)abcc                    abcc                              abc

                        (a) Cover Graph for the PEPL for the con-                                     (b) PEPL Automaton superimposed on the
                        text of P #                                                                   Cover Graph

                                                                  Fig. 3: PEPL derived from P #

                                  •→
                              δ P (id(c), hi, αi) = id(c′ )/i′ where variables c′ and i′ are defined as follows:
                                     
                                     
                                      hc, ii                         if A(c, i, α) ∧ B(c, i, α)
                                     
                                     h⊤(h1, αi), 2i                  if A(c, i, α) ∧ ¬B(c, i, α) ∧ ⊤(h1, αi) 6= ⊥
                        hc′ , i′ i =
                                     
                                     
                                      h⊤,  1i                        if A(c, i, α) ∧ ¬B(c, i, α) ∧ ⊤(h1, αi) = ⊥
                                       hinf({c, ⊤(hi, αi)}), i + 1i otherwise
                                     

                                 where A(c, i, α) ≡ inf({c, ⊤(hi, αi)}) = ⊥ and B(c, i, α) ≡ hi − 1, αi ∈
                                 intent(c)
                                 The the first level recursion starts off with c = ⊤. The notation q/x means
                                 that when a transition to state q is made, the automaton produces the addi-
                                 tional value x.
                                      •→
                             – q0 P = ⊤ is the automaton’s start state, which is also the top concept of the
                               PEPL.
                                 •→                                  •→
                             – F P = ( q, c : q = id(c) ∧ c ∈ Q P ∧ |ownobj(c)| > 0 : {q}) is the
                                       S
                               automaton’s set of final states.

                        The above deﬁnition embeds suﬃcient information to derive algorithmically a
                        DFA whose transition diagram can be superimposed on the cover graph of the
                        PEPL.
290       Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie

      It is assumed below that an function, getFA, is available which delivers a PEPL
                     •→                                                            •→
      automaton M P when provided with a PEPL. As an example, getFA( P ) will
                •→
      return M P , the PEPL-Automaton (partially) shown in Fig. 3b, superimposed
                                     •→
      over the cover graph for P . Note that in order to avoid clutter, a number of
      arcs have not been shown in Fig. 3b. For example, the many of transitions to ⊤
      have been left out.


      5    Matching Using a PEPL-Automaton

      Viewed as a DFA, a PEPL automaton could be used to test whether a given
      sequence of its alphabet are in the regular set of patterns that it describes.
      For example, it can easily be seen in Figure 3 that, starting from ⊤, successive
      transitions on elements of the string hh1, ai, h2, bi, h3, cii lead to the ﬁnal state
      6, aﬃrming that this sequence is indeed part of the set of patterns described by
      the automaton, and and since abc is an own object of 6, aﬃrming that abc is in
      the original set of patterns, P .
      However, the PEPL is not primarily intended to be used in this way. Instead,
      the PEPL is used in Algorithm 2 to ﬁnd all match opportunities in P in a text
      s. The algorithm’s do processes symbols of s, updating variables c and i to keep
      track of partial matches in that part of s already processed. This is expressed as
      loop invariant Inv(c, i, t) ≡
       – M O contains all matches in s[0,t−i+1) . (These are the matches already pro-
         cessed.)
       – And s[t−i+1,t) matches the ﬁrst i − 1 characters of all patterns in extent(c).
         (These are the partial matches in progress.)
      To illustrate matching as executed by Algorithm 2, consider the steps logged in
      Table 4 when matching the set of patterns abc,aabc,abcc against the target string
      aaabcdabccd. The ﬁrst ﬁve entries of each row in this table shows the values of
      variables t, s[t], c, i as they have been updated as a result of the statements in the
      body of the main loop in Algorithm 2. The next set of entries in the respective
      row are all empty except for the position where s[t] has been matched. At such a
      position a “T” or “F” is shown depending on the result of the match. The next
      entry in the row shows the size of the intent of c. The last entry in this row gives
      the patterns matched at the step represented by the row.
      Algorithm 2 PEPL Automaton Based Matching
                     •→
      proc PAutMatch( P , s)
           MO : = ∅;
               •→               •→
            M P : = getFA( P );
                           •→                          •→
            hc, ii, t : = hq0 P , 1i, 0; { Recall that q0 P = ⊤ }
  Mul. Keyword Pattern Matching using Pos. Encoded Pattern Lattices                291

       { invariant: Inv(c, i, t) }
                                     •→
     do (t < |s|) → hc, ii, t : = δ P (c, hi, s[t]i), t + 1;
                    if (i = |intent(c)|) → MO : = MO ∪ ({t} × ownobj(c))
                    [] (i 6= |intent(c)|) → skip
                    fi
     od
corp { post : MO is the set of match occurrences of P in s }

As an example we present an explanation of the steps up to the ﬁrst matched
patterns being recorded. Consider the ﬁrst row of Table 4. This row represents
the ﬁrst step of the matching process. After this step, a transition is made from
the start state (c = ⊤) to the same state (c′ = ⊤). The transition is due to the
                      •→
transition function δ P (c, hi, s[t]i) returning the value ⊤/2 for the oﬀset variable
i = 1 and symbol s[t = 0] = a. The last entry in the row is empty as the
top node does not contain any own objects. The next row shows the transition
 •→
δ P (⊤, h2, ai) = 3/3 due to value of the variable t being incremented from its
value in the previous step. This process continues until the patterns aabc, (a)abc
are recorded when the variables i and t are both (coincidentally) equal to 4 and
state 4 is reached in the 5th row of the table. Recall that a match is recorded for a
state that is represented by a concept such that the size of such concept’s intent
(as shown in the second last entry) is the same as the oﬀset variable i.


         co io t s[t] c i a a a b c d a b c c d |intent(c′ )| {Patterns Matched}
         ⊤ 1 0    a ⊤ 2 T                            1                 ∅
         ⊤ 2 1    a 3 3     T                        2                 ∅
         3 3 2    a 3 3       T                      4                 ∅
         3 3 3    b 3 4         T                    4                 ∅
         3 4 4    c 4 5           T                  4          {aabc,(a)abc}
         4 5 5    d ⊤ 1             F                5                 ∅
         ⊤ 1 6    a ⊤ 2               T              5                 ∅
         ⊤ 2 7    b 7 3                 T            3                 ∅
         7 3 8    c 7 4                   T          3               {abc}
         7 4 9    c 6 5                     T        4              {abcc}
         6 5 10 d ⊤ 1                         F      5                 ∅
      Table 4: Positions visited in aaabcdabccd when matching abc,aabc,abcc


We have arrived at a particularly eﬃcient algorithm, thanks to two observations
                                                                             •→
about PAutMatch . Firstly, transitions in the PEPL-Automaton (δ L ) can be
done in constant time using a lookup table. Secondly, the if statement can be
made in constant time, and consists of simple integer arithmetic to advance
through the lattice and target s, and an update of MO (only if a match has
been found). The latter can be done using a precomputed lookup table, as is
done in the advanced AC algorithm. These two characteristics are also found in
the advanced AC algorithm, and is unavoidable in pattern matching algorithms,
giving us the same exact (worst- and best-case) running time of |s|.
292       Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie

      The example in Table 4 illustrates how, in contrast to the example in Table 2,
      each symbol in the string aaabcdabccd is visited exactly once to match the key-
      words {abc, aabc, abcc}.


      6   Conclusion

      The application of FCA in pattern matching was ﬁrst introduced in [9]. There,
      two-dimensional pattern information was encoded into a concept lattice which
      was subsequently used as the basis for traversing a two-dimensional space in
      search of a speciﬁc pattern. Here, by contrast, common information about multi-
      ple keywords is encoded into a PEPL, to form the basis for discovering positional
      information about matching instances of those keywords in a linearly streamed
      text. The two new pattern matching algorithms are shown to have theoretical
      running-time comparable to the Aho-Corasick family of algorithms.
      Ongoing work involves benchmarking the new algorithms against the Aho-Corasick
      and other multiple keyword pattern matching algorithms. We are also ﬁnd-
      ing ways in which FCA can be eﬀectively used in other stringology contexts
      [6].


      References
       1. Alfred V. Aho and Margaret J. Corasick. Eﬃcient string matching: an aid to
          bibliographic search. Communications of the ACM, 18(6):333–340, June 1975.
       2. Loek Cleophas, Bruce W. Watson, and Gerard Zwaan. A new taxonomy of sublin-
          ear right-to-left scanning keyword pattern matching algorithms. Science of Com-
          puter Programming, 75:1095–1112, 2010.
       3. Maxime A. Crochemore and Wojciech Rytter. Text Algorithms. Oxford University
          Press, 1994.
       4. Maxime A. Crochemore and Wojciech Rytter. Jewels of Stringology. World Scien-
          tiﬁc Publishing Company, 2003.
       5. Derrick G. Kourie and Bruce W. Watson. The Correctness-by-Construction Ap-
          proach to Programming. Springer Verlag, 2012.
       6. Derrick G. Kourie, Bruce W. Watson, Loek Cleophas, and Fritz Venter. Failure
          deterministic ﬁnite automata. In Jan Holub, editor, Proceedings of the Prague
          Stringology Conference (PSC). Czech Technical University, August 2012.
       7. William F. Smyth. Computing Patterns in Strings. Addison-Wesley, 2003.
       8. George Varghese. Network Algorithmics: An Interdisciplinary Approach to Design-
          ing Fast Networked Devices. Morgan Kaufmann, 2004.
       9. Fritz Venter, Derrick G. Kourie, and Bruce W. Watson. FCA-based two dimen-
          sional pattern matching. In Proceedings of the 7th International Conference on
          Formal Concept Analysis, 2009.
      10. Bruce W. Watson. Taxonomies and Toolkits of Regular Language Algorithms. PhD
          thesis, Eindhoven University of Technology, September 1995.