Multiple Keyword Pattern Matching using Position Encoded Pattern Lattices Fritz J. Venter1 , Bruce W. Watson2 , and Derrick G. Kourie1 1 University of Pretoria, {fritz, dkourie}@fastar.org 2 Stellenbosch University, bruce@fastar.org Abstract. Formal concept analysis is used as the basis for two new multiple keyword string pattern matching algorithms. The algorithms ad- dressed are built upon a so-called position encoded pattern lattice (PEPL). The algorithms presented are in conceptual form only; no experimental results are given. The first algorithm to be presented is easily understood and relies directly on the PEPL for matching. Its worst case complexity depends on both the length of the longest keyword, and the length of the search text. Subsequently a finite-automaton-like structure, called a PEPL automaton, is defined which is derived from the PEPL, and which forms the basis for a second more efficient algorithm. In this case, worst case behaviour depends only on the length of the input stream. The second algorithm’s worst case performance is the same as the match- ing phase of the well-known (advanced) Aho-Corasick multiple-keyword pattern matching algorithm—widely regarded as the multiple keyword pattern matching algorithm of choice in contexts such as network in- trusion detection. The first algorithm’s performance is comparable to that of the matching phase of the lesser-known failure-function version of Aho-Corasick. 1 Overview The (multiple) keyword pattern matching problem (in which the patterns are finite strings, or ‘keywords’) consists of finding all occurrences (including over- lapping ones) of the keywords within an input string. Typically, the input string is much larger than the set of keywords and the set of keywords are fixed, mean- ing they can be preprocessed to produce data-structures for later use while pro- cessing the input string. We also make these assumptions in this paper, as this problem variant corresponds to many real-life applications in security, computa- tional biology, etc [8]. Several decades of keyword pattern matching research have yielded many well-known algorithms, such as Knuth-Morris-Pratt, Boyer-Moore, Aho-Corasick, and Commentz-Walter. Overview articles are typically more ac- cessible than the original literature—see [3, 4, 7] for comprehensive overviews and [10, 2] for taxonomies and correctness proofs of such algorithms. In this text, the idea of “position encoding” of a set of patterns is introduced. This strategy serves as an alternative to traditional algorithms used to match c 2012 by the paper authors. CLA 2012, pp. 281–292. Copying permitted only for private and academic purposes. Volume published and copyrighted by its editors. Local Proceedings in ISBN 978–84–695–5252–0, Universidad de Málaga (Dept. Matemática Aplicada), Spain. 282 Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie a set of patterns. These algorithms typically rely on common pattern prefixes, suffixes and/or factors in general. The Aho-Corasick [1] algorithm (AC) is prob- ably the best known and most widely used of these algorithms. Its best- and worst-case performance are both being linear in the size of the input stream and independent of the number of patterns to be matched. Formal concept analysis (FCA) is used to leverage the potential benefits of po- sition encoding. A formal context in which O is the set of objects, A is the set of attributes, and I is an incidence relation between the objects and the attributes will be denoted by K = hO, A, Ii. The concept lattice, B, derived from the context hO, A, Ii will be denoted by B(hO, A, Ii). The extent, intent and set of own objects of a concept c in a concept lattice will be denoted by extent(c), intent(c), and ownobj(c) respectively. The infimum of a set of con- cepts C will be denoted by inf(C). Finally, ⊤(b) denotes the attribute top of attribute b—i.e. ⊤(b) is the largest concept whose intent contains b. In Section 2, it is shown how FCA can be used to construct a concept lattice from a position encoded set of patterns. Such a lattice is called a position encoded pat- tern lattice (PEPL). A first algorithm, called PMatch, is developed in Section 3, which takes such a PEPL together with text stream to be searched as input and produces the desired match occurrences as output, albeit in a rather inefficient way. As an alternative, a so-called PEPL automaton is defined in Section 4, based on the information in a PEPL. A second algorithm given in Section 5 uses this automaton and the text stream to be searched as input and also produces the desired match occurrences as output. However, in this instance the theoreti- cal performance of the algorithm corresponds to that of Aho-Corasick. In a final section, we reflect on the implications of these results. 2 Position Encoded Pattern Lattices (PEPLs) The length of string p will be denoted by |p| and its (i + 1)st element by pi for i ∈ [0, |p|). A match occurrence of a single pattern p in target s is a pair hp, ti, such that ∀ k ∈ [0, |p|), pk = st+k . The problem of matching a set of patterns P on target s can be defined as the requirement to construct the set of all match occurrences, denoted by MO in our algorithms. Definition 1 (Position encoding of a set of patterns). The position en- →• coding of string w is the set of position-symbol pairs denoted by w and is given →• S by w = ( k : k ∈ [1, |w|] : {hk, wk−1 i}). →• →• The position encoding of a set of strings P is denoted P and is given by P = S →• ( w : w ∈P : w) →• For example, the position encoding of “pack” is pack = {h1, pi, h2, ai, h3, ci, h4, ki}, →• and of “packet” it is packet = {h1, pi, h2, ai, h3, ci, h4, ki, h5, ei, h6, ti}. In this Mul. Keyword Pattern Matching using Pos. Encoded Pattern Lattices 283 case, the position encoding of the set of patterns P = {pack, packet} and of →• →• “packet” happens to be the same, i.e. P = packet. Given any set of patterns, we can now constitute a formal context K # along the following lines. Regard the words in the set of patterns as a set of objects. Let the position-symbol pairs of the position encoding of the set of patterns serve as attributes of these objects: a given word has as its attributes all the position-symbol pairs that make up its position-encoding. As an example, consider the set of patterns P = {abc, aabc, abcc}. Table 1 shows the cross table that represents the position encoded formal context derived from →• →• →• P . This context can be denoted by hP, P , I P i, where I P is the incidence re- lation between objects and attributes depicted in the cross table. The formal concept lattice to be derived from such a context will be called a Position En- →• →• →• coded Pattern Lattice (PEPL), denoted by P (hP, P , I P i) or, more concisely, →• by P . The cover graph of the underlying PEPL is shown in Figure 2. h1, ai 1 h4, ci h3, cih2, bi →• →• h1, ai h2, ai h2, bi h3, bi h3, ci h4, ci hP, P , I P i 2 3 abc × × × h2, aih3, bi abc aabc × × × 4 5 abcc × × × × aabc abcc 6 Fig. 1: Position encoded context for P = {abc, aabc, abcc}. Fig. 2: Cover graph of PEPL for P = {abc, aabc, abcc}. If a search of text s is currently at position s[t], and it is found that s[t + n] = a, then attribute hn, ai is said to positively check against s at t. It is evident that the intent of a PEPL concept has the following property: If all the attributes in the intent have been positively checked against a search text s at position t, then the (one or more) words that are own objects of the concept match the text, starting at position t. This is clearly the case for concepts 3, 4 and 5 with own objects “abc”, “aabc” and “abcc” respectively. 3 PEPL-based Matching Using PMatch Algorithm 1 described below is based on the insights of the previous section. Its →• →• →• top level procedure is called PMatch, which takes as input a PEPL P (hP, P , I P i) →• (or simply P ) and a text, s. It then finds in s all match occurrences of words 284 Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie in P , recording them in MO. The algorithm is articulated in Dijkstra’s guarded command language (GCL), widely used for its conciseness and precision [5]. The definition of PMatch assumes constant minlength(P) as the length of the short- →• est keyword in P . To avoid notational clutter, P , s and MO are assumed to be globally accessible to all procedures. A special symbol, nil, is used to designate a non-existent concept, specifically the parent of the top concept. PMatch calls matchIntent for each character in s where a match could possibly start (i.e. the tail is ignored). The condition in the associated for-loop is intended to signify that these probes are from left to right. In each call the intent of the top of the lattice and its non-existent parent, nil, are used as parameters. matchIntent takes a string position t, and two concepts, c and p, as parameters. It is assumed that c is a child of p and the set difference, ∆, between their intents is computed. The special case of ⊤, which has no parent, is catered for. A loop checks whether all the attributes in ∆ indicate positional matches in the text s as offset by the current search position, t—i.e. the loop removes from ∆ all attributes of the form hi, αi such that s[t + i] = α. If this reduces ∆ to the empty set, then a match occurrence is considered to have been found for each own object at c. Moreover, match(c, t) can be called to investigate whether further match occurrences at t can be inferred by considering c’s children. match(p, t), in turn, simply sweeps through the children of p, recursively invoking matchIntent in each case. Algorithm 1 PEPL Based Matching →• proc PMatch( P , s) MO, j : = ∅, minlength(P ); { Traverse target string s from left to right } for (t ∈ [0, |s| − j + 1)) → matchIntent(t, ⊤, nil) rof corp{ post : MO is the set of match occurrences of P in s } proc matchIntent(t, c, p) if (p = nil) → ∆ : = intent(c) [] (p 6= nil) → ∆ : = intent(c) \ intent(p) f i; do (∃hi, αi : hi, αi ∈ ∆ : (s[t + i − 1] = α)) → ∆ : = ∆ \ {hi, αi} od; if (∆ = ∅) → MO : = MO ∪ ownobj(c) × {t}; for all c′ ∈ children(c) → matchIntent(t, c′ , c) rof [] (∆ 6= ∅) → skip fi corp Mul. Keyword Pattern Matching using Pos. Encoded Pattern Lattices 285 To illustrate how Algorithm 1 works, consider the keywords to match P = →• →• {abc, aabc, abcc} and the target s = aaabcdabccd. The formal context hP, P , I P i →• is given in Fig. 1 and the cover graph for the corresponding PEPL, P , is in Fig. 2. For convenience, the intents and own object sets of each concept are made explicit in Table 1. Table 2 provides a trace summary of calls to matchIntent. The first Id of c intent(c) ownobj(c) 1 = ⊤ {h1, ai} 2 {h1, ai, h4, ci} 3 {h1, ai, h3, ci, h2, bi} abc 4 {h1, ai, h4, ci, h3, bi, h2, ai} aabc 5 {h1, ai, h4, ci, h3, ci, h2, bi} abcc 6 = ⊥ {h1, ai, h4, ci, h3, bi, h3, ci, h2, bi, h2, ai} Table 1: Details of concepts of the PEPL in Fig. 2 column shows t, the offset into s from which matching positions are calculated. The second and third columns show the lattice concept visited and its concept participating in the call to matchIntent. The fourth column marked ∆ gives the set difference between the intent of the child and parent concept. A column per symbol in the string aaabcdabccd then follows. The last column gives the own object set to be to be used to update MO when a match has been found. Note that since minlength(P ) = 3 and |s| = 11, the trace ranges over t ∈ [0, 9). Each t c p ∆ a a a b c d a b c c d ownobj(c) 0 1 nil {h1,ai} T ∅ 0 2 1 {h4,ci} F 0 3 1 {h2,bi, h3,ci} F 1 1 nil {h1,ai} T ∅ 1 2 1 {h4,ci} T ∅ 1 4 2 {h2,ai, h3,bi} T T {aabc} 1 3 1 {h2,bi, h3,ci} F 2 1 nil {h1,ai} T 2 2 1 {h4,ci} F 2 3 1 {h2,bi, h3,ci} T T {abc} 2 5 3 {h4,ci} F 3 1 nil {h1,ai} F 4 1 nil {h1,ai} F 5 1 nil {h1,ai} F 6 1 nil {h1,ai} T ∅ 6 2 1 {h4,ci} T ∅ 6 4 2 {h2,ai, h3,bi} F 6 5 2 {h2,bi, h3,ci} T T {abcc} 6 3 1 {h2,bi, h3,ci} T T {abc} 7 1 nil {h1,ai} F 8 1 nil {h1,ai} F Table 2: Algorithm 1 trace: matching {abc, aabc, abcc} in aaabcdabccd row is a matching step of the algorithm—i.e. every row represents a call of the function matchIntent . As an example, the first row indicates that the matching position t = 0 and the attribute set to match is ∆ = {h1,ai}. The first (and only) 286 Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie element of the set is hi,αi = h1,ai. This means that position t + i = 1 is checked for the symbol α = a, which is indeed the case as indicated by the “T” (for the boolean value true) shown in the first column for the target string. All “T” entries in the table indicate that attributes in ∆ have been successfully matched in the do-loop of matchIntent. Once ∆ has been reduced to ∅, MO has to be updated. Of course, if the concept has no own object—as is the case for the top concept marked 1—then nothing is added to MO (i.e. ownobj(c)×{t} = ∅). Subsequent calls to match without updating t, recursively deal with children concepts of the one currently under test. The second row of the table therefore logs the results the call to matchIntent made via match in respect of concept 2, the leftmost child of concept 1. In this case, the intent difference set is ∆ = {h4,ci}, and since ∄hi, αi : {h4,ci} : (s[t + i − 1] = α), (or, more explicitly, s[0 + 4 − 1] = b and not c) matchIntent cannot reduce ∆ to ∅. This is indicated by “F” (for false) as an entry in the relevant column of the table. Control now returns to match, where the next child of concept 1, namely concept 3, is considered. Further rows of the table illustrate the execution steps of Algorithm 1 for the rest of the target string. PMatch eliminates sets of words from P that do not match in s without ever backing up in s, i.e. t is monotonically increasing. In this sense PMatch is an online algorithm, similar to the AC algorithm. However, PMatch sometimes revisits symbols in s. Such revisits are reflected by the multiple entries in various columns representing symbols in aaabcdabccd in Table 2. The execution complexity of the matching process per position checked in s is bounded by the size of the PEPL. Table 2 shows how all concepts are visited when t = 6. An (rather conservative) upper bound of the complexity of Algorithm 1 is →• therefore (| P |× |s|). The advanced AC algorithm is of course more efficient than this. Not only does it check every symbol in s exactly once; it also avoids the application of the expensive set difference operator that is applied in matchIntent of Algorithm 1. Instead, the advanced AC simply makes an automaton transition and considers whether an accepting state has been entered. In the upcoming sections, we refine our algorithm to arrive at a PEPL-based algorithm with similar performance characteristics to advanced AC. 4 PEPL Automata For PEPL based matching to achieve the same order-of-magnitude performance as the advanced AC algorithm, this section defines a structure called a PEPL Automaton. Firstly, the position encoded formal context for the set of keywords P is aug- mented. This augmented context has additional entries to reflect information about each keyword p ∈ P whose first symbol matches the symbol at the mth index of some other keyword, where m > 0. Mul. Keyword Pattern Matching using Pos. Encoded Pattern Lattices 287 Definition 2 (Augmentation operator). For two strings p and y we define the operator denoted # as ( {(p)y} if y 6= ε ∧ p 6= ε p#y = ∅ otherwise Definition 3 (Augmentation of a string). We define the augmentation of string y with respect to string x as # [ hx, yi = ( p, s, r, t : ((x = p · s ∧ y = s · r ∧ s 6= ε) ∨ (x = p · y · t)) : p#y ) Thus, for each proper suffix3 s of x that is also prefix of y, we compute the singleton set p#y (but possibly the empty set) and add all such singleton sets into one big set. Note that there may be several such sets. For example, if x = aa and y = aaaa then the following decompositions of x are relevant : x = ε · aa; x = a · a; so that hx, yi# = ε#aaaa ∪ a#aaaa = ∅ ∪ {(a)aaaa}} = {(a)aaaa} Definition 4 (String-augmentation of a language). We define the string- augmentation of language V with respect to string w as follows. hV, wi# = ( v : v ∈ V : hv, wi# ) [ Then # h{x, y}, yi = {(a)aaaa, (aa)aaaa, (aaa)aaaa} # h{x, y}, xi = {(a)aa, (aa)aa, (aaa)aa} # h{x}, yi = {(a)aaaa} h{y}, xi# = {(a)aa, (aa)aa, (aaa)aa} Definition 5. For a set of patterns P we define the augmented patterns as # [ P# = P ∪ ( p : p ∈ P : hP \ {p}, pi ) Thus, # # {x, y}# = {x, y} ∪ h{x}, yi ∪ h{y}, xi = {aa, aaaa} ∪ {(a)aaaa}{(a)aa, (aa)aa, (aaa)aa} = {aa, aaaa, (a)aaaa, (a)aa, (aa)aa, (aaa)aa} 3 By proper suffix, we mean that the empty string is not taken as a suffix. 288 Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie K# h1, ai h2, ai h2, bi h3, ci h4, ci h3, bi h5, ci abc × × × aabc × × × × abcc × × × × (a)abc × × × × (a)abcc × × × × × Table 3: Context derived by augmenting P = {abc, aabc, abcc} Given set of patterns P , we can constitute a formal context denoted K # for a PEPL using objects from the augmented set of patterns P # and attributes from →• P #. Example 1. As an example, consider again the set of patterns P = {abc, aabc, abcc} augmented with the set of patterns P # derived as follows: # # # P # = P ∪ h{aabc, abcc}, abci ∪ h{abc, abcc}, aabci ∪ h{abc, aabc}, abcci = P ∪ {(a)abc} ∪ ∅ ∪ {(a)abcc} = P ∪ {(a)abc, (a)abcc} = {abc, aabc, abcc, (a)abc, (a)abcc} Table 3 depicts the context derived from the set of objects P # and their corre- sponding position encoding as attributes. •→ Algorithm 2 discussed in Section 5 uses P to match a set of keywords P against •→ a target string. The algorithm relies on a second pre-processing step after P •→ has been generated from its context. This step traverses P to create a special automaton, called a “Position Encoded Pattern Lattice Automaton” or simply a PEPL Automaton. It is defined as follows. •→ Definition 6 (PEPL Automaton). Given P , the PEPL that has been de- rived from set of patterns P # , the associated PEPL automaton is a five-tuple •→ •→ •→ •→ •→ hQ P , V P , δ P , q0 P , F P i such that: •→ •→ – Q P ⊆ {0} ∪ ( q, c : q = id(c) ∧ c ∈ P : {q}) is regarded as the automa- S ton’s set of states, where id(c) simply gives a unique numerical identifier for concept c. •→ •→ – V P = P is regarded as the automaton’s alphabet, indicating position-symbol pairs. •→ •→ •→ •→ – δ P : Q P × V P 9 Q P × |P # Max | is the automaton’s transition func- tion. The mapping is generally determined by the recursive relationship: Mul. Keyword Pattern Matching using Pos. Encoded Pattern Lattices 289 h5, ci h3, bi h2, ai h4, ci h1, ai h3, ci h2, bi h5, ci h3, bi h2, ai h4, ci h1, ai h3, ci h2, bi h1, ai/2 1 1 h2, bi/3 h2, ai/3 2 2 h3, ai/2 h3, bi/1 3 7 h4, ci/5 3 h3, bi/4 7 h3, ci/4 h5, ci/6 h3, ai/3 h4, ci/5 4 6 4 6 5 5 aabc (a)abc (a)abcc abcc abc aabc (a)abc (a)abcc abcc abc (a) Cover Graph for the PEPL for the con- (b) PEPL Automaton superimposed on the text of P # Cover Graph Fig. 3: PEPL derived from P # •→ δ P (id(c), hi, αi) = id(c′ )/i′ where variables c′ and i′ are defined as follows:    hc, ii if A(c, i, α) ∧ B(c, i, α)  h⊤(h1, αi), 2i if A(c, i, α) ∧ ¬B(c, i, α) ∧ ⊤(h1, αi) 6= ⊥ hc′ , i′ i =    h⊤, 1i if A(c, i, α) ∧ ¬B(c, i, α) ∧ ⊤(h1, αi) = ⊥ hinf({c, ⊤(hi, αi)}), i + 1i otherwise  where A(c, i, α) ≡ inf({c, ⊤(hi, αi)}) = ⊥ and B(c, i, α) ≡ hi − 1, αi ∈ intent(c) The the first level recursion starts off with c = ⊤. The notation q/x means that when a transition to state q is made, the automaton produces the addi- tional value x. •→ – q0 P = ⊤ is the automaton’s start state, which is also the top concept of the PEPL. •→ •→ – F P = ( q, c : q = id(c) ∧ c ∈ Q P ∧ |ownobj(c)| > 0 : {q}) is the S automaton’s set of final states. The above definition embeds sufficient information to derive algorithmically a DFA whose transition diagram can be superimposed on the cover graph of the PEPL. 290 Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie It is assumed below that an function, getFA, is available which delivers a PEPL •→ •→ automaton M P when provided with a PEPL. As an example, getFA( P ) will •→ return M P , the PEPL-Automaton (partially) shown in Fig. 3b, superimposed •→ over the cover graph for P . Note that in order to avoid clutter, a number of arcs have not been shown in Fig. 3b. For example, the many of transitions to ⊤ have been left out. 5 Matching Using a PEPL-Automaton Viewed as a DFA, a PEPL automaton could be used to test whether a given sequence of its alphabet are in the regular set of patterns that it describes. For example, it can easily be seen in Figure 3 that, starting from ⊤, successive transitions on elements of the string hh1, ai, h2, bi, h3, cii lead to the final state 6, affirming that this sequence is indeed part of the set of patterns described by the automaton, and and since abc is an own object of 6, affirming that abc is in the original set of patterns, P . However, the PEPL is not primarily intended to be used in this way. Instead, the PEPL is used in Algorithm 2 to find all match opportunities in P in a text s. The algorithm’s do processes symbols of s, updating variables c and i to keep track of partial matches in that part of s already processed. This is expressed as loop invariant Inv(c, i, t) ≡ – M O contains all matches in s[0,t−i+1) . (These are the matches already pro- cessed.) – And s[t−i+1,t) matches the first i − 1 characters of all patterns in extent(c). (These are the partial matches in progress.) To illustrate matching as executed by Algorithm 2, consider the steps logged in Table 4 when matching the set of patterns abc,aabc,abcc against the target string aaabcdabccd. The first five entries of each row in this table shows the values of variables t, s[t], c, i as they have been updated as a result of the statements in the body of the main loop in Algorithm 2. The next set of entries in the respective row are all empty except for the position where s[t] has been matched. At such a position a “T” or “F” is shown depending on the result of the match. The next entry in the row shows the size of the intent of c. The last entry in this row gives the patterns matched at the step represented by the row. Algorithm 2 PEPL Automaton Based Matching •→ proc PAutMatch( P , s) MO : = ∅; •→ •→ M P : = getFA( P ); •→ •→ hc, ii, t : = hq0 P , 1i, 0; { Recall that q0 P = ⊤ } Mul. Keyword Pattern Matching using Pos. Encoded Pattern Lattices 291 { invariant: Inv(c, i, t) } •→ do (t < |s|) → hc, ii, t : = δ P (c, hi, s[t]i), t + 1; if (i = |intent(c)|) → MO : = MO ∪ ({t} × ownobj(c)) [] (i 6= |intent(c)|) → skip fi od corp { post : MO is the set of match occurrences of P in s } As an example we present an explanation of the steps up to the first matched patterns being recorded. Consider the first row of Table 4. This row represents the first step of the matching process. After this step, a transition is made from the start state (c = ⊤) to the same state (c′ = ⊤). The transition is due to the •→ transition function δ P (c, hi, s[t]i) returning the value ⊤/2 for the offset variable i = 1 and symbol s[t = 0] = a. The last entry in the row is empty as the top node does not contain any own objects. The next row shows the transition •→ δ P (⊤, h2, ai) = 3/3 due to value of the variable t being incremented from its value in the previous step. This process continues until the patterns aabc, (a)abc are recorded when the variables i and t are both (coincidentally) equal to 4 and state 4 is reached in the 5th row of the table. Recall that a match is recorded for a state that is represented by a concept such that the size of such concept’s intent (as shown in the second last entry) is the same as the offset variable i. co io t s[t] c i a a a b c d a b c c d |intent(c′ )| {Patterns Matched} ⊤ 1 0 a ⊤ 2 T 1 ∅ ⊤ 2 1 a 3 3 T 2 ∅ 3 3 2 a 3 3 T 4 ∅ 3 3 3 b 3 4 T 4 ∅ 3 4 4 c 4 5 T 4 {aabc,(a)abc} 4 5 5 d ⊤ 1 F 5 ∅ ⊤ 1 6 a ⊤ 2 T 5 ∅ ⊤ 2 7 b 7 3 T 3 ∅ 7 3 8 c 7 4 T 3 {abc} 7 4 9 c 6 5 T 4 {abcc} 6 5 10 d ⊤ 1 F 5 ∅ Table 4: Positions visited in aaabcdabccd when matching abc,aabc,abcc We have arrived at a particularly efficient algorithm, thanks to two observations •→ about PAutMatch . Firstly, transitions in the PEPL-Automaton (δ L ) can be done in constant time using a lookup table. Secondly, the if statement can be made in constant time, and consists of simple integer arithmetic to advance through the lattice and target s, and an update of MO (only if a match has been found). The latter can be done using a precomputed lookup table, as is done in the advanced AC algorithm. These two characteristics are also found in the advanced AC algorithm, and is unavoidable in pattern matching algorithms, giving us the same exact (worst- and best-case) running time of |s|. 292 Fritz J. Venter, Bruce W. Watson and Derrick G. Kourie The example in Table 4 illustrates how, in contrast to the example in Table 2, each symbol in the string aaabcdabccd is visited exactly once to match the key- words {abc, aabc, abcc}. 6 Conclusion The application of FCA in pattern matching was first introduced in [9]. There, two-dimensional pattern information was encoded into a concept lattice which was subsequently used as the basis for traversing a two-dimensional space in search of a specific pattern. Here, by contrast, common information about multi- ple keywords is encoded into a PEPL, to form the basis for discovering positional information about matching instances of those keywords in a linearly streamed text. The two new pattern matching algorithms are shown to have theoretical running-time comparable to the Aho-Corasick family of algorithms. Ongoing work involves benchmarking the new algorithms against the Aho-Corasick and other multiple keyword pattern matching algorithms. We are also find- ing ways in which FCA can be effectively used in other stringology contexts [6]. References 1. Alfred V. Aho and Margaret J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333–340, June 1975. 2. Loek Cleophas, Bruce W. Watson, and Gerard Zwaan. A new taxonomy of sublin- ear right-to-left scanning keyword pattern matching algorithms. Science of Com- puter Programming, 75:1095–1112, 2010. 3. Maxime A. Crochemore and Wojciech Rytter. Text Algorithms. Oxford University Press, 1994. 4. Maxime A. Crochemore and Wojciech Rytter. Jewels of Stringology. World Scien- tific Publishing Company, 2003. 5. Derrick G. Kourie and Bruce W. Watson. The Correctness-by-Construction Ap- proach to Programming. Springer Verlag, 2012. 6. Derrick G. Kourie, Bruce W. Watson, Loek Cleophas, and Fritz Venter. Failure deterministic finite automata. In Jan Holub, editor, Proceedings of the Prague Stringology Conference (PSC). Czech Technical University, August 2012. 7. William F. Smyth. Computing Patterns in Strings. Addison-Wesley, 2003. 8. George Varghese. Network Algorithmics: An Interdisciplinary Approach to Design- ing Fast Networked Devices. Morgan Kaufmann, 2004. 9. Fritz Venter, Derrick G. Kourie, and Bruce W. Watson. FCA-based two dimen- sional pattern matching. In Proceedings of the 7th International Conference on Formal Concept Analysis, 2009. 10. Bruce W. Watson. Taxonomies and Toolkits of Regular Language Algorithms. PhD thesis, Eindhoven University of Technology, September 1995.