=Paper= {{Paper |id=Vol-1466/paper07 |storemode=property |title=An Aho-Corasick Based Assessment of Algorithms Generating Failure Deterministic Finite Automata |pdfUrl=https://ceur-ws.org/Vol-1466/paper07.pdf |volume=Vol-1466 |dblpUrl=https://dblp.org/rec/conf/cla/NxumaloKCW15 }} ==An Aho-Corasick Based Assessment of Algorithms Generating Failure Deterministic Finite Automata== https://ceur-ws.org/Vol-1466/paper07.pdf
            An Aho-Corasick Based Assessment of
         Algorithms Generating Failure Deterministic
                     Finite Automata

     Madoda Nxumalo1 , Derrick G. Kourie2,3 , Loek Cleophas2,4 , and Bruce W.
                                  Watson2,3
                   1
                  Computer Science, Pretoria University, South Africa
     2
      FASTAR Research, Information Science, Stellenbosch University, South Africa
   3
     Centre for Artificial Intelligence Research, CSIR Meraka Institute, South Africa
 4
    Foundations of Language Processing, Computer Science, Umeå University, Sweden
       {madoda,derrick,loek,bruce}@fastar.org — http://www.fastar.org


          Abstract. The Aho-Corasick algorithm derives a failure deterministic
          finite automaton for finding matches of a finite set of keywords in a
          text. It has the minimum number of transitions needed for this task.
          The DFA-Homomorphic Algorithm (DHA) algorithm is more general,
          deriving from an arbitrary complete deterministic finite automaton a
          language-equivalent failure deterministic finite automaton. DHA takes
          formal concepts of a lattice as input. This lattice is built from a state/out-
          transition formal context that is derived from the complete deterministic
          finite automaton. In this paper, three general variants of the abstract
          DHA are benchmarked against the specialised Aho-Corasick algorithm.
          It is shown that when heuristics for these variants are suitably chosen,
          the minimality attained by the Aho-Corasick algorithm can be closely
          approximated. A published non-lattice-based algorithm is also shown to
          perform well in experiments.


          Keywords: Failure deterministic finite automaton, Aho-Corasick algo-
          rithm



 1       Introduction

 A deterministic finite automaton (DFA) defines a set of strings, called its lan-
 guage. It is represented as a graph with symbol-labelled transitions between
 states. There are efficient algorithms to determine whether an input string be-
 longs to the DFA’s language. An approach to reducing DFA memory require-
 ments is the use of so-called failure deterministic finite automata (FDFAs, also
 defined in Section 2). An FDFA can be used to define the same language as a
 DFA with a reduced number of transitions and hence, reduced space required
 to store transition information. Essentially, this is achieved by replacing certain
 DFA state transitions by so-called failure transitions. A small additional com-
 putational cost is incurred in recognising whether given strings are part of the

c paper author(s), 2015. Published in Sadok Ben Yahia, Jan Konecny (Eds.): CLA
 2015, pp. 87–98, ISBN 978–2–9544948–0–7, Blaise Pascal University, LIMOS
 laboratory, Clermont-Ferrand, 2015. Copying permitted only for private and
 academic purposes.
88       Madoda Nxumalo et al.


language. By using a trie (the term used for a DFA graph that is a tree) and
failure transitions, Aho and Corasick [1] generalised the so-called KMP algo-
rithm [7] to multi-keyword pattern matching. There are two versions of their
algorithm: the so-called optimal one, which we call aco, and a failure one, acf.
aco builds a minimal DFA to find all matches of a given keyword set in a text.
acf builds, in a first step, a trie using all prefixes of words from the set. Each
state therefore represents a string which is the prefix of a keyword. Moreover,
the string of a state is spelled out by transitions that connect the start state
of the trie to that state. In a second step, aco then inserts a failure transition
from each state of the trie to some other state. To briefly illustrate the nature
failure transitions, suppose p is a state in the trie representing the string and
keyword she and suppose q is another state representing the prefix string he of
the keyword hers. Then a transition from p to q would indicate that he is the
longest suffix of she that matches a prefix of some other keyword. With appro-
priate further elaboration (details may be found in [1, 11]) the output of acf is an
FDFA that is language equivalent to the aco one It can also be shown that acf
is minimal in the following sense. No other FDFA that is language-equivalent to
aco can have fewer transitions than acf

An algorithm proposed in [8], called the DFA-Homomorphic Algorithm (DHA),
constructs from any complete DFA a language-equivalent FDFA. As predicted
by the theory [2], the resulting FDFA is not necessarily minimal. The abstract
version of the algorithm described in [8] involves the construction of a concept
lattice as explained in Section 2. The original version of the algorithm leaves a
number of decisions as nondeterministic choices. It also strives for minimality by
following a “greedy” heuristic in respect of information embedded in the lattice.
However, concrete versions of DHA and the effect of this heuristic have not been
tested. Here we propose various alternative concrete versions of the algorithm for
different deterministic choices.The acf FDFAs provide a benchmark for assessing
the performance of these concrete variants of DHA. An aco DFA is used as
input to each of several variants of the DHA and the resulting DHA-FDFAs are
compared against the acf version.

An alternative approach to constructing FDFAs from an arbitrary DFA has been
proposed by Kumar et al [10]5 . Their technique is based on finding the maxi-
mal spanning tree [9] of a suitably weighted nondirected graph that reflects the
structure of the underlying DFA. Two algorithms were proposed. One is based
on a maximal spanning tree, and the other on a redefined maximal spanning
tree. Further details about their algorithms may be found in their original pub-
lication. The original maximal spanning tree based algorithm was included in
our comparative study for its performance assessment using acf as the bench-
mark.

5
     Their research uses different terminology to that given above. They refer to an
     FDFA as a delayed-input DFA (abbreviated to D2 F A) and failure transitions are
     called default transitions.
    An Aho-Corasick Based Assessment of Algorithms Generating Failure DFAs         89


Various other FDFA related research has been conducted for certain limited
contexts. See [5] for an overview. A recent example is discussed in [4], where ideas
from [8] were used to modify construction of a so-called factor oracle automaton
to use failure transitions, saving up to 9% of symbol transitions.
Section 2 provides the formal preliminaries relevant to this research. In Sec-
tion 3 we introduce the deterministic variants of the DHA that are subsequently
benchmarked. Section 4 outlines the investigation’s experimental environment,
the data generated, the methods of assessment and the results. Section 5 draws
conclusions and points to further research work currently underway.


2     Preliminaries

An alphabet is a set of symbols, Σ, of size |Σ|, and Σ ∗ denotes the set of all
sequences over this alphabet, including the empty sequence, denoted by . A
string (or word), s, is an element of Σ ∗ and its length is denoted |s|. Note that
|| = 0. The concatenation of strings p and q is represented as pq. If s = pqw
then q is a substring of s, p and pq are prefixes of s and q and qw are suffixes
of s. Moreover, q is a proper substring iff ¬((p = ) ∨ (w = )). Similarly, pq is a
proper prefix iff w 6=  and qw is a proper suffix iff p 6= .
A deterministic finite automata (DFA) is a quintuple, D = (Q, Σ, δ, F, qs ), where
Q is a finite set of states; Σ is an alphabet; δ ∈ Q × Σ 9 Q is the (possibly
partial) symbol transition function mapping symbol/state pairs to states; qs ∈ Q
is the start state; and F ⊆ Q is a set of final states. If δ is a total function, then
the DFA is called complete. In the case of a complete DFA, the extension of δ is
defined as δ ∗ ∈ Q × Σ ∗ −→ Q where δ ∗ (p, ) = p and if δ(p, a) = q and w ∈ Σ ∗ ,
then δ ∗ (p, aw) = δ ∗ (q, w). A finite string, w, is said to be accepted by the DFA
iff δ ∗ (qs , w) ∈ F . The language of a DFA is the set of accepted strings.
A failure DFA (FDFA) is a six-tuple, (Q, Σ, δ, f, F, qs ), where D = (Q, Σ, δ, F, qs )
is a (not necessarily complete) DFA and f ∈ Q 9 Q is a (possibly partial) failure
transition function. For all a ∈ Σ and p ∈ Q, the functions δ and f are related in
the following way: f(p) = q for some q ∈ Q if δ(p, a) is not defined. The extension
of δ in an FDFA context is similar to its DFA version in that δ ∗ ∈ Q × Σ ∗ −→ Q
and δ ∗ (p, ) = p. However:
                           ∗
                             δ (q, w) if δ(p, a) = q
            δ ∗ (p, aw) =
                           δ ∗ (q, aw) if δ(p, a) is not defined and f(p) = q

An FDFA is said to accept string w ∈ Σ ∗ iff δ ∗ (qs , w) ∈ F . An FDFA’s language
is its set of accepted strings. It can be shown that every complete DFA has a
language-equivalent FDFA and vice-versa. When constructing an FDFA from a
DFA, care must be taken to avoid so-called divergent cycles of failure transitions
because they lead to an infinite sequence of failure traversals in string processing
algorithms. (Details are provided in [8].)
90          Madoda Nxumalo et al.


Subfigure 1a depicts a complete DFA for which Q = {q1 , q2 , q3 } and Σ = {a, b, c}.
Its start state is q1 and q3 is the only final state. Its symbol transitions are de-
picted as solid arrows between states. Subfigure 1b shows a language-equivalent
FDFA where dashed arrows indicate the failure transitions. Note, for example
that ab is in the language of both automata. In the DFA case, δ ∗ (q1 , ab) =
δ ∗ (q1 , b) = δ ∗ (q3 , ) In the FDFA case, δ ∗ (q1 , ab) = δ ∗ (q1 , b) = δ ∗ (q2 , b) =
δ ∗ (q3 , b) = δ ∗ (q3 , ).


                    a                          b                a                        b
                                 b

                           c            b                             f         f
     start         q1            q2           q3   start        q1        q2         q3
                           a
                                 c                                         c
                                a,c                                        f

         (a) D = (Q, Σ, δ, q1 , {q3 })                (b) F = (Q, Σ, δ, f, q1 , {q3 })




             ha, q1 i hb, q3 i hc, q2 i hc, q1 i
       q1      X        X        X
       q2      X        X        X
       q3      X        X                 X                (d) State/out-transition lattice

     (c) State/out-transition context

                  Fig. 1: Example automata and state/out-transition lattice


This text relies on standard formal concept analysis terminology and definitions.
(See, for example [6]). A so-called state/out-transition concept lattice can be
derived from any DFA. The objects of its formal context are DFA states, q ∈ Q.
Each attribute is a pair of the form ha, pi ∈ Σ × Q. A state q is deemed to
have this attribute if δ(q, a) = p, i.e. if q is the source of a transition on a to p.
Subfigure 1c is the state/out-transition context for the DFA in Subfigure 1a and
Subfigure 1d is the line diagram of the associated state/out-transition concept
lattice. The latter subfigure shows two intermediate concepts, each larger than
the bottom concept and smaller than the top, but not commensurate with one
another. The right-hand side intermediate concept depicts the fact that states
q1 and q2 (its extent) are similar in that each as a transition on symbol a to q1 ,
b to q3 and on c to q2 — i.e. the concept’s intent is {ha, q1 i, hb, q3 i, hc,q2 i}
Each concept in a state/out-transition lattice can be characterised by a certain
value, called its arc redundancy. For a concept c it is defined as ar(c) = (|int(c)|−
1)×(|ext(c)|−1), where ext(c) and int(c) denote the extent and intent of concept
    An Aho-Corasick Based Assessment of Algorithms Generating Failure DFAs            91


c respectively. The arc redundancy of a concept represents the number of arcs
that may be saved by doing the following:
 1. singling out one of the states in the concept’s extent;
 2. at all the remaining states in the concept’s extent, removing all out-transitions
    mentioned in the concept’s intent;
 3. inserting a failure arc from each of the states in step 2 to the singled out
    state in step 1.
The expression, |ext(c)| − 1 represents the number of states in step 2 above.
At each such state, |int(c)| symbol transitions are removed and a failure arc is
inserted. Thus, |int(c)| − 1 is the total number of transitions saved at each of
|ext(c)| − 1 states so that ar(c) is indeed the total number of arcs saved by the
above transformation.
The positive arc redundancy (PAR) set consists of all concepts whose arc redun-
dancy is greater than zero.


3     The DHA Variants

For the DFA-Homomorphic Algorithm (DHA) to convert a DFA into a language
equivalent FDFA, a three stage transformation of the DFA is undertaken. Ini-
tially, the DFA is represented as a state/out-transition context. From the derived
concept lattices, the PAR set is extracted to serve as input for the DHA.
The basic DHA proposed in [8] is outlined in Algorithm 1. The variable O is used
to keep track of states that are not the source of any failure transitions. This
is to ensure that a state is never the source of more than one failure transition.
Initially all states qualify. A concept c is selected and removed from PAR set,
so that c is no longer available in subsequent iterations. The initial version of
DHA proposed specifically selecting a concept, c, with maximum arc redundancy.
The specification given here leaves open how the choice will be made.
From c’s extent, one of the states, t, is chosen to be a failure transition target
state. DHA gives no specific criteria for which state in ext(c) to choose. The
remaining set of states in ext(c) is denoted by ext0 (c). Then, for each state s in
ext0 (c) that qualifies to be the source of a failure transition (i.e. that is also in O)
all transitions in int(c) are removed from s and a failure transition is installed
from s to t. Because state s has become a failure transition source state whose
target state is t, it may no longer be the source of any other failure transition,
and so is removed from O. These steps are repeated until it is no longer possible
to install any more failure transitions. It should be noted that in this particular
formulation of the abstract algorithm the PAR set is not recomputed to reflect
changes in arc redundancy as the DFA is progressively transformed into an
FDFA. This does not affect the correctness of the algorithm, but may affect its
optimality. Investigating such effects is not within the scope of this study. The
third and fifth lines of Algorithm 1, namely
92       Madoda Nxumalo et al.


      c := selectAConcept(PAR) and
      t := getAnyState(ext(c)) respectively.
are non-specific in the original formulation of DHA.

Three variants of the algorithm are proposed with respect to the third line. For
convenience we represent each variant by a conjunct on the right hand side of an
assignment where c is the assignment target. This, of course, is a slight abuse of
notation since c is not the outcome of a logical operation, but the selection of a
concept from the PAR set according to a criterion represented by h mar(PAR),
h me(PAR) or h mi(PAR). Each selection option a different greedy heuristic
for choosing concept c from the PAR set. By greedy we mean that an element
from the set is selected, based on some maximal or minimal feature, without
regard to possible opportunities lost in the forthcoming iterations by making
these selections. In addition to these heuristics, a single heuristic is proposed for
the fifth line relating to choosing the target state, t, for the failure transitions.
These choices are illustrated as colour-coded assignment statements shown in
the skeleton Algorithm 2 and are now briefly explained. The rationale for these
heuristics will be discussed a section below.
Algorithm 1                                    Algorithm 2
O := Q; PAR := {c | ar(c) > 0};                O := Q; PAR := {c | ar(c) > 0};
do ((O 6= ∅) ∧ (PAR 6= ∅)) →                   do ((O 6= ∅) ∧ (PAR 6= ∅)) →
   c := SelectConcept(PAR);                       c := h mar(PAR) ∨ h me(PAR) ∨ h mi(PAR)
   PAR : = PAR\{c};                               PAR : = PAR\{c}
   t := getAnyState(ext(c));                      t := ClosestT oRoot(c)
   ext0 (c) := ext(c)\{t};                        ext0 (c) := ext(c)\{t};
   for each (s ∈ ext0 (c) ∩ O) →                  for each (s ∈ ext0 (c) ∩ O) →
        if a failure cycle is not created →            if a failure cycle is not created →
            for each ((a, r) ∈ int(c)) →                   for each ((a, r) ∈ int(c)) →
                  δ : = δ \ {hs, a, ri}                          δ : = δ \ {hs, a, ri}
            rof ;                                          rof ;
            f(s) : = t;                                    f(s) : = t;
            O : = O\{s} ;                                  O : = O\{s} ;
        fi                                             fi
   rof                                            rof
od                                             od




The heuristics for choosing concept c from the PAR set in each iteration are as
follows: The h mar heuristic: c is a PAR concept with a maximum arc redun-
dancy. The h mi heuristic: c is a PAR concept with a maximum intent size. The
h me heuristic: c is a PAR concept with a minimum extent size. Once one of
these heuristics has been applied, the so-called ClosestToRoot heuristic is used
to select a state t in ext(c) to become the target state of failure transitions from
each of the remaining states in ext(c). The heuristic means that t is selected as
the state in ext(c) that is closest6 to aco’s start state. Transition modifications
are subsequently made on the FDFA produced to date, provided that a divergent
failure cycle is not produced.

6
     Since a trie has no cycles, the notion of “closest” here simply means a state with the
     shortest path from the start state to that state.
    An Aho-Corasick Based Assessment of Algorithms Generating Failure DFAs          93


4     The Experiment

The experiments were conducted on an Intel i5 dual core CPU machine, running
Linux Ubuntu 14.4. Code was written in C++ and compiled under the GCC
version 4.8.2 compiler.
It can easily be demonstrated that if there are no overlaps between proper pre-
fixes and proper suffixes of keywords in a keyword set, then the associated acf
FDFA’s failure transitions will all loop back to its start state, and out Clos-
estToRoot heuristic will behave similarly. To avoid keyword sets that lead to
such trivial acf FDFAs, the following keyword set construction algorithm was
devised.
Keywords (also referred to as patterns) are from an alphabet set of size 10. Their
lengths range from 5 to 60 characters. Keyword sets of sizes 5, 10, 15, . . . , 100
respectively are generated. For each of these 20 different set sizes, twelve alter-
native keyword sets are generated. Thus in total 12 × 20 = 240 keyword sets are
available.
To construct a keyword set of size N , an initial N random strings are generated7 .
Each such string has random content taken from the alphabet and random length
in the range 5 and 30. However, for reasons given below, only a limited number of
these N strings, say M , are directly inserted into the keyword set. The set is then
incrementally grown to the desired size, N , by repeating the following:
      Select a prefix of random length, say p, from a randomly selected string
      in the current keyword set. Remove a string, say w, from the set of strings
      not yet in the keyword set. Insert either pw or wp into the keyword set.
Steps are taken to ensure that there is a reasonable representation of each of
these three differently constructed keywords in a given keyword set.
These keyword sets served as input to the SPARE-PARTS toolkit [12] to cre-
ate the associated acf FDFAs and the aco DFAs. A routine was written to ex-
tract state/out-transition contexts from the aco DFAs. These contexts were used
by the lattice construction software package known as FCART (version 0.9)[3]
supplied to us by National Research University Higher School of Economics
(Moscow, Russia). The DHA variants under test used resulting concept lattices
to generate the FDFAs. As previously mentioned, a Kumar et al [10] algorithm
was also implemented to generate FDFAs from the DFAs. These will be refer-
enced as kum FDFAs.
Figures 2 and 3 give several views of the extent to which the DHA-based and
kum FDFAs correspond with acf FDFAs. The experimental data is available
online8 . For notational convenience fijk denotes the set of failure transitions of
7
    Note that all random selections mentioned use a pseudo-random number generator.
8
    The experimental data files can be found at this URL:
    http : //www .fastar .org/wiki/index .php?title = Conference Papers#2015 .
94     Madoda Nxumalo et al.


the FDFA corresponding to k th keyword set of size 5j that was generated by
the algorithm variant i ∈ FA\{aco}, where k ∈ [1, 12], j ∈ [1, 20] and FA =
                                          i
{acf, aco, mar, mi, me, kum}. Similarly, δjk refers to symbol transition sets of the
associated FDFAs and, in this case, also the aco DFAs if i = aco. A dot notation
in the subscript is used for averages. Thus, for some i ∈ FA\{aco} we use |fij. | to
denote the average number of failure transitions in the i-type FDFAs produced
                                                         i
by the 12 keyword sets of size 5j, and similarly |δj.      | represents the average
number of symbol transitions.

                                       100




                                                                        ●       ●    ●   ●   ●   ●    ●   ●   ●   ●    ●   ●   ●    ●
                                       80




                                                   ●   ●   ●   ●    ●       ●
                    Arcs Savings (%)

                                       60




                                                   ●
                                       40




                                                       ●
                                                           ●

                                                               ●
                                                                    ●   ●   ●
                                                                                ●
                                                                                     ●                                              ●
                                                                                         ●   ●                             ●   ●
                                       20




                                                                                                 ●    ●   ●            ●
                                                                                                              ●   ●
                                               ●   mar
                                                   mi
                                                   me
                                               ●   acf
                                                   kum
                                       0




                                               0               20               40               60               80               100

                                                                                Pattern Set Size




                                                             aco         i
                                                           |δj.  | − (|δj. | + |fij. |)
                                             Fig. 2:                  aco |             × 100
                                                                    |δj.



Figure 2 shows how many more transitions aco automata require (as a percentage
of aco) compared to each of the FDFA variants. Note that data has been averaged
over the 12 keyword set samples for each of the set sizes and note that the
FDFA transitions include both symbol and failure transitions. The minimal acf
FDFAs attain an average savings of about 80% over all sample sizes and the
mi, me and kum FDFAs track this performance almost identically. Although not
clearly visible in the graph, the me heuristic shows a slight degradation for larger
set sizes, while the kum FDFAs consistently perform about 1% to 2% worse. By
way of contrast, the mar heuristic barely achieves a 50% savings for small sample
sizes, and drops below a 20% savings for a sample size of about 75, after which
there is some evidence that it might improve slightly.
The fact that the percentage transition savings of the various FDFA variants
closely correspond to that of acf does not mean that the positioning of the failure
and symbol transitions should show a one-to-one matching. The extent to which
the transitions precisely match one another is shown in Figure 3. These box-and-
                                   An Aho-Corasick Based Assessment of Algorithms Generating Failure DFAs                                                                                                                                       95


                                               mi                                                                me                                                       mi                                                                  me

No. of Inequivalent Symbol Arcs




                                                                No. of Inequivalent Symbol Arcs
                                   2.0                                                                                                                  100                                                                     100
                                                                                                  15




                                                                                                                            Matching failure arcs (%)




                                                                                                                                                                                                    Matching failure arcs (%)
                                   1.5                                                                                                                  90                                                                      90
                                                                                                                                                                                        ●
                                                                                                                                                                                                                                                           ●
                                                                                                  10
                                                                                                                                                        80                                                                      80
                                   1.0                                                                                                                                                      ●                                                                  ●


                                                                                                                                                        70                                                                      70
                                                                                                   5
                                   0.5
                                                                                                                                                        60                                                                      60
                                   0.0                                                             0
                                           5
                                          10
                                          15
                                          20
                                          25
                                          30
                                          35
                                          40
                                          45
                                          50
                                          55
                                          60
                                          65
                                          70
                                          75
                                          80
                                          85
                                          90
                                          95
                                         100




                                                                                                          5
                                                                                                         10
                                                                                                         15
                                                                                                         20
                                                                                                         25
                                                                                                         30
                                                                                                         35
                                                                                                         40
                                                                                                         45
                                                                                                         50
                                                                                                         55
                                                                                                         60
                                                                                                         65
                                                                                                         70
                                                                                                         75
                                                                                                         80
                                                                                                         85
                                                                                                         90
                                                                                                         95
                                                                                                        100




                                                                                                                                                                5
                                                                                                                                                               10
                                                                                                                                                               15
                                                                                                                                                               20
                                                                                                                                                               25
                                                                                                                                                               30
                                                                                                                                                               35
                                                                                                                                                               40
                                                                                                                                                               45
                                                                                                                                                               50
                                                                                                                                                               55
                                                                                                                                                               60
                                                                                                                                                               65
                                                                                                                                                               70
                                                                                                                                                               75
                                                                                                                                                               80
                                                                                                                                                               85
                                                                                                                                                               90
                                                                                                                                                               95
                                                                                                                                                              100




                                                                                                                                                                                                                                        5
                                                                                                                                                                                                                                       10
                                                                                                                                                                                                                                       15
                                                                                                                                                                                                                                       20
                                                                                                                                                                                                                                       25
                                                                                                                                                                                                                                       30
                                                                                                                                                                                                                                       35
                                                                                                                                                                                                                                       40
                                                                                                                                                                                                                                       45
                                                                                                                                                                                                                                       50
                                                                                                                                                                                                                                       55
                                                                                                                                                                                                                                       60
                                                                                                                                                                                                                                       65
                                                                                                                                                                                                                                       70
                                                                                                                                                                                                                                       75
                                                                                                                                                                                                                                       80
                                                                                                                                                                                                                                       85
                                                                                                                                                                                                                                       90
                                                                                                                                                                                                                                       95
                                                                                                                                                                                                                                      100
                                         Pattern Set Size                                                Pattern Set Size                                         Pattern Set Size                                                      Pattern Set Size



                                              mar                                                             kum                                                      mar                                                                   kum

                                                                                                                                                        80
No. of Inequivalent Symbol Arcs




                                                            ●   No. of Inequivalent Symbol Arcs   250
                   10000                                                                                                                                                                                                        25




                                                                                                                            Matching failure arcs (%)




                                                                                                                                                                                                    Matching failure arcs (%)
                                  8000                                                            200                                                   60                                                                                          ●
                                                                                                                                                                                                                                                           ●
                                                                                                                                                                                                                                20
                                                                                                                                                                                                                                                               ●
                                  6000           ●                                                150                                                         ●
                                                                                                             ●                                          40
                                                                                                                                                                                                                                15                  ●
                                  4000                                                            100                                                                                                                                               ●

                                                                                                                                                                                                                                                           ●
                                                                                                                                                        20
                                  2000                                                            50                                                          ●
                                                                                                                                                              ●
                                                                                                                                                                                                ●                               10
                                                                                                                                                                      ●             ●
                                                                                                                                                                          ● ●   ●
                                                                                                                                                                          ● ●
                                     0                                                             0                                                      0                                                                       5
                                           5
                                          10
                                          15
                                          20
                                          25
                                          30
                                          35
                                          40
                                          45
                                          50
                                          55
                                          60
                                          65
                                          70
                                          75
                                          80
                                          85
                                          90
                                          95
                                         100




                                                                                                          5
                                                                                                         10
                                                                                                         15
                                                                                                         20
                                                                                                         25
                                                                                                         30
                                                                                                         35
                                                                                                         40
                                                                                                         45
                                                                                                         50
                                                                                                         55
                                                                                                         60
                                                                                                         65
                                                                                                         70
                                                                                                         75
                                                                                                         80
                                                                                                         85
                                                                                                         90
                                                                                                         95
                                                                                                        100




                                                                                                                                                                5
                                                                                                                                                               10
                                                                                                                                                               15
                                                                                                                                                               20
                                                                                                                                                               25
                                                                                                                                                               30
                                                                                                                                                               35
                                                                                                                                                               40
                                                                                                                                                               45
                                                                                                                                                               50
                                                                                                                                                               55
                                                                                                                                                               60
                                                                                                                                                               65
                                                                                                                                                               70
                                                                                                                                                               75
                                                                                                                                                               80
                                                                                                                                                               85
                                                                                                                                                               90
                                                                                                                                                               95
                                                                                                                                                              100




                                                                                                                                                                                                                                        5
                                                                                                                                                                                                                                       10
                                                                                                                                                                                                                                       15
                                                                                                                                                                                                                                       20
                                                                                                                                                                                                                                       25
                                                                                                                                                                                                                                       30
                                                                                                                                                                                                                                       35
                                                                                                                                                                                                                                       40
                                                                                                                                                                                                                                       45
                                                                                                                                                                                                                                       50
                                                                                                                                                                                                                                       55
                                                                                                                                                                                                                                       60
                                                                                                                                                                                                                                       65
                                                                                                                                                                                                                                       70
                                                                                                                                                                                                                                       75
                                                                                                                                                                                                                                       80
                                                                                                                                                                                                                                       85
                                                                                                                                                                                                                                       90
                                                                                                                                                                                                                                       95
                                                                                                                                                                                                                                      100
                                         Pattern Set Size                                                Pattern Set Size                                         Pattern Set Size                                                      Pattern Set Size



                                                   acf       acf
                                             (a) |δjk  | − |δjk     i
                                                                 ∩ δjk |                                                                                                        |fijk ∩ facf
                                                                                                                                                                                         jk |
                                                                                                                                                                  (b)                                                           × 100
                                                                                                                                                                                        |facf
                                                                                                                                                                                          jk |


                                                                                                        Fig. 3: Transition Matches



whisker plots show explicitly the median, 25th and 75th percentiles as well as
outliers of each of the 12 sample keyword sets of a given size. Subfigure 3a shows
the number of symbol transitions in acf FDFAs that do not correspond with
those in mi, me, mar and kum respectively. Subfigure 3b shows the percentage
of acf failure transitions matching those of the FDFAs generated by mi, me, mar
and kum respectively.

The symbol transitions for the mi heuristic are practically identical to those of
acf, differing by at most two. Differences are not significantly related to sample
size. Differences for me are somewhat larger, increasing slightly with larger sam-
ple size, though still relatively modest in relation to the overall number of FDFA
transitions. (There are |Q| − 1 transitions in the underlying trie.) In the cases of
mar and kum, the differences are approximately linearly dependent on the size
of the keyword set, reaching over 9000 and 250 respectively for keywords sets of
size 100.

The failure transition differences in regard to mi and me show a very similar
pattern as keyword size increases. Only in isolated instances do they fully match
those of acf, but the matching correspondence drops from a median of more
than 95% in the case of the smallest keywords sets to a median of about 50%
for the largest keyword sets. The median kum failure transition correspondence
with acf is in a range of about 12-18% for all pattern set sizes. However, in the
case of mar, the degree of correspondence is much worse: at best the median
value is just over 60% for small keyword sets, dropping close to zero for medium
96     Madoda Nxumalo et al.


range keyword set sizes, and then increasing slightly to about 10% for the largest
keyword sets.
Overall, Figures 2 and 3 reveal that there is a variety of ways in which failure
transitions may be positioned in an FDFA, and that lead to very good—in many
cases even optimal—transition savings. It is interesting to note that even in the
kum FDFAs, the total number of transition savings is very close to optimal,
despite relatively large differences in the positioning of the transitions. However,
the figures also show that this flexibility in positioning failure transitions to
achieve good arc savings eventually breaks down, as in the case of the mar
FDFAs.
One of the reasons for differences between acf FDFAs and the others is that some
implementations of the acf algorithm, including the SPARE-PARTS implemen-
tation, inserts a failure arc at every state (except the start state), even if there
is an out-transition on every alphabet symbol from a state. Such a failure arc is
of course redundant. Inspection of the data showed that some of the randomly
generated keyword sets lead to such “useless” failure transitions, but they are
so rare that they do not materially affect the overall observations.
The overall rankings of the output FDFAs of the various algorithms to acf could
broadly be stated as mi > me > kum > mar. This ranking is with respect to
closeness of transition placement to acf. Since the original focus of this study
was to explore heuristics for the DHA algorithm, further comments about the
kum algorithm are reserved for the next section.
The rationale for the mar heuristic is clear: it will cause the maximum savings
in transitions in a given iteration. It was in fact the initial criterion proposed
in [8]. It is therefore somewhat surprising that it did not perform very well in
comparison to other heuristics. It would seem that, in the present context, it is
too greedy—i.e. by selecting a concept whose extent contains the set of states
that can effect maximal savings such that in one iteration, it deleteriously elimi-
nates from consideration concepts whose extent contains some of those states in
subsequent iterations. Note that, being based on the maximum of the product
of extent and intent sizes, it will tend to select concepts in the middle of the
concept lattice.
When early trials in our data showed up mar’s relatively poor performance, the
mi and me heuristics were introduced to prioritise concepts in the top or bottom
regions of the lattice. These latter two heuristics will maximize the number
of symbol transitions to be removed per state when replacing them with failure
transitions, in so far as concepts with large intents tend to have small extents and
vice-versa. Although such a relationship is, of course, data-dependent, random
data tends in that direction, as was confirmed by inspection of our data.
These two heuristics appear to be rather successful at attaining acf-like FDFAs.
However, the ClosestToRoot heuristic has also played a part in this success. Note
that the acf failure transitions are designed to record that a suffix of a state’s
    An Aho-Corasick Based Assessment of Algorithms Generating Failure DFAs         97


string is also a prefix of some other state’s string. Thus, f (q) = p means that a
suffix of state q’s string is also a prefix of state p’s string. However, since there
may be several suffixes of q’s string and several states whose prefixes meet this
criterion, the definition of f requires that the longest possible suffix of q’s string
should be used. This ensures that there is only one possible state, p, in the trie
whose prefix corresponds to that suffix. Thus, on the one hand, acf directs a
failure transition “backwards” towards a state whose depth is less than that of
the current state. On the other hand, acf selects a failure transition’s target state
to be as far as possible from the start state, because the suffix (and therefore
also the prefix) used must be maximal in length.
The ClosestToRoot heuristic approximates the acf action in that it also directs
failure transitions backwards towards the start state. However, by selecting a
failure transition’s target state to be as close as possible from the start state,
it seems to contradict acf actions. It is interesting to note in Subfigure 3b that
both mi and me show a rapid and more or less linear decline in failure transition
matchings with respect to acf when pattern set size reaches about 65. We conjec-
ture that for smaller keyword sizes, theClosestToRoot heuristic does not conflict
significantly with acf’s actions because there are few failure target states from
which to choose. When keyword set sizes become greater, there is likely to be
more failure target states from which to choose, and consequently less correspon-
dence between the failure transitions chosen according to differing criteria.This
is but one of several matters that has been left for further study.


5     Conclusions and Future Agenda

Our ultimate purpose is to investigate heuristics for building FDFAs from gen-
eralised complete DFAs—a domain where optimal behaviour is known a priori
to be computationally hard. The comparison against acf FDFAs outlined above
is a firm but limited starting point. The next step is to construct complete DFAs
from randomly generated FDFAs and examine the extent to which the heuristics
tested out in this study can reconstruct the latter from the former. Because gen-
eralised DFAs can have cycles, the ClosestToRoot heuristic will be generalised
by using Dijktra’s algorithm for calculating the shortest distance from the start
state to each DFA state. It remains to be seen whether mar will perform any
better in the generalised context.
The relatively small alphabet size of 10 was dictated by unavoidable growth in
the size of the associated concept lattices. Even though suitable strategies for
trimming the lattice (for example by not generating concepts with arc redun-
dancy less than 2) are being investigated, it is recognised that use of DHA will
always be constrained by the potential for the associated lattice to grow exponen-
tially. Nevertheless, from a theoretical perspective a lattice-based DHA approach
to FDFA generation is attractive because it encapsulates the solution space in
which a minimal FDFA might be found—i.e. each ordering of its concepts maps
98     Madoda Nxumalo et al.


to a possible language-equivalent FDFA that can be derived from DFA and at
least one such ordering will be a minimal FDFA.
The kum FDFA generation approach is not as constrained by space limitations as
the DHA approach and in the present experiments it has performed reasonably
well. In the original publication, a somewhat more refined version is reported
that attempts to avoid unnecessary chains of failure transitions. Future research
should examine the minimising potential of this refined version using generalised
DFAs as input and should explore more fully the relationship between these kum-
based algorithms and the DHA algorithms.


References
 1. A. V. Aho and M. J. Corasick. Efficient string matching: An aid to bibliographic
    search. Commun. ACM, 18(6):333–340, 1975.
 2. H. Björklund, J. Björklund, and N. Zechner. Compression of finite-state automata
    through failure transitions. Theor. Comput. Sci., 557:87–100, 2014.
 3. A. Buzmakov and A. Neznanov. Practical computing with pattern structures in
    FCART environment. In Proceedings of the International Workshop ”What can
    FCA do for Artificial Intelligence?” (FCA4AI at IJCAI 2013), Beijing, China,
    August 5, 2013., pages 49–56, 2013.
 4. L. Cleophas, D. G. Kourie, and B. W. Watson. Weak factor automata: Comparing
    (failure) oracles and storacles. In J. Holub and J. Žďárek, editors, Proceedings of the
    Prague Stringology Conference 2013, pages 176–190, Czech Technical University in
    Prague, Czech Republic, 2013.
 5. M. Crochemore and C. Hancart. Automata for matching patterns. In S. A. Rozen-
    berg G., editor, Handbook of Formal Languages, volume 2, Linear Modeling: Back-
    ground and Application, pages 399–462. Springer-Verlag, 1997. incollection.
 6. B. Ganter, G. Stumme, and R. Wille, editors. Formal Concept Analysis, Foun-
    dations and Applications, volume 3626 of Lecture Notes in Computer Science.
    Springer, 2005.
 7. D. E. Knuth, J. H. M. Jr., and V. R. Pratt. Fast pattern matching in strings.
    SIAM J. Comput., 6(2):323–350, 1977.
 8. D. G. Kourie, B. W. Watson, L. Cleophas, and F. Venter. Failure deterministic
    finite automata. In J. Holub and J. Žďárek, editors, Proceedings of the Prague
    Stringology Conference 2012, pages 28–41, Czech Technical University in Prague,
    Czech Republic, 2012.
 9. J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling
    salesman problem. Proceedings of the American Mathematical Society, 7(1):48–50,
    1956.
10. S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, and J. S. Turner. Algorithms
    to accelerate multiple regular expressions matching for deep packet inspection.
    In Proceedings of the ACM SIGCOMM 2006 Conference on Applications, Tech-
    nologies, Architectures, and Protocols for Computer Communications, Pisa, Italy,
    September 11-15, 2006, pages 339–350, 2006.
11. B. W. Watson. Taxonomies and Toolkits of Regular Language Algorithms. PhD
    thesis, Eindhoven University of Technology, Sept. 1995.
12. B. W. Watson and L. G. Cleophas. SPARE Parts: a C++ toolkit for string pattern
    recognition. Softw., Pract. Exper., 34(7):697–710, 2004.