=Paper= {{Paper |id=Vol-3072/paper5 |storemode=property |title=Ordering Regular Languages: a Danger Zone |pdfUrl=https://ceur-ws.org/Vol-3072/paper5.pdf |volume=Vol-3072 |authors=Davide Martincigh,Giovanna D'Agostino,Alberto Policriti |dblpUrl=https://dblp.org/rec/conf/ictcs/MartincighDP21 }} ==Ordering Regular Languages: a Danger Zone== https://ceur-ws.org/Vol-3072/paper5.pdf
     Ordering Regular Languages: A Danger Zone

         Giovanna D’Agostino, Davide Martincigh, and Alberto Policriti

                                 University of Udine, Italy



        Abstract. Ordering the collection of states of a given automaton start-
        ing from an order of the underlying alphabet is a natural move towards
        a computational treatment of the language accepted by the automaton.
        Along this path, Wheeler graphs have been recently introduced as an ex-
        tension/adaptation of the Burrows-Wheeler Transform (the now famous
        BWT, originally defined on strings) to graphs. These graphs constitute
        an important data-structure for languages, since they allow a very ef-
        ficient storage mechanism for the transition function of an automaton,
        while providing a fast support to all sorts of substring queries. This is pos-
        sible as a consequence of a property—the so-called path coherence—valid
        on Wheeler graphs and consisting in an ordering on nodes that “propa-
        gates” to (collections of) strings. By looking at a Wheeler graph as an
        automaton, the ordering on strings corresponds to the co-lexicographic
        order of the words entering each state. This leads naturally to consider
        the class of regular languages accepted by Wheeler automata, i.e. the
        Wheeler languages.
        It has been shown that, as opposed to the general case, the classic deter-
        minization by powerset construction is polynomial on Wheeler languages.
        As a consequence, most of the classical problems turn out to be “easy”—
        that is, solvable in polynomial time—on Wheeler languages. Moreover,
        deciding whether a DFA is Wheeler and deciding whether a DFA accepts
        a Wheeler language is polynomial.
        Our contribution here is to put an upper bound to easy problems. For
        instance, whenever we generalize by switching to general NFAs or by not
        fixing an order of the underlying alphabet, the above mentioned problems
        become “hard”—that is NP-complete or even PSPACE-complete.



1     Introduction
Adding an order to a class of structures (graphs, groups, monoids...) is a nat-
ural move. Moreover, ordering is a basic data-structuring mechanism, strongly
favoring computational manipulations of the considered class.
   Over the class of finite automata, order has been added e.g. in [12], where
the order must propagate along equally labeled transitions. Much more recently,
    Copyright © 2021 for the individual papers by the papers’ authors. Copyright ©
    2021 for the volume as a collection by its editors. This volume and its papers are
    published under the Creative Commons License Attribution 4.0 International (CC
    BY 4.0).
2       G. D’Agostino et al.

in an effort to find a common denominator to a number of different algorithmic
techniques, Gagie et al. proposed (in [13]) a new simple strategy for enforcing
and using order on a given automaton. Starting from an underlying order of
the alphabet, the order on the states (formally given in Definition 1) must: i)
agree with the order of the labels of their incoming edges, and ii) be coherent
on target/source nodes, for pairs of arcs with equal labels. It turns out that
this kind of automata, called Wheeler automata, (a) admit an efficient index
data structure for searching subpaths labeled with a given query pattern, and
(b) enable a representation of the graph in a space proportional to that of the
edges’ labels since the topology can be encoded with just O(1) bits per node
[13] (as well as enabling more advanced compression mechanisms, see [3,11]).
This is in contrast with the fact that general graphs require a logarithmic (in
the graph’s size) number of bits per edge to be represented, as well as with
recent results showing that in general, the subpath search problem can not be
solved in subquadratic time, unless the strong exponential time hypothesis is
false [4,5,6,7,10].


                                                     c


                                        q1           q2
                                  a            c

                           q0                        f
                                  d
                                               c              f
                                        q4           q3           q5

                                                          c



Fig. 1. A WDFAs A recognizing the language Ld = ac∗ + dc∗ f . Condition (i) of
Definition 1 implies input consistency and induces the partial order q1 < q2 , q3 < q4 <
q5 . From condition (ii) it follows that δ(q1 , c) ≤ δ(q4 , c), thus q2 < q3 . Therefore, the
only order that could make A Wheeler is q0 < q1 < q2 < q3 < q4 < q5 . The reader can
verify that condition (ii) holds for each pair of equally labeled edges.



    In Figure 1 is depicted an example of a Wheeler automaton (see also Defi-
nition 1). Notice that the minimum DFA recognizing L is not input consistent,
hence not Wheeler; we will return on this point later. One is naturally led to
consider the class of Wheeler languages, i.e. the regular languages recognized
by some Wheeler automaton, as well as to raise the question of whether it is
possible to decide, just by looking at a (generic, possibly not Wheeler) DFA,
whether the language it recognizes is Wheeler. Wheeler languages have been
studied extensively in [2], where it is shown that we can decide in polynomial
time whether a DFA recognizes a Wheeler language. Unfortunately, trying to
                                  Ordering Regular Languages: A Danger Zone              3

use the same strategy on a NFA language does not work. In fact, in this paper
we show that the problem for NFAs becomes PSPACE-complete.


                                                    c


                                       q1           q3
                                 a            c

                          q0                        f
                                 b
                                              c              f
                                       q2           q4           q5

                                                         c



Fig. 2. A failed attempt to build a WDFA recognizing the language Lb = ac∗ + bc∗ f .
Condition (i) of Definition 1 induces the partial order q1 < q2 < q3 , q4 < q5 , and
condition (ii) implies δ(q1 , c) ≤ δ(q2 , c), thus q3 < q4 . But if we apply condition (ii)
again, we obtain δ(q2 , c) ≤ δ(q3 , c), a contradiction.



    Consider the example in Figure 2 and confront it with Figure 1. Even tough
apparently we made an insignificant change to the automaton, that is we have
simply swapped character-labels d and b, the new language recognized by the
automaton turns out to be not Wheeler. This example illustrates that the order
of the underlying alphabet plays a significant role when determining whether
an automaton, and even a language, is Wheeler. Notice that swapping d and
b can be seen as considering a different order on Σ: from a ≺ c ≺ d ≺ f to
a ≺ d ≺ c ≺ f . Again a natural question is raised: when presented with an
automaton (or a language) that is not Wheeler, is there a different way to order
the alphabet so that the automaton (language) becomes Wheeler? Being able
to answer in polynomial time to this this question would be very helpful for
applications. Unfortunately, as shown in Section 3, this problem turns out to be
NP-complete even for DFAs.
    WNFAs have the following useful property: turning a WNFA into a DFA
using the (classic) powerset construction actually results in a WDFA with at
most twice the number of states of the original WNFA. In other words, the
blow-up of states that we can observe when converting NFAs into DFAs, does
not occur for Wheeler non-deterministic automata. Nevertheless, it is known (see
[1]) that a blow-up of states can occur even when we switch from the minimum
DFA recognizing a language L to the minimum WDFA recognizing L. As a last
contribution, in this paper we give an answer to a question put forward in [1].
We provide an algorithm to compute the minimum WDFA starting from the
minimum DFA. Our algorithm works in exponential time in the worst case and,
4         G. D’Agostino et al.

since the dimension of the output might be exponential in the dimension of the
input, is optimal for the task.
    Due to space constraints, we present only the sketches of the proofs of our
results. The full proofs, complete with all the missing details, can be found in
the Appendix.


2      Wheeler languages
First of all, we fix some notation. Let Σ denote a finite alphabet endowed with
a total order (Σ, ≺). We denote by Σ ∗ the set of finite words over Σ, with ε
being the empty word, and we extend the order ≺ over Σ to the co-lexicographic
order (Σ ∗ , ≺), where α ≺ β if and only the reverse of α, i.e. α read from the
right to the left, precedes lexicographically the reverse of β. Given two words
α, β ∈ Σ ∗ , we denote by α a β the property that α is a suffix of β. For a language
L ⊆ Σ ∗ , we denote by Pref(L) the set of prefixes of strings in L. We denote by
A = (Q, q0 , δ, F, Σ) a finite automaton (NFA), with Q as set of states, q0 initial
state, δ : Q×Σ → 2Q transition function, and F ⊆ Q final states. The dimension
of A, denoted by |A|, is defined as the number of states of A. An automaton is
deterministic (DFA) if |δ(q, a)| ≤ 1, for all q ∈ Q and a ∈ Σ. As customary, we
extend δ to operate on strings as follows: for all q ∈ Q, a ∈ Σ and α ∈ Σ ∗
                                                    [
                    δ(q, ε) = {q},    δ(q, αa) =         δ(v, a).
                                                         v∈δ(q,α)

We denote by L(A) = {α ∈ Σ ∗ : δ(q0 , α) ∩ F 6= ∅} the language accepted by the
automaton A. We make the assumption that every automaton is basic, that is,
every state is reachable from the initial state and every state can reach at least
one final state. Notice that this assumption is not restrictive, since removing
from a NFA every state not reachable from q0 and every state from which is
impossible to reach a final state can be done in linear time and does not change
the accepted language. It immediately follows that:
    – there might be only one state without incoming edges, namely q0 ;
    – every word that can be read starting from q0 belongs to Pref(L).
Lastly, given a finite set Q, totally ordered by the relation <, we say that I ⊆ Q
is an interval if and only if for all q, q 0 , q 00 ∈ Q with q ≤ q 0 ≤ q 00 , if q, q 00 ∈ I then
q 0 ∈ I. We denote by I := [qmin , qmax ] a generic interval, where qmin (qmax ) is
the minimum (maximum) element of I with respect to <.
     The class of Wheeler automata has been recently introduced in [13]. An
automaton in this class has the property that there exists a total order on its
states that is propagated along equally labeled transition. Moreover, the order
must be compatible with the underlying order of the alphabet:
Definition 1 (Wheeler Automaton). A Wheeler NFA (WNFA) A is a NFA
(Q, q0 , δ, F, Σ) endowed with a binary relation <, such that: (Q, <) is a linear
order having the initial state q0 as minimum, q0 has no in-going edges, and the
                               Ordering Regular Languages: A Danger Zone           5

following two (Wheeler) properties are satisfied. Let v1 ∈ δ(u1 , a1 ) and v2 ∈
δ(u2 , a2 ):
 (i) a1 ≺ a2 → v1 < v2
(ii) (a1 = a2 ∧ u1 < u2 ) → v1 ≤ v2 .
A Wheeler DFA (WDFA) is a WNFA in which the cardinality of δ(u, a) is always
less than or equal to one.
    In Figure 1 is depicted an example of a WDFA.
Remark 1. A consequence of Wheeler property (i) is that A is input-consistent,
that is all transitions entering a given state u ∈ Q have the same label: if
u ∈ δ(v, a) and u ∈ δ(w, b), then a = b. Therefore the function λ : Q → Σ that
associate to each state the unique label of its incoming edges is well defined. For
the state q0 , the only one without incoming edges, we set λ(q0 ) := #.
    In [13] it is shown that WDFAs have a property called path coherence: let
A = (Q, q0 , δ, F, Σ) be a WDFA according to the order (Q, <). Then for every
interval of states I = [qi , qj ] and for all α ∈ Σ ∗ , the set J of states reachable
starting from any state of I by reading α is also an interval. Path coherence
allows us to transfer the order < over the states of Q to the co-lexicographic
order ≺ over the words entering the states: for each state q define the set Iq =
{α : δ(q0 , α) = q)}. With abuse of notation, we extend  to subsets of Σ ∗ as
follows: U  V , with U, V ⊆ Σ ∗ , if and only if α ≺ β for each α ∈ U, β ∈ V
such that {α, β} ∈  / U ∩ V . Then, two states q and p with Iq 6= Ip satisfy q < p
if and only Iq  Ip (again proved in [1]). An immediate consequence of this fact
is that a WDFA admits an unique order of its states that makes it Wheeler and
this order is univocally determined by the co-lexicographic order of any word
entering its states. This result is important for two different reasons. First of
all, it makes possible to decide in polynomial time whether a DFA is Wheeler:
for each state q, pick a word αq entering it and order the states reflecting the
co-lexicographic order of the words {αq : q ∈ Q}; then check if the order satisfies
the Wheeler conditions. Secondly, it is the key to adapt Myhill-Nerode Theorem
to Wheeler automata. We recall the following defintion.
Definition 2 (Myhill-Nerode equivalence). Let L ⊆ Σ ∗ be a language.
Given a word α ∈ Σ ∗ , we define the right context of α as
                           α−1 L := {γ ∈ Σ ∗ : αγ ∈ L},
and we denote by ≡L the Myhill-Nerode equivalence on Pref(L) defined as
                           α ≡L β ⇐⇒ α−1 L = β −1 L.
   The (classic) Myhill-Nerode Theorem, among many other things, establishes
a bijection between equivalence classes of ≡L and the states of the minimum
DFA recognizing L. This minimum automaton is also unique up to isomorphism
and a similar result, fully proved in [2], holds for Wheeler languages as well.
   In order to state such an analogous of Myhill-Nerode Theorem for Wheeler
languages, the equivalence ≡L is replaced by the equivalence ≡cL defined below.
6        G. D’Agostino et al.

Definition 3. The input consistent, convex refinement ≡cL of ≡L is defined as
follows. α ≡cL β if and only if

    – α ≡L β,
    – α and β end with the same character, and 2
    – for all γ ∈ Pref(L), if min(α, β)  γ  max(α, β), then α ≡L γ ≡L β.

    The Myhill-Nerode Theorem for Wheeler languages proves that there exists a
minimum (in the number of states) WDFA recognizing L. As in the classic case,
states of the minimum automaton are, in fact, ≡cL -equivalence classes, this time
consisting of intervals of words. Also such WDFA is unique up to isomorphism.
    A further important consequence, especially for testing Wheelerness, is stated
in the following Lemma (again proved in [2]).

Lemma 1. A regular language L is Wheeler if and only if all monotone se-
quences in (Pref(L), ≺) become eventually constant modulo ≡L . In other words,
for all sequences (αi )i≥0 in Pref(L) with

              α1  α2  . . . αi  . . .   or   α1  α2  · · ·  αi  . . .

there exists an n such that αh ≡L αk , for all h, k ≥ n.

    Lemma 1 shows how it is possible to recognize whether a language L is
Wheeler simply by verifying a property on the words of Pref(L): trying to find a
WDFA that recognizes L is no longer needed to decide the Whelerness of L. As
it turns out, we can verify whether the property depicted in Lemma 1 is satisfied
just by looking at any DFA recognizing L, as shown in Theorem 1 (see [2]).

Theorem 1. Let A be a DFA such that L = L(A), with initial state q0 and
dimension n = |A|.
L is not Wheeler if and only if there exist µ, ν and γ in Σ ∗ , with γ 
                                                                       a µ, ν, such
that:

1. µ 6≡L ν and they label paths from q0 to states u and v, respectively;
2. γ labels two cycles, one starting from u and one starting from v;
3. µ, ν ≺ γ or γ ≺ µ, ν.

The length of the words µ, ν and γ satisfying the above can be bounded:

4. |µ|, |ν| ≤ |γ| ≤ n3 + 2n2 + n + 2.

    Since in this work we make an extensive use of Theorem 1, here is a simple
example on how and why it works. Consider the automata depicted in Figure 3.
As shown in Figure 1 and 2, the language Ld is Wheeler whereas Lb is not, as
can be easily proved using Theorem 1. In fact, consider the automaton Ab . By
setting µ := a, ν := b and γ := c, one can verify conditions 1-3 of the theorem
are satisfied. Notice that condition a 6≡Lb b follows immediately from the fact
that δ(q0 , a) 6= δ(q0 , b) in the minimum DFA Ab .
                               Ordering Regular Languages: A Danger Zone        7


                    c                                         c

                   q1                                        q1
            a                                         a

      q0                                       q0
                    c                                         c
             d                                         b
                   q2             q3                         q2            q3
                           f                                        f
 (a) The minimum DFA Ad recognizing        (b) The minimum DFA Ab recognizing
 Ld = ac∗ + dc∗ f .                        Lb = ac∗ + bc∗ f .

Fig. 3. The minimum DFAs recognizing the languages Ld (Wheeler) and Lb (not
Wheeler).


   If we try to transpose the same reasoning to the automaton Ad by setting
µ = a, ν = d and γ = c, condition 3 of Theorem 1 is no longer satisfied. We can
not find 3 words satisfying conditions 1-3 of Theorem 1, therefore Lb is Wheeler.

    The polynomial bound given by condition 4 of Theorem 1 allows us to design
an algorithm that decides whether a given DFA recognizes a Wheeler language:
using dynamic programming (see [1]), it is possible to keep track of all the
relevant paths and cycles inside the DFA and check, in polynomial time, whether
there exists three words satisfying the conditions of the theorem.
    Things change if, instead of a DFA, we are given a NFA. Trying to exploit
the same idea used for DFAs does not work: the problem of deciding whether
two words µ and ν read by a NFA are Myhill-Nerode equivalent is PSAPCE-
complete, whereas is polynomial on DFAs (simply compute the minimum DFA
using Hopcroft’s algorithm and check whether the two states reached by µ and
ν coincide). Even worse, the straightforward attempt of building the minimum
DFA recognizing the NFA’s language might lead to a blow-up of the sates, re-
sulting in a exponential time (and exponential space) algorithm.
    We show that the problem of deciding whether a NFA recognizes a Wheeler
language is indeed hard, but doesn’t require exponential time to be solved: the
problem turns out to be PSPACE-complete. To show this, we first need to adapt
Theorem 1 to work on NFAs, as described in the following corollary.

Corollary 1. Let A = (Q, q0 , δ, F, Σ) be a NFA of dimension n := |A|. Then
L := L(A) is not Wheeler if and only if there exist three words µ, ν, γ, with
 a µ, ν, such that
γ

1. µγ i 6≡L νγ j for all 0 ≤ i, j ≤ 2n ;
2. γ labels two cycles, one starting from a state p ∈ δ(q0 , µ) and one from a
   state r ∈ δ(q0 , ν);
8       G. D’Agostino et al.

 3. µ, ν ≺ γ or γ ≺ µ, ν.
The length of the words µ, ν and γ satisfying the above can be bounded:
 4. |µ|, |ν| < |γ| ∈ O(n3 · 23n ).

Proof (Sketch). (=⇒) Let D be the minimum DFA recognizing the language L,
with initial state q̂0 . Recall that D might have up to 2n states. From Theorem
1 we know that L is not Wheeler if and only if there exist µ̂, ν̂ and γ̂ satisfying
conditions 1-4 of Theorem 1. Although there is not a perfect correspondence
between cycles in A and cycles in D, whenever we find in D a path µ̂ (respectively,
ν̂) and a cycle γ̂ both ending in the same state û (v̂), we can find in A a path
µ = µ̂γ̂ i (ν = ν̂ γ̂ j ) and a cycle γ1 = γ̂ k1 (γ2 = γ̂ k2 ) both ending in the same
state u (v), for some i, j, k1 , k2 ≤ n. Note that the word γ̂ (n+1)k1 k2 labels two
cycles starting from u and v. Moreover, we have

          |γ̂ (n+1)k1 k2 | ≥ (n + 1)|γ̂ k1 k2 | ≥ (n + 1)|γ̂| ≥ (i + 1)|γ̂| > |µ̂γ̂ i |,

and similarly |γ̂ (n+1)k1 k2 | > |ν̂ γ̂ j |. Since |γ̂| ∈ O(23n ), we have |γ̂ (n+1)k1 k2 | ∈
O(n3 · 23n ) and therefore the words µ, ν and γ := γ̂ (n+1)k1 k2 satisfy conditions
2-4 of this corollary. Finally, condition 1 is satisfied for all i, j ≥ 0 because when
reading µγ i and νγ j on D, they reach different states.
(⇐=) As we did while proving the opposite direction, given three words µ, ν, γ
in A satisfying condition 1-3, we can always find, for some i, j, k ≤ 2n , three
words µ̂ = µγ i , ν̂ = νγ j and γ̂ = γ k in D that almost satisfy condition 1-3 of
Theorem 1. The only requirement that might be missing is that µ̂ 6≡L ν̂. The
upper bound 2n in condition 1 ensures that this requirement is fulfilled, hence
we can apply Theorem 1 to conclude that L is not Wheeler.

    Despite the fact the the bound in condition 4 has become exponential by
switching to NFAs, it is still possible to check in polynomial space (but expo-
nential time) whether there are three words µ, ν and γ satisfying the conditions
of Corollary 1. Thus we can prove the following:

Proposition 1. Given a NFA A, deciding whether the language L := L(A) is
Wheeler is PSPACE-complete.

Proof (Sketch). To prove that the problem is in PSPACE (=NPSPACE), we
guess (using non-determinism) three words µ, ν and γ satisfying the conditions
of Corollary 1. Since this words are too long to be stored, we first guess their
length bit by bit and then we guess their characters one by one, starting from the
last one and proceeding from right to left. Meanwhile, for all α ∈ {µ, ν, γ} and
for all state t of A, we follow backwards the edges of A labeled as the character of
α that we are guessing. This way, we can compute the set of states from which is
possible to reach t by reading α. This allows us to find, for all i, j ≤ 2n , the sets
Qiµ := δ(q0 , µγ i ) and Qjν := δ(q0 , νγ j ). We can then test whether µγ i 6≡L νγ j by
confronting the language accepted by the automaton A with set of initial states
Qiµ and the language accepted by the automaton A with set of initial states Qiν .
                                Ordering Regular Languages: A Danger Zone          9

This can be done in polynomial space since the problem of deciding whether two
NFAs recognize the same language is known to be PSPACE-complete.

    To prove the hardness of the problem, we show a polynomial reduction from
the universality problem for NFA, i.e. the problem of deciding whether the lan-
guage accepted by a NFA A over the alphabet Σ is such that L(A) = Σ ∗ .
Let A = (Q, q0 , δ, F, Σ) be a NFA and let L := L(A). We can assume, without
loss of generality, that q0 ∈ F , otherwise A would not accept the empty word and
we could immediately derive that L =   6 Σ ∗ . Starting from A, we build in constant
time a NFA A00 over the alphabet Σ ∪ {a, b, c} that recognizes the language

                         L00 := a · (Lc)∗ · L + b · (Σ + c)∗ ,

where a, b, c are three characters not in Σ and such that a ≺ b ≺ c (the order of
the characters of Σ is irrelevant in this proof). If we prove that L = Σ ∗ if and
only if L00 is Wheeler, the reduction is complete and the thesis follows. Notice
that, since ε ∈ L, the following property holds:

                       L = Σ ∗ ⇐⇒ (Lc)∗ · L = (Σ + c)∗ .                         (1)

Let us prove that L = Σ ∗ implies that L00 is not Wheeler. If L = Σ ∗ , then by
(1) we have L00 = (a + b)(Σ + c)∗ . The minimum DFA recognizing L00 has only
one cycle, therefore from Theorem 1 it follows that L00 is Wheeler.
We next prove that L00 not Wheeler implies L = Σ ∗ . Notice that a−1 L00 =
(Lc)∗ · L so that, by (1) if L =
                               6 Σ ∗ , then a−1 L00 = (Lc)∗ · L 6= (Σ + c)∗ . However
           −1 00            ∗
we have b L = (Σ + c) , thus a 6≡L00 b. Moreover, it can be proved that, for
all n, both a ≡L00 acn and b ≡L00 bcn hold. Therefore the monotone sequence

                   ac ≺ bc ≺ ac2 ≺ bc2 ≺ · · · ≺ acn ≺ bcn ≺ . . .

is not constant modulo ≡L00 and from Lemma 1 it follows that L(A00 ) is not
Wheeler.


3   Generalized Wheelerness
As we have already pointed out in the introduction, changing the underlying
order of the alphabet might turn a Wheeler language into a not Wheeler one
and vice versa. For instance, consider again the Wheeler languages Ld and the
regular (but not Wheeler) language Lb depicted in Figure 3. If we change the
order of Σ from a ≺ c ≺ d ≺ f to a ≺ d ≺ c ≺ f , the Wheeler language Ld
turns into a non-Wheeler language (isomorphic to Lb under the isomorphism
ϕ between alphabets that fixes characters a, c, f and sends d into ϕ(d) = b).
Hence, by not fixing an apriori order of the alphabet Σ we enlarge the class of
languages.

Definition 4 (Generalized Wheelerness). A NFA A over the alphabet Σ is
called a Generalized Wheeler Automaton (GWNFA) if and only if there exists
10      G. D’Agostino et al.

an ordering of the elements of Σ that makes A Wheeler.
A language L is called generalized Wheeler (for short GW) if and only if there
exists a GWNFA that recognizes L.
    Let A be a WDFA. Then, every word α that labels a cycle in A is primitive
(see [2]), that is there exists no β 6= ε and i > 1 such that α = β i . A direct con-
sequence of this property is that Wheeler languages form a subclass of star-free
languages, i.e. the class of languages that can be defined by a regular expression
not containing the Kleene star. Since star-free expressions, and thus star-free
languages, are closed under permutations of the alphabet, even GW languages
must be a subclass of star-free languages. Here we show that the inclusion is
strict, therefore GW languages must be studied separately.
Proposition 2. If |Σ| ≥ 2, then the set {L ⊆ Σ ∗ : L is GW} is a proper subset
of {L ⊆ Σ ∗ : L is star-free}.
Proof. Let a, b be two distinct characters of Σ, and consider the language L =
a(aba)∗ a + ba(aba)∗ b. It is possible to prove that L is star-free, but L is not GW:
consider the sequence (αi )i≥2 with α2n = a(aba)n and α2n+1 = ba(aba)n . Since,
for all i, the word αi is a prefix of αi+1 , independently from how a and b are
ordered we have
     aaba ≺ baaba ≺ aabaaba ≺ baabaaba ≺ · · · ≺ a(aba)i ≺ ba(aba)i ≺ . . .
Moreover, for all i we have a(aba)i 6≡L ba(aba)i , since a(aba)i · a belongs to L
but ba(aba)i · a does not. We can then apply Lemma 1 to conclude that L is not
Wheeler. Since this result does not depend on the order of the alphabet, L is
not GW.
     We mentioned that we can decide in polynomial time whether a DFA is
Wheeler. On the contrary, deciding whether a NFA is Wheeler is NP-complete,
even when we bound the outdegree of each state of the NFA to be at most 5
(see [8]). As one may expect, deciding whether a NFA is a GWNFA is not easier
than deciding whether a NFA is Wheeler. In fact, we show in Proposition 3 that
the problem is NP-complete. We actually prove a stronger result in Proposition
4: even deciding whether a DFA is a GWNFA is NP-complete. Since the proof
of Proposition 4 is cumbersome, we decided to present also it weaker version,
i.e. Proposition 3, which shows a more natural reduction from the problem of
deciding whether a NFA is Wheeler. Both proofs can be found in the Appendix.
It is worth noticing that the proof of Proposition 3 can be adapted to work
even on DFAs, hence giving an alternative way to prove that deciding whether a
DFA is a GWNFA is NP-complete. Nonetheless, Proposition 4 is still stronger,
since it also proves that deciding whether a DFA recognizes a GW language is
NP-complete.
Proposition 3 (GWNFA hardness). Let A be a NFA. Deciding whether A
is a GWNFA is NP-complete.
Proposition 4 (GWDFA and GW languages hardness). Let L ⊆ Σ ∗ be a
language and A be a DFA. Both the problems of deciding whether A is a GWNFA
and deciding whether L is GW are NP-complete.
                               Ordering Regular Languages: A Danger Zone        11

4   DFA to WDFA
As discussed in Section 1, a Wheeler automaton can be represented more com-
pactly than a generic finite automaton. Therefore, if we are given a Wheeler
language L represented as a DFA A that recognizes L, we may be tempted to
look for a WDFA A0 that recognizes the same language to achieve a better, i.e.
more compact, representation. Unfortunately, this approach might not work in
our favor: in [1], a family (Lm )m≥1 of Wheeler languages with the property that
the dimension of the minimum WDFA recognizing Lm is exponential in the di-
mension of the minimum DFA recognizing Lm is presented. Note that we can
always assume that, whenever we are given a DFA or a WDFA, the automaton
is minimum. In fact, both tasks of minimizing a DFA and minimizing a WDFA
can be done in polynomial time (see [9] for DFAs and [1] for WDFAs). Here we
answer (positively) to the open question put forward in [1] whether there ex-
ists an algorithm to compute the minimum WDFA starting from the minimum
DFA. Our algorithm works in exponential time in the worst case and, since the
dimension of the output might be exponential in the dimension of the input, is
optimal for the task.
    Let L = L(A) be the language recognized by the given (minimum) DFA
A. Recall that there is a 1 to 1 correspondence between the ≡cL -classes and the
states of the minimum WDFA recognizing L. First of all, our algorithm identifies
a representative for each ≡cL -class; to be able do this in exponential time, we
first need to put a bound on the length of such representatives.
Lemma 2. Let A be the minimum DFA recognizing the Wheeler language L =
L(A) over the alphabet Σ, and let C1 , ..., Cm be the pairwise distinct equivalence
classes of ≡cL . Then, for each 1 ≤ i ≤ m, there exists a word αi ∈ Ci such that
|αi | < n + n2 , where n := |A|.
Proof (Sketch). Suppose by contradiction that there exists a class Ci such that
for all α ∈ Ci it holds |α| ≥ n + n2 , and let α ∈ Ci be a word of minimum
length. Since A has n states, there must exists a factor of α that corresponds to
a cycle in the run of α on A. Let β be the word obtained by erasing such factor
from α. Since α and β ends in the same state of A, we have α ≡L β. From the
minimality of α it follows that α 6≡cL β, therefore there must exists a word η that
is not Myhill-Nerode equivalent to α and that is included (co-lexicographically)
between α and β. We can further assume that the length of η is greater than n2 ,
hence we can identify a common factor of α and η that labels two cycles in the
runs of α and η on A. We can then apply Theorem 1 to conclude that L is not
Wheeler, a contradiction.
   The algorithm works as follow: first of all, it generates the list containing
every word of length less than n + n2 that can be read on A. This can be done in
                                                        2
exponential time since the list contains at most |Σ|n+n words. Then, we order
the list co-lexicographically and we apply the following proposition.
Proposition 5 (Minimum DFA to minimum WDFA). Let A be the min-
imum automaton recognizing L = L(A) with |A| = n, over the alphabet Σ with
12      G. D’Agostino et al.

|Σ| = σ. Let d := n + n2 and define

                       Pref(L)≤d := {α ∈ Pref(L) : |α| ≤ d}.

Assume that we are given the elements of Pref(L)≤d in co-lexicographic order,
i.e. Pref(L)≤d = {α1 , ..., αk } with αi ≺ αi+1 . Then it is possible to build the
minimun WDFA recognizing L in O(k + n2 · σ · m log m) time, where m is the
dimension of the output.

Proof (Sketch). From Lemma 2 we know that each ≡cL -class has at least one
representative in Pref(L)≤d . By considering the list [α1 ]L , . . . , [αk ]L we can spot
such representatives: αi represents a ≡cL -class if and only if αi and αi+1 belong
to different ≡L -classes or they belong to the same class but their last character
is different.
    Each representative of a ≡cL -class become a state of our WDFA. To compute
the edges of the automaton, we simply check, for all representative β and for all
c ∈ Σ, the ≡cL -class of β · c.


5    Conclusions
Having an order on the set of states of an automaton that is consistent with a
given order of the underlying alphabet has important implication from a prac-
tical point of view. In fact, deterministic or even non-deterministic finite state
automata enjoying this property—Wheeler automata—can be used to efficiently
analyse the language they accept by standard tools.
    In this work we established a few limitations along the directions that one
can imagine to take in order to extend the nice properties of Wheeler languages.
    More specifically, we proved that adding as a degree of freedom the possibility
to re-order the underlying alphabet produces a significantly more complex (NP-
complete) class of languages. In addition, the polynomial test we have for testing
whether a language is Wheeler when the latter is presented by a deterministic
automaton, turns out much more complex (PSPACE-complete) if the language
is presented by a non-deterministic one. The picture is completed by explicitly
giving an algorithm that turns a DFA into a Wheeler automaton, whenever this
is possible.
    Proving the above limitations should clarify the role played by apparently
secondary aspects of the definition of Wheeler language. This kind of study
should help to better understand the nature of Wheeler languages, with the
ultimate goal of singling out any feature that may admit some sort of extension.
    Consider, in more general terms, the following open problem: is there a class
of automata A properly extending the class of Wheeler automata and such that
their deterministic equivalent DA is such that |DA | ∈ poly(|A|)?
                                      Ordering Regular Languages: A Danger Zone              13

6    Appendix
Proof of Corollary 1. Let D = (Q̂, q̂0 , δ̂, F̂ , Σ) be the minimum DFA recog-
nizing L. Clearly D has at most 2n states.
(⇐=) From condition 2 it follows that µγ ∗ ⊆ Pref(L), so consider the following
list of 2n + 1 states of D:
                                                                               n
                         δ̂(q̂0 , µγ 0 ), δ̂(q̂0 , µγ 1 ), . . . , δ̂(q̂0 , µγ 2 ).

Since D has at most 2n states, there must exist two integers 0 ≤ h < k ≤ 2n
such that δ̂(q̂0 , µγ h ) = δ̂(q̂0 , µγ k ). Therefore γ k−h labels a cycle starting from
                                                                       0   0
δ̂(q̂0 , µγ h ). Similarly, there exist 0 ≤ h0 < k 0 ≤ 2n such that γ k −h labels a cycle
                             0
starting from δ̂(q̂0 , νγ h ). The words

                                    µ̂ := µγ h
                                                 0
                                    ν̂ := νγ h
                                                           0       0       n
                                    γ̂ := γ lcm(k−h,k −h )·2 ,

where the factor 2n in the definition of γ̂ ensures that |µ̂|, |ν̂| ≤ |γ̂|, satisfy
condition 2 of Theorem 1. Condition 1 of Theorem 1 follows automatically from
conditions 1 of this corollary. Lastly, condition 3 of Theorem 1 follows from
conditions 3 of this corollary and the fact that γ      a µ, ν. Thus we can apply
Theorem 1 to conclude that L is not Wheeler.
(=⇒) Since L = L(D) is not Wheeler, let µ̂, ν̂, γ̂ be as in Theorem 1. The DFA
D has at most 2n states, hence the length of γ̂ is bounded by the constant
23n + 2 · 22n + 2n + 2. We have µ̂γ̂ ∗ ⊆ Pref(L), so let t0 = q0 , t1 , . . . , tm be a run
of µ̂γ̂ n over A. We set u := |µ̂| and g := |γ̂|, and consider the list of n + 1 states

                             tu , tu+g , tu+2g , . . . , tu+ng = tm

Since A has n states, there must exist two integers 0 ≤ h < k ≤ n such that
tu+hg = tu+kg . That is, there exists a state p := tu+hg such that p ∈ δ q0 , µ̂γ̂ h
and γ̂ k−h labels a cycle starting from p. We can repeat the same argument for a
                                                                                              0
run of ν̂ γ̂ n over A to find a state r and two integers h0 , k 0 such that r ∈ δ(q0 , ν̂ γ̂ h )
        0     0
and γ̂ k −h labels a cycle starting from r. We can then define the words

                                     µ := µ̂γ̂ h
                                                     0
                                     ν := ν̂ γ̂ h
                                                               0       0
                                     γ := γ̂ lcm(k−h,k −h )·n

which satisfy the conditions 2 and 3.
Condition 4 is satisfied since |γ̂| ≤ 23n +2·22n +2n +2 and lcm(k−h, k 0 −h0 ) < n2 .
Finally, condition 1 is satisfied for all i, j ≥ 0. Indeed, for all l the words µ̂ and
µ̂γ̂ l lead to the same state of D, thus µ̂ ≡L µ̂γ̂ l . Similarly, for all l we also have
ν̂ ≡L ν̂ γ̂ l . Since ∀i ∃si such that µγ i = µ̂γ̂ si , and similarly, ∀j ∃sj such that
νγ j = ν̂ γ̂ sj , the thesis follows from µ̂ 6≡L ν̂.                                    t
                                                                                        u
14      G. D’Agostino et al.

Proof of Proposition 1. First of all we need to prove that the problem
is in PSPACE. We will show instead that its complement is in NPSPACE,
then the thesis follows from Savitch’s Theorem, which states that NPSPACE
= PSPACE, and the fact that PSPACE is closed under complementation. We
prove that checking the conditions in Corollary 1 is in NPSPACE. We can use
non-determinism to guess, bit by bit, the length of µ, ν and γ and store this
guessed information in three counters u, v, g respectively, using O(n) space for
each. We also need to guess the ending states p, r ∈ Q of µ, ν. Then we start
guessing the characters of µ, ν and γ starting from their last one and proceeding
backwards toward their first one, checking condition 3 of Corollary 1 in constant
space. Whenever we guess a character of µ (respectively, ν, γ) we decrease by
one the counter u (v, g), so we know when the guessing stops. If condition 3 is
satisfied, we calculate (we will show later how) the sets δ(q0 , µ), δ(q0 , ν), δ(p, γ),
and δ(r, γ) and check condition 2, that is, whether p ∈ δ(q0 , µ), r ∈ δ(q0 , ν),
p ∈ δ(p, γ), and r ∈ δ(r, γ). If condition 2 is satisfied, we consider condition 1: for
fixed i, j ≤ 2n consider the automata Aiµγ and Ajνγ obtained from the NFA L(A)
by considering as initial states the sets δ(q0 , µγ i ), δ(q0 , νγ j ), respectively. Notice
that we have µγ i 6≡L νγ j if and only if L(Aiµγ ) 6= L(Ajνγ ) and checking whether
L(Aiµγ ) = L(Ajνγ ) can be done in polynomial space, since deciding whether two
NFAs recognize the same language is a well-known PSPACE-complete problem.
    To conclude the proof, we claim that we are able to calculate in polynomial
space, for all q ∈ Q and for all β ∈ {µ, ν, γ}∗ , the set δ(q, β). While guessing
µ character by character, we can compute, for each state q, the set Qµ,q of the
states from which is possible to reach q reading µ. To build Qµ,q we start from
the set Q0 := {q} and we follow backwards the edges entering q and labeled
as the last character of µ. We call Q1 this new set of states and we repeat the
process by following backwards the edges entering each state of Q1 and labeled
as the second to last character of µ. Proceeding inductively, we compute the
sets Q0 , Q1 , . . . Qu = Qµ,q ; notice that to calculate Qk+1 we only need Qk and
the k-th to last character of µ, thus we can update (instead of storing) the set
Qk . We can do the same for ν and γ to compute, for each q ∈ Q, the sets Qν,q
and Qγ,q . Once we have stored, for all q ∈ Q and for all α ∈ {µ, ν, γ}, the sets
Qα,q , our claim follows easily. For instance, to compute δ(q0 , µγ) we build, for
all t ∈ Q, the set
                                                            [
                      Q(t) := {q ∈ Q : t ∈ δ(q, µγ)} =            Qµ,p .
                                                          p∈Qγ,t

Then we have t ∈ δ(q0 , µγ) if an only if q0 ∈ Q(t).

    To prove the completeness of the problem, we will show a polynomial reduc-
tion from the universality problem for NFA, i.e. the problem of deciding whether
the language accepted by a NFA A over the alphabet Σ is such that L(A) = Σ ∗ .
    Let A = (Q, q0 , δ, F, Σ) be a NFA and let L = L(A). We can assume without
loss of generality that q0 ∈ F , otherwise A would not accept the empty word
and we could immediately derive that L =    6 Σ ∗ . Let a, b, c be three characters
                                     Ordering Regular Languages: A Danger Zone      15

not in Σ and such that a ≺ b ≺ c with respect to the lexicographical order
(the order of the characters of Σ is irrelevant in this proof). First, we build the
automaton A0 starting from A by adding an edge (qf , q0 , c) for each final state
qf ∈ F , see the top part of Figure 1. Notice that A0 recognizes the language
L0 = L(A0 ) = (Lc)∗ · L, and it is straightforward to prove that L = Σ ∗ if and
only if L0 = (Σ + c)∗ : if L = Σ ∗ , let α be a word in (Σ + c)∗ containing n
occurrences of c. Then α = α0 c α2 c . . . αn−1 c αn for some α1 , . . . , αn ∈ Σ ∗ .
Hence α ∈ (Σ ∗ c)∗ · Σ ∗ = L0 . On the other hand, if L =6 Σ ∗ let α be a word in
  ∗                     0
Σ \ L. Then α · c ∈ /L.



                                                    c
                 A0              c


                              q0             A              N           A

                      a


           q00

                             b
                                             q1      Σ, c



Fig. 4. The automaton A00 . Every accepting state of A, labeled A in the figure, has a
back edge labeled c connecting it to q0 . Conversly, non-accepting states of A, labeled
N in the figure, do not have such back edges.



    We build a second automaton A00 as depicted in Figure 4. Let L00 = L(A00 )
be the language recognized by A00 . We claim that L = Σ ∗ if and only if L00 is
Wheeler.
(=⇒) If L = Σ ∗ , we have already proved that L0 = (Σ + c)∗ . Hence we have
L00 = (a + b) · (Σ + c)∗ . The minimum DFA recognizing L00 has only one loop,
therefore by Theorem 1 L00 is Wheeler.
(⇐=) If L =6 Σ ∗ , let α be a word in Σ ∗ \ L. Notice that α 6= ε since we assumed
that ε ∈ L. Every possible run of α over A must lead to a non-accepting state,
            / L0 . This implies that for all i ≥ 0 we have a · ci · α · c ∈
hence α · c ∈                                                             / L00 (notice
that the only edge labeled c leaving q0 ends in q0 ). On the other hand, for all
j ≥ 0 we have bcj · α · c ∈ L00 , hence for all i, j ≥ 0 we have aci 6≡L00 bcj . Thus
the following monotone sequence in Pref(L00 )

                      ac ≺ bc ≺ acc ≺ bcc ≺ · · · ≺ acn ≺ bcn ≺ . . .
16       G. D’Agostino et al.

is not eventually constant modulo ≡L00 . From Lemma 1 it follows that L00 is not
Wheeler.                                                                      t
                                                                              u

Proof of Lemma 2. Suppose by contradiction that there exists a class Ci such
that for all α ∈ Ci it holds |α| ≥ n + n2 , and let α ∈ Ci be a word of minimum
length. Consider the first n + 1 states q0 = t0 , ..., tn of A visited by reading the
first n characters of α. Since A has only n states, there must exist 0 ≤ i, j ≤ n
with i < j such that ti = tj . Let α0 be the prefix of α of length i (if i = 0 then
α0 = ε), let δ be the factor of α of length j − i labeling the path ti , ..., tj , and
let ζ be the suffix of α such that α = α0 δζ. By construction, the words α and
β := α0 ζ end in the same state, hence α ≡L β. Moreover, from |β| < |α| and the
minimality of α it follows that α 6≡cL β.
Suppose that α ≺ β, the other case being completely symmetrical. Since α and
β share the same suffix ζ, they end with the same character. This means that
the words α and β, which are Myhill-Nerode equivalent but not ≡cL equivalent,
were not split into two distinct ≡cL -classes due to input-consistency, therefore
there must exists a word η such that α ≺ η ≺ β and η 6≡L α. Formally, assume
by contradiction that for all words η such that α ≺ η ≺ β it holds η ≡L α. Then,
by definition of ≡cL , it would follow α ≡cL β, a contradiction.
Let η be a word such that α ≺ η ≺ β and η 6≡L α. From ζ a α, β it follows that
ζ a η, so we can write η = η 0 ζ for some η 0 ∈ Σ ∗ . Recall that by construction
α = α0 δζ with |α0 δ| ≤ n, hence |ζ| ≥ n2 . Consider the last n2 +1 states r0 , ..., rn2
of A visited by reading the word α, and the last n2 + 1 states p0 , ..., pn2 visited
by reading the word η. Since A has only n states, there must exist 0 ≤ i, j ≤ n2
with i < j such that (ri , pi ) = (rj , pj ). Notice that it can’t be ri = pi , otherwise
from the determinism of A it would follow rn2 = pn2 ; from the minimality of A
it would then follow α ≡L η, a contradiction.
Let ζ 00 be the suffix of ζ of length n2 − j, and let γ be the factor of ζ of length
j − i labeling the path ri , ..., rj . Since |ζ| ≥ n2 , there exists ζ 0 ∈ Σ ∗ such that
ζ = ζ 0 γζ 00 . We can then rewrite α, η and β as

                                    α = α0 δζ = α0 δζ 0 γζ 00
                                    η = η 0 ζ = η 0 ζ 0 γζ 00
                                    β = α0 ζ = α0 ζ 0 γζ 00 .

Let k be an integer such that |γ k | is greater than |α0 δζ 0 | and |η 0 ζ 0 |. Set µ := η 0 ζ 0 ;
from α ≺ η ≺ β it follows that α0 δζ 0 ≺ µ ≺ α0 ζ 0 . If γ k ≺ µ set ν := α0 ζ 0 ,
otherwise set ν := α0 δζ 0 . In both cases, the hypothesis of Theorem 1 are satisfied,
since γ k labels two cycles starting from the states ri and pi , that we have proved
to be distinct. We can conclude that L is not Wheeler, a contradiction, and the
thesis follows.                                                                               t
                                                                                              u

Proof of Proposition 5. Consider the pairwise distinct equivalence classes
C1 , . . . , Cm of the equivalence ≡cL . Clearly, the minimum Wheeler automaton
recognizing L has m states. We can assume without loss of generality that the
                                   Ordering Regular Languages: A Danger Zone         17

equivalence classes are co-lexicographically ordered, i.e. Ci ≺ Ci+1 for all i < m.
For sake of simplicity, given a word α ∈ Σ ∗ we will write [α] to indicate its
equivalence class modulo ≡L , and we will write [α]W to indicate its equivalence
class modulo ≡cL .
Let K := {1, ..., k} and, for all i, let Ki := {j ∈ K : [αj ]W = Ci }. Since
each Ci is convex in Pref(L), each Ki must be convex in K, that is each Ki
is an interval. Therefore the list of equivalence classes [α1 ]W , ..., [αk ]W must be
partitioned in consecutive runs of the same class, each class appearing in one and
only one run. From Lemma 2 we know that each equivalence class has at least
one representative in Pref(L)≤d , hence the list [α1 ]W , ..., [αk ]W must contain
exactly m runs.
    For all 1 ≤ j < k, we have [αj ]W 6= [αj+1 ]W if and only if
                                                                          
                [αj ] 6= [αj+1 ] ∨ [αj ] = [αj+1 ] ∧ last(αj ) 6= last(αj+1 ) ,

where last(α) denotes the last character of α. This means that we can identify
the m runs of the equivalence ≡cL just by looking at the two lists α1 , ..., αk and
[α1 ], ..., [αk ]: whenever last(αj ) 6= last(αj+1 ) or [αj ] 6= [αj+1 ], we know that a
new ≡cL run must start at αj+1 .
In O(k) we are able to determine the m runs and to pick a representative for
each of them, i.e. we can find m indexes i1 , ..., im such that for all 1 ≤ j ≤ m
it holds αij ∈ Ci . We call the set {ai1 , ..., aim } a fingerprint of the language L,
i.e. a set of words that has cardinality m such that distinct elements of the set
belong to distinct ≡cL -classes.
     We show how to build the minimum Wheeler DFA recognizing L, starting
from any fingerprint of L and the standard minimum DFA recognizing L. Let
{β1 , ..., βm } be a fingerprint of L and let A be the minimum DFA recognizing
L. We can assume without loss of generality that β1 ≺ ... ≺ βm . We build the
automaton AW = (Q, β1 , δ, F, Σ), where the set of states is Q = {β1 , ..., βm }
and the set of final states is F = {βj : βj ∈ L}. The transition function δ can
be computed as follow. For all 1 ≤ j ≤ m and for all c ∈ Σ, check whether
βj · c ∈ Pref(L). If βj · c ∈ / Pref(L), there are no edges labeled c that exit from
βj . If instead βj · c ∈ Pref(L), locate βj · c using a binary search. There are three
possible cases.

 1. βj · c ≺ β1 . Then δ(βj , c) = β1 .
 2. βm ≺ βj · c. Then δ(βj , c) = βm .
 3. There exists s such that βs  βj · c  βs+1 . It can not be the case that both
    βj c 6≡L βs and βj c 6≡L βs+1 , since {β1 , ..., βm } is a fingerprint of L. Hence
    we distinguish three cases.
    (a) βs ≡L βj c 6≡L βs+1 . Then δ(βj , c) = βs .
    (b) βs 6≡L βj c ≡L βs+1 . Then δ(βj , c) = βs+1 .
    (c) βs ≡L βj c ≡L βs+1 . Since {β1 , ..., βm } is a fingerprint of L, it is either
         c = last(βj c) = last(βs ), in which case δ(βj , c) = βs , or c = last(βs+1 ),
         in which case δ(βj , c) = βs+1 .
18     G. D’Agostino et al.

                                                                                       t
                                                                                       u

Proof of Proposition 3. The problem is in NP, since we can use non-determinism
to guess the order of the alphabet and then check whether such order makes the
NFA Wheeler.
    To prove the hardness, we show a polynomial reduction from the problem
of deciding whether a NFA is Wheeler. Let A be a NFA with initial state q0 ,
over the alphabet Σ = {a1 . . . , aσ } ordered by the relation a1 ≺ · · · ≺ aσ . We
want to build a new automaton A0 such that A0 is a GWNFA if and only if A
is Wheeler. A0 will be an automaton of size |A| + O(σ) over the alphabet Σ 0 of
size O(σ).
    The automaton A0 will be built starting from A and adding extra states and
transitions. We define the new alphabet as

                       Σ 0 = {a1 , . . . , aσ , x1 , . . . , xσ−1 , e, f },

with xi , e, f ∈
               / Σ, and we add two final states qe and qf . We then build σ − 1
gadgets, one for each pair of consecutive characters (ai , ai+1 ) of Σ, each one
connected to A∪{qe , qf } as depicted in Figure 5. This completes the construction
of the automaton A0 . Notice that A is input-consistent, but in general it is not
deterministic.


                                                                 ai
                                                 qi2                        qi7
                                                            xi
                                                                       xi
                                               xi                             e
                                  qi3                            qi5              qe
                   ai+1

              q0


                                                                              f
                                                                 qi4              qf
                                  xi
                                                       xi              xi

                                                                 ai
                                                 qi1                        qi6



         Fig. 5. The gadget Gi , connected to q0 and to the sinks qe and qf .



    We want to show that A is Wheeler according to the order (Σ, ≺) if and only
if A0 is a GWNFA.
                                 Ordering Regular Languages: A Danger Zone          19

(=⇒) Define the order ≺0 over Σ 0 by setting

                   a1 ≺0 ... ≺0 aσ ≺0 x1 ≺0 ... ≺0 xσ−1 ≺0 e ≺0 f.

We show that A0 is Wheeler according to (Σ 0 , ≺0 ) by ordering its states. Since
A is Wheeler, there already exists an order of its states that makes A Wheeler.
Therefore, we simply need to extend this order to the states of A0 . Recall that
in [1] it has been proved that a WNFA is Wheeler if and only if, for each pair of
states q, p such that Iq 6= Ip , either Iq  Ip or Ip  Iq holds, where by definition
Iq  Ip if and only if, for all α ∈ Iq and for all β ∈ Ip such that {α, β} 6⊆ Iq ∩ Ip ,
we have α ≺ β. Therefore we check, for each pair of states q and p in A0 , that
either Iq  Ip (implying q < p) or Ip  Iq (implying p < q) holds. Note that
when Iq and Ip are disjoint, the condition Iq  Ip translates to the following:
for all α ∈ Iq and for all β ∈ Ip , α ≺ β . In the discussion that follows we never
compare two states that belong to A, since the order between them is already
established.
      First of all, we will order, for each i, the states with incoming edge xi . Since
xi ∈  / Σ, the only states to compare are the one belonging to the gadget Gi , i.e.
qi4 , qi5 , qi6 , qi7 . Consider the languages

                           I4 := Iqi4 = xi (xi ai xi )∗
                           I5 := Iqi5 = ai+1 xi (xi ai xi )∗
                           I6 := Iqi6 = xi xi (ai xi xi )∗
                           I7 := Iqi7 = ai+1 xi xi (ai xi xi )∗ .

Since ai+1 ≺0 xi we have I4 , I5 ≺ I6 , I7 . Moreover, consider any two words
α ∈ I4 and β ∈ I5 . If |α| < |β|, we have α a β hence α ≺ β. If |α| ≥ |β|
instead, then the last |β| characters of α must be ai xi (xi ai xi )m , where m is such
that β = ai+1 xi (xi ai xi )m . From ai ≺0 ai+1 we still get α ≺ β, hence I4 ≺ I5 .
Similarly we can prove that I6 ≺ I7 . It immediately follows that we need to
order the states as follows: qi4 < qi5 < qi6 < qi7 .
    Secondly, for each 1 ≤ i ≤ σ we need to sort the states with incoming edges
labeled ai . Note that the automaton A might contain states with incoming label
ai and we need to consider such states as well. For i = σ, the task is easy. The
                                                                       3
only gadget with a state labeled aσ is Gσ−1 , and such state is qσ−1      . Notice that
Iqσ−1
  3   = {aσ }. Let q be a state of A with λ(q) = aσ . Since every word in Iq ends
with aσ , it trivially follows that Iqσ−1
                                       3    Iq . If Iq 6= {aσ }, we are forced to set
 3                                                          3
qσ−1  < q. If instead Iq = {aσ }, the order of q and qσ−1        does not matter. For
                                3
sake of consistency, we set qσ−1 < q.
For i = 1, the only gadget with states labeled a1 is G1 , and such states are q11
and q12 . Let q be a state of A with λ(q) = a1 and consider the languages

                          I1 := Iq11 = x1 x1 a1 (x1 x1 a1 )∗
                          I2 := Iq12 = a2 x1 x1 a1 (x1 x1 a1 )∗ .
20       G. D’Agostino et al.

For all α ∈ Iq , we have either α = a1 or α = α0 aj a1 for some α0 ∈ Σ ∗ and some
1 ≤ j ≤ σ. In both cases, α must precede co-lexicographically every word of I1
and I2 , thus Iq ≺ I1 , I2 . To compare I1 and I2 , consider any two words α ∈ I1
and β ∈ I2 . If |α| < |β|, we have α a β hence α ≺ β. If |α| ≥ |β| instead, then
the last |β| characters of α must be a1 x1 x1 a1 (x1 x1 a1 )m , where m is such that
β = a2 x1 x1 a1 (x1 x1 a1 )m . From a1 <0 a2 we still get α ≺ β, hence I1 ≺ I2 . It
follows that q11 and q12 must follow q (and all the other states of A with label
a1 ), and we must set q11 < q12 .
                                                        3
Lastly, if 1 < i < σ we should compare the sates qi−1      , qi1 and qi2 with the sates
in Qi := {q ∈ A : λ(q) = ai }. Applying both of the reasoning discussed in
                                                                   3
the cases i = 1 and i = σ, we can conclude that the state qi−1         precedes all the
                                       1       2
states in Qi and that the states qi and qi follow all the states in Qi and must
be ordered as qi1 < qi2 .
    The order of the states of A0 that we described makes A0 Wheeler with
respect to (Σ 0 , ≺0 ), hence making it a GWNFA.
(⇐=) If A0 is a GWNFA, then there exists an order ≺0 over Σ 0 that makes A0
Wheeler. Since A is a sub-automaton of A0 , it follows that even A is Wheeler
according to ≺0 . Let ≺  e be the restriction of ≺0 over the alphabet Σ; we want to
show that ≺ is the same order as ≺. Assume by contradiction that ≺
             e                                                             e=6 ≺. If for
all 1 ≤ i < σ we have ai ≺ai+1 , then ≺ =≺, a contradiction. Hence there exists
                             e             e
1 ≤ i < σ such that ai+1 ≺a   e i . Since ≺0 extends ≺,
                                                     e this implies that ai+1 ≺0 ai .
We will show that A is not Wheeler according to ≺0 , a contradiction. Define
                         0

the words µ := xi , ν := ai+1 xi and γ := xi ai xi . From µ a γ and ai+1 ≺0 ai
we have µ, ν ≺ γ. The word γ labels two cycles in A0 starting from two distinct
states, i.e. qi4 and qi5 . Moreover, µ and ν label two paths that start from the
initial state q0 and end in qi4 and qi5 respectively. Since qi4 and qi5 are not Myhill-
Nerode equivalent, we can apply Theorem 1 to conclude that A0 is not Wheeler
according to ≺0 , a contradiction. Thus ≺    e and ≺ coincide.
We have shown that A is Wheeler according to ≺0 , and that ≺0 extends ≺.
Therefore we can conclude that A is Wheeler according to ≺.                           t
                                                                                      u

Definition 5 (Betweenness). Input: a list Y of n distinct elements Y =
y1 , ..., yn and k < n3 ordered triples (a1 , b1 , c1 ), ..., (ak , bk , ck ), where each el-
ement of each triple belongs to Y . Elements belonging to the same triple are
distinct.
Output: yes/no answer. The answer is “yes” if and only if there exists a total
order < of Y such that, for each k, either ak < bk < ck or ak > bk > ck .

Proof of Proposition 4. We can prove that both problems are in NP using an
argument similar to the one employed in the proof of Proposition 3.
    To prove the hardness, we show a polynomial reduction from the between-
ness problem to both of the problems described; we will use exactly the same
reduction for both problems. We start from an instance I = (Y, K) of the be-
tweenness problem, where Y is the set Y = {y1 . . . , yn } and K ⊆ P(T 3 ) is a
collection of k triples (a1 , b1 , c1 ), . . . , (ak , bk , ck ), for some 1 < k < n3 . We build
a DFA A of size O(n + k), over an alphabet of size O(n + k). The alphabet is
                                       Ordering Regular Languages: A Danger Zone                            21

Σ = Y ∪ {x1 . . . , xk , e, f }, where we introduce a new character xi for each triple
(ai , bi , ci ) ∈ K and two extra “ending” characters e and f . To build G, we start
with the initial state q0 connected with n states q1 , . . . , qn through the edges
(q0 , yj , qj ) for each 1 ≤ j ≤ n. We also add two sinks qe and qf , the only final
                                 1    3     5
states. We add the states qij      , qij , qij (see Figure 6) and the transitions



                                                        5              bi      3
                                                       qij                    qij
                                                                  xi
                                                                             xi
                                                      xi                1         e
                                      qj                               qij                     qe
                             =y
                                j
                        ai
                q0
                        ci =
                               ym                 xi                              f
                                      qm                            2
                                                                   qim                         qf
                                                             xi              xi

                                                       6
                                                                       bi      4
                                                      qim                     qim



                      Fig. 6. The gadget Gi related to the i-th triple.



                      1           1            3              3            5             5            1
       δ(qj , xi ) = qij ,     δ(qij , xi ) = qij ,        δ(qij , bi ) = qij ,       δ(qij , xi ) = qij
        1
and δ(qij , e) = qe . We repeat the same process with ci : given the integer m such
                                    2    4     6
that ci = ym , we add the states qim  , qim , qim and the transitions
                  2             2            4                4            6              6            2
   δ(qm , xi ) = qim ,       δ(qim , xi ) = qim ,          δ(qim , bi ) = qim ,        δ(qim , xi ) = qim
        2
and δ(qim   , f ) = qf . Lastly, we remove the states among q1 , . . . , qn that don’t
have outgoing edges. More formally, we define the sets A := {a1 , . . . , ak } and
C := {c1 , . . . , ck } and we remove from G all the states qj such that yj ∈ / A ∪ C.
We show that the instance I = (Y, K) of the betweennes problem is satisfiable
if and only if A is a GWNFA, if and only if L(A) is GW.
(=⇒) Since I = (Y, K) is satisfiable, there exists an ordering π : Y → {1, ..., n}
of the elements of Y satisfying I. We order Σ as follows:

                     π −1 (1) ≺ ... ≺ π −1 (n) ≺ x1 ≺ · · · ≺ xk ≺ e ≺ f.

This ordering induce a partial order on the states of A, where states with differ-
ent incoming labels are ordered by such labels. Therefore we only need to order
22      G. D’Agostino et al.

the states of A with the same incoming label.
                                                                         1    3     2     4
For each 1 ≤ i ≤ k, the only states of A with incoming label xi are qij    , qij , qim , qim ,
where j and m are integers such that ai = yj and ci = ym . Since, by con-
struction, the order π satisfies the instance I, then only two cases can occur:
either π(ai ) < π(bi ) < π(ci ), or π(ci ) < π(bi ) < π(ai ). In the first case, we set
 1      2      3       4
qij < qim  < qij  < qim  . To realize that this is in fact the correct order of the
states, consider the following languages:
                                  ∗
                       1 = {α ∈ Σ
               I1 := Iqij                          1
                                    : δ(q0 , α) = qij } = ai xi (xi bi xi )∗
               I2 := Iqim
                       2  = ci xi (xi bi xi )∗
                                               ∗
               I3 := Iqij
                       3 = ai xi xi (bi xi xi )


               I4 := Iqim
                       4  = ci xi xi (bi xi xi )∗ .

Since, by construction, we have ai , bi , ci ≺ xi , it follows that I1 , I2 ≺ I3 , I4 .
Moreover, from π(ai ) < π(bi ) < π(ci ) we also have that I1 ≺ I2 and I3 ≺ I4 ,
which completes the ordering. Symmetrically, if π(ci ) < π(bi ) < π(ai ) then we
      2      1      4       3
set qim   < qij < qim   < qij .
We still need to order the states of A whose incoming labels belong to Y . For
                                                                                   5
each 1 ≤ p ≤ n, the states with incoming label yp belong to the sets V5 := {qij       :
                         6
bi = yp } or V6 := {qim : bi = yp } or Vp , where Vp = {qp } if qp is a state of A
(i.e. if yp ∈ A ∪ C) and Vp = ∅ otherwise. If Vp = {qp }, we set qp as the smallest
state. We then sort the states of V5 and V6 by their first subscript; when the first
                                                                          5         6
subscript is equal, i.e. the two states that we want to confront are qij    and qim
                                                 5     6
(with ai = yj and ci = ym ), then we set qij < qim if π(ai ) < π(bi ) < π(ci ) and
           6     5
we set qim    < qij  if π(ci ) < π(bi ) < π(ai ). This can be deduced by confronting
the following languages:
                                                               ∗
                            I5 := Iqij
                                    5 = ai xi xi bi (xi xi bi )


                          I6 := Iq60      = ci0 xi0 xi0 bi0 (xi0 xi0 bi0 )∗ .
                                    i m


If i < i0 then, by construction, we have xi ≺ xi0 , hence I5 ≺ I6 . Symmetrically,
if i0 < i then we have I6 ≺ I5 . Lastly, if i = i0 then the order between I5 and I6
is determined solely by ai and ci0 = ci .
     Since there is only one state with incoming label e and only one state with
incoming label f , we have finished. The order described makes A Wheeler, thus
A is a GWNFA and L(A) is GW.
(⇐=) Assume that the instance I = (Y, K) of the betweenness problem is un-
satisfiable. We prove that L(A) is not GW. Assume by contradiction that L(A)
is GW, then there exists an ordering π 0 of the elements of Σ such that L(A)
is Wheeler according to said order. Recall that Y ⊆ Σ and consider the or-
der π := π 0 |Y . Since (Y, K) is unsatisfiable, π must violate one of the con-
straints, i.e. there exists an 1 ≤ i ≤ k such that either π(ai ), π(ci ) < π(bi ) or
π(bi ) < π(ai ), π(ci ). Define the words µ := ai xi , ν := ci xi and γ := xi bi xi ; then
it is either µ, ν ≺ γ or γ ≺ µ, ν (here the co-lexicographic order ≺ is calculated
                                 Ordering Regular Languages: A Danger Zone            23

with respect to π 0 ). By construction, γ labels two cycles in A starting from two
                  1        2
distinct states, qij and qim , which are not Myhill-Nerode equivalent. Moreover,
                                                                             1
µ and ν label two paths that start from the initial state q0 and end in qij    and
 2
qim  respectively. We can then apply the Theorem 1 to conclude that L(A) is
not Wheeler according to π 0 , a contradiction. Therefore L(A) is not GW, which
automatically implies that A is not a GWNFA.                                     t
                                                                                 u


References

 1. Alanko, J., D’Agostino, G., Policriti, A., Prezza, N.: Regular languages meet pre-
    fix sorting. In: Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algo-
    rithms. pp. 911–930 (2020). https://doi.org/10.1137/1.9781611975994.55, https:
    //epubs.siam.org/doi/abs/10.1137/1.9781611975994.55
 2. Alanko, J., D’Agostino, G., Policriti, A., Prezza, N.: Wheeler Languages. CoRR
    arXiv:2002.10303 (Feb 2020)
 3. Alanko, J., Gagie, T., Navarro, G., Seelbach Benkner, L.: Tunneling on wheeler
    graphs. In: 2019 Data Compression Conference (DCC). pp. 122–131 (2019).
    https://doi.org/10.1109/DCC.2019.00020
 4. Backurs, A., Indyk, P.: Which regular expression patterns are hard to match? In:
    2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).
    pp. 457–466 (2016). https://doi.org/10.1109/FOCS.2016.56
 5. Equi, M., Grossi, R., Makinen, V.: On the Complexity of Exact Pattern Matching in
    Graphs: Binary Strings and Bounded Degree. In: ICALP 2019 - 46th International
    Colloquium on Automata, Languages and Programming. pp. 1–15. Patras, Greece
    (Jul 2019), https://hal.inria.fr/hal-02338498
 6. Equi, M., Mäkinen, V., Tomescu, A.I.: Graphs cannot be indexed in polynomial
    time for sub-quadratic time string matching, unless seth fails. In: Bureš, T., Dondi,
    R., Gamper, J., Guerrini, G., Jurdziński, T., Pahl, C., Sikora, F., Wong, P.W. (eds.)
    SOFSEM 2021: Theory and Practice of Computer Science. pp. 608–622. Springer
    International Publishing, Cham (2021)
 7. Gibney, D., Hoppenworth, G., Thankachan, S.V.: Simple reductions from
    formula-sat to pattern matching on labeled graphs and subtree isomorphism.
    In: Le, H.V., King, V. (eds.) 4th Symposium on Simplicity in Algorithms,
    SOSA 2021, Virtual Conference, January 11-12, 2021. pp. 232–242. SIAM
    (2021). https://doi.org/10.1137/1.9781611976496.26, https://doi.org/10.1137/
    1.9781611976496.26
 8. Gibney, D., Thankachan, S.V.: On the hardness and inapproximability of rec-
    ognizing wheeler graphs. In: 27th Annual European Symposium on Algorithms,
    ESA 2019, September 9-11, 2019, Munich/Garching, Germany. pp. 51:1–51:16
    (2019). https://doi.org/10.4230/LIPIcs.ESA.2019.51, https://doi.org/10.4230/
    LIPIcs.ESA.2019.51
 9. Hopcroft, J.E.: An n log n algorithm for minimizing states in a finite automaton.
    Tech. rep., Stanford University (January 1971)
10. Potechin, A., Shallit, J.: Lengths of words accepted by nondeter-
    ministic finite automata. Information Processing Letters 162, 105993
    (2020).      https://doi.org/https://doi.org/10.1016/j.ipl.2020.105993,       https:
    //www.sciencedirect.com/science/article/pii/S0020019020300806
24      G. D’Agostino et al.

11. Prezza, N.: On locating paths in compressed tries. In: Proceedings of the Thirty-
    Second Annual ACM-SIAM Symposium on Discrete Algorithms. p. 744–760. So-
    ciety for Industrial and Applied Mathematics, USA (2021)
12. Shyr, H., Thierrin, G.: Ordered automata and associated languages. Tamkang J.
    Math (5), 9–20 (1974)
13. Travis Gagie, Giovanni Manzini e Sirén, J.: Wheeler graphs: A framework for bwt-
    based data structures. Theoretical computer science 698, 67–78 (2017)