1. Introduction

ICTCS

Giuseppa Castiglione

Giovanna D'Agostino

Alberto Policriti

Antonio Restivo

Brian Riccardi

String Matching, Deterministic Finite Automata, Wheeler Languages, Graph Indexing, Co-lexicographical Sorting

0 Dip. di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca , Italy 1 Dip. di Matematica e Informatica, Università di Palermo , Italy 2 Dip. di Scienze Matematiche, Informatiche e Fisiche, Università degli Studi di Udine , Italy

2025

26 10 12

Wheeler languages, introduced to capture a class of regular languages compatible with an ordered and indexable structure, form a well-behaved subclass of the regular languages. In this paper, we study a little-explored property of such languages: closure under complementation. Specifically, we provide a complete characterization of Wheeler languages whose complement is also Wheeler. Our results ofer a deeper understanding of the internal structure of these languages and have both theoretical implications-within the classification of regular languages-and practical applications, particularly in fields leveraging coherent orderings, such as text indexing and genomic data analysis.

1. Introduction

In recent years, interest in Wheeler languages, introduced in [ 1 ], has grown significantly, in part due to their central role in eficient indexable data structures such as Wheeler graphs and the Burrows-Wheeler Transform (BWT). A regular language is said to be Wheeler if it can be recognized by a deterministic automaton equipped with a total order on its states—i.e. sets of strings—that satisfy monotonicity conditions coordinating such ordering with the ordering of its transition labels.

While several studies have investigated the structure, minimization, and verification of the Wheeler property in regular languages (see [ 2, 3, 4 ]), and generalize it to a larger context (see [ 5 ]), the question of closure under complementation remains largely open. It is known that the class of Wheeler languages (a subclass of star-free languages) is not closed under complement in general, which naturally raises the following questions, first addressed in [ 6 ]: which Wheeler languages have the property that their complement is also Wheeler? Can we provide a structural characterization of this subclass?

In this work, we answer these questions afirmatively by presenting a formal and constructive characterization of Wheeler languages that are closed under complement within the Wheeler class. Our analysis relies on a combination of automata-theoretic techniques and combinatorial properties of co-lexicographical orders, extending and strengthening existing results. In addition to its theoretical relevance, our characterization helps to delineate more precisely the computational and expressive boundaries of Wheeler-based data structures.

More precisely, our result builds on a characterization given in [ 7 ] of the languages ℒ for which both ℒ and its complement ℒ are Wheeler with respect to every total order on the alphabet. This universality condition is particularly strong and, as a consequence, the class of such languages is rather limited. In fact, it coincides the union of definite ( DEF) and reverse definite languages ( RDEF). The classes of DEF and RDEF are classical subclasses of regular languages that have been studied since the origins of automata theory [ 8 ] and play a role in classifications of star-free languages [ 9 ]. Starting from this characterization, our goal is to extend the class DEF ∪ RDEF so as to capture exactly those languages that are Wheeler and whose complement is also Wheeler, but with respect to a fixed alphabet order, rather than universally. To this end, we treat the classes DEF and RDEF separately. For the reverse definite languages, we introduce a new operation—parameterized by a fixed total order on the alphabet—that, when applied to RDEF, yields a larger class of languages which are guaranteed to be Wheeler and to have a Wheeler complement for that specific order. For definite languages, we first provide a novel characterization of the class DEF in terms of string intervals. We then extend this notion by allowing infinite periodic strings as interval endpoints. As in the

RDEF case, this generalized

framework allows us to define new languages that are Wheeler and have Wheeler complements for the chosen alphabet order.

The paper is organized as follows. In Section 2 we recall the basic notions regarding strings and regular languages that will be needed to prove our result (see [ 10 ] as a general reference). In Section 3 we will summarize all results on Wheeler automata and Wheeler languages together with some particular constructions on DFA that will be needed in the sequel. Section 4 contains the characterization of the class of regular languages which are Wheeler with a Wheeler complement and, finally, Section 5 contains conclusions and open problems.

2. Basics

Strings and Automata. Letting Σ be a finite alphabet, we denote by Σ* the set of finite words (or strings) over Σ. A language ℒ is a subset of Σ* and Pref(ℒ) is the set of all prefixes of words in deterministic finite automaton (DFA) is a tuple = (, , Σ, , ), where is a finite non-empty set of states, is the initial state (or source), the alphabet is Σ, : × Σ → is the (possibly partial) transition ℒ. A function, and ⊆

is the set of final states . Whenever is not defined on (, ) we write (, ) = ⊥. If is a total function we say that the DFA is complete. When clear from the context, we omit the alphabet Σ. If we are dealing with a single DFA, the symbols , , , always refer to it. Sometimes we shall describe the transition function using triples (labeled edges), where (, , ′) stands for ′ = (, ). We call (, , ′) an -transition and (, , ) an -loop. As customary, we extend to strings: for all ∈ , ∈ Σ and ∈ Σ* , we put (, ) = and (, ) = ( (, ), ), where (⊥, ) = ⊥ . Given (or recognized) by is then defined as ℒ() = ⋃︀ = (, , Σ, , ) and ∈ , we denote by the set { ∈ . Moreover, for any ∈ , we denote by () the ∈ Σ* : (, ) = }. The language accepted set of all letters ∈ Σ that label transitions reaching , i.e., () :− { ∈ Σ : (∃ ∈ )( (, ) = ) }.

A Σ-loop in a DFA is a state such that (, ) = , for all ∈ Σ. By saying that we add a Σ-loop to we mean that we add a new state together with all transitions (, ) = for ∈ Σ (and eventually some other transitions ending in ). A cycle labeled = 1 . . . − 1 ∈ Σ* in a DFA is a sequence of states 1, 2, . . . , = 1 such that (, ) = +1, for all < , and it is simple if 1, . . . , − 1 are pairwise distinct (in particular, a loop is a simple cycle).

We mostly consider trimmed DFAs, that is, automata in which every state is reachable (from the initial state) and useful (can reach at least one final state). This is not restrictive: every automaton can be turned into a trimmed and equivalent one by simply deleting unreachable states as well as states not reaching at least one final state. In a trimmed automata there can be at most one state without incoming transitions, namely , and every string that can be read starting from belongs to the set of prefixes, Pref(ℒ()). as follows (see [ 10 ]): minimum DFA correspond to the classes of the Myhill-Nerode equivalence relation ≡ ℒ

Languages accepted by DFAs form the class of regular languages. Given a regular language ℒ, there exists a unique (up to isomorphism) state-wise minimum complete DFA accepting ℒ. The states of this on Σ* , defined ≡ ℒ ⇐⇒ { ∈ Σ* : ∈ ℒ } = { ∈ Σ* : ∈ ℒ } .

Denoting by ℒ

the minimum complete DFA having as states the classes [ ]ℒ of the Myhill-Nerode

DFA obtained from ℒ

equivalence, we have: = [ ]ℒ, ([ ]ℒ, ) = [ · ]ℒ, and = { [ ] ℒ : ∈ ℒ }. Letting ℒ be the by considering only classes [ ]ℒ with ∈ Pref(ℒ) and the transitions among them, one can easily check that ℒ recognizes ℒ, it is trimmed, and can difer from ℒ for (at most) one non-final Σ-loop. ℒ is the DFA with minimum number of states among (not necessarily complete) DFAs accepting ℒ.

Given a finite set ⊆ Σ* , we define the Prefix-tree of as the DFA having Pref() as states, as (root and) initial state, and whose transitions are (, ) = , for , ∈ Pref(). Every state of this DFA is reached by the single word . The Prefix-tree acceptor of a finite set is the Prefix-tree of having as collection of final states.

The class DEF consists of languages of the form ∪ Σ* , where , ⊆ Σ* are finite. The class RDEF is the class of languages whose reverse is in DEF, that is, languages of the form ∪ Σ* , where , are finite. Notice that, for RDEF languages, we can always assume Pref( ) ∩ = ∅ and prefix-free (meaning that there no strings , ∈ such that is a prefix of ). Moreover, in the minimum trimmed DFA recognizing an infinite language in RDEF the set of states contains a single ifnal Σ-loop and the transitions restricted to ∖ {} build no cycle.

3. Preliminaries

Since in this paper we will be dealing with a finite alphabet Σ and a fixed total order over it, unless otherwise specified, we will always use Σ = {1, . . . , }, for some ∈ N ∖ {0}, with its natural order. We extend such order co-lexicographically to Σ* , that is, for , ∈ Σ* , we have ≤ if and only if either is a sufix of (denoted by ⊣ ), or there exist ′, ′, ∈ Σ* and , ∈ Σ, such that = ′ and = ′ and < . Notice that in the co-lexicographical order every string has an immediate successor, the string 1 , but there are strings without an immediate predecessor, e.g. the string 2. If is a string, we denote by | | its length, and by [] its -th character from the left, if 1 ≤ ≤ | |.

Wheeler Automata and Languages. Wheeler Automata are a special class of DFAs that leverage an a priori fixed order of the alphabet in order to achieve, among other things, eficient compression and indexing (see [ 1 ]). Specifically, the co-lexicographic order is lifted from strings to states of the automaton in such a way that precedes if and only if, for every reaching and reaching , precedes co-lexicographically. Axioms (W1) and (W2) below do the job.

Definition 3.1 (Wheeler Automaton [ 1 ]). A Wheeler DFA (WDFA for brevity) is a trimmed DFA endowed with a total order (, ≤ ) on the set of states such that the initial state has no incoming transitions, it is minimum for ≤ , and the following two Wheeler axioms are satisfied. Let ′ = (, ) and ′ = (, ): (W1) if < , then ′ < ′; (W2) if = , < , and ′ ̸= ′, then ′ < ′.

Notice that we use the same symbol for the order on the alphabet and the states. Moreover, notice that (W1) implies that a WDFA is input consistent, that is, | ()| = 1 for all states ̸= . Any DFA can be transformed into an equivalent input consistent DFA in ︀( || · | Σ|)︀ time by simply creating, for each state ∈ , at most |Σ| copies of , one for each diferent incoming label of . An input consistent DFA can also be pictured as a graph where the labels of the edges are moved to the respective target states (a state is labeled when the transitions reaching are -transitions). The initial state, reached by no transitions, is labeled with #, where # < for all ∈ Σ. An example of a Wheeler state-labeled DFA is depicted in Figure 1. The unique Wheeler order on the DFA is the one where, for , ∈ Σ, if < , then a state labeled precedes a state labeled , and the final state labeled 2 (above in the figure) precedes the state labeled 2 below.

It can be proved (see [ 3, 5 ]) that, if a total order satisfying Definition 3.1 exists, then it is unique and, as we said, allows the lifting of the co-lexicographic order from strings to states. More precisely, given a trimmed, input consistent DFA in which the initial state has no incoming edges, we can define a start # partial order ≤ on by ≤ ′ ⇔ ( = ′) ∨ (∀ ∈ )(∀ ∈ ′ )( < ). Then, it can be proved that ≤ always satisfies properties (W1) and (W2) and the following holds.

Lemma 1 ([ 5 ]). A trimmed, input consistent DFA is Wheeler if and only if the partial order ≤ is total. Remark 1. Since in a WDFA the order ≤ is total, we can decide whether ≤ ′ by simply checking the relative co-lexicographical order among a pair (, ) with ∈ and ∈ ′ . Moreover, if a sequence of words ( )∈N is monotone in the co-lexicographical order of words, the corresponding sequence of states ( (, ))∈N must be monotone in the order ≤ and, since there are only a finite number of states, the sequence ( (, ))∈N must be eventually constant.

A Wheeler language is a language accepted by a Wheeler DFA. We denote by W(<) the class of languages that are Wheeler in the ordered alphabet (Σ, <). In order to recognize if a given regular language is Wheeler, we shall use the following.

Lemma 2 ([ 3 ]). A regular language ℒ is Wheeler if and only if all monotone sequences in (Pref(ℒ), ≤ ) become eventually constant modulo ≡ ℒ. In other words, for all sequences ( )∈ in Pref(ℒ) such that: 1 ≤ 2 ≤ . . . ≤ ≤ . . . or 1 ≥ 2 ≥ . . . ≥ ≥ . . . , there exists an such that ℎ ≡ ℒ , for all ℎ, ≥ .

Notice that there exists a Wheeler language ℒ for which the minimum DFA ℒ is not Wheeler. Consider e.g. the DFA given in Fig. 2. This DFA is (minimum) but not Wheeler by Lemma 1: the strings 124, 1244 reach , the string 324 reaches state , and 124 < 324 < 1244, so that states , are <-incomparable. However the language recognized by this DFA is Wheeler. To prove this, we can use a characterization of (non) Wheeler languages based on the existence of "special" pairs of ℒ-incomparable states. More precisely: Definition 3.2. Let = (, , Σ, , ) be a DFA. If , ∈ then we denote by ▷◁< the fact that , are ≤ ℒ -incomparable, that is:

▷◁< ⇐⇒ ∃, ′ ∈ , ∃, ′ ∈ ( < ) ∧ ( ′ < ′).

Theorem 3 ([ 4 ]). Let ℒ be the minimum trimmed DFA accepting ℒ. Then, ℒ ∈/ W(<) if and only if in ℒ there are two nodes ̸= such that: 1. , are the starting points of two equally labeled cycles; 2. ▷◁< .

Going back to the language of Fig. 2, we notice that ℒ contains only one pair from which two equally labeled cycles start, namely the pair (, ), but <ℒ holds. Hence, the language ℒ() is Wheeler by Theorem 3.

Another consequence of Theorem 3 is that Wheeler languages are star-free. To see this, remember that a counter in a DFA is a sequence of ≥ 2 pairwise distinct states 1, . . . , such that there exists a string ∈ Σ* for which: (i) (, ) = 1, and (ii) (, ) = +1 for all 1 ≤ < . It is known that start

1 3 2 2 a language is star-free if the minimum automaton of the language is counter-free. Hence, Wheeler languages are star-free since one can prove that a counter would always violate the conditions in Theorem 3.

Wheeler Languages with a Wheeler complement

Wheeler languages are not closed under complementation; for example, the language 2* over the ordered alphabet {1, 2} is Wheeler but its complement is not: indeed, the monotone sequence 2 < 12 < 22 < ... < 2 < 12 < 2+1 < ... belongs to Pref(ℒ) but it is not eventually constant modulo ≡ ℒ. Hence, ℒ is not Wheeler by Lemma 2. Remark 2. One can easily find a suficient condition for describing languages which are Wheeler with a Wheeler complement as follows. Consider a language ℒ recognized by a Wheeler complete DFA , that is an automaton satisfying Def. 3.1 even without trimming. Clearly, since the trimmed version of is still a Wheeler DFA, ℒ is Wheeler. Moreover, ℒ is also Wheeler, because if we swap final and non final states in and trim the resulting DFA, we obtain a Wheeler DFA for ℒ. However, this is not a necessary condition for a language to be Wheeler with a Wheeler complement, as the following example shows. Consider Σ = {1, 2} and the language 1Σ* . Then ℒ = { } ∪ 2Σ* . Using Lemma 2 one can easily check that both ℒ and ℒ are Wheeler. However, there cannot be a complete Wheeler DFA recognizing any of them. By Remark 2, in such a DFA any monotone sequence of words should eventually end in the same state. However, the words of the sequence

2 < 12 < 22 < ... < 2 < 12 < 2+1 < ... belong alternatively to ℒ and ℒ, leading to a contradiction. 1

In [ 7 ] it is proved that DEF ∪ RDEF coincides with the class of languages ℒ such that both ℒ and ℒ are Wheeler with respect to every order of the alphabet.

Lemma 4 ([ 7 ]). ℒ, ℒ ∈ W(<) for all total orders < ⇔ ℒ ∈ DEF ∪ RDEF.

What happens if we drop the universality condition above? In this paper, we show that in order to characterize the class of languages ℒ such that ℒ, ℒ ∈ (<) with respect to a fixed order < of Σ, it is suficient to consider a slight extension of the class DEF ∪ RDEF. We start with an example of a language ℒ such that both ℒ and its complement ℒ are Wheeler for a fixed order, but ℒ is not in DEF ∪ RDEF.

Example 4.1. Let Σ = {1, 2} and consider the language ℒ = 1+ over Σ, recognized by the DFA in Fig. 3. This DFA is a complete Wheeler DFA, since it satisfies the conditions of Def. 3.1 even without trimming — just consider the order < 1 < 2 < 3. Hence, both ℒ and ℒ are Wheeler by Remark 2. Moreover, 1 ∈ ℒ, while 21, 12 ̸∈ ℒ, for all > 1: hence, we cannot decide membership 1Notice that this sequence is neither in Pref(ℒ) nor in Pref(ℒ), hence it does not contradict the Wheelerness of ℒ or ℒ. 1 1 1 1 2 2 1 2 2 3 in ℒ by just checking a prefix or sufix of fixed length, proving that ℒ ̸∈ DEF ∪ RDEF. This simple example can also be used to explain another peculiarity of complementation for Wheeler language. If we change the order of the alphabet so that 2 < 1, then we can prove that the language 1+ is still Wheeler but its complement is not. Consider again the complete DFA of Fig. 3: if 2 < 1 states 1, 2 are incomparable with respect to <, since 1 < 121 < 11 and 1, 11 reach 1 while 121 reaches 2. Hence, this DFA is not Wheeler w.r.t. the order 2 < 1 and we cannot count on it to prove that 1+ is Wheeler. However, if we trim we obtain a Wheeler DFA for 1+ (insensitive to the order of the alphabet), thereby proving that 1+ is still Wheeler when 2 < 1. Notice that this trimmed DFA is useless to show the Wheelerness of the complement, which can indeed proved to be a non Wheeler language by Lemma 2, since the sequence 1 < 121 < 11 < 1211 < 111 < 1211 < . . . is monotone in (Pref(ℒ), <) but it is not eventually constant modulo ≡ ℒ. 2

When the order of the alphabet is fixed, a special role is played, as we shall see, by the first symbol of the alphabet. This can be explained as follows. If we fix a symbol of the alphabet, the concatenation to the left with , that is, the function from strings to strings defined by ↦→ is not in general monotone: e.g. 1 < 21 but 31 > 321. However, if = 1 is the minimum of (Σ, <), then, for any ∈ Σ* , 1 is the immediate successor of and, therefore, for < we have: <

1 ≤ < 1.

From this it follows, for example, that < if and only if 1 < 1 . This helps significantly when analyzing Wheelerness of languages.

The following definition characterizes more precisely the kind of application we need from the above considerations.

Definition 4.1. Let ℒ be a regular language and ℓ = sup { ∈ N : 1 ∈ Pref(ℒ) }. If ℓ = ∞, then we define 1↑ℒ = ℒ otherwise, 1↑ℒ = ℒ ∪ ︀{ 1ℓ+ : 1ℓ ∈ ℒ ∧ ∈ N ︀} .

Remark 3. If ℓ = 0, then 1↑ℒ = 1* ℒ. For example, if ℒ = {212, 3} over the alphabet {1, 2, 3}, then 1↑ℒ = 1* {212, 3}. Notice that the operator 1↑ℒ is idempotent, but it is not a closure operator because it is not monotone, e.g. {1, 11} ⊆ { 1, 11, 1111} but 1↑{1, 11} = {1} ∪ 1* 11 = 1+ ̸⊆ 1↑{1, 11, 1111} = {1, 11} ∪ 1* 1111.

We first prove that the operation ℒ ↦→ 1↑ℒ preserves Wheelerness. The intuition is that since 1 is the immediate co-lexicographic successor of , if we add to a language the monotone sequence 1, 11, 111 . . . , with all strings reaching the same state, this cannot create any new monotone sequence violating Lemma 2.

Lemma 5. If ℒ ∈ (<) then 1↑ℒ ∈ (<).

Proof. Let ℒ ∈ (<) and let ℓ = sup { ∈ N : 1 ∈ Pref(ℒ) }. If ℓ = ∞, then 1↑ℒ = ℒ and there is nothing to prove, hence assume ℓ < ∞. Since ℒ ∈ (<), there exists a Wheeler DFA 2Notice that this sequence is not in Pref(ℒ). = (, , , ) recognizing ℒ. We build a Wheeler DFA 1 recognizing 1↑ℒ in two phases. First, we construct a (Wheeler) ′ = (′, ′, ′, ′) with ℒ(′) = ℒ where, for ≤ ℓ, any state reached by ′(′, 1) is only reached by 1. Then, we turn ′ into a Wheeler DFA 1 recognizing 1↑ℒ, so that 1↑ℒ = ℒ(1) ∈ (<).

The automaton ′ = (′, ′, ′, ′) is defined as follows. For any 1 ≤ ≤ ℓ, let ′ be a copy of the state = (, 1). Let ′ = ∪ {1′, . . . , ℓ′}, ′ = , and ′ = ∪ { ′ : ∈ , 1 ≤ ≤ ℓ }. The new transition function ′ is obtained from by just erasing (, 1, 1) and adding (, 1, 1′), (′, 1, ′+1), (′, , ), for 1 ≤ < ℓ, ̸= 1, and (, , ) ∈ (see Fig. 4). 1

j 1 j 1 1 1 1 ℓ . . . 2 1 1 1 1 ′ ℓ . . . ′ 2 ′ 1 j j

j 1

j 1 1 1 1 ℓ . . . 2 1 1 1 1 ′ ℓ . . . ′ 2 ′ 1 j j

j 1

j 1 1 1 1 ℓ . . . 2 1

Notice that both ′(ℓ′, 1) and (ℓ, 1) are undefined. By construction, ′ is a DFA and for any 1 ≤ ≤ ℓ we have that ′ is reached in ′ by 1 only. Moreover, words reaching ∈ in ′ are exactly those reaching in that are diferent from 1, for 1 ≤ ≤ ℓ. This implies that ℒ(′) = ℒ() = ℒ. By possibly erasing non-reachable states we can also assume that ′ is trimmed. Moreover, we can check that the preorder ≤ ′ is total. Since 1 is the minimum letter of the alphabet, we <′ ℓ′ and all the ′’s precede states in ′ ∖ {1′, . . . , ℓ′}, which are ordered have 1′ <′ 2′ <′ · · · as in .

From ′ we obtain 1, recognizing 1↑ℒ, by adding the self-loop (ℓ′, 1, ′ ), that is, 1 = (′, ′, ′ ∪ ℓ {(ℓ′, 1, ℓ′)}, ′) (see Fig. 4). By construction, ′, for 1 ≤ < ℓ, is reached in 1 by 1 only, while ℓ′ is reached by the infintely many words in 1ℓ1* . Moreover, a state ∈ ′ ∖ {1′, . . . , ′} ⊆ is reached in 1 only by the words /∈ 1* and such that one of the following holds: 1. reaches in ′ as well, or 2. = 1ℓ1ℎ with ℎ > 0, 1ℓ reaches in ′, and [ 1 ] ̸= 1.

From this it follows that also the order ≤ 1 is total. Since, for 1 ≤ < ℓ, ′ is reached in 1 only by 1 and ℓ′ is reached by the words in 1ℓ1* , we have 1′ <1 · · · <1 ℓ′. Moreover, since 1 is the minimum in Σ, all states in {1′, . . . , ′} precede states in ′ ∖ {1′, . . . , ′}. Consider now , ′ ∈ ′ ∖ {1′, . . . , ′}, with ̸= ′. Since, as proved above, the order ≤ ′ over the automaton ′ is total, , ′ are comparable in ≤ ′ : say <′ ′. We now prove that this implies that ≤ 1 ′ holds as well. Suppose reaches and ′ reaches ′ in 1. Then < ′ because one of the following cases apply: • , ′ reach , ′ in ′, respectively, hence < ′ because <′ ′. • reaches in ′, ′ = 1ℓ1ℎ with ℎ > 0, 1ℓ reaches ′ in ′, and [ 1 ] ̸= 1; in this case, from <′ ′ it follows < 1ℓ < 1ℓ1ℎ = ′. • = 1ℓ1ℎ with ℎ > 0, 1ℓ reaches in ′, and [ 1 ] ̸= 1, while ′ reaches ′ in ′; in this case, from <′ ′ it follows 1ℓ < ′; moreover, all words in 1ℓ1+ reach in 1 and form the infinite chain of immediate successors of 1ℓ in Σ* ; since ′ ̸∈ 1ℓ1+ ( ′ reaches ′ and not in in 1), it follows that 1ℓ1ℎ < ′. • = 1ℓ1ℎ with ℎ > 0, 1ℓ reaches in ′, and [ 1 ] ̸= 1, ′ = 1ℓ1 ′ with > 0, 1ℓ ′ reaches ′ in ′, and ′[ 1 ] ̸= 1. In this case, reasoning as in the previous points we get 1ℓ < 1ℓ1ℎ = < 1ℓ ′ < 1ℓ1 ′ = ′.

Hence, we proved that the partial order ≤ 1 is, in fact, total. By Lemma 1 we have that 1 is a Wheeler automaton and, since it recognizes 1↑ℒ, we have that 1↑ℒ ∈ (<).

Definition 4.2.

The class 1↑RDEF is the closure of RDEF under the operator 1↑ℒ:

1↑RDEF = RDEF ∪ {︁ 1↑ℒ : ℒ ∈ RDEF }︁

We first prove that this class is closed under boolean operations.

Lemma 6. The class 1↑RDEF is closed under boolean operations. Moreover, 1↑RDEF ⊆ W(<).

Proof. We first prove closure under complementation. If ℒ ∈ RDEF then it is known that ℒ ∈ RDEF. If ℒ ∈ 1↑RDEF ∖ RDEF then there exists ℛ ∈ RDEF such that ℒ = 1↑ℛ and 1↑ℛ ̸= ℛ, that is ℓ = sup{ : 1 ∈ Pref(ℛ)} < ∞. Let , be finite sets such that ℛ = ∪ Σ* , Pref(F) ∩ G = ∅, and is prefix-free. Consider the DFA , recognizing the language ℒ = 1↑ℛ, defined as follows. First, we build the Prefix-tree acceptor of ∪ , and since Pref(F) ∩ G = ∅, we have that no state in leads to a state in . We then add a new Σ-loop and all transitions (, , ), for ∈ , obtaining an automaton recognizing ℛ. Finally, we add the self-loop (1ℓ, 1) = 1ℓ obtaining recognizing ℒ = 1↑ℛ.

In order to obtain a DFA for ℒ, starting from , switch final and non-final states and add a new ifnal Σ-loop ′, reached by all missing transitions. Finally, we trim all states not reaching a final state, obtaining a new ′, easily seen to accept ℒ. Notice that the Σ-loop of will be erased in ′, because it is final in and there are no paths leaving it. Hence is not in ′.

We consider two cases.

If state 1ℓ is erased in ′, then ′ contains only one state reached by infinitely many strings — namely, the Σ-loop ′. In this case ℒ ∈ RDEF ⊆ 1↑RDEF and we are done.

If 1ℓ is in ′, then ′ contains only two states reached by infinitely many strings — namely, ′ and 1ℓ —, and the only simple cycles in ′ are the 1-loop on 1ℓ and the Σ-loop on ′. If we remove the transition (1ℓ, 1) we obtain a DFA ′′ recognizing a language ∈ RDEF such that ℒ = 1↑, proving that ℒ ∈ 1↑RDEF.

As for closure under union, if ℒ1, ℒ2, ∈ 1↑RDEF, let ℒ1 = 1↑ℛ1 and ℒ2 = 1↑ℛ2, with ℛ1, ℛ2 ∈ RDEF. Let ℓ = sup{ : 1 ∈ Pref(ℛ)}, for = 1, 2. If ℓ1 = ℓ2 then 1↑ℛ1 ∪ 1↑ℛ2 = 1↑(ℛ1 ∪ ℛ2) and we are done. If ℓ1 < ℓ2 then consider the language ℛ˜ 1 = ℛ1 ∪ ︀{ 1ℎ : ℓ1 ≤ ℎ ≤ ℓ2, 1ℓ1 ∈ ℛ1 }︀ . Then ℛ˜ 1 ∈ RDEF, because whether a word belongs to ℛ˜ 1 still depends only on the prefix of fixed length of the word. Hence, ℛ˜ 1 ∪ ℛ2 ∈ RDEF and ℒ1 ∪ ℒ2 = 1↑(ℛ˜ 1 ∪ ℛ2).

Being closed under complementation and union, the class 1↑RDEF is closed under boolean operations.

Finally, since 1↑RDEF = RDEF ∪ {︀ 1↑ℒ : ℒ ∈ RDEF }︀ the inclusion 1↑RDEF ⊆ W(<) follows from Lemma 4 and Lemma 5.

In Lemma 6 we proved that 1↑RDEF ⊆ follows.

Lemma 7. Let ℒ be a regular language. If Pref(ℒ) ̸= Σ* then

W(<). We obtain a partial converse of this inclusion as

ℒ ∈ W(<) ⇒ ℒ ∈ 1↑RDEF.

Proof. Suppose that ℒ ∈ W(<). Since we are assuming Pref(ℒ) ̸= Σ* , it follows that in ℒ—minimum DFA accepting ℒ—there is a final Σ-loop . If the only simple cycles in ℒ are the loops over , then ℒ ∈ RDEF and we are done.

Otherwise, consider a simple cycle not visiting . We claim that if is the label of (any) such a simple cycle starting from a state and is such that (, ) = , then = 1 and = 1 for some ∈ N. This implies that such a cycle is unique, namely, a 1-loop on (, 1). We prove our claim showing that, otherwise, we could find three words 1 < 2 < 3 with 1, 3 reaching and 2 reaching , respectively. This would imply ▷◁< and, since is the label of a cycle both from and from (the Σ-loop) , by Theorem 3 we would contradict ℒ ∈ W(<).

Let be a word reaching . We may assume ̸= , otherwise ℒ = Σ* and we are done. Let ′ ∈ Σ* be any word with < ′. Suppose first that ̸∈ 1* . Then, there exists ′ < with | ′| = | | and we can consider 1 = ′ < 2 = < 3 = ′ . Since 1, 3 ∈ , and 2 ∈ , we are done. Hence, ∈ 1* and, furthermore, since ℒ ∈ (<) and, consequently, is star-free, ℒ must be counter-free, implying = 1.

Next, suppose that ̸∈ 1* . Then let ′, ′′ ∈ Σ* be such that | ′′| = | | and ′′ < < ′. Since 1 = ′′, 2 = , 3 = ′ are such that 1 < 2 < 3, we would again contradict ℒ ∈ (<). Therefore, ∈ 1* and the claim is proved.

The only cycle in ℒ not visiting is the 1-labelled self-loop on , with reached by 1 only. Erasing the 1-loop from we then obtain a DFA accepting ℛ ∈ RDEF, thereby proving ℒ = 1↑ℛ ∈ 1↑RDEF.

Corollary 8. If ℒ is a regular language with ℒ ∈ (<) and Pref(ℒ) ̸= Σ* , then ℒ ∈ (<). Proof. Suppose ℒ ∈ (<) and Pref(ℒ) ̸= Σ* . By Lemma 7 we have that ℒ ∈ 1↑RDEF, hence ℒ ∈ 1↑RDEF ⊆ W(<) by Lemma 6.

Corollary 9. If ℒ is a regular language and Pref(ℒ) ̸= Σ* then,

ℒ ∈ (<) ⇔ ℒ ∈ 1↑RDEF.

Proof. Implication from left to right is given in Lemma 7. Implication from right to left follows from Lemma 6.

Thanks to the previous results we now have a complete characterization of the Wheeler Languages with a Wheeler complement in the cases in which either Pref(ℒ) or Pref(ℒ) are diferent from Σ* . Indeed it easily follows from Corollary 9 that this class coincides with 1↑RDEF.

In order to complete our characterization, we need to consider the case in which both ℒ and ℒ are Wheeler and Pref(ℒ) = Pref(ℒ) = Σ* . In this case we will generalize the notion of interval on Σ* . A (classic) interval in (Σ* , <) is a set of the form (, ), [, ), (,

}, [, ) = { ∈ Σ* : ≤ < }, etc...

The words , are called the bound words of the intervals (, ), [, ), . . .

Following the ideas introduced in [ 11 ], we generalize this type of intervals by considering intervals of finite words but whose bound words can be periodic in Σ. Since we will compare finite and innfiite words co-lexicographically, it is convenient to think of infinite words as “growing” to the left. More precisely, an infinite word is depicted as = . . . − 1 . . . 1 0, where ∈ Σ. Finite words are then seen as infinite words with an infinite occurrence of the character # on the left: . . . # # # . . . 0, where # < , for all ∈ Σ. The co-lexicographic order < over Σ* ∪ Σ, extending the co-lexicographic order < over Σ* , can then be defined by: <

⇔ ∃ ( () < () ∧ ∀ < () = ()), where an infinite sequence of # is appended to finite words as explained above. Finally, if , ∈ Σ* we denote by the infinite periodic word 3 . . . . . . .

We can now consider intervals with bounds that are infinite periodic words. E.g. if Σ = {1, 2, 3} we have:

(23, ∞) = { ∈ Σ* : 23 < } = Σ* 32* 3, while [33, ∞) = Σ* 33, and [3321, 31) = Σ* 3321. As we shall prove later, these generalized intervals are always (regular) Wheeler languages. First we characterize the class DEF using usual (classical) intervals.

Lemma 10. A regular language ℒ is in DEF if and only if it is a union of a finite number of intervals in (Σ* , <) having finite words or ∞ as bounds.

Proof. Let Σ = {1, 2, . . . , }. We first consider languages of the form ∈ Σ, we consider two cases: Σ* . If = 1 . . . , with 1. ∈ * ; in this case Σ* = [, ∞). 2. ̸∈ * ; in this case let ℎ be the first index from the left such that ℎ = ̸= , that is, = ℎ− 1 ℎ+1 . . . , with ̸= . Define the finite word ′ as ′ = ( + 1) ℎ+1 . . . . Then Σ* = [, ′).

Suppose ℒ ∈ DEF. Then there are finite sets = { 1, . . . }, = { 1, . . . , } such that ℒ = ∪ Σ* and, thus, ℒ = { 1} ∪ · · · ∪ { } ∪ Σ* 1 ∪ · · · ∪ Σ* . Since { } = [ ], for all , by the previous result ℒ is a union of a finite number of intervals in (Σ* , <) having finite words or ∞ as bounds.

Conversely, if ∈ Σ* , we first prove that [, ∞) ∈ DEF. If = ∈ * , then [, ∞) = Σ* ∈ DEF. If ̸∈ * , then let ′ be defined as in (2) above. Then, starting from , we can reach a word in ∈ * in a finite number of iterations of (· )′, obtaining [, ∞) = [, ′) ∪ [ ′, ′′) ∪ . . . ∪ [ , ) = Σ* ∪ Σ* ′ ∪ · · · ∪ Σ* , proving that [, ∞) ∈ DEF. Since [, ∞) ∈ DEF, its complement [, ) is in DEF as well and the same holds for [, ] = [, ) ∪ { } and (, ∞) = [, ∞) ∖ { }. Finally, if < then (, ) = (, ∞) ∩ [, ) is in DEF, together with all its variations obtained by closing the interval to the left or to the right. Since DEF is closed under finite unions, this proves that a union of a ifnite number of intervals with finite or ∞ as bounds is in DEF.

Remark 4. Notice that the usual definition of DEF does not depend on a fixed order of the alphabet. The previous lemma tells us that if ℒ ∈ DEF then, whatever the order < is, we can decompose ℒ as a union of intervals, but these intervals change if we change the order. E.g. if Σ = {, } and < then Σ* = [, ), while, if < then Σ* = [, ∞).

In the previous lemma we proved that languages in DEF are those of the form ⋃︀ =1 where are open/closed intervals with bounds , ∈ Σ* ∪ {∞}. We call them intervals with finite bounds . We now prove that, by allowing bounds to be also periodic infinite words, we get exactly the class of regular Wheeler languages ℒ with Pref(ℒ) = Σ* .

Definition 4.3. The class DEFlim is the class of regular languages which are finite unions of intervals in (Σ+, <) with finite or periodic words as bounds 4.

By Lemma 10 we have DEF ⊆

DEFlim. 3Using the embedding introduced in [ 11 ] one such word would map to a periodic rational. 4Notice that if = Σ then we can use instead of ∞ as bound of an interval.

Lemma 11. The class DEFlim is closed under boolean operations. Moreover, DEFlim ⊆

Proof. The complement of an interval is either an interval or a union of two intervals. The intersection of a finite number of intervals is an interval. Hence, the complement of a finite union of intervals is a ifnite unions of intervals. Clearly, the union of two sets that are finite union of intervals is still a finite union of intervals, hence DEFlim is closed under boolean operations.

Suppose ℒ = ⋃︀=1 , where the ’s are intervals. We first prove that ℒ is regular. We have already proved that intervals with finite bounds are regular. Consider an interval of type ( , ∞). Then ( , ∞) = { ∈ Σ* : < } = Σ* 1 ∪ Σ* 2 * , with 1 = { ∈ Σ* : | | ≤ | |, < } , 2 = { ∈ Σ* : | | ≤ | |, < }. Hence ( , ∞) is regular. Since any other interval with bounds which are finite or periodic can be constructed from intervals of this form or with finite bounds using boolean operations, the claim follows.

We are now left to prove that the regular language ℒ = ⋃︀ =1 is Wheeler. If ℒ is finite, this is obvious. If ℒ is not finite, then we use Lemma 2. Assume, by way of a contradiction, that there exists a monotone sequence ( )∈ which visit alternatively two diferent states , ′ of the minimum DFA accepting ℒ. Let ∈ Σ* be such that (, ) ∈ , (′, ) ̸∈ (including the case (′, ) = ⊥), and consider the monotone sequence ( )∈. This sequence is in ℒ infinitely often, hence there must be an interval that contains an infinite number of elements of ( )∈. Notice that, by definition, an interval contains all words of Σ* between its bound words. Hence, all elements of the sequence ( )∈ will be eventually in and, as a consequence, in ℒ, contradicting the fact that reaches ′ for infinitely many ’s and (′, ) ̸∈ .

Remark 5. In Lemma 11 we have just proved that any finite union of intervals in Σ* , with finite or periodic words as bounds, is a Wheeler language. It is not possible to generalize this result to intervals in Pref(ℒ), e.g. the language ℒ = 13* 1 ∪ 23* 2 ∪ {2} is the union of two intervals in Pref(ℒ), that is, ℒ = { ∈ Pref(ℒ) : 11 ≤ < 31 } ∪ { ∈ Pref(ℒ) : 2 ≤ < 32 } (notice that e.g. 11 < 21 < 31 but 21 ̸∈ Pref(ℒ)), but ℒ is not Wheeler.

Remark 6. There is an important diference, when Wheelerness is concerned, between intervals with ifnite or ∞ bounds and intervals with periodic bounds. If a language ℒ is a finite union of classical intervals and we change the order, then ℒ is still a finite union of classical intervals w.r.t. the new order (because it is in DEF, see Remark 4). In other words, the class DEF is not sensitive to the underlying order of the alphabet: although our characterization seems to depend from the order, the usual definition of the class does not. This independence does not hold for intervals with periodic bounds. E.g. consider the alphabet Σ = {1, 2, 3} and the language ℒ = Σ* 12* , which is in DEFlim if the order is the standard 1 < 2 < 3 because Σ* 12* = [1, 2). In this order there are two infinite sequences of words converging, from below and above, respectively, to 2—namely, 12 and 32, for ∈ Σ* and ∈ N+.

The two sequences, however, must reach diferent states (one accepting and the other not accepting) in the automaton recognizing ℒ, and cannot be merged into a unique monotone sequence: hence, there is no violation of Lemma 2 . If, instead, we consider the order 1 < 3 < 2, the language is not Wheeler because with this order the two sequence above can be merged into a unique monotone sequence 12 < 32 < 122 < 322 < · · ·

< 12 < 32 < 12+1 < 32+1 . . . violating Lemma 2. Notice that, if the order is 1 < 3 < 2, then Σ* 12* is not an interval in Σ* , because 1, 12 ∈ Σ* 12* while 13 ̸∈ Σ* 12* , although 1 < 13 < 12 holds.

Lemma 12 ([ 11 ]). Suppose Pref(ℒ) = Σ* . Then ℒ ∈ W(<) ⇒ ℒ ∈ DEFlim.

Proof. This result was already proved in [ 11 ]. In this paper it is proved that, given a Wheeler DFA, the set of words ending in any given state constitute an interval of Pref(ℒ) bounded by finite or periodic words. If ℒ is Wheeler and Pref(ℒ) = Σ* , then there exists a WDFA recognizing ℒ and ℒ is a union of the Σ* -interval corresponding to final states.

Using Lemma 7 and Lemma 12 we finally obtain the following.

Theorem 13. Let ℒ be a regular language. Then

ℒ, ℒ ∈ W(<) ⇔ ℒ ∈ 1↑RDEF ∪ DEFlim Proof. (⇒) If either Pref(ℒ) or Pref(ℒ) are diferent from Σ* then ℒ, ℒ are both in 1↑RDEF by Lemma 7 and Lemma 5. If Pref(ℒ) = Pref(ℒ) = Σ* , then ℒ, ℒ are both in DEFlim by Lemma 12 and Lemma 11.

(⇐) If ℒ ∈ 1↑RDEF then ℒ ∈ 1↑RDEF and both ℒ and ℒ are Wheeler according to Lemma 6. If ℒ ∈ DEFlim then ℒ ∈ DEFlim and both ℒ and ℒ are Wheeler by Lemma 11.

Notice that 1↑RDEF ∩ DEFlim is not limited to finite and cofinite languages. Indeed, the language ℒ of Example 4.1 belongs to the intersection 1↑RDEF ∩ DEFlim. In fact, it belongs to 1↑RDEF because ℒ = 1↑({2, 12}{1, 2}* + ) and belongs to DEFlim by Lemma 12, since it is Wheeler and Pref(ℒ) = Σ* . Moreover, notice that Lemma 7 and Lemma 12 fail to give a complete characterization of Wheeler language only for the language ℒ such that ℒ ∈ W(<), ℒ ̸∈ W(<), and Pref(ℒ) ̸= Σ* , Pref(ℒ) = Σ* .

5. Conclusions

In this paper we proved that the class of Wheeler languages ℒ whose complement ℒ is also Wheeler can be fully characterized and shows interesting features. On the one hand, such class is best described starting from the classic classes DEF and RDEF: any ℒ ∈ DEF ∪ RDEF is certainly Wheeler with complement also Wheeler. On the other hand, however, both DEF and RDEF need a “closer look” for a complete characterization: both must be extended to capture (exactly) Wheeler languages with Wheeler complement. For any ℒ ∈ RDEF also 1↑ℒ is Wheeler with Wheeler complement. Moreover, and probably more interestingly, the class DEF must also be extended to a class DEFlim that can be seen as a partition of Σ* into a collection of open/closed intervals. Both cases need further investigation, in particular in view of the possibility of relativizing this analysis to the (much more general) case of partial orderings of states.

Acknowledgments

The authors thank Ruben Becker and Nicola Prezza for helpful discussions and comments. Funding. Giuseppa Castiglione is Supported by Project “ACoMPA” (CUP B73C24001050001) funded by the NextGeneration EU programme PNRR MUR M4 C2 Inv. 1.5 – Project ECS00000017 Tuscany Health Ecosystem (Spoke 6), CUP Master B63C22000680007.

Declaration on Generative AI The authors have not employed any Generative AI tools.

[1]

Gagie , G. Manzini,

Sirén , Wheeler graphs: a framework for BWT-based data structures , Theoretical Computer Science 698 ( 2017 ) 67 - 78 . doi: 10 .1016/j.tcs. 2017 . 06 .016, algorithms, Strings and Theoretical Approaches in the Big Data Era (In Honor of the 60th Birthday of Professor Rafaele Giancarlo) .

[2]

Alanko ,

Cotumaccio ,

Prezza , Linear-time minimization of wheeler dfas , in: 2022 Data Compression Conference (DCC) , 2022 , pp. 53 - 62 . doi: 10 .1109/DCC52660. 2022 . 00013 .

[3]

Alanko , G. D'Agostino , A.

Policriti , N.

Prezza , Wheeler languages, Inf. Comput . 281 ( 2021 ) 104820 . URL: https://doi.org/10.1016/j.ic. 2021 . 104820 . doi: 10 .1016/J.IC. 2021 . 104820 .

[4]

Becker ,

Cenzato ,

S.-H.

Kim ,

Kodric ,

Policriti ,

Prezza , Optimal wheeler language recognition , in: International Symposium on String Processing and Information Retrieval , Springer, 2023 , pp. 62 - 74 .

[5]

Cotumaccio , G. D'Agostino , A.

Policriti , N. Prezza,

Co-lexicographically ordering automata and regular languages - part I, J . ACM 70 ( 2023 ) 27 : 1 - 27 : 73 . doi: 10 .1145/3607471.

[6]

Castiglione ,

Restivo , Completing wheeler automata , in: U. de'Liguoro , M. Palazzo , L. Roversi (Eds.), Proceedings of the 25th Italian Conference on Theoretical Computer Science , Torino, Italy, September 11-13 , 2024 , volume 3811 of CEUR Workshop Proceedings, CEUR-WS.org , 2024 , pp. 120 - 132 . URL: https://ceur-ws. org/ Vol- 3811 /paper060.pdf.

[7]

Becker , G. Castiglione, G. D'Agostino , A.

Policriti , N.

Prezza , A.

Restivo , B.

Riccardi , Universally wheeler languages, CoRR abs/2504 .19537 ( 2025 ). URL: https://doi.org/10.48550/arXiv.2504.19537. doi: 10 .48550/ARXIV.2504.19537. arXiv: 2504 . 19537 .

[8]

S. C.

Kleene , Representation of Events in Nerve Nets and Finite Automata , Princeton University Press, Princeton, 1956 , pp. 3 - 42 . doi:doi:10.1515/ 9781400882618 - 002 .

[9]

J. A.

Brzozowski ,

Li ,

Liu , Syntactic complexities of six classes of star-free languages , J. Autom. Lang. Comb . 17 ( 2012 ) 83 - 105 . doi: 10 .25596/JALC-2012-083.

[10]

Sakarovitch , Elements of automata theory, Cambridge university press, 2009 .

[11]

Manzini ,

Policriti ,

Prezza ,

Riccardi , The rational construction of a wheeler DFA , in: 35th Annual Symposium on Combinatorial Pattern Matching, CPM 2024 , June 25-27, 2024 , Fukuoka, Japan, volume 296 of LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum für Informatik , 2024 , pp. 23 : 1 - 23 : 15 . doi: 10 .4230/LIPICS.CPM. 2024 . 23 .