1. Introduction

Completing Wheeler Automata⋆

Giuseppa Castiglione

Antonio Restivo

0 0 Dipartimento di Matematica e Informatica, Università degli studi di Palermo , Palermo , Italy

We consider the problem of embedding a Wheeler Deterministic Finite Automaton (WDFA, in short) into an equivalent complete WDFA, preserving the order of states and the accepted language. In some cases, such a complete WDFA does not exist. We say that a WDFA is Wheeler-complete (W-complete, in short) if it cannot be properly embedded into an equivalent WDFA. We give an algorithm that, given as input a WDFA , returns the smallest W-complete WDFA containing : it is called the W-completion of . We derive some interesting applications of this algorithm concerning the construction of a WDFA for the union and a WDFA for the complement of Wheeler languages.

Wheeler Automata Complete Automata Boolean Operations

1. Introduction

the one with the minimal number of states is unique. We call it the Wheeler completion of and we denote it by ().

The main contribution of this paper is a completion algorithm that, having as input a WDFA

, returns its Wheeler completion (). In the case () is a complete WDFA we say that is completable.

We further consider some relevant applications of this completion algorithm. In fact, there

are some important constructions in automata theory that require the automata to be complete.

This is the case of boolean operations. According to the fact that WDFAs are not in general complete, the family of Wheeler languages is closed under intersection, but it is not closed neither under complementation nor under union (cf. [9]). In the second part of the paper we use the completion algorithm to construct, under suitable condistions, a WDFA that recognizes the complement of a Wheeler language and a WDFA that recognizes the union of two Wheeler languages. This approach is alternative to the one proposed in [12]. 2. Preliminaries and notations If Σ is a finite alphabet, with Σ* we denote the set of finite words on Σ. If ⊆ Σ* , with Pref ()

we denote the set of all prefixes of words in , Pref () = { ∈ Σ* | ∃ ∈ Σ* s.t. ∈ }. A deterministic finite automaton (DFA) is a quintuple = (, Σ, , , ) where is a finite set of states, Σ is a finite alphabet, : × Σ → is the transition function, eventually a partial function, is the initial state and ⊆ is the set of final states. We denote by * the generalized transition function defined on the words of Σ* . If is a total function the automaton is complete whereas if is a partial function the automaton is incomplete. If (, ) is not defined for some ∈ and ∈ Σ we write (, ) = □ and we say that (, ) is a missing transition.

We denote by () the language accepted by . It is well-known that two automata and ℬ are equivalent if () = (ℬ). If we denote () = { ∈ Σ* | * (, ) ∈ }, a state is a sink state (or empty state) if () = ∅, is a coaccesible state if () ̸= ∅. For any ∈ and ∈ Σ we define () = { ∈ Σ (, ) = , for some ∈ }. In what follows we consider | only automata in which all the states are accessible i.e. can be reached from , and such that () = ∅. An automaton is said input consistent if |()| = 1, for each ∈ ∖ {}.

Let (, ≤ ) a total order on , for any , ∈ we write < if ≤ and ̸= . Two elements , ∈ are consecutive if < and does not exist any ∈ such that < < .

If ⊆ we say that we extend the order ≤ on if we define a total order ≤ ′ on such

that ≤ is the restriction on of ≤ ′. In what follows we will say that (, ≤ ) is a restriction of (, ≤ ′) and (, ≤ ′) is an extension of (, ≤ ).

A Wheeler DFA (WDFA) is an input consistent DFA = (, Σ, , , ), with Σ a total ordered alphabet, () = ∅ and there exists in a total order ≤ such that • the initial state is the minimum; • let 1 = (1, 1) and 2 = (2, 2), 1, 2, 1, 2 ∈ and 1, 2 ∈ Σ; if 1 < 2 then 1 < 2; if 1 = 2 and 1 ≤ 2 then 1 ≤ 2; (a) (b)

As a consequence, if 1 < 2 then, for all , ∈ , if () = { 1} and () = { 2} then < . We say that a regular language is a Wheeler language if it is recognizable by a WDFA. It is well known that Wheeler languages are star-free languages (cf. [9]). In the original definition, all the states of a Wheeler DFA are supposed to be coaccessible. However, since we are interested in completing a WDFA, it is quite natural to add some sink states while maintaining the Wheeler property regarding the order. For this reason, in what follows, we assume that a WDFA can also have non-coaccessible states. Note that, by allowing for non-coaccessible states the automaton remains well-defined.

We conclude the preliminary section with the definition of inclusion among automata which is fundamental in this paper.

Let = = (1, Σ, 1, 1, 1) and ℬ = (2, Σ, 2, 2, 2) two equivalent automata, we say that is included in ℬ (in symbols ⊆ ℬ ) if the transition graph of is a subgraph of the transition graph of ℬ, 1 = 2 and 1 = 2.

3. Completion of Wheeler Automata.

Here we face the problem of completing a Wheeler automaton by an extension of the original order and preserving the language recognized by the automaton. Given a WDFA , the goal is to find, when it exists, an equivalent complete Wheeler automaton ℬ such that ⊆ ℬ and the order of states of ℬ is an extension of the one in . First, recall that any incomplete DFA can be completed by adding an empty state to which all the missing transitions converge. Such an operation does not ensure that we obtain a Wheeler automaton, as showed in the following example.

Example 1. Let us consider the automaton in Figure 1(a), it is a WDFA (with < and < < ) that accepts the language + and it is not complete. By adding a single sink state for each letter as in Figure 1(b) we get the minimal input-consistent complete equivalent automaton. Note that it is not a WDFA indeed if < then < , a contradiction, because () = {} and () = {}. Whereas, if > then > , a contradiction because is the minimum. In Figure 2 an equivalent complete Wheeler automaton is depicted with states < < < < < . Note that the order of the set of states is an extension of the first one, and three sink states have been added.

The previous example emphasizes two issues. On one hand, the classical procedure of

completing an automaton by adding only one sink state (or one for each letter, for the inputconsistency) could produce a DFA that is no more a WDFA. On the other hand, the automaton can, in some cases, be completed by adding more than one empty state in order to preserve the Wheeler property. In some cases it is not possible to complete a Wheeler automaton by maintaining the Wheeler property, as the following example shows.

Example 2. The automaton in Figure 3(a) is a Wheeler automaton (with < and < < ) that accepts the language +. In [9] the authors give such a language as an example of Wheeler language for which there is not any complete Wheeler automaton that recognizes it. To prove this fact they use some properties of the co-lexicographic order on (()).

Given a WDFA , although there does not always exist a complete WDFA that contains ,

nevertheless among the WDFAs that contain there always exists a maximal one. This leads to the following definition.

Definition 1. Let be a Wheeler automaton. is Wheeler-complete (shortly W-complete) if for any equivalent Wheeler automaton ℬ if ⊆ ℬ then = ℬ.

The following theorem gives a characterization of W-complete automata.

(a) (b) Theorem 1. Let = (, Σ, , , ) be a WDFA. is W-complete if for any missing transition (, ), with ∈ , ∈ Σ, there exist , ∈ with < < such that (, ) = (, ) and it is a coaccessible state.

Let be a WDFA. Denote by the set of states of and suppose that all the elements of

are coaccessible. Let ℬ = (, Σ, , , ) be a W-complete WDFA such that ⊆ ℬ . One has that = ∪ , where is the set of sink states of ℬ. We say that two states , are consecutive in (resp. in ∪ ) if < and does not exist any ∈ (resp. ∈ ∪ ) such that < < .

We say that ℬ has a minimal number of sink states, if for any other W-complete WDFA ℬ′ containing , ℬ′ has a number of states greater than or equal to that of ℬ.

Lemma 1. Let be a WDFA with = {1, 2, . . . , } all coaccesible states. Let ℬ = ( ∪ , Σ, , , ) be a W-complete DFA such that ⊆ ℬ with the minimal number of sink states. Two sink states , ∈ are consecutive in ∪ if there exist two consecutive letters < such that (, ) = and (1, ) = .

From the Wheeler property and previous Lemma we can infer as a corollary the following lemma.

Lemma 2. Let and be two consecutive states of ∪ . If (, ) and (, ) are sink states, for some ∈ Σ, then (, ) = (, ).

By using Lemma 1 and Lemma 2 we can prove the following Theorem.

Theorem 2. For each WDFA = (, Σ, , , ) with all states coaccessible there exists a unique W-complete DFA ℬ with minimal number of states such that ⊆ ℬ . We call it the Wheeler completion (shortly W-completion) of and we denote it by ().

Here we give an upper bound on the number of states of the W-completion automaton.

Theorem 3. Let = (, Σ, , , , ≤ ) a WDFA with states and over a letter alphabet. The W-completion () has at most 2 + − 2 states.

Remark that a W-complete automaton is not in general complete. See, for instance, the WDFA

in 3(b) is the W-completion of the automaton in Figure 3(a), but it is not complete. Moreover, we say that a Wheeler automaton is completable if () is complete.

4. An algorithm for the W-completion of a Wheeler DFA

Let = (, Σ, , 1, ) be a solo DFA, where = {1, 2, . . . , } is a totally ordered set of states.

In the sequel it is convenient to represent the transition function of the DFA as transformations of the set of states, i.e. a partial mapping of into itself (cf. for instance [13]). For each ∈ Σ, the transformation is defined for ∈ as () = (, ). If (, ) is not defined , we write () = □ and we say that () is a missing -transition (or a -hole). Hence, an arbitrary partial transformation can be written in the form = 1

2 · · · 1 2 · · · − 1 − 1 )︂ () = * (, ). (1, 2, . . . , ) composed by the elements diferent from □ . where = () and ∈ ∪ {□ }, for 1 ≤ ≤ . We denote by the subsequence of

For each word ∈ Σ* , the transition function defines a transformation of : for all ∈ ,

With this representation, the property that the DFA, over a totally ordered alphabet Σ, is a

WDFA corresponds to the following three conditions:

• For each ∈ Σ, 1 ∈/ ; • For each ∈ Σ, is a non-decreasing sequence; • Denoted by ( ) and ( ) the first and the last element of , respectively. If < , then ( ) < ( ).

The notion of interval of missing transitions (or interval of holes) plays an important role in our construction. For , ∈ with − ≥

2, by , we denote the internal interval , = { ∈ | < < }. Remark that, in our notation, the intervals , do not contain the endpoints and . Further, we denote by 0, and ,+1 the left and right intevals: 0, = { ∈ | < } and ,+1 = { ∈ | > }. We say that , , is an interval of missing -transitions if for each ∈ , , () = □ and (), () ̸= □ . We denote it by (, ). In a similar way, we define the left interval of -holes (0, ) and the right interval of -holes (, + 1). defined by the following transformations: Example 3. Consider the WDFA in Figure 4 over the alphabet {, }. The transition function is ( 0, 3 ) = {1, 2}.

Then = ( 2, 3, 4 ) and = (5, 5, 6, 6) and the interval of missing -transitions (-holes) are ( 2, 5 ) = {3, 4} and (5, 7) = {6}. The (unique) interval of missing -transition (-holes) is

Now we give an algorithm that, receiving as input a WDFA with all coaccessible states

computes (). It works by adding some sink states and transitions as much as possible to fill as much holes as possible by maintaining the Wheeler property. In the execution of the algorithm we will deal with WDFA = (, Σ, , 1, ) in which the set of states is the union = ∪ , where is the set of coaccesible states and is the set of the sink states.

The elements of are denoted by the integers {1, 2, . . . } and the elements of by rational

non-integer numbers with 1 < < + 1. The total order in the elements of = ∪ is the order of the rational numbers. In the input WDFA , we have = ∅ and = .

At every step, in the run of the algorithm, we update only the set and the transition function . The alphabet Σ, the set of coaccessible states and the set of final states remain unchanged. The goal is to replace the missing transitions with proper transitions (maintaining the Wheeler property), hence we add a new sink state, when needed, and replace all the missing transitions in the interval with proper transitions converging to such a sink state. The original transitions remain unchanged.

If we refer to the missing transitions has ’holes’, the above replacements are called filling the holes. More in detail, we distinguish two kinds of interval of holes. For each ∈ Σ, the interval of holes (, ) is said to be of integer type if both () and () are integers, i.e. elements of . Similarly, (0, ) ( (, + 1)) is said to be of of integer type if () (resp. ()) is an integer. The intervals of holes that are not of integer type are said to be of non-integer type.

Finally, an interval of holes (, ) of integer type is said to be blocking if () = (). All other intervals of holes are non-blocking.

The basic operations we consider consist of filling the holes and are described below in detail.

Fill internal and right intervals of -holes of integer type.

Let (, ), or (, + 1), be an interval of -holes of integer type, • Update by creating a new state = () + 0.5:

:= ∪ {}; • Update :

For all < < , () = ;

For all ∈ Σ, () = □ .

Fill left intervals of -holes of integer type.

Let (0, ) be an interval of -holes of integer type, • Update by creating a new state = () − 0.3:

:= ∪ { }; • Update :

For all < , () = ;

For all ∈ Σ, ( ) = □ .

Fill intervals of -holes of non-integer type.

In this case we do not distinguish between internal, right and left intervals of -holes. We denote by this interval. By hypothesis, one of the endpoints (say ) of is such that () is not an integer, i.e. it is an element of (a sink state). In this case, we do not add states to , but we update only the transition function as follows: holes are created, which, in turn, are filled by iterating the procedure.

The procedure stops when either all holes disappear (in such a case we obtain a complete automaton) or only blocking intervals of holes remains (in this second case we obtain a Wcomplete automaton which is not complete) (cf. Theorem 1).

tion functions: Example 4. Let = (, Σ, , 1, , ≤ ) with = {1, 2, . . . , 6} depicted in Figure 4 and transi

By filling ( 2, 5 ), (5, 7) and ( 0, 3 ) three sink states (3.5, 4.5 and 4.7) are added and the transition function is updated as follows: • Update :

For all ∈ , () = ().

By the description of previous operations and by Theorem 1 one can infer the following theorem.

By filling

( 4, 5 ) one more state is added (the sink state 5.5) and by filling all the intervals of non-integer type the transition function is updated as follows: By filling the last interval of non-integer type, we get the following W-complete automaton also depicted in Figure 5.

Theorem 4. Let = (, Σ, , 1, , ≤ ) be a WDFA with all coaccesible states. The Algorithm constructs ().

Figure 5 shows the W-completion where the new sink states and new transitions are dashed.

The W-completion () = ( ∪ , ′, 1, ) contains both coaccessible and sink states, hence the following definition makes sense. We define Dom() as the set of words that can be read by () from the initial state. More formally, Dom() = { ∈ Σ* | ′* (1, ) ̸= □ }. For instance, if is the WDFA of Example 2, Dom() = Pref (++ + +). The following inclusions hold: () ⊆

Pref (()) ⊆

Dom()

Moreover, we have that () is complete if Dom() = Σ* . Such a concept is crucial for defining the operations on Wheeler automata described in the next section.

5. Operations on Wheeler automata. We start this section by recalling some basic constructions in theory of automata.

Let = (, Σ, , , ) be a DFA and let () be the language recognized by . Let = (, Σ, , , ∖ ). If is a complete DFA then recognizes the complement of (), i.e. () = Σ* ∖ ().

Let 1 = (1, Σ, 1, 1, 1) and 2 = (2, Σ, 2, 2, 2) be two DFAs over the same alphabet Σ, recognizing respectively the languages (1) and (2). The cartesian product of 1 and 2 is the DFA 1 × 2 = (, Σ, , , ), where: • = 1 × 2, • = (1, 2), • ((1, 2), ) = ( 1(1, ), 2(2, )), with (1, 2) ∈ and ∈ Σ.

If = 1 × 2 then recognizes the intersection of (1) and (2), i.e () = (1) ∩ (2). Whereas, if = (1 × 2) ∪ (1 × 2) and 1 and 2 are complete DFAs, then recognizes the union of (1) and (2), i.e () = (1) ∪ (2).

Remark 1. The construction of the DFA for the intersection does not require the DFA 1 and 2 to be complete. On the contrary, with the completeness hypothesis the constructions relative to the complement and the union always works.

In the case of Wheeler languages, we are dealing with automata that are not, in general,

complete then the above constructions could fail for WDFAs. Indeed, the class of Wheeler languages is closed under intersection, but it is not closed under union and complementation.

In the following, we give a procedure for the complementation and a procedure for the union

of Wheeler languages. The basic idea in both constructions is the following: first apply to the input WDFA the completion algorithm given in the previous section; then apply to the output of the completion algorithm the classical constructions for the complement and the union.

If the W-completion is a complete WDFA, we are able to construct WDFAs both for the complement and for the union. If not, some special cases are considered.

5.1. The Complement construction a Wheeler language and it is such that Let = (, Σ, , 1, ) a WDFA. We compute the W-completion () = ( ∪ , ′, 1, ). We, then, construct the automaton = ( ∪ , Σ, ′, 1, ( ∪ ) ∖ ). The language () is () = Dom() ∖ ().

If () is complete then () = (). recognizing the language + Example 5. Let us consider the transition function of the Wheeler automaton in Figure 1(a) The W-completion () is the following:

It is the complete automaton (Dom() = Σ* ) in Figure 2 with, = 1, = 2, = 2.5, = 2.7, = 3 and = 3.7. Hence the complement of the Wheeler language + is a Wheeler language. 5.1, point 5, of [9], where the Wheelereness of (()) ∖ () is considered.

If () is not complete (i.e. is not completable) () depends on , () = Dom() ∖ () is a subset of (). Remark that this result extends the one stated in Lemma Example 6. Let us consider the transition function of the Wheeler automaton in Figure 3(a) recognizing the language + 3 )︂ 2

3 )︂ ︂( 1 3 □ 3 )︂ 3

It cannot be completed (cf. Example 2) but it has a W-completion as follows. language, as shown in the following example.

It is known that the Wheeler language + is not recognized by any complete WDFA hence any WDFA recognizing it is not completable. On the other hand, it can occur that an automaton is not completable but recognizes a Wheeler language whose complement is a Wheeler Example 7. Let us consider the language + . It is a Wheeler language because it is finite and its complement is a Wheeler language because it is cofinite. The following is the transition function of a Wheeler automaton that recognizes it and is not completable 5.2. The Union construction (2) of the automaton 2. alphabet Σ, recognizing respectively the languages (1) and (2).

Let 1 = (1, Σ, 1, 1, 1) and 2 = (2, Σ, 2, 2, 2) be two WDFAs over the same

We first construct the W-completion (1) of the automaton 1 and the W-completion Wheeler conditions.

Let ′1 = 1 ∪ 1 the set of states of (1) and ′2 = 2 ∪ 2 the set of states of (2).

We construct the automaton (1) ×

(2), as in the classical way. More precisely, we consider only the accessible and coaccessible part of (1) × (2) i.e. we consider accessible pairs (, ) such that at least one of the two states is coaccessible. Moreover, we choose as set of final states the set of accessible states of (1 × ′2) ∪ (′1 × 2).

If (1) and (2) are complete, the automaton (1) × (2) is a WDFA indeed, given two states (1, 2) and (1, 2) of (1)× (2) we have that 1 ≤ 1 ⇐⇒ 2 ≤ 2, since the co-lexicographic order over the words corresponds to the total order between the states.

By this remark, one can define a total order on the states of (1) × (2) satisfying the If one of the automata (), with ∈ {1, 2}, is not complete, it happens that, for some

state ∈ ′ and some letter

∈ Σ, the transition ′(, ) = □ is a missing transition. We can consider □ as new special sink state, that cannot be placed in order relation with other

In this case, the cartesian product (1) × (2) contains some states of type (, ), but also some states of type (, □ ) and (□ , ) and (□ , □ ). Finally, let us define ((, □ ), ) = ( 1′(, ), □ ), for any ∈ 1 and ∈ Σ and ((□ , ), ) = (□ , 2′(, )), for any ∈ 2 and

If only states of the form (, ) and (′, □ ) appear (in such accessible part), we are able to order the pairs with diferent first component: (, □ ) < (′, □ ) ⇐⇒ < ′ and (, ) < (′, □ ) ⇐⇒ < ′, as showed in the following example. In a similar way we are able to order states of the form (, ) and (□ , ′).

Example 8. Let 1 be the WDFA (with < ) recognizing the language + and (1) its W-completion, as in the Example 5, and let 2 be the WDFA recognizing the language + and (2) its W-completion, as in the Example 6. As Figure 6 shows, the union procedure in this case gives an automaton that contains only comparable pairs hence it is a Wheeler automaton that recognizes + ∪ +.

Theorem 5. Let 1 = (1, Σ, 1, 1, 1) and 2 = (2, Σ, 2, 2, 2) be two WDFA over the same alphabet Σ. If (1) ⊆ Dom(2) and (2) ⊆ Dom(1) then (1) ∪ (2) is a Wheeler language.

Remark that Theorem 5 gives only a suficient condition for the union. Example 8 shows that the construction relative to the union works also under more general conditions. An interesting open problem is to find necessary and suficient conditions for the constructions relative to the union and the complement of Wheeler languages.

Würzburg, Germany, February 25-27, 1993, Proceedings, volume 665 of Lecture Notes in Computer Science, Springer, 1993, pp. 333–342. [5] M. P. Schützenberger, A remark on incompletely specified automata, Inf. Control. 8 (1965) 373–376. [6] T. Gagie, G. Manzini, J. Sirén, Wheeler graphs: A framework for bwt-based data structures,

Theor. Comput. Sci. 698 (2017) 67–78. [7] D. Gibney, S. V. Thankachan, On the complexity of recognizing wheeler graphs, Algorithmica 84 (2022) 784–814. [8] N. Cotumaccio, N. Prezza, On indexing and compressing finite automata, in: D. Marx (Ed.), Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, SIAM, 2021, pp. 2585–2599. [9] J. Alanko, G. D’Agostino, A. Policriti, N. Prezza, Wheeler languages, Information and

Computation 281 (2021) 104820. [10] N. Cotumaccio, G. D’Agostino, A. Policriti, N. Prezza, Co-lexicographically ordering automata and regular languages - part I, J. ACM 70 (2023) 27:1–27:73. [11] G. D’Agostino, D. Martincigh, A. Policriti, Ordering regular languages and automata:

Complexity, Theor. Comput. Sci. 949 (2023) 113709. [12] L. Egidi, F. A. Louza, G. Manzini, Space eficient merging of de bruijn graphs and wheeler graphs, Algorithmica 84 (2022) 639–669. [13] J. A. Brzozowski, B. Li, D. Liu, Syntactic complexities of six classes of star-free languages, J. Autom. Lang. Comb. 17 (2012) 83–105.

[1]

M.-P.

Béal ,

Lombardy ,

Perrin , Embeddings of local automata, Illinois Journal of Mathematics 54 ( 2010 ) 155 - 174 .

[2]

J. J.

Ashley ,

B. H.

Marcus ,

Perrin ,

Tuncel , Surjective extensions of sliding-block codes , SIAM J. Discret. Math. 6 ( 1993 ) 582 - 611 .

[3]

Ehrenfeucht , G. Rozenberg, Each regular code is included in A maximal regular code , RAIRO Theor. Informatics Appl . 20 ( 1986 ) 89 - 96 .

[4]

Montalbano , Local automata and completion , in: P. Enjalbert , A. Finkel , K. W. Wagner (Eds.), STACS 93, 10th Annual Symposium on Theoretical Aspects of Computer Science,