=Paper=
{{Paper
|id=Vol-1885/40
|storemode=property
|title=On h-Lexicalized
Automata and h-Syntactic Analysis
|pdfUrl=https://ceur-ws.org/Vol-1885/40.pdf
|volume=Vol-1885
|authors=Martin Plátek,Friedrich Otto,František Mráz
|dblpUrl=https://dblp.org/rec/conf/itat/PlatekOM17
}}
==On h-Lexicalized
Automata and h-Syntactic Analysis==
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 40–47 CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 M. Plátek, F. Otto, F. Mráz On h-Lexicalized Automata and h-Syntactic Analysis Martin Plátek1 , F. Otto2 , and F. Mráz1 ∗ 1 Charles University, Department of Computer Science Malostranské nám. 25, 118 00 PRAHA 1, Czech Republic martin.platek@mff.cuni.cz, frantisek.mraz@mff.cuni.cz 2 Fachbereich Elektrotechnik/Informatik, Universität Kassel D-34109 Kassel, Germany otto@theory.informatik.uni-kassel.de Abstract: Following some previous studies on list auto- and basic and h-proper languages, this new model is bet- mata and restarting automata, we introduce a generalized ter suited for modeling (i) lexicalized syntactic analysis and refined model – the h-lexicalized restarting list au- (h-syntactic analysis) and specially (ii) (lexicalized) anal- tomaton (LxRLAW). We argue that this model is useful ysis by reduction of natural languages (compare [6, 7]). for expressing transparent variants of lexicalized syntactic Analysis by reduction is a technique for deciding the analysis, and analysis by reduction in computational lin- correctness of a sentence. It is based on a stepwise sim- guistics. We present several subclasses of LxRLAW and plification by reductions preserving the (in)correctness of provide some variants and some extensions of the Chom- the sentence until a short sentence is obtained for which sky hierarchy, including the variant for the lexicalized syn- it is easy to decide its correctness. Restarting automata tactic analysis. We compare the input languages, which were introduced as an automata model for analysis by re- are the languages traditionally considered in automata the- duction. While modeling analysis by reduction, restarting ory, to the so-called basic and h-proper languages. The automata are forced to use very transparent types of com- basic and h-proper languages allow stressing the trans- putations. Nevertheless, they still can recognize a proper parency of h-lexicalized restarting automata for a super- superset of the class of context-free languages (CFL). The class of the context-free languages by the so-called com- first observation, which supports our argumentation, is that plete correctness preserving property. Such a type of trans- LxRLAW allow characterizations of the Chomsky hierar- parency cannot be achieved for the whole class of context- chy for classes of languages and for h-syntactic analysis free languages by traditional input languages. The trans- as well. parency of h-lexicalized restarting automata is illustrated An LxRLAW M is a device with a finite state control by two types of hierarchies which separate the classes of and a read/write window of a fixed size. This window can infinite and the classes of finite languages by the same move in both directions along a tape (that is, a list of items) tools. containing a word delimited by sentinels. The LxRLAW- automaton M uses an input alphabet and a working alpha- bet that contains the input alphabet. Lexicalization of M is 1 Introduction given through a morphism which binds the individual sym- bols from the working alphabet to symbols from the input Chomsky introduced his well-known hierarchy of gram- alphabet. M can decide (in general non-deterministically) mars to formulate the phrase-structure (immediate con- to rewrite the contents of its window: it may delete some stituents) syntax of natural languages. However, the syn- items from the list (moving its window in one or the other tax of most European languages (including English) is of- direction), insert some items into the list, and/or replace ten considered as a lexicalized syntax. In other words, some items. In addition, M can perform a restart opera- the categories of Chomsky are bound to immediate con- tion which causes M to place its window at the left end stituents (phrases), while by lexicalized syntax they are of the tape, so that the first symbol it contains is the left bound to individual word-forms. In order to give a general border marker, and to reenter its initial state. theoretical basis for lexicalized syntax that is comparable In the technical part we adjust some known results to the Chomsky hierarchy, we follow some previous stud- to results on several subclasses of LxRLAW that are ob- ies of list and restarting automata (see [1, 4, 5, 13]) and tained by restricting its set of operations to certain subsets, introduce a generalized and refined model that formalizes and we also provide some variants and extensions of the lexicalization in a natural way – the h-lexicalized restart- Chomsky hierarchy of languages. E.g., an LxRLAI uses an ing list automaton with a look-ahead window (LxRLAW). input alphabet only. We argue that, through the use of restarting operations We recall and newly introduce some constraints that are suitable for restarting automata, and we outline ways ∗ The first and the third author were partially supported by the Czech for new combinations of constraints, and more transparent Science Foundation under the project 15-04960S. computations. On h-Lexicalized Automata and h-Syntactic Analysis 41 Section 2 contains the basic definitions. The charac- 3. A rewrite step (q, u) −→ (q0 , W(v)) assumes that terizations of the Chomsky hierarchy by certain types of (q0 , W(v)) ∈ δ (q, u), |v| = |u|, and the sentinels are LxRLAW is provided in Section 3. at the same positions in u and v (if at all). It causes M In Section 4 we introduce RLWW-automata as a re- to replace the contents u of the read/write window by stricted type of LxRLAW and give several new con- the string v, and to enter state q0 . The head does not straints for them. By the basic and h-proper languages change its position. we show the transparency of h-lexicalized restarting auto- mata through the so-called complete correctness preserv- 4. An S-right step (q, u) −→ (q0 , SR(v)) assumes that ing property. Using the new constraints and the basic and (q0 , SR(v)) ∈ δ (q, u), v is shorter than u, containing h-proper languages, we are able to separate classes of fi- all sentinels in u. It causes M to replace u by v and nite languages in a similar way as the classes of infinite to enter state q0 ; the new position of the head is on languages and establish in this way new hierarchies, which the first item of v (the contents of the window is thus can create a suitable tool for the classification of syntactic ‘completed’ from the right; the positional distance to phenomena for computational linguistics. The paper con- the right sentinel decreases). cludes with Section 5. 5. An S-left step (q, u) −→ (q0 , SL(v)) assumes that (q0 , SL(v)) ∈ δ (q, u), v is shorter than u, containing 2 Definitions all sentinels in u. It causes M to replace u by v, to en- ter state q0 , and to shift the head position by |u| − |v| In what follows, we use ⊂ to denote the proper subset re- items to the left – but to the left sentinel c at most (the lation. Further, we will sometimes use regular expressions contents of the window is ‘completed’ from the left; instead of the corresponding regular languages. Finally, the distance to the left sentinel decreases if the head throughout the paper λ will denote the empty word and position was not already at c). N+ will denote the set of all positive integers. An h-lexicalized two-way restarting list automa- 6. An insert step (q, u) −→ (q0 , I(v)) assumes that ton, LxRLAW for short, is a one-tape machine M = (q0 , I(v)) ∈ δ (q, u), u is a proper subsequence of v (Q, Σ, Γ, c, $, q0 , k, δ , h), where Q is the finite set of states, (keeping the obvious sentinel constraints). It causes Σ is the finite input alphabet, Γ is the finite working al- M to replace u by v (by inserting the relevant items), phabet containing Σ, the symbols c, $ 6∈ Γ are the mark- and to enter state q0 . The head position is shifted by ers for the left and right border of the work space, respec- |v| − |u| to the right (the distance to the left sentinel tively, h : Γ ∪ {c, $} → Σ ∪ {c, $} is a mapping creating a increases). (letter) morphism from cΓ∗ $ to cΣ∗ $ such that, for each 7. A restart step (q, u) −→ Restart assumes that a ∈ Σ ∪ {c, $}, h(a) = a; q0 ∈ Q is the initial state, k ≥ 1 is Restart ∈ δ (q, u). It causes M to place its read/write the size of the read/write window, and window onto the left end of the tape, so that the first δ : Q × PC ≤k → symbol it sees is the left border marker c, and to reen- P((Q × ({MVR, MVL} ∪ {W(v), SR(v), SL(v), I(v)})) ter the initial state q0 . ∪ {Restart, Accept, Reject}) 8. An accept step (q, u) −→ Accept assumes that is the transition relation. Here P(S) denotes the powerset Accept ∈ δ (q, u). It causes M to halt and accept. of a set S, 9. A reject step (q, u) −→ Reject assumes that Reject ∈ PC ≤k := (c · Γ k−1 k ) ∪ Γ ∪ (Γ ≤k−1 · $) ∪ (c · Γ ≤k−2 · $), δ (q, u). It causes M to halt and reject. is the set of possible contents of the read/write window A configuration of M is a string αqβ where q ∈ Q, and of M, and v ∈ PC ≤n , where n ∈ N. either α = λ and β ∈ {c} · Γ∗ · {$} or α ∈ {c} · Γ∗ and β ∈ Γ∗ · {$}; here q represents the current state, αβ is According to the transition relation, if M is in state q and the current contents of the tape, and it is understood that sees the string u in its read/write window, it can perform the head scans the first k symbols of β or all of β when nine different steps, where q0 ∈ Q: |β | < k. A restarting configuration is of the form q0 cw$, 1. A move-right step (q, u) −→ (q0 , MVR) assumes that where w ∈ Γ∗ ; if w ∈ Σ∗ , then q0 cw$ is an initial configura- (q0 , MVR) ∈ δ (q, u) and u 6= $. It causes M to shift tion. We see that any initial configuration is also a restart- the read/write window one position to the right and to ing configuration. Any restart transfers M into a restarting enter state q0 . configuration. In general, the automaton M is non-deterministic, that 2. A move-left step (q, u) −→ (q0 , MVL) assumes that is, there can be two or more steps (instructions) with the (q0 , MVL) ∈ δ (q, u) and u 6∈ c · Γ∗ · {λ , $}. It causes same left-hand side (q, u), and thus, there can be more than M to shift the read/write window one position to the one computation for an input word. If this is not the case, left and to enter state q0 . the automaton is deterministic. 42 M. Plátek, F. Otto, F. Mráz A computation of M is a sequence C = C0 ,C1 , . . . ,C j 2.1 Further Refinements on LxRLAW of configurations, where C0 is an initial configuration and Ci+1 is obtained by a step of M from Ci , for all 0 ≤ i < j. Here we introduce some constrained types of rewriting steps which assume q, q0 ∈ Q and u ∈ PC ≤k . An input word w ∈ Σ∗ is accepted by M, if there is A delete-right step (q, u) → (q0 , DR(v)) is an S-right a computation which starts with the initial configuration step (q, u) → (q0 , SR(v)) such that v is a proper sub- q0 cw$ and ends by executing an Accept instruction. By sequence of u, containing all the sentinels from u (if any). L(M) we denote the language consisting of all input words A delete-left step (q, u) → (q0 , DL(v)) is an S-left step accepted by M; we say that M accepts the input language (q, u) → (q0 , SL(v)) such that v is a proper sub-sequence L(M). of u, containing all the sentinels from u (if any). A basic (or characteristic, or working) word w ∈ Γ∗ is A contextual-left step (q, u) → (q0 , CL(v)) is an S-left accepted by M, if there is a computation which starts with step (q, u) → (q0 , SL(v)), where u = v1 u1 v2 u2 v3 and v = the restarting configuration q0 cw$ and ends by executing v1 v2 v3 for some v1 , u1 , v2 , u2 , v3 , such that v contains all an Accept instruction. By LC (M) we denote the language the sentinels from u (if any). consisting of all basic words accepted by M; we say that A contextual-right step (q, u) → (q0 , CR(v)) is an S- M accepts the basic (characteristic) language LC (M). right-step (q, u) → (q0 , SR(v)), where u = v1 v2 v3 and v Further, we take LhP (M) = { h(w) ∈ Σ∗ | w ∈ LC (M) }, = v1 v3 for some v1 , v2 , v3 , such that v contains all the sen- and we say that M recognizes (accepts) the h-proper lan- tinels from u (if any). guage LhP (M). Note that the contextual-right step is not symmetrical to the contextual-left step. We will use this fact in Section 4. Finally, we take LA (M) = { (h(w), w) | w ∈ LC (M) } and The set OG = {MVL, MVR, W, SL, SR, DL, DR, CL, we say that M determines the h-syntactic analysis LA (M). CR, I, Restart} represents the set of types of steps, We say that, for x ∈ Σ∗ , LA (M, x) = { (x, y) | y ∈ which can be used for characterizations of subclasses of LC (M), h(y) = x } is the h-syntactic analysis for x by M. LxRLAW. This set does not contain the symbols Accept We see that LA (M, x) is non-empty only for x ∈ LhP (M). and Reject, corresponding to halting steps, as they are used In the following we only consider finite computations of for all LxRLAW-automata. Let T ⊆ OG. We denote by LxRLAW-automata, which finish either by an accept or by T-automata the subset of LxRLAW-automata which only a reject operation. use transition steps from the set T ∪ {Accept, Reject}. For An LxRLAI M is an LxRLAW for which the input alpha- example, {MVR, W}-automata only use move-right steps, bet and the working alphabet are equal. W-steps, Accept steps, and Reject steps. Notations. For brevity, the prefix det- will be used to de- Fact 1. (Equality of Languages for LxRLAI-automata). note the property of being deterministic. For any class For each LxRLAI-automaton M, L(M) = LC (M) = A of automata, L (A ) will denote the class of input lan- LhP (M). guages that are recognized by automata from A , LC (A ) will denote the class of basic languages that are recog- – Cycles, tails: Any finite computation of an LxRLAW- nized by automata from A , and LhP (A ) will denote the automaton M consists of certain phases. Each phase starts class of h-proper languages that are recognized by auto- in a restarting configuration. In a phase called a cycle, the mata from A . Moreover, LA (A ) will denote the class window moves along the tape performing non-restarting of h-syntactic analyses that are determined by automata operations until a Restart operation is performed and thus from A . Let us note that we use the more simple notation a new restarting configuration is reached. If after a restart LP (A ) in [15] and other papers in a different sense than configuration no further restart operation is performed, any the denotation LhP (A ) here. finite computation necessarily finishes in a halting config- For a natural number k ≥ 1, L (k-A ) (LC (k-A ) or uration – such a phase is called a tail. LhP (k-A )) will denote the class of input (basic or h- proper) languages that are recognized by those automata – Cycle-rewritings: We use the notation q0 cu$ `cM q0 cv$ from A that use a read/write window of size k. to denote a cycle of M that begins with the restarting con- figuration q0 cu$ and ends with the restarting configura- – Monotonicity of Rewritings. We introduce various no- tion q0 cv$. Through this relation we define the relation of tions of monotonicity as important types of constraints for cycle-rewriting by M. We write u ⇒cM v iff q0 cu$ `cM q0 cv$ computations of LxRLAW-automata. ∗ holds. The relation u ⇒cM v is the reflexive and transitive Let M be an LxRLAW-automaton, and let C = closure of u ⇒cM v. Ck ,Ck+1 , . . . ,C j be a sequence of configurations of M, We point out that the cycle-rewriting is a very important where Ci+1 is obtained by a single transition step of M feature of LxRLAW. from Ci , k ≤ i < j. We say that C is a sub-computation of M. – Reductions: If u ⇒cM v is a cycle-rewriting by M such Let RW ⊆ {W, SR, SL, DR, DL, CR, CL, I}. Then we de- that |u| > |v|, then u ⇒cM v is called a reduction by M. note by W(C, RW ) the maximal (scattered) sub-sequence On h-Lexicalized Automata and h-Syntactic Analysis 43 of C, which contains those configurations from C that cor- word. Since the regular languages are closed under homo- respond to RW -steps (that is, those configurations in which morphisms, we have the following proposition. a transition step of one of the types from the set RW is ap- plied). We say that W(C, RW ) is the working sequence Proposition 3. For X ∈ {L , LC , LhP }, the classes of C determined by RW . X (1-det-{MVR}) and X (1-{MVR}) coincide with the Let C be a sub-computation of an LxRLAW- class REG of regular languages. automaton M, and let Cw = cαqβ $ be a configuration Observe that for LxRLAW-automata with window from C. Then |β $| is the right distance of Cw , which is size 1, the operation SR coincides with the operations DR denoted by Dr (Cw ), and |cα| is the left distance of Cw , and CR, and that the operation SL coincides with the oper- which is denoted by Dl (Cw ). ations DL and CL, and that in this situation all these oper- We say that a working sequence W(C, RW ) = ations just delete the symbol currently inside the window. (C1 ,C2 , . . . ,Cn ) is RW -monotone (or RW -right-monotone) if Dr (C1 ) ≥ Dr (C2 ) ≥ . . . ≥ Dr (Cn ). Proposition 4. For X ∈ {L , LC , LhP }: A sub-computation C of M is RW -monotone if X ({MVR, CL, W}) ⊆ CFL, W(C, RW ) is RW -monotone. If we write (right-)mono- where CFL is the class of context-free languages. tone, we actually mean {W, SR, SL, DR, DL, CR, CL, I}-right-monotone, that is, monotonicity with respect to Proof. Any {MVR, CL, W}-automaton M = (Q, Σ, Γ, c, $, any type of (allowed) rewriting and inserting for the cor- q0 , k, δ , h) can be simulated by a pushdown automaton responding type of automaton. By completely(-right)- (PDA) P which stores the current content of the window monotone, we actually mean monotone with respect to of M in its control unit (as a state). On an input word w, P each configuration of the computation. first tries to read the first k symbols of w and to store them For each of the prefixes X ∈ {RW, λ , completely} we within its control unit. If w is of length less than k, then P say that M is X-monotone if each of its (sub)computations accepts or rejects w upon encountering the right sentinel $. is X-monotone. Otherwise, P continues while preserving the following in- variant: the contents of the pushdown store of P equals the Fact 2. Let M be an {MVR, SR, SL, W, I}-automaton. part the tape of M to the left of its current position, the Then M is completely-monotone. contents of the window of M and its state are both stored in the control unit of P, and the rest of the tape of M to the Remark on PDA. It is not hard to see that a 1- right of its window is the unread part of the input of P. {MVR, SR, SL, W, I}-automaton is a type of normalized Hence, P accepts exactly L(M) by entering an accepting pushdown automaton. The top of the pushdown is repre- state and reading the rest of its input whenever M performs sented by the position of the head, and the content of the an Accept-step. If we include all working symbols of M pushdown is represented by the part of the tape between into the input alphabet of P, we obtain a PDA P0 such that the left sentinel and the position of the head. In fact, in L(P0 ) = LC (M), thus the basic language of M is context- a very similar way the pushdown automaton was intro- free. As CFL is closed under homomorphisms, also the duced by Chomsky. A k-{MVR, SR, SL, W, I}-automaton h-proper language of M is context-free, too. can be interpreted as a pushdown automaton with a k- lookahead, and with a limited look under the top of the Proposition 5. For X ∈ {L , LC , LhP }, the class pushdown at the same time. (det-)PDAs can be simulated X (1-{MVR, CL, W}) coincides with the class CFL of even by (det-)1-{MVR, SL, W}-automata. In the follow- context-free languages. ing, we will consider the automata which fulfill the condi- tion of completely-right-monotonicity for pushdown auto- Proof. According to Proposition 4, each language ac- mata (PDA). cepted by a 1-{MVR, CL, W}-automaton as an input lan- guage or as a basic language is context-free. It remains to proof the opposite inclusion. Let L be a context-free lan- 3 Characterizations of the Chomsky guage. Clearly, the empty language and the language {λ } Hierarchy can be accepted by a 1-{MVR, CL, W}-automaton. W.l.o.g. we can suppose that L \ {λ } is generated by We transform and enhance some results from [1, 5] con- a context-free grammar G = (Π, Σ, S, PG ) in Chomsky nor- cerning input languages to basic and h-proper languages. mal form. We can construct a PDA P accepting the lan- det-{MVR}- and {MVR}-automata work like determin- guage L \ {λ } by empty store. For each nonempty word istic and nondeterministic finite-state automata, respec- w ∈ L, the PDA P can guess and simulate a rightmost tively. The only difference is that such automata can derivation of w according to G in reverse order. That accept or reject without visiting all symbols of an input is, P will perform a bottom-up analysis of w which uses word. Nevertheless, these automata can be simulated by a shift-operation that moves the next input symbol onto deterministic and nondeterministic finite-state automata, the pushdown and reductions according to the rewrite rules respectively, which instead of an Accept-step enter a spe- from PG . The reduction according to a rule of the form cial accepting state in which they scan the rest of the input X → x, for X ∈ Π, x ∈ Σ, consists of popping x from the 44 M. Plátek, F. Otto, F. Mráz top of the pushdown and pushing X onto the pushdown. LA ({MVR, CL, W}) ⊂ LA ({MVR, MVL, W}), The reduction according to a rule of the form X → Y Z, LA ({MVR, MVL, W}) ⊂ LA ({MVR, MVL, W, I}). where X,Y, Z are nonterminals of G, consists of popping In a similar way we can obtain the deterministic variants Y Z (in reversed order) from the pushdown and pushing X of the Chomsky hierarchy. onto the pushdown. The empty word λ can be immediately accepted or re- jected by a 1-{MVR, CL, W}-automaton M when λ ∈ L 4 RLWW-Automata with New Constraints or λ 6∈ L, respectively. For each nonempty word w ∈ Σ∗ , Here we formulate the h-lexicalized two-way restarting each computation of P on w can be simulated by M in such automaton in weak accepting (cyclic) form, and some of a way that the top of the pushdown will be in the window its subclasses. By considering basic and h-proper lan- of M. The rest of the contents of the pushdown of P will guages, this new type of automaton is close to the method be stored on the part of the tape of M to the left of the of analysis by reduction (see [6]), its computations are position of its window. transparent, and it reflects well the structure of its basic A shift-operation of P can be simulated by a MVR-step. and h-proper languages. The reduction according to a rule of the form X → x, for X ∈ Π, x ∈ Σ, will be simulated by the W-step which An h-lexicalized two-way restarting automaton rewrites x in the window of M by X. The reduction ac- in weak accepting form (originally called cyclic cording to a rule of the form X → Y Z, where X,Y, Z ∈ Π, form) M, an hRLWW-automaton for short, is a will be simulated by the CL-step which deletes the tape {MVR, MVL, SL, SR, Restart}-automaton, which uses an cell containing Z and a W-step which rewrites the sym- SL-step or SR-step exactly once in each cycle (only one bol Y in the window of M by X. As P accepts when the of them), and directly accepts only words that fit into its pushdown contains the single symbol S, M will perform window. an Accept-step, when the only symbol on its tape is the An hRLWW-automaton is called an hRLW-automaton initial nonterminal S. Clearly, L(M) = L(P). Addition- if its working alphabet coincides with its input alphabet. ally, M can check that after each MVR-step the symbol Note that in this situation, each restarting configuration is which appears within its window is either the sentinel $ or necessarily an initial configuration. a terminal from Σ. In this way it is ensured that M does An hRLW-automaton is called an hRL-automaton if not accept any word containing a working symbol, hence each of its rewrite steps is a DL- or DR-step. LC (M) = LhP (M) = L. An hRL-automaton is called an hRLC-automaton if each of its rewrite steps is a CL- or CR-step. It is easy to see that a linear-bounded automaton work- An hRLWW-automaton is called an hRLWWC- ing on a tape containing an input word delimited by automaton if each of its rewrite steps is a CL-step or CR- sentinels directly corresponds to a 1-{MVR, MVL, W}- step. automaton. We can also see that a {MVR, MVL, W}- An hRLWW-automaton is called an hRRWW- automaton can be simulated by a 1-{MVR, MVL, W}- automaton if it does not use any MVL-steps. Analogously, automaton. Recall that the class CSL is closed under non- we obtain hRRW-automata, hRR-automata, and hRRC- erasing homomorphisms. Therefore, we have the follow- automata. ing statement. We see that for hRLWW-automata, all cycle-rewritings are reductions. We also have the following simple Proposition 6. For X ∈ {L , LC , LhP }, the class facts, which illustrate the transparency of computations of X ({MVR, MVL, W}) coincides with the class CSL of hRLWW-automata due their basic and h-proper languages. context-sensitive languages. Fact 8. (Complete Correctness Preserving Property). By adding the ability to insert cells within the working Let M be a deterministic hRLWW-automaton, let C = tape to the operations MVR, MVL and W, we can easily C0 ,C1 , · · · ,Cn be a computation of M, and let cui $ be the simulate an arbitrary Turing machine. Hence we have the tape contents of the configuration Ci , 0 ≤ i ≤ n. If ui ∈ following proposition. LC (M) for some i, then u j ∈ LC (M) for all j = 0, 1, . . . , n. Proposition 7. For X ∈ {L , LC , LhP }, the class Fact 9. (Complete Error Preserving Property). X ((det-){MVR,MVL,W,I}) coincides with the class RE Let M be a deterministic hRLWW-automaton, let C = of recursively enumerable languages. C0 ,C1 , · · · ,Cn be a computation of M, and let cui $ be the tape contents of the configuration Ci , 0 ≤ i ≤ n. If ui 6∈ From the previous results we obtain the following vari- LC (M) for some i, then u j 6∈ LC (M) for all j = 0, 1, . . . , n. ant of the Chomsky hierarchy for the classes of the h- syntactic analysis which follows from the corresponding Fact 10. (Prefix Correctness Preserving Property). hierarchies for the classes of h-proper and basic languages. Let M be an hRLWW-automaton, let C = C0 ,C1 , · · · ,Cn be a computation of M, and let cui $ be the tape contents of Corollary 1. We have the following hierarchy: the configuration Ci , 0 ≤ i ≤ n. If ui ∈ LC (M) for some i, LA ({MVR}) ⊂ LA ({MVR, CL, W}), then u j ∈ LC (M) for all j ≤ i. On h-Lexicalized Automata and h-Syntactic Analysis 45 Fact 11. (Suffix Error Preserving Property). If M = (Q, Σ, Γ, c, $, q0 , k, δ , h) is a right-monotone Let M be an hRLWW-automaton, let C = C0 ,C1 , · · · ,Cn be hRLWW-automaton, then its characteristic language a computation of M, and let cui $ be the tape contents of LC (M) is context-free [14]. As LhP (M) = h(LC (M)), and the configuration Ci , 0 ≤ i ≤ n. If ui ∈ / LC (M) for some i, as CFL is closed under morphisms, it follows that LhP (M) then u j 6∈ LC (M) for all j ≥ i. is also context-free. Conversely, assume that L ⊆ Σ∗ is a context-free lan- Corollary 2. (Equality of Languages for hRLW- guage. Without loss of generality we may assume that automata). L does not contain the empty word. Thus, there exists a For each hRLW-automaton M, L(M) = LC (M) = LhP (M). context-free grammar G = (N, Σ, S, P) for L which is in From the last corollary above, the Complete Error and Greibach normal form, that is, each rule of P has the form Complete Correctness Preserving Properties for determin- A → aα for some string α ∈ N ∗ and some letter a ∈ Σ. For istic hRLW-automata, and the Suffix Error Preserving the following construction we assume that the rules of G Property and the Prefix Correctness Preserving Property are numbered from 1 to m. for hRLW-automata follow for their input, basic, and h- From G we construct a new grammar G0 := (N, ∆, S, P0 ), proper languages. where ∆ := { (∇i , a) | 1 ≤ i ≤ m and the i-th rule of G has the form A → aα } is a set of new terminal symbols that are in one-to-one correspondence to the rules of G, and 4.1 On Regular Languages and Correctness Preserving Computations P0 := { A → (∇i , a)α | A → aα is the i-th rule of G, 1 ≤ i ≤ m }. Proposition 12. The class REG is characterized by ba- sic and h-proper languages of deterministic hRLWW- Obviously, a word ω ∈ ∆∗ belongs to L(G0 ) if and only automata which use only the operations MVR, CR, if ω has the form ω = (∇i1 , a1 )(∇i2 , a2 ) · · · (∇in , an ) for Restart, and which use the operation CR only when the some integer n > 0, where a1 , a2 , . . . , an ∈ Σ, i1 , i2 , . . . , in ∈ window is at the position just to the right of the left sen- {1, 2, . . . , m}, and the sequence of these indices describes tinel. We denote such automata by det-Rg2. a (left-most) derivation of w := a1 a2 · · · an from S in G. Proof. We outline only the main ideas of the proof. First Let us take h((∇i , a)) = a for all (∇i , a) ∈ ∆. Then it fol- we show that any basic language and h-proper language of lows that h(L(G0 )) = L(G) = L. From ω this derivation a det-Rg2 is regular. Let Mk be a k-det-Rg2-automaton. can be reconstructed deterministically. In fact, the lan- It is not hard to see that Mk can be simulated by a finite guage L(G0 ) is deterministic context-free. Hence, there automaton Ak+1 with a stack of size k + 1 stored within exists a deterministic right-monotone hRRC-automaton M its finite control, and therefore LC (Mk ) = L(Ak+1 ), and for this language (see [4]). By interpreting the sym- LhP (Mk ) = LhP (Ak+1 ). Since the regular languages are bols of ∆ as auxiliary symbols, we obtain a determin- closed under homomorphisms, LhP (Ak+1 ) is a regular lan- istic right-monotone hRRWWC-automaton M 0 such that guage, too. h(LC (M 0 )) = LhP (M 0 ) = h(L(M)) = h(L(G0 )) = L. Ob- On the other hand, we can see from the pumping lemma serve that the input language L(M 0 ) of M 0 is actually for regular languages that any regular language L ⊆ Σ∗ can empty. be accepted by a det-Rg2-automaton with working (and input) alphabet Σ. Remark. By means of h-lexicalized restarting auto- mata, the above construction corresponds to the linguis- Remark. Since det-Rg2-automata are deterministic tic effort to obtain a set of categories (auxiliary symbols) hRLWW-automata, we see that they are completely cor- that ensures the correctness preserving property for the re- rectness preserving for their basic languages. spective analysis by reduction. Note that in its reductions, the automaton M 0 above only uses delete operations (in a special way). This is highly reminiscent of the basic (el- 4.2 Monotonicity and h-Proper Languages ementary) analysis by reduction learned in (Czech) basic As an hRRWWC-automaton is an hRRWW-automaton schools. which uses only CL-operations instead of SL-operations and CR-operations instead of SR-operations, we can prove 4.3 Hierarchies the following theorem in a similar way as it was done for a stronger version of hRRWW-automata in the technical In this subsection we will show similar results for finite report [15]. and infinite classes of basic and h-proper languages and for Theorem 13. CFL = LhP (det-mon-hRLWWC). classes of h-syntactic analysis given by certain subclasses of hRLWW-automata. At first we introduce some useful Proof. The proof is based mainly on ideas from [15]. Here notions. it is transformed for hRLWW-automata with the constraint For a (sub)class X of hRLWW-automata, by fin(i)-X we of the weak accepting form. denote the subclass of X-automata which can perform at 46 M. Plátek, F. Otto, F. Mráz most i reductions, and by fin-X we denote the union of Remark. The next theorem enables us to classify finite fin(i)-X over all natural numbers i. linguistic observations in a similar way as infinite for- By LC (M, i) we denote the subset of LC (M) contain- mal languages. It refines a part of the Chomsky hier- ing all words accepted by at most i reductions. We take archy and gives a useful tool for classifications of sev- LhP (M, i) = h(LC (M, i)). eral syntactic phenomena of natural languages. In order to show the next theorem we consider the following se- Proposition 14. Let i ≥ 0, and let M be an hRLWW- quences of languages for k ≥ 1: Lc fk+1 = { ai bi·k | i ≥ 1 } automaton. Then there is a fin(i)-hRLWW-automaton M1 and Lcsk+1 = { ai bi ci·k , ai+1 bi ci·k | i ≥ 1 }. such that LC (M, i) = LC (M1 ), and if u ⇒M1 v, then u ⇒M v. If M is deterministic, then M1 is deterministic as well. Theorem 16. For X = hRLWWC, and k ≥ 2 the following holds: Proof. The basic idea of the construction of M1 is to add (1) LhP (k-det-Rg2) ⊂ LhP (k-det-mon-X) ⊂ to the finite control of M the counter of possible cycles on LhP (k-det-X), the starting word by M. While simulating the first cycle (2) LA (k-det-Rg2) ⊂ LA (k-det-mon-X) ⊂ of M, M1 already simulates the next up to i − 1 cycles of M LA (k-det-X), rejecting all words which are not acceptable with at most i (3) LhP (k-det-fin-Rg2) ⊂ LhP (k-det-fin-mon-X) ⊂ cycles by M. 2 LhP (k-det-fin-X), We can see that a similar proposition can also be shown (4) LA (k-det-fin-Rg2) ⊂ LA (k-det-fin-mon-X) ⊂ for hRRWW-automata. LA (k-det-fin-X). For a positive integer k, let the prefix de(k)- denote Proof. We outline the main ideas of the proof. For k ≥ 2 hRLWW-automata which delete at most k symbols in each and X = hRLWWC, the following inclusions follow from reduction. definitions: The next proposition lays the foundation to the desired (1) LhP (k-det-Rg2) ⊆ LhP (k-det-mon-X) ⊆ hierarchies. In order to show the next proposition we LhP (k-det-X), consider the following sequence of languages for k ≥ 1: (2) LA (k-det-Rg2) ⊆ LA (k-det-mon-X) ⊆ Lrgk = { (ak )i | i ≥ 1 }. LA (k-det-X), Proposition 15. LhP (k-de(k)-det-fin(1)-Rg2) r (3) LhP (k-det-fin-Rg2) ⊆ LhP (k-det-fin-mon-X) ⊆ LhP ((k − 1)-hRLWW) 6= 0, / for all k ≥ 2. LhP (k-det-fin-X), (4) LA (k-det-fin-Rg2) ⊆ LA (k-det-fin-mon-X) ⊆ Proof. We outline the basic ideas only. We can con- LA (k-det-fin-X). struct a k-det-Rg2-automaton MRk such that LC (MRk ) = It remains to show that all these inclusions are proper. LhP (MRk ) = Lrgk . For MRk a window of the size k suf- We use the sequence of context-free languages Lc fk to fices, because MRk can first move its window to the right show the first proper inclusion in all four propositions. of the left sentinel. If it sees ak , then it performs a CR-step For any natural number k ≥ 2, it is not hard to which deletes ak . Next, if the window contains only the construct a k-det-mon-hRLC-automaton Mc fk such that right sentinel, the automaton accepts, otherwise it restarts. LC (Mc fk ) = LhP (Mc fk ) = Lc fk . By applying Propo- From Proposition 14 we can see that there is a k-det- sition 14 for i = k, we can construct a k-det-mon- Rg2-automaton MR0k such that LC (MR0k ) = LhP (MR0k ) = fin(k)-hRLC-automaton Mc fk0 such that LC (Mc fk0 ) = LhP (MRk , 1). LhP (Mc fk0 ) = LhP (Mc fk , k). On the other hand, there is no (k − 1)-hRLWW- For a contradiction, let us suppose that there is an k-det- automaton accepting LhP (MR0k ) as its h-proper language. Rg2-automaton Mk such that LhP (Mk ) = LhP (Mc fk , k). Let For a contradiction, let us suppose that LhP (M) = us consider the word ak+1 b(k+1)(k−1) ∈ LhP (Mc fk0 ). As this LhP (MR0k ) for a (k − 1)-hRLWW-automaton M. Then word is longer than k, Mk must use at least one cycle to ac- M accepts a word w ∈ LC (M) such that h(w) = ak ∈ cept a word w such that h(w) = ak+1 b(k+1)(k−1) . Since any LhP (MRk , 1). In an accepting computation on w, automa- k-det-Rg2-automaton can delete only some of the first k ton M must perform at least one reduction, but it can delete symbols we have a contradiction to the complete correct- at most (k − 1) symbols by which it obtains a word w0 of ness preserving property of Mk . Thus, the first inclusion is length between 1 and k − 1, hence h(w0 ) 6∈ LhP (MR0k ) – a proper for all four propositions of the theorem. contradiction to LhP (M) = LhP (MR0k ). 2 Using languages Lcsk we can show the second inclu- sion to be proper in all four propositions. Let k ≥ 2 be Corollary 3. For all types X ∈ {det-Rg2, hRR, hRRC, an integer. It is easy to construct a k-det-hRLC-automaton hRRW, hRRWWC, hRRWW, hRL, hRLC, hRLW, Mcsk such that LC (Mcsk ) = LhP (Mcsk ) = Lcsk . By ap- hRLWWC, hRLWW}, all prefixes pr1 ∈ {λ , f in}, all plying Proposition 14 for i = k, we can construct a k- prefixes prefX ∈ {λ , mon, det, det-mon}, and all k ≥ 2, det-fin(k)-hRLC-automaton Mcs0k such that LC (Mcs0k ) = the following holds: LhP (Mcs0k ) = LhP (Mcsk , k). LhP (k-pr1 -prefX -X) ⊂ LhP ((k + 1)-pr1 -prefX -X) and In order to derive a contradiction, let us assume that LhP (de(k)-pr1 -prefX -X) ⊂ LhP (de(k + 1)-pr1 -prefX -X). there is a k-mon-det-hRLWWC-automaton Mk such that On h-Lexicalized Automata and h-Syntactic Analysis 47 LhP (Mk ) = LhP (Mcsk , k). Let us consider the word References ak+1 bk+1 c(k+1)(k−1) ∈ LhP (Mcs0k ). As this word is longer than 2k, Mk must use at least two cycles to accept a word w [1] Michal P. Chytil, Martin Plátek, Jörg Vogel: A note on such that h(w) = ak+1 bk+1 c(k+1)(k−1) . Because of the cor- the Chomsky hierarchy. Bulletin of the EATCS 27: 23–30 rectness preserving property Mk must reduce w into a word (1985) w1 such that h(w1 ) = ak+1 bk ck·(k−1) , and then it must re- [2] Petr Jančar, František Mráz, Martin Plátek, Jörg Vogel: duce w1 to a word w2 such that h(w2 ) = ak bk ck·(k−1) . But Restarting automata. In: Horst Reichel (ed.): FCT’95, these two reductions violate the condition of monotonic- Proc., pages 283–292, LNCS 965, Springer, Berlin (1995) ity – a contradiction to the monotonicity constraint of Mk . [3] Petr Jančar, František Mráz, Martin Plátek: Forgetting Au- Hence, the second inclusion is proper in all four proposi- tomata and Context-Free Languages. Acta Informatica 33: 409–420 (1996) tions of the theorem. 2 [4] Petr Jančar, František Mráz, Martin Plátek, Jörg Vogel: On monotonic automata with a restart operation. J. Au- tom. Lang. Comb. 4: 287–311 (1999) [5] Petr Jančar, Martin Plátek, Jörg Vogel: Generalized linear 5 Conclusion list automata. ITAT 2004, Univerzita P. J. Šafárika v Koši- ciach, 2005, p. 97–105, ISBN 80-7097-589-X (2005) We have introduced the h-lexicalized restarting list au- [6] Markéta Lopatková, Martin Plátek, Vladislav Kuboň: tomaton (LxRLAW), which yields a formal environment Modeling syntax of free word-order languages: Depen- that is useful for expressing the lexicalized syntax in com- dency analysis by reduction. In: Václav Matoušek, Pavel putational linguistics. We presented a basic variant of the Mautner, Tomáš Pavelka (eds.), TSD 2005, Proceedings, Chomsky hierarchy of LxRLAW-automata, which can be pages 140–147, LNCS 3658, Springer, Berlin (2005) interpreted as a hierarchy of classes of input, basic, and h- [7] Markéta Lopatková, Martin Plátek, Petr Sgall: Towards a proper languages, and a hierarchy of h-syntactic analyses formal model for functional generative description: Anal- as well. ysis by reduction and restarting automata. Prague Bull. In the main part of the paper we have concentrated on h- Math. Linguistics 87: 7–26 (2007) lexicalized (deterministic) hRLWW-automata fulfilling the [8] Solomon Marcus: Contextual grammars and natural lan- constraint of weak accepting form. We have stressed the guages. Handbook of Formal Languages, pages 215–235, transparency of computations of these automata for their Springer, Berlin (1997) basic and h-proper languages due to the complete correct- [9] František Mráz: Lookahead hierarchies of restarting auto- ness preserving property. We believe that the automata mata. J. Autom. Lang. Comb. 6: 493–506 (2001) having the complete correctness preserving property con- [10] František Mráz, Friedrich Otto, Martin Plátek: The de- stitute a meaningful class of automata and that they cover gree of word-expansion of lexicalized RRWW-automata a significant class of languages, including the class CFL. – A new measure for the degree of nondeterminism of (context-free) languages. Theoretical Computer Science The newly added property of weak accepting form is 410: 3530–3538 (2009) particularly important for computational linguistic, since [11] Ladislav Nebeský: On One Formalization of Sentence it allows to use finite observations (languages) for classi- Analysis. Slovo a slovesnost, 104–107, (1962) fications of classes of infinite and finite languages as well, [12] Miroslav Novotný: With Algebra from Language to Gram- as we have shown in Section 4.3. mar and back (in Czech: S algebrou od jazyka ke gramatice hRLWW-automata can be considered as a refined ana- a zpět). Academia, Praha (1988) lytical counterpart to generative Marcus contextual gram- [13] Friedrich Otto: Restarting automata and their relations to mars (see e.g. [8]) and Novotny’s pure grammars (see the Chomsky hierarchy. In: Zoltan Esik, Zoltan Fülöp e.g. [12]). Contextual and pure grammars work with the (eds.): Developments in Language Theory, Proceedings (complete) generative correctness preserving property. We of DLT’2003, pages 55–74, LNCS 2710, Springer, Berlin have applied many useful techniques for the formalization (2003) of syntactic analysis of Czech sentences from Ladislav [14] Martin Plátek: Two-way restarting automata and j- Nebeský (see e.g. [11]). monotonicity. In: LLeszek Pacholski, Peter Ružička (eds.): We plan in the close future to show a transfer from SOFSEM’01, Proc., pages 316–325, LNCS 2234, Springer, the complete correctness preserving monotone hRRWW- Berlin (2001) automaton to the complete correctness preserving mono- [15] Martin Plátek, Friedrich Otto, František Mráz: tone LxRLAW-automaton without any restart, which in On h-Lexicalized Restarting List Automata, fact works like a push-down automaton. Such an automa- Technical report, www.theory.informatik.uni- ton computes in linear time. It is also possible to obtain kassel.de/projekte/RL2016v6.4.pdf, Kassel (2017) similar hierarchies for this type of automata as for the hRRWW-automata from Section 4.3. We also plan to study some relaxations of the complete correctness preserving property.