=Paper= {{Paper |id=Vol-1885/40 |storemode=property |title=On h-Lexicalized Automata and h-Syntactic Analysis |pdfUrl=https://ceur-ws.org/Vol-1885/40.pdf |volume=Vol-1885 |authors=Martin Plátek,Friedrich Otto,František Mráz |dblpUrl=https://dblp.org/rec/conf/itat/PlatekOM17 }} ==On h-Lexicalized Automata and h-Syntactic Analysis== https://ceur-ws.org/Vol-1885/40.pdf
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 40–47
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 M. Plátek, F. Otto, F. Mráz



                              On h-Lexicalized Automata and h-Syntactic Analysis

                                                       Martin Plátek1 , F. Otto2 , and F. Mráz1 ∗
                                                   1
                                                 Charles University, Department of Computer Science
                                              Malostranské nám. 25, 118 00 PRAHA 1, Czech Republic
                                          martin.platek@mff.cuni.cz, frantisek.mraz@mff.cuni.cz
                                               2   Fachbereich Elektrotechnik/Informatik, Universität Kassel
                                                                 D-34109 Kassel, Germany
                                                       otto@theory.informatik.uni-kassel.de

      Abstract: Following some previous studies on list auto-                     and basic and h-proper languages, this new model is bet-
      mata and restarting automata, we introduce a generalized                    ter suited for modeling (i) lexicalized syntactic analysis
      and refined model – the h-lexicalized restarting list au-                   (h-syntactic analysis) and specially (ii) (lexicalized) anal-
      tomaton (LxRLAW). We argue that this model is useful                        ysis by reduction of natural languages (compare [6, 7]).
      for expressing transparent variants of lexicalized syntactic                   Analysis by reduction is a technique for deciding the
      analysis, and analysis by reduction in computational lin-                   correctness of a sentence. It is based on a stepwise sim-
      guistics. We present several subclasses of LxRLAW and                       plification by reductions preserving the (in)correctness of
      provide some variants and some extensions of the Chom-                      the sentence until a short sentence is obtained for which
      sky hierarchy, including the variant for the lexicalized syn-               it is easy to decide its correctness. Restarting automata
      tactic analysis. We compare the input languages, which                      were introduced as an automata model for analysis by re-
      are the languages traditionally considered in automata the-                 duction. While modeling analysis by reduction, restarting
      ory, to the so-called basic and h-proper languages. The                     automata are forced to use very transparent types of com-
      basic and h-proper languages allow stressing the trans-                     putations. Nevertheless, they still can recognize a proper
      parency of h-lexicalized restarting automata for a super-                   superset of the class of context-free languages (CFL). The
      class of the context-free languages by the so-called com-                   first observation, which supports our argumentation, is that
      plete correctness preserving property. Such a type of trans-                LxRLAW allow characterizations of the Chomsky hierar-
      parency cannot be achieved for the whole class of context-                  chy for classes of languages and for h-syntactic analysis
      free languages by traditional input languages. The trans-                   as well.
      parency of h-lexicalized restarting automata is illustrated
                                                                                     An LxRLAW M is a device with a finite state control
      by two types of hierarchies which separate the classes of
                                                                                  and a read/write window of a fixed size. This window can
      infinite and the classes of finite languages by the same
                                                                                  move in both directions along a tape (that is, a list of items)
      tools.
                                                                                  containing a word delimited by sentinels. The LxRLAW-
                                                                                  automaton M uses an input alphabet and a working alpha-
                                                                                  bet that contains the input alphabet. Lexicalization of M is
      1    Introduction                                                           given through a morphism which binds the individual sym-
                                                                                  bols from the working alphabet to symbols from the input
      Chomsky introduced his well-known hierarchy of gram-                        alphabet. M can decide (in general non-deterministically)
      mars to formulate the phrase-structure (immediate con-                      to rewrite the contents of its window: it may delete some
      stituents) syntax of natural languages. However, the syn-                   items from the list (moving its window in one or the other
      tax of most European languages (including English) is of-                   direction), insert some items into the list, and/or replace
      ten considered as a lexicalized syntax. In other words,                     some items. In addition, M can perform a restart opera-
      the categories of Chomsky are bound to immediate con-                       tion which causes M to place its window at the left end
      stituents (phrases), while by lexicalized syntax they are                   of the tape, so that the first symbol it contains is the left
      bound to individual word-forms. In order to give a general                  border marker, and to reenter its initial state.
      theoretical basis for lexicalized syntax that is comparable                    In the technical part we adjust some known results
      to the Chomsky hierarchy, we follow some previous stud-                     to results on several subclasses of LxRLAW that are ob-
      ies of list and restarting automata (see [1, 4, 5, 13]) and                 tained by restricting its set of operations to certain subsets,
      introduce a generalized and refined model that formalizes                   and we also provide some variants and extensions of the
      lexicalization in a natural way – the h-lexicalized restart-                Chomsky hierarchy of languages. E.g., an LxRLAI uses an
      ing list automaton with a look-ahead window (LxRLAW).                       input alphabet only.
      We argue that, through the use of restarting operations                        We recall and newly introduce some constraints that
                                                                                  are suitable for restarting automata, and we outline ways
         ∗ The first and the third author were partially supported by the Czech   for new combinations of constraints, and more transparent
      Science Foundation under the project 15-04960S.                             computations.
On h-Lexicalized Automata and h-Syntactic Analysis                                                                                                          41

         Section 2 contains the basic definitions. The charac-                             3. A rewrite step (q, u) −→ (q0 , W(v)) assumes that
      terizations of the Chomsky hierarchy by certain types of                                (q0 , W(v)) ∈ δ (q, u), |v| = |u|, and the sentinels are
      LxRLAW is provided in Section 3.                                                        at the same positions in u and v (if at all). It causes M
         In Section 4 we introduce RLWW-automata as a re-                                     to replace the contents u of the read/write window by
      stricted type of LxRLAW and give several new con-                                       the string v, and to enter state q0 . The head does not
      straints for them. By the basic and h-proper languages                                  change its position.
      we show the transparency of h-lexicalized restarting auto-
      mata through the so-called complete correctness preserv-                             4. An S-right step (q, u) −→ (q0 , SR(v)) assumes that
      ing property. Using the new constraints and the basic and                               (q0 , SR(v)) ∈ δ (q, u), v is shorter than u, containing
      h-proper languages, we are able to separate classes of fi-                              all sentinels in u. It causes M to replace u by v and
      nite languages in a similar way as the classes of infinite                              to enter state q0 ; the new position of the head is on
      languages and establish in this way new hierarchies, which                              the first item of v (the contents of the window is thus
      can create a suitable tool for the classification of syntactic                          ‘completed’ from the right; the positional distance to
      phenomena for computational linguistics. The paper con-                                 the right sentinel decreases).
      cludes with Section 5.
                                                                                           5. An S-left step (q, u) −→ (q0 , SL(v)) assumes that
                                                                                              (q0 , SL(v)) ∈ δ (q, u), v is shorter than u, containing
      2     Definitions                                                                       all sentinels in u. It causes M to replace u by v, to en-
                                                                                              ter state q0 , and to shift the head position by |u| − |v|
      In what follows, we use ⊂ to denote the proper subset re-                               items to the left – but to the left sentinel c at most (the
      lation. Further, we will sometimes use regular expressions                              contents of the window is ‘completed’ from the left;
      instead of the corresponding regular languages. Finally,                                the distance to the left sentinel decreases if the head
      throughout the paper λ will denote the empty word and                                   position was not already at c).
      N+ will denote the set of all positive integers.
         An h-lexicalized two-way restarting list automa-                                  6. An insert step (q, u) −→ (q0 , I(v)) assumes that
      ton, LxRLAW for short, is a one-tape machine M =                                        (q0 , I(v)) ∈ δ (q, u), u is a proper subsequence of v
      (Q, Σ, Γ, c, $, q0 , k, δ , h), where Q is the finite set of states,                    (keeping the obvious sentinel constraints). It causes
      Σ is the finite input alphabet, Γ is the finite working al-                             M to replace u by v (by inserting the relevant items),
      phabet containing Σ, the symbols c, $ 6∈ Γ are the mark-                                and to enter state q0 . The head position is shifted by
      ers for the left and right border of the work space, respec-                            |v| − |u| to the right (the distance to the left sentinel
      tively, h : Γ ∪ {c, $} → Σ ∪ {c, $} is a mapping creating a                             increases).
      (letter) morphism from cΓ∗ $ to cΣ∗ $ such that, for each                            7. A restart step (q, u) −→ Restart assumes that
      a ∈ Σ ∪ {c, $}, h(a) = a; q0 ∈ Q is the initial state, k ≥ 1 is                         Restart ∈ δ (q, u). It causes M to place its read/write
      the size of the read/write window, and                                                  window onto the left end of the tape, so that the first
                     δ : Q × PC ≤k →                                                          symbol it sees is the left border marker c, and to reen-
       P((Q × ({MVR, MVL} ∪ {W(v), SR(v), SL(v), I(v)}))                                      ter the initial state q0 .
                 ∪ {Restart, Accept, Reject})
                                                                                           8. An accept step (q, u) −→ Accept assumes that
      is the transition relation. Here P(S) denotes the powerset                              Accept ∈ δ (q, u). It causes M to halt and accept.
      of a set S,
                                                                                           9. A reject step (q, u) −→ Reject assumes that Reject ∈
          PC   ≤k
                    := (c · Γ   k−1       k
                                      ) ∪ Γ ∪ (Γ   ≤k−1
                                                          · $) ∪ (c · Γ   ≤k−2
                                                                                 · $),        δ (q, u). It causes M to halt and reject.

      is the set of possible contents of the read/write window                              A configuration of M is a string αqβ where q ∈ Q, and
      of M, and v ∈ PC ≤n , where n ∈ N.                                                 either α = λ and β ∈ {c} · Γ∗ · {$} or α ∈ {c} · Γ∗ and
                                                                                         β ∈ Γ∗ · {$}; here q represents the current state, αβ is
      According to the transition relation, if M is in state q and                       the current contents of the tape, and it is understood that
      sees the string u in its read/write window, it can perform                         the head scans the first k symbols of β or all of β when
      nine different steps, where q0 ∈ Q:                                                |β | < k. A restarting configuration is of the form q0 cw$,
          1. A move-right step (q, u) −→ (q0 , MVR) assumes that                         where w ∈ Γ∗ ; if w ∈ Σ∗ , then q0 cw$ is an initial configura-
             (q0 , MVR) ∈ δ (q, u) and u 6= $. It causes M to shift                      tion. We see that any initial configuration is also a restart-
             the read/write window one position to the right and to                      ing configuration. Any restart transfers M into a restarting
             enter state q0 .                                                            configuration.
                                                                                            In general, the automaton M is non-deterministic, that
          2. A move-left step (q, u) −→ (q0 , MVL) assumes that                          is, there can be two or more steps (instructions) with the
             (q0 , MVL) ∈ δ (q, u) and u 6∈ c · Γ∗ · {λ , $}. It causes                  same left-hand side (q, u), and thus, there can be more than
             M to shift the read/write window one position to the                        one computation for an input word. If this is not the case,
             left and to enter state q0 .                                                the automaton is deterministic.
42                                                                                                                  M. Plátek, F. Otto, F. Mráz

        A computation of M is a sequence C = C0 ,C1 , . . . ,C j     2.1 Further Refinements on LxRLAW
     of configurations, where C0 is an initial configuration and
     Ci+1 is obtained by a step of M from Ci , for all 0 ≤ i < j.    Here we introduce some constrained types of rewriting
                                                                     steps which assume q, q0 ∈ Q and u ∈ PC ≤k .
        An input word w ∈ Σ∗ is accepted by M, if there is
                                                                        A delete-right step (q, u) → (q0 , DR(v)) is an S-right
     a computation which starts with the initial configuration
                                                                     step (q, u) → (q0 , SR(v)) such that v is a proper sub-
     q0 cw$ and ends by executing an Accept instruction. By
                                                                     sequence of u, containing all the sentinels from u (if any).
     L(M) we denote the language consisting of all input words
                                                                        A delete-left step (q, u) → (q0 , DL(v)) is an S-left step
     accepted by M; we say that M accepts the input language
                                                                     (q, u) → (q0 , SL(v)) such that v is a proper sub-sequence
     L(M).
                                                                     of u, containing all the sentinels from u (if any).
       A basic (or characteristic, or working) word w ∈ Γ∗ is           A contextual-left step (q, u) → (q0 , CL(v)) is an S-left
     accepted by M, if there is a computation which starts with      step (q, u) → (q0 , SL(v)), where u = v1 u1 v2 u2 v3 and v =
     the restarting configuration q0 cw$ and ends by executing       v1 v2 v3 for some v1 , u1 , v2 , u2 , v3 , such that v contains all
     an Accept instruction. By LC (M) we denote the language         the sentinels from u (if any).
     consisting of all basic words accepted by M; we say that           A contextual-right step (q, u) → (q0 , CR(v)) is an S-
     M accepts the basic (characteristic) language LC (M).           right-step (q, u) → (q0 , SR(v)), where u = v1 v2 v3 and v
       Further, we take LhP (M) = { h(w) ∈ Σ∗ | w ∈ LC (M) },        = v1 v3 for some v1 , v2 , v3 , such that v contains all the sen-
     and we say that M recognizes (accepts) the h-proper lan-        tinels from u (if any).
     guage LhP (M).                                                     Note that the contextual-right step is not symmetrical to
                                                                     the contextual-left step. We will use this fact in Section 4.
       Finally, we take LA (M) = { (h(w), w) | w ∈ LC (M) } and         The set OG = {MVL, MVR, W, SL, SR, DL, DR, CL,
     we say that M determines the h-syntactic analysis LA (M).       CR, I, Restart} represents the set of types of steps,
        We say that, for x ∈ Σ∗ , LA (M, x) = { (x, y) | y ∈         which can be used for characterizations of subclasses of
     LC (M), h(y) = x } is the h-syntactic analysis for x by M.      LxRLAW. This set does not contain the symbols Accept
     We see that LA (M, x) is non-empty only for x ∈ LhP (M).        and Reject, corresponding to halting steps, as they are used
        In the following we only consider finite computations of     for all LxRLAW-automata. Let T ⊆ OG. We denote by
     LxRLAW-automata, which finish either by an accept or by         T-automata the subset of LxRLAW-automata which only
     a reject operation.                                             use transition steps from the set T ∪ {Accept, Reject}. For
        An LxRLAI M is an LxRLAW for which the input alpha-          example, {MVR, W}-automata only use move-right steps,
     bet and the working alphabet are equal.                         W-steps, Accept steps, and Reject steps.
                                                                     Notations. For brevity, the prefix det- will be used to de-
     Fact 1. (Equality of Languages for LxRLAI-automata).            note the property of being deterministic. For any class
     For each LxRLAI-automaton M, L(M) = LC (M) =                    A of automata, L (A ) will denote the class of input lan-
     LhP (M).                                                        guages that are recognized by automata from A , LC (A )
                                                                     will denote the class of basic languages that are recog-
     – Cycles, tails: Any finite computation of an LxRLAW-           nized by automata from A , and LhP (A ) will denote the
     automaton M consists of certain phases. Each phase starts       class of h-proper languages that are recognized by auto-
     in a restarting configuration. In a phase called a cycle, the   mata from A . Moreover, LA (A ) will denote the class
     window moves along the tape performing non-restarting           of h-syntactic analyses that are determined by automata
     operations until a Restart operation is performed and thus      from A . Let us note that we use the more simple notation
     a new restarting configuration is reached. If after a restart   LP (A ) in [15] and other papers in a different sense than
     configuration no further restart operation is performed, any    the denotation LhP (A ) here.
     finite computation necessarily finishes in a halting config-       For a natural number k ≥ 1, L (k-A ) (LC (k-A ) or
     uration – such a phase is called a tail.                        LhP (k-A )) will denote the class of input (basic or h-
                                                                     proper) languages that are recognized by those automata
     – Cycle-rewritings: We use the notation q0 cu$ `cM q0 cv$
                                                                     from A that use a read/write window of size k.
     to denote a cycle of M that begins with the restarting con-
     figuration q0 cu$ and ends with the restarting configura-       – Monotonicity of Rewritings. We introduce various no-
     tion q0 cv$. Through this relation we define the relation of    tions of monotonicity as important types of constraints for
     cycle-rewriting by M. We write u ⇒cM v iff q0 cu$ `cM q0 cv$    computations of LxRLAW-automata.
                                ∗
     holds. The relation u ⇒cM v is the reflexive and transitive        Let M be an LxRLAW-automaton, and let C =
     closure of u ⇒cM v.                                             Ck ,Ck+1 , . . . ,C j be a sequence of configurations of M,
        We point out that the cycle-rewriting is a very important    where Ci+1 is obtained by a single transition step of M
     feature of LxRLAW.                                              from Ci , k ≤ i < j. We say that C is a sub-computation
                                                                     of M.
     – Reductions: If u ⇒cM v is a cycle-rewriting by M such
                                                                       Let RW ⊆ {W, SR, SL, DR, DL, CR, CL, I}. Then we de-
     that |u| > |v|, then u ⇒cM v is called a reduction by M.
                                                                     note by W(C, RW ) the maximal (scattered) sub-sequence
On h-Lexicalized Automata and h-Syntactic Analysis                                                                                     43

      of C, which contains those configurations from C that cor-      word. Since the regular languages are closed under homo-
      respond to RW -steps (that is, those configurations in which    morphisms, we have the following proposition.
      a transition step of one of the types from the set RW is ap-
      plied). We say that W(C, RW ) is the working sequence           Proposition 3. For X ∈ {L , LC , LhP }, the classes
      of C determined by RW .                                         X (1-det-{MVR}) and X (1-{MVR}) coincide with the
         Let C be a sub-computation of an LxRLAW-                     class REG of regular languages.
      automaton M, and let Cw = cαqβ $ be a configuration                Observe that for LxRLAW-automata with window
      from C. Then |β $| is the right distance of Cw , which is       size 1, the operation SR coincides with the operations DR
      denoted by Dr (Cw ), and |cα| is the left distance of Cw ,      and CR, and that the operation SL coincides with the oper-
      which is denoted by Dl (Cw ).                                   ations DL and CL, and that in this situation all these oper-
         We say that a working sequence W(C, RW ) =                   ations just delete the symbol currently inside the window.
      (C1 ,C2 , . . . ,Cn ) is RW -monotone (or RW -right-monotone)
      if Dr (C1 ) ≥ Dr (C2 ) ≥ . . . ≥ Dr (Cn ).                      Proposition 4. For X ∈ {L , LC , LhP }:
         A sub-computation C of M is RW -monotone if                   X ({MVR, CL, W}) ⊆ CFL,
      W(C, RW ) is RW -monotone. If we write (right-)mono-            where CFL is the class of context-free languages.
      tone, we actually mean {W, SR, SL, DR, DL, CR, CL,
      I}-right-monotone, that is, monotonicity with respect to        Proof. Any {MVR, CL, W}-automaton M = (Q, Σ, Γ, c, $,
      any type of (allowed) rewriting and inserting for the cor-      q0 , k, δ , h) can be simulated by a pushdown automaton
      responding type of automaton. By completely(-right)-            (PDA) P which stores the current content of the window
      monotone, we actually mean monotone with respect to             of M in its control unit (as a state). On an input word w, P
      each configuration of the computation.                          first tries to read the first k symbols of w and to store them
         For each of the prefixes X ∈ {RW, λ , completely} we         within its control unit. If w is of length less than k, then P
      say that M is X-monotone if each of its (sub)computations       accepts or rejects w upon encountering the right sentinel $.
      is X-monotone.                                                  Otherwise, P continues while preserving the following in-
                                                                      variant: the contents of the pushdown store of P equals the
      Fact 2. Let M be an {MVR, SR, SL, W, I}-automaton.              part the tape of M to the left of its current position, the
      Then M is completely-monotone.                                  contents of the window of M and its state are both stored
                                                                      in the control unit of P, and the rest of the tape of M to the
         Remark on PDA. It is not hard to see that a 1-               right of its window is the unread part of the input of P.
      {MVR, SR, SL, W, I}-automaton is a type of normalized
                                                                         Hence, P accepts exactly L(M) by entering an accepting
      pushdown automaton. The top of the pushdown is repre-
                                                                      state and reading the rest of its input whenever M performs
      sented by the position of the head, and the content of the
                                                                      an Accept-step. If we include all working symbols of M
      pushdown is represented by the part of the tape between
                                                                      into the input alphabet of P, we obtain a PDA P0 such that
      the left sentinel and the position of the head. In fact, in
                                                                      L(P0 ) = LC (M), thus the basic language of M is context-
      a very similar way the pushdown automaton was intro-
                                                                      free. As CFL is closed under homomorphisms, also the
      duced by Chomsky. A k-{MVR, SR, SL, W, I}-automaton
                                                                      h-proper language of M is context-free, too.
      can be interpreted as a pushdown automaton with a k-
      lookahead, and with a limited look under the top of the         Proposition 5. For X ∈ {L , LC , LhP }, the class
      pushdown at the same time. (det-)PDAs can be simulated          X (1-{MVR, CL, W}) coincides with the class CFL of
      even by (det-)1-{MVR, SL, W}-automata. In the follow-           context-free languages.
      ing, we will consider the automata which fulfill the condi-
      tion of completely-right-monotonicity for pushdown auto-        Proof. According to Proposition 4, each language ac-
      mata (PDA).                                                     cepted by a 1-{MVR, CL, W}-automaton as an input lan-
                                                                      guage or as a basic language is context-free. It remains to
                                                                      proof the opposite inclusion. Let L be a context-free lan-
      3 Characterizations of the Chomsky                              guage. Clearly, the empty language and the language {λ }
        Hierarchy                                                     can be accepted by a 1-{MVR, CL, W}-automaton.
                                                                         W.l.o.g. we can suppose that L \ {λ } is generated by
      We transform and enhance some results from [1, 5] con-          a context-free grammar G = (Π, Σ, S, PG ) in Chomsky nor-
      cerning input languages to basic and h-proper languages.        mal form. We can construct a PDA P accepting the lan-
         det-{MVR}- and {MVR}-automata work like determin-            guage L \ {λ } by empty store. For each nonempty word
      istic and nondeterministic finite-state automata, respec-       w ∈ L, the PDA P can guess and simulate a rightmost
      tively. The only difference is that such automata can           derivation of w according to G in reverse order. That
      accept or reject without visiting all symbols of an input       is, P will perform a bottom-up analysis of w which uses
      word. Nevertheless, these automata can be simulated by          a shift-operation that moves the next input symbol onto
      deterministic and nondeterministic finite-state automata,       the pushdown and reductions according to the rewrite rules
      respectively, which instead of an Accept-step enter a spe-      from PG . The reduction according to a rule of the form
      cial accepting state in which they scan the rest of the input   X → x, for X ∈ Π, x ∈ Σ, consists of popping x from the
44                                                                                                                M. Plátek, F. Otto, F. Mráz

     top of the pushdown and pushing X onto the pushdown.               LA ({MVR, CL, W}) ⊂ LA ({MVR, MVL, W}),
     The reduction according to a rule of the form X → Y Z,             LA ({MVR, MVL, W}) ⊂ LA ({MVR, MVL, W, I}).
     where X,Y, Z are nonterminals of G, consists of popping
                                                                       In a similar way we can obtain the deterministic variants
     Y Z (in reversed order) from the pushdown and pushing X
                                                                    of the Chomsky hierarchy.
     onto the pushdown.
        The empty word λ can be immediately accepted or re-
     jected by a 1-{MVR, CL, W}-automaton M when λ ∈ L              4    RLWW-Automata with New Constraints
     or λ 6∈ L, respectively. For each nonempty word w ∈ Σ∗ ,
                                                                    Here we formulate the h-lexicalized two-way restarting
     each computation of P on w can be simulated by M in such
                                                                    automaton in weak accepting (cyclic) form, and some of
     a way that the top of the pushdown will be in the window
                                                                    its subclasses. By considering basic and h-proper lan-
     of M. The rest of the contents of the pushdown of P will
                                                                    guages, this new type of automaton is close to the method
     be stored on the part of the tape of M to the left of the
                                                                    of analysis by reduction (see [6]), its computations are
     position of its window.
                                                                    transparent, and it reflects well the structure of its basic
        A shift-operation of P can be simulated by a MVR-step.
                                                                    and h-proper languages.
     The reduction according to a rule of the form X → x, for
     X ∈ Π, x ∈ Σ, will be simulated by the W-step which               An h-lexicalized two-way restarting automaton
     rewrites x in the window of M by X. The reduction ac-          in weak accepting form (originally called cyclic
     cording to a rule of the form X → Y Z, where X,Y, Z ∈ Π,       form) M, an hRLWW-automaton for short, is a
     will be simulated by the CL-step which deletes the tape        {MVR, MVL, SL, SR, Restart}-automaton, which uses an
     cell containing Z and a W-step which rewrites the sym-         SL-step or SR-step exactly once in each cycle (only one
     bol Y in the window of M by X. As P accepts when the           of them), and directly accepts only words that fit into its
     pushdown contains the single symbol S, M will perform          window.
     an Accept-step, when the only symbol on its tape is the           An hRLWW-automaton is called an hRLW-automaton
     initial nonterminal S. Clearly, L(M) = L(P). Addition-         if its working alphabet coincides with its input alphabet.
     ally, M can check that after each MVR-step the symbol          Note that in this situation, each restarting configuration is
     which appears within its window is either the sentinel $ or    necessarily an initial configuration.
     a terminal from Σ. In this way it is ensured that M does          An hRLW-automaton is called an hRL-automaton if
     not accept any word containing a working symbol, hence         each of its rewrite steps is a DL- or DR-step.
     LC (M) = LhP (M) = L.                                             An hRL-automaton is called an hRLC-automaton if
                                                                    each of its rewrite steps is a CL- or CR-step.
        It is easy to see that a linear-bounded automaton work-        An hRLWW-automaton is called an hRLWWC-
     ing on a tape containing an input word delimited by            automaton if each of its rewrite steps is a CL-step or CR-
     sentinels directly corresponds to a 1-{MVR, MVL, W}-           step.
     automaton. We can also see that a {MVR, MVL, W}-                  An hRLWW-automaton is called an hRRWW-
     automaton can be simulated by a 1-{MVR, MVL, W}-               automaton if it does not use any MVL-steps. Analogously,
     automaton. Recall that the class CSL is closed under non-      we obtain hRRW-automata, hRR-automata, and hRRC-
     erasing homomorphisms. Therefore, we have the follow-          automata.
     ing statement.                                                    We see that for hRLWW-automata, all cycle-rewritings
                                                                    are reductions. We also have the following simple
     Proposition 6. For X ∈ {L , LC , LhP }, the class
                                                                    facts, which illustrate the transparency of computations of
     X ({MVR, MVL, W}) coincides with the class CSL of
                                                                    hRLWW-automata due their basic and h-proper languages.
     context-sensitive languages.
                                                                    Fact 8. (Complete Correctness Preserving Property).
        By adding the ability to insert cells within the working    Let M be a deterministic hRLWW-automaton, let C =
     tape to the operations MVR, MVL and W, we can easily           C0 ,C1 , · · · ,Cn be a computation of M, and let cui $ be the
     simulate an arbitrary Turing machine. Hence we have the        tape contents of the configuration Ci , 0 ≤ i ≤ n. If ui ∈
     following proposition.                                         LC (M) for some i, then u j ∈ LC (M) for all j = 0, 1, . . . , n.
     Proposition 7. For X ∈ {L , LC , LhP }, the class              Fact 9. (Complete Error Preserving Property).
     X ((det-){MVR,MVL,W,I}) coincides with the class RE            Let M be a deterministic hRLWW-automaton, let C =
     of recursively enumerable languages.                           C0 ,C1 , · · · ,Cn be a computation of M, and let cui $ be the
                                                                    tape contents of the configuration Ci , 0 ≤ i ≤ n. If ui 6∈
       From the previous results we obtain the following vari-      LC (M) for some i, then u j 6∈ LC (M) for all j = 0, 1, . . . , n.
     ant of the Chomsky hierarchy for the classes of the h-
     syntactic analysis which follows from the corresponding        Fact 10. (Prefix Correctness Preserving Property).
     hierarchies for the classes of h-proper and basic languages.   Let M be an hRLWW-automaton, let C = C0 ,C1 , · · · ,Cn be
                                                                    a computation of M, and let cui $ be the tape contents of
     Corollary 1. We have the following hierarchy:                  the configuration Ci , 0 ≤ i ≤ n. If ui ∈ LC (M) for some i,
       LA ({MVR}) ⊂ LA ({MVR, CL, W}),                              then u j ∈ LC (M) for all j ≤ i.
On h-Lexicalized Automata and h-Syntactic Analysis                                                                                                45

      Fact 11. (Suffix Error Preserving Property).                      If M = (Q, Σ, Γ, c, $, q0 , k, δ , h) is a right-monotone
      Let M be an hRLWW-automaton, let C = C0 ,C1 , · · · ,Cn be     hRLWW-automaton, then its characteristic language
      a computation of M, and let cui $ be the tape contents of      LC (M) is context-free [14]. As LhP (M) = h(LC (M)), and
      the configuration Ci , 0 ≤ i ≤ n. If ui ∈
                                              / LC (M) for some i,   as CFL is closed under morphisms, it follows that LhP (M)
      then u j 6∈ LC (M) for all j ≥ i.                              is also context-free.
                                                                        Conversely, assume that L ⊆ Σ∗ is a context-free lan-
      Corollary 2. (Equality of Languages for hRLW-                  guage. Without loss of generality we may assume that
      automata).                                                     L does not contain the empty word. Thus, there exists a
      For each hRLW-automaton M, L(M) = LC (M) = LhP (M).            context-free grammar G = (N, Σ, S, P) for L which is in
         From the last corollary above, the Complete Error and       Greibach normal form, that is, each rule of P has the form
      Complete Correctness Preserving Properties for determin-       A → aα for some string α ∈ N ∗ and some letter a ∈ Σ. For
      istic hRLW-automata, and the Suffix Error Preserving           the following construction we assume that the rules of G
      Property and the Prefix Correctness Preserving Property        are numbered from 1 to m.
      for hRLW-automata follow for their input, basic, and h-           From G we construct a new grammar G0 := (N, ∆, S, P0 ),
      proper languages.                                              where ∆ := { (∇i , a) | 1 ≤ i ≤ m and the i-th rule of G has
                                                                     the form A → aα } is a set of new terminal symbols that
                                                                     are in one-to-one correspondence to the rules of G, and
      4.1 On Regular Languages and Correctness
          Preserving Computations                                      P0 := { A → (∇i , a)α | A → aα is the i-th rule of G,
                                                                             1 ≤ i ≤ m }.
      Proposition 12. The class REG is characterized by ba-
      sic and h-proper languages of deterministic hRLWW-             Obviously, a word ω ∈ ∆∗ belongs to L(G0 ) if and only
      automata which use only the operations MVR, CR,                if ω has the form ω = (∇i1 , a1 )(∇i2 , a2 ) · · · (∇in , an ) for
      Restart, and which use the operation CR only when the          some integer n > 0, where a1 , a2 , . . . , an ∈ Σ, i1 , i2 , . . . , in ∈
      window is at the position just to the right of the left sen-   {1, 2, . . . , m}, and the sequence of these indices describes
      tinel. We denote such automata by det-Rg2.                     a (left-most) derivation of w := a1 a2 · · · an from S in G.
      Proof. We outline only the main ideas of the proof. First      Let us take h((∇i , a)) = a for all (∇i , a) ∈ ∆. Then it fol-
      we show that any basic language and h-proper language of       lows that h(L(G0 )) = L(G) = L. From ω this derivation
      a det-Rg2 is regular. Let Mk be a k-det-Rg2-automaton.         can be reconstructed deterministically. In fact, the lan-
      It is not hard to see that Mk can be simulated by a finite     guage L(G0 ) is deterministic context-free. Hence, there
      automaton Ak+1 with a stack of size k + 1 stored within        exists a deterministic right-monotone hRRC-automaton M
      its finite control, and therefore LC (Mk ) = L(Ak+1 ), and     for this language (see [4]). By interpreting the sym-
      LhP (Mk ) = LhP (Ak+1 ). Since the regular languages are       bols of ∆ as auxiliary symbols, we obtain a determin-
      closed under homomorphisms, LhP (Ak+1 ) is a regular lan-      istic right-monotone hRRWWC-automaton M 0 such that
      guage, too.                                                    h(LC (M 0 )) = LhP (M 0 ) = h(L(M)) = h(L(G0 )) = L. Ob-
         On the other hand, we can see from the pumping lemma        serve that the input language L(M 0 ) of M 0 is actually
      for regular languages that any regular language L ⊆ Σ∗ can     empty.
      be accepted by a det-Rg2-automaton with working (and
      input) alphabet Σ.                                                Remark. By means of h-lexicalized restarting auto-
                                                                     mata, the above construction corresponds to the linguis-
         Remark. Since det-Rg2-automata are deterministic            tic effort to obtain a set of categories (auxiliary symbols)
      hRLWW-automata, we see that they are completely cor-           that ensures the correctness preserving property for the re-
      rectness preserving for their basic languages.                 spective analysis by reduction. Note that in its reductions,
                                                                     the automaton M 0 above only uses delete operations (in a
                                                                     special way). This is highly reminiscent of the basic (el-
      4.2   Monotonicity and h-Proper Languages
                                                                     ementary) analysis by reduction learned in (Czech) basic
      As an hRRWWC-automaton is an hRRWW-automaton                   schools.
      which uses only CL-operations instead of SL-operations
      and CR-operations instead of SR-operations, we can prove
                                                                     4.3 Hierarchies
      the following theorem in a similar way as it was done for
      a stronger version of hRRWW-automata in the technical          In this subsection we will show similar results for finite
      report [15].                                                   and infinite classes of basic and h-proper languages and for
      Theorem 13. CFL = LhP (det-mon-hRLWWC).                        classes of h-syntactic analysis given by certain subclasses
                                                                     of hRLWW-automata. At first we introduce some useful
      Proof. The proof is based mainly on ideas from [15]. Here      notions.
      it is transformed for hRLWW-automata with the constraint          For a (sub)class X of hRLWW-automata, by fin(i)-X we
      of the weak accepting form.                                    denote the subclass of X-automata which can perform at
46                                                                                                                  M. Plátek, F. Otto, F. Mráz

     most i reductions, and by fin-X we denote the union of             Remark. The next theorem enables us to classify finite
     fin(i)-X over all natural numbers i.                               linguistic observations in a similar way as infinite for-
        By LC (M, i) we denote the subset of LC (M) contain-            mal languages. It refines a part of the Chomsky hier-
     ing all words accepted by at most i reductions. We take            archy and gives a useful tool for classifications of sev-
     LhP (M, i) = h(LC (M, i)).                                         eral syntactic phenomena of natural languages. In order
                                                                        to show the next theorem we consider the following se-
     Proposition 14. Let i ≥ 0, and let M be an hRLWW-                  quences of languages for k ≥ 1: Lc fk+1 = { ai bi·k | i ≥ 1 }
     automaton. Then there is a fin(i)-hRLWW-automaton M1               and Lcsk+1 = { ai bi ci·k , ai+1 bi ci·k | i ≥ 1 }.
     such that LC (M, i) = LC (M1 ), and if u ⇒M1 v, then u ⇒M v.
     If M is deterministic, then M1 is deterministic as well.           Theorem 16. For X = hRLWWC, and k ≥ 2 the following
                                                                        holds:
     Proof. The basic idea of the construction of M1 is to add            (1)    LhP (k-det-Rg2) ⊂ LhP (k-det-mon-X) ⊂
     to the finite control of M the counter of possible cycles on       LhP (k-det-X),
     the starting word by M. While simulating the first cycle             (2)     LA (k-det-Rg2) ⊂ LA (k-det-mon-X) ⊂
     of M, M1 already simulates the next up to i − 1 cycles of M        LA (k-det-X),
     rejecting all words which are not acceptable with at most i          (3) LhP (k-det-fin-Rg2) ⊂ LhP (k-det-fin-mon-X) ⊂
     cycles by M.                                              2        LhP (k-det-fin-X),
        We can see that a similar proposition can also be shown           (4) LA (k-det-fin-Rg2) ⊂ LA (k-det-fin-mon-X) ⊂
     for hRRWW-automata.                                                LA (k-det-fin-X).
        For a positive integer k, let the prefix de(k)- denote          Proof. We outline the main ideas of the proof. For k ≥ 2
     hRLWW-automata which delete at most k symbols in each              and X = hRLWWC, the following inclusions follow from
     reduction.                                                         definitions:
        The next proposition lays the foundation to the desired            (1)      LhP (k-det-Rg2) ⊆ LhP (k-det-mon-X) ⊆
     hierarchies. In order to show the next proposition we              LhP (k-det-X),
     consider the following sequence of languages for k ≥ 1:               (2)       LA (k-det-Rg2) ⊆ LA (k-det-mon-X) ⊆
     Lrgk = { (ak )i | i ≥ 1 }.                                         LA (k-det-X),
     Proposition     15.   LhP (k-de(k)-det-fin(1)-Rg2) r                  (3) LhP (k-det-fin-Rg2) ⊆ LhP (k-det-fin-mon-X) ⊆
     LhP ((k − 1)-hRLWW) 6= 0,
                            / for all k ≥ 2.                            LhP (k-det-fin-X),
                                                                           (4) LA (k-det-fin-Rg2) ⊆ LA (k-det-fin-mon-X) ⊆
     Proof. We outline the basic ideas only. We can con-                LA (k-det-fin-X).
     struct a k-det-Rg2-automaton MRk such that LC (MRk ) =                It remains to show that all these inclusions are proper.
     LhP (MRk ) = Lrgk . For MRk a window of the size k suf-            We use the sequence of context-free languages Lc fk to
     fices, because MRk can first move its window to the right          show the first proper inclusion in all four propositions.
     of the left sentinel. If it sees ak , then it performs a CR-step      For any natural number k ≥ 2, it is not hard to
     which deletes ak . Next, if the window contains only the           construct a k-det-mon-hRLC-automaton Mc fk such that
     right sentinel, the automaton accepts, otherwise it restarts.      LC (Mc fk ) = LhP (Mc fk ) = Lc fk . By applying Propo-
        From Proposition 14 we can see that there is a k-det-           sition 14 for i = k, we can construct a k-det-mon-
     Rg2-automaton MR0k such that LC (MR0k ) = LhP (MR0k ) =            fin(k)-hRLC-automaton Mc fk0 such that LC (Mc fk0 ) =
     LhP (MRk , 1).                                                     LhP (Mc fk0 ) = LhP (Mc fk , k).
        On the other hand, there is no (k − 1)-hRLWW-                      For a contradiction, let us suppose that there is an k-det-
     automaton accepting LhP (MR0k ) as its h-proper language.          Rg2-automaton Mk such that LhP (Mk ) = LhP (Mc fk , k). Let
     For a contradiction, let us suppose that LhP (M) =                 us consider the word ak+1 b(k+1)(k−1) ∈ LhP (Mc fk0 ). As this
     LhP (MR0k ) for a (k − 1)-hRLWW-automaton M. Then                  word is longer than k, Mk must use at least one cycle to ac-
     M accepts a word w ∈ LC (M) such that h(w) = ak ∈                  cept a word w such that h(w) = ak+1 b(k+1)(k−1) . Since any
     LhP (MRk , 1). In an accepting computation on w, automa-           k-det-Rg2-automaton can delete only some of the first k
     ton M must perform at least one reduction, but it can delete       symbols we have a contradiction to the complete correct-
     at most (k − 1) symbols by which it obtains a word w0 of           ness preserving property of Mk . Thus, the first inclusion is
     length between 1 and k − 1, hence h(w0 ) 6∈ LhP (MR0k ) – a        proper for all four propositions of the theorem.
     contradiction to LhP (M) = LhP (MR0k ).                       2       Using languages Lcsk we can show the second inclu-
                                                                        sion to be proper in all four propositions. Let k ≥ 2 be
     Corollary 3. For all types X ∈ {det-Rg2, hRR, hRRC,                an integer. It is easy to construct a k-det-hRLC-automaton
     hRRW, hRRWWC, hRRWW, hRL, hRLC, hRLW,                              Mcsk such that LC (Mcsk ) = LhP (Mcsk ) = Lcsk . By ap-
     hRLWWC, hRLWW}, all prefixes pr1 ∈ {λ , f in}, all                 plying Proposition 14 for i = k, we can construct a k-
     prefixes prefX ∈ {λ , mon, det, det-mon}, and all k ≥ 2,           det-fin(k)-hRLC-automaton Mcs0k such that LC (Mcs0k ) =
     the following holds:                                               LhP (Mcs0k ) = LhP (Mcsk , k).
     LhP (k-pr1 -prefX -X) ⊂ LhP ((k + 1)-pr1 -prefX -X) and               In order to derive a contradiction, let us assume that
     LhP (de(k)-pr1 -prefX -X) ⊂ LhP (de(k + 1)-pr1 -prefX -X).         there is a k-mon-det-hRLWWC-automaton Mk such that
On h-Lexicalized Automata and h-Syntactic Analysis                                                                                        47

      LhP (Mk ) = LhP (Mcsk , k). Let us consider the word             References
      ak+1 bk+1 c(k+1)(k−1) ∈ LhP (Mcs0k ). As this word is longer
      than 2k, Mk must use at least two cycles to accept a word w       [1] Michal P. Chytil, Martin Plátek, Jörg Vogel: A note on
      such that h(w) = ak+1 bk+1 c(k+1)(k−1) . Because of the cor-          the Chomsky hierarchy. Bulletin of the EATCS 27: 23–30
      rectness preserving property Mk must reduce w into a word             (1985)
      w1 such that h(w1 ) = ak+1 bk ck·(k−1) , and then it must re-     [2] Petr Jančar, František Mráz, Martin Plátek, Jörg Vogel:
      duce w1 to a word w2 such that h(w2 ) = ak bk ck·(k−1) . But          Restarting automata. In: Horst Reichel (ed.): FCT’95,
      these two reductions violate the condition of monotonic-              Proc., pages 283–292, LNCS 965, Springer, Berlin (1995)
      ity – a contradiction to the monotonicity constraint of Mk .      [3] Petr Jančar, František Mráz, Martin Plátek: Forgetting Au-
      Hence, the second inclusion is proper in all four proposi-            tomata and Context-Free Languages. Acta Informatica 33:
                                                                            409–420 (1996)
      tions of the theorem.                                      2
                                                                        [4] Petr Jančar, František Mráz, Martin Plátek, Jörg Vogel:
                                                                            On monotonic automata with a restart operation. J. Au-
                                                                            tom. Lang. Comb. 4: 287–311 (1999)
                                                                        [5] Petr Jančar, Martin Plátek, Jörg Vogel: Generalized linear
      5   Conclusion                                                        list automata. ITAT 2004, Univerzita P. J. Šafárika v Koši-
                                                                            ciach, 2005, p. 97–105, ISBN 80-7097-589-X (2005)
      We have introduced the h-lexicalized restarting list au-          [6] Markéta Lopatková, Martin Plátek, Vladislav Kuboň:
      tomaton (LxRLAW), which yields a formal environment                   Modeling syntax of free word-order languages: Depen-
      that is useful for expressing the lexicalized syntax in com-          dency analysis by reduction. In: Václav Matoušek, Pavel
      putational linguistics. We presented a basic variant of the           Mautner, Tomáš Pavelka (eds.), TSD 2005, Proceedings,
      Chomsky hierarchy of LxRLAW-automata, which can be                    pages 140–147, LNCS 3658, Springer, Berlin (2005)
      interpreted as a hierarchy of classes of input, basic, and h-     [7] Markéta Lopatková, Martin Plátek, Petr Sgall: Towards a
      proper languages, and a hierarchy of h-syntactic analyses             formal model for functional generative description: Anal-
      as well.                                                              ysis by reduction and restarting automata. Prague Bull.
         In the main part of the paper we have concentrated on h-           Math. Linguistics 87: 7–26 (2007)
      lexicalized (deterministic) hRLWW-automata fulfilling the         [8] Solomon Marcus: Contextual grammars and natural lan-
      constraint of weak accepting form. We have stressed the               guages. Handbook of Formal Languages, pages 215–235,
      transparency of computations of these automata for their              Springer, Berlin (1997)
      basic and h-proper languages due to the complete correct-         [9] František Mráz: Lookahead hierarchies of restarting auto-
      ness preserving property. We believe that the automata                mata. J. Autom. Lang. Comb. 6: 493–506 (2001)
      having the complete correctness preserving property con-         [10] František Mráz, Friedrich Otto, Martin Plátek: The de-
      stitute a meaningful class of automata and that they cover            gree of word-expansion of lexicalized RRWW-automata
      a significant class of languages, including the class CFL.            – A new measure for the degree of nondeterminism of
                                                                            (context-free) languages. Theoretical Computer Science
         The newly added property of weak accepting form is
                                                                            410: 3530–3538 (2009)
      particularly important for computational linguistic, since
                                                                       [11] Ladislav Nebeský: On One Formalization of Sentence
      it allows to use finite observations (languages) for classi-
                                                                            Analysis. Slovo a slovesnost, 104–107, (1962)
      fications of classes of infinite and finite languages as well,
                                                                       [12] Miroslav Novotný: With Algebra from Language to Gram-
      as we have shown in Section 4.3.
                                                                            mar and back (in Czech: S algebrou od jazyka ke gramatice
         hRLWW-automata can be considered as a refined ana-
                                                                            a zpět). Academia, Praha (1988)
      lytical counterpart to generative Marcus contextual gram-
                                                                       [13] Friedrich Otto: Restarting automata and their relations to
      mars (see e.g. [8]) and Novotny’s pure grammars (see
                                                                            the Chomsky hierarchy. In: Zoltan Esik, Zoltan Fülöp
      e.g. [12]). Contextual and pure grammars work with the                (eds.): Developments in Language Theory, Proceedings
      (complete) generative correctness preserving property. We             of DLT’2003, pages 55–74, LNCS 2710, Springer, Berlin
      have applied many useful techniques for the formalization             (2003)
      of syntactic analysis of Czech sentences from Ladislav           [14] Martin Plátek: Two-way restarting automata and j-
      Nebeský (see e.g. [11]).                                              monotonicity. In: LLeszek Pacholski, Peter Ružička (eds.):
         We plan in the close future to show a transfer from                SOFSEM’01, Proc., pages 316–325, LNCS 2234, Springer,
      the complete correctness preserving monotone hRRWW-                   Berlin (2001)
      automaton to the complete correctness preserving mono-           [15] Martin Plátek, Friedrich Otto, František Mráz:
      tone LxRLAW-automaton without any restart, which in                   On       h-Lexicalized    Restarting     List    Automata,
      fact works like a push-down automaton. Such an automa-                Technical         report,       www.theory.informatik.uni-
      ton computes in linear time. It is also possible to obtain            kassel.de/projekte/RL2016v6.4.pdf, Kassel (2017)
      similar hierarchies for this type of automata as for the
      hRRWW-automata from Section 4.3.
         We also plan to study some relaxations of the complete
      correctness preserving property.