=Paper= {{Paper |id=None |storemode=property |title=Towards a formal model of natural language description based on restarting automata with parallel DR-structures |pdfUrl=https://ceur-ws.org/Vol-683/paper5.pdf |volume=Vol-683 |dblpUrl=https://dblp.org/rec/conf/itat/LopatkovaMP10 }} ==Towards a formal model of natural language description based on restarting automata with parallel DR-structures == https://ceur-ws.org/Vol-683/paper5.pdf
          Towards a formal model of natural language description
         based on restarting automata with parallel DR-structures?

                             Markéta Lopatková, František Mráz, and Martin Plátek

                                     Charles University in Prague, Czech Republic
                                            lopatkova@ufal.mff.cuni.cz
                                            frantisek.mraz@mff.cuni.cz
                                             martin.platek@mff.cuni.cz

Abstract. We provide a formal model of a stratificational      layers of linguistic description in a parallel way. The
dependency approach to natural language description. This      novelty of this approach consists in the formal pre-
formal model is motivated by an elementary method of           sentation of the stepwise parallel composition of tree
analysis by reduction, which serves for describing correct     structures on different language layers.
sentence analysis. The model is based on enhanced restart-         In [2], natural language description is modeled as
ing automata that assign a set of parallel dependency struc-
                                                               a formal string to string translation using a suitable
tures to every reduction of an input sentence. These struc-
                                                               model of restarting automata. [3] introduces a class of
tures capture the correspondence of dependency trees on
different layers of linguistic description, namely layer of    enhanced restarting automata with an output consist-
surface syntax and layer of language meaning.                  ing of a single DR-tree. Here we discuss a model which
The novelty of this contribution consists in (i) the exten-    is able to represent several parallel dependency struc-
sion of enhanced restarting automata in order to produce       tures and thus to capture relations between syntactic
tree structures with several interlinked layers and (ii) the   structures on different layers derived from RA.
application of these automata to the stratificational de-
scription of a natural language.
                                                               1.1    Functional Generative Description
                                                        The theoretical linguistic basis for our research is
1     Introduction                                      provided by the Functional Generative Description
                                                        (FGD in the sequel, see esp. [4]). FGD is characterized
Formal modeling of syntactic structure of a natural
                                                        by its stratificational and dependency-based approach
language, its syntactic analysis as well as synthesis,
                                                        to the language description.
has an important impact on an insight into the char-
                                                            The stratificational approaches split language de-
acteristic features of the language and into the needs
                                                        scription into layers, each layer providing complete
of its explicit description.
                                                        description of a (disambiguated) sentence and having
    We attempt to provide a formal model for natural
                                                        its own vocabulary and syntax. We use the version of
language description which would adequately reflect
                                                        FGD that distinguishes four layers of description:1
linguistic methods and makes it possible to formulate
and refine linguistic observations and thus deepen the t-layer (tectogrammatical layer) capturing deep syn-
understanding of the language.                               tax, which comprises language meaning in a form
    The proposed formal model is based on an ele-            of a dependency tree; the core concepts of this
mentary method of analysis by reduction. The anal-           layer are dependency, valency, and topic-focus ar-
ysis by reduction (RA henceforth, see [1, 2], here Sec-      ticulation, see esp. [4];
tion 1.2), serves for describing correct reductions of a-layer (analytical layer) capturing surface syntax in
natural language sentences (particularly for languages       a form of a dependency tree (non-projective in
with free word order) on several linguistic layers (see      general);
Section 1.1).                                           m-layer (morphological layer) capturing morphology
    The proposed model is based on the concept of en-        (string of triples [word form, lemma, tag]);
hanced restarting automata that assign a set of depen- w-layer (word layer) capturing individual words and
dency structures (DR-structures) to every reduction          punctuation marks in a form of a simple string.
of an input sentence; DR-structures can capture a set
of dependency trees representing sentence on different      There are one-to-one correspondences between
 ?
                                                        w-  and  m-layer (we leave aside small exceptions here)
    The paper reports on the research supported by the
                                                               1
    grants of GAČR No. P202/10/1333, P103/10/0783, and            We adopt the notation of the Prague Dependency Tree-
    405/08/0681. It is carried under the project of MŠMT          bank [5], a large corpus of Czech sentences, which uses
    ČR No. MSM0021620838.                                         FGD as its theoretical basis.
26      Markéta Lopatková et al.

and between m- and a-layer; individual symbols of
these three layers (surface layers in the sequel) reflect
individual ‘surface’ words and punctuation marks.
On the other hand, individual symbols of t-layer re-
flect only lexical words (so called function words, as,
e.g., prepositions, auxiliary verbs, are captured as at-
tributes of lexical words); moreover, surface ellipses
(as, e.g., elided subject in Czech) are restored as nodes
on t-layer.
    Similarly as in other stratificational approaches,
see e.g. [6], the layers are ordered; the lowest one be-
ing the simplest w-layer, the highest being the most
abstract t-layer.
    FGD as a dependency-based approach captures
both surface and deep syntactic information in a form
of dependency structures. Words (i.e., their a- and t-
correlates, respectively) are represented as nodes of
the respective trees, each node being a complex unit
capturing the lexical, morphological and syntactic fea-
tures; relations among words are represented by ori-
ented edges [7]. The dependency nature of these repre-
sentations is important particularly for languages with
relatively high freedom of word order; it also complies
with the shift of focus to deep syntactic representation
for which dependency approach is commonly used.
    The following example illustrates description of
a sentence at four layers of FGD (slightly simplified).
                                                                Fig. 1. Parallel representation on t-, a-, m- and w-layers
Such a description expresses all necessary linguistic
                                                                of the sample sentence according to FGD.
information on a disambiguated sentence.
    Sı́dlo dnes mohla mı́t ve státě Texas.
    residence - today - could - have - in - state - Texas
    ‘She (= elided Sb) could reside in the state of Texas       sequences of reductions (deletions) in the sentence –
    today.’                                                     each step of RA is represented by (i) deleting at least
Figure 1 shows the deep syntactic tree on t-layer, the          one word of the input sentence, or (ii) by replacing
surface non-projective syntactic tree on a-layer, the           an (in general discontinuous) substring of a sentence
string of triples [word form, lemma, tag] on m-layer            by a shorter substring. Consequently, it is possible to
and the string of wordforms on w-layer. The dotted              derive formal dependency relations between individ-
lines interconnect corresponding nodes. We focus on             ual sentence members based on the possible order(s)
the non-trivial relation between a-layer and t-layer            of reductions.
here; (i) preposition ve ‘in’ as well as noun státě ‘state’       Using RA we analyze an input sentence (w-layer)
in the a-tree are linked to the single t-symbol rep-            enriched with the metalanguage information from the
resenting lexical word stát ‘state’; (ii) similarly, both      m-, a- and t-layer. Symbols on different layers repre-
modal verb mohla ‘could’ and lexical verb mı́t ‘have’           senting a single word of an input sentence are pro-
are represented as the single t-node mı́t ‘to have’ (in-        cessed simultaneously.
formation on modal verb is preserved as the attribute               The principles of RA can be summed up in the
‘poss’); as a result, the non-projective a-tree is trans-       following observations:
formed to the projective t-tree; (iii) moreover, subject
elided in a surface sentence is restored in the t-tree           – The fact that a certain word (or a group of words)
(the node with the symbol starting with #PersPron),                can be deleted implies that this word (or group of
thus this node has no counterpart on the a-layer.                  words) depends in RA on one of the words retained
                                                                   in the simplified sentence; the latter being called
                                                                   governing word(s) in RA.
1.2   Basic principles of analysis by reduction
                                                                 – Two words (or groups of words) can be deleted in
Analysis by reduction is based on a stepwise simpli-               an arbitrary order if and only if they are mutually
fication of an analyzed sentence. It defines possible              independent in RA.
                                                               Towards a formal model of natural language . . .      27

 – In order to ensure adequate modeling of natural               w-layer m-layer      a-layer t-layer
   language meaning (on t-layer), certain groups of              Sı́dlo  sı́dlo.NNNS4 Obj     PAT
   words have to be deleted in a single step (e.g.,                                           ACT
                                                                 dnes    dnes.Db- - - Adv     TWHEN
   valency frame evoking words2 and their (valency)
                                                                 mohla moci.VpYS- Pred
   complementaions [1]); such words is said to con-
                                                                 mı́t    mı́t.Vf- - - Obj     PRED.Frame1.poss
   stitute a reduction component. Even in such cases             ve      v-1.RV- -6   AuxP
   it is usual to determine governing-dependent pairs            státě stát.NNIS6  Adv     LOC
   on the layer of surface syntax (a-layer). In such             Texas Texas.NNIS6 Atr        ID
   a case, it is necessary to define special rules for           .       ..Z:         AuxK
   particular language phenomena.

   When simplifying input sentence, it is necessary to        Fig. 2. Representation of the sample sentence at four lay-
apply certain elementary principles assuring adequate         ers of FGD (simplified).
analysis:

  – principle of correctness: simplified sentence             Figure 2 presents a (simplified) representation of the
     must be correct; this principle is applied on all        sample sentence at four layers of FGD. Each column
     layers of language description;                          captures one layer of language description (w-, m-, a-
  – principle of completeness: simplified sentence            and t-layer, respectively, see Section 1.1). Rows cap-
     must be complete with respect to its valency struc-      ture information related to particular words and punc-
     ture, i.e., each frame evoking word must be ‘satu-       tuation marks (one or more rows for an individual
     rated’ on the t-layer [1];                               word/punctuation mark, depending on its surface and
  – principle of shortening: at least one ‘surface’           deep word order, see [2]).
     word (i.e., its correlates on w-, m- and a-layer)            We can see that the sentence contains a verbonom-
     must be deleted in each step of RA;                      inal predicate (the predicate consisting of the light
  – principle of layers: each step of RA must con-            verb mı́t ‘to have’ and the noun sı́dlo ‘residence’); this
     cern all symbols, i.e., symbols from all layers, rep-    predicate evokes two valency positions, actor (ACT)
     resenting a particular processed word.                   and local modification (LOC); the nominal part of the
  – principle of minimality: each step of RA must             predicate is analyzed as patient (PAT) of the verb in
     be ‘minimal’ – any potential reduction step con-         FGD (encoded as Frame1 ... ACT PAT LOC).
     cerning less symbols in the sentence would violate
                                                                  There are two possible orders of reductions: (1) In
     the principle of completeness.
                                                              the first reduction step of RA, the word Texas is re-
    These principles imply that in a single reduction         duced – the simplified sentence is grammatically cor-
step, either (i) item(s) representing a single free modi-     rect and it is complete (i.e., its valency structure is
fication or (ii) items representing valency complemen-        complete). The respective symbols on all layers (inter-
tations of a single frame evoking word together with          linked by dotted lines in Figure 1) are processed simul-
their governing word are processed.                           taneously: those on w- and m-layers are deleted; the
    The sentence is simplified until so called core pred-     a-symbol is analyzed as depending on the a-symbol
icative structure is reached (typically formed by sen-        for the preceding noun stát ‘state’ (as its syntactic
tence predicate and its valency complementations).            attribute, Atr); further, the t-symbol for Texas is an-
    The basic principles of RA are exemplified on sev-        alyzed as depending on the t-symbol for stát ‘state’
eral reduction steps of our sample Czech sentence from        (ID indicates a proper name).
Section 1.1; they illustrate how the sentence is simpli-      (2) Alternatively, RA may start with the reduction
fied and how the fragments of its DR-structure (a- and        step processing the word dnes ‘today’. Again, respec-
t-trees) are built.                                           tive symbols on w- and m-layers are deleted; based
     Sı́dlo dnes mohla mı́t ve státě Texas.                 on surface syntactic and morphological categories, the
     residence - today - could - have - in - state - Texas    a-symbol and t-symbol are included in the a- and
     ‘She (= elided Sb) could reside in the state of Texas    t-structures, respectively.
     today.’                                                      After processing the words Texas and dnes, RA
2
    A frame evoking word is a lexical word (verb, noun,
                                                              continues with the predicate and its complementa-
    adjective or adverb) that requires a set of syntactico-   tions. As they form a reduction component, they must
    semantic complementations, as e.g. the verb to give       be processed in a single step on the t-layer (otherwise
    requires three complementations, namely actor (ACT,       a principle of completeness on the t-layer is violated).
    who gives something), patient (PAT, what is given) and    Thus the sentence is simplified on the surface layers
    addressee (ADDR, to whom something is given).             (i.e., w-, m- and a-layers) first and only when all va-
28      Markéta Lopatková et al.

lency complementations are reduced there, the frame           – dl – deletes the visited symbol and shifts the head
evoking predicate and its complementations can be               on its right neighbor;
processed on the t-layer. Let us describe the analysis        – wr[b] – rewrites the visited symbol by the symbol b;
of this component in more detail.                             – pb – serves for marking (putting a pebble on) the
    First, a prepositional group ve+státě ‘in (the)           visited item only: marked items are used as nodes
state’ is identified in the sentence – it is processed          in DR-trees (in any other aspect it is an empty
in a single step on ‘surface’ layers (its w- and m- sym-        operation, see later);
bols are deleted and a-symbols are included into the          – accept – halts the computations and accepts the
a-structure as adverbial modification of the verb mı́t          input word.
‘to have’). The whole prepositional group will be rep-
resented as a single t-node stát, see Figure 1. It will be    Of course, neither the left sentinel c nor the right
identified as a local valency complementation LOC on       sentinel $ must be deleted. At the right end of the
the t-layer. Second, the noun sı́dlo ‘residence’ is pro-   tape M either halts and accepts, or it halts and rejects,
cessed on the surface layers and marked as a valency       or it restarts, that is, it places its window over the left
PAT complementation on the t-layer. Third, the sym-        end of the tape and reenters the initial state. It is
bol for elided subject, which is restored on the t-layer,  required that prior to the first restart step and also
is marked as ACT valency complementation. All the          between any two restart steps, M executes at least
valency complementations of the predicate are identi-      one delete operation. During each step, M can change
fied now, the valency frame Frame1 is saturated.           its internal state.
    Next, we focus on the modal verb mohla ‘(she)              We can see that any finite computation of M con-
can’. On the a-layer, this is a governing node of the      sists of certain phases. A phase, called a cycle, starts in
lexical verb mı́t; the respective edge is created, which   a restarting configuration, the window is moved along
results in a non-projective a-tree (the symbols on the     the tape by performing its operations until a restart
surface layers are deleted). On the other hand, modal      operation is performed and thus a new restarting con-
verbs are accounted functional verbs in FGD and thus       figuration is reached. If no further restart operation
represented as attributes of lexical verbs in t-trees (mı́tis performed, each finite computation necessarily fin-
in our case, value ‘pass’ in the respective t-node). Thus  ishes in a halting configuration – such a phase is called
the non-projectivity is eliminated in the t-tree.          a tail. We assume that no delete and rewrite operation
    Now the predicate with its complementations            is executed in a tail computation.
ACT, PAT and LOC (Frame1) can be identified as                 The notation u `cM v denotes a reduction per-
the core predicative structure on the t-layer, the rel-    formed during a cycle of M that begins with the tape
evant edges for individual valency complementations        inscription cu$ and ends with the tape inscription cv$;
                                                                            ∗
are created and the simplified sentence is accepted (in    the relation `cM is the reflexive and transitive closure
the accepting step, the final full stop is processed on    of `cM .
the surface layers).                                           A string w ∈ Γ ∗ is accepted by M , if there is an
                                                           accepting computation which starts in the restarting
                                                           configuration with the tape inscription cw$ and ends
2 Restarting automata                                      by executing the accept operation. By LC (M ) we de-
First, we introduce a relevant type of a simple restart- note the language consisting of all words accepted by
ing automaton – sRL-automaton – rather informally. M ; we say that M recognizes (accepts) the character-
From technical reasons, we do it in a slightly different istic language LC (M ).
way than in [8].                                               Further we will refer to a sRL-automaton M as
    An sRL-automaton M is (in general) a nondeter- a tuple M = (Γ, c, $, R(M ), A(M )), where Γ is a char-
ministic machine with a finite-state control Q, a fi- acteristic vocabulary (alphabet), c and $ are sentinels
nite characteristic alphabet Γ , and a head (window of not belonging to Γ , R(M ) is a finite set of restarting
size 1) that works on a flexible tape delimited by the instructions over Γ and A(M ) is a finite set of accept-
left sentinel c and the right sentinel $ (c, $ 6∈ Γ ). For ing meta-instructions over Γ .
an input w ∈ Γ ∗ , the initial tape inscription is cw$. Remark: sRL-automata are two-way nondeterministic
To process this input, M starts in its initial state q0 automata which allow to check whole input sentence
with its window over the left end of the tape, scanning prior to any changes. It resembles linguist who can
the left sentinel c. According to its transition relation, read the whole sentence first and reduce the sentence
M performs the following operations in the individual in a correct way afterward. We choose nondeterminis-
steps:                                                     tic model to enable various orders of reductions. This
 – moves to the right or to left – shift the head one can serve for verification of independency between in-
   position to the right or to the left;              dividual parts of a sentence.
                                                                               Towards a formal model of natural language . . .       29

    Similarly as in [8], we use a restricted type of sRL-                     other hand, it is easy to design a nondeterministic
automata for which the number of rewrite, delete and                          sRL-automaton which is not correctness preserving
pebble operations made per cycle is limited by a con-                         (see [8]). We consider only the correctness preserving
stant. Such sRL-automata can be described by (meta)-                          automata in the sequel in order to model adequately
instructions, which describe the moves of the head and                        the analysis by reduction.
the changes of the states implicitly. Each cycle of M
is described by a single restarting instructions over Γ
                                                                              2.1   Restarting automata enhanced with
of the following form:
                                                                                    DR-structures
IR = (c · E0 , [a1 ]1 o1 , E1 , [a2 ]2 o2 , E2 , . . . , Es−1 , [as ]s os ,
Es · $, Restart), where                                                       In this section we introduce enhanced restarting au-
                                                                              tomata, so called sERL-automata. During their com-
  – E0 , E1 , . . . , Es (s > 0), are regular languages over                  putations, these automata build structures consisting
    Γ (usually represented by regular expressions);                           of deleted, rewritten, or marked items and of directed
  – o1 , · · · , os ∈ {dl, pb, wr[b]}, where b ∈ Γ .                          edges between pairs of such items.
  – The symbols a1 , a2 , . . . , as ∈ Γ correspond to the                        Enhanced restarting automata sERL-automata
    symbols on which the corresponding operations                             were formally introduced in a restricted form in [3].
    {o1 , · · · , os } are executed.                                          Their formal definition is rather long and very techni-
                                                                              cal. So we prefer to use an informal description of the
   Let us define auxiliary function o : Γ → Γ for each
                                                                              model and concentrate on their possible applications.
operation o ∈ {dl, pb, wr[b]}:
                                                                              In contrast to sRL-automata, there can be attached
  – pb(ai ) = ai ,                                                            a so called DR-structure to any item of the tape of an
  – dl(ai ) = λ, and                                                          sERL-automaton.
  – wr[b](ai ) = b.                                                               A DR-structure is a slight generalization of
                                                                              DR-trees used in [7]. It is an oriented graph D =
     When trying to execute IR starting from a tape                           (V, Hd ), where V is a finite set of nodes, and Hd is a fi-
inscription cw$, M will get stuck (and so reject), if                         nite set of edges. A node u ∈ V is a tuple u = [i, j, a],
w does not admit a factorization of the form w =                              where a is a symbol assigned to the node, i, j are nat-
v0 a1 v1 a2 . . . vs−1 as vs such that vi ∈ Ei for all i =                    ural numbers; i represents the horizontal position of
0, . . . , s. On the other hand, if w admits factorizations                   the node u, j represents the vertical position of u (it
of this form, then one of them is chosen nondeter-                            is equal to 0 or to the number of nodes with the same
ministically, and cw$ is transformed (reduced) into                           horizontal position i from which there are oriented
cv0 o1 (a1 )v1 · · · vs−1 os (as )vs $.                                       paths to u). An edge h of D is an ordered pair of
     Tails of accepting computations are described by                         nodes h = (u, v). We define two types of edges:
accepting instructions over Γ of the form
IA = (c · E0 , [a1 ]1 , E1 , [a2 ]2 , E2 , . . . , Es−1 , [as ]s , Es ·        – Oblique edge: h = (u, v), where u = [iu , ju , a], v =
$, Accept), where individual Ei are regular languages                            [iv , jv , b] and iu 6= iv ;
                                                                               – Vertical edge: h = (u, v), where u = [iu , ju , a],
over Γ .
                                                                                 v = [iv , jv , b] and iu = iv , jv = ju + 1;
     For our linguistic application (i.e., modeling FGD),
we consider the accepting instructions with finite Ei ’s                      We say that D = (V, Hd ) is a DR-tree if the graph D
only.                                                                         is an oriented tree (with a single root, in which all
     A tail performed by the instruction IA starts with                       maximal paths in D end).
the inscription on the tape cz$; if z ∈ E0 a1 · · · as Es ,                       Each sERL-automaton Me is actually an sRL-
then M accepts z (and the whole computation as well).                         automaton with enhanced instructions. An enhanced
Otherwise the computation halts with rejection. This                          instruction is a pair Ie = (I, G) consisting of an
special form of accepting instruction is introduced                           instruction I of a sRL-automaton and an acyclic
with regard to the future enhancements of restarting                          graph G representing the required structure for sym-
automata.                                                                     bols processed – deleted, rewritten or marked (peb-
     The class of all sRL-automata are denote as sRL.                         bled) – during the application of the instruction I. The
     A crucial role in our applications has the following                     restrictions put on the set of edges of G are described
property of restarting automata.                                              below together with the application of an enhanced
(Correctness Preserving Property) A sRL-auto-                                 instruction Ie = (I, G), where G = (U, H), on a tape
maton M is correctness preserving if u ∈ LC (M ) and                          containing cw$.
       ∗
u `cM v imply that v ∈ LC (M ).                                                   All the symbols on the tape are stored in so called
     It is already known that all deterministic                               items which are actually nodes of a DR-structure. I.e.,
sRL-automata are correctness preserving. On the                               a symbol x is stored in a node [i, j, x], where i is
30        Markéta Lopatková et al.

its horizontal position. Initially, horizontal position of Then two consecutive applications of I1 on the word
n-th input symbol of the input equals n, j = 0 and aaabbbccc will result in the following sequence of tape
there are no edges between the items.                                    contents (in the figure the tape content consists of the
                                                                         bold items depicted in the upper horizontal part of
Restarting instruction. If I is a restarting in-
                                                                         a picture for a particular configuration):
struction I = (c · E0 , [a1 ]1 o1 , E1 , [a2 ]2 o2 , E2 , . . . , Es−1 ,
[as ]s os , Es · $, Restart) then o1 , · · · , os ∈ {dl, pb, wr[b]} [0,0,c][1,0,a][2,0,a][3,0,a][4,0,b][5,0,b][6,0,b][7,0,c][8,0,c][9,0,c][10,0,$]
(b ∈ Γ ) are the operations performed on symbols
{a1 , · · · , as }. Let oi1 = wr[bi1 ], . . . , oir = wr[bir ], for [0,0,c][1,0,a][2,0,a]                  [5,0,b][6,0,b][7,1,X][8,0,c][9,0,c][10,0,$]

some r ≥ 0, be all rewrite operations from {o1 , . . . , os }.
                                                                                           [3,0,a] [4,0,b]                 [7,0,c]
Let G = (U, H), where U = {1, 2, . . . , s, i01 , i02 , . . . , i0r },
and let the nodes correspond to the symbols a1 , . . . , as [0,0,c][1,0,a]                                         [6,0,b][7,1,X] [8,1,X][9,0,c][10,0,$]

and symbols bi1,. . .,bir , respectively. An edge (u,v) ∈ H
                                                                                   [2,0,a]                 [5,0,b]         [7,0,c] [8,0,c]
is of one of the following forms:
 1. deleting: u = i corresponds to a deleted symbol ai                                                [3,0,a] [4,0,b]

    (hence oi = dl, 1 ≤ i ≤ s) and v = j for some
    j ∈ U , j 6= i. Let us note that the deleting edge is          Accepting instruction. If I is an accepting in-
    always oblique.                                                struction I = (c·E0 , [a1 ]1 pb, E1 , [a2 ]2 pb, E2 , . . . , Es−1 ,
 2. rewriting: u = i corresponds to a rewritten sym-               [as ]s pb, Es · $, Accept) then the symbols {a1 , · · · , as }
    bol ai (hence oi = wr[bij ], bij ∈ Γ , 1 ≤ i ≤ s,              are pebbled and they correspond to the nodes in
    1 ≤ j ≤ r) and v = i0j . Let us note that the rewrit-          U = {1, 2, . . . , s}. All edges (u, v) ∈ H are oblique and
    ing edge is always vertical.                                   have the same properties as deleting edges in graphs
                                                                   enhancing restarting instructions.
    An application of Ie on a word w consists in:
                                                                        An application of Ie = (I, G), with G = (U, H), on
 1. Choosing a factorization of cw$ of the form w = a word w consists in:
    v0 a1 v1 a2 . . . vs−1 as vs such that vi ∈ Ei for all i = 1. Choosing a factorization of cw$ of the form w =
    0, . . . , s. On the other hand, if w does not admit                v0 a1 v1 a2 . . . vs−1 as vs such that vi ∈ Ei for all i =
    any factorization of this form, then I cannot be                    0, . . . , s. On the other hand, if w does not admit
    applied on w.                                                       any factorization of this form, then I cannot be
 2. Rewriting the tape containing cw$ into the tape                     applied on w.
    containing cv0 o1 (a1 )v1 · · · vs−1 os (as )vs $.              2. For each edge e = (i, j) in H a new oblique edge
 3. For each edge e ∈ H a new edge is inserted into                     is inserted into the current DR-structure. The in-
    the current DR-structure.                                           serted edge starts from the item containing ai and
    If e = (i, j) is a deleting edge, an oblique edge from              ends in the item containing symbol aj .
    the item containing deleted ai into the item con-                   If there was a DR-structure attached to some of
    taining the symbol corresponding to j (either aj , if               the connected items, the structure is preserved and
    1 ≤ j ≤ s, or bj , when j ∈ {i01 , . . . , i0r }) is inserted.      combined into the resulting graph.
    If e = (i, j) is a rewriting edge, a vertical edge from                                                                         ∗
    the item containing ai into the item containing Example 2: Let I2 = ((c, [a]1 pb, λ, [b]2 pb, X ,
    bj is inserted (j ∈ {i01 , . . . , i0r }) and the vertical [c]3 pb, $, Accept), D2 ) be an enhanced accepting in-
    position of the new item containing bj is set to the struction with the graph D2 of the following form:
    value by q + 1, where q is the vertical position of
                                                                                                           3
    the item containing ai .
    If there was a DR-structure attached to some of the                                        1      2
    deleted of rewritten cell, the structure is preserved
    and combined into a larger graph.                              Then after applying I on the resulting DR-structure
                                                                                                                  2
Example 1: Let I1 = ((c · a∗ , [a]1 pb, λ, [a]2 dl, λ, from Example 1 we obtain the following final DR-
[b]3 dl, b∗ X ∗ , [c]4 wr[X], c∗ · $, Restart), D1 ) be an en- structure:
hanced restarting instruction with the graph D1 of                                              [9,0,c]
the following form:
                                                                                  [1,0,a]                                         [6,0,b] [7,1,X][8,1,X]

                         1                    4’                                            [2,0,a]                     [5,0,b]          [7,0,c] [8,0,c]


                                2      3       4                                                        [3,0,a][4,0,b]
                                                              Towards a formal model of natural language . . .    31

Computation of enhanced restarting automata.                 3   FGD as a (4)-sERL-automaton
A (restarting) configuration C = (T, D) of a compu-
tation by Me = (Γ, c, $, ER(Me ), EA(Me )) consists of       Our ultimate goal is to model FGD – we consider
the set T of items representing current tape content         a (4)-sERL-automaton MFD to be a suitable formal
and a DR-structure D. By an application of an en-            frame for this linguistic theory.
hanced instruction I, the tape content is changed and            Let MFD = (Γ, c, $, ER(MFD ), EA(MFD )); Γ con-
the DR-structure can grow. The initial configuration         sists of four parts Γ = Σw ∪ Σm ∪ Σa ∪ Σt which cor-
C0 = (Tw , D∅ ) for an input word w = x1 . . . xn con-       respond to the respective layers of FGD (Section 1.2).
sists of the set of items representing the initial con-      Recall that the symbols from individual layers can be
tent of the tape Tw = {[0, 0, c]} ∪ {[0, i, xi ] | i =       quite complex.
1, . . . , n} ∪ {(n + 1, 0, $]} and the empty DR-structure       A language of layer ` ∈ {w, m, s, t} accepted
D∅ = (∅, ∅). By application of an enhanced instruc-          by MFD is obtained as a projection of the characteristic
tion I, a configuration C is transformed onto a new          language onto Σ` , i.e., L` (MFD ) = PrΣ` (LC (MFD ))).
configuration C 0 .                                              The characteristic language LC (MFD ) contains in-
     An input word w is accepted by Me if there exists       put sequences (over Σw ) interleaved with metalan-
a computation starting with the initial configuration        guage information in the form of symbols from Σm ∪
for w which ends in a configuration Ca by an applica-        Σa ∪Σt . Then, the language of correct sentences of the
tion of an accepting enhanced instruction. The result        studied natural language is Lw = PrΣw (LC (MFD )). In
of the computation is the DR-structure of Ca .               our case, it defines the set of correct Czech sentences.
     Similarly as for sRL-automata, we define the char-          Similarly, a DR-language of layer ` accepted by
acteristic language of Me as LC (Me ) = {w | w ∈ Γ and       MFD is obtained as DR` (MFD ) = PrΣ` (DR(MFD )),
Me accepts w}. Moreover, DR-language of Me is the            ` ∈ {w, m, s, t}. Let us note that DRw (MFD ) and
set DR(Me ) = {D | D is a result of some accepting           DRm (MFD ) are empty (Lw and Lm are string lan-
computation of Me }.                                         guages). Further, DRa (MFD ) and DRt (MFD ) are lan-
                                                             guages of DR-trees. Each DR-tree T ∈ DRt (MFD ) is
                                                             projective (with respect to its descendants); that is,
2.2   Enhanced automata with several layers                  for each node n of the DR-tree T all its descendants
                                                             (including also the node n) constitute a contiguous
                                                             sequence in the horizontal ordering of nodes of the
First, we introduce a technical notion of projection. Let
                                                             tree T . On the other hand, trees from DRa (MFD ) can
Σ and Γ (⊃ Σ) be alphabets. The projection from Γ ∗
                                                             be in general non-projective.
onto Σ ∗ denoted as PrΣ is the morphism defined as
                                                                 The DR-language DRt (MFD ) represents the set of
a 7→ a (for a ∈ Σ) and A 7→ λ (for A ∈ Γ r Σ).
                                                             meaning descriptions in FGD whereas and DRa (MFD )
Similarly, we define the projection of languages: PrΣ :
                                                             models the set of (surface) syntactic trees.
P(Γ ∗ ) 7→ P(Σ ∗ ).
                                                                 Let us note that Lt (MFD ) is designed as a deter-
    Similarly as above, we introduce a projection            ministic context-free language. Readers familiar with
for DR-structures. Let D be a DR-structure from              restarting automata can see that LC (MFD ) is a de-
a DR-language over an alphabet Γ , PrΣ (D) denotes           terministic context-sensitive language and Lw (MFD ) is
a DR-structure over Σ that is obtained from D by             a context-sensitive language.
removing all nodes with (at least one) symbol from               We have not mentioned so far the edges from
Γ r Σ and all edges incident to that nodes. A projec-        DR-structures from DR(MFD ) which have the edges
tion may be in an obvious way extended onto projec-          with nodes (their symbols) in different layers. These
tion of a DR-language over Γ .                               edges serve for connecting the corresponding lexical
    Let Σ1 , . . . , Σj , for some j ≥ 1, be a se-           units on different layers (see dotted lines in Figure 1
quence of pairwise disjoint alphabets and Γ =                and are obtained by applications of extended restart-
Σ1 ∪ · · · ∪ Σj . We say that sERL-automaton Mj =            ing meta-instructions of MFD .
(Γ, c, $, ER(Mj ), EA(Mj )) is an enhanced simple                Here such an edge connects nodes on neighboring
sERL-automaton with j layers ((j)-sERL-automaton             layers only. I.e., this edges connect w-layer nodes to
for short) if the following assumptions are fulfilled:       m-layer nodes, m-layer nodes to a-layer nodes, a-layer
                                                             nodes to t-layer nodes, and nothing else (see Figure 1).
 – Mj is correctness preserving;
 – Mj is allowed to rewrite a symbol from Σi (for 1 ≤        Concluding remarks In this paper, encouraged
   i ≤ j) by a symbol from the same sub-vocabulary           by [9, 10], we extend the formal model of natural lan-
   Σi , only. We refer to the symbols from Σi as the         guage description based on FGD so that it outputs
   symbols on layer i (or i-symbols).                        neither lists of words nor lists of symbols but the
32      Markéta Lopatková et al.

so called reduction language and a set of complex 10. S. Bensch, F. Drewes: Millstream systems. Report
DR-structures. We plan to study the relation between      UMINF 09.21, Umeå University, 2009.
reduction languages and DR-languages more deeply in 11. J. Kunze: Abhängigkeitsgrammatik. Volume XII of Stu-
the near future.                                          dia Grammatica. Akademie Verlag, Berlin, 1975.
    The novelty of this contribution consists in (i) the
extension of enhanced restarting automata in order to
produce tree structures with several interlinked layers
and (ii) the application of these automata to the strati-
ficational description of a natural language. Moreover,
we outline a formalization of the basic methodology
of FGD in terms derived from the automata theory
and from the theory of parsing schemata as well. We
envisage that the proposed methodology is not FGD-
specific and that similar approach can be used to ob-
tain a formal frame for other language descriptions, as
e.g. those presented in [6] and [11].


References
 1. M. Lopatková, M. Plátek, V. Kuboň: Modeling syntax
    of free word-order languages: dependency analysis by
    reduction. In: Matoušek, V., Mautner, P., Pavelka, T.,
    (eds), Proceedings of Text, Speech and Dialogue Inter-
    national Conference, TSD 2005, Volume 3658 of LNAI,
    Berlin Heidelberg, Springer-Verlag (2005) 140–147.
 2. M. Lopatková, M. Plátek, P. Sgall: Towards a formal
    model for functional generative description, analysis by
    reduction and restarting automata. The Prague Bul-
    letin of Mathematical Linguistics, 87, 2007, 7–26.
 3. M. Plátek, F. Mráz, M. Lopatková: Restarting au-
    tomata with structured output and functional gener-
    ative description. In: Dediu, A., Fernau, H., Martin-
    Vide, C. (eds), Proceedings of the Fourth International
    Conference Language and Automata Theory and Ap-
    plications, LATA 2010, Berlin Heidelberg, Springer-
    Verlag, 2010, 500–511.
 4. P. Sgall, E. Hajičová, J. Panevová: The Meaning of
    the Sentence in Its Semantic and Pragmatic Aspects.
    Reidel, Dordrecht (1986).
 5. J. Hajič, E. Hajičová, J. Panevová, P. Sgall, P. Pajas,
    J. Štěpánek, J. Havelka, M. Mikulová, M.: Prague De-
    pendency Treebank 2.0. Linguistic Data Consortium,
    Philadelphia, PA, USA, 2006.
 6. I.A. Mel’čuk: Dependency syntax: theory and practice.
    State University of New York Press, Albany, 1988.
 7. T. Holan, V. Kuboň, K. Oliva, M. Plátek: Two useful
    measures of word order complexity. In: Polguere, A.,
    Kahane, S. (eds), Processing of Dependency-Based
    Grammars: Proceedings of the Workshop (COL-
    ING/ACL’98), Montreal, ACL, 1998, 21–28.
 8. H. Messerschmidt, F. Mráz, F. Otto, M. Plátek: Cor-
    rectness preservation and complexity of simple RL-
    automata. In: Implementation and Application of
    Automata, Volume 4094 of LNCS, Berlin Heidelberg,
    Springer-Verlag, 2006, 162–172.
 9. R. Gramatovici, C. Martı́n-Vide: Sorted dependency
    insertion grammars. Theor. Comput. Sci., 354 (1),
    2006, 142–152.