Linguistics behind the mirror

                                                       Karel Oliva

                                   Institute of the Czech Language AS CR, v. v. i.
                         Letenská 123/4, Praha 1 - Malá Strana, CZ - 118 51, Czech Republic

Abstract. A natural language is usually modelled as             that the borderline between strings which are gram-
a subset of the set T ∗ of strings (over some set T of termi-   matical and those which are ungrammatical is sharp
nals) generated by some grammar G. Thus, T ∗ is divided         and clear-cut.
into two disjoint classes: into grammatical and ungram-             Even elementary language practice (e.g., serving
matical strings (any string not generated by G is considered    as a native speaker – informant for fellow linguists, or
ungrammatical). This approach brings along the following
                                                                teaching one’s mother tongue) shows that this presup-
problems:
  – on the theoretical side, it is impossible to rule out
                                                                position does not hold in reality. The realistic picture
    clearly unacceptable yet “theoretically grammatical”        is much more like the one in Fig. 1: there are strings
    strings (e.g., strings with multiple centre self-embed-     which are considered clearly correct (“grammatical”)
    dings, cf. The cheese the lady the mouse the cat the        by the native speakers, there are other ones that are
    dog chased caught frightened bought cost 10 £),             doubtless incorrect (out of the language, “informally
  – on the practical side, it impedes systematic build-up       ungrammatical”, unacceptable for native speakers),
    of such computational lingustics applications as, e.g.,     and there is a non-negligible set of strings whose sta-
    grammar-checkers.                                           tus wrt. correctness (acceptability, grammaticality) is
In an attempt to lay a theoretical fundament enabling the       not really clear and/or where opinions of the native
solution of these problems, the paper first proposes a tri-
                                                                speakers differ (some possibly tending more in this,
partition of the stringset into:
                                                                others more in the other direction, etc.).
  – clearly grammatical strings,
  – clearly ungrammatical strings,                                  Assuming the better empirical adequacy of the pic-
  – strings with unclear (“on the verge”) grammaticality        ture in Fig. 1, the objective of this paper will be to
    status                                                      propose that a syntactic description of (some natural)
and, based on this, concentrates on                             language L should consist of:
  – techniques for systematic discovery and description of
    clearly ungrammatical strings,                               – a formal grammar G defining the set L(G)
  – the impact of the approach onto the theory of gram-            of doubtlessly grammatical strings (L(G) ⊆ L).
    maticality,                                                    Typically, the individual components of G (rules,
  – an overview of simple ideas about applications of the          principles, constraints, . . . ) are based on a struc-
    above in building grammar-checkers and rule-based              ture assigned to a string, either directly (mention-
    part-of-speech taggers.                                        ing, e.g., the constituent structure) or indirectly,
                                                                   operating with other syntactically assigned fea-
                                                                   tures (such as subject, direct object, etc.). Since
1    Introduction                                                  the description of the “clearly correct” strings via
                                                                   such a grammar is fairly standard, it will not be
Apart from deciding on the membership of a particular              further dealt with here,
string σ in a particular language L, a formal grammar            – a formal “ungrammar” U defining the set L(U )
is usually assigned an additional task: to assign each             of doubtlessly ungrammatical strings. Typically,
string from the language L some (syntactic) structure.             any individual component (“unrule”) of U would be
The idea behind this is that the property of having                based on lexical characteristics only, i.e. it would
a structure differentiates the strings σ ∈ L from all              take recourse neither to any structure of a string
“other” strings ω 6∈ L, i.e. having a structure differ-            nor to other syntactic characteristics (such as be-
entiates sentences from “nonsentences”. Due to this,               ing a subject etc.), not even indirectly.
the task of identifying the appurtenance of a string to
a language (the set membership) and the task of as-                  Unlike the standard approach, such a description
signing the string its structure are often viewed as in         allows also for the existence of a non-empty set of
fact identical. In other words, the current approach to         strings which belong to neither clearly grammatical
syntactic description supposes that any string ω ∈ T ∗          nor clearly ungrammatical strings – more formally,
which cannot be assigned a structure by the respective          such a description allows for a nonempty set
grammar is to be considered (formally) ungrammati-              T ∗ \(L(G)∪L(U )). Apart from this, the explicit knowl-
cal. Closely linked to this is also the presupposition          edge of the set L(U ) of ungrammatical strings allows
2       Karel Oliva


                                                                               "clearly" correct strings (sentences)
                                                                               "clearly" incorrect strings
                                                                               strings with uncertain/unclear
                                                                               grammaticality status

                      T*


                                                       Fig. 1.


for straightforward development of important applica- (word) order phenomena: word order rules are
tions (cf. Sect. 4).                                    rules which define the mutual ordering of (two or
                                                        more) elements E1, E2, . . . occurring within a partic-
                                                        ular string; if this ordering is not kept, then the re-
2 The unrules of the ungrammar
                                                        spective word order phenomenon is violated and the
The above abstract ideas call for methods for discover- string is to be considered ungrammatical.
ing and describing the “unrules” of the “ungrammar”. Example: in an English do-interrogative sentence con-
In doing this, the following two points can be postu- sisting of a finite form of the auxiliary verb do, of a sub-
lated as starters:                                          ject position filled in by a noun or a personal pronoun
  – grammaticality/ungrammaticality is defined for in nominative, of a base form of a main verb different
    whole sentences (i.e. not for subparts of sentences from be and have, and of the final question mark, the
    only, at least not in the general case)                 order must necessarily follow the pattern just used for
  – ungrammaticality occurs (only) as a result of vio- listing the elements, or, in an echo question, it must
    lation of some linguistic phenomenon or phenom- follow the pattern of a declarative sentence. If this or-
    ena within the sentence.                                der is not kept, the string is ungrammatical (cf. Did
                                                            she come?, She did come? vs. *Did come she?, etc.).
    Since any “clear” error consists of violation of a lan-
guage phenomenon, it seems reasonable that the agreement phenomena: understood broadly, an
search for incorrect configurations be preceded by an agreement phenomenon requires that if two (or more)
overview and classification of phenomena fit to the cur- elements E1, E2, . . . cooccur in a sentence, then some
rent purpose.                                               of their morphological characteristics have to be in
    From the viewpoint of the way of their manifesta- a certain systematic relation (most often, identity); if
tion in the surface string, (syntactic) phenomena can this relation does not hold, the respective instance of
be divided into three classes:                              the agreement is violated and the string is ungram-
                                                            matical. (The difference to selection phenomena con-
selection phenomena: in a rather broad under-
                                                            sists thus of the fact that the two (or more) elements
standing, selection (as a generalized notion of sub-
                                                            E1, E2, . . . need not cooccur at all – that is, the agree-
categorisation) is the requirement for a certain ele-
                                                            ment is violated if they cooccur but do not agree, but
ment (a syntactic category, sometimes even a single
                                                            it is not violated if only one of the pair (of the set)
word) E1 to occur in a sentence if another element E2
                                                            occurs, which would, however, be a violation of the
(or: set of elements {E2, E3, . . . , En}) is present, i.e.
                                                            selection.)
if E2 (or: {E2, E3, . . . , En}) occur(s) in a string but
E1 does not, the respective instance of selection phe- Example: the string *She does it himself. breaks the
nomena is violated and the string is to be considered agreement relation in gender between the anaphora
ungrammatical.                                              and its antecedent (while the sentences She does it
Example: in English, if a non-imperative finite verb herself. and She does it. are both correct – mind here
form occurs in a sentence, then also a word function- the difference to selection).
ing as its subject must occur in the sentence (cf. the          This overview of classes of phenomena suggests
contrast in grammaticality between She is at home. that each string violating a certain phenomenon can
vs. *Is at home.).                                          be viewed as an extension of some minimal violating
                                                                                 Linguistics behind the mirror       3

                                                                
                                                    cat: pron
                                cat: n
                      ≺⊕                      ∨  pron type:pers  ⊕ himself ⊕            (1)
                              gender:fem
                                                     gender:fem


string, i.e. as an extension of a string which contains          Further, such a minimal violating (abstract) string
only the material necessary for the violation. For ex- can be generalized into an incorrect configuration of
ample, the ungrammatical string The old woman saw unlimited length using the following linguistic facts
himself in the mirror yesterday, if considered a case about the anaphoric pronoun himself in English:
of violation of the anaphora-agreement relation, can
                                                               – a bound anaphora must cooccur with a noun or
be viewed as an extension of the minimal string The
                                                                  nominal phrase displaying the same gender and
woman saw himself, and in fact as an extension of
                                                                  number as the pronoun (with the binder of the
the string Woman himself (since for the anaphora-
                                                                  anaphor); usually, this binder precedes the pro-
agreement violation, the fact that some other phenom-
                                                                  noun within the sentence (and then it is a case of a
ena are also violated in the string does not play any
                                                                  true anaphor) or, rarely, it can follow the anaphor
role).
                                                                  (in case of a cataphoric relation: Himself, he bought
    This means that a minimal violating string can be
                                                                  a book.)
discovered in each ungrammatical string, and hence             – occassionally, also an overtly unbound anaphora
each “unrule” of the “formal ungrammar” can be con-               can occur; apart from imperative sentences (Kill
structed in two steps:                                            yourself !), the anaphor must then closely follow
                                                                  a to-infinitive (The intention was only to kill him-
  – first, by defining an (abstract) minimal violating
                                                                  self.) or a gerund (Killing himself was the only
     string, based on a violation of an individual phe-
                                                                  intention.).
     nomenon (or, as the case might be, based on com-
     bination of violations of a “small number” of phe-          Taken together, these points mean that the only
     nomena)                                                 way how to give the configuration from the string (1)
  – second, by defining how the (abstract) minimal vi- at least a chance to be grammatical is to extend it
     olating string can be extended into a full-fledged with an item which
     (abstract) violating string (or to more such strings,     – either, is in masculine gender and singular number
     if there are more possibilities of the extension), i.e.   – or is an imperative or an infinitive or a gerund and
     by defining the material (as to quality and posi-            stands to the left of the word himself.
     tioning) which can be added to the minimal string
     without making the resulting string grammatical This further suggests that – in order to keep the string
     (not even contingently).                                ungrammatical also after the extension – no masculine
                                                             gender and singular number item must occur within
The approach to discovering/describing ungrammati- the (extended) string, as well as no infinitive or gerund
cal strings will be illustrated by the following example must appear to the left of the word himself.
where the sign ‘≺’ will mark sentence beginning (an              This can be captured in a (semi-)formal way (em-
abstract position in front of the first word), and ‘Â’ ploying the Kleene-star ‘*’ for any number of repeated
will mark sentence end (i.e. an abstract position “after occurrences, and ‘¬’ for negation) as follows.
the full stop”).                                                 In the first step, the requirement of no singular
                                                             masculine is to be added (2), in the second step, the
Example: As reasoned already above, the abstract
                                                             prohibition on occurrence of an imperative or an in-
minimal violating string of the string The old woman
                                                             finitive (represented by the infinitival particle to) or
saw himself in the mirror yesterday is the following
                                                             a gerund to the left of the word himself will be ex-
configuration (1) (in the usual regular expression no-
                                                             pressed as in (3). This is then the final form of de-
tation, using feature structures for the individual el-
                                                             scription of an abstract violating string. Any partic-
ements of the regular expression, ‘∨’ for disjunction,
                                                             ular string matching this description is guaranteed to
the sign ‘⊕’ for concatenation, and brackets ‘(‘and’)’ in
                                                             be ungrammatical in English.
the usual way for marking off precedence/grouping).
    This configuration states that a string consisting of
two elements (the sentential boundaries do not count), 3 Ungrammar and the theory of
a feminine noun or a feminine personal pronoun fol-                grammaticality
lowed by the word himself, can never be a correct sen-
tence of English (cf., e.g., the impossibility of the dia- An important case – mainly for the theory of gram-
logue Who turned Io into a cow? *Hera himself.)              maticality – of a minimal violating string is three fi-
4      Karel Oliva

                                                             
                    ∗                        cat: pron
          number: sg             cat: n
    ≺⊕ ¬                  ⊕                ∨  pron type:pers 
          gender:masc          gender:fem
                                                  gender:fem
                                           ∗                            ∗
                                 number: sg                       number: sg
                          ⊕ ¬                    ⊕ himself ⊕ ¬                    ⊕                       (2)
                                gender:masc                       gender:masc


                                                                                                    
                                                      ∗                         cat: pron
       number: sg                                 cat:part               cat: n
≺⊕ ¬                 ∨ [v form : (imp ∨ ger)] ∨                 ⊕                 ∨  pron type:pers 
       gender:masc                                form:to             gender:fem
                                                                                         gender:fem
                                                      ∗                             ∗
       number: sg                                 cat:part                       number: sg
 ⊕ ¬                 ∨ [v form : (imp ∨ ger)] ∨                 ⊕ himself ⊕ ¬                     ⊕  (3)
       gender:masc                                form:to                        gender:masc


nite verbs following each other closely, i.e. the config- and hence that of a grammar – and the view advocated
uration V F in + V F in + V F in. Such a configuration in this paper differ considerably:
appears, e.g., in the sentence The mouse the cat the
dog chased caught survived which is a typical example      – the standard approach to langue, which allows for
of – in its time frequently discussed – case of a multi-      specification of the set of correct strings only (via
ple centre self-embedding construction. The important         the grammar), has no means available for ruling
point concerning this construction is that it became          out constructions with multiple centre self-
the issue of discussions since                                embedding (short of ruling out recursion of the
                                                              description of relative clauses, which would indeed
  – one the one hand, this construction is – (almost)         solve the problem, however, would also have se-
    necessarily – licensed by any “reasonable” formal         rious negative consequences elsewhere),
    grammar of English, due to the necessity of allow-     – the approach proposed, by allowing for explicit
    ing in this grammar for the possibility of (recur-        and most importantly independent specifications
    sive) embedding (incl. centre self-embedding) of          of the sets of correct and of incorrect strings as
    relative clauses                                          two autonomous parts of the langue, allows for
  – on the other hand, such sentences are unanimously         ruling out constructions involving multiple centre
    considered unacceptable by native speakers of En-         self-embedded relative clauses (at least in certain
    glish (with the contingent exception of theoretical       cases); this is achieved without consequences on
    linguists J ).                                            any other part of the grammar and the language
                                                              described, simply by stating that strings where
The antagonism between the two points is tradition-           three (or more) finite verbs follow each other im-
ally attributed to (and attempted to be explained by)         mediately belong to the area of “clearly incorrect”
a tension between the langue (grammar, grammatical            strings.
competence) and the parole (language performance) of
the speakers, that is, by postulating that the speakers By solving the problem of unacceptability of the
possess some internal system of the language but that strings involving three (and more) finite verbs follow-
they use the language in a way which deviates from ing each other via the formal ungrammar, the ap-
this system. Such an assumption is generally a good proach proposed enforces a refinement of perspective
explanation for such (unintentional) violations of of the general description of grammaticality and un-
langue (i.e. of grammaticality) in speech as, e.g., slips grammaticality. In particular, from now on the Fig. 1
of tongue, hesitations and/or repetitions, etc., but it above has to be understood as depicting the situation
can hardly be used sensibly in case there are no extra- in the language (understood as set of strings) only, i.e.
linguistic factors and, above all, where the sentences in without any recourse to the means of its description
question correspond to the langue (to the grammatical (i.e. without any recourse to a grammar and, in par-
description). This demonstrates that what is really at ticular, to the coverage of a grammar). The coverage
stake here is the correctness of the general understand- of the two grammar modules introduced above (the
ing of the langue (and not a problem of a particular “grammar of the correct strings” and the “ungram-
grammar of a particular language).                        mar of the incorrect strings”), i.e. the stringsets de-
    The difference in methods of ruling sentences with scribed by the components of the grammar describing
multiple centre self-embedding out of the language the “clearly correct” and the “clearly incorrect” strings,
drives us to the fact that the standard view of langue – should be rather described as in Fig. 2.
                                                                             Linguistics behind the mirror    5


                                                                     strings (sentences) described by
                                                                     the "grammar of the correct strings"

                                                                     strings (sentences) described by
                                                                     the "grammar of the incorrect strings"


       T*                                                            strings described by
                                                                     neither of the grammars


                                                      Fig. 2.


    The crucial point is the part of this picture pointed   – (formally) grammatical strings are strings de-
out by the arrow (where dense dots and vertical bars          scribed by the grammar but not by the ungram-
overlap). This area of the picture is the one represent-      mar
ing strings which are described by both components          – (formally) ungrammatical strings are strings de-
of the grammar, i.e. strings which are covered both           scribed by the ungrammar
by the description (grammar) of the correct strings         – strings whose grammaticality is (formally) unde-
and by the description (ungrammar) of the incorrect           fined are strings which are described neither by
strings. At first glance, this might seem as a contra-        the grammar not by the ungrammar.
diction (seemingly, some strings are considered correct
and incorrect simultaneously), but it is not one, since
the true situation described in this picture is in fact     4   Applications
two independent partitionings of the set of strings T ∗
by two independent set description systems, each of    In the previous sections, rather theoretical issues con-
which describes a subset of T ∗ . Viewed from this per-cerning the general view of grammaticality and means
spective, it should not be surprising that some stringsof description of grammatical/ungrammatical strings
are described by both of the systems (while others are were dealt with. The task of finding the set of strictly
described by neither of them). The fundamental is-     ungrammatical strings has also a practical importance,
sue here is the relation of the two description systemshowever, since for certain applications it is crucial to
(the grammar and the ungrammar) to the pretheoret-     know that particular configuration of words (or of ab-
ical understanding of the notion of grammaticality as  stractions over strings of words, e.g., configurations of
                                                       part-of-speech information) is guaranteed to be incor-
acceptability of a string for a native speaker of a lan-
guage. Traditionally, all the strings were considered  rect.
grammatical which were described by the grammar            The most prominent (or at least: the most ob-
                                                       vious) among such tasks is (automatic) grammar-
of the correct strings. In the light of the current dis-
cussion, and mainly of the evidence provided by the    checking: the ability to recognize reliably that
multiple centre self-embedding relative constructions, a string is ungrammatical would result in grammar-
this definition of grammaticality should be adjusted   checkers with considerably more user-friendly perfor-
by adding the proviso that strings which are covered   mance than most of our present ones display, as they
by the description of incorrect strings (by the ungram-are based predominantly on simple patter-matching
mar) should not be considered grammatical (not even    techniques, and hence they produce a lot of false
in case they are simultaneously covered by the gram-   alarms over correct strings on the one hand while they
mar of the correct strings). This changes the perspec- leave unflagged many strings whose ungrammaticality
tive (compared to the standard one), by giving the     is obvious to a human, but which cannot be detected
ungrammar the “veto right” over the grammaticality     as incorrect since their inner structure is too complex
of a string, but obviously corresponds to the language or does not correspond to any of the patterns for any
reality more closely than the standard approach.       other reason.
                                                           Another practical task where the knowledge of the
    Viewed from the perspective of a grammatical de- ungrammar of a particular language may turn into
scription considered as a model of a linguistic compe- the central expertise needed is part-of-speech tag-
tence, the previous discussion can be summed up as ging, i.e. assigning morphological information (such
follows:                                               as part-of-speech, case, number, tense, . . . ) to words
6        Karel Oliva

in running texts. The main problem for (automatic)                  The (linguistic) validity of these rules is based on
part-of-speech tagging is morphological ambiguity, i.e.        the fact that any string matching the pattern part of
the fact that words might have different morphologi-           the rule on each position would be ungrammatical (in
cal meanings (e.g., the English wordform can is either         English), and hence that the reading to be deleted can
a noun (“a food container”) or a modal verb (“to be            be removed without any harm to any of the grammat-
able to”); a more typical – and much more frequent -           ical readings of the input string.
case of ambiguity in English is the noun/verb ambigu-               It is important to realize that the proposed ap-
ity in such systematic cases as weight, jump, call, . . . ).   proach to the "discovery" of disambiguation rules
The knowledge of ungrammatical configurations can              yields the expected results – i.a. rules corresponding
be employed for the build-up of a part-of-speech tag-          to the Constraint Grammar rules given in standard lit-
ger based on the idea of (stepwise) elimination of those       erature (e.g., it brings the rule for English saying that
individual readings which are ungrammatical (i.e. im-          if an unambiguous ARTICLE is followed by a word
possible) in the context of a given sentence. In particu-      having a potential VERB reading, then this VERB
lar, each extended violating string with n constituting        reading is to be discarded, cf. [1, p. 11], and compare
members (i.e. a configuration which came into being            this to the example above). The most important in-
by extending a minimal violating string of length n)           novative feature (wrt. the usual ad hoc approach to
can be turned into a set of disambiguation rules by            writing these rules) is thus the systematic linguistic
stipulating, for each resulting rule differently, (n − 1)      method of discovering the violating strings, supporting
constituting members of the extended violating string          the development of all possible disambiguation rules,
as unambiguous and issuing a deletion statement for            i.e. of truly powerful Constraint Grammars. It is also
the n-th original element in a string which matches            worth mentioning that the idea of the method as such
the constituting elements as well as the extension ele-        is language independent – it can be used for develop-
ments inbetween them. Thus, each extended violating            ment of Constraint Grammars for most different lan-
string arising from a simple violating string of length n      guages (even though the set of the developed rules will
yields n disambiguation rules.                                 be of course language-specific and will depend on the
                                                               syntactic regularities of the language in question).
Example: The two-membered minimal violating
string ARTICLE + VERB, after being extended into
the configuration (in the usual Kleene-star notation) References
ARTICLE + ADVERB∗ + VERB, yields the follow-
ing two rules:                                        1. F. Karlsson, A. Voutilainen, J. Heikikilä, and A. Antilla
                                                                 (eds.) Constraint grammar – a language-independent
Rule 1:
                                                                 system for parsing unrestricted text. Mouton de
find_a_string consisting of (from left to right):                Gruyter, Berlin & New York, 1995.

    – a word which is an unambiguous ARTICLE
      (i.e. bears no other tag or tags than ARTICLE)
    – any number of words which bear the tag ADVERB
      (but no other tags)
    – a word bearing the tag VERB

delete_the_tag VERB from the last word of the string

Rule 2:
find_a_string consisting of (from left to right):

    – a word bearing the tag ARTICLE
    – any number of words which bear the tag ADVERB
      (but no other tags)
    – a word which is an unambiguous VERB (i.e. it
      bears only a single tag VERB or it bears more
      than one tag, but all these tags are VERB)

delete_the_tag ARTICLE from the first word of the
string