Linguistics behind the mirror Karel Oliva Institute of the Czech Language AS CR, v. v. i. Letenská 123/4, Praha 1 - Malá Strana, CZ - 118 51, Czech Republic Abstract. A natural language is usually modelled as that the borderline between strings which are gram- a subset of the set T ∗ of strings (over some set T of termi- matical and those which are ungrammatical is sharp nals) generated by some grammar G. Thus, T ∗ is divided and clear-cut. into two disjoint classes: into grammatical and ungram- Even elementary language practice (e.g., serving matical strings (any string not generated by G is considered as a native speaker – informant for fellow linguists, or ungrammatical). This approach brings along the following teaching one’s mother tongue) shows that this presup- problems: – on the theoretical side, it is impossible to rule out position does not hold in reality. The realistic picture clearly unacceptable yet “theoretically grammatical” is much more like the one in Fig. 1: there are strings strings (e.g., strings with multiple centre self-embed- which are considered clearly correct (“grammatical”) dings, cf. The cheese the lady the mouse the cat the by the native speakers, there are other ones that are dog chased caught frightened bought cost 10 £), doubtless incorrect (out of the language, “informally – on the practical side, it impedes systematic build-up ungrammatical”, unacceptable for native speakers), of such computational lingustics applications as, e.g., and there is a non-negligible set of strings whose sta- grammar-checkers. tus wrt. correctness (acceptability, grammaticality) is In an attempt to lay a theoretical fundament enabling the not really clear and/or where opinions of the native solution of these problems, the paper first proposes a tri- speakers differ (some possibly tending more in this, partition of the stringset into: others more in the other direction, etc.). – clearly grammatical strings, – clearly ungrammatical strings, Assuming the better empirical adequacy of the pic- – strings with unclear (“on the verge”) grammaticality ture in Fig. 1, the objective of this paper will be to status propose that a syntactic description of (some natural) and, based on this, concentrates on language L should consist of: – techniques for systematic discovery and description of clearly ungrammatical strings, – a formal grammar G defining the set L(G) – the impact of the approach onto the theory of gram- of doubtlessly grammatical strings (L(G) ⊆ L). maticality, Typically, the individual components of G (rules, – an overview of simple ideas about applications of the principles, constraints, . . . ) are based on a struc- above in building grammar-checkers and rule-based ture assigned to a string, either directly (mention- part-of-speech taggers. ing, e.g., the constituent structure) or indirectly, operating with other syntactically assigned fea- tures (such as subject, direct object, etc.). Since 1 Introduction the description of the “clearly correct” strings via such a grammar is fairly standard, it will not be Apart from deciding on the membership of a particular further dealt with here, string σ in a particular language L, a formal grammar – a formal “ungrammar” U defining the set L(U ) is usually assigned an additional task: to assign each of doubtlessly ungrammatical strings. Typically, string from the language L some (syntactic) structure. any individual component (“unrule”) of U would be The idea behind this is that the property of having based on lexical characteristics only, i.e. it would a structure differentiates the strings σ ∈ L from all take recourse neither to any structure of a string “other” strings ω 6∈ L, i.e. having a structure differ- nor to other syntactic characteristics (such as be- entiates sentences from “nonsentences”. Due to this, ing a subject etc.), not even indirectly. the task of identifying the appurtenance of a string to a language (the set membership) and the task of as- Unlike the standard approach, such a description signing the string its structure are often viewed as in allows also for the existence of a non-empty set of fact identical. In other words, the current approach to strings which belong to neither clearly grammatical syntactic description supposes that any string ω ∈ T ∗ nor clearly ungrammatical strings – more formally, which cannot be assigned a structure by the respective such a description allows for a nonempty set grammar is to be considered (formally) ungrammati- T ∗ \(L(G)∪L(U )). Apart from this, the explicit knowl- cal. Closely linked to this is also the presupposition edge of the set L(U ) of ungrammatical strings allows 2 Karel Oliva "clearly" correct strings (sentences) "clearly" incorrect strings strings with uncertain/unclear grammaticality status T* Fig. 1. for straightforward development of important applica- (word) order phenomena: word order rules are tions (cf. Sect. 4). rules which define the mutual ordering of (two or more) elements E1, E2, . . . occurring within a partic- ular string; if this ordering is not kept, then the re- 2 The unrules of the ungrammar spective word order phenomenon is violated and the The above abstract ideas call for methods for discover- string is to be considered ungrammatical. ing and describing the “unrules” of the “ungrammar”. Example: in an English do-interrogative sentence con- In doing this, the following two points can be postu- sisting of a finite form of the auxiliary verb do, of a sub- lated as starters: ject position filled in by a noun or a personal pronoun – grammaticality/ungrammaticality is defined for in nominative, of a base form of a main verb different whole sentences (i.e. not for subparts of sentences from be and have, and of the final question mark, the only, at least not in the general case) order must necessarily follow the pattern just used for – ungrammaticality occurs (only) as a result of vio- listing the elements, or, in an echo question, it must lation of some linguistic phenomenon or phenom- follow the pattern of a declarative sentence. If this or- ena within the sentence. der is not kept, the string is ungrammatical (cf. Did she come?, She did come? vs. *Did come she?, etc.). Since any “clear” error consists of violation of a lan- guage phenomenon, it seems reasonable that the agreement phenomena: understood broadly, an search for incorrect configurations be preceded by an agreement phenomenon requires that if two (or more) overview and classification of phenomena fit to the cur- elements E1, E2, . . . cooccur in a sentence, then some rent purpose. of their morphological characteristics have to be in From the viewpoint of the way of their manifesta- a certain systematic relation (most often, identity); if tion in the surface string, (syntactic) phenomena can this relation does not hold, the respective instance of be divided into three classes: the agreement is violated and the string is ungram- matical. (The difference to selection phenomena con- selection phenomena: in a rather broad under- sists thus of the fact that the two (or more) elements standing, selection (as a generalized notion of sub- E1, E2, . . . need not cooccur at all – that is, the agree- categorisation) is the requirement for a certain ele- ment is violated if they cooccur but do not agree, but ment (a syntactic category, sometimes even a single it is not violated if only one of the pair (of the set) word) E1 to occur in a sentence if another element E2 occurs, which would, however, be a violation of the (or: set of elements {E2, E3, . . . , En}) is present, i.e. selection.) if E2 (or: {E2, E3, . . . , En}) occur(s) in a string but E1 does not, the respective instance of selection phe- Example: the string *She does it himself. breaks the nomena is violated and the string is to be considered agreement relation in gender between the anaphora ungrammatical. and its antecedent (while the sentences She does it Example: in English, if a non-imperative finite verb herself. and She does it. are both correct – mind here form occurs in a sentence, then also a word function- the difference to selection). ing as its subject must occur in the sentence (cf. the This overview of classes of phenomena suggests contrast in grammaticality between She is at home. that each string violating a certain phenomenon can vs. *Is at home.). be viewed as an extension of some minimal violating Linguistics behind the mirror 3      cat: pron cat: n ≺⊕  ∨  pron type:pers  ⊕ himself ⊕  (1) gender:fem gender:fem string, i.e. as an extension of a string which contains Further, such a minimal violating (abstract) string only the material necessary for the violation. For ex- can be generalized into an incorrect configuration of ample, the ungrammatical string The old woman saw unlimited length using the following linguistic facts himself in the mirror yesterday, if considered a case about the anaphoric pronoun himself in English: of violation of the anaphora-agreement relation, can – a bound anaphora must cooccur with a noun or be viewed as an extension of the minimal string The nominal phrase displaying the same gender and woman saw himself, and in fact as an extension of number as the pronoun (with the binder of the the string Woman himself (since for the anaphora- anaphor); usually, this binder precedes the pro- agreement violation, the fact that some other phenom- noun within the sentence (and then it is a case of a ena are also violated in the string does not play any true anaphor) or, rarely, it can follow the anaphor role). (in case of a cataphoric relation: Himself, he bought This means that a minimal violating string can be a book.) discovered in each ungrammatical string, and hence – occassionally, also an overtly unbound anaphora each “unrule” of the “formal ungrammar” can be con- can occur; apart from imperative sentences (Kill structed in two steps: yourself !), the anaphor must then closely follow a to-infinitive (The intention was only to kill him- – first, by defining an (abstract) minimal violating self.) or a gerund (Killing himself was the only string, based on a violation of an individual phe- intention.). nomenon (or, as the case might be, based on com- bination of violations of a “small number” of phe- Taken together, these points mean that the only nomena) way how to give the configuration from the string (1) – second, by defining how the (abstract) minimal vi- at least a chance to be grammatical is to extend it olating string can be extended into a full-fledged with an item which (abstract) violating string (or to more such strings, – either, is in masculine gender and singular number if there are more possibilities of the extension), i.e. – or is an imperative or an infinitive or a gerund and by defining the material (as to quality and posi- stands to the left of the word himself. tioning) which can be added to the minimal string without making the resulting string grammatical This further suggests that – in order to keep the string (not even contingently). ungrammatical also after the extension – no masculine gender and singular number item must occur within The approach to discovering/describing ungrammati- the (extended) string, as well as no infinitive or gerund cal strings will be illustrated by the following example must appear to the left of the word himself. where the sign ‘≺’ will mark sentence beginning (an This can be captured in a (semi-)formal way (em- abstract position in front of the first word), and ‘Â’ ploying the Kleene-star ‘*’ for any number of repeated will mark sentence end (i.e. an abstract position “after occurrences, and ‘¬’ for negation) as follows. the full stop”). In the first step, the requirement of no singular masculine is to be added (2), in the second step, the Example: As reasoned already above, the abstract prohibition on occurrence of an imperative or an in- minimal violating string of the string The old woman finitive (represented by the infinitival particle to) or saw himself in the mirror yesterday is the following a gerund to the left of the word himself will be ex- configuration (1) (in the usual regular expression no- pressed as in (3). This is then the final form of de- tation, using feature structures for the individual el- scription of an abstract violating string. Any partic- ements of the regular expression, ‘∨’ for disjunction, ular string matching this description is guaranteed to the sign ‘⊕’ for concatenation, and brackets ‘(‘and’)’ in be ungrammatical in English. the usual way for marking off precedence/grouping). This configuration states that a string consisting of two elements (the sentential boundaries do not count), 3 Ungrammar and the theory of a feminine noun or a feminine personal pronoun fol- grammaticality lowed by the word himself, can never be a correct sen- tence of English (cf., e.g., the impossibility of the dia- An important case – mainly for the theory of gram- logue Who turned Io into a cow? *Hera himself.) maticality – of a minimal violating string is three fi- 4 Karel Oliva      ∗   cat: pron number: sg cat: n ≺⊕ ¬ ⊕ ∨  pron type:pers  gender:masc gender:fem gender:fem   ∗   ∗ number: sg number: sg ⊕ ¬ ⊕ himself ⊕ ¬ ⊕ (2) gender:masc gender:masc        ∗   cat: pron number: sg cat:part cat: n ≺⊕ ¬ ∨ [v form : (imp ∨ ger)] ∨ ⊕ ∨  pron type:pers  gender:masc form:to gender:fem gender:fem     ∗   ∗ number: sg cat:part number: sg ⊕ ¬ ∨ [v form : (imp ∨ ger)] ∨ ⊕ himself ⊕ ¬ ⊕  (3) gender:masc form:to gender:masc nite verbs following each other closely, i.e. the config- and hence that of a grammar – and the view advocated uration V F in + V F in + V F in. Such a configuration in this paper differ considerably: appears, e.g., in the sentence The mouse the cat the dog chased caught survived which is a typical example – the standard approach to langue, which allows for of – in its time frequently discussed – case of a multi- specification of the set of correct strings only (via ple centre self-embedding construction. The important the grammar), has no means available for ruling point concerning this construction is that it became out constructions with multiple centre self- the issue of discussions since embedding (short of ruling out recursion of the description of relative clauses, which would indeed – one the one hand, this construction is – (almost) solve the problem, however, would also have se- necessarily – licensed by any “reasonable” formal rious negative consequences elsewhere), grammar of English, due to the necessity of allow- – the approach proposed, by allowing for explicit ing in this grammar for the possibility of (recur- and most importantly independent specifications sive) embedding (incl. centre self-embedding) of of the sets of correct and of incorrect strings as relative clauses two autonomous parts of the langue, allows for – on the other hand, such sentences are unanimously ruling out constructions involving multiple centre considered unacceptable by native speakers of En- self-embedded relative clauses (at least in certain glish (with the contingent exception of theoretical cases); this is achieved without consequences on linguists J ). any other part of the grammar and the language described, simply by stating that strings where The antagonism between the two points is tradition- three (or more) finite verbs follow each other im- ally attributed to (and attempted to be explained by) mediately belong to the area of “clearly incorrect” a tension between the langue (grammar, grammatical strings. competence) and the parole (language performance) of the speakers, that is, by postulating that the speakers By solving the problem of unacceptability of the possess some internal system of the language but that strings involving three (and more) finite verbs follow- they use the language in a way which deviates from ing each other via the formal ungrammar, the ap- this system. Such an assumption is generally a good proach proposed enforces a refinement of perspective explanation for such (unintentional) violations of of the general description of grammaticality and un- langue (i.e. of grammaticality) in speech as, e.g., slips grammaticality. In particular, from now on the Fig. 1 of tongue, hesitations and/or repetitions, etc., but it above has to be understood as depicting the situation can hardly be used sensibly in case there are no extra- in the language (understood as set of strings) only, i.e. linguistic factors and, above all, where the sentences in without any recourse to the means of its description question correspond to the langue (to the grammatical (i.e. without any recourse to a grammar and, in par- description). This demonstrates that what is really at ticular, to the coverage of a grammar). The coverage stake here is the correctness of the general understand- of the two grammar modules introduced above (the ing of the langue (and not a problem of a particular “grammar of the correct strings” and the “ungram- grammar of a particular language). mar of the incorrect strings”), i.e. the stringsets de- The difference in methods of ruling sentences with scribed by the components of the grammar describing multiple centre self-embedding out of the language the “clearly correct” and the “clearly incorrect” strings, drives us to the fact that the standard view of langue – should be rather described as in Fig. 2. Linguistics behind the mirror 5 strings (sentences) described by the "grammar of the correct strings" strings (sentences) described by the "grammar of the incorrect strings" T* strings described by neither of the grammars Fig. 2. The crucial point is the part of this picture pointed – (formally) grammatical strings are strings de- out by the arrow (where dense dots and vertical bars scribed by the grammar but not by the ungram- overlap). This area of the picture is the one represent- mar ing strings which are described by both components – (formally) ungrammatical strings are strings de- of the grammar, i.e. strings which are covered both scribed by the ungrammar by the description (grammar) of the correct strings – strings whose grammaticality is (formally) unde- and by the description (ungrammar) of the incorrect fined are strings which are described neither by strings. At first glance, this might seem as a contra- the grammar not by the ungrammar. diction (seemingly, some strings are considered correct and incorrect simultaneously), but it is not one, since the true situation described in this picture is in fact 4 Applications two independent partitionings of the set of strings T ∗ by two independent set description systems, each of In the previous sections, rather theoretical issues con- which describes a subset of T ∗ . Viewed from this per-cerning the general view of grammaticality and means spective, it should not be surprising that some stringsof description of grammatical/ungrammatical strings are described by both of the systems (while others are were dealt with. The task of finding the set of strictly described by neither of them). The fundamental is- ungrammatical strings has also a practical importance, sue here is the relation of the two description systemshowever, since for certain applications it is crucial to (the grammar and the ungrammar) to the pretheoret- know that particular configuration of words (or of ab- ical understanding of the notion of grammaticality as stractions over strings of words, e.g., configurations of part-of-speech information) is guaranteed to be incor- acceptability of a string for a native speaker of a lan- guage. Traditionally, all the strings were considered rect. grammatical which were described by the grammar The most prominent (or at least: the most ob- vious) among such tasks is (automatic) grammar- of the correct strings. In the light of the current dis- cussion, and mainly of the evidence provided by the checking: the ability to recognize reliably that multiple centre self-embedding relative constructions, a string is ungrammatical would result in grammar- this definition of grammaticality should be adjusted checkers with considerably more user-friendly perfor- by adding the proviso that strings which are covered mance than most of our present ones display, as they by the description of incorrect strings (by the ungram-are based predominantly on simple patter-matching mar) should not be considered grammatical (not even techniques, and hence they produce a lot of false in case they are simultaneously covered by the gram- alarms over correct strings on the one hand while they mar of the correct strings). This changes the perspec- leave unflagged many strings whose ungrammaticality tive (compared to the standard one), by giving the is obvious to a human, but which cannot be detected ungrammar the “veto right” over the grammaticality as incorrect since their inner structure is too complex of a string, but obviously corresponds to the language or does not correspond to any of the patterns for any reality more closely than the standard approach. other reason. Another practical task where the knowledge of the Viewed from the perspective of a grammatical de- ungrammar of a particular language may turn into scription considered as a model of a linguistic compe- the central expertise needed is part-of-speech tag- tence, the previous discussion can be summed up as ging, i.e. assigning morphological information (such follows: as part-of-speech, case, number, tense, . . . ) to words 6 Karel Oliva in running texts. The main problem for (automatic) The (linguistic) validity of these rules is based on part-of-speech tagging is morphological ambiguity, i.e. the fact that any string matching the pattern part of the fact that words might have different morphologi- the rule on each position would be ungrammatical (in cal meanings (e.g., the English wordform can is either English), and hence that the reading to be deleted can a noun (“a food container”) or a modal verb (“to be be removed without any harm to any of the grammat- able to”); a more typical – and much more frequent - ical readings of the input string. case of ambiguity in English is the noun/verb ambigu- It is important to realize that the proposed ap- ity in such systematic cases as weight, jump, call, . . . ). proach to the "discovery" of disambiguation rules The knowledge of ungrammatical configurations can yields the expected results – i.a. rules corresponding be employed for the build-up of a part-of-speech tag- to the Constraint Grammar rules given in standard lit- ger based on the idea of (stepwise) elimination of those erature (e.g., it brings the rule for English saying that individual readings which are ungrammatical (i.e. im- if an unambiguous ARTICLE is followed by a word possible) in the context of a given sentence. In particu- having a potential VERB reading, then this VERB lar, each extended violating string with n constituting reading is to be discarded, cf. [1, p. 11], and compare members (i.e. a configuration which came into being this to the example above). The most important in- by extending a minimal violating string of length n) novative feature (wrt. the usual ad hoc approach to can be turned into a set of disambiguation rules by writing these rules) is thus the systematic linguistic stipulating, for each resulting rule differently, (n − 1) method of discovering the violating strings, supporting constituting members of the extended violating string the development of all possible disambiguation rules, as unambiguous and issuing a deletion statement for i.e. of truly powerful Constraint Grammars. It is also the n-th original element in a string which matches worth mentioning that the idea of the method as such the constituting elements as well as the extension ele- is language independent – it can be used for develop- ments inbetween them. Thus, each extended violating ment of Constraint Grammars for most different lan- string arising from a simple violating string of length n guages (even though the set of the developed rules will yields n disambiguation rules. be of course language-specific and will depend on the syntactic regularities of the language in question). Example: The two-membered minimal violating string ARTICLE + VERB, after being extended into the configuration (in the usual Kleene-star notation) References ARTICLE + ADVERB∗ + VERB, yields the follow- ing two rules: 1. F. Karlsson, A. Voutilainen, J. Heikikilä, and A. Antilla (eds.) Constraint grammar – a language-independent Rule 1: system for parsing unrestricted text. Mouton de find_a_string consisting of (from left to right): Gruyter, Berlin & New York, 1995. – a word which is an unambiguous ARTICLE (i.e. bears no other tag or tags than ARTICLE) – any number of words which bear the tag ADVERB (but no other tags) – a word bearing the tag VERB delete_the_tag VERB from the last word of the string Rule 2: find_a_string consisting of (from left to right): – a word bearing the tag ARTICLE – any number of words which bear the tag ADVERB (but no other tags) – a word which is an unambiguous VERB (i.e. it bears only a single tag VERB or it bears more than one tag, but all these tags are VERB) delete_the_tag ARTICLE from the first word of the string