=Paper= {{Paper |id=None |storemode=property |title=Towards Automatic Detection of Applicable Diatheses |pdfUrl=https://ceur-ws.org/Vol-1003/10.pdf |volume=Vol-1003 |dblpUrl=https://dblp.org/rec/conf/itat/VernerovaL13 }} ==Towards Automatic Detection of Applicable Diatheses== https://ceur-ws.org/Vol-1003/10.pdf
ITAT 2013 Proceedings, CEUR Workshop Proceedings Vol. 1003, pp. 10–17
http://ceur-ws.org/Vol-1003, Series ISSN 1613-0073, c 2013 A. Vernerová, M. Lopatková



                        Towards Automatic Detection of Applicable Diatheses

                                                  Anna Vernerová, and Markéta Lopatková

                                                          Charles University in Prague
                                                      Faculty of Mathematics and Physics
                                                  Institute of Formal and Applied Linguistics
                                                                 Czech Republic
                                               {vernerova, lopatkova}@ufal.mff.cuni.cz

Abstract: The valency behavior (argument structure) of                                          apelovat na kolegyADDR.na+Acc, aby práci
lexical items is so varied that it cannot be described by                         dokončiliPAT.aby-Clause včas
general rules and must be captured in lexicons separately                              ‘to appeal to his colleagues to finish the work in
for each lexical item. For verbs, lexicons typically de-                          time’
scribe only unmarked usage—the active form—while nat-
ural languages allow for certain regular changes in the
                                                                               • apelovat ‘to put emphasis’
number, type and/or realization of complementations (e.g.
                                                                                 ACTNom PATna+Acc
passivization). Thanks to their regularity, such changes
                                                                                    v jeho rodině se stále apeluje na morálkuPAT.na+Acc
may be described in a separate rule component of the lex-
                                                                                   ‘in his family emphasis is always put on morality’
icon; however, they are typically seen in many but not all
verbs and their applicability to a given lexical unit (verb
meaning) is not predictable from its valency alone. In this                     The above examples demonstrate how valency behavior
paper, we describe our initial experiments with using a                      varies even among semantically close lexical units (LUs),
large morphologically annotated corpus of Czech for de-                      both when they belong to the same lexeme and when they
termining which diatheses are applicable to a given lexical                  belong to different lexemes. It must therefore be captured
unit.                                                                        for each lexical unit of a verb separately in the form of a
                                                                             lexical entry listed in the valency lexicon. On the other
                                                                             hand, certain changes in the valency structure are regular
1     Introduction                                                           and can be described in the form of rules which can be
                                                                             specified in a separate component of the lexicon. Such
Valency refers to the argument structure of lexical units.1
                                                                             changes are typically seen in many but not all verbs and
In the Functional Generative Description (FGD), valency
                                                                             their applicability to a given lexical unit is not predictable
belongs to the so-called tectogrammatical layer [16, 20],
                                                                             from its valency frame alone.
i.e. the layer of linguistically structured meaning. It
                                                                                A lexical entry does not list all of its possible forms but
is captured by so called valency frames specifying the
                                                                             only one—usually the structure corresponding to the ac-
valency complementations (arguments that are either re-
                                                                             tive form of the verb, which is considered to be its un-
quired or specifically permitted by the given lexical unit).
                                                                             marked use—and a list of rules for creating other possi-
For each valency complementation, both its semantics (in
                                                                             ble structures (the marked uses). This description is both
the form of a tectogrammatical functor, which captures
                                                                             economical (less space is needed for storing the informa-
a coarse-grained semantic role) and its syntactic/morpho-
                                                                             tion about all available realizations of the LU) and linguis-
logical form must be specified.
                                                                             tically adequate (it captures generalizations which would
    Example 12                                                               not be obvious if all possible surface forms were listed).
                                                                                Valency lexicons are created with many applications in
    • vyzývat ‘to appeal, to challenge’
                                                                             mind: they help to maintain consistency of corpus an-
      ACTNom ADDRAcc PATk+Dat, na+Acc, aby, at’, že
                                                                             notation, provide syntactic and morphological informa-
         vyzvat někohoADDR.Acc, aby se uklidnilPAT.aby-Clause
                                                                             tion during parsing and natural language generation, and
                           ‘to ask somebody to calm down’
                                                                             may even prove useful in word sense disambiguation and
                 vyzvat někohoADDR.Acc na soubojPAT.na+Acc
                                                                             machine translation; moreover, lexicon data is consulted
                         ‘to challenge somebody to a duel’
                                                                             by linguists during their theoretical research and provides
    • apelovat ‘to appeal’                                                   useful information for students of Czech. All of these tasks
      ACTNom ADDRna+Acc PATaby, at’, že                                      involve actual occurrences of the valency patterns in the
                                                                             natural language, and so the unmarked structures from the
     1 Whereas the term lexeme roughly corresponds to a dictionary verb
                                                                             lexicon need to be converted into all structures that may
item with all its meanings, by a lexical unit (LU) we refer to a verb in a
given meaning. See Section 3.1 for more details.
                                                                             appear in the actual data.
    2 The frames and examples are taken from Vallex 2.6, http://                A rule based approach to creating derived valency struc-
ufal.mff.cuni.cz/vallex/2.6/data/html.                                       tures has already been used during the annotation of the
Towards Automatic Detection of Applicable Diatheses                                                                                 11


Prague Dependency Treebank3 (PDT) [3, 4]. Frames in            extensively for several decades [12]. In the description of
the valency lexicon PDT-Vallex describe the unmarked           Czech, we follow the classification given by Kettnerová
structure but all possible structures may appear in actual     et al. in [7].
treebank data. During consistency controls, general rules         Here we focus on diatheses—specific relations stem-
were used to generate frames describing the marked va-         ming from the changes in the linking of situational partici-
lency structures; then it was checked whether any of these     pants, valency complementations and surface syntactic po-
marked structures matches the data and the annotation          sitions. Diatheses belong to the group of grammaticalized
in the treebank. (The derived structures carry informa-        alternations: they are realized by the use of specific mor-
tion about the required form of the verb, and the number       phological and/or syntactical means, including the gram-
and type of the valency complementations including their       matical category of voice of the verb and the surface forms
functors, obligatoriness and permitted forms.) The rules       of the complementations. They relate different surface
that were used for the conversions are described in detail     syntactic structures of a single lexical unit of a verb. They
in [23].                                                       also belong to the group of conversive alternations: the
   Because correctness of the underlying PDT data was as-      transformation acts as a permutation on the assignment of
sumed, the rules were allowed to heavily over-generate.        valency complementations to surface syntactic positions,
For example, “passive” frames of the verb mít ‘to have’        typically shifting Actor away from the prominent subject
were generated although, in reality, it does not form pas-     position and filling it with some other complementation.
sive in Czech. While this is a reasonable strategy for con-
sistency checks of annotated data, other tasks that utilize
                                                               2.1 Types of grammatical diatheses in Czech
a valency lexicon would benefit from lists of diatheses ap-
plicable to any given lexical unit. Manual annotation pro-     In this section, we summarize the description of Czech
vides a number of examples of lexical units occurring in       diatheses as given by [17] and [6], and comment on some
different types of diatheses; however, the size of the tec-    of the issues that need to be solved and decisions that need
togrammatically annotated PDT data is too small, so we         to be made for their automatic analysis.
cannot make any conclusions from the fact that a lexical
unit does not occur in a diathesis. Therefore, we are trying
to draw evidence from a much larger, automatically mor-        The unmarked member of the diathesis. The unmarked
phologically annotated corpus. We have decided to use          usage is described in the lexical entry in the lexicon. The
SYN, a non-referential corpus of 1,300 million automati-       verb appears in an active form or as an infinitive; the com-
cally morphologically tagged words.                            plementations are realized in the forms specified for them
   For Czech, [21] used simple heuristics for determin-        in the lexicon entry. All complementations specified in
ing which diatheses are applicable to which lexical units      the entry as obligatory are present on the tectogrammat-
(both kinds of passive for verbs with complementations re-     ical layer, although some of them may be elided in the
alized as a prepositionless object, infinitive or dependent    surface realization of the sentence (if their value is either
clause; only reflexive passive for intransitives and verbs     clear from the context or general); inner participants4 that
where all complementations are realized as prepositional       are not specified in the lexical entry must not appear as
phrases; no passive for reflexives). For other languages,      arguments of the verb, but free modifications may.
most authors have only studied the applicability of alter-
nations and diatheses to whole lexemes rather than to in-      Diatheses with past participle.
dividual lexical units [11, 14, 15, 19]. We also draw in-
spiration from the work on automatic extraction of whole         1. passive diathesis (periphrastic passive)
frames from corpora, which has been attempted for several            e.g. Neustále jsem byl někým vyzýván, abych se legit-
languages including English [10], Czech [18], and Polish             imoval. ‘All the time - I was - someoneInstr - asked -
[2].                                                                 to show my ID.’ – ‘I kept being asked to show my ID’
                                                                     The form of the verb in this diathesis consists of the
2    Diatheses                                                       past participle of the main verb + the verb být ‘to
                                                                     be’ (in a finite or infinite form). The subject slot of
Regular changes of the valency structure of a lexical unit,          the passive construction either remains empty, or it
in the English-language literature usually called alterna-           is filled by a complementation which originally filled
tions, typically allow the speaker to express the same situ-         an object slot (typically that of an Accusative object,
ational meaning (i.e., propositional content characterized           but realization through infinitive, clause, genitive, or
by the set of situational participants) in different ways            phrase jako+Acc ‘as something’ is also possible); if
that result in different perspectives from which the situ-           the complementation is expressed as a noun phrase, it
ation is viewed. Alternations have already been studied              is turned into the Nominative case. The Actor (which
    3 See http://ufal.mff.cuni.cz/pdt2.5/ for information         4 Inner participants are complementations with either of the functors

about the current version.                                     ACTot, PATient, ADDRessee, EFFect or ORIGin.
12                                                                                                      A. Vernerová, M. Lopatková


        in the active construction fills the subject slot) may be       of the two possible readings of the example sentence
        realized either in the Instrumental case, or as a prepo-        above is considered to be a diathesis:
        sitional phrase od+Gen ‘by/from+Gen’.                           Mamince včera jídlo připravila tetička. Maminka má
                                                                        tedy již jídlo uvařeno. ‘The aunt has prepared the
     2. resultative diathesis with the auxiliary verb být ‘to be’
                                                                        food for the mother yesterday. Therefore, the mother
        e.g. Jídlo je uvařeno. ‘The food is cooked.’                   has the food cooked already.’ This case is considered
        This form of resultative diathesis differs from the pe-         to be a diathesis, because the Actor of the first sen-
        riphrastic passive only in meaning, not in the surface          tence (the aunt) has moved away from the subject po-
        form or structure. In many cases, it is not clear which         sition (and is not expressed in the resultative variant
        of the two possible readings the speaker has in mind.           at all).
        For example, the sentence okno je otevřeno may be              Maminka vařila celé dopoledne a nyní již má jídlo
        interpreted as a case of the resultative diathesis, de-         uvařeno. ‘Mother has been cooking all morning and
        scribing a state, i.e. ‘the window is (already) open’,          now she already has the food cooked.’ This case is
        or as a case of the passive diathesis, describing an            not considered to be a diathesis, because the same
        event, i.e. ‘the window is (being) opened’. This am-            complementation is corresponding to the subject in
        biguity is called event–state homonymy in Czech lin-            both cases.
        guistics. Because it is so common, we assume that the           In practice, however, it is often impossible to distin-
        passive diathesis is possible whenever the resultative          guish between the two readings (Panevová et al. [17]
        diathesis is possible and vice versa.                           claim that out of 60 cases of a resultative diathesis
        Moreover, Czech also exhibits a competition between             in the PDT, 23 are ambiguous), and the difference is
        past participles and deverbal adjectives. The kind of           usually only obvious from the context, so it is inac-
        deverbal adjectives that we have in mind are formed             cessible to the kind of naive, syntax-based automatic
        by adding vowel endings to past participles; both the           methods that we are trying to use. Our automatic
        participle and the adjective can then be used to ex-            method does not differentiate between the two read-
        press the resultative meaning, while only the partici-          ings.
        ple can be used to express the passive meaning. On
        one hand, this interchangeability of the “short” (par-       4. recipient passive diathesis
        ticiple) and “long” (adjectival) forms is often used            e.g. Dostal jsem zaplaceno (od šéfa). ‘I got paid (by
        as a guideline in determining whether a given sen-              the boss).’
        tence should be considered resultative—if the partici-          The most visible characteristics of the recipient
        ple can be replaced with the adjective, the resultative         diathesis is the auxiliary verb dostat ‘to get’ and the
        interpretation is valid. On the other hand, participle          past participle. The original frame must contain a
        forms are sometimes used in purely adjectival mean-             complementation in dative or a benefactor; this com-
        ing, such as in the sentence stále ještě nebyl najeden         plementation becomes the subject of the diathetic
        ‘he still was not full’, which features the word form           construction. The actor is expressed in the Instrumen-
        najeden (past participle of the reflexive verb najíst se        tal case, or as a prepositional phrase od+Gen ‘by’.
        ‘satiate oneself, eat so much that one is full’). If we         If there is a semantic patient, it keeps its form and
        were to read this as a diathesis, this would have to            agrees with the participle in gender and case. All
        be a case of periphrastic passive or of the resultative         these conditions together are fairly specific and there-
        diathesis with auxiliary verb být ‘to be’ formed from           fore allow for a fairly accurate search for corpus con-
        the sentence najedl se ‘he ate to be full’. However, it         cordances.
        is not possible that the same complementation would
        fill the subject position in both the active and the pas-
        sive/resultative diathesis. We have to read this sen-       Diatheses with the reflexive particle se.
        tence as a sentence with the adjective najedený ‘full,       5. deagentive diathesis (reflexive passive)
        satiated’.
                                                                        e.g. Vařilo se tu pro emigranty. ‘Cooked - reflexive
     3. possessive resultative diathesis                                - here - for - emigrants.’ – ‘It was cooked here for
                                                                        emigrants.’
        e.g. Maminka má jídlo uvařeno. ‘The mother has the
        food cooked.’                                                   The only surface marks of this diathesis are a verb in
                                                                        the third person (agreeing with the subject in num-
        In this type of construction, auxiliary verb mít ‘to            ber and gender, or singular neuter for subjectless sen-
        have’ is used together with the past participle of the          tences) and the free reflexive morpheme se. The Ac-
        main verb.                                                      tor is not expressed in this kind of construction at all.
        Note that the conversive aspect is crucial for our the-         Rules for forming the deagentive diathesis have al-
        oretical concept of a diathesis. For example, only one          most the same conditions as the rules for forming the
Towards Automatic Detection of Applicable Diatheses                                                                                       13


      passive diathesis (and both types can be applied to         3    Treatment of diatheses in Vallex and
      almost any frame), and also the ways in which the                PDT-Vallex
      Patient, Addressee or Effect are moved into the sub-
      ject position are the same. However, the sets of verbs      Our proposal is primarily formulated for the purpose of the
      that allow the two diatheses are different.                 description of valency in valency lexions of Czech verbs
                                                                  built within the framework of the Functional Generative
  6. dispositional diathesis (mediopassive)                       Description (FGD). We are working with two lexicons,
                                                                  VALLEX 2.65 , see [13], and PDT-Vallex 2.06 , see [22],
      e.g. Dobřeadverb se (miDat) tu hrál tenis.                 although this phenomenon is to be solved in any valency
      ‘Well - reflexive - to-me - here - played - tennis.’ –      lexicon. We work with the common format developed for
      ‘For me, this was a good place to play tennis. I en-        the two lexicons by Bejček et al. [1].
      joyed playing tennis here.’                                    Both lexicons are divided into two components: a data
                                                                  component and a rule component.
      A characteristic feature of the dispositional diathesis
      is an evaluative element, usually an adverb such as
      dobře ‘well’, pomalu ‘slowly’. The verb form is the        3.1 The data component
      same as in the deagentive diathesis, i.e. a third per-      The data component consists of word entries correspond-
      son verb agreeing with the subject (or signular neuter      ing to verb lexemes. Lexeme is an abstract twofold
      in subjectless sentences) + reflexive partice se. How-      data structure which associates lexical form(s) and lexi-
      ever, the Actor may be expressed on the surface in the      cal unit(s). Lexical forms are all possible manifestations
      dative case.                                                of a lexeme in an utterance (e.g. a lemma or a group of
      Although people do not have difficulties distinguish-       lemmas,7 all morphological forms of these lemmas, and
      ing between the deagentive and the dispositional            their reflexive and irreflexive forms). All lexical forms of
      diathesis, the difference is hard to grasp for an au-       a lexeme are represented by its lemma(s).
      tomatic procedure when the Actor in the dative is              In the lexicon, each lexical unit (a sense of a verb) is
      elided from the sentence (in a sample from the cor-         characterized by a gloss (a verb or a paraphrase roughly
      pus SYN2005 cited in [17], the Actor was only ex-           synonymous with the given sense) and by example(s) (sen-
      pressed in 22 sentences out of 143). For example, the       tence fragment(s) containing the given verb used in the
      following sentence is deagentive, although its surface      given sense). The core information on valency character-
      structure is similar to the example of a dispositional      istics of a lexical unit is encoded as (exactly one) valency
      diathesis given above:                                      frame reflecting the unmarked (active) use of the verb.8
      Odpoledneadverb se tu hrál tenis.                              In an ideal model of the lexicon, information on the
      ‘In the afternoons - reflexive - here - played - tennis.’   possible application of diatheses is stored in each lexical
      – ‘In the afternoon, tennis was played here.’               unit in a special attribute -diat. This attribute has not
                                                                  been implemented in either of the lexicons yet, but the at-
      Moreover, this diathesis is very rare—according to          tribute -rfl (reflexivity) that is present in Vallex overlaps
      [17], it appears only 8 times in the tectogrammati-         with the proposed attribute -diat to a certain extent. It
      cally annotated part of the PDT. We therefore follow        lists possible syntactic functions of the relexive morpheme
      the strategy used by Skoumalová [21, p.47] and as-          se/si. The values of the -rfl attribute are pass for reflex-
      sume that any imperfective verb which can form the          ive passives in verbs with accusative complements, pass0
      deagentive diathesis can also form the dispositional        for reflexive passives in intransitive verbs, and cor4 and
      diathesis.                                                  cor3 for cases where se and si fill the position of an object
                                                                  in accusative (cor4) or in dative (cor3), showing that the
                                                                  subject is performing an action on itself. If the verb allows
Diatheses with the reflexive particle si.                         for any of these constructions with se/si, the possibility has
                                                                  been exemplified with made up examples (the annotators
  7. causative diathesis                                          simply converted the examples given for the active diathe-
                                                                  sis into a passive/reciprocal construction). Many of these
      e.g. Nechal si od Gesy vařit. ‘He let - reflexive - by
                                                                      5 http://ufal.mff.cuni.cz/vallex/2.6/doc/home.html
      Gesy - cook.’ – ‘He let Gesy cook for him.’                      6 This is the version that has been published as part of

      This verbal form roughly corresponds to the English         the Prague Czech-English Dependency Treebank 2.0.; it is avail-
                                                                  able from http://ufal.mff.cuni.cz/pcedt2.0/publications/
      ‘have something done’. For lexicographic purposes,          vallex3.xml and can be browsed at http://ufal.mff.cuni.cz/
      we view the causative diathesis as a separate sense of      lindat/PDT-Vallex.html.
      the verbs nechat/dát ‘let/give’. One reason for this             7 Vallex lexemes comprise perfective, imperfective and iterative vari-

      treatment lies in the fact that two Actors appear in the    ants, as well as spelling variants, so that the lexicon covers almost twice
                                                                  as many lemmas as lexemes. On the other hand, there is a one-to-one
      construction - the Actor of the verb nechat/dát and         correspondence between lexemes and lemmas in PDT-Vallex.
      the Actor of the dependent infinitive.                           8 See the Introduction for more details about valency frames.
14                                                                                                  A. Vernerová, M. Lopatková


                                                                • provide natural corpus examples of the diathesis to be
                       Table 1: Counts
                                   Vallex     PDT-Vallex          included in the lexicon,
    lexemes7                       2726       7103
                                                              and in uncertain cases
    lexical units (LU)             6451       11932
                                                                • provide corpus evidence on the basis of which the an-
                           Lemmas                                 notators can quickly make the decision.
    lemmas (L)                       4789     7103
    LUs (separated by lemma)         11229    11932              Below we describe such a method in some detail. The
                                     1528     1590            method works by iterating over the frames in three passes.
    reflexive                                                 The first pass is a negative pass which filters out lexical
                                     31.9 %   22.4 %
                                                              units where the diathesis is not applicable due to either
    nonreflexive, 0 occurrences      586      783
                                                              grammatical concerns or insufficient corpus evidence. The
    of past participles (tag „Vs“)   12.2 %   11.0 %
                                                              second pass is a positive pass where lexical units with suf-
    nonreflexive, some               888      758
                                                              ficient evidence for applicability are dealt with. In the fi-
    occurrences of Vs, 1 LU          18.5 %   10.7 %
                                                              nal step, corpus evidence is gathered for the remaining un-
    nonreflexive, 0 occurrences      125      71              clear lexical units. This evidence is then presented to the
    in sentences with “se”           2.6 %    1.0 %           annotator for a manual decision. If the second or third
    tagged as “P7-X4-*”                                       phase yields a large number of examples, the automatic
                                                              method should also order them so that simple, clear exam-
                                                              ples come first. The method of ordering corpus examples
examples do not sound natural. We hope that our methods       used by [8] is well-suited for our purposes.
will provide some more natural corpus examples. More-            Due to the difficulties in distinguishing some of the
over, we intend to cover other diatheses that the current     diatheses mentioned above, the proposed semi-automatic
annotation does not cover.                                    procedure only strives to identify cases of the following
   See Table 1 for counts of lexemes, lemmas and lexi-        diatheses: periphrastic passive, possessive resultative, re-
cal units in both lexicons. Both lexicons are available       cipient, and deagentive (reflexive passive).
in machine-tractable XML format and also as human-
friendly web pages.
                                                              4.1 Negative pass — excluding frames

3.2     The rule component                                    In the negative pass we use various methods for excluding
                                                              inapplicable diatheses. In some cases, we exclude whole
The proposed rule component of the lexicon consists of        lexemes (reflexive verbs and lexemes for which no cor-
a set of formal syntactic rules determining changes in the    pus evidence suggesting a possibility of the diathesis was
mapping of valency complementations onto surface syn-         found); the rule-based exclusion, on the other hand, may
tactic positions. They make it possible to obtain all pos-    exclude some lexical units of a lexeme while other proceed
sible surface syntactic manifestations of lexical units of    into the next phase.
verbs (i.e., number of complementations, their types and
possible morphological forms).
   At present, we use transformational rules formulated for   Reflexives. We assume that none of the diatheses is ap-
the purposes of the description of diatheses in PDT-Vallex,   plicable to a lexeme with a reflexive lemma. These cases
the lexicon of the Prague Dependency Treebank, see [23].      include reflexiva tantum (bát se ‘to fear’) and derived re-
                                                              flexives (šířit se ‘to spread (itself)’). This assumption cov-
                                                              ers 1528 out of 4789 lemmas occurring in Vallex 2.6, and
4      Methodology                                            1590 out of 7116 verb lemmas occurring in PDT-Vallex
                                                              2.0. It can be seen from Table 1 that so far this is the most
Due to the size of the lexicon, it is preferable to mini-     effective step in the negative pass.
mize the necessary manual work involved in augmenting            We are aware of the fact that this assumption is only
the lexicon with information about applicable diatheses.      approximately valid. According to [9, p. 93], derived re-
Moreover, experience suggests that annotators tend to be      flexives do not form passive (neither periphrastic nor re-
positively biased towards assuming the applicability of the   flexive), but some reflexiva tantum do; [21, p. 43] is only
diatheses. Also, examples given by annotators tend to be      aware of two reflexive verbs that form a periphrastic pas-
contrieved/unnatural. To address these problems we would      sive, the reflexiva tantum tázat se ‘to ask’ and obávat se
like to have a semiautomatic method which should, where       ‘to fear’, and otherwise assumes that reflexive verbs do
possible                                                      not form passives. While [23, p. 124] discusses the lim-
                                                              ited possibility of forming the reflexive passive of reflexiva
     • automatically decide whether a diathesis is applica-   tantum, she also gives a (made up?) example of a stylisti-
       ble,                                                   cally non-neutral sentence smálo se, až se plakalo ‘it was
Towards Automatic Detection of Applicable Diatheses                                                                                     15


laughed so much that it was cried’ with reflexive passive                  list of the necessary structures for each diathesis is com-
of reflexiva tantum.                                                       piled. The effectiveness of this exclusion depends on the
   We have found several other cases where reflexive verbs                 type of diathesis. The diatheses that are formed with the
form a diathesis:                                                          past participle can be applied to almost any structure. This
                                                                           step is a little more useful in the the cases where the diathe-
  1. To se lehko pamatuje. ‘This is easy to remember. It is                sis is formed using the particle se.
     easy to remember it.’ (derived from pamatovat si ‘to
     remember’)
     Na to se lehko zvykne. ‘This is easy to get used to. It               Corpus-based exclusion. We start with a very naive im-
     is easy to get used to it.’ (derived from zvyknout si ‘to             plementation of this step, excluding the applicability of the
     get used to’)                                                         diatheses for whole lexemes. Applicability of the diathe-
     Na všechno se zvykne. ‘Everything gets used to. Peo-                  ses formed with the past participle may be ruled out if
     ple get used to everything.’ (derived from zvyknout si                the past participle is not found in the corpus. Similarly,
     ‘to get used to’)                                                     we may exclude the applicability of the reflexive passive
     This usage is almost idiomatic; the first two exam-                   whenever the verb does not appear in the same sentence as
     ples are cases of the dispositional diathesis, and the                the particle se anywhere in the corpus. Table 1 shows that
     third seems to be derived from it. We expect that fur-                we need to refine these criteria, especially for the exclu-
     ther research will show that this type of construction                sion of the reflexive passive.
     is productive even among reflexive verbs.                                The mere presence of the se token is not necessarily in-
                                                                           dicative of the given diathesis. First of all, the se need
  2. Prezident Václav Havel je lidmiInstr nejméně oblíben                 not be a particle at all, e.g. in the sentence tančil se že-
     od té doby, kdy začal prezidentovat. ‘President                      nou ‘he danced with (his) wife’, the word se ‘with’ is in
     Václav Havel is by-the-people least liked since he                    fact a preposition. (The morphological tagger used to tag
     started presidenting.’ – ‘President Václav Havel’s                    the corpus SYN is accurate enough to overcome this am-
     popularity is the least since he became president.’                   biguity.) But even as a particle, se can be part of a dif-
     (derived from oblíbit si ‘get to like’)                               ferent grammatical structure, e.g. in the sentence snažil se
     The corpus contains many instances of je oblíben ‘is                  tančit ‘he tried reflexive to dance’ the word se belongs to
     liked’ which can be easily analyzed as cases of the                   the reflexivum tantum snažit se ‘to try’, not to the verb
     verbo-nominal predicate být oblíben(ý) ‘to be liked’,                 tančit ‘to dance’. Limiting the search to segments en-
     not as passive. However, this particular sentence also                closed by punctuation might exclude some genuine exam-
     contains the Actor lidmi ‘by the people’ in the Instru-               ples of diatheses: minule se, pokud si pamatuji, tančilo
     mental case, which is typical of a passive construc-                  až do rána ‘the last time reflexive, as far as I remember,
     tion. One option is to claim that lidmi is a valency                  danced until morning’ – ‘as far as I remember, the last
     complementation of the adjective (oblíben kým ‘to be                  time dancing continued until morning’; we do not want to
     liked by whom’). The other option is to admit that                    take the risk of missing some existing evidence already in
     this is a case of a passive construction, possibly re-                this phase. Thus, naive corpus search does not suffice to
     lated to the historical existence of the verb oblíbit ‘get            exclude more than a tiny number of verbs (as can be seen
     to like’ without a reflexive particle (as documented in               from Table 1): auxiliary methods such as (shallow) parsing
     [5]9 ).                                                               or at least clause detection are needed. The Prague Depen-
                                                                           dency Treebank is too small for the purpose of rejecting
  3. Zdálo se, že toto úsilí už už začne nést ovoce, bylo
                                                                           the applicability of a diathesis, especially if it is rare such
     vděčně povšimnuto čtenáři. ‘It seemed that the effort
                                                                           as the possessive resultative. (E.g., there are only about
     will soon bear fruit, it was noticed by the readers.’
                                                                           70 instances of the possessive resultative in the whole of
     (derived from povšimnout si ‘notice’)
                                                                           PDT.) Corpus SYN, albeit more adequate in size, is not
     Here, the reading as a verbo-nominal predicate seems
                                                                           parsed, so a different, inherently less reliable method must
     even less likely than in the previous example.
                                                                           be used. We could, for example, base our decision as to
The possibility to form passives of reflexive verbs is cer-                whether the se is connected to the relevant verb or not on
tainly an interesting area for further research.                           their distance in the sentence.


Rule-based exclusion. Some of the diatheses require a                      Combination of the rule-based and corpus-based
particular grammatical structure to be applicable. It is                   method. The rules allow us to identify frames describ-
therefore possible to exclude frames where this structure                  ing structures in which a given lexical unit may appear in
is absent. Here we rely on [23] where a machine-readable                   a diathesis. These structures can be turned automatically
                                                                           into patterns for corpus search. In general, no significant
    9 At http://psjc.ujc.cas.cz/, search for oblíbiti gives 60 in-         conclusions can be drawn from the fact that the resulting
stances documented on write-out cards; the relevant entry from the lexi-   search does not produce any results: Czech is a pro-drop
con can be found by searching for oblíbiti si.                             language, so even semantically obligatory elements can
16                                                                                                    A. Vernerová, M. Lopatková


be elided in the actual sentence. Only the dispositional           • The least strict measure is the applicability of a rule
diathesis contains an element that is obligatory on the sur-         for forming the given diathesis. This is a necessary,
face, but the range of possible morphemic realizations of            yet not sufficient condition. The rules have been de-
this evaluative element needs to be further researched.              scribed in detail in [23] and it is known that they
                                                                     heavily overgenerate.
4.2   Positive copus-based pass                                    • If corpus evidence is found for the applicability of the
                                                                     diathesis, the amount/reliability of this evidence may
In the positive pass we search the corpus for evidence               be just as important (especially if the decision is not
showing the applicability of a given diathesis. Especially           reviewed by an annotator). Even a single corpus oc-
an occurrence of a past participle is indicative of a diathe-        currence provides evidence that it is possible to form
sis (although concerns about the competition between past            the diathesis, yet (if the verb itself is frequent) it also
participles and adjectives need to be addressed). The three          provides evidence that for some reason, that possibil-
kinds of diatheses with past participle forms that we in-            ity is not widely used by the users of the language.
tend to distinguish—periphrastic passive, possessive resul-
tative and recipient passive—moreover differ in the aux-           • Lack of corpus evidence leads to the exclusion of
iliary verbs. Therefore we assume that instances of past             some LUs that pass the first test. We expect to find
participles found in the corpus can be assigned to a diathe-         cases where no corpus evidence of the applicability
sis with a fair amount of certainty. The situation with the          of the diathesis will be found, yet an annotator pre-
passive constructions built with the reflexive particle se is        sented with the LU might still feel that it cannot be
more complex, but the techniques developed for the first             excluded completely. (This is essentially the same
pass will hopefully help here as well.                               case as we discussed in the previous paragraph—
   The automatic method must be able to assign the evi-              a possibility that is exploited only rarely—only this
dence found to a particular diathesis and to a particular lex-       time for diathesis-verb combinations that did not ap-
ical unit (it does not suffice to know that a verb with many         pear in the corpus.) We believe that in such a case,
meanings appears in the passive diathesis in the given sen-          and if the entry has been reviewed by an annotator, it
tence; we are looking for examples which we can desam-               is best to provide this information to the user of the
biguate). Sometimes, the first pass will give us a single            lexicon.
candidate. In other instances, we apply the rules to the
remaining frames, derive the description of the full struc-      Acknowledgments
tures corresponding to a diathesis, and then search the cor-
pus for patterns with elements that are unique to only one       The research reported in this paper was supported by
of the candidates.                                               the grant of the Czech Science Foundation GAČR No.
                                                                 P406/12/0557.
                                                                    The first author was partially supported by the grant
4.3   Corpus evidence for manual annotation                      SVV-2013-267314.
Finally, similar methods as in the second phase will be             This work has been using language resources developed
used, but examples with ambiguous status will be out-            and/or stored and/or distributed by the LINDAT-Clarin
put. We expect that the examples will be automatically           project of the Ministry of Education of the Czech Republic
assigned to a diathesis with high precision. Thus, for each      (project LM2010013).
combination of a lexical unit and a diathesis that remain
undecided after the previous pass, the system will be able       References
to provide the annotator with a selection of sentences that
could be instances of this LU in the given diathesis with        [1] Bejček, E., Kettnerová, V., and Lopatková, M. (2010).
high likelihood. The annotator will then either select a            Advanced searching in the valency lexicons using
couple of examples that demonstrate the applicability of            PML-TQ search engine. In Sojka, P., Horák, A.,
the diathesis, or will decide that the diathesis is not appli-      Kopeček, I., and Pala, K., editors, Text, Speech and
cable to the given LU.                                              Dialogue. 13th International Conference, volume 6231
                                                                    of Lecture Notes in Computer Science, pages 51–58,
                                                                    Berlin / Heidelberg. Masarykova univerzita, Springer.
5     Conclusions
                                                                 [2] D˛ebowski, Ł. (2009). Valence extraction using EM
                                                                    selection and co-occurrence matrices. Language re-
We introduced a (semi-)automatic method for identify-
                                                                    sources and evaluation, 43(4):301–327.
ing lexical units that undergo individual diatheses, and we
have discussed some of the difficulties that stand in the        [3] Hajič, J. (2006). Complex corpus annotation: The
way of a fully automatic procedure. We have also shown              Prague Dependency Treebank. In Šimková, M., edi-
that the question whether a diathesis is applicable to a lex-       tor, Insight into Slovak and Czech Corpus Linguistics,
ical unit may be answered in several different ways:                pages 54–73. Veda, Bratislava.
Towards Automatic Detection of Applicable Diatheses                                                                             17


[4] Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas,   [15] McCarthy, D. and Korhonen, A. (1998). Detecting
   P., Štěpánek, J., Havelka, J., Mikulová, M., Žabokrt-           verbal participation in diathesis alternations. In Pro-
   ský, Z., and Ševčíková-Razímová, M. (2006). Prague              ceedings of the 36th Annual Meeting of the Association
   Dependency Treebank 2.0. CD-ROM. LDC Catalog                     for Computational Linguistics and 17th International
   No. LDC2006T01.                                                  Conference on Computational Linguistics - Volume 2,
                                                                    ACL ’98, pages 1493–1495, Stroudsburg, PA, USA.
[5] Hujer, O., Smetánka, E., Weingart, M., Havránek, B.,            Association for Computational Linguistics.
   Šmilauer, V., and Získal, A. (1933–1957). Příruční
   slovník jazyka českého. Státní nakladatelství, Státní        [16] Panevová, J. (1994). Valency frames and the mean-
   nakladatelství učebnic, Státní pedagogické nakladatel-          ing of the sentence. In Luelsdorff, P. A., editor, The
   ství, Praha.                                                     Prague School of Structural and Functional Linguis-
                                                                    tics, pages 223–243. John Benjamins Publishing Com-
[6] Kettnerová, V. and Lopatková, M. (2011). The lexico-            pany, Amsterdam, Philadelphia.
   graphic representation of Czech diatheses: Rule based
   approach. In Majchráková, D. and Garabík, R., editors,        [17] Panevová, J. et al. (manuscript). Syntax současné
   Natural Language Processing, Multilinguality, pages              češtiny (na základě anotovaného korpusu). Naklada-
   89–100, Bratislava, Slovakia. Tribun EU.                         telství Karolinum, Praha.

                                                                 [18] Sarkar, A. and Zeman, D. (2000). Automatic extrac-
[7] Kettnerová, V., Lopatková, M., and Bejček, E. (2012).
                                                                    tion of subcategorization frames for Czech. In Proceed-
   The syntax-semantics interface of Czech verbs in the
                                                                    ings of the 18th International Conference on Compu-
   valency lexicon. In Fjeld, R. and Torjusen, J., edi-
                                                                    tational Linguistics, COLING 2000, volume 2, pages
   tors, Proceedings of the 15th EURALEX International
                                                                    691–697, Stroudsburg, PA, USA. Association for Com-
   Congress, pages 434–443, Oslo, Norway. Department
                                                                    putational Linguistics.
   of Linguistics and Scandinavian Studies, University of
   Oslo.                                                         [19] Schulte im Walde, S. (2000). Clustering verbs se-
                                                                    mantically according to their alternation behaviour. In
[8] Kilgarriff, A., Husak, M., McAdam, K., Rundell, M.,
                                                                    Proceedings of the 18th conference on Computational
   and Rychlý, P. (2008). GDEX: Automatically finding
                                                                    linguistics - Volume 2, COLING ’00, pages 747–753,
   good dictionary examples in a corpus. In Bernal, E.
                                                                    Stroudsburg, PA, USA. Association for Computational
   and DeCesaris, J., editors, Proceedings of the 13th EU-
                                                                    Linguistics.
   RALEX International Congress, Barcelona, Spain. In-
   stitut Universitari de Lingüística Aplicada. Universitat      [20] Sgall, P., Bémová, A., Borota, J., Hajičová, E., Ha-
   Pompeu Fabra; Documenta Universitaria.                           jičová, I., Jirků, P., Panevová, J., Pit’ha, P., Plátek, M.,
                                                                    and Vrbová, J. (1986). Úvod do syntaxe a sémantiky.
[9] Kopečný, F. (1962). Základy české skladby. Státní             Academia.
   pedagogické nakladatelství, Praha, 2. edition.
                                                                 [21] Skoumalová, H. (2001). Czech Syntactic Lexicon.
[10] Korhonen, A. (2002). Subcategorization Acquisition.            PhD thesis, Charles University in Prague.
   PhD thesis, Ph. D. thesis, University of Cambridge.
                                                                 [22] Urešová, Z. (2011a). Valenční slovník Pražského
[11] Lapata, M. (1999). Acquiring lexical generalizations           závislostního korpusu (PDT-Vallex). Studies in Com-
   from corpora: A case study for diathesis alternations. In        putational and Theoretical Linguistics. Ústav formální
   Proceedings of the 37th annual meeting of the Associ-            a aplikované lingvistiky, Praha, Czech Republic.
   ation for Computational Linguistics on Computational
   Linguistics, pages 397–404. Association for Computa-          [23] Urešová, Z. (2011b). Valence sloves v Pražském
   tional Linguistics.                                              závislostním korpusu. Studies in Computational and
                                                                    Theoretical Linguistics. Ústav formální a aplikované
[12] Levin, B. C. (1993). English Verb Classes and Alter-           lingvistiky, Praha, Czech Republic.
   nations: A Preliminary Investigation. The University
   of Chicago Press, Chicago and London.

[13] Lopatková, M., Žabokrtský, Z., and Kettnerová, V.
   (2008). Valenční slovník českých sloves. Karolinum,
   Praha.

[14] McCarthy, D. (2001). Lexical acquisition at the
   syntax-semantics interface: diathesis alternations, sub-
   categorization frames and selectional preferences. PhD
   thesis, University of Sussex.