=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards Automatic Detection of Applicable Diatheses
|pdfUrl=https://ceur-ws.org/Vol-1003/10.pdf
|volume=Vol-1003
|dblpUrl=https://dblp.org/rec/conf/itat/VernerovaL13
}}
==Towards Automatic Detection of Applicable Diatheses==
ITAT 2013 Proceedings, CEUR Workshop Proceedings Vol. 1003, pp. 10–17
http://ceur-ws.org/Vol-1003, Series ISSN 1613-0073, c 2013 A. Vernerová, M. Lopatková
Towards Automatic Detection of Applicable Diatheses
Anna Vernerová, and Markéta Lopatková
Charles University in Prague
Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
Czech Republic
{vernerova, lopatkova}@ufal.mff.cuni.cz
Abstract: The valency behavior (argument structure) of apelovat na kolegyADDR.na+Acc, aby práci
lexical items is so varied that it cannot be described by dokončiliPAT.aby-Clause včas
general rules and must be captured in lexicons separately ‘to appeal to his colleagues to finish the work in
for each lexical item. For verbs, lexicons typically de- time’
scribe only unmarked usage—the active form—while nat-
ural languages allow for certain regular changes in the
• apelovat ‘to put emphasis’
number, type and/or realization of complementations (e.g.
ACTNom PATna+Acc
passivization). Thanks to their regularity, such changes
v jeho rodině se stále apeluje na morálkuPAT.na+Acc
may be described in a separate rule component of the lex-
‘in his family emphasis is always put on morality’
icon; however, they are typically seen in many but not all
verbs and their applicability to a given lexical unit (verb
meaning) is not predictable from its valency alone. In this The above examples demonstrate how valency behavior
paper, we describe our initial experiments with using a varies even among semantically close lexical units (LUs),
large morphologically annotated corpus of Czech for de- both when they belong to the same lexeme and when they
termining which diatheses are applicable to a given lexical belong to different lexemes. It must therefore be captured
unit. for each lexical unit of a verb separately in the form of a
lexical entry listed in the valency lexicon. On the other
hand, certain changes in the valency structure are regular
1 Introduction and can be described in the form of rules which can be
specified in a separate component of the lexicon. Such
Valency refers to the argument structure of lexical units.1
changes are typically seen in many but not all verbs and
In the Functional Generative Description (FGD), valency
their applicability to a given lexical unit is not predictable
belongs to the so-called tectogrammatical layer [16, 20],
from its valency frame alone.
i.e. the layer of linguistically structured meaning. It
A lexical entry does not list all of its possible forms but
is captured by so called valency frames specifying the
only one—usually the structure corresponding to the ac-
valency complementations (arguments that are either re-
tive form of the verb, which is considered to be its un-
quired or specifically permitted by the given lexical unit).
marked use—and a list of rules for creating other possi-
For each valency complementation, both its semantics (in
ble structures (the marked uses). This description is both
the form of a tectogrammatical functor, which captures
economical (less space is needed for storing the informa-
a coarse-grained semantic role) and its syntactic/morpho-
tion about all available realizations of the LU) and linguis-
logical form must be specified.
tically adequate (it captures generalizations which would
Example 12 not be obvious if all possible surface forms were listed).
Valency lexicons are created with many applications in
• vyzývat ‘to appeal, to challenge’
mind: they help to maintain consistency of corpus an-
ACTNom ADDRAcc PATk+Dat, na+Acc, aby, at’, že
notation, provide syntactic and morphological informa-
vyzvat někohoADDR.Acc, aby se uklidnilPAT.aby-Clause
tion during parsing and natural language generation, and
‘to ask somebody to calm down’
may even prove useful in word sense disambiguation and
vyzvat někohoADDR.Acc na soubojPAT.na+Acc
machine translation; moreover, lexicon data is consulted
‘to challenge somebody to a duel’
by linguists during their theoretical research and provides
• apelovat ‘to appeal’ useful information for students of Czech. All of these tasks
ACTNom ADDRna+Acc PATaby, at’, že involve actual occurrences of the valency patterns in the
natural language, and so the unmarked structures from the
1 Whereas the term lexeme roughly corresponds to a dictionary verb
lexicon need to be converted into all structures that may
item with all its meanings, by a lexical unit (LU) we refer to a verb in a
given meaning. See Section 3.1 for more details.
appear in the actual data.
2 The frames and examples are taken from Vallex 2.6, http:// A rule based approach to creating derived valency struc-
ufal.mff.cuni.cz/vallex/2.6/data/html. tures has already been used during the annotation of the
Towards Automatic Detection of Applicable Diatheses 11
Prague Dependency Treebank3 (PDT) [3, 4]. Frames in extensively for several decades [12]. In the description of
the valency lexicon PDT-Vallex describe the unmarked Czech, we follow the classification given by Kettnerová
structure but all possible structures may appear in actual et al. in [7].
treebank data. During consistency controls, general rules Here we focus on diatheses—specific relations stem-
were used to generate frames describing the marked va- ming from the changes in the linking of situational partici-
lency structures; then it was checked whether any of these pants, valency complementations and surface syntactic po-
marked structures matches the data and the annotation sitions. Diatheses belong to the group of grammaticalized
in the treebank. (The derived structures carry informa- alternations: they are realized by the use of specific mor-
tion about the required form of the verb, and the number phological and/or syntactical means, including the gram-
and type of the valency complementations including their matical category of voice of the verb and the surface forms
functors, obligatoriness and permitted forms.) The rules of the complementations. They relate different surface
that were used for the conversions are described in detail syntactic structures of a single lexical unit of a verb. They
in [23]. also belong to the group of conversive alternations: the
Because correctness of the underlying PDT data was as- transformation acts as a permutation on the assignment of
sumed, the rules were allowed to heavily over-generate. valency complementations to surface syntactic positions,
For example, “passive” frames of the verb mít ‘to have’ typically shifting Actor away from the prominent subject
were generated although, in reality, it does not form pas- position and filling it with some other complementation.
sive in Czech. While this is a reasonable strategy for con-
sistency checks of annotated data, other tasks that utilize
2.1 Types of grammatical diatheses in Czech
a valency lexicon would benefit from lists of diatheses ap-
plicable to any given lexical unit. Manual annotation pro- In this section, we summarize the description of Czech
vides a number of examples of lexical units occurring in diatheses as given by [17] and [6], and comment on some
different types of diatheses; however, the size of the tec- of the issues that need to be solved and decisions that need
togrammatically annotated PDT data is too small, so we to be made for their automatic analysis.
cannot make any conclusions from the fact that a lexical
unit does not occur in a diathesis. Therefore, we are trying
to draw evidence from a much larger, automatically mor- The unmarked member of the diathesis. The unmarked
phologically annotated corpus. We have decided to use usage is described in the lexical entry in the lexicon. The
SYN, a non-referential corpus of 1,300 million automati- verb appears in an active form or as an infinitive; the com-
cally morphologically tagged words. plementations are realized in the forms specified for them
For Czech, [21] used simple heuristics for determin- in the lexicon entry. All complementations specified in
ing which diatheses are applicable to which lexical units the entry as obligatory are present on the tectogrammat-
(both kinds of passive for verbs with complementations re- ical layer, although some of them may be elided in the
alized as a prepositionless object, infinitive or dependent surface realization of the sentence (if their value is either
clause; only reflexive passive for intransitives and verbs clear from the context or general); inner participants4 that
where all complementations are realized as prepositional are not specified in the lexical entry must not appear as
phrases; no passive for reflexives). For other languages, arguments of the verb, but free modifications may.
most authors have only studied the applicability of alter-
nations and diatheses to whole lexemes rather than to in- Diatheses with past participle.
dividual lexical units [11, 14, 15, 19]. We also draw in-
spiration from the work on automatic extraction of whole 1. passive diathesis (periphrastic passive)
frames from corpora, which has been attempted for several e.g. Neustále jsem byl někým vyzýván, abych se legit-
languages including English [10], Czech [18], and Polish imoval. ‘All the time - I was - someoneInstr - asked -
[2]. to show my ID.’ – ‘I kept being asked to show my ID’
The form of the verb in this diathesis consists of the
2 Diatheses past participle of the main verb + the verb být ‘to
be’ (in a finite or infinite form). The subject slot of
Regular changes of the valency structure of a lexical unit, the passive construction either remains empty, or it
in the English-language literature usually called alterna- is filled by a complementation which originally filled
tions, typically allow the speaker to express the same situ- an object slot (typically that of an Accusative object,
ational meaning (i.e., propositional content characterized but realization through infinitive, clause, genitive, or
by the set of situational participants) in different ways phrase jako+Acc ‘as something’ is also possible); if
that result in different perspectives from which the situ- the complementation is expressed as a noun phrase, it
ation is viewed. Alternations have already been studied is turned into the Nominative case. The Actor (which
3 See http://ufal.mff.cuni.cz/pdt2.5/ for information 4 Inner participants are complementations with either of the functors
about the current version. ACTot, PATient, ADDRessee, EFFect or ORIGin.
12 A. Vernerová, M. Lopatková
in the active construction fills the subject slot) may be of the two possible readings of the example sentence
realized either in the Instrumental case, or as a prepo- above is considered to be a diathesis:
sitional phrase od+Gen ‘by/from+Gen’. Mamince včera jídlo připravila tetička. Maminka má
tedy již jídlo uvařeno. ‘The aunt has prepared the
2. resultative diathesis with the auxiliary verb být ‘to be’
food for the mother yesterday. Therefore, the mother
e.g. Jídlo je uvařeno. ‘The food is cooked.’ has the food cooked already.’ This case is considered
This form of resultative diathesis differs from the pe- to be a diathesis, because the Actor of the first sen-
riphrastic passive only in meaning, not in the surface tence (the aunt) has moved away from the subject po-
form or structure. In many cases, it is not clear which sition (and is not expressed in the resultative variant
of the two possible readings the speaker has in mind. at all).
For example, the sentence okno je otevřeno may be Maminka vařila celé dopoledne a nyní již má jídlo
interpreted as a case of the resultative diathesis, de- uvařeno. ‘Mother has been cooking all morning and
scribing a state, i.e. ‘the window is (already) open’, now she already has the food cooked.’ This case is
or as a case of the passive diathesis, describing an not considered to be a diathesis, because the same
event, i.e. ‘the window is (being) opened’. This am- complementation is corresponding to the subject in
biguity is called event–state homonymy in Czech lin- both cases.
guistics. Because it is so common, we assume that the In practice, however, it is often impossible to distin-
passive diathesis is possible whenever the resultative guish between the two readings (Panevová et al. [17]
diathesis is possible and vice versa. claim that out of 60 cases of a resultative diathesis
Moreover, Czech also exhibits a competition between in the PDT, 23 are ambiguous), and the difference is
past participles and deverbal adjectives. The kind of usually only obvious from the context, so it is inac-
deverbal adjectives that we have in mind are formed cessible to the kind of naive, syntax-based automatic
by adding vowel endings to past participles; both the methods that we are trying to use. Our automatic
participle and the adjective can then be used to ex- method does not differentiate between the two read-
press the resultative meaning, while only the partici- ings.
ple can be used to express the passive meaning. On
one hand, this interchangeability of the “short” (par- 4. recipient passive diathesis
ticiple) and “long” (adjectival) forms is often used e.g. Dostal jsem zaplaceno (od šéfa). ‘I got paid (by
as a guideline in determining whether a given sen- the boss).’
tence should be considered resultative—if the partici- The most visible characteristics of the recipient
ple can be replaced with the adjective, the resultative diathesis is the auxiliary verb dostat ‘to get’ and the
interpretation is valid. On the other hand, participle past participle. The original frame must contain a
forms are sometimes used in purely adjectival mean- complementation in dative or a benefactor; this com-
ing, such as in the sentence stále ještě nebyl najeden plementation becomes the subject of the diathetic
‘he still was not full’, which features the word form construction. The actor is expressed in the Instrumen-
najeden (past participle of the reflexive verb najíst se tal case, or as a prepositional phrase od+Gen ‘by’.
‘satiate oneself, eat so much that one is full’). If we If there is a semantic patient, it keeps its form and
were to read this as a diathesis, this would have to agrees with the participle in gender and case. All
be a case of periphrastic passive or of the resultative these conditions together are fairly specific and there-
diathesis with auxiliary verb být ‘to be’ formed from fore allow for a fairly accurate search for corpus con-
the sentence najedl se ‘he ate to be full’. However, it cordances.
is not possible that the same complementation would
fill the subject position in both the active and the pas-
sive/resultative diathesis. We have to read this sen- Diatheses with the reflexive particle se.
tence as a sentence with the adjective najedený ‘full, 5. deagentive diathesis (reflexive passive)
satiated’.
e.g. Vařilo se tu pro emigranty. ‘Cooked - reflexive
3. possessive resultative diathesis - here - for - emigrants.’ – ‘It was cooked here for
emigrants.’
e.g. Maminka má jídlo uvařeno. ‘The mother has the
food cooked.’ The only surface marks of this diathesis are a verb in
the third person (agreeing with the subject in num-
In this type of construction, auxiliary verb mít ‘to ber and gender, or singular neuter for subjectless sen-
have’ is used together with the past participle of the tences) and the free reflexive morpheme se. The Ac-
main verb. tor is not expressed in this kind of construction at all.
Note that the conversive aspect is crucial for our the- Rules for forming the deagentive diathesis have al-
oretical concept of a diathesis. For example, only one most the same conditions as the rules for forming the
Towards Automatic Detection of Applicable Diatheses 13
passive diathesis (and both types can be applied to 3 Treatment of diatheses in Vallex and
almost any frame), and also the ways in which the PDT-Vallex
Patient, Addressee or Effect are moved into the sub-
ject position are the same. However, the sets of verbs Our proposal is primarily formulated for the purpose of the
that allow the two diatheses are different. description of valency in valency lexions of Czech verbs
built within the framework of the Functional Generative
6. dispositional diathesis (mediopassive) Description (FGD). We are working with two lexicons,
VALLEX 2.65 , see [13], and PDT-Vallex 2.06 , see [22],
e.g. Dobřeadverb se (miDat) tu hrál tenis. although this phenomenon is to be solved in any valency
‘Well - reflexive - to-me - here - played - tennis.’ – lexicon. We work with the common format developed for
‘For me, this was a good place to play tennis. I en- the two lexicons by Bejček et al. [1].
joyed playing tennis here.’ Both lexicons are divided into two components: a data
component and a rule component.
A characteristic feature of the dispositional diathesis
is an evaluative element, usually an adverb such as
dobře ‘well’, pomalu ‘slowly’. The verb form is the 3.1 The data component
same as in the deagentive diathesis, i.e. a third per- The data component consists of word entries correspond-
son verb agreeing with the subject (or signular neuter ing to verb lexemes. Lexeme is an abstract twofold
in subjectless sentences) + reflexive partice se. How- data structure which associates lexical form(s) and lexi-
ever, the Actor may be expressed on the surface in the cal unit(s). Lexical forms are all possible manifestations
dative case. of a lexeme in an utterance (e.g. a lemma or a group of
Although people do not have difficulties distinguish- lemmas,7 all morphological forms of these lemmas, and
ing between the deagentive and the dispositional their reflexive and irreflexive forms). All lexical forms of
diathesis, the difference is hard to grasp for an au- a lexeme are represented by its lemma(s).
tomatic procedure when the Actor in the dative is In the lexicon, each lexical unit (a sense of a verb) is
elided from the sentence (in a sample from the cor- characterized by a gloss (a verb or a paraphrase roughly
pus SYN2005 cited in [17], the Actor was only ex- synonymous with the given sense) and by example(s) (sen-
pressed in 22 sentences out of 143). For example, the tence fragment(s) containing the given verb used in the
following sentence is deagentive, although its surface given sense). The core information on valency character-
structure is similar to the example of a dispositional istics of a lexical unit is encoded as (exactly one) valency
diathesis given above: frame reflecting the unmarked (active) use of the verb.8
Odpoledneadverb se tu hrál tenis. In an ideal model of the lexicon, information on the
‘In the afternoons - reflexive - here - played - tennis.’ possible application of diatheses is stored in each lexical
– ‘In the afternoon, tennis was played here.’ unit in a special attribute -diat. This attribute has not
been implemented in either of the lexicons yet, but the at-
Moreover, this diathesis is very rare—according to tribute -rfl (reflexivity) that is present in Vallex overlaps
[17], it appears only 8 times in the tectogrammati- with the proposed attribute -diat to a certain extent. It
cally annotated part of the PDT. We therefore follow lists possible syntactic functions of the relexive morpheme
the strategy used by Skoumalová [21, p.47] and as- se/si. The values of the -rfl attribute are pass for reflex-
sume that any imperfective verb which can form the ive passives in verbs with accusative complements, pass0
deagentive diathesis can also form the dispositional for reflexive passives in intransitive verbs, and cor4 and
diathesis. cor3 for cases where se and si fill the position of an object
in accusative (cor4) or in dative (cor3), showing that the
subject is performing an action on itself. If the verb allows
Diatheses with the reflexive particle si. for any of these constructions with se/si, the possibility has
been exemplified with made up examples (the annotators
7. causative diathesis simply converted the examples given for the active diathe-
sis into a passive/reciprocal construction). Many of these
e.g. Nechal si od Gesy vařit. ‘He let - reflexive - by
5 http://ufal.mff.cuni.cz/vallex/2.6/doc/home.html
Gesy - cook.’ – ‘He let Gesy cook for him.’ 6 This is the version that has been published as part of
This verbal form roughly corresponds to the English the Prague Czech-English Dependency Treebank 2.0.; it is avail-
able from http://ufal.mff.cuni.cz/pcedt2.0/publications/
‘have something done’. For lexicographic purposes, vallex3.xml and can be browsed at http://ufal.mff.cuni.cz/
we view the causative diathesis as a separate sense of lindat/PDT-Vallex.html.
the verbs nechat/dát ‘let/give’. One reason for this 7 Vallex lexemes comprise perfective, imperfective and iterative vari-
treatment lies in the fact that two Actors appear in the ants, as well as spelling variants, so that the lexicon covers almost twice
as many lemmas as lexemes. On the other hand, there is a one-to-one
construction - the Actor of the verb nechat/dát and correspondence between lexemes and lemmas in PDT-Vallex.
the Actor of the dependent infinitive. 8 See the Introduction for more details about valency frames.
14 A. Vernerová, M. Lopatková
• provide natural corpus examples of the diathesis to be
Table 1: Counts
Vallex PDT-Vallex included in the lexicon,
lexemes7 2726 7103
and in uncertain cases
lexical units (LU) 6451 11932
• provide corpus evidence on the basis of which the an-
Lemmas notators can quickly make the decision.
lemmas (L) 4789 7103
LUs (separated by lemma) 11229 11932 Below we describe such a method in some detail. The
1528 1590 method works by iterating over the frames in three passes.
reflexive The first pass is a negative pass which filters out lexical
31.9 % 22.4 %
units where the diathesis is not applicable due to either
nonreflexive, 0 occurrences 586 783
grammatical concerns or insufficient corpus evidence. The
of past participles (tag „Vs“) 12.2 % 11.0 %
second pass is a positive pass where lexical units with suf-
nonreflexive, some 888 758
ficient evidence for applicability are dealt with. In the fi-
occurrences of Vs, 1 LU 18.5 % 10.7 %
nal step, corpus evidence is gathered for the remaining un-
nonreflexive, 0 occurrences 125 71 clear lexical units. This evidence is then presented to the
in sentences with “se” 2.6 % 1.0 % annotator for a manual decision. If the second or third
tagged as “P7-X4-*” phase yields a large number of examples, the automatic
method should also order them so that simple, clear exam-
ples come first. The method of ordering corpus examples
examples do not sound natural. We hope that our methods used by [8] is well-suited for our purposes.
will provide some more natural corpus examples. More- Due to the difficulties in distinguishing some of the
over, we intend to cover other diatheses that the current diatheses mentioned above, the proposed semi-automatic
annotation does not cover. procedure only strives to identify cases of the following
See Table 1 for counts of lexemes, lemmas and lexi- diatheses: periphrastic passive, possessive resultative, re-
cal units in both lexicons. Both lexicons are available cipient, and deagentive (reflexive passive).
in machine-tractable XML format and also as human-
friendly web pages.
4.1 Negative pass — excluding frames
3.2 The rule component In the negative pass we use various methods for excluding
inapplicable diatheses. In some cases, we exclude whole
The proposed rule component of the lexicon consists of lexemes (reflexive verbs and lexemes for which no cor-
a set of formal syntactic rules determining changes in the pus evidence suggesting a possibility of the diathesis was
mapping of valency complementations onto surface syn- found); the rule-based exclusion, on the other hand, may
tactic positions. They make it possible to obtain all pos- exclude some lexical units of a lexeme while other proceed
sible surface syntactic manifestations of lexical units of into the next phase.
verbs (i.e., number of complementations, their types and
possible morphological forms).
At present, we use transformational rules formulated for Reflexives. We assume that none of the diatheses is ap-
the purposes of the description of diatheses in PDT-Vallex, plicable to a lexeme with a reflexive lemma. These cases
the lexicon of the Prague Dependency Treebank, see [23]. include reflexiva tantum (bát se ‘to fear’) and derived re-
flexives (šířit se ‘to spread (itself)’). This assumption cov-
ers 1528 out of 4789 lemmas occurring in Vallex 2.6, and
4 Methodology 1590 out of 7116 verb lemmas occurring in PDT-Vallex
2.0. It can be seen from Table 1 that so far this is the most
Due to the size of the lexicon, it is preferable to mini- effective step in the negative pass.
mize the necessary manual work involved in augmenting We are aware of the fact that this assumption is only
the lexicon with information about applicable diatheses. approximately valid. According to [9, p. 93], derived re-
Moreover, experience suggests that annotators tend to be flexives do not form passive (neither periphrastic nor re-
positively biased towards assuming the applicability of the flexive), but some reflexiva tantum do; [21, p. 43] is only
diatheses. Also, examples given by annotators tend to be aware of two reflexive verbs that form a periphrastic pas-
contrieved/unnatural. To address these problems we would sive, the reflexiva tantum tázat se ‘to ask’ and obávat se
like to have a semiautomatic method which should, where ‘to fear’, and otherwise assumes that reflexive verbs do
possible not form passives. While [23, p. 124] discusses the lim-
ited possibility of forming the reflexive passive of reflexiva
• automatically decide whether a diathesis is applica- tantum, she also gives a (made up?) example of a stylisti-
ble, cally non-neutral sentence smálo se, až se plakalo ‘it was
Towards Automatic Detection of Applicable Diatheses 15
laughed so much that it was cried’ with reflexive passive list of the necessary structures for each diathesis is com-
of reflexiva tantum. piled. The effectiveness of this exclusion depends on the
We have found several other cases where reflexive verbs type of diathesis. The diatheses that are formed with the
form a diathesis: past participle can be applied to almost any structure. This
step is a little more useful in the the cases where the diathe-
1. To se lehko pamatuje. ‘This is easy to remember. It is sis is formed using the particle se.
easy to remember it.’ (derived from pamatovat si ‘to
remember’)
Na to se lehko zvykne. ‘This is easy to get used to. It Corpus-based exclusion. We start with a very naive im-
is easy to get used to it.’ (derived from zvyknout si ‘to plementation of this step, excluding the applicability of the
get used to’) diatheses for whole lexemes. Applicability of the diathe-
Na všechno se zvykne. ‘Everything gets used to. Peo- ses formed with the past participle may be ruled out if
ple get used to everything.’ (derived from zvyknout si the past participle is not found in the corpus. Similarly,
‘to get used to’) we may exclude the applicability of the reflexive passive
This usage is almost idiomatic; the first two exam- whenever the verb does not appear in the same sentence as
ples are cases of the dispositional diathesis, and the the particle se anywhere in the corpus. Table 1 shows that
third seems to be derived from it. We expect that fur- we need to refine these criteria, especially for the exclu-
ther research will show that this type of construction sion of the reflexive passive.
is productive even among reflexive verbs. The mere presence of the se token is not necessarily in-
dicative of the given diathesis. First of all, the se need
2. Prezident Václav Havel je lidmiInstr nejméně oblíben not be a particle at all, e.g. in the sentence tančil se že-
od té doby, kdy začal prezidentovat. ‘President nou ‘he danced with (his) wife’, the word se ‘with’ is in
Václav Havel is by-the-people least liked since he fact a preposition. (The morphological tagger used to tag
started presidenting.’ – ‘President Václav Havel’s the corpus SYN is accurate enough to overcome this am-
popularity is the least since he became president.’ biguity.) But even as a particle, se can be part of a dif-
(derived from oblíbit si ‘get to like’) ferent grammatical structure, e.g. in the sentence snažil se
The corpus contains many instances of je oblíben ‘is tančit ‘he tried reflexive to dance’ the word se belongs to
liked’ which can be easily analyzed as cases of the the reflexivum tantum snažit se ‘to try’, not to the verb
verbo-nominal predicate být oblíben(ý) ‘to be liked’, tančit ‘to dance’. Limiting the search to segments en-
not as passive. However, this particular sentence also closed by punctuation might exclude some genuine exam-
contains the Actor lidmi ‘by the people’ in the Instru- ples of diatheses: minule se, pokud si pamatuji, tančilo
mental case, which is typical of a passive construc- až do rána ‘the last time reflexive, as far as I remember,
tion. One option is to claim that lidmi is a valency danced until morning’ – ‘as far as I remember, the last
complementation of the adjective (oblíben kým ‘to be time dancing continued until morning’; we do not want to
liked by whom’). The other option is to admit that take the risk of missing some existing evidence already in
this is a case of a passive construction, possibly re- this phase. Thus, naive corpus search does not suffice to
lated to the historical existence of the verb oblíbit ‘get exclude more than a tiny number of verbs (as can be seen
to like’ without a reflexive particle (as documented in from Table 1): auxiliary methods such as (shallow) parsing
[5]9 ). or at least clause detection are needed. The Prague Depen-
dency Treebank is too small for the purpose of rejecting
3. Zdálo se, že toto úsilí už už začne nést ovoce, bylo
the applicability of a diathesis, especially if it is rare such
vděčně povšimnuto čtenáři. ‘It seemed that the effort
as the possessive resultative. (E.g., there are only about
will soon bear fruit, it was noticed by the readers.’
70 instances of the possessive resultative in the whole of
(derived from povšimnout si ‘notice’)
PDT.) Corpus SYN, albeit more adequate in size, is not
Here, the reading as a verbo-nominal predicate seems
parsed, so a different, inherently less reliable method must
even less likely than in the previous example.
be used. We could, for example, base our decision as to
The possibility to form passives of reflexive verbs is cer- whether the se is connected to the relevant verb or not on
tainly an interesting area for further research. their distance in the sentence.
Rule-based exclusion. Some of the diatheses require a Combination of the rule-based and corpus-based
particular grammatical structure to be applicable. It is method. The rules allow us to identify frames describ-
therefore possible to exclude frames where this structure ing structures in which a given lexical unit may appear in
is absent. Here we rely on [23] where a machine-readable a diathesis. These structures can be turned automatically
into patterns for corpus search. In general, no significant
9 At http://psjc.ujc.cas.cz/, search for oblíbiti gives 60 in- conclusions can be drawn from the fact that the resulting
stances documented on write-out cards; the relevant entry from the lexi- search does not produce any results: Czech is a pro-drop
con can be found by searching for oblíbiti si. language, so even semantically obligatory elements can
16 A. Vernerová, M. Lopatková
be elided in the actual sentence. Only the dispositional • The least strict measure is the applicability of a rule
diathesis contains an element that is obligatory on the sur- for forming the given diathesis. This is a necessary,
face, but the range of possible morphemic realizations of yet not sufficient condition. The rules have been de-
this evaluative element needs to be further researched. scribed in detail in [23] and it is known that they
heavily overgenerate.
4.2 Positive copus-based pass • If corpus evidence is found for the applicability of the
diathesis, the amount/reliability of this evidence may
In the positive pass we search the corpus for evidence be just as important (especially if the decision is not
showing the applicability of a given diathesis. Especially reviewed by an annotator). Even a single corpus oc-
an occurrence of a past participle is indicative of a diathe- currence provides evidence that it is possible to form
sis (although concerns about the competition between past the diathesis, yet (if the verb itself is frequent) it also
participles and adjectives need to be addressed). The three provides evidence that for some reason, that possibil-
kinds of diatheses with past participle forms that we in- ity is not widely used by the users of the language.
tend to distinguish—periphrastic passive, possessive resul-
tative and recipient passive—moreover differ in the aux- • Lack of corpus evidence leads to the exclusion of
iliary verbs. Therefore we assume that instances of past some LUs that pass the first test. We expect to find
participles found in the corpus can be assigned to a diathe- cases where no corpus evidence of the applicability
sis with a fair amount of certainty. The situation with the of the diathesis will be found, yet an annotator pre-
passive constructions built with the reflexive particle se is sented with the LU might still feel that it cannot be
more complex, but the techniques developed for the first excluded completely. (This is essentially the same
pass will hopefully help here as well. case as we discussed in the previous paragraph—
The automatic method must be able to assign the evi- a possibility that is exploited only rarely—only this
dence found to a particular diathesis and to a particular lex- time for diathesis-verb combinations that did not ap-
ical unit (it does not suffice to know that a verb with many pear in the corpus.) We believe that in such a case,
meanings appears in the passive diathesis in the given sen- and if the entry has been reviewed by an annotator, it
tence; we are looking for examples which we can desam- is best to provide this information to the user of the
biguate). Sometimes, the first pass will give us a single lexicon.
candidate. In other instances, we apply the rules to the
remaining frames, derive the description of the full struc- Acknowledgments
tures corresponding to a diathesis, and then search the cor-
pus for patterns with elements that are unique to only one The research reported in this paper was supported by
of the candidates. the grant of the Czech Science Foundation GAČR No.
P406/12/0557.
The first author was partially supported by the grant
4.3 Corpus evidence for manual annotation SVV-2013-267314.
Finally, similar methods as in the second phase will be This work has been using language resources developed
used, but examples with ambiguous status will be out- and/or stored and/or distributed by the LINDAT-Clarin
put. We expect that the examples will be automatically project of the Ministry of Education of the Czech Republic
assigned to a diathesis with high precision. Thus, for each (project LM2010013).
combination of a lexical unit and a diathesis that remain
undecided after the previous pass, the system will be able References
to provide the annotator with a selection of sentences that
could be instances of this LU in the given diathesis with [1] Bejček, E., Kettnerová, V., and Lopatková, M. (2010).
high likelihood. The annotator will then either select a Advanced searching in the valency lexicons using
couple of examples that demonstrate the applicability of PML-TQ search engine. In Sojka, P., Horák, A.,
the diathesis, or will decide that the diathesis is not appli- Kopeček, I., and Pala, K., editors, Text, Speech and
cable to the given LU. Dialogue. 13th International Conference, volume 6231
of Lecture Notes in Computer Science, pages 51–58,
Berlin / Heidelberg. Masarykova univerzita, Springer.
5 Conclusions
[2] D˛ebowski, Ł. (2009). Valence extraction using EM
selection and co-occurrence matrices. Language re-
We introduced a (semi-)automatic method for identify-
sources and evaluation, 43(4):301–327.
ing lexical units that undergo individual diatheses, and we
have discussed some of the difficulties that stand in the [3] Hajič, J. (2006). Complex corpus annotation: The
way of a fully automatic procedure. We have also shown Prague Dependency Treebank. In Šimková, M., edi-
that the question whether a diathesis is applicable to a lex- tor, Insight into Slovak and Czech Corpus Linguistics,
ical unit may be answered in several different ways: pages 54–73. Veda, Bratislava.
Towards Automatic Detection of Applicable Diatheses 17
[4] Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, [15] McCarthy, D. and Korhonen, A. (1998). Detecting
P., Štěpánek, J., Havelka, J., Mikulová, M., Žabokrt- verbal participation in diathesis alternations. In Pro-
ský, Z., and Ševčíková-Razímová, M. (2006). Prague ceedings of the 36th Annual Meeting of the Association
Dependency Treebank 2.0. CD-ROM. LDC Catalog for Computational Linguistics and 17th International
No. LDC2006T01. Conference on Computational Linguistics - Volume 2,
ACL ’98, pages 1493–1495, Stroudsburg, PA, USA.
[5] Hujer, O., Smetánka, E., Weingart, M., Havránek, B., Association for Computational Linguistics.
Šmilauer, V., and Získal, A. (1933–1957). Příruční
slovník jazyka českého. Státní nakladatelství, Státní [16] Panevová, J. (1994). Valency frames and the mean-
nakladatelství učebnic, Státní pedagogické nakladatel- ing of the sentence. In Luelsdorff, P. A., editor, The
ství, Praha. Prague School of Structural and Functional Linguis-
tics, pages 223–243. John Benjamins Publishing Com-
[6] Kettnerová, V. and Lopatková, M. (2011). The lexico- pany, Amsterdam, Philadelphia.
graphic representation of Czech diatheses: Rule based
approach. In Majchráková, D. and Garabík, R., editors, [17] Panevová, J. et al. (manuscript). Syntax současné
Natural Language Processing, Multilinguality, pages češtiny (na základě anotovaného korpusu). Naklada-
89–100, Bratislava, Slovakia. Tribun EU. telství Karolinum, Praha.
[18] Sarkar, A. and Zeman, D. (2000). Automatic extrac-
[7] Kettnerová, V., Lopatková, M., and Bejček, E. (2012).
tion of subcategorization frames for Czech. In Proceed-
The syntax-semantics interface of Czech verbs in the
ings of the 18th International Conference on Compu-
valency lexicon. In Fjeld, R. and Torjusen, J., edi-
tational Linguistics, COLING 2000, volume 2, pages
tors, Proceedings of the 15th EURALEX International
691–697, Stroudsburg, PA, USA. Association for Com-
Congress, pages 434–443, Oslo, Norway. Department
putational Linguistics.
of Linguistics and Scandinavian Studies, University of
Oslo. [19] Schulte im Walde, S. (2000). Clustering verbs se-
mantically according to their alternation behaviour. In
[8] Kilgarriff, A., Husak, M., McAdam, K., Rundell, M.,
Proceedings of the 18th conference on Computational
and Rychlý, P. (2008). GDEX: Automatically finding
linguistics - Volume 2, COLING ’00, pages 747–753,
good dictionary examples in a corpus. In Bernal, E.
Stroudsburg, PA, USA. Association for Computational
and DeCesaris, J., editors, Proceedings of the 13th EU-
Linguistics.
RALEX International Congress, Barcelona, Spain. In-
stitut Universitari de Lingüística Aplicada. Universitat [20] Sgall, P., Bémová, A., Borota, J., Hajičová, E., Ha-
Pompeu Fabra; Documenta Universitaria. jičová, I., Jirků, P., Panevová, J., Pit’ha, P., Plátek, M.,
and Vrbová, J. (1986). Úvod do syntaxe a sémantiky.
[9] Kopečný, F. (1962). Základy české skladby. Státní Academia.
pedagogické nakladatelství, Praha, 2. edition.
[21] Skoumalová, H. (2001). Czech Syntactic Lexicon.
[10] Korhonen, A. (2002). Subcategorization Acquisition. PhD thesis, Charles University in Prague.
PhD thesis, Ph. D. thesis, University of Cambridge.
[22] Urešová, Z. (2011a). Valenční slovník Pražského
[11] Lapata, M. (1999). Acquiring lexical generalizations závislostního korpusu (PDT-Vallex). Studies in Com-
from corpora: A case study for diathesis alternations. In putational and Theoretical Linguistics. Ústav formální
Proceedings of the 37th annual meeting of the Associ- a aplikované lingvistiky, Praha, Czech Republic.
ation for Computational Linguistics on Computational
Linguistics, pages 397–404. Association for Computa- [23] Urešová, Z. (2011b). Valence sloves v Pražském
tional Linguistics. závislostním korpusu. Studies in Computational and
Theoretical Linguistics. Ústav formální a aplikované
[12] Levin, B. C. (1993). English Verb Classes and Alter- lingvistiky, Praha, Czech Republic.
nations: A Preliminary Investigation. The University
of Chicago Press, Chicago and London.
[13] Lopatková, M., Žabokrtský, Z., and Kettnerová, V.
(2008). Valenční slovník českých sloves. Karolinum,
Praha.
[14] McCarthy, D. (2001). Lexical acquisition at the
syntax-semantics interface: diathesis alternations, sub-
categorization frames and selectional preferences. PhD
thesis, University of Sussex.