=Paper= {{Paper |id=Vol-1779/02bejeck |storemode=property |title=Extracting Verbal Multiword Data from Rich Treebank Annotation |pdfUrl=https://ceur-ws.org/Vol-1779/02bejeck.pdf |volume=Vol-1779 |authors=Eduard Bejčeck,Jan Hajič,Pavel Stranak,Zdenka Uresova |dblpUrl=https://dblp.org/rec/conf/tlt/BejcekHSU17 }} ==Extracting Verbal Multiword Data from Rich Treebank Annotation== https://ceur-ws.org/Vol-1779/02bejeck.pdf
                Extracting Verbal Multiword Data
                 from Rich Treebank Annotation
          Eduard Bejček, Jan Hajič, Pavel Straňák and Zdeňka Urešová

 Charles University in Prague, Faculty of Mathematics and Physics, ÚFAL
  {bejcek,hajic,stranak,uresova}@ufal.mff.cuni.cz


                                         Abstract
        The PARSEME Shared Task on automatic identification of verbal multiword
        expressions aims at identifying such expressions in running texts. Typol-
        ogy of verbal multiword expressions, very detailed annotation guidelines and
        gold-standard data for as many languages as possible will be provided. Since
        the Prague Dependency Treebank includes Czech multiword expression an-
        notation, it was natural to make an attempt to automatically convert the data
        into the Shared Task format. However, since the Czech treebank predates the
        Shared Task annotation guidelines, a prior examination was necessary to de-
        termine to which extent the conversion can be fully automatic and how much
        manual work remains.
            In this paper, we show that information contained in the Prague Depen-
        dency Treebank is sufficient to extract all of the Shared Task categories of
        verbal multiword expressions relevant for Czech, even if these categories are
        originally annotated differently; nevertheless, some manual checking and an-
        notation would still be necessary, e.g. for distinguishing borderline cases.


1       Motivation
The goal of the PARSEME [11] Shared Task (PST)1 is to develop automatic de-
tection of verbal multiword expressions (VMWEs) for a wide range of languages
from different language families. It includes data preparation for the task par-
ticipants, based on annotation guidelines that were tested on real data for almost
twenty languages [16].2 The training and testing data for the PST (3,500 instances
per language) are being annotated; while manual annotation is necessary for many
languages, reusing existing annotated data is preferred whenever possible.
     This preference led us to explore the Prague Dependency Treebank (PDT,
[1, 4]), which includes quite a rich annotation of MWEs.3 However, the anno-
    1
     http://multiword.sourceforge.net/sharedtask2017
    2
     Also at http://parsemefr.lif.univ-mrs.fr/guidelines-hypertext.
   3
     Some VMWEs categories were annotated during the creation of the original PDT 2.0, others
were annotated particularly for PDT 2.5; PDT 3.0 contains all of them.




                                             13
tation of the PDT preceded the PARSEME typology of VMWEs and thus it is
understandable that the information encoded there is not straightforwardly trans-
formable into the PST categories and format. Nevertheless, we hoped that the PDT
annotation did contain all the necessary information. If confirmed, it would prove
that the original scheme of rich annotation was well conceived, and in particular,
that the MWE annotation in PDT in fact followed the principles recommended in
[10].


2    Introduction
We believe that for the Czech language, annotation of VMWEs already encoded
in the data of the Prague Dependency Treebank 3.0 (PDT) [1] presents suitable
material for the PST and satisfies the task needs in both (i) the amount of annotated
data and (ii) the types of VMWEs that correspond to the types proposed in the PST.
    The PARSEME Shared Task identifies six groups of VMWEs: light verb con-
structions (LVC), idioms (ID), verb particle combinations (VPC), inherently reflex-
ive verbs (IReflV), language specific types and other verbal MWEs (OTH).
    All the various types of VMWEs required by the PST are annotated in quite
a number of diverse ways in the PDT and the information is spread across several
layers of annotation. Thus we first had to relate the PDT annotation to the PST
guidelines in order to confirm that the PDT data can be reused for the Shared Task
and only then the extraction of all types of VMWEs (relevant for Czech) and their
conversion into the PST format could take place.
    At the same time (or even more importantly), we were testing the following
four principles for good-quality MWE treebank design published in [10], which
are based on a survey of as many as 23 different treebanks (dependency-based,
constituency-based, HPSG, LFG, mixed):
Principle A: to annotate MWEs as such,
Principle B: to mark MWEs in a distinctive and specific way,
Principle C: to annotate even discontinuous MWEs and MWEs of varying forms,
Principle D: to allow for searching MWEs by their type.
    After thorough analysis of the PDT we have concluded that Principles A and
B are clearly fulfilled in the PDT due to its explicit MWE annotation. Principle
C is also followed thanks to the explicit links between the PDT’s annotation lay-
ers. Principle D is, from the PST point of view, followed only partially, since the
respective typologies do not match one-to-one.
    Thorough inspection of the PDT annotation scheme resulted in an automatic
conversion procedure with rules formulated for each of the PST types. Manual
checks and some amount of manual annotation is still necessary, even if for only a
fraction of the data.




                                         14
3        Conversion of Czech data
As already explained, the creation of the Czech language data for the PST takes
advantage of the existing rich annotation of the PDT, including explicit annotation
of VMWEs.
     The treatment of verbal idioms (part of the ID category) and LVCs in the PDT
is related to valency, as the valency formalism allows for morphological, syntactic
and semantic description of VMWEs in the treebank [2, 3, 13]. These VMWEs
are recorded in the related valency lexicon, PDT-Vallex [14], as specific “senses”
of the base lemma. For the annotation of verb-noun idiomatic combinations and
some other types of MWEs in the PDT style treebanks and in the associated va-
lency lexicons see [9, 15]. PDT-Vallex has been available already with the original
PDT 2.0 treebank [4]. Afterwards, explicit general annotation of MWEs including
verbal phrases which now correspond to the ID, LVC and OTH categories has been
carried out (see [12]). The MWE annotation became part of later PDT releases,
including the most recent, PDT 3.0 [1].4 Reflexive verbs (IReflV) are treated as
“words with spaces” on the deep syntactic annotation layer, with the particle being
part of such words.
             Prague Dependency Treebank 3.0                          PARSEME Shared Task


                                                                    ID     IReflV         LSpec
    dostat_se
    PRED
                                                                    LVC     OTH            VPC
                                                                 Nevidomý
                                                                 se                  1:IReflV;2:LVC
nevidomý                 styk                utrpět              dostane             1;2
ACT                      CPHR                TWHEN               do                  2
                                                                 styku               2
                                                                 s
                                                                 rehabilitačními
                                                                 pracovníky      nsp
     #QCor                pracovník #PersPron         zranění
                                                                 ,
     ACT                  PAT       ACT               PAT
                                                                 když
                                                                 utrpí               3:ID
                                                                 zranění         nsp 3
                                                                 .
                            rehabilitační
                            RSTR

            Nevidomý se dostane do styku s rehabilitačními pracovníky, když utrpí zranění.
            Blind  gets into contact with rehabilita�on workers, when sustains injury.
            A blind man gets in touch with a physiatrists when he sustains an injury.


Figure 1: Extracting VMWE information: PDT annotation on the left, PST for-
mat with three VMWEs identified on the right (numbers distinguish VMWE oc-
curences, “nsp” stands for “no space after”, colours are ours). Only four types are
relevant for Czech – neither VPC nor any language specific type is used.
     4
         http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3




                                                 15
    To sum up, different PST types of VMWEs are obtained from various informa-
tion sources available at the different layers of annotation in the PDT. See Figure 1
for an illustration of three of them (the annotation view is simplified only to cover
MWE-related phenomena); an annotation of the non-verbal MWE “rehabilitační
pracovník” (rehabilitation worker) which is not being converted for PST is also
shown.
    In this section, we describe the PDT-style annotation of the proposed six types
of VMWEs recognized in the PST as well as their conversion into the common PST
format (Sections 3.1–3.6). Two special aspects are discussed, namely deverbative
variants (Section 3.7) and cases of overlapping annotation (Section 3.8).

3.1        Light Verb Constructions
In the PDT annotation, LVCs consist of two lexical units: a semantically empty
(or “light”) verb and a noun carrying the main lexical meaning of the entire phrase.
The nominal part of the LVCs is labeled by the CPHR functor (Compound PHRase).
For example: to comePRED into forceCPHR , to undertakePRED preparationsCPHR .
LVCs are identified as depicted in Figure 2.

                                                             vstoupit                                                   
                                                             PRED                                vstoupil   Zákon tak vstoupil v platnost.
1. Input text                          2. PDT t‐layer                         3. PDT a‐layer     Pred
                                                                                                                                    has come
                                                                                                                                    into force
Zákon tak vstoupil v platnost.
Law so came into force.                         zákon   tak podepsat    platnost                                        4. Output annota�on
                                                                                      Zákon    tak   v
                                                ACT     MANN            CPHR
                                                                                      Sb       Adv   AuxP
By that the law has come into force.

                                                              #QCor                                  platnost
                                                              ACT                                    Obj



Figure 2: Identifying an LVC containing a preposition using two layers of PDT
annotation. Deep syntactic layer provides a CPHR node and its governing verb
(step 2). The preposition, in this case a part of the LVC, is represented by a node
between the light verb node and the predicative noun node in the surface syntactic
tree (step 3). The preposition node is (also) referenced from the CPHR node.

    Three more things have to be taken into account:
   1. Prepositions, if they are part of the LVC, must be retrieved from the surface
       syntactic layer, since they are not present on the deep layer. If there is any
       extra node between a node for a predicate and a node for a CPHR, it is part
       of the LVC.
   2. If reflexive particles are part of the verb lemma (see IReflV in Section 3.4),
       they also have to become part of the LVC.
   3. The CPHR functor is also used for a specific type of phrases with the verb
       “to be” (it is necessaryCPHR to leave). These phrases, not assumed by the
       PST guidelines, are excluded by checking the lemma of the verb.
    There are 2496 LVCs in the PDT extracted by the above rules. Minor details
aside, LVCs as defined for the PST can be identified on the basis of the existing
PDT annotation without any additional manual annotation.




                                                              16
3.2      Verbal Idioms
These VMWEs, denoted as ID in the PST guidelines, compose quite a large group
containing not only traditional idioms. We have to process it in two steps.
    Part of the VMWEs defined as IDs, namely those which are quite fixed idioms,
are understood similarly in the PDT and in the guidelines for the Shared Task, e.g.:
“házet klacky pod nohy” lit. to-throw sticks under feet (= to put obstacles in one’s
way), “brát vítr z plachet” (= to take the wind out of someone’s sails). These
verbal idioms (similarly to LVCs) always consist of two nodes in the PDT: the
governing verb part and the dependent node (with the DPHR functor = Dependent
part of PHRaseme). These idioms can be thus easily extracted by looking for the
DPHR functor. The DPHR node represents all other lexical components of the
idiom, should there be more than one (lemma of the deep syntactic layer is e.g.
“klacky_pod_nohy” or “vítr_z_plachet”), since these are quite fixed expressions in
terms of (the impossibility of) insertion or other modification. Even prepositions
are part of it and their detection is even easier than with a CPHR. See an example
in Figure 3.

                                     2. PDT t‐layer      dát .enunc
1. Input text                                            PRED                  3. Output annota�on
                                                                                  
Odezva na sebe nedala čekat.                                               Odezva na sebe nedala čekat.
Reac�on on itself not‐gave wait.                                                  didn't keep us wai�ng
                                            odezva    #Neg na_sebe_čekat
The reac�on didn't keep us wai�ng.          ACT       RHEM DPHR




Figure 3: Identifying a four-word ID using a DPHR node and its governing node.
(The VMWE here is negated: “na sebe nedala čekat” instead of canonical “na sebe
dala čekat”. It does not interfere with the extraction process, since negation is
annotated separately; thus the two phrases themselves look the same.)

     The other group of VMWEs categorized as ID in PST is not so fixed. VMWEs
from this group do not fulfill the criteria for DPHR annotation in the PDT, but they
still qualify to be an IDin the PST. They have been annotated together with all
other MWEs in PDT 3.0 [12]. The problem is they are marked neither as idioms,
nor even as verbal expressions. Moreover, they are recorded on the deep syntactic
layer as a set of nodes (i.e. content words), neglecting auxiliary words.
     Our approach finds a head in the syntactic tree of such a set. If it is a verb, the
MWE is a verbal one (Figure 4). Then other auxiliary nodes (e.g., prepositions)
referred to by the annotated content words are added. (The exception is a conjunc-
tion introducing the whole phrase: it does not belong to the VMWE.) The resulting
VMWE gets the ID mark, unless it overlaps with CPHR or DPHR annotation (see
Section 3.8).
     We have identified 2107 IDs using either the PDT 3.0 MWE or DPHR annota-
tion.




                                             17
                                                           root   2. PDT t‐layer
                                                           mwe lexeme
                                                                                                                       3. Output annota�on
                                                                                                              
                                                 kroutit
                                                                                                Nevěřícně krou�m hlavou nad legisla�vou.
                                                 PRED
1. Input text                                    verb                                                    shaking my head

Nevěřícně      krou�m hlavou nad legisla�vou.
Disbelievingly I‐shake head over legisla�on.
I am shaking my head in disbelief on the legisla�on.          #PersPron                 nevěřícný hlava        legislativa
                                                              ACT                       MANN      PAT          REG



Figure 4: Identifying an ID using PDT 3.0 annotation. Such a MWEs is visualized
here as a hatched area and is in fact recorded in the tree root with links to appro-
priate nodes (red arrows). The dependent node here is not marked as DPHR nor
CPHR but as a regular PAT ; it is however part of the PDT 3.0-annotated MWE.


3.3        Verb-particle Combinations
Verb-particle combinations (VPC) are not present in Czech. A phenomenon similar
to VPCs is in Czech realized by verbal prefixes (the result being another single
lexical unit, i.e., not a MWE).

3.4        Inherently Reflexive Verbs
Inherently Reflexive Verbs (IReflV) contain one of two possible clitics in Czech:
“se” or “si”, e.g. “bát se” (= to be afraid), “hledět si” (= to mind sth). Such verb
is considered a separate lexical unit (different from the verb appearing without the
particle if such verb exists at all) and both its parts are represented by just one node
at the deep syntactic layer of the PDT, and the node’s lemma matches the PDT-
Vallex lexical unit, which includes the appropriate particle as part of the headword
in the lexicon. This annotation was used for exactly the two types qualified as
IReflV in the PST guidelines, namely, for the case when the non-reflexive coun-
terpart verb does not exist or when its meaning is markedly changed. Using this
annotation, all IReflVs can be extracted from the PDT texts and converted, see
Figure 5.

                                                                             3. PDT a‐layer      týká
                                                                                                 Pred
                                                                                                                   4. Output annota�on
1. Input text                                                 2. PDT t‐layer                                            
Opatření     se týká zejména domovníků.            týkat_se                                                  Opatření se týká zejména domovníků.
                                                   PRED
The meassure involves chiefly housekeepers.                                           Opatření   se      domovníků    involves
                                                                                     Sb         AuxT Obj
                                                opatření           zejména   domovník
                                                ACT                RHEM      PAT
                                                                                                  zejména
                                                                                                  AuxZ




Figure 5: Identifying an IReflV using a lemma on deep syntactic layer (step 2)
together with an analytical function AuxT on a surface syntactic layer (step 3).

    IReflVs should be possible to extract also without the deep syntactic layer;
an analytical function of an IReflV reflexive particle should be either AuxT or
AuxO on a surface syntactic layer; other values (AuxR, Obj, or Adv) are reserved




                                                                   18
for reflexive particles used in other than IReflV contexts, e.g. in passive con-
structions. Suspicious cases (705 verb occurrences) in which the information from
the two layers of annotation clashes have been detected by looking for discrep-
ancy between the lemma and the corresponding analytical function and manually
checked and corrected when necessary (330 cases). There are some borderline
cases where the PDT annotation differs from the PST guidelines; however, these
are mainly errors in annotation and not a true difference between the PST and PDT
guidelines.
    By this approach, 10,266 VMWEs of the IReflV type were extracted from
the PDT. The conversion was automatic except for the 705 manually checked oc-
currences.

3.5         Others
This category (OTH) is specified in the PST guidelines as a VMWE that does not
fit into any of the other categories, as described in the previous sections. Namely,
it applies to “coordinations of verbs, e.g. to drink and drive, and compound verbs,
e.g. to short-circuit, to pretty-print, to voice act”. The second subtype usually re-
sults in a one-word expression in Czech, so we need to search only for coordinated
verbs.
     For this category, the PDT 3.0 MWEs annotation [12] is useful again. All
MWEs containing two verbs connected by a coordinating conjunction are marked
as an OTH, see Figure 6. This is a very marginal category; we have found only two
OTHs in the data.

                                                             být                                  3. PDT a‐layer
                                                             PRED     2. PDT t‐layer                     conjunction



                                                         doktorand     student
                                                         ACT           PAT                        verb      verb

                                                                                                                               4. Output
1. Input text                                                                                                                  annota�on
                                                                                 a
Doktorand je studentem, jak se         sluší a patří.                            CONJ                                      
PhD‐student is student,     as  suits and befits.                                          Doktorand je studentem, jak se sluší a patří.
A PhD student is a student, as he should be.                                                                            as he should be
                                                                     #Gen slušet_se patřit_se
                                                                     ACT  RSTR      RSTR




Figure 6: Identifying an OTH by the pattern “two coordinated verbs” (step 3). (Both
coordinated verbs also qualify as IReflVs, which is not shown in the figure.)



3.6         Language Specific Category
No language specific categories are defined for Czech.

3.7         Deverbative variants
PARSEME Shared Task guidelines also recognize other, non-verbal variants of ver-
bal MWEs, such as relative clauses (heart which he broke), gerunds (heart break-




                                                                19
ing), nominal groups (heart-breaking), or adjectival groups (breaking her heart).
In Czech, nominalization is a common way of verbal MWE variation, see [7, 6, 8].
     There is no nominal group annotated as CPHR in the PDT and thus no LVC
variant. There are several nominal MWEs annotated as DPHR, but only seven of
them are made from verbal MWEs. We have picked them manually. During the
PDT 3.0 MWE annotation project [12], annotators were asked to mark deverbative
variants with the verbal lexicon entry. This annotation, although it is not frequent,
is also used.
     The situation is quite different for IReflV where many non-verbal lemmas
also contain reflexive particles “se” or “si”. These cases qualify themselves as
nominal or adverbial variants of inherently reflexive verbs.
     To sum up, there are deverbative MWEs in the PST Czech data, however they
are not frequent.
     We are also preparing other deverbative MWEs using data by an idiom recog-
nizer based on a database, upgraded for deverbatives by Milena Hnátková [5].

3.8     Overlaps
Since the data for PST are extracted from various pieces of annotation, it can easily
happen they are duplicated or that they overlap. All these cases have to be solved
properly, as described below.

3.8.1    Coordination
Part of a VMWE can be coordinated while the other part is used only once, as
in “Ministerstvo poskytuje malým podnikatelům informační služby a poradenskou
činnost.” (The ministry provides information services and counselling activities to
small businesses.), where two LVCs are present: to provide services and to provide
activities. Such a case is correct and both VMWEs should be preserved and marked
in the output data, with the verb “provide” being part of both.

3.8.2    Duplicates due to added nodes in the PDT
Since a large part of the MWE annotation in the PDT is encoded at the deep syntac-
tic layer, sometimes a VMWE is found that has no direct realization in the surface
form of the sentence, although it is present in its deep structure. For example, The
measure can be taken for six month at most and only for selected items., which
in fact means The measure can be taken for six month at most and the measure
can be taken only for selected items. In the PDT, two light verb constructions are
annotated and both of them are linked to the same words. This would result in du-
plicate annotation of the words “measure” “be” and “taken” in the sentence. Such
duplicates are detected and removed before the data are exported.




                                         20
3.8.3    Overlapping different types of VMWEs
As described previously, we combine explicit idiomatic annotation (DPHR), ex-
plicit light verb annotation (CPHR) and the verbal MWE annotation from PDT 3.0.
If they overlap, the type of the MWE (ID or LVC) is always determined by the
explicit DPHR/CPHR annotation. If only the PDT 3.0 MWE annotation is present,
it always gets ID type as the most probable case; however, this could be checked
manually in future.
     Whenever IReflV overlaps with any other, usually larger MWE, both are cor-
rect and should remain in the output. Other overlaps of different types of VMWEs
are not possible due to the source data we work with.
     It is yet to be determined what to do with cases where an ID from DPHR and
from PDT 3.0 MWE annotation overlaps with different word range.

3.9     Results
After removing the overlaps, there are over 14,000 verbal multiword expressions
exported in the PST format. Table 1 presents the numbers of individual types of
VMWEs.

                                          number       without
                     VMWE type
                                     of instances     overlaps
                     ID                     2,107       1,611
                     LVC                    2,496       2,437
                     IReflV                10,266       9,982
                     OTH                        2           2
                     Total                   14,871    14,032

Table 1: Number of VMWEs extracted from the PDT and prepared for PARSEME
Shared Task. The first number is a raw number of VMWEs found, the second one
is after removal of duplicates and overlapped expressions that should not overlap.



4     Conclusions
It can be concluded that due to a well-founded, rich annotation scheme used in the
Prague Dependency Treebank, which also conforms to most of the four PARSEME
MWE annotation principles, we can almost fully automatically transform the origi-
nal MWE annotation into the PARSEME Shared Task verbal MWE types. By that,
we can extract 14,032 VMWEs.
     In the near future, we still want to manually check some borderline cases men-
tioned above, e.g. whether an isolated verbal PDT 3.0 MWE should be always
an ID, or how to solve overlapping annotation of the same type but of a different




                                        21
range. We will include deverbative MWEs from separate automatic lexicon-based
annotation.


5   Acknowledgement
The work described here has been supported by the project PARSEME, LD14117,
by the Ministry of Education, Youth and Sports of the Czech Republic, and carried
out within the framework of the project COST IC1207 PARSEME. The project
used data distributed by the LINDAT/CLARIN repository, supported by the Min-
istry of Education, Youth and Sports of the Czech Republic (projects LM2010013,
LM2015071). We also thank our colleague Milena Hnátková who kindly extracted
deverbative variants of VMWEs using her phraseme database and we are working
on incorporating this data into our outputs.


References
 [1] Eduard Bejček, Eva Hajičová, Jan Hajič, Pavlína Jínová, Václava Kettnerová,
     Veronika Kolářová, Marie Mikulová, Jiří Mírovský, Anna Nedoluzhko,
     Jarmila Panevová, Lucie Poláková, Magda Ševčíková, Jan Štěpánek, and
     Šárka Zikánová. Prague Dependency Treebank 3.0, 2013. Data available
     from LINDAT/CLARIN, http://hdl.handle.net/11858/00-097C-0000-0023-
     1AAF-3.

 [2] Jan Hajič, Jarmila Panevová, Zdeňka Urešová, Alevtina Bémová, Veronika
     Kolářová, and Petr Pajas. PDT-VALLEX: Creating a large-coverage valency
     lexicon for treebank annotation. In Joakim Nivre and Erhard Hinrichs, edi-
     tors, Proceedings of The Second Workshop on Treebanks and Linguistic The-
     ories, volume 9 of Mathematical Modeling in Physics, Engineering and Cog-
     nitive Sciences, pages 57–68, Vaxjo, Sweden, 2003. Vaxjo University Press.

 [3] Jan Hajič and Zdeňka Urešová. Linguistic annotation: from links to cross-
     layer lexicons. In Joakim Nivre and Erhard Hinrichs, editors, Proceedings
     of The Second Workshop on Treebanks and Linguistic Theories, volume 9
     of Mathematical Modeling in Physics, Engineering and Cognitive Sciences,
     pages 69–80, Vaxjo, Sweden, 2003. Vaxjo University Press.

 [4] Jan Hajič, Jarmila Panevová, Eva Hajičová, Petr Sgall, Petr Pajas, Jan
     Štěpánek, Jiří Havelka, Marie Mikulová, Zdeněk Žabokrtský, Magda
     Ševčíková Razímová, and Zdeňka Urešová. Prague Dependency Tree-
     bank 2.0, 2006. LDC2006T01. Philadelphia, PA, USA.

 [5] Milena Hnátková. Značkování frazémů a idiomů v Českém národním kor-
     pusu s pomocí Slovníku české frazeologie a idiomatiky. Slovo a slovesnost,
     2002.




                                        22
 [6] Veronika Kolářová. Valence deverbativních substantiv v češtině (PhD the-
     sis). PhD thesis, Univerzita Karlova v Praze, Matematicko-fyzikální fakulta,
     Praha, Czechia, 2005.
 [7] Veronika Kolářová. Valency of Deverbal Nouns in Czech. The Prague Bul-
     letin of Mathematical Linguistics, 86:5–20, 2006.
 [8] Veronika Kolářová. Special valency behavior of Czech deverbal nouns, chap-
     ter 2, pages 19–60. Studies in Language Companion Series, 158. John Ben-
     jamins Publishing Company, Amsterdam, The Netherlands, 2014.
 [9] Adam Przepiórkowski, Jan Hajič, Elżbieta Hajnicz, and Zdeňka Urešová.
     Phraseology in two slavic valency dictionaries: limitations and perspectives.
     International Journal of Lexicography, (1):1–38, 2016.
[10] Victoria Rosén, Koenraad De Smedt, Gyri Losnegaard, Eduard Bejček, Agata
     Savary, and Petya Osenova. MWEs in treebanks: From survey to guidelines.
     In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Marko Grobelnik,
     Bente Maegaard, Joseph Mariani, Asunción Moreno, Jan Odijk, and Ste-
     lios Piperidis, editors, Proceedings of the 10th International Conference on
     Language Resources and Evaluation (LREC 2016), pages 2323–2330, Paris,
     France, 2016. European Language Resources Association.
[11] Agata Savary, Manfred Sailer, Yannick Parmentier, Michael Rosner, Victoria
     Rosén, Adam Przepiórkowski, Cvetana Krstev, Veronika Vincze, Beata Wój-
     towicz, Gyri Smørdal Losnegaard, Carla Parra Escartín, Jakub Waszczuk,
     Matthieu Constant, Petya Osenova, and Federico Sangati. PARSEME –
     PARSing and Multiword Expressions within a European multilingual net-
     work. In 7th Language & Technology Conference: Human Language Tech-
     nologies as a Challenge for Computer Science and Linguistics (LTC 2015),
     Poznań, Poland, November 2015.
[12] Pavel Straňák. Annotation of Multiword Expressions in The Prague Depen-
     dency Treebank. PhD thesis, Charles University in Prague, 2010.
[13] Zdeňka Urešová. Valence sloves v Pražském závislostním korpusu. Studies
     in Computational and Theoretical Linguistics. Ústav formální a aplikované
     lingvistiky, Praha, Czechia, 2011.
[14] Zdeňka Urešová. Valenční slovník Pražského závislostního korpusu (PDT-
     Vallex). Studies in Computational and Theoretical Linguistics. Ústav formální
     a aplikované lingvistiky, Praha, Czechia, 2011.
[15] Zdeňka Urešová, Eva Fučíková, Jan Hajič, and Jana Šindlerová. An analysis
     of annotation of verb-noun idiomatic combinations in a parallel dependency
     corpus. In The 9th Workshop on Multiword Expressions (MWE 2013), pages
     58–63, Atlanta, Georgia, USA, 2013. Association for Computational Linguis-
     tics, Association for Computational Linguistics.




                                        23
[16] Veronika Vincze, Agata Savary, Marie Candito, Carlos Ramisch, and Fabi-
     enne Cap. Annotation guidelines for the PARSEME shared task on automatic
     detection of verbal multiword expressions, version 6.0, 2016. http:
     //typo.uni-konstanz.de/parseme/images/shared-task/
     guidelines/PARSEME-ST-annotation-guidelines-v6.pdf.




                                     24