=Paper=
{{Paper
|id=None
|storemode=property
|title=Automatic Recognition of Composite Verb Forms in Serbian
|pdfUrl=https://ceur-ws.org/Vol-920/p89-djordjevic.pdf
|volume=Vol-920
|dblpUrl=https://dblp.org/rec/conf/bci/Dordevic12
}}
==Automatic Recognition of Composite Verb Forms in Serbian==
<pdf width="1500px">https://ceur-ws.org/Vol-920/p89-djordjevic.pdf</pdf>
<pre>
         Automatic Recognition of Composite Verb Forms in
                             Serbian
                                                                    Bojana Đorđević
                                                                    Faculty of Philology
                                                                     Belgrade, Serbia
                                                               bojana@lingvistika.org


ABSTRACT                                                                               given in those grammar books and their actual usage. Our
In this paper, we will present the work on building a shallow                          approximation was that by parsing using only those “raw” rules,
parser for recognizing composite verb forms in Serbian – the                           we could automatically recognize around 40% of all the CVF,
forms that consist of an auxiliary verb and a main verb. The                           which is not a very satisfying result.
parser is made in Unitex, a corpus processing software, in the                         The problem with the remaining 60% seems to be the following.
form of local grammars that rely on using morphological                                To start with, the possibility of changing the word order is rarely
dictionaries of Serbian. The model was tested on a small corpus of                     mentioned – having the auxiliary verb not before but after the
texts, both written in Serbian and translated into Serbian (total of                   main verb. Also, the verbs that are reflexive have an additional
171 kw), in a few phases. In the current phase, the average result                     component, namely the particle se, which also changes its
of 95,8% of well recognized units is achieved, with the translation                    position due to the formerly mentioned inversion. Inclusion of
of Jules Verne’s Around the world in 80 days giving the best                           those two facts would bring the total sum to 60 or maybe 70%.
results (98,8%), and a short story by Ivo Andrić, A Vacation in                        The rest of the forms are those that have some kind of an insert
the South, giving the worst (91,7%).                                                   between their main components. Those cases are in fact the ones
                                                                                       that call for making a parser. The inserts can be of many types
Keywords                                                                               (simple words, phrases, appositions) and can combine in
Composite verb forms, shallow parsing, Serbian                                         numerous ways.
                                                                                       After making the initial model and applying it to texts, we
                                                                                       searched for unrecognized items and included them in the model.
1. RECOGNIZING COMPOSITE VERB                                                          In the end, we had approximately three different basic sets of
FORMS – THE STARTING POINT                                                             rules for each of the CVF, with each having different types and
                                                                                       combinations of inserts included in them.
1.1 Composite Verb Forms – What Are They?
Under the term Composite Verb Forms (CVF) we consider the                              1.3 Aims
verb forms made of two parts – one being the auxiliary verb                            The aims we had while making the model were:
jesam, biti or hteti and the other the main verb – in the form of
either infinitive or past participle. Most of the CVF are tenses, but                  1. Taking in account all the different word orders
some of them, like Conditional (Potencijal) and Future Perfect                         2. Recognizing CVF of reflexive verbs
(Futur II) are aspects. We looked into all the tenses and aspects in
                                                                                       3. Recognizing inserted clitics and other inserts, with emphasis on
the active voice: Past Tense (sam išla – I went), Future Tense (ću
                                                                                       adverbs and adverbial phrases
ići – I will go), Past Perfect Tense (sam bila otišla/bejah otišla –
had been gone), Future Perfect (budem otišla – will have gone)                         4. Dealing with elided CVF
and Conditional (bih išla – I would go).
The main idea behind building the shallow parser for CVF is to                         Phases one and two were completed almost immediately.
make the base to which other segments can later be attached – in                       Inserting clitics and simple adverbs (here simple meaning that
specific those for recognizing noun and preposition phrases. This                      they have a single entry in morphological dictionaries – either in
is just one of the steps, but an initial and, in our opinion, a very                   the part with simple or composite forms) was also quite
important one, towards building a shallow parser for entire                            straightforward. Nevertheless, a significant number of units
Serbian grammar.                                                                       remained unrecognized, so in the next phase we included more
1.2 Theory                                                                             inserts and made recognition of adverbs more complex. The work
                                                                                       on inserts will be presented in more detail in section 2.2.
The starting ground for making the model were grammar books
                                                                                       Dealing with ellipsis was the most difficult task and is still open.
used in high school and undergraduate studies [1] [2]. However,
                                                                                       What is meant under ellipsis and how we worked on it will be
there is a clear difference between knowing the formation rules
                                                                                       presented in section 2.3.

BCI’12, September 16–20, 2012, Novi Sad, Serbia.
                                                                                       Evaluation of the grammars will be given is section 3.
Copyright © 2012 by the paper’s authors. Copying permitted only for private and
academic purposes. This volume is published and copyrighted by its editors.
Local Proceedings also appeared in ISBN 978-86-7031-200-5, Faculty of Sciences,
University of Novi Sad.
                                                                                  89
1.4 Corpora                                                                 4.
                                                                            <FUTUR1><AUX>nećemo</AUX><CLIT>ih</CLIT><V>pozvati
The model was tested on four texts/collections of texts: 10                 </V><Vadd>i<V>reći</V></FUTUR1>
chapters of Jules Verne’s Around the World in 80 Days (28 kw), a            (we will not call them and tell them)
corpus of newspaper texts on the day of 03.01.2004. (79 kw),
Early Sorrows by Danilo Kiš (56 kw) and a story by Ivo Andrić,
                                                                            2.2 Modeling the Inserts
A Vacation in the South (8 kw).
                                                                            When modeling the inserts, we started with simple but useful
2. PARSING                                                                  segments like clitics and adverbs. Soon, there was a need for a
                                                                            more complex definition of an adverb so currently, adverb (ADV)
2.1 Background                                                              is a subgraph that recognizes simple adverbs, repetition of
The shallow parser was made in Unitex corpus processing                     adverbs, conjunctions of simple adverbs and present participles
software, version 2.1 [3]. 1 All the rules are given in the forms of        (V:S – pevajući). We could not take into account the adverbial
local grammars – finite state transducers – whose outputs are               function of certain phrases, such as preposition phrases (PP), so
appropriate XML tags. The model is dependent on using the                   they are not yet included here. The current look of a general insert
morphological dictionaries of Serbian [4], thanks to which we               segment that recognizes adverbs is presented in Figure 2.
were able to use specific morphological forms. Currently, there is
no agreement or any kind of a syntactic relation included and the
connections between words are established purely on the basis of
word order. An example of one of the local grammars is presented
in Figure 1.


     Figure 1: Local grammar for the Future Tense (Futur 1).


The graph in Figure 1 recognizes all the forms of the Future                             Figure 2: Local grammar for adverbs.
Tense that consist of an auxiliary (AUX) verb (V) hteti in the
Present Tense (P) that comes first, after which there is an optional
insert (here Umetak1). The next element is obligatory and it is a           After this initial phase, other insert segments were included, like
verb (V) in the infinitive form (W). Following that, there is               pronouns (PRO) and particles (PAR). We also made a very
another optional element. This time, it is an elided CVF (here              complex preposition phrase (PPkonstrukcije). Chunks like
Dodatak1). Gray graph boxes denote subgraphs – they are a link              apposition (Apozicija), that we were able to define thanks to
to another graph in which the given element is presented in detail.         commas that appear at its ends, were also included. The noun
Local grammars have XML tags as their outputs. The above graph              phrase (NP) was included the last because it was the most difficult
will insert tags <AUX> and </AUX> around the auxiliary verb                 one to model, but its inclusion, apart from ADV grammar,
and <V> and </V> around the main verb. There are appropriate                contributed the most to good recognition results. An example of a
tags for both segments of inserts and segments of elided CVF, but           part of a general insert is presented in Figure 3.
they are placed inside the subgraphs. The entire recognized CVF
has its own tense tag.
Here are some results of application of this graph. The first
example contains only obligatory elements, while the other
examples have either inserts or elided CVF, or both in the last
one.
1.<FUTUR1><AUX>neće</AUX><V>doći</V></FUTUR1>
(he will not come)
2.<FUTUR1><AUX>će</AUX><CLIT>
im</CLIT><NP>učiteljica</NP><V>reći</V></FUTUR1>
(the teacher will tell them)
3.<FUTUR1><AUX>ću</AUX><V>reći</V><Vadd>i<PP>bez                            Figure 3: Segment of a general insert that recognizes various
<NP>problema</NP></PP><V>potpisati</V></FUTUR1>                                                   insert elements.
(I will say and sign without a problem)


1
    http://www-igm.univ-mlv.fr/~unitex/

                                                                       90
2.3 Modeling the Elided CVF                                                   Table 2: Results per Tenses in Around the world in 80 days
                                                                                                       (28 kw)
Elided CVF are the ones that share the auxiliary verb with the
verb before them, to which they are usually connected with the                                                NOT
                                                                                 80 days         MISS                     OK        Total
conjunction i (došao je i seo – he came and sat down). These units                                             OK
were complicated to recognize for two main reasons: there is a               Future Simple          0           0          57        57
high possibility that the verb after the conjunction is followed by          Future Perfect         0           0          0          0
its own auxiliary verb that can but does not have to be adjacent. In          Simple Past           4           2         584        590
that case, there is a danger of falsely recognizing an elided CVF             Past Perfect          0           0          2          2
while it is in fact a regular one (an example of that problem is              Conditional           2           0          40        42
given in section 3). Also, the forms and number of inserts                                          6           2         683        691
                                                                                  Total
between the first CVFs and the elided ones can be very complex                                   (0,9%)      (0,3%)     (98,8%)    (100%)
and ask for special attention.
Figure 5 gives an example of the forms such as: je došao bez                     Table 3: Results per Tenses in Early Sorrows (56 kw)
pitanja i brzo pitao (he came without a question and quickly
                                                                                                              NOT
asked) and je rekao ili viknuo (he said or shouted).                               Kiš           MISS                     OK        Total
                                                                                                               OK
                                                                             Future Simple          1           3         183        187
                                                                             Future Perfect         0           0          10        10
                                                                              Simple Past          22          19         958        999
                                                                              Past Perfect          1           0          29        30
                                                                              Conditional           0           2         118        120
                                                                                                   24          24         1298      1346
                                                                                  Total
                                                                                                 (1,8%)      (1,8%)     (96,4%)    (100%)

                                                                            Table 4: Results per Tenses in A Vacation in the South (8 kw)
           Figure 5: Elided CVF for the Past Tense.
                                                                                                              NOT
                                                                                 Andrić          MISS                     OK        Total
                                                                                                               OK
These grammars are still not as refined as they should be to be              Future Simple          0           0          5          5
useful, so we had to exclude them in some tenses as the noise they           Future Perfect         0           0          0          0
made was pretty high.                                                         Simple Past          11           2         128        141
                                                                              Past Perfect          0           0          1          1
3. EVALUATION                                                                 Conditional           0           2          11        11
                                                                                                   11           2         145        158
3.1 General Data                                                                  Total
                                                                                                  (7%)       (1,3%)     (91,7%)    (100%)
Evaluation was done in the following way – after the initial
recognition and tagging, we manually tagged all the texts, adding           On average, 95,8% of all the CVF are correctly recognized.
an attribute P(ROVERA) (check) with three values. For the units             Elided CVF are not fully included in grammars so we have not
that were recognized well, the value was OK, for the badly                  included them in the MISS results.
recognized, it was NOT OK, and those that were not found at all
were tagged and marked as MISS.
In the tables below, results are presented for each of the tenses in        3.2 MISS Units
each of the four texts.                                                     MISS units were of four main types:
                                                                            1) The ones with a more complicated insert (su vazdušasti oblaci,
 Table 1: Results per Tenses in the Collection of Newspaper                 tečno more i tvrdo kopno, menjajući svako svoja svojstva, izašli
                       Texts (79 kw)                                        – airy clouds, liquid sea and solid ground, each changing their
                                                                            properties, emerged)
                                 NOT
       2004           MISS                  OK           Total              2) Units with a strangely composed insert (se vrlo Paspartuu više
                                  OK
 Future Simple          3          0        174           177               nije dopadalo – Passepartout did not like it at all any more)
 Future Perfect         0          1         4             5                3) CVF with an embedded CVF, as that case is not yet included in
  Simple Past          26         15        951           992               grammars (valjda se dok se igrao okrenuo – I guess that while he
  Past Perfect          0          0         1             1                was playing he turned around)
  Conditional           2          0        110           112               4) CVF embedded in appositions, as they are also not yet included
                       31         16        1240         1287               in grammars (a onda se odjednom – kao da je uvideo da je sleteo
       Total
                     (2,4%)     (1,2%)    (96,4%)       (100%)              na pogresxnu adresu! – dostojanstveno i prezrivo vinuo – and
                                                                            then suddenly – as if he realized he had landed at a wrong
                                                                            address! – he dignifiedly and scornfully flew up)


                                                                       91
3.3 NOT OK Units                                                              modular and correctly settled so it might not happen yet. That
                                                                              phase would also mean a total rearrangement and division of
The NOT OK units can be divided into two major groups:                        grammars as they are now.
1) The ones in which some other part of speech (usually a noun)               Another interesting future phase, tightly dependent on modularity
gets recognized as a verb (most of the time – past participle) with           of CVF grammars, is incorporating grammars developed by other
which it shares the graphical form. That is how many interesting,             colleagues, made to recognize units such as dates and proper
falsely recognized examples of Past Tense, together with their                names. In order to make that kind of modularity among inserts,
elided CVFs, are produced:                                                    there is a number of alternations that need to be made, and most
     •    crna      <PERFEKAT     P=”NOT       OK”>je                         likely, some of the current solutions will have to be rethought.
          svila</PERFEKAT> vlažna od suza (black silk is                      Incorporating those graphs would certainly lead to greater
          wet from tears)                                                     precision and it would be interesting to see at what cost, if any at
                                                                              all.
Here, svila (silk) is recognized as a past participle form of the
verb sviti (to fold).                                                         Current ellipsis grammars need to be further refined. It is still left
     •    Jer      <PERFEKAT    P=”NOT    OK”>vile
                                                                              to see how much of the ellipsis can be handled in the automatic
          se</PERFEKAT> uvek oblače u belo (Because
                                                                              way. Those subgrammars are then to be included where it is
          fairies always wear white)                                          possible and where they do not make too much noise.

Vile (fairies) is recognized as a past participle form of the verb
viti (to flutter). This example was in fact recognized due to lack of
                                                                              5. REFERENCES
                                                                              [1] Stanojčić, Ž., Popović, Lj. 2002. Gramatika srpskoga jezika.
agreement in the model. Namely, the form of the Past Tense made
                                                                                  Zavod za udžbenike i nastavna sredstva, Beograd
with only the past participle and the reflexive particle se is the one
in which the past participle is in 3. person singular. The form vile          [2] Stevanović, M. 1964. Savremeni srpskohrvatski jezik –
mathes 3. person plural.                                                          gramatički sistemi i književnojezička norma. Naučno delo,
                                                                                  Beograd
     •    Kao što ga <PERFEKAT P=”NOT OK”>je izdao i
          prošle</PERFEKAT> godine (As he betrayed him                        [3] Paumier, S. 2003. http://www-igm.univ-
          last year too)                                                          mlv.fr/~unitex/UnitexManual2.1.pdf
Prošle (previous) is recognized as a past participle form of proći            [4] Vitas, D., Krstev, C., Obradović I., Popović, Lj., Pavlović-
(to pass). This is an example of false recognition of an elided                   Lažetić, Gordana. 2003. A Processing Serbian Written Texts:
CVF but the reason is the same as in the previouse example – lack                 An Overview of Resources and Basic Tools. In Workshop on
of agreement. The elided CVF normally agrees in number and                        Balkan Language Resources and Tools (Thessaloniki,
gender with the main verb of the previous CVF, and here, while                    Greece, November 21, 2003) S. Piperidis and V. Karkaletsis,
izdao is 3. person singular masculine gender, prošle is 3. person                 97-104
plural feminine gender.                                                       [5] Nenadić, G., Vitas, D. , i Krstev, C. 2001. Local Grammars
2) The ones with the full CVF recognized as an ellipsis.:                         and Compound Verb Lemmatization in Serbo-Croatian.
                                                                                  Current Issues in Formal Slavic Linguistics. Gerhild
     •    <FUTUR1 P=”NOT OK”>će joj čestitati i
                                                                                  Zybatow, Uwe Junghanns, Grit Mehlorn, Luka Szuscich,
          reći</FUTURE1> će joj (will congradulate her and
                                                                                  Eds. Peter Lang. Frankfurt am Main, Berlin, Bern ; Bruxelles
          tell her)
                                                                                  ; New York ; Oxford ; Wien, 469-477.
In this case, the auxiliary verb of the second verb, falsely
                                                                              [6] Vitas, D., Krstev, C. 2003. Composite Tense Recognition
recognized as the elided CVF, immediately follows the main verb.
                                                                                  and Tagging in Serbian. In Proceedings of the Workshop on
Cases like this should be the easiest to deal with, once we pay
                                                                                  Morphological Processing of Slavic Languages :
more attention to the segment of elided CVF.
                                                                                  10th Conference of the European Chapter EACL 2003.
                                                                                  (Budapest, Hungary, April 13th, 2003). T. Erjavec, D. Vitas,
4. FURTHER RESEARCH                                                               Eds. 54-61.
There are a few directions in which we plan to take the work on
automatic recognition of CVF. The general direction is towards                [7] Vučković, K., Agić, Ž., Tadić, Marko. 2010. Improving
precision and more grammatical accuracy. There are a few                          Chunking Accuracy on Croatian Texts by Morphosyntactic
technical alterations that still need to be done. Apart from fixing               Tagging. In Proceedings of the Seventh International
some still found problems and including some cases or                             Conference on Language Resources and Evaluation,
combinations of inserts that have not yet been included, there is a               European Language Resources Association (Valletta), 1944-
growing need for increasing the modularity of grammars. This                      1949.
applies to all the segments of grammars, but primarily to CVF                 [8] Gross, M. Lemmatization of English Verbs in Compound
parts. There is also a need for going a step further and                          Tenses. Available at: http://infolingu.univ-
incorporating agreement elements between the auxiliary verb and                   mlv.fr/english/Bibliographie/Articles/Lemmatization.pdf
the main verb. This step requires having all the other elements


                                                                         92

</pre>