Extraction and Analysis of Proper Nouns in Slovak Texts

                 Radovan Garabík                                                   Radoslav Brída
           Ľ. Štúr Institute of Linguistics                                 Ľ. Štúr Institute of Linguistics
           Slovak Academy of Sciences                                       Slovak Academy of Sciences
                Bratislava, Slovakia                                             Bratislava, Slovakia
     garabik@kassiopeia.juls.savba.sk                                          brida@korpus.sk


1        Introduction                                              rare), and the inflection is mostly realized by chang-
                                                                   ing the suffix and root vowel alteration, we can ex-
Unknown named entity recognition in inflected lan-                 pect the overall distance between lemma and its
guages faces several specific problems – the first                 word forms to be not only bounded from above, but
and foremost is that the entities themselves are in-               also have a regular distribution (roughly speaking,
flected1 (Dvonč et al., 1966) leading to a problem                the less typical the suffix length, the less likely is
of identifying word forms as belonging to the same                 such a word form to appear).
lexeme, and also the problem of finding correct
lemma. In this article we analyse the distribution                    We used the morphological database of Slovak
of word forms for proper nouns in Slovak and de-                   language (Garabík and Šimková, 2012; Karčová,
scribe an algorithm for their automatic extraction                 2008; Garabík, 2007), which contains (at the time
and lemmatisation.                                                 of writing) complete morphological information
                                                                   of 35 009 nouns (lemmata), out of which 1031 are
   The task of lemmatisation and morphological an-
                                                                   proper nouns. We randomly divided the database
notation of flective (and more specifically, Slavic)
                                                                   into two parts, the training set and the evaluation
languages is reasonably researched and developed
                                                                   set, ensuring that about 90% of both common and
(Hajič, 2004). Since we cannot expect a morpho-
                                                                   proper nouns is present in the training set. The eval-
logical database (data relating lemmata to inflected
                                                                   uation set contained 101 lemmata and 694 unique
word forms and their grammatical tags) to cover all
                                                                   word forms for proper nouns.
or almost all the words present in the corpus (espe-
cially proper names that keep appearing depending
on who or what has become a hot topic in mass                                1
                                                                                                              common nouns

media), using a well tuned guesser can improve the                          0.8

accuracy of lemmatisation and tagging.
                                                                            0.6
   Common sense says that named entities (proper
                                                                    n/|N|


names in particular) behave differently from com-                           0.4

mon names, which translated into information the-
                                                                            0.2
ory terms means that the information about whether
a word is a proper name is not independent from the                          0


information about its morphology paradigm. This                                   0   1   2     3       4
                                                                                              ρ(lemma,word)
                                                                                                               5      6       7


means we can use the information about proper                                1
                                                                                                               proper nouns
names to decrease the entropy of inflections, which
is good because it helps the guesser choose between
                                                                            0.8


the possible lemmata and morphological tags.                                0.6
                                                                    n/|N|


2        Datasets                                                           0.4


We denote Levenshtein distance (Ëåâåíøòåéí,
                                                                            0.2


1965) between two words l and w by ρ(l, w). Since                            0

a typical Slovak noun has up to 12 different word                                 0   1   2     3       4
                                                                                              ρ(lemma,word)
                                                                                                               5      6       7


forms (two numbers, six cases – the vocative is
     e.g. for the lemma Galileo, genitive would be Galilea,
     1                                                             Figure 1: Distribution of word forms according to
dative Galileovi etc.                                              their Levenshtein distance from lemma.

                     Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
    In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final
                              Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org
                                                             140
             †                                        ‡                         ‡            ‡            ‡          †        †
    ...    Toska   Toskala   Toskalu   Toskánec   Toskánska   Toskánske     Toskánsko   Toskánskom    Toskánsku    Toske    Tosky   Toso       ...
            10       33        28         20         221         11            110          20           304        15       26      16


Table 1: Alphabetic list of proper noun candidates, with number of occurances in the corpus. Note the
extracted lemmata/lexemes Toska † (La Tosca), Toskánsko‡ (Tuscany), as well as unrelated Toskala,
Toso and related Toskánec (inhabitant of Tuscany) and Toskánske (Tuscan, adjective).


   Fig. 1 displays the distribution of known com-                         interval width to be 2000 words – increasing it
mon (top) and proper (bottom) nouns, summed and                           above this number does not improve the accuracy
normalized through all the nouns in the training                          anymore and the speed is acceptable. It should be
set. Vertical error bars display the standard devi-                       noted that this interval is not a width of the context
ation for the given distance of word form from                            of the concordance – this is an interval in the lexico-
the lemma. From the graphs, we derive several                             graphically ordered set of proper noun candidates
conclusions – proper nouns are “less inflected”,                          extracted from a given text, e.g. from a novel if we
higher ratio of them is in the basic form (lemma),                        want to extract the whole inflectional paradigms of
and the maximum distance is ρ = 7 for common                              (new, unknown) proper nouns from the novel, or in-
nouns (nouns with greater distance are those with                         deed from the whole corpus, if we aim to augment
very irregular declension, e.g. človek → ľudia “hu-                       a morphological database.
man/humans”) and ρ = 5 for proper nouns. Dis-                                We formally describe a Levenshtein edit opera-
tributions of common and proper nouns from the                            tion e = (o, is , id ) – a triple of operation type o,
evaluation set match those from the training set, so                      position is in the source string s and position id
there appears to be some difference between com-                          in the destination string d, where operation type o
mon and proper nouns globally. However, categoris-                        is one of replace, insert or delete. For replace or
ing single nouns using these differencies between                         insert, the replacement/new character is taken from
distributions is not reliable.                                            the destination string d.
                                                                             Sequence of edit operations q = (e1 , e2 , e3 , ...),
3         Extracting Candidates                                           together with the destination string d, when applied
                                                                          to a string s ∈ S defines a mapping fq,d : Sq,d 7→ S,
Our algorithm extracts plausible candidates for                           where Sq,d and S are sets of strings.2
proper nouns (those beginning with a capital let-                            If we denote by t a morphological tag for a given
ter but not at the beginning of a sentence, together                      word form w, then for a lexeme with a lemma l a
with some additional filters) and for each candi-                         tuple (w, t) unambiguously refers to one inflected
date, it considers the set of words with ρ ≤ 5. This                      word form and its grammatical categories. We can
would require calculating the Levenshtein distance                        then construct a sequence of edit operations leading
between all pairs of words in the set and the com-                        from l to w, denoted by q(l, t).
plexity would be O(n2 ), which is unacceptable for                           For each proper noun from the training set, we
corpus sized inputs. Unfortunately, Levenshtein dis-                      precompute the functions fq(l,t),l (this can be im-
tance is a metric but cannot be used to make an                           proved by dividing the nouns into categories based
ordered set out of a list of words (in particular, it                     on their declension rules and using only one noun
cannot be used to define an ordering binary rela-                         from each category), to get the sequence of oper-
tion ≤).                                                                  ations leading from the lemma to the tuple (w, t)
   However, a trick can be applied – in a lexico-                         of the word form and morphological tag. Then, for
graphically ordered list of words (see Table 1) we                        each extracted word, we apply the functions fq(l,t),l
need to look only at some interval around the word;                       to every word from the abovementioned interval
word forms from beyond the interval are very un-                          and the word with greatest coverage (sum of the
likely to belong to the same lexeme. The complex-                         frequencies of generated word forms within the in-
ity will be O(Cn); where C is the (constant) size of                      terval) is declared the lemma to the extracted word.
the interval. This means that for some of the nouns                       Of course, this maximum can be attained by more
not all word forms will be covered; especially for                        than one word, especially if the lexeme is incom-
the shorter ones, where there is a higher probabil-                           2
                                                                                It is not possible to define the function f for every source
ity that many unrelated words will be within the                          string, since some of the operations might not be applicable to
interval. Empirically we estimated the reasonable                         the given strings.


                                                                   141
plete. We assume that at least the most common                                                      emes have recall ≈ 1, and about 50 lexemes have
inflectional paradigms (used for proper nouns) are                                                  0.9 ' recall ' 0.6, while only a small number
present in the training set.                                                                        of lexemes have lower precision. The lower recall
                                                                                                    is caused by insufficient data coverage – not all
                                            number of lemmata assigned
                       word                         per word form                                   the word forms were present in the analysed cor-
                      forms      [%]        correct      all                                        pus. The precision we obtained is excellent and the
                        100      18.9          0         1                                          accuracy of automatic lemma assignment is good.
                          4       0.8          0         2
                        418      79.2          1         1
                          6       1.1          1         2                                          5       Augmenting Morphological Database
                      Σ 528     100.0
                                                                                                    The abovementioned process was used to increase
Table 2: Number of automatically assigned lem-                                                      the number of proper nouns in Slovak morpholog-
mata per word form.                                                                                 ical database. We used the extracted candidates
                                                                                                    from the prim-6.1-public-all corpus with a number
                                                                                                    of occurrences at least 100 (count of all possible
4                Evaluation                                                                         word forms derived from a given lemma). We cal-
We used the algorithm to extract proper nouns                                                       culate the coverage of word forms for one lemma
from the Slovak National Corpus, version prim-                                                      as r = C(w, t)/C(g), where C(w, t) is the num-
6.1-public-all3 , of the size 829 million tokens, and                                               ber of generated tuples of word forms and their
evaluated the results on the proper nouns from the                                                  corresponding morphological tags, and C(g) the
evaluation set. The percentage of correctly auto-                                                   number of grammar categories (usually 7 or 14; 7
matically assigned lemmata is shown in Table 2 –                                                    cases including the vocative and one or two gram-
we see that 79.2% word forms had been assigned                                                      matical numbers, with many proper nouns present
a unique lemma, which was also the correct one,                                                     only in singular).
while 18.9% had been assigned a unique, but incor-                                                     After removing generated word forms with no
rect lemma4 .                                                                                       corpus evidence, the average coverage of word
                                                                                                    forms per lemma is r = 0.84 ± 0.23, i.e. 84%
                 70
                                                                        precision
                                                                                                    of word forms is present in the corpus, 0.23 is the
                 60
                                                                           recall
                                                                                                    standard deviation of the coverage. Generated word
                 50                                                                                 forms still contain a lot of noise, therefore we also
                                                                                                    removed those word forms whose contribution to
 frequency


                 40


                 30
                                                                                                    the number of occurrences of given lemma was less
                 20
                                                                                                    than 1% (it is rare for a grammatical case to have
                 10
                                                                                                    such a low percentage compared to other cases). Af-
                                                                                                    ter this, the coverage changed to r = 0.75 ± 0.24,
                  0
                      1   0.9   0.8   0.7    0.6    0.5     0.4   0.3     0.2       0.1   0         where again 0.24 is the standard deviation of the
                                             precision, recall
                                                                                                    coverage. Then we manually proofread, corrected
                                                                                                    and filled in the word forms for the several hun-
Figure 2: Histogram of precision and recall on au-
                                                                                                    dred most frequent lemmata. After adding these
tomatically assigned word forms of the lexeme(s)
                                                                                                    words to the morphological database, we iterated
for the evaluation data.
                                                                                                    the process, re-training the algorithm and generat-
   Figure 2 displays the precision and recall on                                                    ing another list of less frequent proper nouns.
word forms for proper nouns (i.e. how much of                                                       6       Conclusion
the lexeme has been extracted; the numbers are not
weighted by the frequency of word forms in the                                                      The method has been used to improve the cover-
corpus) from the evaluation set; we note that about                                                 age of proper nouns in the Slovak morphological
70 lexemes5 have precision ≈ 1; about 40 lex-                                                       database and is used as a part of morphological
             3
                                                                                                    guesser, providing candidate lemmata and morpho-
     http://korpus.juls.savba.sk/res.html
             4
     For 13 word forms (2.5%) the correct lemma was not
                                                                                                    logical tags for unknown proper nouns, as part of
present in the interval of 2000 words.                                                              the morphosyntactic analysis and part of speech
   5
     Since the number of proper nouns in our evaluation set                                         tagging of the Slovak National Corpus.6
was 101, these numbers are fortuitously almost identical to
                                                                                                        6
percentage.                                                                                                 http://korpus.juls.savba.sk


                                                                                              142
 References
[Левенштейн1965] Владимир Иосифович Левен-
   штейн. 1965. Двоичные коды с исправлением
   выпадений, вставок и замещений символов.
   Докл. АН СССР, 4(163):845–848.
[Dvonč et al.1966] Ladislav Dvonč, Gejza Horák, Fran-
   tišek Miko, Jozef Mistrík, Ján Oravec, Jozef
   Ružička, and Milan Urbančok. 1966. Morfológia
   slovenského jazyka. Vydavatel’stvo SAV, Bratislava,
   Slovakia, 1st edition. 895 p.

[Garabík and Šimková2012] Radovan Garabík and
   Mária Šimková. 2012. Slovak Morphosyntactic
   Tagset. Journal of Language Modelling, 0(1):41–
   63.
[Garabík2007] Radovan Garabík. 2007. Slovak mor-
   phology analyzer based on Levenshtein edit oper-
   ations. In M. Laclavík, I. Budinská, and L. Hlu-
   chý, editors, Proceedings of the WIKT’06 confer-
   ence, pages 2–5, Bratislava. Institute of Informatics
   SAS.
[Hajič2004] Jan Hajič. 2004. Disambiguation of Rich
   Inflection (Computational Morphology of Czech).
   Karolinum, Charles Univeristy Press, Prague, Czech
   Republic.
[Karčová2008] Agáta Karčová. 2008. Príprava a us-
   kutočňovanie projektu morfologického analyzátora.
   In Anna Gálisová and Alexandra Chomová, edi-
   tors, Varia. 15. Zborník materiálov z XV. kolokvia
   mladých jazykovedcov, pages 286–292, Bratislava.
   Slovenská jazykovedná spoločnost’ pri SAV – Ka-
   tedra slovenského jazyka a literatúry FHV UMB
   v Banskej Bystrici.


                                                       143