Extraction and Analysis of Proper Nouns in Slovak Texts Radovan Garabík Radoslav Brída Ľ. Štúr Institute of Linguistics Ľ. Štúr Institute of Linguistics Slovak Academy of Sciences Slovak Academy of Sciences Bratislava, Slovakia Bratislava, Slovakia garabik@kassiopeia.juls.savba.sk brida@korpus.sk 1 Introduction rare), and the inflection is mostly realized by chang- ing the suffix and root vowel alteration, we can ex- Unknown named entity recognition in inflected lan- pect the overall distance between lemma and its guages faces several specific problems – the first word forms to be not only bounded from above, but and foremost is that the entities themselves are in- also have a regular distribution (roughly speaking, flected1 (Dvonč et al., 1966) leading to a problem the less typical the suffix length, the less likely is of identifying word forms as belonging to the same such a word form to appear). lexeme, and also the problem of finding correct lemma. In this article we analyse the distribution We used the morphological database of Slovak of word forms for proper nouns in Slovak and de- language (Garabík and Šimková, 2012; Karčová, scribe an algorithm for their automatic extraction 2008; Garabík, 2007), which contains (at the time and lemmatisation. of writing) complete morphological information of 35 009 nouns (lemmata), out of which 1031 are The task of lemmatisation and morphological an- proper nouns. We randomly divided the database notation of flective (and more specifically, Slavic) into two parts, the training set and the evaluation languages is reasonably researched and developed set, ensuring that about 90% of both common and (Hajič, 2004). Since we cannot expect a morpho- proper nouns is present in the training set. The eval- logical database (data relating lemmata to inflected uation set contained 101 lemmata and 694 unique word forms and their grammatical tags) to cover all word forms for proper nouns. or almost all the words present in the corpus (espe- cially proper names that keep appearing depending on who or what has become a hot topic in mass 1 common nouns media), using a well tuned guesser can improve the 0.8 accuracy of lemmatisation and tagging. 0.6 Common sense says that named entities (proper n/|N| names in particular) behave differently from com- 0.4 mon names, which translated into information the- 0.2 ory terms means that the information about whether a word is a proper name is not independent from the 0 information about its morphology paradigm. This 0 1 2 3 4 ρ(lemma,word) 5 6 7 means we can use the information about proper 1 proper nouns names to decrease the entropy of inflections, which is good because it helps the guesser choose between 0.8 the possible lemmata and morphological tags. 0.6 n/|N| 2 Datasets 0.4 We denote Levenshtein distance (Ëåâåíøòåéí, 0.2 1965) between two words l and w by ρ(l, w). Since 0 a typical Slovak noun has up to 12 different word 0 1 2 3 4 ρ(lemma,word) 5 6 7 forms (two numbers, six cases – the vocative is e.g. for the lemma Galileo, genitive would be Galilea, 1 Figure 1: Distribution of word forms according to dative Galileovi etc. their Levenshtein distance from lemma. Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org 140 † ‡ ‡ ‡ ‡ † † ... Toska Toskala Toskalu Toskánec Toskánska Toskánske Toskánsko Toskánskom Toskánsku Toske Tosky Toso ... 10 33 28 20 221 11 110 20 304 15 26 16 Table 1: Alphabetic list of proper noun candidates, with number of occurances in the corpus. Note the extracted lemmata/lexemes Toska † (La Tosca), Toskánsko‡ (Tuscany), as well as unrelated Toskala, Toso and related Toskánec (inhabitant of Tuscany) and Toskánske (Tuscan, adjective). Fig. 1 displays the distribution of known com- interval width to be 2000 words – increasing it mon (top) and proper (bottom) nouns, summed and above this number does not improve the accuracy normalized through all the nouns in the training anymore and the speed is acceptable. It should be set. Vertical error bars display the standard devi- noted that this interval is not a width of the context ation for the given distance of word form from of the concordance – this is an interval in the lexico- the lemma. From the graphs, we derive several graphically ordered set of proper noun candidates conclusions – proper nouns are “less inflected”, extracted from a given text, e.g. from a novel if we higher ratio of them is in the basic form (lemma), want to extract the whole inflectional paradigms of and the maximum distance is ρ = 7 for common (new, unknown) proper nouns from the novel, or in- nouns (nouns with greater distance are those with deed from the whole corpus, if we aim to augment very irregular declension, e.g. človek → ľudia “hu- a morphological database. man/humans”) and ρ = 5 for proper nouns. Dis- We formally describe a Levenshtein edit opera- tributions of common and proper nouns from the tion e = (o, is , id ) – a triple of operation type o, evaluation set match those from the training set, so position is in the source string s and position id there appears to be some difference between com- in the destination string d, where operation type o mon and proper nouns globally. However, categoris- is one of replace, insert or delete. For replace or ing single nouns using these differencies between insert, the replacement/new character is taken from distributions is not reliable. the destination string d. Sequence of edit operations q = (e1 , e2 , e3 , ...), 3 Extracting Candidates together with the destination string d, when applied to a string s ∈ S defines a mapping fq,d : Sq,d 7→ S, Our algorithm extracts plausible candidates for where Sq,d and S are sets of strings.2 proper nouns (those beginning with a capital let- If we denote by t a morphological tag for a given ter but not at the beginning of a sentence, together word form w, then for a lexeme with a lemma l a with some additional filters) and for each candi- tuple (w, t) unambiguously refers to one inflected date, it considers the set of words with ρ ≤ 5. This word form and its grammatical categories. We can would require calculating the Levenshtein distance then construct a sequence of edit operations leading between all pairs of words in the set and the com- from l to w, denoted by q(l, t). plexity would be O(n2 ), which is unacceptable for For each proper noun from the training set, we corpus sized inputs. Unfortunately, Levenshtein dis- precompute the functions fq(l,t),l (this can be im- tance is a metric but cannot be used to make an proved by dividing the nouns into categories based ordered set out of a list of words (in particular, it on their declension rules and using only one noun cannot be used to define an ordering binary rela- from each category), to get the sequence of oper- tion ≤). ations leading from the lemma to the tuple (w, t) However, a trick can be applied – in a lexico- of the word form and morphological tag. Then, for graphically ordered list of words (see Table 1) we each extracted word, we apply the functions fq(l,t),l need to look only at some interval around the word; to every word from the abovementioned interval word forms from beyond the interval are very un- and the word with greatest coverage (sum of the likely to belong to the same lexeme. The complex- frequencies of generated word forms within the in- ity will be O(Cn); where C is the (constant) size of terval) is declared the lemma to the extracted word. the interval. This means that for some of the nouns Of course, this maximum can be attained by more not all word forms will be covered; especially for than one word, especially if the lexeme is incom- the shorter ones, where there is a higher probabil- 2 It is not possible to define the function f for every source ity that many unrelated words will be within the string, since some of the operations might not be applicable to interval. Empirically we estimated the reasonable the given strings. 141 plete. We assume that at least the most common emes have recall ≈ 1, and about 50 lexemes have inflectional paradigms (used for proper nouns) are 0.9 ' recall ' 0.6, while only a small number present in the training set. of lexemes have lower precision. The lower recall is caused by insufficient data coverage – not all number of lemmata assigned word per word form the word forms were present in the analysed cor- forms [%] correct all pus. The precision we obtained is excellent and the 100 18.9 0 1 accuracy of automatic lemma assignment is good. 4 0.8 0 2 418 79.2 1 1 6 1.1 1 2 5 Augmenting Morphological Database Σ 528 100.0 The abovementioned process was used to increase Table 2: Number of automatically assigned lem- the number of proper nouns in Slovak morpholog- mata per word form. ical database. We used the extracted candidates from the prim-6.1-public-all corpus with a number of occurrences at least 100 (count of all possible 4 Evaluation word forms derived from a given lemma). We cal- We used the algorithm to extract proper nouns culate the coverage of word forms for one lemma from the Slovak National Corpus, version prim- as r = C(w, t)/C(g), where C(w, t) is the num- 6.1-public-all3 , of the size 829 million tokens, and ber of generated tuples of word forms and their evaluated the results on the proper nouns from the corresponding morphological tags, and C(g) the evaluation set. The percentage of correctly auto- number of grammar categories (usually 7 or 14; 7 matically assigned lemmata is shown in Table 2 – cases including the vocative and one or two gram- we see that 79.2% word forms had been assigned matical numbers, with many proper nouns present a unique lemma, which was also the correct one, only in singular). while 18.9% had been assigned a unique, but incor- After removing generated word forms with no rect lemma4 . corpus evidence, the average coverage of word forms per lemma is r = 0.84 ± 0.23, i.e. 84% 70 precision of word forms is present in the corpus, 0.23 is the 60 recall standard deviation of the coverage. Generated word 50 forms still contain a lot of noise, therefore we also removed those word forms whose contribution to frequency 40 30 the number of occurrences of given lemma was less 20 than 1% (it is rare for a grammatical case to have 10 such a low percentage compared to other cases). Af- ter this, the coverage changed to r = 0.75 ± 0.24, 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 where again 0.24 is the standard deviation of the precision, recall coverage. Then we manually proofread, corrected and filled in the word forms for the several hun- Figure 2: Histogram of precision and recall on au- dred most frequent lemmata. After adding these tomatically assigned word forms of the lexeme(s) words to the morphological database, we iterated for the evaluation data. the process, re-training the algorithm and generat- Figure 2 displays the precision and recall on ing another list of less frequent proper nouns. word forms for proper nouns (i.e. how much of 6 Conclusion the lexeme has been extracted; the numbers are not weighted by the frequency of word forms in the The method has been used to improve the cover- corpus) from the evaluation set; we note that about age of proper nouns in the Slovak morphological 70 lexemes5 have precision ≈ 1; about 40 lex- database and is used as a part of morphological 3 guesser, providing candidate lemmata and morpho- http://korpus.juls.savba.sk/res.html 4 For 13 word forms (2.5%) the correct lemma was not logical tags for unknown proper nouns, as part of present in the interval of 2000 words. the morphosyntactic analysis and part of speech 5 Since the number of proper nouns in our evaluation set tagging of the Slovak National Corpus.6 was 101, these numbers are fortuitously almost identical to 6 percentage. http://korpus.juls.savba.sk 142 References [Левенштейн1965] Владимир Иосифович Левен- штейн. 1965. Двоичные коды с исправлением выпадений, вставок и замещений символов. Докл. АН СССР, 4(163):845–848. [Dvonč et al.1966] Ladislav Dvonč, Gejza Horák, Fran- tišek Miko, Jozef Mistrík, Ján Oravec, Jozef Ružička, and Milan Urbančok. 1966. Morfológia slovenského jazyka. Vydavatel’stvo SAV, Bratislava, Slovakia, 1st edition. 895 p. [Garabík and Šimková2012] Radovan Garabík and Mária Šimková. 2012. Slovak Morphosyntactic Tagset. Journal of Language Modelling, 0(1):41– 63. [Garabík2007] Radovan Garabík. 2007. Slovak mor- phology analyzer based on Levenshtein edit oper- ations. In M. Laclavík, I. Budinská, and L. Hlu- chý, editors, Proceedings of the WIKT’06 confer- ence, pages 2–5, Bratislava. Institute of Informatics SAS. [Hajič2004] Jan Hajič. 2004. Disambiguation of Rich Inflection (Computational Morphology of Czech). Karolinum, Charles Univeristy Press, Prague, Czech Republic. [Karčová2008] Agáta Karčová. 2008. Príprava a us- kutočňovanie projektu morfologického analyzátora. In Anna Gálisová and Alexandra Chomová, edi- tors, Varia. 15. Zborník materiálov z XV. kolokvia mladých jazykovedcov, pages 286–292, Bratislava. Slovenská jazykovedná spoločnost’ pri SAV – Ka- tedra slovenského jazyka a literatúry FHV UMB v Banskej Bystrici. 143