=Paper=
{{Paper
|id=Vol-2203/126
|storemode=property
|title=Crowdsourcing for Slovak Morphological Lexicon
|pdfUrl=https://ceur-ws.org/Vol-2203/126.pdf
|volume=Vol-2203
|authors=Vladimir Benko
|dblpUrl=https://dblp.org/rec/conf/itat/Benko18
}}
==Crowdsourcing for Slovak Morphological Lexicon==
S. Krajči (ed.): ITAT 2018 Proceedings, pp. 126–129 CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Vladimír Benko Crowdsourcing for the Slovak Morphological Lexicon Vladimír Benko UNESCO Chair in Plurilingual and Multicultural Communication Comenius University in Bratislava Šafárikovo nám. 6, SK-81499 Bratilava, Slovakia and Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences Panská 26, SK-81101 Bratislava, Slovakia Abstract. We present an on-going experiment aimed at im- while the Aranea Project [5, 6] is using a more traditional proving the results of Slovak PoS tagging by means of increas- TreeTagger [7, 8] with a custom language model, yet with- ing the size of morphological lexicon that is used for training out any functionality to guess lemmas for the OOV lexical the respective tagger(s). The frequency list of out-of- items. Both systems are using the SNC tagset2 [9] – a fine- vocabulary (OOV) word forms along with the tags and lemmas grained positional tagset vaguely resembling the popular assigned by the guesser is manually checked, corrected and MULTEXT-East3 tagset utilized for several Slavic lan- classified by students in the framework of assignments, so that guages. valid lexical items candidates for inclusion into the morpho- Language models for both systems, however, have been logical lexicon could be identified. We expect to improve the trained on the same source data – the 1.2 M token Manually lexicon coverage by the most frequent proper names and for- morphologically annotated corpus4 and the SNC Morphol- eign words, as well as to create an auxiliary lexicon contain- ing the most frequent typos. ogy database5 covering approx. 100 K lemmas, yielding some 3.2 M inflected forms. This is why that, despite the fact that both systems do not produce exactly the same out- 1 Introduction put, they are (almost) identical6 in the amount of OOV “Crowdsourcing” is a relatively recent concept that en- items, that is rather high. compasses many practices. This diversity leads to the blur- As both Slovak annotation systems explicitly indicate the ring of the limits of crowdsourcing that may be identified OOV status of every token within a corpus, an analysis of virtually with any type of Internet-based collaborative ac- the situation can be conveniently performed by the corpus tivity, such as co-creation or user innovation [1]. In their manager, such as NoSketch Engine7 [10]. In the SNC cor- paper, authors define eight characteristics typical for pora, the OOV status is indicated by the “XX” value pf the crowdsourcing as follows: “prec” attribute – this value can be observed in 54.5 million cases of 1.37 Gigatoken prim-8.0-pubic-sane8 main corpus, • There is a clearly defined crowd (a) which is 3.98% of all tokens. • There exists a task with a clear goal (b) In the web-based Araneum Slovacum Maximum9, where • The recompense received by the crowd is clear (c) the OOV state is indicated by the “0” value of the “ztag” • The crowdsourcer is clearly identified (d) attribute, the situation is even worse – 135.5 million OOVs • The compensation to be received by the crowdsourcer is out of 2.96 Gigatokens, i.e., 4.57%. This can be explained clearly defined (e) by the rather “low quality” of web data that, despite all • It is an online assigned process of participative type (f) efforts in cleaning and filtering the source texts, naturally • It uses an open call of variable extent (g) contains lots of “noise” of different kinds. • It uses the Internet (h) 3 The Task From this perspective, language data annotation per- formed by students in the framework of the end-of-term The OOV lexical items observed in our corpora are of assignments can well be considered “crowdsourcing”, even different nature. Besides the “true neologisms”, i.e., words if only some of the above characteristics apply. It is also qualifying for inclusion even into the traditional dictionary, worth noting that, according to our experience, students proper nouns (such as personal and geographical names) appreciate the feeling that their work may be useful not and their derivates, we can find also items traditionally not only as a tool for classification. considered as “words” – various abbreviations, acronyms and symbols, URLs or e-mail addresses, parts of foreign language quotations and – above all – all sorts of “typos” 2 The Problem and “errors”. Inflected word forms apply to almost all pre- Slovak belongs to languages with more than one system viously mentioned categories, which makes the whole pic- for morphosyntactic annotation available, with two of them ture even more complex. being actively used in our work1 . They have been devel- oped (partially independently) in the framework of two different research projects. 2 The Slovak National Corpus (SNC) [2] is using a system https://korpus.sk/morpho_en.html 3 based on the new Czech MorphoDiTa tagger [3, 4] with a http://nl.ijs.si/ME/V4/ 4 custom language model and a tool for guessing lemmas for https://korpus.sk/ver_r(2d)mak.html 5 unrecognized (out-of-vocabulary – OOV) lexical items; https://korpus.sk/morphology_database.html 6 The differences are mainly caused by the fact that the TreeTagger-based system is also using word forms from 1 We are aware of (at least) two more systems for mor- the training corpus that were not present in the morphologi- phosyntactic annotation of Slovak data that have been in- cal database (mostly proper nouns) to ammend the morpho- dependently developed at Masaryk University in Brno and logical lexicon, 7 Charles University in Prague, respectively. These two sys- https://nlp.fi.muni.cz/trac/noske 8 tems, however, were not available for our work at the time https://korpus.sk/prim(2d)8(2e)0.html 9 of writing this paper. http://aranea.juls.savba.sk/aranea_about Crowdsourcing for Slovak Morphological Lexicon 127 In the following text we present an experiment aimed at Table 1. Source Data amending the morphological lexicon used for training the Id Word Lemma aTag language model(s) by a manually validated list of most sk_11184 dvojťaţiek dvojťaţka Nn frequent OOV items derived from an annotated web corpus. sk_11185 dvojťaţiek dvojťaţky Nn The annotation is to be performed by graduate students of sk_11186 dvojťaţka dvojťaţka Nn foreign languages, in the framework of end-of-term as- sk_11187 Dvojťaţka dvojťaţka Nn signment for the “Introduction to Corpus Linguistics” sub- sk_11188 Dvojťaţka Dvojťaţka Nn ject. sk_11189 Dvojťaţka dvojťaţka Yx Having only limited “human power” (two groups with 46 sk_11190 Dvojťaţka Dvojťaţka Yx students in total) at hand, we decided to follow the minimal sk_11191 dvojťaţkách dvojťaţke Nn two-fold setup (i.e., each item to be annotated by only two sk_11192 dvojťaţke dvojťaţka Nn independent annotators) and make the task as simple as sk_11193 dvojťaţkou dvojťaţka Nn possible. This is why the annotators were not expected to sk_11194 dvojťaţku dvojťaţka Nn check all the morphological categories provided by the sk_11195 dvojťaţky dvojťaţka Nn respective tags, and they were asked to decide only on two sk_11196 dvojťaţky dvojťaţky Av parameters – lemma and word class (part of speech). sk_11197 dvojťaţky dvojťaţky Nn sk_11198 dvojtisícovku dvojtisícovka Nn 4 The Data sk_11199 dvojtlačidlo dvojtlačidlo Nn sk_11200 dvojtraktovú dvojtraktový Aj In the first step, we used data from the Araneum sk_11201 dvojumývadlom dvojumývadlom Nn Slovacum Maximum 17.09 web corpus of approx. 3 Giga- sk_11202 dvojumývadlom dvojumývadlom Yx tokens that has been independently tagged both by the SNC sk_11203 dvojzákrutovej dvojzákrutovej Aj MorphoDiTa and the Aranea TreeTagger pipelines, and sk_11204 dvojzákrutovej dvojzákrutovej Yx subsequently merged into a single vertical file. Then, we sk_11205 dvojzápasovú dvojzápasový Aj converted the original SNC morphological tags to “PoS- sk_11206 dvojzónovú dvojzónový Aj only” tags and produced a frequency list of all lexical items sk_11207 dvolezite dvolezite Nn indicated as OOV by both taggers. This list has been further sk_11208 dvolezite dvolezite Yx filtered to exclude word forms contained in the Czech mor- sk_11209 Dvonča Dvonča Nn phological lexicon10. After deleting the unused parameters, sk_11210 Dvonča Dvonč Nn the resulting lists contained the frequency, word form, sk_11211 Dvončom Dvonča Nn lemma assigned by the SNC guesser and PoS information sk_11212 Dvončom Dvonč Nn derived from the tag assigned by TreeTagger (aTag, using the AUT11 notation). This decision has been motivated by As has been already mentioned, each item (line of the ta- an observation that TreeTagger is typically more successful ble) has to be annotated by two independent annotators. We in assigning morphological categories for unknown words decided, however, not to split the data in a straightforward than MorphoDiTa. way, but to assign each alphabetical segment of the data to As we naturally could expect to be able to process only three annotators using a rule as follows: each triple of lines the rather small part of the list, after some experimenting will be split into three tuples containing first and second, with various thresholds, we decided to pass into annotation first and third and second and third lines, respectively. only items appearing 50 or more times, yielding to 77,169 Moreover, the whole lot of data has been split to three items. This meant that each annotator would process ap- parts, so that each annotator could get three different sec- proximately 3,300 items. tions of the alphabet in his or her data. The example of source data (after discarding the frequen- By applying this fairly “sophisticated” assignment cy information and adding a unique Id) is shown in Table 1. scheme, we expected to improve the overall uniformity and We can observe several phenomena here. The same lexi- quality of the output, as well as to prevent “collaboration” cal item is in some cases tagged as “foreign”, while as among students, as no two assigned lots were identical. “noun” or “adjective” in the others, and lemma form as An excerpt of the data from Table 1 assigned to a single well as its capitalization is sometimes guessed correctly, annotator is shown in Table 2. while sometimes not. It can be also seen, that many table items will in fact have to be merged after correcting the Table 2. Data to Annotate annotation, producing less total of correct lines. Id Word Lemma Lemmb bTag aTag The overall task for the annotators was to produce correct sk_11184 dvojťaţiek dvojťaţka dvojťaţka Nn data for all lines in the table. To minimize the number of sk_11185 dvojťaţiek dvojťaţky dvojťaţky Nn necessary keystrokes and to keep track of the changes, the sk_11187 Dvojťaţka dvojťaţka dvojťaţka Nn data have been further modified to contain two newly add- sk_11188 Dvojťaţka Dvojťaţka Dvojťaţka Nn ed columns – Lemmb used as a template for correcting the sk_11190 Dvojťaţka Dvojťaţka Dvojťaţka Yx value for Lemma (it is expected that most modifications sk_11191 dvojťaţkách dvojťaţke dvojťaţke Nn will occur at the end of the respective string only) and bTag sk_11193 dvojťaţkou dvojťaţka dvojťaţka Nn (to be filled only in case of wrong PoS assignment). sk_11194 dvojťaţku dvojťaţka dvojťaţka Nn sk_11196 dvojťaţky dvojťaţky dvojťaţky Av sk_11197 dvojťaţky dvojťaţky dvojťaţky Nn Note that the “missing” every third Id results from the assignment scheme. 5 The Crowd Annotation 10 The split data has been uploaded as excel spreadsheets to https://lindat.mff.cuni.cz/repository/xmlui/handle/ a shared Google disk and assigned randomly to the respec- 11234/1-1836 tive annotators. The task has been assigned in the middle of 11 http://aranea.juls.savba.sk/aranea_about/aut.html 128 Vladimír Benko the semester, after the students already got acquainted with part of the word as a result of hyphenation), the value of the basic concepts of corpus morphosyntactic annotation bTag will be “Er” (error). and acquired the elementary querying skills. (G) If the word form is obvious foreign word, the value The instructions for annotating the data were as follows. of bTag will be “Yx”. (A) Only Lemmb and bTag columns may be modified. (H) It is not necessary to evaluate whether the word form (B) If both Lemma and aTag values are correct, nothing is “literary” – words of “lower” registers (such as slang) has to be done. also have “correct” lemmas. (C) If aTag value is wrong, the correct value should be The annotators were also instructed to check all “non- inserted in bTag. obvious” items by querying the corpus and analyzing the (D) If Lemma value is wrong, it should be corrected in respective contexts. The initial training was performed dur- Lemmb. ing one teaching lesson in a computer lab, so that possibly (E) If the word form is obvious typo (missing or super- all frequent problems could be explained. fluous letter, exchanged letters), or the word does not con- tain the necessary diacritics, the correct lemma marked by an asterisk should entered in Lemmb. 6 First Results and Problems (F) If the correct word form cannot be reconstructed by Out of 46 students, 43 managed to complete the assign- simple editing operations, i.e., cannot be recognized (e.g., ments in time. Table 3 shows an example of the correctly annotated data. Table 3. Annotated Data Id Word Lemma Lemmb bTag aTag sk_11184 dvojťaţiek dvojťaţka dvojťaţka Nn sk_11185 dvojťaţiek dvojťaţky dvojťaţka Nn sk_11187 Dvojťaţka dvojťaţka dvojťaţka Nn sk_11188 Dvojťaţka Dvojťaţka dvojťaţka Nn sk_11190 Dvojťaţka Dvojťaţka dvojťaţka Nn Yx sk_11191 dvojťaţkách dvojťaţke dvojťaţka Nn sk_11193 dvojťaţkou dvojťaţka dvojťaţka Nn sk_11194 dvojťaţku dvojťaţka dvojťaţka Nn sk_11196 dvojťaţky dvojťaţky dvojťaţka Nn Av sk_11197 dvojťaţky dvojťaţky dvojťaţka Nn sk_11199 dvojtlačidlo dvojtlačidlo dvojtlačidlo Nn sk_11200 dvojtraktovú dvojtraktový dvojtraktový Aj sk_11202 dvojumývadlom dvojumývadlom dvojumývadlo Nn Yx sk_11203 dvojzákrutovej dvojzákrutovej dvojzákrutový Aj sk_11205 dvojzápasovú dvojzápasový dvojzápasový Aj sk_11206 dvojzónovú dvojzónový dvojzónový Aj sk_11208 dvolezite dvolezite dôleţitý* Aj Yx sk_11209 Dvonča Dvonča Dvonč Nn sk_11211 Dvončom Dvonča Dvonč Nn sk_11212 Dvončom Dvonč Dvonč Nn We can see that PoS information was corrected in four will require more detailed instruction so that a correct an- cases, lemma form in nine cases and its capitalization in notation could be obtained. two cases. One lexical item was marked as “error”, as it After merging the duplicate “fully agreed” items from lacked all diacritics and used nonstandard spelling. the previous table, 27,135 unique lines were obtained. Ta- The quick analysis, however, revealed that the annotation ble 5 shows the word class distribution of the resulting da- is much below the expected quality. We will discuss some ta. of the issues. The basic statistics is shown in Table 4. Table 5. Annotated Data PoS Distribution Table 4. Results of Annotation PoS Count % Count % % Nn 20,043 73.86 Assigned lines 77,169 100.00 Aj 5174 19.07 Lines annotated at least once 76,413 99.02 Pn 46 0.17 Lines annotated twice 60,048 77.81 100.00 Nm 27 0.10 Lines agreed on lemma 39,469 51.15 65.73 Vb 464 1.71 Lines agreed on lemma and PoS 33,371 43.24 55.57 Av 261 0.96 Pp 8 0.03 The rather low values of the raw inter-annotator agree- Cj 10 0.04 ment suggests that the resulting data has to be analyzed Ij 42 0.15 thoroughly before the procedure can be used within a simi- Pt 24 0.09 lar larger-scale annotation attempt in the future. Ab 185 0.68 The quick analysis revealed some frequent issues – dif- Xy 1 0.00 Yx 490 1.81 ferent treatment of (prototypically) proper names written in Er 343 1.26 lowercase, assigning PoS information to symbols and for- ? 17 0.06 eign words, incoherent use of asterisks, etc. Some of these 27,135 100.00 issues can be solved by an automated procedure but some Crowdsourcing for Slovak Morphological Lexicon 129 The values in the table basically follow our expectations: svete (10 rokov Slovenského národného korpusu). Ed. most unrecognized items belong to main content word clas- K. Gajdošová – A. Ţáková. Bratislava: VEDA 2014, ses – nouns and adjectives. Moreover, out of the 20,043 pp. 35–64. words tagged as nouns, 14,190 (70.80%) begin with upper- [3] D. “johanka” Spoustová, J. Hajič, J. Raab and M. case letter, i.e., they are most likely proper nouns. Spousta. Semi-Supervised Training for the Averaged The rather low value of the “Er” class can be explained Perceptron POS Tagger. In Proceedings of the 12th by the observation that errors, despite their being frequent, Conference of the European Chapter of the ACL rarely behave “paradigmatically”, i.e., a single correct word (EACL 2009), pp. 763–771, Athens, Greece, March. form can produce many different incorrect ones. Association for Computational Linguistics. [4] J. Straková, M. Straka and J. Hajič. Open-Source 7 Conclusions and Further Work Tools for Morphology, Lemmatization, POS Tagging There were several goals to be achieved by the annota- and Named Entity Recognition. In Proceedings of tion. Firstly, we would like to produce a validated list of 52nd Annual Meeting of the Association for Compu- most frequent neologisms to be included in the morpholog- tational Linguistics: System Demonstrations, pp. 13– ical lexicon; in this stage, we even do not expect to gener- 18, Baltimore, Maryland, June 2014. Association for ate full paradigms for those lexical items. Secondly, we Computational Linguistics. wanted to get the list of the most frequent typos and other [5] V. Benko. Aranea: Yet Another Family of (Compara- types of errors that could also be used as a supplement to ble) Web Corpora. In P. Sojka, A. Horák, I. Kopeček that lexicon, but also as source data for a future system for and Karel Pala (Eds.): Text, Speech and Dialogue. data normalization. And lastly, we also wanted to obtain a 17th International Conference, TSD 2014, Brno, list of most frequent foreign lexical items appearing in Slo- Czech Republic, September 8–12, 2014. Proceedings. vak corpus data. LNCS 8655. Springer International Publishing Swit- Although the detailed analysis of the annotated data is zerland, 2014. yet to be performed, some conclusions can be seen already. [6] V. Benko. Two Years of Aranea: Increasing Counts They can be summarized as follows: and Tuning the Pipeline. In Proceedings of the Ninth (1) To minimize the consequences of students’ failed as- International Conference on Language Resources and signments, a three-fold setup would be probably better. Evaluation (LREC 2016). – Portoroţ : European Lan- (2) The Annotation Guidelines must be as precise as pos- guage Resources Association (ELRA), 2016, pp. sible, showing not only the typical problems and their solu- 4245–4248. ISBN 978-2-9517408-9-1. tions, but also the seemingly “easy” cases. One-page in- [7] H. Schmid. Probabilistic Part-of-Speech Tagging Us- struction, as it was in our case, is definitely not sufficient. ing Decision Trees. Proceedings of International Con- (3) The most common errors were associated with the ference on New Methods in Language Processing, treatment of proper nouns. An automatic procedure based Manchester. 1994. on frequencies of lower/uppercased word forms would [8] H. Schmid. Improvements in Part-of-Speech Tagging most likely perform better. with an Application to German. Proceedings of the (4) The other common issue was the proper form of ACL SIGDAT-Workshop, Dublin. 1995. lemma for adjectives (it should be masculine and nomina- [9] R. Garabík and M. Šimková. Slovak Morphosyntactic tive singular). As the morphology of Slovak adjectives is Tagset. In Journal of Language Modeling. Institute of fairly regular, a procedure to fix it automatically would be Computer Science PAS, 2012, Vol. 0, No. 1, pp. 41– feasible. 63. (5) One of the fairy frequent PoS ambiguity in our data [10] P. Rychlý. Manatee/Bonito – A Modular Corpus was the “Nn”/“Yx” (noun/foreign) case. The manually an- Manager. In 1st Workshop on Recent Advances in notated data, however, show that the real number of “for- Slavonic Natural Language Processing. Brno: Masa- eigns” is rather low, yet in introduces a lot of noise into the ryk University, 2007. pp. 65–70. ISBN 978-80-210- annotation process. It would therefore be reasonable to sub- 4471-5. stitute all tags for “foreigns” with that of “nouns” in the [11] V. Benko and R. Garabík. Ensemble Tagging Slovak future annotation. Web Data. Accepted for presentation at the SlaviCorp In the near future, besides the new round of a similar an- 2018 Conference, Prague, 24–26 September, 2018. notation effort with an improved setup, we would like to Unpublished. combine its results with those obtained in the framework of the ensemble tagging experiment described in our other work [11]. Acknowledgment This work has been, in part, funded by the Slovak KEGA and VEGA Grant Agencies, Project No. K-16-022-00, and 2/0017/17, respectively. References [1] E. Estellés-Arolas and F. González-Ladrón-de- Guevara. Towards an Integrated Crowdsourcing Defi- nition, Journal of Information Science, 38 (2): 189– 200, doi:10.1177/0165551512437638. [2] M. Šimková and R. Garabík. Slovenský národný kor- pus (2002–2012): východiská, ciele a výsledky pre výskum a prax. In Jazykovedné štúdie XXXI. Rozvoj jazykových technológií a zdrojov na Slovensku a vo