=Paper= {{Paper |id=Vol-2203/126 |storemode=property |title=Crowdsourcing for Slovak Morphological Lexicon |pdfUrl=https://ceur-ws.org/Vol-2203/126.pdf |volume=Vol-2203 |authors=Vladimir Benko |dblpUrl=https://dblp.org/rec/conf/itat/Benko18 }} ==Crowdsourcing for Slovak Morphological Lexicon== https://ceur-ws.org/Vol-2203/126.pdf
S. Krajči (ed.): ITAT 2018 Proceedings, pp. 126–129
CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Vladimír Benko




                                   Crowdsourcing for the Slovak Morphological Lexicon
                                                                    Vladimír Benko
                                           UNESCO Chair in Plurilingual and Multicultural Communication
                                                          Comenius University in Bratislava
                                                  Šafárikovo nám. 6, SK-81499 Bratilava, Slovakia
                                                                          and
                                            Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences
                                                      Panská 26, SK-81101 Bratislava, Slovakia

            Abstract. We present an on-going experiment aimed at im-          while the Aranea Project [5, 6] is using a more traditional
            proving the results of Slovak PoS tagging by means of increas-    TreeTagger [7, 8] with a custom language model, yet with-
            ing the size of morphological lexicon that is used for training   out any functionality to guess lemmas for the OOV lexical
            the respective tagger(s). The frequency list of out-of-           items. Both systems are using the SNC tagset2 [9] – a fine-
            vocabulary (OOV) word forms along with the tags and lemmas        grained positional tagset vaguely resembling the popular
            assigned by the guesser is manually checked, corrected and        MULTEXT-East3 tagset utilized for several Slavic lan-
            classified by students in the framework of assignments, so that   guages.
            valid lexical items candidates for inclusion into the morpho-        Language models for both systems, however, have been
            logical lexicon could be identified. We expect to improve the     trained on the same source data – the 1.2 M token Manually
            lexicon coverage by the most frequent proper names and for-       morphologically annotated corpus4 and the SNC Morphol-
            eign words, as well as to create an auxiliary lexicon contain-
            ing the most frequent typos.                                      ogy database5 covering approx. 100 K lemmas, yielding
                                                                              some 3.2 M inflected forms. This is why that, despite the
                                                                              fact that both systems do not produce exactly the same out-
            1    Introduction                                                 put, they are (almost) identical6 in the amount of OOV
               “Crowdsourcing” is a relatively recent concept that en-        items, that is rather high.
            compasses many practices. This diversity leads to the blur-          As both Slovak annotation systems explicitly indicate the
            ring of the limits of crowdsourcing that may be identified        OOV status of every token within a corpus, an analysis of
            virtually with any type of Internet-based collaborative ac-       the situation can be conveniently performed by the corpus
            tivity, such as co-creation or user innovation [1]. In their      manager, such as NoSketch Engine7 [10]. In the SNC cor-
            paper, authors define eight characteristics typical for           pora, the OOV status is indicated by the “XX” value pf the
            crowdsourcing as follows:                                         “prec” attribute – this value can be observed in 54.5 million
                                                                              cases of 1.37 Gigatoken prim-8.0-pubic-sane8 main corpus,
            • There is a clearly defined crowd (a)                            which is 3.98% of all tokens.
            • There exists a task with a clear goal (b)                          In the web-based Araneum Slovacum Maximum9, where
            • The recompense received by the crowd is clear (c)               the OOV state is indicated by the “0” value of the “ztag”
            • The crowdsourcer is clearly identified (d)                      attribute, the situation is even worse – 135.5 million OOVs
            • The compensation to be received by the crowdsourcer is          out of 2.96 Gigatokens, i.e., 4.57%. This can be explained
              clearly defined (e)                                             by the rather “low quality” of web data that, despite all
            • It is an online assigned process of participative type (f)      efforts in cleaning and filtering the source texts, naturally
            • It uses an open call of variable extent (g)                     contains lots of “noise” of different kinds.
            • It uses the Internet (h)
                                                                              3       The Task
               From this perspective, language data annotation per-
            formed by students in the framework of the end-of-term              The OOV lexical items observed in our corpora are of
            assignments can well be considered “crowdsourcing”, even          different nature. Besides the “true neologisms”, i.e., words
            if only some of the above characteristics apply. It is also       qualifying for inclusion even into the traditional dictionary,
            worth noting that, according to our experience, students          proper nouns (such as personal and geographical names)
            appreciate the feeling that their work may be useful not          and their derivates, we can find also items traditionally not
            only as a tool for classification.                                considered as “words” – various abbreviations, acronyms
                                                                              and symbols, URLs or e-mail addresses, parts of foreign
                                                                              language quotations and – above all – all sorts of “typos”
            2    The Problem                                                  and “errors”. Inflected word forms apply to almost all pre-
              Slovak belongs to languages with more than one system           viously mentioned categories, which makes the whole pic-
            for morphosyntactic annotation available, with two of them        ture even more complex.
            being actively used in our work1 . They have been devel-
            oped (partially independently) in the framework of two
            different research projects.                                          2
              The Slovak National Corpus (SNC) [2] is using a system              https://korpus.sk/morpho_en.html
                                                                                  3
            based on the new Czech MorphoDiTa tagger [3, 4] with a                http://nl.ijs.si/ME/V4/
                                                                                  4
            custom language model and a tool for guessing lemmas for              https://korpus.sk/ver_r(2d)mak.html
                                                                                5
            unrecognized (out-of-vocabulary – OOV) lexical items;                 https://korpus.sk/morphology_database.html
                                                                                6
                                                                                   The differences are mainly caused by the fact that the
                                                                              TreeTagger-based system is also using word forms from
               1
                 We are aware of (at least) two more systems for mor-         the training corpus that were not present in the morphologi-
            phosyntactic annotation of Slovak data that have been in-         cal database (mostly proper nouns) to ammend the morpho-
            dependently developed at Masaryk University in Brno and           logical lexicon,
                                                                                7
            Charles University in Prague, respectively. These two sys-            https://nlp.fi.muni.cz/trac/noske
                                                                                8
            tems, however, were not available for our work at the time            https://korpus.sk/prim(2d)8(2e)0.html
                                                                                9
            of writing this paper.                                                http://aranea.juls.savba.sk/aranea_about
Crowdsourcing for Slovak Morphological Lexicon                                                                                            127




            In the following text we present an experiment aimed at                            Table 1. Source Data
         amending the morphological lexicon used for training the             Id         Word              Lemma            aTag
         language model(s) by a manually validated list of most               sk_11184   dvojťaţiek        dvojťaţka        Nn
         frequent OOV items derived from an annotated web corpus.             sk_11185   dvojťaţiek        dvojťaţky        Nn
         The annotation is to be performed by graduate students of            sk_11186   dvojťaţka         dvojťaţka        Nn
         foreign languages, in the framework of end-of-term as-               sk_11187   Dvojťaţka         dvojťaţka        Nn
         signment for the “Introduction to Corpus Linguistics” sub-           sk_11188   Dvojťaţka         Dvojťaţka        Nn
         ject.                                                                sk_11189   Dvojťaţka         dvojťaţka        Yx
            Having only limited “human power” (two groups with 46             sk_11190   Dvojťaţka         Dvojťaţka        Yx
         students in total) at hand, we decided to follow the minimal         sk_11191   dvojťaţkách       dvojťaţke        Nn
         two-fold setup (i.e., each item to be annotated by only two          sk_11192   dvojťaţke         dvojťaţka        Nn
         independent annotators) and make the task as simple as               sk_11193   dvojťaţkou        dvojťaţka        Nn
         possible. This is why the annotators were not expected to            sk_11194   dvojťaţku         dvojťaţka        Nn
         check all the morphological categories provided by the               sk_11195   dvojťaţky         dvojťaţka        Nn
         respective tags, and they were asked to decide only on two           sk_11196   dvojťaţky         dvojťaţky        Av
         parameters – lemma and word class (part of speech).                  sk_11197   dvojťaţky         dvojťaţky        Nn
                                                                              sk_11198   dvojtisícovku     dvojtisícovka    Nn
         4        The Data                                                    sk_11199   dvojtlačidlo      dvojtlačidlo     Nn
                                                                              sk_11200   dvojtraktovú      dvojtraktový     Aj
            In the first step, we used data from the Araneum                  sk_11201   dvojumývadlom     dvojumývadlom    Nn
         Slovacum Maximum 17.09 web corpus of approx. 3 Giga-                 sk_11202   dvojumývadlom     dvojumývadlom    Yx
         tokens that has been independently tagged both by the SNC            sk_11203   dvojzákrutovej    dvojzákrutovej   Aj
         MorphoDiTa and the Aranea TreeTagger pipelines, and                  sk_11204   dvojzákrutovej    dvojzákrutovej   Yx
         subsequently merged into a single vertical file. Then, we            sk_11205   dvojzápasovú      dvojzápasový     Aj
         converted the original SNC morphological tags to “PoS-               sk_11206   dvojzónovú        dvojzónový       Aj
         only” tags and produced a frequency list of all lexical items        sk_11207   dvolezite         dvolezite        Nn
         indicated as OOV by both taggers. This list has been further         sk_11208   dvolezite         dvolezite        Yx
         filtered to exclude word forms contained in the Czech mor-           sk_11209   Dvonča            Dvonča           Nn
         phological lexicon10. After deleting the unused parameters,          sk_11210   Dvonča            Dvonč            Nn
         the resulting lists contained the frequency, word form,              sk_11211   Dvončom           Dvonča           Nn
         lemma assigned by the SNC guesser and PoS information                sk_11212   Dvončom           Dvonč            Nn
         derived from the tag assigned by TreeTagger (aTag, using
         the AUT11 notation). This decision has been motivated by           As has been already mentioned, each item (line of the ta-
         an observation that TreeTagger is typically more successful     ble) has to be annotated by two independent annotators. We
         in assigning morphological categories for unknown words         decided, however, not to split the data in a straightforward
         than MorphoDiTa.                                                way, but to assign each alphabetical segment of the data to
            As we naturally could expect to be able to process only      three annotators using a rule as follows: each triple of lines
         the rather small part of the list, after some experimenting     will be split into three tuples containing first and second,
         with various thresholds, we decided to pass into annotation     first and third and second and third lines, respectively.
         only items appearing 50 or more times, yielding to 77,169       Moreover, the whole lot of data has been split to three
         items. This meant that each annotator would process ap-         parts, so that each annotator could get three different sec-
         proximately 3,300 items.                                        tions of the alphabet in his or her data.
            The example of source data (after discarding the frequen-       By applying this fairly “sophisticated” assignment
         cy information and adding a unique Id) is shown in Table 1.     scheme, we expected to improve the overall uniformity and
            We can observe several phenomena here. The same lexi-        quality of the output, as well as to prevent “collaboration”
         cal item is in some cases tagged as “foreign”, while as         among students, as no two assigned lots were identical.
         “noun” or “adjective” in the others, and lemma form as             An excerpt of the data from Table 1 assigned to a single
         well as its capitalization is sometimes guessed correctly,      annotator is shown in Table 2.
         while sometimes not. It can be also seen, that many table
         items will in fact have to be merged after correcting the                          Table 2. Data to Annotate
         annotation, producing less total of correct lines.              Id         Word          Lemma       Lemmb     bTag aTag
            The overall task for the annotators was to produce correct   sk_11184   dvojťaţiek    dvojťaţka   dvojťaţka      Nn
         data for all lines in the table. To minimize the number of      sk_11185   dvojťaţiek    dvojťaţky   dvojťaţky      Nn
         necessary keystrokes and to keep track of the changes, the      sk_11187   Dvojťaţka     dvojťaţka   dvojťaţka      Nn
         data have been further modified to contain two newly add-       sk_11188   Dvojťaţka     Dvojťaţka   Dvojťaţka      Nn
         ed columns – Lemmb used as a template for correcting the        sk_11190   Dvojťaţka     Dvojťaţka   Dvojťaţka      Yx
         value for Lemma (it is expected that most modifications         sk_11191   dvojťaţkách   dvojťaţke   dvojťaţke      Nn
         will occur at the end of the respective string only) and bTag   sk_11193   dvojťaţkou    dvojťaţka   dvojťaţka      Nn
         (to be filled only in case of wrong PoS assignment).            sk_11194   dvojťaţku     dvojťaţka   dvojťaţka      Nn
                                                                         sk_11196   dvojťaţky     dvojťaţky   dvojťaţky      Av
                                                                         sk_11197   dvojťaţky     dvojťaţky   dvojťaţky      Nn

                                                                           Note that the “missing” every third Id results from the
                                                                         assignment scheme.
                                                                         5    The Crowd Annotation
             10                                                             The split data has been uploaded as excel spreadsheets to
              https://lindat.mff.cuni.cz/repository/xmlui/handle/        a shared Google disk and assigned randomly to the respec-
         11234/1-1836                                                    tive annotators. The task has been assigned in the middle of
           11
              http://aranea.juls.savba.sk/aranea_about/aut.html
128                                                                                                                               Vladimír Benko




      the semester, after the students already got acquainted with       part of the word as a result of hyphenation), the value of
      the basic concepts of corpus morphosyntactic annotation            bTag will be “Er” (error).
      and acquired the elementary querying skills.                          (G) If the word form is obvious foreign word, the value
         The instructions for annotating the data were as follows.       of bTag will be “Yx”.
         (A) Only Lemmb and bTag columns may be modified.                   (H) It is not necessary to evaluate whether the word form
         (B) If both Lemma and aTag values are correct, nothing          is “literary” – words of “lower” registers (such as slang)
      has to be done.                                                    also have “correct” lemmas.
         (C) If aTag value is wrong, the correct value should be            The annotators were also instructed to check all “non-
      inserted in bTag.                                                  obvious” items by querying the corpus and analyzing the
         (D) If Lemma value is wrong, it should be corrected in          respective contexts. The initial training was performed dur-
      Lemmb.                                                             ing one teaching lesson in a computer lab, so that possibly
         (E) If the word form is obvious typo (missing or super-         all frequent problems could be explained.
      fluous letter, exchanged letters), or the word does not con-
      tain the necessary diacritics, the correct lemma marked by
      an asterisk should entered in Lemmb.                               6     First Results and Problems
         (F) If the correct word form cannot be reconstructed by           Out of 46 students, 43 managed to complete the assign-
      simple editing operations, i.e., cannot be recognized (e.g.,       ments in time. Table 3 shows an example of the correctly
                                                                         annotated data.
                                                             Table 3. Annotated Data
                               Id           Word              Lemma             Lemmb               bTag     aTag
                               sk_11184     dvojťaţiek        dvojťaţka         dvojťaţka                    Nn
                               sk_11185     dvojťaţiek        dvojťaţky         dvojťaţka                    Nn
                               sk_11187     Dvojťaţka         dvojťaţka         dvojťaţka                    Nn
                               sk_11188     Dvojťaţka         Dvojťaţka         dvojťaţka                    Nn
                               sk_11190     Dvojťaţka         Dvojťaţka         dvojťaţka           Nn       Yx
                               sk_11191     dvojťaţkách       dvojťaţke         dvojťaţka                    Nn
                               sk_11193     dvojťaţkou        dvojťaţka         dvojťaţka                    Nn
                               sk_11194     dvojťaţku         dvojťaţka         dvojťaţka                    Nn
                               sk_11196     dvojťaţky         dvojťaţky         dvojťaţka           Nn       Av
                               sk_11197     dvojťaţky         dvojťaţky         dvojťaţka                    Nn
                               sk_11199     dvojtlačidlo      dvojtlačidlo      dvojtlačidlo                 Nn
                               sk_11200     dvojtraktovú      dvojtraktový      dvojtraktový                 Aj
                               sk_11202     dvojumývadlom     dvojumývadlom     dvojumývadlo        Nn       Yx
                               sk_11203     dvojzákrutovej    dvojzákrutovej    dvojzákrutový                Aj
                               sk_11205     dvojzápasovú      dvojzápasový      dvojzápasový                 Aj
                               sk_11206     dvojzónovú        dvojzónový        dvojzónový                   Aj
                               sk_11208     dvolezite         dvolezite         dôleţitý*           Aj       Yx
                               sk_11209     Dvonča            Dvonča            Dvonč                        Nn
                               sk_11211     Dvončom           Dvonča            Dvonč                        Nn
                               sk_11212     Dvončom           Dvonč             Dvonč                        Nn


         We can see that PoS information was corrected in four           will require more detailed instruction so that a correct an-
      cases, lemma form in nine cases and its capitalization in          notation could be obtained.
      two cases. One lexical item was marked as “error”, as it              After merging the duplicate “fully agreed” items from
      lacked all diacritics and used nonstandard spelling.               the previous table, 27,135 unique lines were obtained. Ta-
         The quick analysis, however, revealed that the annotation       ble 5 shows the word class distribution of the resulting da-
      is much below the expected quality. We will discuss some           ta.
      of the issues. The basic statistics is shown in Table 4.
                                                                                       Table 5. Annotated Data PoS Distribution
                       Table 4. Results of Annotation
                                                                                              PoS        Count        %
                                          Count        %        %                             Nn         20,043    73.86
        Assigned lines                    77,169   100.00                                     Aj          5174     19.07
        Lines annotated at least once     76,413    99.02                                     Pn             46     0.17
        Lines annotated twice             60,048    77.81    100.00                           Nm             27     0.10
        Lines agreed on lemma             39,469    51.15     65.73                           Vb            464     1.71
        Lines agreed on lemma and PoS     33,371    43.24     55.57                           Av            261     0.96
                                                                                              Pp              8     0.03
         The rather low values of the raw inter-annotator agree-                              Cj             10     0.04
      ment suggests that the resulting data has to be analyzed                                Ij             42     0.15
      thoroughly before the procedure can be used within a simi-                              Pt             24     0.09
      lar larger-scale annotation attempt in the future.                                      Ab            185     0.68
         The quick analysis revealed some frequent issues – dif-                              Xy              1     0.00
                                                                                              Yx            490     1.81
      ferent treatment of (prototypically) proper names written in
                                                                                              Er            343     1.26
      lowercase, assigning PoS information to symbols and for-                                ?              17     0.06
      eign words, incoherent use of asterisks, etc. Some of these                                        27,135   100.00
      issues can be solved by an automated procedure but some
Crowdsourcing for Slovak Morphological Lexicon                                                                                          129




           The values in the table basically follow our expectations:           svete (10 rokov Slovenského národného korpusu). Ed.
         most unrecognized items belong to main content word clas-              K. Gajdošová – A. Ţáková. Bratislava: VEDA 2014,
         ses – nouns and adjectives. Moreover, out of the 20,043                pp. 35–64.
         words tagged as nouns, 14,190 (70.80%) begin with upper-          [3] D. “johanka” Spoustová, J. Hajič, J. Raab and M.
         case letter, i.e., they are most likely proper nouns.                  Spousta. Semi-Supervised Training for the Averaged
           The rather low value of the “Er” class can be explained              Perceptron POS Tagger. In Proceedings of the 12th
         by the observation that errors, despite their being frequent,          Conference of the European Chapter of the ACL
         rarely behave “paradigmatically”, i.e., a single correct word          (EACL 2009), pp. 763–771, Athens, Greece, March.
         form can produce many different incorrect ones.                        Association for Computational Linguistics.
                                                                           [4] J. Straková, M. Straka and J. Hajič. Open-Source
         7    Conclusions and Further Work                                      Tools for Morphology, Lemmatization, POS Tagging
            There were several goals to be achieved by the annota-              and Named Entity Recognition. In Proceedings of
         tion. Firstly, we would like to produce a validated list of            52nd Annual Meeting of the Association for Compu-
         most frequent neologisms to be included in the morpholog-              tational Linguistics: System Demonstrations, pp. 13–
         ical lexicon; in this stage, we even do not expect to gener-           18, Baltimore, Maryland, June 2014. Association for
         ate full paradigms for those lexical items. Secondly, we               Computational Linguistics.
         wanted to get the list of the most frequent typos and other       [5] V. Benko. Aranea: Yet Another Family of (Compara-
         types of errors that could also be used as a supplement to             ble) Web Corpora. In P. Sojka, A. Horák, I. Kopeček
         that lexicon, but also as source data for a future system for          and Karel Pala (Eds.): Text, Speech and Dialogue.
         data normalization. And lastly, we also wanted to obtain a             17th International Conference, TSD 2014, Brno,
         list of most frequent foreign lexical items appearing in Slo-          Czech Republic, September 8–12, 2014. Proceedings.
         vak corpus data.                                                       LNCS 8655. Springer International Publishing Swit-
            Although the detailed analysis of the annotated data is             zerland, 2014.
         yet to be performed, some conclusions can be seen already.        [6] V. Benko. Two Years of Aranea: Increasing Counts
         They can be summarized as follows:                                     and Tuning the Pipeline. In Proceedings of the Ninth
            (1) To minimize the consequences of students’ failed as-            International Conference on Language Resources and
         signments, a three-fold setup would be probably better.                Evaluation (LREC 2016). – Portoroţ : European Lan-
            (2) The Annotation Guidelines must be as precise as pos-            guage Resources Association (ELRA), 2016, pp.
         sible, showing not only the typical problems and their solu-           4245–4248. ISBN 978-2-9517408-9-1.
         tions, but also the seemingly “easy” cases. One-page in-          [7] H. Schmid. Probabilistic Part-of-Speech Tagging Us-
         struction, as it was in our case, is definitely not sufficient.        ing Decision Trees. Proceedings of International Con-
            (3) The most common errors were associated with the                 ference on New Methods in Language Processing,
         treatment of proper nouns. An automatic procedure based                Manchester. 1994.
         on frequencies of lower/uppercased word forms would               [8] H. Schmid. Improvements in Part-of-Speech Tagging
         most likely perform better.                                            with an Application to German. Proceedings of the
            (4) The other common issue was the proper form of                   ACL SIGDAT-Workshop, Dublin. 1995.
         lemma for adjectives (it should be masculine and nomina-          [9] R. Garabík and M. Šimková. Slovak Morphosyntactic
         tive singular). As the morphology of Slovak adjectives is              Tagset. In Journal of Language Modeling. Institute of
         fairly regular, a procedure to fix it automatically would be           Computer Science PAS, 2012, Vol. 0, No. 1, pp. 41–
         feasible.                                                              63.
            (5) One of the fairy frequent PoS ambiguity in our data        [10] P. Rychlý. Manatee/Bonito – A Modular Corpus
         was the “Nn”/“Yx” (noun/foreign) case. The manually an-                Manager. In 1st Workshop on Recent Advances in
         notated data, however, show that the real number of “for-              Slavonic Natural Language Processing. Brno: Masa-
         eigns” is rather low, yet in introduces a lot of noise into the        ryk University, 2007. pp. 65–70. ISBN 978-80-210-
         annotation process. It would therefore be reasonable to sub-           4471-5.
         stitute all tags for “foreigns” with that of “nouns” in the       [11] V. Benko and R. Garabík. Ensemble Tagging Slovak
         future annotation.                                                     Web Data. Accepted for presentation at the SlaviCorp
            In the near future, besides the new round of a similar an-          2018 Conference, Prague, 24–26 September, 2018.
         notation effort with an improved setup, we would like to               Unpublished.
         combine its results with those obtained in the framework of
         the ensemble tagging experiment described in our other
         work [11].
         Acknowledgment
           This work has been, in part, funded by the Slovak KEGA
         and VEGA Grant Agencies, Project No. K-16-022-00, and
         2/0017/17, respectively.
         References
         [1] E. Estellés-Arolas and F. González-Ladrón-de-
             Guevara. Towards an Integrated Crowdsourcing Defi-
             nition, Journal of Information Science, 38 (2): 189–
             200, doi:10.1177/0165551512437638.
         [2] M. Šimková and R. Garabík. Slovenský národný kor-
             pus (2002–2012): východiská, ciele a výsledky pre
             výskum a prax. In Jazykovedné štúdie XXXI. Rozvoj
             jazykových technológií a zdrojov na Slovensku a vo