=Paper=
{{Paper
|id=Vol-3226/paper14
|storemode=property
|title=Global Variants in the Czech Language
|pdfUrl=https://ceur-ws.org/Vol-3226/paper14.pdf
|volume=Vol-3226
|authors=Jaroslava Hlaváčová,Lukáš Kyjánek,Magda Ševčíková
|dblpUrl=https://dblp.org/rec/conf/itat/HlavacovaKS22
}}
==Global Variants in the Czech Language==
Global Variants in the Czech Language Jaroslava Hlaváčová, Lukáš Kyjánek and Magda Ševčíková Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics Malostranské náměstí 25, 118 00 Prague, Czechia Abstract There are words written in several different ways in Czech, e.g., lampion ∼ lampión (lampion). This variability may occur in either some inflectional wordforms (inflectional variants), cf. hradu ∼ hradě in the locative case of the noun hrad (castle), or across the inflectional wordforms and derivatives (global variants), cf. fantazijní ∼ fantasijní in the adjective derived from the noun fantazie ∼ fantasie (fantasy). It is reasonable to distinguish the global variants as different words but to have formal means that interconnect them in the Natural Language Processing systems and resources. In this paper, we describe the identification of global variants in the Czech vocabulary and summarise new changes in the MorfFlex CZ dictionary and DeriNet lexicon concerning this type of variants. We reviewed several typical patterns within global variants captured in the available resources and combined a set of regular expressions with manual annotations to achieve the highest precision of the identification. Keywords global and inflectional variant, morphology, word derivation, Czech 1. Introduction the variability in all the inflectional forms, often also in derived words, e.g., the prothetic v- 2 attached to the The written form is one of the possible representations of noun obchod ∼ vobchod (shop), and to the derived verb ob- languages. Czech speakers must learn and use a relevant chodovat ∼ vobchodovat (to trade), and the derived adjec- script, rules, and regularities of the respective writing tive obchodní ∼ vobchodní (commercial). All those words system because of its substantial standardisation and cod- manifest the same difference in every single wordform of ification. However, some words can be still written in sev- their inflectional paradigms (for instance, genitive cases eral slightly different ways in so-called orthographic of the noun (obchodu ∼ vobchodu) and adjective (obchod- (spelling) variants, e.g., citron ∼ citrón (lemon), mu- ního ∼ vobchodního)). On the other hand, somewhere seum ∼ muzeum (museum), peepshow ∼ peepšou ∼ pípšou between the two defined types, there are also several (peepshow). cases in which the variability is limited to a few forms, The emergence of orthographic variants in Czech is cf. the infinitive and past participles of the verb myslet influenced by various aspects like the spoken represen- ∼ myslit (to think), while the remaining wordforms are tation of Czech, language development, and language identical. contact. Some cases of the variability are only tempo- The inflectional variants are captured in the morpho- rary until the use of one of the orthographic variants is logical dictionary MorfFlex CZ (hereafter MorfFlex) by established and codified as the preferred one (which can means of the 15th position in the morphological tag de- take years or decades). However, codified or not, many scribing morphological categories of a given wordform of the orthographic variants appear in the texts produced [2]. Until the 2020 edition of MorfFlex, there was no by speakers, which complicates work with language re- distinction between the description of global and inflec- sources and all sorts of NLP applications. tional variants. All the variants were marked at the 15th Adhering to the current decisions on annotating this position of the Prague positional tagset [1]. In the last ver- phenomena in the corpus PDT-C (cf. its manual in [10]), sion, the global variants are annotated by means of links we distinguish two types of orthographic variants. In- between them.3 In MorfFlex, one word from the 𝑛-tuple flectional variants refer to relatively regular variants of variants is selected as the basic one. All other vari- within a set of wordforms of a given word, e.g., the loca- ants are linked with the basic one by means of additional tive case of some masculine inanimate nouns like obchod pieces of information in their lemma. This information (shop): obchodu ∼ obchodě. Global variants1 address contains not only the basic variant, but also the (rough) ITAT’22: Information technologies – Applications and Theory, Septem- ber 23–27, 2022, Zuberec, Slovakia complete description of the morphological annotation of the corpus Envelope-Open hlavacova@ufal.mff.cuni.cz (J. Hlaváčová); PDT-C [10, pp. 36–42]. We will stick to the term global, as it is kyjanek@ufal.mff.cuni.cz (L. Kyjánek); sevcikova@ufal.mff.cuni.cz shorter, but the two terms are equivalent. 2 (M. Ševčíková) This variation originates from the common Czech and is not codified. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). It is sub-standard, but it occurs in the written language. 3 CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 There are more ways how to interconnect the global variants (see 1 Global variants are also called full-paradigm variants in the [4], [5], [6]). style of the variant. There is no strict rule for such selec- tion, as no of the possible criteria is easy to formulate or singular locative case of the lemma obchod is ob- check. Usually, the more common (frequent) or standard chodu ∼ obchodě. (over non-standard) variant was taken as the basic one. The final selection of the basic variant depended on the — Global variants lexicographer’s opinion. are a pair (or generally 𝑛-tuple) of lemmas whose Until recently, no special attention was given to the difference in spellings propagate to all wordforms completeness of global variants within the whole of the of their inflectional paradigms and to most of their dictionary. That is why we reviewed available resources derivationally related words; e.g., obchod ∼ vobchod addressing the issue of orthographic variability (Sec- → obchodní ∼ vobchodní (apart from the first letter, tion 2), and extracted typical patterns for global vari- inflectional paradigms of these words are identical). ants from them. We applied these patterns to the set of — Basic variant lemmas from MorfFlex to find as many global variant is the representative variant for an 𝑛-tuple of global candidates as possible. We also exploited the DeriNet variants. lexicon [14], which models word-formation relations in Czech, to search for global variants within derivationally related words (Section 3). After manual filtration of the obtained global variants, our work resulted in intercon- 2. Available Resources Containing necting global variants in MorfFlex. They will appear in its next version. They were also partly uploaded to the Variants DeriNet 2.1 lexicon. Since any researcher working on a lexical resource must The resulting data and categorisation (Sections 4 and 5) inevitably process also the orthographic variants, there have potential in two research directions. First, they lead are some pieces of annotations of them in the existing lan- to the improvement of the Natural Language Processing guage resources for Czech. However, capturing variants applications for which the morphological dictionaries is not the primary goal in any of the resources, so system- serve as the background data, cf. [11], [13], and [12]. atic care is more than needed in this kind of annotation. Second, they contribute to a wider linguistic discussion The available digitised resources provide a good point of on (not only orthographic) variability in Czech, especially departure for the phenomenon of orthographic variants, in the context of border cases between inflectional and but we see the following two shortcomings/inconsisten- global variants mentioned above and exemplified before cies in them. The number of the captured orthographic the conclusions of this paper. variants is often relatively low in the resources. The or- thographic variants are not treated across derivation, e.g., úřad ∼ ouřad (office) but not úřadovat ∼ ouřadovat (to officiate). 1.1. Terminology There are various language resources and grammar books that include this kind of annotation; however, in To sum up definitions of the basic terms used in the following paragraphs, we describe only those that this paper, we provide this section. We frame it are available and digitised, and thus machine-readable. to facilitate reading in case readers would like to MorfFlex 2.0 [2], the lexicon of the inflectional mor- easily remind some definitions while reading more phology of Czech, in its currently available version, al- advanced parts of the text. ready includes annotation of global variants. It classifies them into three types of variants: standard (label DD , — Inflectional paradigm e.g., lavor ∼ lavór (pail)), common Czech/non-standard is a set of all wordforms derived by means of inflec- (label GC , e.g., oprášit ∼ voprášit (to dust down)) and dis- tion from a citation wordform (so-called lemma); tortion/typo (label DS , e.g., Dominigue ∼ Dominique). e.g., the inflectional paradigm of the lemma obchod However, as the results of the work herein show, we were consists of wordforms obchod, obchodu, obchodě, … able to find many more 𝑛-tuples of variants not included in the current version. — Inflectional variants VALLEX 3.0 [9] is the valency lexicon of Czech verbs are a pair (or generally 𝑛-tuple) of wordforms be- that also interconnects and labels several orthographic longing to the same inflectional paradigm of the variants of verbs. Their representations are systematic, same lemma and having the same values of all mor- but the definition of being a variant seems different from phological categories, but different spellings; e.g., ours. For instance, VALLEX marks also chytit ∼ chytnout (to catch) as variants although its status is arguable. Slovník spisovného jazyka českého (SSJČ) [3], the the extractions from MorfFlex and VALLEX were easy, explanatory dictionary of Czech digitised into a semi- as these resources are designed to be processed auto- structured format,4 covers wide vocabulary and includes matically, we had to use regular expressions to extract annotations of orthographic variants without any addi- candidates for variants from SSJČ because, in its digitised tional classification. However, it uses various options version, it stores data in a semi-structured file format in which these variants are underlined in glosses of the (see Section 2). Consequently, we checked whether the dictionary, e.g., by using words extracted from this resource are attested in the vocabulary of MorfFlex to mitigate incorrectly extracted, • v. = viz (see) in “obepsati v. opsati” (copy), non-existent, and archaic words. We also wrote down ex- • comma in “mysliti, mysleti” (to think), or amples and patterns from the relevant linguistic studies • řidč. = řidčeji (rarely) in “zpěvánka, řidč. zpě- like those cited in the previous section. vanka, zpívánka” ([usually a short] song). When comparing the 𝑛-tuples of variants extracted from the resources, we have observed that MorfFlex in- Except for these language resources, there is also a long cludes more than two thirds of the variants captured tradition of linguistic studies on the topic of orthographic in SSJČ. The remaining third of the variants from SSJČ variants, cf. prefixes s-/se- and z-/ze- and orthography ? of loanwords in general in [15, pp. 167–170], morpholog- seems questionable, e.g., zvýhodněný ∼ zvýhodnělý (priv- ical variability and its reflection to orthography in [16, ileged) in which the variability does not affect the indi- pp. 268–276], orthography of loanwords with s/z char- vidual characters but the affixes (and thus can diverge acters, such as analýza ∼ analýsa (analysis) in [17], and the word meaning). The extracted variants from VALLEX terminological issues of the phenomenon in [7]. They cover only verbs; most of the variants are not included provide extensive lists of variants or patterns that are in the other resources. shared across variants. We extracted some patterns from the studies, which helped us to identify typical general 3.2. Formalising Regular Patterns properties of global variants. Having relatively large amount of 𝑛-tuples with global variants, we considered whether to use pattern-matching 3. Searching for Global Variants algorithms or to formalise frequent patterns manually. We chose the latter way as it allowed us to have a better We developed a semi-automatic procedure that consists overview of the processed data and to create more com- of four straightforward subsequent steps to search for plicated patterns that would take into account not only Czech global variants. We exploited the available re- character changes but also morpho-syntactic categories sources mentioned in the previous section, and we ex- of words. For instance, this decision allowed us to avoid tracted frequent patterns that appear in global variants. interconnection of the masculine animate variant pair We applied these patterns to the set of lemmas from Morf- česač ∼ česáč (a man harvesting apples) to the masculine Flex 2.0 to obtain 𝑛-tuples of global variants, such as ex- inanimate variant pair česač ∼ česáč (an instrument for tremismus ∼ extrémismus ∼ extremizmus ∼ extrémizmus harvesting fruits of tall trees). (extremism). We exploited various types of intersections of the ex- To get also derivationally related global variants that tracted lists of variant 𝑛-tuples and their sorting to be were not identified on the basis of the extracted patterns, able to infer frequent patterns that occur in global vari- we included DeriNet into the process. Thanks to that we ants. We first looked at the global variants extracted covered also 𝑛-tuples like extremistický ∼ extrémistický from all three resources, then at those that occurred in (extremist) for the already identified variants of the base at least two resources, and only then at those that were word extremism mentioned above. The resulting 𝑛-tuples in individual resources but not in the others. More than were manually annotated to eliminate randomly similar one hundred observed patterns were formalised into the words. In the last step, the data were uploaded as a new form of regular expressions, e.g. ∧ o.* ↔ ∧ vo.* in ob- type of annotation into DeriNet, and, in parallel, new chod ∼ vobchod (shop). The relevant morpho-syntactic links were also added to MorfFlex. categories were also stored with the particular regular expression. 3.1. Extracting Variants from the Existing Resources 3.3. Applying Patterns to MorfFlex We started with assembling the existing resources listed We applied the formalised patterns to all the lemmas from in Section 2 and extracting variants from them. While MorfFlex (we did not search for inflectional variants, so we did not have to take wordforms into account). To 4 https://ssjc.ujc.cas.cz/ A citron.NOUN lemon B citron.NOUN lemon citrón.NOUN lemon citronový.ADJ citroník.NOUN lemony little lemon citrón.NOUN citronový.ADJ citroník.NOUN lemon lemony little lemon citronóvý.ADJ citróník.NOUN citronóvý.ADJ citróník.NOUN lemony little lemon citronově.ADV lemony little lemon lemonily citronově.ADV lemonily citrónově.ADV citrónóvě.ADV lemonily lemonily Figure 1: Two possible ways of representing global variants in the rooted trees; (A) making parallel branches, (B) connecting variants to the basic variant (the latter option implemented in DeriNet 2.1). achieve higher precision, we also exploited the knowl- 𝑛 MorfFlex 2.0 after edge of morpho-syntactic categories of the candidates 2 31,919 49,079 for global variants. If these categories, e.g., grammatical 3 1,227 2,089 gender or animacy, of the candidates differed between 4 121 264 the words, these candidates were excluded, cf. the mascu- 5 16 18 line animate noun car (tsar) ≁ the masculine inanimate 6 187 187 cár (shred). 8 4 4 9 1 1 In order to obtain more consistent list of variants, we 11 1 1 also took derivational morphology from DeriNet into 12 1 1 account. For each identified global variant, we observed relevant sub-tree of derivationally related words and tried to identify the same patterns among the derivatives. Table 1 The resulting 𝑛-tuples of global variants were also Number of interlinked variants in MorfFlex 2.0 (the second manually filtered. The annotator was provided lists of column) and after addition of the new variant annotation (the 𝑛-tuples of global variant candidates; the task was to go third column). The first column lists sizes of 𝑛-tuples — 𝑛 = 2 is for pairs. through the lists of variants and exclude those 𝑛-tuples which only accidentally met a derivational pattern, but that were not variants, e.g., the pair fiala (wallflower) ≁ fiála (pinnacle) had to be excluded manually, although theBoth resources differ in the data structures they use for same pattern works well in real variants like neandrtalec storing their data, but they both share the same set of ∼ neandrtálec (Neanderthal). During the manual work lemmas. on the global variant candidates we also identified inflec- DeriNet interconnects derivationally related words tional variants with variant lemma — see the example of into so-called derivational families. Each family of the pair of verbs myslit, myslet. words is represented in a form of rooted tree (in graph Table 1 shows how many variant 𝑛-tuples were an- theory terminology), in which words are represented notated in the MorfFlex 2.0, and how many were added as nodes while derivational relations as edges. In other thanks to the new found 𝑛-tuples. It is visible, that the words, each derived word in DeriNet has at maximum main increment is recorded for smaller 𝑛, especially for one base word (antecedent), e.g., učitelka (female teacher) pairs (𝑛 = 2), triples (𝑛 = 3) and 4-tuples. The bigger ← učitel (teacher) ← učit (to teach). values of 𝑛 remain the same. In the rooted tree data structure, unidentified global In the following sections, we present a more detailed variants caused structural inconsistencies. For instance, the adjective citrónový (related to lemon) could be con- analysis of the prototypical cases of global variants (Sec- tion 4) and, on the other hand, cases that we do not treatnected to the noun citron, although the noun citrón would as variants (Section 5). be a better antecedent (both global variants of lemon). To tackle this issue, identifying global variants is crucial. We considered two possible ways of representing 3.4. Global Variants into DeriNet global variants in the current rooted tree data structure We uploaded the resulting global variants into the newest of DeriNet. In the first approach (see Fig. 1, part A), the version of DeriNet 2.1, and we intend to do so also for global variants would create parallel branches in the tree, the next version of the inflectional dictionary MorfFlex. e.g., citron → citronový → citronově parallel to citrón → Figure 2: Simplified record of the global variants of the noun úřad ∼ ouřad (office) and its derivatives from DeriNet 2.1. Variant relations are represented by dark grey dashed arrows that are shorter than the light grey solid arrows, which represent derivational relations. Size of the nodes corresponds to the token frequency of the lemmas in the corpus SYNv4 [8]. Brackets around nodes indicate that the node’s derivatives were hidden for spatial reasons. citrónový → citrónově. The major disadvantage of this 4. Prototypical Cases of Global approach is that the branches may be disconnected or contain gaps if any variant is missing in the vocabulary. Variants In the second approach (see Fig. 1, part B), the global In this section, we will present the most common types variants would be connected to one basic variant to of global variants together with typical examples.6 One which the derivatives are connected, while the other of the important properties of global variants is that their variants would not have any derivatives connected, e.g., derivatives can also become global variants. [ citron ∼ citrón ] → [ citronový ∼ citrónový ] → [ citronově Example: The pair of verbs lítat ∼ létat (to fly) derives ∼ citrónově ]. The lack of global variants in this approach iterative verbs lítávat ∼ létávat, adjectives lítající ∼ lé- does not disconnect word(s) from the tree. Therefore, we tající (flying), and/or verbal nouns lítání ∼ létání (the chose this approach for DeriNet. flying). Derivatives in each of the pairs are also global The selection of the basic variant followed the similar variants. criteria that were applied in MorfFlex. We tried to do so consistently across the 𝑛-tuples that share the same pattern. The final decision depended on a lexicographer.5 4.1. Long and Short Vowels As a result, the data from our experiments with global In this type of variants, words vary in the length of a variants has been already uploaded into DeriNet 2.1. If vowel, either in the affix, or in the root. words are variants in this lexicon, one of the words is Example: Suffix variation in svíčkař ∼ svíčkář (someone selected as the basic one and the other ones are con- who makes candles), and root variation in kvikat ∼ kvíkat nected directly to it by special relation that is labelled as (to oink/squeak). Type=Variant . Fig. 2 illustrates words derived from the variant pair úřad ∼ ouřad (office) from DeriNet 2.1; the missing variants of the individual derivatives, such as úřadek ∼ ouřadek, will be connected in the new release. DeriNet projects but we plan to make a unification. 5 6 Unfortunately, this task was not coordinated between MorfFlex and This overview is by no means complete. 4.2. Alveolar vs. Postalveolar/Palatal 4.7. Variants of Foreign Names Consonants Most frequent foreign geographic names have usually a The consonants alternate in the root; the instances are Czech translation. of different origins. Example: The Czech variant of Paris is Paříž, Moscow is Example: vlaštovka ∼ vlašťovka (a swallow), student ∼ Moskva, Berlin is Berlín. študent (student), mrazený ∼ mražený (frozen). Though both words can appear in Czech texts, they are not considered global variants. Moreover, the original of the foreign name is usually not inflected. 4.3. Soft and Hard Adjectives Person names are typically not translated, but their There are two types of adjectives — soft and hard, but spelling is often unusual. In addition, errors or typos some of them can vary between the two types. This was frequently occur in their spelling. In such cases, they can quite common in the past, as is visible from the additional be considered variants. Sometimes, one of the variants information attached usually to one of the variants — it is is a spelling adapted to the pronunciation, as the long often archaic or outdated. The basic variant can be soft as variant in the following example. well as hard, depending on the lexicographer’s decision. Example: Abdulah ∼ Abdullah ∼ Abduláh. At the beginning of our work, this type of variants was This is not applied to Slavic names with the ending -ij not recorded. or -i which are sometimes translated with the ending -ý. Example: Adjectival variation in the pairs námezdný ∼ As the variation appears only in the nominative singular námezdní (hired), přívodný ∼ přívodní (feed, inflow ... e.g. (lemma) and vocative singular, we consider this type of pipe). variants as inflectional. Example: All the three variants of the name Čajkovský (Tchaikovsky), namely Čajkovský ∼ Čajkovskij ∼ Čaj- 4.4. Prothetic v- kovski, are inflectional variants of the singular nomina- Many Czech words starting with the vowel o exist also in tive and vocative cases. Other cases do not manifest this the variation with the prothetic v- at their very beginning. type of variation. They are not global variants. Though the latter variant is considered non-standard and Similarly, names of ancient Greeks with the lemma is used mainly in spoken Czech, it is very common and ending -es or -és are not global variants, as this variation penetrating into the written Czech, too. appears only in nominative singular. They are inflec- Example: okno ∼ vokno (window). This type of variants tional variants. can appear not only at the beginning of words, but also Example: Empedokles, Empedoklés. after a prefix that precedes the o; e.g., zotvírat ∼ zvotvírat (to open step by step). 5. Non-variants 4.5. Vocalized and Non-vocalised Prefixes The soft–hard type can seemingly be applied to soft and hard declension of nouns with feminine or masculine The prefixes v-, s-, vz-, roz-, od-, pod-, nad-, ob-, před- can gender. In reality, in such cases, we should rather speak be expanded by e (ve-, se-, vze-, roze-, ode-, pode-, nade-, obe-, přede-). Nevertheless, some words can have both about a combined paradigm and merge the two variants spellings, which makes them variants. into one inflectional paradigm. This has been already Example: střást ∼ setřást (shake off ), rozsmutnit ∼ rozes-done for masculine declension of soft–hard pairs, both mutnit (make sad), objet ∼ obejet (go around). animate and inanimate. Example: The lemma kužel (cone) can be inflected ei- ther as a hard noun (following the traditional masculine 4.6. Stylistic Variants (ú ∼ ou, ý ∼ ej, th ∼ t, inanimate declension class hrad) as well as a soft noun s ∼ z) (following the traditional masculine inanimate declen- sion class stroj). It is reasonable to join wordforms of the This type of variants usually puts into opposition stan- two inflected sets and to represent the whole set of word- dard and non-standard Czech, let it be archaic, colloquial forms by a single lemma. As there is only one lemma, or other sort of style. The most frequent is the variation these words cannot be global variants either. between s and z, especially within the suffixes -ismus and The feminine gender is different, as there the lemmas -izmus. differ. However, the difference is always within the end- Example: mechanismus ∼ mechanizmus (mechanism), ing, so according to the definition of global variants, they vytékat ∼ vytejkat (flow/leak out), úzký ∼ ouzký (narrow), are not global variants. Though they are often viewed as ortopedie ∼ orthopedie (orthopedics). global variants, there is rather one inflectional paradigm with inflectional variants affecting all the wordforms. The new pattern for this type of variation should be This project included lots of manual work, as the topic added and all the wordforms merged into a single inflec- of variants is very variable and there are no rules for tional paradigm with inflectional variants even for the the really strict distinction of what are and what are not lemma. variants. Thus, the manual work discovered some border Example: The lemmas kapuce ∼ kapuca (hood) have cases where it had to be decided from scratch. In general, different inflectional paradigms, but the individual tags we adopted very strict rules, e.g. we do not consider vari- (combinations of number and grammatical case) differ ants those words that contain formally different affixes only in endings. (see the example zvýhodněný ≁ zvýhodnělý (privileged)). Similar cases are variants with different genders. All those cases are to be researched in greater detail in Example: brambora (fem.), brambor (masc. inan.) (both the future. potato); ribstole (fem.), ribstol (masc. inan.) (both wall bars). Again, the variation manifests itself only in endings, so Acknowledgement they cannot be considered global variants. The solution This work was supported by the Grant No. GA19-14534S proposed for the nouns with the same gender (merg- of the Czech Science Foundation, and the Grant No. ing the inflectional paradigms) cannot be applied here, START/HUM/010 of Grant schemes at Charles University because of the so-called “Principle of morphological differ- (reg. No. CZ.02.2.69/0.0/0.0/19_073/0016935), and LIN- entiation” introduced in [10]. One of its requirements is DAT/CLARIAH-CZ project of the Ministry of Education that the gender of a noun should stay the same within (LM2015071, LM2018101). the whole inflectional paradigm. These examples reveal that the Principle is questionable; it would probably be advisable to reconsider it. References During the work on global variants in Czech resources, we came across several peculiarities. [1] Hajič, J. 2004. Disambiguation of Rich Inflection ? (Computational Morphology of Czech). Nakladatel- Example: The pair pécéčko ∼ písíčko (a sort of abbre- viation of personal computer / PC). Is it a pair of global ství Karolinum, Charles University, Czechia. variants, or not? For the time being, the two lemmas are [2] Hajič, J.; Hlaváčová, J.; Mikulová, M.; Straka, not interlinked. M.; Štěpánková, B. 2020. MorfFlex CZ 2.0. Sometimes, we found sets of seeming variants, that LINDAT/CLARIAH-CZ digital library at the Institute had a typical variant pattern, but they were not variants of Formal and Applied Linguistics (ÚFAL), Faculty because of different meaning. of Mathematics and Physics, Charles University, Example: valečka (biol. sort of grass) ≁ válečka (someone Czechia. URL: http://hdl.handle.net/11234/1-3185. [fem.] who rolls something), and/or studenský (adjective [3] Havránek, B. (ed.) 1960–1971. Slovník spisovného to the town of Studená) ≁ studénský (adjective to the town jazyka českého. Academia, Prague, Czechia. of Studénka). [4] Hlaváčová, J. 2009. Formalizace systému české Neither we interconnected the onomatopoeic or ex- morfologie s ohledem na automatické zpracování pressive words. českých textů. Ph.D. thesis, FF UK, 146 pp. Example: ďoubnout ≁ ďubnout (expr. to push). [5] Hlaváčová, J. 2011 Problém variantních tvarů slov při automatickém zpracování jazyka. In: Information Technologies – Applications and Theory, pp. 75-78. 6. Conclusion [6] Hlaváčová J. 2019. Aggregates and Variants in Two Czech Morphological Approaches. In: Proceedings The paper presented the specialised project of looking of the 19th Conference ITAT 2019: Slovenskočeský for global variants in available resources of Czech lexical NLP workshop (SloNLP 2019), pp. 120-124. data. The main aim was to make an “inventory” of Czech [7] Hrbáček, J. 1974. Lexikální ekvivalenty, dublety a global variants and to annotate them. Special attention varianty Naše řeč 57(1), pp. 28–33. was paid to the distinction between the global and inflec- [8] Křen, Michal et al. 2016. Corpus SYN, version 4. tional ones. This distinction has been already captured Prague, Institute of the Czech National Corpus, Fac- in the new edition of MorfFlex 2.0, in which many pairs ulty of Arts, Charles University; http://www.korpus. of global variants still remained unlinked. In particular, cz. it was necessary to reflect the existence of global variant [9] Lopatková, M.; Kettnerová, V.; Bejček, E.; 𝑛-tuples in DeriNet. The newly identified global variants Vernerová, A.; Žaborktský, Z. 2016. VALLEX 3.0. now are captured in the recent edition of DeriNet 2.1. LINDAT/CLARIAH-CZ digital library at the Institute Comprising of the new annotation of global variants into of Formal and Applied Linguistics (ÚFAL), Faculty MorfFlex is planned for a future edition. of Mathematics and Physics, Charles University, Czechia. URL: http://hdl.handle.net/11234/1-2307. [10] Mikulová, M.; Hajič, J.; Hana, J.; Hanová, H.; Hlaváčová, J.; Jeřábek, E.; Štěpánková, B.; Vidová Hladká, B.; Zeman, D. 2020. Manual for Morpholog- ical Annotation. Revision for Prague Dependency Treebank – Consolidated 2020 release. Technical Re- port TR-2020-64. Institute of Formal and Applied Lin- guistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Czechia. ISSN: 1214-5521. URL: https://ufal.mff.cuni.cz/techrep/tr64.pdf. [11] Richter, M.; Straňák, P.; Rosen, A. 2012. Korektor – A System for Contextual Spell-checking and Dia- critics Completion. In: Proceedings of the 24th In- ternational Conference on Computational Linguis- tics (Coling 2012), pp. 1–12. Coling 2012 Organizing Committee, Mumbai, India. [12] Straka, M.; Straková, J. 2017. Tokenizing, POS Tag- ging, Lemmatizing and Parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Association for Computa- tional Linguistics, Vancouver, Canada. [13] Straková, J.; Straka, M.; Hajič, J. 2014. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In: Proceedings of 52nd Annual Meeting of the Association for Com- putational Linguistics: System Demonstrations, pp. 13–18. Association for Computational Linguistics, Baltimore, Maryland. [14] Vidra, J.; Žabokrtský, Z.; Kyjánek, L.; Ševčíková, M.; Dohnalová, Š.; Svoboda, E.; Bodnár, J. 2021. DeriNet 2.1. LINDAT/CLARIAH-CZ digital library at the Insti- tute of Formal and Applied Linguistics (ÚFAL), Fac- ulty of Mathematics and Physics, Charles University, Czechia. URL: http://hdl.handle.net/11234/1-3765. [15] Mluvnice češtiny 1: Fonetika, Fonologie, Mor- fonologie a morfemika, Tvoření slov. 1986. Academia, nakladatelství Československé Akademie věd, Prague, Czechia. [16] Mluvnice češtiny 2: Tvarosloví. 1986. Academia, nakladatelství Československé Akademie věd, Prague, Czechia. [17] Pravopis a výslovnost přejatých slov se s – z. In- ternetová jazyková příručka [online] (2008–2022). Ústav pro jazyk český AV ČR, Prague, Czechia. Cit. 28. 5. 2022. URL: https://prirucka.ujc.cas.cz