=Paper=
{{Paper
|id=Vol-3226/paper14
|storemode=property
|title=Global Variants in the Czech Language
|pdfUrl=https://ceur-ws.org/Vol-3226/paper14.pdf
|volume=Vol-3226
|authors=Jaroslava Hlaváčová,Lukáš Kyjánek,Magda Ševčíková
|dblpUrl=https://dblp.org/rec/conf/itat/HlavacovaKS22
}}
==Global Variants in the Czech Language==
<pdf width="1500px">https://ceur-ws.org/Vol-3226/paper14.pdf</pdf>
<pre>
Global Variants in the Czech Language
Jaroslava Hlaváčová, Lukáš Kyjánek and Magda Ševčíková
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics Malostranské náměstí 25, 118 00 Prague,
Czechia


                                             Abstract
                                             There are words written in several different ways in Czech, e.g., lampion ∼ lampión (lampion). This variability may occur in
                                             either some inflectional wordforms (inflectional variants), cf. hradu ∼ hradě in the locative case of the noun hrad (castle), or
                                             across the inflectional wordforms and derivatives (global variants), cf. fantazijní ∼ fantasijní in the adjective derived from
                                             the noun fantazie ∼ fantasie (fantasy). It is reasonable to distinguish the global variants as different words but to have formal
                                             means that interconnect them in the Natural Language Processing systems and resources. In this paper, we describe the
                                             identification of global variants in the Czech vocabulary and summarise new changes in the MorfFlex CZ dictionary and
                                             DeriNet lexicon concerning this type of variants. We reviewed several typical patterns within global variants captured in the
                                             available resources and combined a set of regular expressions with manual annotations to achieve the highest precision of the
                                             identification.

                                             Keywords
                                             global and inflectional variant, morphology, word derivation, Czech


1. Introduction                                                                                                                       the variability in all the inflectional forms, often also
                                                                                                                                      in derived words, e.g., the prothetic v- 2 attached to the
The written form is one of the possible representations of                                                                            noun obchod ∼ vobchod (shop), and to the derived verb ob-
languages. Czech speakers must learn and use a relevant                                                                               chodovat ∼ vobchodovat (to trade), and the derived adjec-
script, rules, and regularities of the respective writing                                                                             tive obchodní ∼ vobchodní (commercial). All those words
system because of its substantial standardisation and cod-                                                                            manifest the same difference in every single wordform of
ification. However, some words can be still written in sev-                                                                           their inflectional paradigms (for instance, genitive cases
eral slightly different ways in so-called orthographic                                                                                of the noun (obchodu ∼ vobchodu) and adjective (obchod-
(spelling) variants, e.g., citron ∼ citrón (lemon), mu-                                                                               ního ∼ vobchodního)). On the other hand, somewhere
seum ∼ muzeum (museum), peepshow ∼ peepšou ∼ pípšou                                                                                   between the two defined types, there are also several
(peepshow).                                                                                                                           cases in which the variability is limited to a few forms,
   The emergence of orthographic variants in Czech is                                                                                 cf. the infinitive and past participles of the verb myslet
influenced by various aspects like the spoken represen-                                                                               ∼ myslit (to think), while the remaining wordforms are
tation of Czech, language development, and language                                                                                   identical.
contact. Some cases of the variability are only tempo-                                                                                   The inflectional variants are captured in the morpho-
rary until the use of one of the orthographic variants is                                                                             logical dictionary MorfFlex CZ (hereafter MorfFlex) by
established and codified as the preferred one (which can                                                                              means of the 15th position in the morphological tag de-
take years or decades). However, codified or not, many                                                                                scribing morphological categories of a given wordform
of the orthographic variants appear in the texts produced                                                                             [2]. Until the 2020 edition of MorfFlex, there was no
by speakers, which complicates work with language re-                                                                                 distinction between the description of global and inflec-
sources and all sorts of NLP applications.                                                                                            tional variants. All the variants were marked at the 15th
   Adhering to the current decisions on annotating this                                                                               position of the Prague positional tagset [1]. In the last ver-
phenomena in the corpus PDT-C (cf. its manual in [10]),                                                                               sion, the global variants are annotated by means of links
we distinguish two types of orthographic variants. In-                                                                                between them.3 In MorfFlex, one word from the 𝑛-tuple
flectional variants refer to relatively regular variants                                                                              of variants is selected as the basic one. All other vari-
within a set of wordforms of a given word, e.g., the loca-                                                                            ants are linked with the basic one by means of additional
tive case of some masculine inanimate nouns like obchod                                                                               pieces of information in their lemma. This information
(shop): obchodu ∼ obchodě. Global variants1 address                                                                                   contains not only the basic variant, but also the (rough)
ITAT’22: Information technologies – Applications and Theory, Septem-
ber 23–27, 2022, Zuberec, Slovakia                                                                                                      complete description of the morphological annotation of the corpus
Envelope-Open hlavacova@ufal.mff.cuni.cz (J. Hlaváčová);                                                                                PDT-C [10, pp. 36–42]. We will stick to the term global, as it is
kyjanek@ufal.mff.cuni.cz (L. Kyjánek); sevcikova@ufal.mff.cuni.cz                                                                       shorter, but the two terms are equivalent.
                                                                                                                                      2
(M. Ševčíková)                                                                                                                          This variation originates from the common Czech and is not codified.
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).                                                       It is sub-standard, but it occurs in the written language.
                                                                                                                                      3
    CEUR
    Workshop
            CEUR Workshop Proceedings (CEUR-WS.org)
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073                                                                                                        There are more ways how to interconnect the global variants (see
1
    Global variants are also called full-paradigm variants in the                                                                       [4], [5], [6]).
style of the variant. There is no strict rule for such selec-
tion, as no of the possible criteria is easy to formulate or      singular locative case of the lemma obchod is ob-
check. Usually, the more common (frequent) or standard            chodu ∼ obchodě.
(over non-standard) variant was taken as the basic one.
The final selection of the basic variant depended on the          — Global variants
lexicographer’s opinion.                                          are a pair (or generally 𝑛-tuple) of lemmas whose
   Until recently, no special attention was given to the          difference in spellings propagate to all wordforms
completeness of global variants within the whole of the           of their inflectional paradigms and to most of their
dictionary. That is why we reviewed available resources           derivationally related words; e.g., obchod ∼ vobchod
addressing the issue of orthographic variability (Sec-            → obchodní ∼ vobchodní (apart from the first letter,
tion 2), and extracted typical patterns for global vari-          inflectional paradigms of these words are identical).
ants from them. We applied these patterns to the set of
                                                                  — Basic variant
lemmas from MorfFlex to find as many global variant
                                                                  is the representative variant for an 𝑛-tuple of global
candidates as possible. We also exploited the DeriNet
                                                                  variants.
lexicon [14], which models word-formation relations in
Czech, to search for global variants within derivationally
related words (Section 3). After manual filtration of the
obtained global variants, our work resulted in intercon-        2. Available Resources Containing
necting global variants in MorfFlex. They will appear in
its next version. They were also partly uploaded to the
                                                                   Variants
DeriNet 2.1 lexicon.                                            Since any researcher working on a lexical resource must
   The resulting data and categorisation (Sections 4 and 5)     inevitably process also the orthographic variants, there
have potential in two research directions. First, they lead     are some pieces of annotations of them in the existing lan-
to the improvement of the Natural Language Processing           guage resources for Czech. However, capturing variants
applications for which the morphological dictionaries           is not the primary goal in any of the resources, so system-
serve as the background data, cf. [11], [13], and [12].         atic care is more than needed in this kind of annotation.
Second, they contribute to a wider linguistic discussion        The available digitised resources provide a good point of
on (not only orthographic) variability in Czech, especially     departure for the phenomenon of orthographic variants,
in the context of border cases between inflectional and         but we see the following two shortcomings/inconsisten-
global variants mentioned above and exemplified before          cies in them. The number of the captured orthographic
the conclusions of this paper.                                  variants is often relatively low in the resources. The or-
                                                                thographic variants are not treated across derivation, e.g.,
                                                                úřad ∼ ouřad (office) but not úřadovat ∼ ouřadovat (to
                                                                officiate).
  1.1. Terminology                                                 There are various language resources and grammar
                                                                books that include this kind of annotation; however, in
  To sum up definitions of the basic terms used in              the following paragraphs, we describe only those that
  this paper, we provide this section. We frame it              are available and digitised, and thus machine-readable.
  to facilitate reading in case readers would like to              MorfFlex 2.0 [2], the lexicon of the inflectional mor-
  easily remind some definitions while reading more             phology of Czech, in its currently available version, al-
  advanced parts of the text.                                   ready includes annotation of global variants. It classifies
                                                                them into three types of variants: standard (label DD ,
  — Inflectional paradigm                                       e.g., lavor ∼ lavór (pail)), common Czech/non-standard
  is a set of all wordforms derived by means of inflec-         (label GC , e.g., oprášit ∼ voprášit (to dust down)) and dis-
  tion from a citation wordform (so-called lemma);              tortion/typo (label DS , e.g., Dominigue ∼ Dominique).
  e.g., the inflectional paradigm of the lemma obchod           However, as the results of the work herein show, we were
  consists of wordforms obchod, obchodu, obchodě, …             able to find many more 𝑛-tuples of variants not included
                                                                in the current version.
  — Inflectional variants
                                                                   VALLEX 3.0 [9] is the valency lexicon of Czech verbs
  are a pair (or generally 𝑛-tuple) of wordforms be-
                                                                that also interconnects and labels several orthographic
  longing to the same inflectional paradigm of the
                                                                variants of verbs. Their representations are systematic,
  same lemma and having the same values of all mor-
                                                                but the definition of being a variant seems different from
  phological categories, but different spellings; e.g.,
                                                                ours. For instance, VALLEX marks also chytit ∼ chytnout
                                                                (to catch) as variants although its status is arguable.
   Slovník spisovného jazyka českého (SSJČ) [3], the           the extractions from MorfFlex and VALLEX were easy,
explanatory dictionary of Czech digitised into a semi-         as these resources are designed to be processed auto-
structured format,4 covers wide vocabulary and includes        matically, we had to use regular expressions to extract
annotations of orthographic variants without any addi-         candidates for variants from SSJČ because, in its digitised
tional classification. However, it uses various options        version, it stores data in a semi-structured file format
in which these variants are underlined in glosses of the       (see Section 2). Consequently, we checked whether the
dictionary, e.g., by using                                     words extracted from this resource are attested in the
                                                               vocabulary of MorfFlex to mitigate incorrectly extracted,
         • v. = viz (see) in “obepsati v. opsati” (copy),      non-existent, and archaic words. We also wrote down ex-
         • comma in “mysliti, mysleti” (to think), or          amples and patterns from the relevant linguistic studies
         • řidč. = řidčeji (rarely) in “zpěvánka, řidč. zpě-   like those cited in the previous section.
           vanka, zpívánka” ([usually a short] song).             When comparing the 𝑛-tuples of variants extracted
                                                               from the resources, we have observed that MorfFlex in-
   Except for these language resources, there is also a long
                                                               cludes more than two thirds of the variants captured
tradition of linguistic studies on the topic of orthographic
                                                               in SSJČ. The remaining third of the variants from SSJČ
variants, cf. prefixes s-/se- and z-/ze- and orthography                                              ?
of loanwords in general in [15, pp. 167–170], morpholog-       seems questionable, e.g., zvýhodněný ∼ zvýhodnělý (priv-
ical variability and its reflection to orthography in [16,     ileged) in which the variability does not affect the indi-
pp. 268–276], orthography of loanwords with s/z char-          vidual characters but the affixes (and thus can diverge
acters, such as analýza ∼ analýsa (analysis) in [17], and      the word meaning). The extracted variants from VALLEX
terminological issues of the phenomenon in [7]. They           cover only verbs; most of the variants are not included
provide extensive lists of variants or patterns that are       in the other resources.
shared across variants. We extracted some patterns from
the studies, which helped us to identify typical general 3.2. Formalising Regular Patterns
properties of global variants.
                                                             Having relatively large amount of 𝑛-tuples with global
                                                             variants, we considered whether to use pattern-matching
3. Searching for Global Variants                             algorithms or to formalise frequent patterns manually.
                                                             We chose the latter way as it allowed us to have a better
We developed a semi-automatic procedure that consists overview of the processed data and to create more com-
of four straightforward subsequent steps to search for plicated patterns that would take into account not only
Czech global variants. We exploited the available re- character changes but also morpho-syntactic categories
sources mentioned in the previous section, and we ex- of words. For instance, this decision allowed us to avoid
tracted frequent patterns that appear in global variants. interconnection of the masculine animate variant pair
We applied these patterns to the set of lemmas from Morf- česač ∼ česáč (a man harvesting apples) to the masculine
Flex 2.0 to obtain 𝑛-tuples of global variants, such as ex- inanimate variant pair česač ∼ česáč (an instrument for
tremismus ∼ extrémismus ∼ extremizmus ∼ extrémizmus harvesting fruits of tall trees).
(extremism).                                                    We exploited various types of intersections of the ex-
   To get also derivationally related global variants that tracted lists of variant 𝑛-tuples and their sorting to be
were not identified on the basis of the extracted patterns, able to infer frequent patterns that occur in global vari-
we included DeriNet into the process. Thanks to that we ants. We first looked at the global variants extracted
covered also 𝑛-tuples like extremistický ∼ extrémistický from all three resources, then at those that occurred in
(extremist) for the already identified variants of the base at least two resources, and only then at those that were
word extremism mentioned above. The resulting 𝑛-tuples in individual resources but not in the others. More than
were manually annotated to eliminate randomly similar one hundred observed patterns were formalised into the
words. In the last step, the data were uploaded as a new form of regular expressions, e.g. ∧ o.* ↔ ∧ vo.* in ob-
type of annotation into DeriNet, and, in parallel, new chod ∼ vobchod (shop). The relevant morpho-syntactic
links were also added to MorfFlex.                           categories were also stored with the particular regular
                                                             expression.
3.1. Extracting Variants from the Existing
     Resources                             3.3. Applying Patterns to MorfFlex
We started with assembling the existing resources listed       We applied the formalised patterns to all the lemmas from
in Section 2 and extracting variants from them. While          MorfFlex (we did not search for inflectional variants, so
                                                               we did not have to take wordforms into account). To
4
    https://ssjc.ujc.cas.cz/
A              citron.NOUN
                 lemon
                                                                 B                       citron.NOUN
                                                                                           lemon
                                                                                                          citrón.NOUN
                                                                                                         lemon
 citronový.ADJ           citroník.NOUN
      lemony             little lemon            citrón.NOUN                 citronový.ADJ              citroník.NOUN
                                                    lemon                             lemony            little lemon
                                                                         citronóvý.ADJ                       citróník.NOUN
                                citronóvý.ADJ    citróník.NOUN                 lemony                        little lemon
 citronově.ADV                          lemony   little lemon
   lemonily
                                                                       citronově.ADV
                                                                             lemonily

                               citrónově.ADV                         citrónóvě.ADV
                                                                           lemonily
                                  lemonily


Figure 1: Two possible ways of representing global variants in the rooted trees; (A) making parallel branches, (B) connecting
variants to the basic variant (the latter option implemented in DeriNet 2.1).


achieve higher precision, we also exploited the knowl-                            𝑛      MorfFlex 2.0      after
edge of morpho-syntactic categories of the candidates                             2            31,919    49,079
for global variants. If these categories, e.g., grammatical                       3             1,227     2,089
gender or animacy, of the candidates differed between                             4               121       264
the words, these candidates were excluded, cf. the mascu-                         5                16        18
line animate noun car (tsar) ≁ the masculine inanimate                            6               187       187
cár (shred).                                                                      8                 4         4
                                                                                  9                 1         1
   In order to obtain more consistent list of variants, we
                                                                                 11                 1         1
also took derivational morphology from DeriNet into                              12                 1         1
account. For each identified global variant, we observed
relevant sub-tree of derivationally related words and tried
to identify the same patterns among the derivatives.             Table 1
   The resulting 𝑛-tuples of global variants were also           Number of interlinked variants in MorfFlex 2.0 (the second
manually filtered. The annotator was provided lists of           column) and after addition of the new variant annotation (the
𝑛-tuples of global variant candidates; the task was to go        third column). The first column lists sizes of 𝑛-tuples — 𝑛 = 2
                                                                 is for pairs.
through the lists of variants and exclude those 𝑛-tuples
which only accidentally met a derivational pattern, but
that were not variants, e.g., the pair fiala (wallflower) ≁
fiála (pinnacle) had to be excluded manually, although theBoth resources differ in the data structures they use for
same pattern works well in real variants like neandrtalec storing their data, but they both share the same set of
∼ neandrtálec (Neanderthal). During the manual work       lemmas.
on the global variant candidates we also identified inflec-  DeriNet interconnects derivationally related words
tional variants with variant lemma — see the example of   into so-called derivational families. Each family of
the pair of verbs myslit, myslet.                         words is represented in a form of rooted tree (in graph
   Table 1 shows how many variant 𝑛-tuples were an-       theory terminology), in which words are represented
notated in the MorfFlex 2.0, and how many were added      as nodes while derivational relations as edges. In other
thanks to the new found 𝑛-tuples. It is visible, that the words, each derived word in DeriNet has at maximum
main increment is recorded for smaller 𝑛, especially for  one base word (antecedent), e.g., učitelka (female teacher)
pairs (𝑛 = 2), triples (𝑛 = 3) and 4-tuples. The bigger   ← učitel (teacher) ← učit (to teach).
values of 𝑛 remain the same.                                 In the rooted tree data structure, unidentified global
   In the following sections, we present a more detailed  variants caused structural inconsistencies. For instance,
                                                          the adjective citrónový (related to lemon) could be con-
analysis of the prototypical cases of global variants (Sec-
tion 4) and, on the other hand, cases that we do not treatnected to the noun citron, although the noun citrón would
as variants (Section 5).                                  be a better antecedent (both global variants of lemon). To
                                                          tackle this issue, identifying global variants is crucial.
                                                             We considered two possible ways of representing
3.4. Global Variants into DeriNet                         global variants in the current rooted tree data structure
We uploaded the resulting global variants into the newest of DeriNet. In the first approach (see Fig. 1, part A), the
version of DeriNet 2.1, and we intend to do so also for global variants would create parallel branches in the tree,
the next version of the inflectional dictionary MorfFlex. e.g., citron → citronový → citronově parallel to citrón →
Figure 2: Simplified record of the global variants of the noun úřad ∼ ouřad (office) and its derivatives from DeriNet 2.1.
Variant relations are represented by dark grey dashed arrows that are shorter than the light grey solid arrows, which represent
derivational relations. Size of the nodes corresponds to the token frequency of the lemmas in the corpus SYNv4 [8]. Brackets
around nodes indicate that the node’s derivatives were hidden for spatial reasons.


citrónový → citrónově. The major disadvantage of this                   4. Prototypical Cases of Global
approach is that the branches may be disconnected or
contain gaps if any variant is missing in the vocabulary.
                                                                           Variants
   In the second approach (see Fig. 1, part B), the global              In this section, we will present the most common types
variants would be connected to one basic variant to                     of global variants together with typical examples.6 One
which the derivatives are connected, while the other                    of the important properties of global variants is that their
variants would not have any derivatives connected, e.g.,                derivatives can also become global variants.
[ citron ∼ citrón ] → [ citronový ∼ citrónový ] → [ citronově           Example: The pair of verbs lítat ∼ létat (to fly) derives
∼ citrónově ]. The lack of global variants in this approach             iterative verbs lítávat ∼ létávat, adjectives lítající ∼ lé-
does not disconnect word(s) from the tree. Therefore, we                tající (flying), and/or verbal nouns lítání ∼ létání (the
chose this approach for DeriNet.                                        flying). Derivatives in each of the pairs are also global
   The selection of the basic variant followed the similar              variants.
criteria that were applied in MorfFlex. We tried to do
so consistently across the 𝑛-tuples that share the same
pattern. The final decision depended on a lexicographer.5               4.1. Long and Short Vowels
   As a result, the data from our experiments with global               In this type of variants, words vary in the length of a
variants has been already uploaded into DeriNet 2.1. If                 vowel, either in the affix, or in the root.
words are variants in this lexicon, one of the words is                 Example: Suffix variation in svíčkař ∼ svíčkář (someone
selected as the basic one and the other ones are con-                   who makes candles), and root variation in kvikat ∼ kvíkat
nected directly to it by special relation that is labelled as           (to oink/squeak).
Type=Variant . Fig. 2 illustrates words derived from the
variant pair úřad ∼ ouřad (office) from DeriNet 2.1; the
missing variants of the individual derivatives, such as
úřadek ∼ ouřadek, will be connected in the new release.
                                                                            DeriNet projects but we plan to make a unification.
5                                                                       6
    Unfortunately, this task was not coordinated between MorfFlex and       This overview is by no means complete.
4.2. Alveolar vs. Postalveolar/Palatal                            4.7. Variants of Foreign Names
     Consonants                                                Most frequent foreign geographic names have usually a
The consonants alternate in the root; the instances are Czech translation.
of different origins.                                          Example: The Czech variant of Paris is Paříž, Moscow is
Example: vlaštovka ∼ vlašťovka (a swallow), student ∼ Moskva, Berlin is Berlín.
študent (student), mrazený ∼ mražený (frozen).                    Though both words can appear in Czech texts, they are
                                                               not considered global variants. Moreover, the original of
                                                               the foreign name is usually not inflected.
4.3. Soft and Hard Adjectives                                     Person names are typically not translated, but their
There are two types of adjectives — soft and hard, but spelling is often unusual. In addition, errors or typos
some of them can vary between the two types. This was frequently occur in their spelling. In such cases, they can
quite common in the past, as is visible from the additional be considered variants. Sometimes, one of the variants
information attached usually to one of the variants — it is is a spelling adapted to the pronunciation, as the long
often archaic or outdated. The basic variant can be soft as variant in the following example.
well as hard, depending on the lexicographer’s decision. Example: Abdulah ∼ Abdullah ∼ Abduláh.
At the beginning of our work, this type of variants was           This is not applied to Slavic names with the ending -ij
not recorded.                                                  or -i which are sometimes translated with the ending -ý.
Example: Adjectival variation in the pairs námezdný ∼ As the variation appears only in the nominative singular
námezdní (hired), přívodný ∼ přívodní (feed, inflow ... e.g. (lemma) and vocative singular, we consider this type of
pipe).                                                         variants as inflectional.
                                                               Example: All the three variants of the name Čajkovský
                                                               (Tchaikovsky), namely Čajkovský ∼ Čajkovskij ∼ Čaj-
4.4. Prothetic v-
                                                               kovski, are inflectional variants of the singular nomina-
Many Czech words starting with the vowel o exist also in tive and vocative cases. Other cases do not manifest this
the variation with the prothetic v- at their very beginning. type of variation. They are not global variants.
Though the latter variant is considered non-standard and          Similarly, names of ancient Greeks with the lemma
is used mainly in spoken Czech, it is very common and ending -es or -és are not global variants, as this variation
penetrating into the written Czech, too.                       appears only in nominative singular. They are inflec-
Example: okno ∼ vokno (window). This type of variants tional variants.
can appear not only at the beginning of words, but also Example: Empedokles, Empedoklés.
after a prefix that precedes the o; e.g., zotvírat ∼ zvotvírat
(to open step by step).
                                                                  5. Non-variants
4.5. Vocalized and Non-vocalised Prefixes The soft–hard type can seemingly be applied to soft and
                                                           hard declension of nouns with feminine or masculine
The prefixes v-, s-, vz-, roz-, od-, pod-, nad-, ob-, před- can
                                                           gender. In reality, in such cases, we should rather speak
be expanded by e (ve-, se-, vze-, roze-, ode-, pode-, nade-,
obe-, přede-). Nevertheless, some words can have both      about a combined paradigm and merge the two variants
spellings, which makes them variants.                      into one inflectional paradigm. This has been already
Example: střást ∼ setřást (shake off ), rozsmutnit ∼ rozes-done for masculine declension of soft–hard pairs, both
mutnit (make sad), objet ∼ obejet (go around).             animate and inanimate.
                                                           Example: The lemma kužel (cone) can be inflected ei-
                                                           ther as a hard noun (following the traditional masculine
4.6. Stylistic Variants (ú ∼ ou, ý ∼ ej, th ∼ t, inanimate declension class hrad) as well as a soft noun
      s ∼ z)                                               (following the traditional masculine inanimate declen-
                                                           sion class stroj). It is reasonable to join wordforms of the
This type of variants usually puts into opposition stan-
                                                           two inflected sets and to represent the whole set of word-
dard and non-standard Czech, let it be archaic, colloquial
                                                           forms by a single lemma. As there is only one lemma,
or other sort of style. The most frequent is the variation
                                                           these words cannot be global variants either.
between s and z, especially within the suffixes -ismus and
                                                              The feminine gender is different, as there the lemmas
-izmus.
                                                           differ. However, the difference is always within the end-
Example: mechanismus ∼ mechanizmus (mechanism),
                                                           ing, so according to the definition of global variants, they
vytékat ∼ vytejkat (flow/leak out), úzký ∼ ouzký (narrow),
                                                           are not global variants. Though they are often viewed as
ortopedie ∼ orthopedie (orthopedics).
                                                           global variants, there is rather one inflectional paradigm
                                                           with inflectional variants affecting all the wordforms.
The new pattern for this type of variation should be              This project included lots of manual work, as the topic
added and all the wordforms merged into a single inflec-       of variants is very variable and there are no rules for
tional paradigm with inflectional variants even for the        the really strict distinction of what are and what are not
lemma.                                                         variants. Thus, the manual work discovered some border
Example: The lemmas kapuce ∼ kapuca (hood) have                cases where it had to be decided from scratch. In general,
different inflectional paradigms, but the individual tags      we adopted very strict rules, e.g. we do not consider vari-
(combinations of number and grammatical case) differ           ants those words that contain formally different affixes
only in endings.                                               (see the example zvýhodněný ≁ zvýhodnělý (privileged)).
   Similar cases are variants with different genders.          All those cases are to be researched in greater detail in
Example: brambora (fem.), brambor (masc. inan.) (both          the future.
potato); ribstole (fem.), ribstol (masc. inan.) (both wall
bars).
   Again, the variation manifests itself only in endings, so   Acknowledgement
they cannot be considered global variants. The solution
                                                               This work was supported by the Grant No. GA19-14534S
proposed for the nouns with the same gender (merg-
                                                               of the Czech Science Foundation, and the Grant No.
ing the inflectional paradigms) cannot be applied here,
                                                               START/HUM/010 of Grant schemes at Charles University
because of the so-called “Principle of morphological differ-
                                                               (reg. No. CZ.02.2.69/0.0/0.0/19_073/0016935), and LIN-
entiation” introduced in [10]. One of its requirements is
                                                               DAT/CLARIAH-CZ project of the Ministry of Education
that the gender of a noun should stay the same within
                                                               (LM2015071, LM2018101).
the whole inflectional paradigm. These examples reveal
that the Principle is questionable; it would probably be
advisable to reconsider it.                                    References
   During the work on global variants in Czech resources,
we came across several peculiarities.                       [1] Hajič, J. 2004. Disambiguation of Rich Inflection
                                 ?                              (Computational Morphology of Czech). Nakladatel-
Example: The pair pécéčko ∼ písíčko (a sort of abbre-
viation of personal computer / PC). Is it a pair of global      ství Karolinum, Charles University, Czechia.
variants, or not? For the time being, the two lemmas are    [2] Hajič, J.; Hlaváčová, J.; Mikulová, M.; Straka,
not interlinked.                                                M.; Štěpánková, B. 2020. MorfFlex CZ 2.0.
   Sometimes, we found sets of seeming variants, that           LINDAT/CLARIAH-CZ digital library at the Institute
had a typical variant pattern, but they were not variants       of Formal and Applied Linguistics (ÚFAL), Faculty
because of different meaning.                                   of Mathematics and Physics, Charles University,
Example: valečka (biol. sort of grass) ≁ válečka (someone       Czechia. URL: http://hdl.handle.net/11234/1-3185.
[fem.] who rolls something), and/or studenský (adjective    [3] Havránek, B. (ed.) 1960–1971. Slovník spisovného
to the town of Studená) ≁ studénský (adjective to the town      jazyka českého. Academia, Prague, Czechia.
of Studénka).                                               [4] Hlaváčová, J. 2009. Formalizace systému české
   Neither we interconnected the onomatopoeic or ex-            morfologie s ohledem na automatické zpracování
pressive words.                                                 českých textů. Ph.D. thesis, FF UK, 146 pp.
Example: ďoubnout ≁ ďubnout (expr. to push).                [5] Hlaváčová, J. 2011 Problém variantních tvarů slov při
                                                                automatickém zpracování jazyka. In: Information
                                                                Technologies – Applications and Theory, pp. 75-78.
6. Conclusion                                               [6] Hlaváčová J. 2019. Aggregates and Variants in Two
                                                                Czech Morphological Approaches. In: Proceedings
The paper presented the specialised project of looking          of the 19th Conference ITAT 2019: Slovenskočeský
for global variants in available resources of Czech lexical     NLP workshop (SloNLP 2019), pp. 120-124.
data. The main aim was to make an “inventory” of Czech [7] Hrbáček, J. 1974. Lexikální ekvivalenty, dublety a
global variants and to annotate them. Special attention         varianty Naše řeč 57(1), pp. 28–33.
was paid to the distinction between the global and inflec- [8] Křen, Michal et al. 2016. Corpus SYN, version 4.
tional ones. This distinction has been already captured         Prague, Institute of the Czech National Corpus, Fac-
in the new edition of MorfFlex 2.0, in which many pairs         ulty of Arts, Charles University; http://www.korpus.
of global variants still remained unlinked. In particular,      cz.
it was necessary to reflect the existence of global variant [9] Lopatková, M.; Kettnerová, V.; Bejček, E.;
𝑛-tuples in DeriNet. The newly identified global variants       Vernerová, A.; Žaborktský, Z. 2016. VALLEX 3.0.
now are captured in the recent edition of DeriNet 2.1.          LINDAT/CLARIAH-CZ digital library at the Institute
Comprising of the new annotation of global variants into        of Formal and Applied Linguistics (ÚFAL), Faculty
MorfFlex is planned for a future edition.
    of Mathematics and Physics, Charles University,
    Czechia. URL: http://hdl.handle.net/11234/1-2307.
[10] Mikulová, M.; Hajič, J.; Hana, J.; Hanová, H.;
    Hlaváčová, J.; Jeřábek, E.; Štěpánková, B.; Vidová
    Hladká, B.; Zeman, D. 2020. Manual for Morpholog-
    ical Annotation. Revision for Prague Dependency
    Treebank – Consolidated 2020 release. Technical Re-
    port TR-2020-64. Institute of Formal and Applied Lin-
    guistics (ÚFAL), Faculty of Mathematics and Physics,
    Charles University, Czechia. ISSN: 1214-5521. URL:
    https://ufal.mff.cuni.cz/techrep/tr64.pdf.
[11] Richter, M.; Straňák, P.; Rosen, A. 2012. Korektor
    – A System for Contextual Spell-checking and Dia-
    critics Completion. In: Proceedings of the 24th In-
    ternational Conference on Computational Linguis-
    tics (Coling 2012), pp. 1–12. Coling 2012 Organizing
    Committee, Mumbai, India.
[12] Straka, M.; Straková, J. 2017. Tokenizing, POS Tag-
    ging, Lemmatizing and Parsing UD 2.0 with UDPipe.
    In: Proceedings of the CoNLL 2017 Shared Task:
    Multilingual Parsing from Raw Text to Universal
    Dependencies, pp. 88–99. Association for Computa-
    tional Linguistics, Vancouver, Canada.
[13] Straková, J.; Straka, M.; Hajič, J. 2014. Open-Source
    Tools for Morphology, Lemmatization, POS Tagging
    and Named Entity Recognition. In: Proceedings of
    52nd Annual Meeting of the Association for Com-
    putational Linguistics: System Demonstrations, pp.
    13–18. Association for Computational Linguistics,
    Baltimore, Maryland.
[14] Vidra, J.; Žabokrtský, Z.; Kyjánek, L.; Ševčíková, M.;
    Dohnalová, Š.; Svoboda, E.; Bodnár, J. 2021. DeriNet
    2.1. LINDAT/CLARIAH-CZ digital library at the Insti-
    tute of Formal and Applied Linguistics (ÚFAL), Fac-
    ulty of Mathematics and Physics, Charles University,
    Czechia. URL: http://hdl.handle.net/11234/1-3765.
[15] Mluvnice češtiny 1: Fonetika, Fonologie, Mor-
    fonologie a morfemika, Tvoření slov. 1986.
    Academia, nakladatelství Československé Akademie
    věd, Prague, Czechia.
[16] Mluvnice češtiny 2: Tvarosloví. 1986. Academia,
    nakladatelství Československé Akademie věd,
    Prague, Czechia.
[17] Pravopis a výslovnost přejatých slov se s – z. In-
    ternetová jazyková příručka [online] (2008–2022).
    Ústav pro jazyk český AV ČR, Prague, Czechia. Cit.
    28. 5. 2022. URL: https://prirucka.ujc.cas.cz

</pre>