1. Motivation

Linking Two Lexical Resources: VALLEX and MorfFlex Lexicons

Markéta Lopatková

Jaroslava Hlaváčová

Jiří Mírovský

0 0 Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics , Malostranské nám. 25, Prague, 118 00 , Czech Republic

The article focuses on two diferent lexicons providing complementary information: MorfFlex covering general Czech morphology, and VALLEX giving information on the syntax and semantics of Czech verbs. We discuss diferent designs of these lexicons, concentrating primarily on variants and homographs in the Czech vocabulary. Within the project, we have verified the theoretical approaches and harmonized the treatment of variants in both lexicons, adopting the clear morphologically based criteria from MorfFlex for distinguishing variants in VALLEX. The two updated lexicons, MorfFlex and VALLEX, with interlinked records represent the project's main outcome.

eol>morphology valency variants homographs linking resources

1. Motivation 2. MorfFlex

Language resources represent a crucial prerequisite for Morphological dictionaries serve as inventories of all any natural language processing (NLP) task – and this is wordforms of a natural language, and as such, they reptrue not only in the "pre-AI-chat-applications world", but resent essential language resources. even more so in the world using large language models MorfFlex [ 1, 2 ], the morphological dictionary of the and their AI processing, for which a large quantity of data Czech language, has two main purposes: is crucial. Less obvious is the usability of high-quality (but necessarily very limited) language resources pre- • analysis of wordforms pared manually, i.e., data with added expert information • generation of wordforms based on (not only) linguistic erudition. As for the first purpose, MorfFlex serves as the basis for

Despite these fundamental doubts concerning the use- morphological taggers, which assign a basic wordform fulness of human-developed data in future NLP, we do (lemma) and a set of its morphological properties (in the not want to abandon and throw away these high-quality form of a morphological tag) to every Czech wordform. language resources yet. Still, we want to maintain them In NLP tasks, this helps to reduce data sparsity, as it and connect diferent types of data as much as possible. allows automatic tools to work just with lemmas (instead Moreover, such high-quality data, ofering deep linguis- of individual wordforms), or just with morphological tic insight, represent an essential resource for further tags. Lemmatization and tagging also allow machines theoretical and formal linguistics research. (and human users as well) to create queries for efective

Here we focus on two language resources – two searching in language corpora. diferent lexicons providing complementary informa- From the other side, the morphological dictionary contion: MorfFlex covering general Czech morphology and tains all the necessary information for generating wordVALLEX giving information on syntax and meaning of forms based on their lemma and morphological tag. It Czech verbs. Our goal is to merge these two resources – is used in various tools of NLP, for instance, machine simply by interlinking them. translation.

MorfFlex is maintained as a set of triplets <wordform, lemma, tag> .

The lemma is a basic (representative) wordform, which usually serves as the key word in dictionaries. In MorfITAT 2023: Information Technologies – Applications and Theory, 2023 Flex, it can be accompanied by a brief semantic note $ lopatkova@ufal.mf.cuni.cz (M. Lopatková); which serves only human editors. It means that MorfFlex hlavacova@ufal.mf.cuni.cz (J. Hlaváčová); does not contain any information concerning syntactic mirovsky@ufal.mf.cuni.cz (J. Mírovský) or semantic properties of words in a form that could be (J. H00la0v0á-0č0o0v2á-)3;803030-09-6010103(M-27.4L1o-p1a3t4k7o(vJá.)M;0í0ro0v0s-0k0ý0)1-6506-6797 used in automatic tools.

CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ©ACt2tEr0i2bU3utCRioonpWy4r.0igoIhnrttekfornsrahtthiooisnppaalp(PCerCrboByYcite4s.0ea)u.dthionrsg.Usse( CpeErmUittRed-uWndeSr.Correagti)ve Commons License F<opríesxkaamlp,lep,íthsekfaotll,owVipnYgSt-ri-p-le-tR-AAI--> is the MorfFlex entry belonging to the wordform pískal ‘(he) whistled’. Given the wordform (the first item from the triplet), MorfFlex provides its lemma and its morphological description in the form of a morphological tag1 (compare the first purpose from above). On the other hand, given a lemma and a morphological tag (the second and the third item), the particular wordform can be generated.

Putting it diferently, MorfFlex associates individual lemmas with the complete set of their wordforms, i.e., with their whole (morphological) paradigms. In other words, MorfFlex provides complete characteristics of the formal part of Czech lexemes (as described in Section 3).

MorfFlex contains more than 1 million lemmas, 56,748 of which are verbs. lexeme verb forms odpovídali odpovídán odpovídat

odpovídáme odpovídejte

odpovída…je odpovědět odpověz odpoví odpověděl odpověděvše …

lexical units odpovídat/odpovědět-1

≈ `answer' odpovídat/odpovědět-2

≈ `react' odpovídat-3

≈ `be responsible' odpovídat-4

≈ 'correspond'

3. VALLEX

One important rule applies in MorfFlex, referred to as the One of the key tasks linguists and NLP specialists ingolden rule of morphology in [ 4 ]: Any pair <lemma, tag> tensively study is the possibility of representing natural can be associated with only a single wordform, not more. language semantics. On the linguistic side, in the last This requirement guarantees that a unique wordform is 20 years, a lot of efort has been devoted to dictionaries derived based on the lemma and tag. To put it diferently, of semantic propositions capturing predicate-argument it cannot happen that a particular lemma together with relations or, in other words, the valency potential of preda particular tag are attached to more than one wordform. icates: verbs, nouns, adjectives, and adverbs. Let us menThis rule is essential for the unambiguous description of tion esp. the FrameNet project [ 5 ], PropBank [ 6 ], VerbNet variants. Namely, every variant has its unique description [ 7 ], OntoNotes [ 8 ]. – a unique record in MorfFlex. As an illustration of the VALLEX,3 the Valency Lexicon of Czech Verbs [ 9, 10 ], principle, compare the two variants of the imperative of is a collection of linguistically annotated data and docthe verb plavat ‘to swim’, with their tags difering in the umentation. It provides a formal, machine-readable delast position:2 scription of verbal predicates (the term verbal predicate <plav, plavat, Vi-S---2--A-I--> refers here to the meaning counterpart of a “morphologi<plavej, plavat, Vi-S---2--A-I-1> cal” verb), focusing on the valency properties of Czech A detailed description of MorfFlex and the adopted verbs and their additional syntactic and semantic characprinciples can be found in [ 2 ]. teristics. The lexicon covers common senses of the most

As described above, MorfFlex contains complete mor- frequent Czech verbs: in total, it comprises 11,132 verb phological characteristics of (all) Czech wordforms, in- senses of almost 4,700 verbs, i.e., more than 6,850 verb cluding their grouping into paradigms represented by a lexical units (counting perfective and imperfective verbs lemma. As such, it is a highly valuable source of informa- as forming a single lexeme, see below). If iterative verbs tion on morphology for diferent types of lexicons like are also counted, the lexicon covers 5,098 verb lemmas. VALLEX. The lexicon design stresses versatile usability both for a human user and for the automatic processing of Czech.

As for the VALLEX formal structure, its basic building blocks correspond to individual lexemes [ 11, 12 ] – as sketched in Figure 1, a lexeme is understood as a two-fold abstract entity associating: 1The morphological tag is a label indicating the morphological features of a wordform. In MorfFlex, it is structured as a string of 15 positions, as described in [ 3 ]. Here the first position indicates part-of-speech (N for noun, A for adjective, V for verb, etc.); the following positions are most relevant for verbs: 3 - gender (e.g., F feminine, M masculine animate), 4 - number (P plural, S singular), 8 - person (e.g., 1, 2), 9 - tense (e.g., P present, R past, F future), 12 - voice (A active, P passive) and 13 - aspect (I imperfective, P perfective, B biaspectual).

The last position distinguishes variants, see also sect. 4.3. 2These two records are an example of inflectional variants, see section 4.3 for more details. • a set of all relevant verb forms (the whole verb morphological paradigms) represented by a set of their lemmas, and • a set of lexical units corresponding to individual meanings, i.e., “complexes with (relatively) stable, discrete semantic properties”, according to [ 13 ].

3https://ufal.mf.cuni.cz/vallex.

The data can be downloaded from the LINDAT/CLARIAH-CZ Repository http://hdl.handle.net/11234/1-4756. 1–8

In VALLEX, following the tradition of the Functional

Generative Description and the valency theory developed within this approach [ 14, 15 ], several “morphological” verbs are typically subsumed under a single lexeme, namely: (i) morphological variants (in a strict sense as described below in Section 4.3), marked of by the slash symbol, e.g., vystříhávat / vystřihávatimpf ‘to cut out’, (ii) perfective and imperfective counterparts, distinguished by their aspect in the subscript, e.g., dávatimpf–dátpf ‘to give’, and (iiii) other verbs traditionally considered as the same predicates, e.g., spolknoutpf1 – spolykatpf2 ‘to swallow’, distinguished by digits following the aspect value.4 In these cases, all the forms are conceived as forms of a single verbal predicate at the syntactic and semantic layer of the language description. As exemplified in Fig- Figure 3: Lexical unit in VALLEX (predicate stékat – stéci/stéct ure 2, the predicate odpovídatimpf – odpovědětpf with the ‘to answer’). meaning ‘to answer’ is represented by two lemmas (difering in aspect). Similarly, two lemma variants stéci/stéctpf ‘to flow down’, are subsumed together with their imperfective counterpart stékatimpf under the single lexeme iteratives as, e.g., běhávat ‘to run (repeatedly)’ (only spostékatimpf – stéci/stéctpf (Figure 3). However, individual radically covered by VALLEX), and non-standard variants lexical units can be associated with just a subset of lem- as, e.g., voprášit as a variant of the standard oprášit ‘to mas, as exemplified in Figure 1, with four lexical units for dust’ (ignored in VALLEX at all). In this case, it is not a the imperfective odpovídatimpf and just two lexical units mistake, VALLEX ignores them as a rule, so we simply for the perfective odpovědětimpf. do not process them.

VALLEX provides rich syntactic and semantic infor- Another diference concerns the forms of lemmas mation for each lexical unit. This information includes, stored both in VALLEX and MorfFlex and related to reifrst of all, its valency characteristics in the form of a lfexives (Sect. 4.1) and to asymmetries in treating homovalency frame; supplementary information such as, e.g., graphs (Sect. 4.2) and variants (Sect. 4.3). types of applicable diatheses or the possibility to express reflexivity and reciprocity is available, as illustrated in 4.1. Reflexive morphemes Figure 2. On the other side, the information concerning MorfFlex. In MorfFlex, every lemma corresponds to a relevant forms is limited to the list of relevant lemmas, i.e., “morphological” word, i.e., to a sequence of letters deliminfinitive forms of verbs representing the whole morpho- ited by spaces.5 logical paradigms, and their aspect (as it is syntactically For example, odpovídat ‘to answer’ represents a sinand semantically relevant). gle “morphological” word and a “semantic” word as well.

Interlinking VALLEX with MorfFlex is thus a natural However, there are “semantic” words in Czech consisting way how to add detailed morphological information to of two strings of letters, as, e.g., bát se ‘to fear’ or povšimVALLEX. nout si ‘to notice’– in MorfFlex, such cases split into two entries, one for the verb lemma (bát or povšimnout) and 4. Identification of Corresponding one for the reflexive ( se or si), though they cannot be used without reflexives in well-formed sentences (* Karkulka

Entries bála vlka., *Karkulka oblíbila vlka.).6

Several diferences originating from the diferent designs

of the two lexicons (as sketched above) had to be solved.

First, two groups of lemmas are more or less systematically excluded from VALLEX but covered in MorfFlex: 4See also footnote 7 in Sect. 4.3. 5The same principle is applied to all wordforms, i.e., analytical forms of verbs are not covered as units in MorfFlex;e.g., the analytical form budou odpovídat ‘(they) will answer’ is covered by two triplets, namely <budou, být, VB-P---3F-AAI- -> and <odpovídat, odpovídat, Vf--------A-I-->.

6Unless the reflexive is elided from the surface sentence, as, e.g., VALLEX. In VALLEX, on the other hand, “semantic” unique paradigm (compare the forms of the 3rd person words are treated as integral units. Thus, the pairs bát singular, present tense: dí ‘(he) tells’ vs. děje se ‘(it) hapse ‘to fear’ or povšimnout si ‘to notice’ form indivisible pens’), so it is necessary to have two records to distinunits considered as verb lemmas. Similarly, the pair dít se guish them. Thus the lemma dít-1_(dít_se) ‘to happen’ ‘to happen’ is treated here as a single unit (in addition to or ‘to occur’ (imperfective) difers from the lemma dítthe verb dít ‘to tell’, see the following section 4.2 dealing 2_(říkat) ‘to tell’ (biaspectual). with homographs). In total, VALLEX covers 204 such On the other hand, the verb topit, despite having two lemmas (called reflexiva tantum). clearly diferent meanings, namely ‘to produce heat’ and

Further, for some verbs, both non-reflexive and reflex- ‘to drown’, is represented by a single lemma, as in both ive counterparts appear in VALLEX (they are interlinked, meanings, the verb topit has an identical set of wordforms as their meanings are related but their syntactic patterns with the same morphological tags. difer), as, e.g., bavit (někoho) ‘to amuse (sb)’ and bavit se This simple and consistent criterion based on the mor(s někým) ‘to talk (with sb)’. Such pairs share their mor- phological paradigm works well at the morphological phological paradigms (difering only in the absence/p- level. However, it is inappropriate for VALLEX, where resence of the reflexive), thus they are represented as a we accent syntax and semantics over morphology. single non-reflexive verb in MorfFlex. There are 1,490 lemmas in VALLEX that appear both without and with the reflexive.

In addition, VALLEX distinguishes verbs with optional reflexives, i.e., verbs that appear both without the reflexive and with the reflexive, despite having the same syntactic pattern and the same meaning (202 verb lemmas), as. e.g., mrknout (se) ‘to glance’ (compare the following corpus example and its modification, Počkejte minutku, mrknu se, kde by mohly být. (SYN v10) – Počkejte minutku, mrknu, kde by mohly být.).

VALLEX. In VALLEX, verbs with diferent meanings

but clearly etymologically connected are treated as separate lexical units within a single lexeme, as illustrated, e.g., by the predicate odpovídatimpf – odpovědětpf in Figure 1.

In the same way, VALLEX treats the imperfective verb nakupovat ‘to buy’ as a counterpart to the perfective verb nakoupit ‘to buy’ in the single lexeme nakupovatimpf – nakoupitpf, as these two verbs share the same valency patterns. However, the lemma nakupovat can also be treated as the aspectual counterpart to the verb nakupit ‘to heap’.

Solution. When interlinking the two lexicons, we ig- Thus, it is necessary to distinguish them as homographs nore the potential reflexives se, si, both obligatory and in VALLEX, nakupovat impf – nakoupitpf ‘to buy’ and optional ones. As a consequence, both non-reflexive and nakupovat impf – nakupitpf ‘to heap’, even though they reflexive counterparts in VALLEX are mapped onto a sin- are subsumed under the single lemma nakupovat in Morfgle lemma in MorfLex, as, e.g., bít ‘to beat’ and bít se ‘to Flex as they represent the single “morphological” verb ifght’ are mapped onto the MorfFlex lemma bít. (having the same paradigm).

Distinguishing homographs as sketched above allows 4.2. Homographs us to follow the theoretical assumptions – adopted from the Functional Generative Description (see Sect. 3) – The second asymmetry between MorfFlex and VALLEX that aspectual counterparts form a single lexeme. Thus, concerns ambiguous verbs, i.e., verbs with the same VALLEX can cope with asymmetrical cases where a single lemma but (substantially) difering in their meaning, and “morphological” verb characterizes (a formal part of) two thus considered independent units from the semantic or more lexemes. point of view. However, in contrast to the clear technical criterion adopted in MorfFlex, the semantically (or etymologically) MorfFlex. MorfFlex does not consider meaning, only based criterion used in VALLEX is somewhat blurry. Conmorphology. Consequently, two words with (even totally) sequently, diferent lexicons difer in their treatment of diferent meanings do not have separate records unless homographs since experts from time to time disagree in their morphological features difer. In other words, if the their interpretations (or their analysis shifts in time as the paradigms of the two words are identical, they are not particular meaning becomes more independent in everydistinguished in MorfFlex. day practice). For example, the comprehensive Slovník

For example, the verb dít has two diferent meanings, spisovného jazyka českého (SSJČ) from the sixties [ 17 ] also varying in the aspect value, each of them with its treats the verb hradit in a single entry in all its meanings (incl. ‘to fence; to enclose’; and ‘to cover; to reimburse’) in a dialog („Ty se bojíš samoty? “ „Bojím.“ ‘ “Are you afraid of while its more recent (and substantially smaller) succesbuesainggesal(oPnepeo?v”i“s—i—o–pYeensí,zIe abmál(aafřríacitd.).‘”S’h)e,owr asshaarferadidintomaosrke cPoempaplfeoxr sor Slovník spisovné češtiny pro školu a veřejnost (SSČ) [ 18 ] the money.’, where the only reflexive si appears instead of two ones, distinguishes two homographs, hradit ‘to fence; to ense belonging to bát and si belonging to říct, cf. haplology [ 16 ]). close’ and hradit ‘to cover; to reimburse’ (VALLEX follows the latter solution, distinguishing two homographs). marked in the dictionary.

The predicate odpovídatimpf – odpovědětpf, see Figure 1, (i) Global variants. Global variants (also called fullserves as another example of fuzzy boundaries between paradigm variants) are those variants that relate to all individual lexemes – all its meanings are traditionally wordforms of a paradigm, and always in the same way, considered as belonging to the same lexeme, despite their e.g., vystříhávat and vystřihávat (‘to cut out’, imperfecsubstantial distance (at least from the synchronous point tive) – the whole paradigms of the verbs difer in the of view). alternation -í- vs. -i- in the root.

VALLEX (in its latest public version 4.5) distinguishes Each of the two (or more) global variants has its own 245 homographs formed out of 122 lemmas (i.e., when lemma with a complete paradigm. One of the variant ignoring the homograph marker and the possible reflex- lemmas is proclaimed a basic one, and the other contains ive). a link to the basic one. In such a way, we have the variants interconnected. In the previous example, vystříhávat is Solution. When interlinking the two lexicons, we ig- the basic lemma. The lemma vystřihávat contains the nore the homograph marker in VALLEX and distinguish link to vystříhávat (vystřihávat -> vystříhávat). diferent lemmas only if they are also distinguished in MorfFlex. Consequently, two or more (homographic) (ii) Inflectional variants. Inflectional variants (also verbs in VALLEX may be mapped onto one or more lem- called wordform variants) are those variants that relate mas in MorfFlex. only to some wordforms of a paradigm. In that case, (i)

In most cases (for 35 lemmas), the information on the the two (or more) variants have the same lemma and (ii) aspect of the verb (in the morphological tag) makes it all the values of all morphological categories are identipossible to detect appropriate mapping automatically. cal. For example, the wordforms kopá and kope ‘(he/she) Rare cases (5 lemmas) where automatic mapping is not digs’ belong to the paradigm of the verb kopat ‘to dig’; possible are checked and resolved manually. For exam- their morphological features are identical (3rd person ple, VALLEX contains the homographic lemma stát and singular, present tense). As this variant manifests just in this pair, they are considered inflectional variants (not distinguishes three verbs: státimpf ‘to cost’, státimpf ‘to global). The distinction is expressed through numbers in stand; to be located’, and stát sepf ‘to happen’. On the their morphological tags (at the very last position). The MorfFlex side, three verbs with the samme lemma appear triplets, for example, are: as well, two imperfective verbs, stát-3_ˆ(stojím_stojíš) <kopá, kopat, VB-S---3P-AAI--> ‘to stand’ and stát-5_ˆ(sníh) ‘to melt’, and one perfec- <kope, kopat, VB-S---3P-AAI-1> tive, stát-2_ˆ(stanu_staneš) ‘to happen’. While there is a single pair of perfective verbs on both sides, the mapping is unproblematic. However, both imperfective verbs from VALLEX correspond to the imperfective verb stát3_ˆ(stojím_stojíš) (and stát-5_ˆ(sníh) ‘to melt’ is not covered by VALLEX at all). Here manual intervention is necessary.

Inflectional variants of infinitives. There is an inlfectional variant concerning the great majority of verbs that manifests itself in their lemmas. It is the common ending -t variant vs. the archaic ending -ti. The infinitive kopat ‘to dig’ of the previous example has the inflectional variant kopati. There is no need to artificially create two paradigms (kopat, kopati) difering only in the infinitive, they are both subsumed under the lemma kopat.

<kopat, kopat, Vf--------A-I--> <kopati, kopat, Vf--------A-I-2>

Further, there are several verbs with another pair of infinitive ending variants, namely -ci vs. -ct, for instance, říci vs. říct ‘to tell’. Again, those variants are inflectional since they relate to the infinitives only (subsumed under the lemma ending with -ci).

We should mention one more type of inflectional variants of infinitives, namely the endings -it vs. -et / -ět, as manifested with the verb muset ‘must’ and with the verb chraptět ‘to rasp’.

<muset, muset, Vf--------A-I--> <musit, muset, Vf--------A-I-1> <chraptět, chraptět, Vf--------A-I--> <chraptit, chraptět, Vf--------A-I-1> In this case, not only the infinitives but also the wordforms of the past tense show the same diference, e.g.,

4.3. Morphological variants

The third asymmetry between MorfFlex and VALLEX concerns morphological variants with identical syntactic and semantic characteristics. In VALLEX, several verb lemmas with the same meaning and the same syntactic and semantic characteristics are typically grouped into one entry, as illustrated above.7 In MorfFlex, the approach to variants is diferent. Variants are only described from the morphological or orthographic point of view, regardless of syntax or semantics.

MorfFlex. Since the latest version, MorfFlex CZ 2.0

[ 2 ], two types of variants are recognized and consistently

7Traditional Czech lexicography does not provide a testable criterion

for distinguishing variants; thus, the concept of variants applied in the older VALLEX versions (3 and 4) is broader than in MorfFlex. -il vs. -el / -ěl. However, the rest of the wordforms are identical, so according to the definition, the variants are inflectional.

VALLEX. As VALLEX registers just lemmas (as the rep

resentative forms of the whole paradigm), only those variants that afect lemmas are relevant for the mapping. Interlinking the records. The lemma candidates seAll such variants should be mapped onto MorfFlex. In par- lected for interlinking were automatically checked – the ticular, VALLEX explicitly keeps all lemmas representing unambiguous MorfFlex lemmas with the same aspect global variants and all lemmas representing inflectional value as the VALLEX ones were automatically added to variants where the lemma is hit. The lemma variant -t, the new VALLEX attribute -morfflex assigned to each -ti represents the only systematic exception, where only lexical unit. the first forms are present in VALLEX. Aspect. The lemma pairs difering in aspect were man

VALLEX 4.5 (the latest released version) contains 134 ually checked (35 cases). Typically, the variance was groups of lemma variants (ignoring possible homograph caused by the diverse classification of these verbs in markers and reflexives). Mostly, there are two variants the source Czech lexicons (as SSJČ or SSČ mentioned for a lemma; in 5 cases, there are three variants (e.g., in Sect. 4.2) and in their corpora usage reflecting cursvléci/svléct/svlíct ‘to undress’. rent Czech and its development (as, e.g., dovést ‘to be able’, which obviously moves from imperfective to perfective). In these cases, the VALLEX and MorfFlex aspect values were harmonized and the pairs of records were interlinked.

Third, we detected additional lemmas marked in Morf

Flex as variants of the already matched lemmas (Sect. 4.3) and added them to the list (12 lemmas in total, as, e.g., colloquial oblíct as a variant of obléci ‘to dress’). They were added to VALLEX as well.

Solution. As a consequence, global lemma variants in VALLEX are mapped onto all (interconnected) MorfFlex lemmas, for example: VALLEX: vystřihávat/vystříhávat ‘to cut out’

-> MorfFlex: vystřihávat, vystříhávat; VALLEX: oddechnout/oddychnout/oddýchnout ‘to breathe out; to rest’ -> MorfFlex: oddechnout, oddychnout, oddýchnout. On the other hand, inflectional variants afecting lemmas (listed in VALLEX as well)8are mapped onto the basic lemmas in MorfFlex only, for example: VALLEX: říci / říct ‘to tell’ -> MorfFlex: říci VALLEX: muset / musit ‘must’ -> MorfFlex: muset

5. Linking the lexicons

Compiling the list of records for interlinking. The primary and obvious task was to compile a list of lemmas covered by both lexicons. First, we collected the set of lemma-aspect pairs from VALLEX (typically more lemmas from a single lexeme), ignoring possible reflexives (Sect. 4.1) and homograph markers (Sect. 4.2). This list of 3,635 lemma–aspect pairs served as the initial repertoire of lemmas that should be processed.

Second, we matched them with the appropriate records in MorfFlex. On the way, we found several verbal lemmas not covered by MorfFlex by mistake (e.g., mlet as an inflectional variant of mlít ‘to melt’). The relevant ones were added to MorfFlex (8 lemmas in total), and the rest of them (5 archaic lemmas, as pékat ‘to used to bake’) were removed from VALLEX.

8Naturally, inflectional variants not afecting lemma are disregarded

in VALLEX.

Homographs. There were also several cases of ambigu

ous mapping detected by the automatic procedure, i.e., cases of two MorfFlex lemma candidates with the same aspect value identified as possible counterparts of a particular record in VALLEX. These few cases had to be resolved manually (5 lemmas).

As a by-product, several cases of inappropriate homograph splitting were corrected (namely, 4 archaic lemmas were removed from VALLEX based on the MorfFlex evidence).

Variants. The most significant part of the changes concerned the harmonization of variants due to the strictly morphology-based criterion used in MorfFlex for distinguishing variants. On the MorfFlex side, four inflectional variants (e.g., dožnout / dožít ‘to finish mowing’) and 14 global variants (e.g., nadechnout / nadýchnout ‘to inhale’) were added. As for VALLEX, 47 variants (as detected in previous releases) were separated as non-variants. Nevertheless, the particular lemmas were kept in the same lexical units but marked as non-variants in the updated data (as, e.g., plavat / plovatimpf was replaced by plavatimpf1 – plovatimpf2). On the other hand, two pairs of verbs were connected as variants in VALLEX based on MorfFlex (e.g., utvářet / utvářitimpf ‘to create’). Further, as mentioned earlier, several new variants were added, covering mainly colloquial Czech (including 3 new variants with homograph markers).

The updated VALLEX data cover 108 variant groups (ignoring possible homograph markers and reflexives) based on the strict morphological criterion as adopted in MorFlex.

Reflexives. Finally, the established mapping was propa

gated to the reflexive verbs as well.

6. Outcome: Updated Lexicons with Interlinked Records The two updated lexicons, MorfFlex and VALLEX, with

interlinked records, represent the project’s main outcome. Technically, the new attribute -morfflex was added to each lexical unit of VALLEX, listing all relevant MorfFlex lemmas (as representatives of the whole morphological paradigms). The total sum of 5,098 lemmas are interlinked.

Within the project, we have verified the theoretical approaches and harmonized the treatment of variants in both lexicons, adopting the clear morphologically based criteria from MorfFlex for distinguishing variants in VALLEX. As a secondary benefit, some minor inconsistencies in both lexicons were detected and corrected. In all cases, these imperfections concerned individual lexicon entries (as, e.g., missing entry was added, unrecognized variants linked, and lemmas that do not function as morphological variants were separated) while the overall design of the lexicons proved to be suitable for such endeavor.

The updated VALLEX data are publicly available in the working version.9 The finalized version will be part of its next public release.

Since we concentrated on morphological characteristics,

after analyzing the data in both dictionaries, the vast majority of lemmas were linked automatically (the few detected ambiguities were disambiguated manually as it was not worth inventing some heuristics or relying on machine learning methods for such a minor task).

Unfortunately, this does not apply to dictionaries focused on semantic information, as can be exemplified by the ambitious SemLink project for English [ 19, 20 ]. Linking such resources requires extensive manual efort to satisfy a reasonable quality of the result and large manually annotated corpora with individual predicate senses disambiguated by trained linguists. Only such training data make it possible to design elaborated (semi)automatic linking procedures allowing their users to preserve the mapping of such (necessarily constantly changing) resources. Such data are not available for VALLEX yet; however, VALLEX is being integrated into the SynSemClass project [ 21 ], which aims to serve as an inter-connecting data resource. 1–8

Acknowledgment The work on the VALLEX and MorfFLex lexicons was

supported by and has been using data and tools provided by the LINDAT /CLARIAH-CZ Research Infrastructure (https://lindat.cz), supported by the Ministry of Education, Youth and Sports of the Czech Republic (project No. LM2023062).

[1]

Hlaváčová ,

Mikulová ,

Štěpánková , Konzistence morfologického slovníku morflex , Jazykovedný časopis / Journal of Linguistics 72 ( 2021 ) 855 - 861 .

[2]

Hajič ,

Hlaváčová ,

Mikulová ,

Straka ,

Štěpánková , MorfFlex CZ 2.0 , 2020 . URL: http: //hdl.handle.net/11234/1-3186, LINDAT/CLARIAHCZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL) , Faculty of Mathematics and Physics , Charles University.

[3]

Mikulová ,

Hajič ,

Hana ,

Hanová ,

Hlaváčová ,

Jeřábek ,

Štěpánková ,

B. V.

Hladká ,

Zeman , Manual for Morphological Annotation, Revision for the Prague Dependency Treebank - Consolidated 2020 release , Technical Report TR-2020-64 , Institute of Formal and Applied Linguistics, Charles University, 2020 .

[4]

Hlaváčová , Golden rule of morphology and variants of wordforms , Jazykovedný časopis / Journal of Linguistics 68 ( 2017 ) 136 - 144 .

[5]

C. F.

Baker ,

C. J.

Fillmore ,

J. B.

Lowe , The Berkeley FrameNet project, in: COLING-ACL'98: Proceedings of the Conference , Montreal, Canada, 1998 , pp. 86 - 90 . doi:https://dl.acm.org/doi/ 10.3115/980845.980860.

[6]

Palmer ,

Gildea ,

Kingsbury , The Proposition Bank: An Annotated Corpus of Semantic Roles, Computational Linguistics 31 ( 2005 ) 71 - 106 . doi:https://doi.org/10.1162/ 0891201053630264.

[7]

Kipper-Schuler , VerbNet: a broad-coverage, comprehensive verb lexicon , Ph.D. thesis , Computer and Information Science Department, Universiy of Pennsylvania, Philadelphia, PA, 2005 . URL: http:// repository.upenn.edu/dissertations/AAI3179808/.

[8]

Weischedel , E. H. amd Mitchell Marcus,

Palmer ,

Belvim ,

Pradhan ,

Ramshaw ,

Xue , Ontonotes: A Large Training Corpus for Enhanced Processing , Springer New York, NY, Berlin, 2011 , pp. 53 - 63 . doi:https://doi.org/ 10.1007/978-1- 4419 -7713-7.

[9]

Lopatková ,

Kettnerová ,

Bejček ,

Vernerová ,

Žabokrtský , Valenční slovník českých sloves VALLEX , Nakladatelství

Karolinum

, Praha, 2016 .

[10]

Lopatková ,

Kettnerová ,

Mírovský ,

Vernerová ,

Bejček ,

Žabokrtský , VALLEX 4 .5, 2022 . URL: http://hdl.handle.net/11234/1-4756, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL) , Faculty of Mathematics and Physics , Charles University.

[11]

Žabokrtský ,

Lopatková , Valency Information in VALLEX 2 . 0: Logical Structure of the Lexicon , The Prague Bulletin of Mathematical Linguistics ( 2007 ) 41 - 60 .

[12]

Filipec ,

Čermák , Česká lexikologie, Academia, Praha, 1985 .

[13]

D. A.

Cruse , Lexical Semantics, Cambridge University Press, Cambridge, 1986 .

[14]

Sgall ,

Hajičová , J. Panevová, The Meaning of the Sentence in Its Semantic and Pragmatic Aspects , Reidel, Dordrecht, 1986 .

[15]

Panevová , Valency Frames and the Meaning of the Sentence , in: P. A. Luelsdorf (Ed.), The Prague School of Structural and Functional Linguistics , John Benjamins, Amsterdam/Philadelphia, 1994 , pp. 223 - 243 .

[16]

Rosen , Haplology of Reflexive Clitics in Czech, in: E. Kaczmarska, M. Nomachi (Eds.), Slavic and German in Contact: Studies from Areal and Contrastive Linguistics , volume 26 of Slavic Eurasian Studies, Slavic Research Center, Sapporo, 2014 , pp. 97 - 116 .

[17]

Havránek ,

Bělič ,

Helcl , A . Jedlička (Eds.), Slovník spisovného jazyka českého , Academia, Praha, 1964 .

[18]

Filipec ,

Daneš ,

Machač , V. Mejstřík (Eds.), Slovník spisovné češtiny pro školu a veřejnost, Academia , Praha, 2003 .

[19]

Palmer , Semlink: Linking PropBank, VerbNet and FrameNet , in: Proceedings of the Generative Lexicon Conference, GenLex 2009 , Pisa, 2009 , pp. 9 -- 15 .

[20]

Stowe ,

Preciado ,

Conger ,

S. W.

Brown , G. Kazeminejad,

Gung , M. Palmer, SemLink 2 . 0: Chasing lexical resources , in: Proceedings of the 14th International Conference on Computational Semantics (IWCS) , Association for Computational Linguistics , Groningen, The Netherlands (online) , 2021 , pp. 222 - 227 . URL: https://aclanthology.org/ 2021 .iwcs- 1 . 21 .

[21]

Urešová ,

Zaczynska ,

Bourgonje , E. Fučíková, G. Rehm,

Hajič , Making a Semantic Event-type Ontology Multilingual , in: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022 ), European Language Resources Association, Marseille, France, 2022 , pp. 1332 - 1334 .