<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">VESUM: A Large Morphological Dictionary of Ukrainian As a Dynamic Tool</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Vasyl</forename><surname>Starko</surname></persName>
							<email>v.starko@ucu.edu.ua</email>
							<affiliation key="aff0">
								<orgName type="institution">Ukrainian Catholic University</orgName>
								<address>
									<addrLine>2a Kozelnytska Str</addrLine>
									<postCode>79026</postCode>
									<settlement>Lviv</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andriy</forename><surname>Rysin</surname></persName>
							<email>arysin@gmail.com</email>
							<affiliation key="aff1">
								<orgName type="institution">Independent researcher</orgName>
								<address>
									<addrLine>104 Hab Tower Pl</addrLine>
									<postCode>27513</postCode>
									<settlement>Cary</settlement>
									<region>NC</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">International Conference on Computational Linguistics and Intelligent Systems</orgName>
								<address>
									<addrLine>May 12-13</addrLine>
									<postCode>2022</postCode>
									<settlement>Gliwice</settlement>
									<country key="PL">Poland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">VESUM: A Large Morphological Dictionary of Ukrainian As a Dynamic Tool</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">9BBACE9C6E57AE28638A79D16738D5ED</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T12:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Morphological dictionary</term>
					<term>POS dictionary</term>
					<term>Ukrainian</term>
					<term>VESUM</term>
					<term>POS tagging</term>
					<term>morphological analysis</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The paper describes VESUM, a large morphological dictionary of Ukrainian, as a valuable resource for the analysis and synthesis of Ukrainian morphological data. In line with its manifold practical uses, VESUM supplies a rich set of morphological features for more than 400,000 Ukrainian lemmas. Its lexical range extends beyond what is found in Ukrainian monolingual and grammatical dictionaries to cover proper names, abbreviations, alternative spellings, slang, deprecated items, dialectal and archaic words, etc. VESUM's inflectional paradigms include a number of substandard wordforms (marked as such) that occur in texts and need to be recognized by NLP applications. The paper describes VESUM's structure, morphological information it provides, its use in the LanguageTool language checker and in the Lucene search engine, as well as the dynamic tagging component that acts as a complement to the dictionary itself. VESUM's coverage of different text types is also discussed. The dictionary is provided as an open access source via an online repository for the NLP community and is made available online through a web interface in human-readable, searchable format.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Morphology is critical for many downstream NLP tasks, and morphological lexicons have been created for a number of different languages. They are especially useful for highly inflectional languages and are the building blocks of spellcheckers and parsers. The output of a morphological module is often exploited by NLP applications, such as search engines, information extraction systems, and machine translation systems. The high utility and manifold applicability of a large morphological dictionary is convincingly evidenced by such projects as MorfFlex CZ 2.0. Developed stagewise for over 30 years, this Czech lexicon has more than 100,000,000 wordforms representing over 1,000,000 lemmas <ref type="bibr" target="#b4">[5]</ref>, <ref type="bibr" target="#b17">[18]</ref>. Furthermore, grammatical dictionaries are increasingly made available online <ref type="bibr" target="#b23">[23]</ref> to present, in contrast to most traditional lexicographical works, inflectional paradigms fully and explicitly. This format is helpful to different groups of users, from non-native students of the language to teachers to professional linguists.</p><p>Like other Slavic languages, Ukrainian is highly inflectional: one paradigm may consist of 17-19 forms for a typical noun, 27-32 for verbs (excluding analytical forms), and 32-43 for adjectives. Ukrainian inflectional morphology includes:</p><p>• number: singular and plural, with occasionally used, even though generally archaic, forms of dual number; • gender: masculine, feminine, and neuter;</p><p>• grammatical case: nominative, genitive, dative, accusative, instrumental, locative, and vocative, with a number of words having multiple possible forms for a given case; • person: 1st, 2nd, and 3rd; • tense: past, present, and future; • aspect: imperfective and perfective; • mood: indicative, imperative, and subjunctive; • degrees of comparison: positive, comparative, and superlative; For various reasons, Ukrainian texts abound in spelling and morphological variants and idiosyncrasies, which greatly complicates the task of practical morphological annotation, even when the goal is to handle contemporary texts only. Extending the scope timewise to earlier periods and geographically to Ukrainian diaspora texts necessitates the use of a significantly richer set of morphological devices. There is a distinct need for a large machine-readable dictionary for morphological analysis that would be suitable for various types of texts, able to handle both standard and nonstandard usage, and providing good coverage in Ukrainian corpora. Our goal here is to explain how VESUM meets these challenges, describe the tools and methods underpinning its development, and analyze its practical application.</p><p>The paper is organized in the following way. In Section 2, we review the relevant works in the domain of Ukrainian morphology. Section 3 describes the composition and features of VESUM. Section 4 details the applications of VESUM, while the next section provides an account of how POS tagging is performed using VESUM. In Section 6, we discuss text coverage achieved with the help of VESUM. Finally, conclusions are drawn and prospects for future development are outlined.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related works</head><p>Ukrainian is a synthetic language in that it expresses grammatical meanings through inflections rather than word combinations. It has a significant degree of irregularity, especially with regard to nominal declension and verbal conjugation. For this reason, an adequate morphological description requires a large number of morphological classes and a number of exceptions to be specified. The creation of a morphological lexicon is, thus, a challenging task, and the situation is compounded by the fact that the existing descriptions of morphological paradigms for Ukrainian do not easily lend themselves to use in NLP. Traditionally, such works single out classes based on both inflections and stress patterns <ref type="bibr" target="#b18">[19]</ref> and are, thus, more complicated than is practically necessary for POS tagging, which does not involve accentuation. This approach is adopted in two academic grammatical dictionaries of Ukrainian <ref type="bibr" target="#b8">[9]</ref> and <ref type="bibr" target="#b15">[16]</ref>. The former comprises 140,000 lemmas (no proper names) and offers an explicit formalized system of morphological codes representing morphological classes, while the latter has over 260,000 lemmas (including proper names) and appears to employ an in-house system of a similar kind. After their initial release in 2011, both resources have not been updated to any substantial degree. Thus, they do not include numerous words that have entered the language over the past decade and an even longer period. Probably due to their academic character, neither dictionary attempts to cover slang, substandard lexical items, abbreviations, and alternative spellings.</p><p>In the domain of practical NLP applications, a morphological tagset has been developed and implemented for Ukrainian in an open-sourced project <ref type="bibr" target="#b6">[7]</ref>, <ref type="bibr" target="#b7">[8]</ref> in conformity with a multilingual system that imposes certain restrictions on individual languages. There are also long-running in-house projects <ref type="bibr" target="#b2">[3]</ref>, <ref type="bibr" target="#b14">[15]</ref> that make use of formalized morphological systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods and Materials</head><p>As a highly inflectional language, Ukrainian requires a morphological lexicon consisting of lemmas and codes to generate declension and conjugation paradigms, i.e., all the wordforms associated with a given lemma. For a long while, no resource of this type was publicly available for Ukrainian. VESUM <ref type="bibr" target="#b11">[12]</ref> was created to fill this gap. In its current version 5.6.0, the dictionary contains over 416,000 lemmas from which over 6.5 million wordforms are generated. VESUM is a non-commercial project: the dictionary data are available under the CC BY-NC-SA 4.0 license, while its software is distributed under GPLv3.</p><p>VESUM has benefited from some of the best lexicographical and morphological resources available for Ukrainian: an academic grammatical dictionary of Ukrainian <ref type="bibr" target="#b8">[9]</ref>, an academic description of Ukrainian morphology <ref type="bibr" target="#b20">[21]</ref>, a comprehensive overview of dynamic processes in the modern Ukrainian lexicon <ref type="bibr" target="#b5">[6]</ref>, an online dictionary collection <ref type="bibr" target="#b12">[13]</ref>, and other dictionaries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Composition of VESUM</head><p>As is typical for morphological dictionaries <ref type="bibr" target="#b8">[9]</ref>, lemmas in VESUM are grouped into inflection classes rather than traditional parts of speech, even though the two largely overlap. For example, the inflection class of verbs does not include participles because they have adjectival declension paradigms. VESUM divides all lexis into 13 inflection classes. Apart from such groupings as nouns (tagged as noun), verbs (verb), adjectives (adj), adverbs (adv), adverbial participles (advp), numerals (numr), conjunctions (conj), prepositions (prep), particles (part), and interjections (intj), VESUM includes onomatopoeic words (onomat), transliterated foreign words (foreign), and non-inflected words (noninfl) that cannot be categorized otherwise.</p><p>In the source files, lemmas are grouped into several files, which together comprise the lemma lists. Each entry has the format lemma morphological code stylistic tag comments For example, дружба /n10.p1.ko.&lt; :xp1 # людина дружба /n10.p1 :xp2 # стосунки where the slash sign / indicates that the word is inflected; n10 encodes for the specific nominal inflection group; p1 is the code for generating plural forms; ko defines the ending in the vocative case; &lt; is a flag for a person; xp1 and xp2 identify two homonyms; the comments after # specify that the first homonym is a person (groomsman) and the second one refers to a relationship (friendship).</p><p>VESUM handles homonymy differently from typical monolingual dictionaries: what it recognizes are grammatical homonyms, i.e., identically spelled words that differ in their grammatical categories (e.g., gender) and/or their paradigms. Homonymous lemmas are marked by indices. In the case of, especially, uninflected words, the syntactic role is taken into account. For example, the entry так adv:&amp;pron:dem|part|conj:coord:subord compactly encodes that this Ukrainian word can function as an adverbial pronoun, a particle, or a conjunction, coordinate or subordinate.</p><p>A number of additional tags are conjoined to the basic ones with the ampersand symbol: &amp;adjp (participle) &amp;&amp;adjp (also a participle) &amp;pron (pronoun) &amp;numr (ordinal numeral) &amp;insert (parenthetical word/wordform) &amp;predic (predicative).</p><p>Eleven additional tags are used to distinguish the types of pronouns. The complete tagset is available online <ref type="bibr" target="#b11">[12]</ref>. More details on the generation of wordforms from lemmas are provided in <ref type="bibr" target="#b16">[17]</ref>.</p><p>Over time, VESUM's range has been extended to cover an increasing number of types that occur in real texts but are outside standard Ukrainian: slang, professional jargon, recent lexical loans, proper names, abbreviations, variant spellings, as well as a reasonable number of dialectal and archaic words. Thus, while VESUM's predominant focus is on morphology, it does supply certain stylistic tags some of which are utilized by the LanguageTool module to warn users about style: subst (substandard) coll (colloquial) arch (archaic, outdated, or, in some cases, dialectal) slang vulg (vulgar) alt (alternative spelling) ua_1992 (spelling under the 1991 rules) ua_2019 (spelling under the 2019 rules) var (variant form) bad (erroneous or objectionable lemma/wordform) rare (for items with markedly low frequencies). Standard neutral vocabulary is left unmarked.</p><p>In addition to purely morphological and stylistic tags, VESUM utilizes several semantic tags. These labels are abbr (abbreviation) and prop (proper name) with further specification: prop:lname (last name), prop:fname (first name), prop:pname (patronymic name), prop:geo (geographical name), and prop:abbr (abbreviated proper name).</p><p>VESUM's output is a flat list of wordforms in the format wordform lemma positional tag For example, here is a fragment of the paradigm for the verb vesty (lead):</p><p>вести verb:imperf:inf веди verb:imperf:impr:s:2 ведім verb:imperf:impr:p:1 ведімо verb:imperf:impr:p:1 ведіть verb:imperf:impr:p:2 Positional tags are strings of individual tags each of which encodes a morphological category. The tags are separated by the semicolon: verb, imperfective aspect, imperative mood, singular/plural number, infinitive/person in the example above. Unlike other morphological dictionaries, such as the MorfFlex Dictionary of Czech <ref type="bibr" target="#b17">[18]</ref>, the names of individual tags are mostly shortened English words. This transparency helps even casual users get a better grasp of the morphological annotation. They can then employ individual tags and their combinations to restrict search queries. This format also conveniently expresses morphological features that can be exploited by rules, taggers, and computer models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Features</head><p>The distinct features of VESUM that are dictated by its practical orientation and set it apart from other morphological dictionaries can be summarized as follows:</p><p>1. Open-source project 2. Machine-readable format 3. Large size (bigger than similar resources) 4. A compact system of inflection codes 5. Dynamic nature (the dictionary is constantly enlarged with new lemmas and is used together with a dynamic tagging component) 6. Wide coverage of proper names: over 54,000 lemmas, including all names of populated areas in Ukraine according to the official register; Ukrainian geographical names introduced in the process of decommunization; more than 3,500 first names, 1,000 patronymic names, and 28,000 last names; a number of foreign proper names 7. Coverage of non-standard vocabulary: 8,000 erroneous lemmas (with replacements), 1,700 most frequent abbreviations, alternative spellings, 1,500 slang words, and over 1,200 archaic words 8. Inclusion of rare morphological forms, such as the colloquial infinitive forms ending in -t' rather than -ty and the variant ending -a for the accusative case of some singular masculine nouns, e.g., ножа (knife.Acc.sing) 9. Information on case government, e.g., rv_oru after an adjectival lemma means that it governs a noun in the instrumental case 10. Suitability for expansion with other types of linguistic information (phonetic, semantic, etc.) to be applied in the course of text processing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiment</head><p>One of VESUM's defining features is its practical focus and integration with several related projects. The dictionary is geared toward practical application, handling real-life Ukrainian texts with their complexities and irregularities, formalization of morphological information, and availability in machine-and human-readable form. Its origins can be traced back to the ispell-uk project for spellchecking under Linux. The dictionary was later adapted to perform more sophisticated spelling and grammar checking in Pravopysnyk <ref type="bibr" target="#b16">[17]</ref>, the Ukrainian module of the LanguageTool language checker <ref type="bibr" target="#b10">[11]</ref>.</p><p>One of the milestones in VESUM's development came in 2017 when a new Ukrainian morphological analyzer was created for the Apache Lucene search engine and the Ukrainian-language Wikipedia articles were re-indexed. Since then, full-text search based VESUM's morphological data has been used also in other web projects. The morphological toolkit containing VESUM has been used in the lang-uk project <ref type="bibr" target="#b3">[4]</ref> for lemmatization, POS tagging of the UberText corpus (over 665 million tokens), and building word vectors. An earlier version of VESUM was converted into the OpenCorpora format <ref type="bibr" target="#b1">[2]</ref> and used in the morphological module of the pymorphy2 library and derivative systems for Ukrainian <ref type="bibr" target="#b19">[20]</ref>.</p><p>The most challenging material for VESUM as a dynamic morphological tool has been presented by the General Regionally Annotated Corpus of Ukrainian (GRAC) <ref type="bibr" target="#b13">[14]</ref>, which is the most diverse corpus of Ukrainian, running a total of over 650 million tokens, spanning over two centuries (1816-2021), composed of over 90,000 texts in various genres written by 20,000 authors who used different spelling systems. VESUM and GRAC form a dynamic tandem: iteratively, VESUM is used to lemmatize and POS tag each new version of GRAC; a list of unrecognized words, sorted by frequency, is then generated from the tagged corpus; new lemmas are extracted by expert linguists from this list, semi-automatically coded, manually verified, and added to VESUM. Subtle modifications have been made in the grammatical apparatus to enable it to handle irregular forms, archaic words, alternative spellings, etc. Over the years, this approach has been mutually beneficial: GRAC has received increasingly better-fitted POS tagging, while VESUM has grown and improved its coverage by drawing lexical items from a wide variety of textual sources.</p><p>VESUM is highly sensitive to language change: it includes new geographical names that became official in Ukraine as a result of the 2015 decommunization laws, feminine forms of nouns that have gained currency over the past several years, and morphological and spelling changes introduced in the new official spelling rules of 2019.</p><p>While VESUM is primarily intended for machine use, it can also be highly useful to anyone interested in Ukrainian morphology and inflection, to both native speakers and non-native students of Ukrainian. We have developed a web interface for VESUM available at r2u.org.ua/vesum to function as an online grammatical dictionary for a wide audience of human users. Search queries can be adapted (via a checkbox) to focus on lemmas or specific indirect forms. Question marks and asterisks can be used in queries to replace, respectively, one character and zero or more characters. VESUM has evolved over the years to include various forms, such as substandard and archaic forms, that are not normally presented in academic dictionaries of Modern Ukrainian. Figure <ref type="figure" target="#fig_0">1</ref> below illustrates that a typical paradigm for an adjective, such as зелений 'green', includes long forms (зеленая, зеленую, зеленеє. зеленії) and the short form зелен, which are missing from other such dictionaries. However, they occur in older texts, in elevated speech, in poetry, and in some other instances. Thus, the textbased approach persistently implemented in the compilation of VESUM for an extended period of time leads to enhanced coverage of real language phenomena as compared to other similar resources. The web interface also provides quick links that let the user look up the word in question in collection of Russian-Ukrainian and English-Ukrainian dictionaries, in an explanatory dictionary of Ukrainian, and in the GRAC corpus. Moreover, there are links to the full tagset used and statistics.</p><p>Other uses of VESUM include the compilation of various types of dictionaries, linguistic research, NLP research, and so on.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head><p>A static morphological dictionary can hardly be expected to comprehensively cover the multitude of words, such as hyphenated adjectives (for example, українсько-англійський 'Ukrainian-English') or nominal compounds that are created using active patterns of the language in question. To this end, VESUM has been supplied with a dynamic tagging component that processes lexical items of this kind in Ukrainian texts. Ukrainian has a number of combining forms, such as бізнес-'business' and онлайн-'online', which are joined with other words with a hyphen, e.g., онлайн-магазин 'online store'. The dynamic tagging component recognizes these and several other types of hyphenated nouns, adjectives, and adverbs with 95% accuracy <ref type="bibr" target="#b16">[17]</ref>.</p><p>VESUM comes together with a set of disambiguation rules that remove a limited number of ambiguous forms based on frequency information and context. The overall task of ambiguity resolution is to be dealt with at a separate stage.</p><p>For POS tagging, VESUM is converted into the morfologik format <ref type="bibr" target="#b10">[11]</ref> that compactly encodes sequences of the type wordform/lemma/tags. The search is based on finite state automata and is performed using LanguageTool functions. Java code has been added to LanguageTool to carry out dynamic tagging specifically for Ukrainian. The same morfologik format is used in the Ukrainian module of Lucene/ElasticSearch <ref type="bibr" target="#b22">[22]</ref>, and the content is optimized for search.</p><p>For anyone interesting in POS tagging Ukrainian texts using VESUM, the LanguageTool API NLP UK project, available from github.com along with VESUM, provides the TagText utility, along with a tokenizer and a lemmatizer for Ukrainian. TagText calls LanguageTool functions and the Ukrainian dynamic tagging module to perform sentence splitting, lemmatization, POS tagging, and disambiguation. Several hundred cases of ambiguity are also resolved. The user has a choice of receiving TagText's output in text form or XML and collecting several types of statistical data. The tagged version of Ukrainian text can then be used as input for disambiguation based on transition probabilities and neural networks.</p><p>Below is a fragment from the Constitution of Ukraine POS-tagged by TagText with output in xml format: As can be seen, the tagger has not performed disambiguation for these tokens. Only a handful of disambiguation rules are implemented in the tagger, and work continues on full-fledged disambiguation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussion</head><p>A practical NLP researcher is interested in text coverage, i.e., what percentage of tokens in a given text are recognized by a morphological tagger. To this end, we have generated statistics by using the TagText tagger on Ukrainian texts after filtering out Russian words. The results are presented in Table <ref type="table">1</ref>. Five corpora (scientific texts, fiction, news, a random GRAC sample, and the Wikipedia corpus from the lang-uk project) have been processed. On non-encyclopedic texts, VESUM achieves a consistent rate of 97-99% in terms of wordforms. The rates of recognized types exhibit greater variance: 76% for scientific texts and fiction; 82% for the GRAC sample, and 85% for news. The Wikipedia corpus <ref type="bibr" target="#b3">[4]</ref> is special in that it contains a disproportionately large number of low-frequency proper names. Nevertheless, VESUM achieves text coverage of 95% on Wikipedia articles. An analysis of the unrecognized type list for Wikipedia has revealed the following: unrecognized proper names account for 40% of the total type count; 50% of all unrecognized types have frequencies below 10; 34% occur only once in this corpus; the list contains numerous misspellings, words in Latin script, and alphanumeric expressions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Bogdan Babych <ref type="bibr" target="#b0">[1]</ref> presents data on VESUM's coverage of types, rather than tokens, in four corpora (news, fiction, law, and wikipedia) and develops an algorithm for morphological processing of out-of-vocabulary items. The author reports percentages of unrecognized types that are in agreement with our results for fiction but are higher for news and much more so (by up to 33% percentage points) for the Wikipedia corpus. The higher percentage of unrecognized types in his experiment may be attributed to three factors: 1. The use of an earlier version of VESUM, which has nearly 10% fewer lemmas than the current one. 2. Text quality, such as news scraped from the web, especially with insufficient language filtering that fails to filter out Russian texts. 3. Possible non-use of the dynamic tagging component.</p><p>From our experience of processing various large Ukrainian text corpora, the tail of frequency distribution of unrecognized types is composed predominantly of 1) proper names; 2) misspellings and Russian words (written in Cyrillic, which in many cases makes them graphically indistinguishable from Ukrainian words); 3) foreign (mainly English) words written in Latin script. Fiction texts may also contain a number of archaic words and spellings. Given a sufficiently large corpus, especially collected from the Internet, unrecognized types in group 2 may achieve frequencies above 10 or higher. An algorithm for automatic paradigm induction, such as suggested in <ref type="bibr" target="#b0">[1]</ref>, would greatly increase effective coverage of the first class, generate paradigms for pseudo-lemmas for the second, and have no effect on the third.</p><p>VESUM is aimed at handling diverse but legitimate Ukrainian vocabulary. It does cover thousands of lemmas that are outside the standard language, but no attempt is made to cover outright misspellings, Russian vocabulary, or words in Latin script. That said, VESUM's coverage of proper names can be improved in two ways: by adding the more frequent ones to VESUM and by complementing the lexicon with an algorithm along the lines of <ref type="bibr" target="#b0">[1]</ref> to treat OOV items. Another area of possible improvement involves coverage of ungrammatical but still common forms. For example, the incorrect form корогв (banner.Gen.sing) occurs 11 times in the Wikipedia corpus. It could be added to VESUM and listed alongside the correct form корогов with the tag subst (substandard). This way, the incorrect forms will be recognized during POS tagging and supplied with the correct lemma, increasing utility for NLP applications and human users. Many substandard forms have already been incorporated into VESUM, but their inclusion has been somewhat limited because LanguageTool handles these items and misspellings automatically by applying the minimum edit distance algorithm and suggesting correct forms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions and Further Development</head><p>In contrast to other resources for Ukrainian, VESUM is a large dictionary with wider coverage of the Ukrainian word stock, including proper names, abbreviations, non-standard wordforms and lemmas, slang, alternative spellings, and dialect and archaic words. VESUM supplies a series of stylistic and semantic labels, and its efficiency is increased with the help of a dynamic tagging module. VESUM has an accompanying toolkit for the morphological analysis of Ukrainian.</p><p>The dynamic tagging module can be enhanced with techniques similar to those suggested in <ref type="bibr" target="#b0">[1]</ref>. This approach may address the issue of new terminology and proper names that appear in texts and are not (yet) covered by VESUM.</p><p>The morphological dictionary can be complemented with a semantic lexicon to enable both morphological and semantic tagging of Ukrainian texts in one pass.</p><p>Overall, VESUM is a powerful open-access source of morphological data for Ukrainian that is already used in several large-scale projects. It achieves high text coverage on various types of texts and can be effectively used in computational linguistics research and NLP applications. It also serves as a rich source of morphological data to a wide range of users via a searchable web interface. With its practical text-oriented approach and growing lemma count, VESUM is a useful and dynamic tool for the evolving Ukrainian language.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The paradigm of the adjective 'green' in VESUM's web interface at r2u.org.ua/vesum.</figDesc><graphic coords="6,86.20,72.00,404.50,565.69" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Output of the TagText tagger in xml format.</figDesc><graphic coords="7,89.35,444.61,430.49,252.55" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Unsupervised Induction of Ukrainian Morphological Paradigms for the New Lexicon: Extending Coverage for Named Entities and Neologisms Using Inflection Tables and Unannotated Corpora</title>
		<author>
			<persName><forename type="first">B</forename><surname>Babych</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing</title>
				<meeting>the 7th Workshop on Balto-Slavic Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1" to="11" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Chaplinsky</surname></persName>
		</author>
		<ptr target="https://github.com/dchaplinsky/LT2OpenCorpora" />
		<title level="m">LT2OpenCorpora</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Kompiuterna linhvistyka</title>
		<author>
			<persName><forename type="first">N</forename><surname>Darchuk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computational Linguistics</title>
				<meeting><address><addrLine>Kyiv</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
		<respStmt>
			<orgName>Kyiv University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">V</forename><surname>Dyomkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chaplinsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Stegnii</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Marikovskyi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Tykhonov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Petriv</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shekhovtsov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chalyi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kodliuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pavliuchenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kunikevych</surname></persName>
		</author>
		<author>
			<persName><surname>Kh</surname></persName>
		</author>
		<author>
			<persName><surname>Skopyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">-</forename><surname>Lang</surname></persName>
		</author>
		<ptr target="https://lang.org.ua/en/corpora/" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">MorfFlex CZ 2.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL)</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hajič</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hlaváčová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mikulová</surname></persName>
		</author>
		<ptr target="http://hdl.handle.net/11234/1-3186" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
		<respStmt>
			<orgName>Faculty of Mathematics and Physics, Charles University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Dynamichni protsesy v suchasnomu ukrainskomu leksykoni [Dynamic Processes in the Modern Ukrainian Lexicon</title>
		<author>
			<persName><forename type="first">N</forename><surname>Klymenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Karpilovska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kysliuk</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
			<publisher>Dmytro Burago Publishing House</publisher>
			<pubPlace>Kyiv</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Overview of the Ukrainian language resources within the multilingual European MULTEXT-East project</title>
		<author>
			<persName><forename type="first">N</forename><surname>Kotsyba</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SISN</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="122" to="129" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">MULTEXTEast Morphosyntactic Specifications</title>
		<author>
			<persName><forename type="first">N</forename><surname>Kotsyba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Shevchenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Derzhanski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mykulyak</surname></persName>
		</author>
		<ptr target="http://nl.ijs.si/ME/V4/msd/html/msd-uk.html" />
		<imprint>
			<date type="published" when="2010">2010</date>
			<publisher>Ukrainian Specifications</publisher>
		</imprint>
	</monogr>
	<note>version 4. 3.11</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Hramatychnyi slovnyk ukrainskoi literaturnoi movy</title>
		<author>
			<persName><forename type="first">V</forename><surname>Krytska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Nedozym</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Orlova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Puzdyrieva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yu</forename><surname>Romaniuk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Slovozmina [Grammatical Dictionary of the Ukrainian Literary Language. Inflection</title>
				<meeting><address><addrLine>Kyiv</addrLine></address></meeting>
		<imprint>
			<publisher>Dmytro Burago Publishing House</publisher>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Developing an Open-Source, Rule-Based Proofreading Tool</title>
		<author>
			<persName><forename type="first">M</forename><surname>Miłkowski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Software -Practice and Experience</title>
		<imprint>
			<biblScope unit="volume">40</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="543" to="566" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">M</forename><surname>Miłkowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weiss</surname></persName>
		</author>
		<author>
			<persName><surname>Morfologik</surname></persName>
		</author>
		<ptr target="https://github.com/morfologik/morfologik-stemming" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Large Electronic Dictionary of Ukrainian (VESUM)</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rysin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Starko</surname></persName>
		</author>
		<ptr target="https://github.com/brown-uk/dict_uk" />
	</analytic>
	<monogr>
		<title level="j">Version</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">0</biblScope>
			<biblScope unit="page" from="2005" to="2022" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">O</forename><surname>Telemko</surname></persName>
		</author>
		<ptr target="https://r2u.org.ua" />
		<title level="m">Russian-Ukrainian Dictionaries</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Rysin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Starko</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Yu</forename><surname>Marchenko</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page">2022</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Shvedova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Waldenfels</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yarygin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rysin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Starko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Nikolajenko</surname></persName>
		</author>
		<ptr target="http://uacorpus.org/" />
		<title level="m">GRAC: General Regionally Annotated Corpus of Ukrainian</title>
				<meeting><address><addrLine>Kyiv; Lviv, Jena</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017-2022</date>
		</imprint>
	</monogr>
	<note>Electronic resource</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Korpusna linhvistyka</title>
		<author>
			<persName><forename type="first">V</forename><surname>Shyrokov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Corpus Linguistics</title>
				<meeting><address><addrLine>Dovira, Kyiv</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Slovnyky Ukrainy&quot; online</title>
		<author>
			<persName><forename type="first">V</forename><surname>Shyrokov</surname></persName>
		</author>
		<idno>2001-2022</idno>
		<ptr target="https://lcorp.ulif.org.ua/dictua/" />
	</analytic>
	<monogr>
		<title level="m">Dictionaries of Ukraine</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Velykyi elektronnyi slovnyk ukrainskoi movy (VESUM) iak zasib NLP dlia ukrainskoi movy [Large Electronic Dictionary of Ukrainian (VESUM) As an NLP Tool for Ukrainian</title>
		<author>
			<persName><forename type="first">V</forename><surname>Starko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rysin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Halaktyka Slova [Lexical Galaxy</title>
				<meeting><address><addrLine>Kyiv</addrLine></address></meeting>
		<imprint>
			<publisher>Dmytro Burago Publishing House</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="135" to="141" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">The MorfFlex Dictionary of Czech as a Source of Linguistic Data</title>
		<author>
			<persName><forename type="first">B</forename><surname>Štěpánková</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mikulová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hajič</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of XIX EURALEX Congress: Lexicography for Inclusion</title>
		<title level="s">Democritus University of Thrace</title>
		<meeting>XIX EURALEX Congress: Lexicography for Inclusion<address><addrLine>Thrace, Greece</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="387" to="392" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Slovozmina ukrainskoi movy [Inflection of the Ukrainian Language</title>
		<author>
			<persName><forename type="first">O</forename><surname>Taranenko</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">System of Intellectual Ukrainian Language Processing</title>
		<author>
			<persName><forename type="first">N</forename><surname>Tmienova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sus</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ITS</title>
		<imprint>
			<biblScope unit="volume">2019</biblScope>
			<biblScope unit="page" from="199" to="209" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Teoretychna morfolohiia ukrainskoi movy</title>
		<author>
			<persName><forename type="first">I</forename><surname>Vykhovanets</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Horodenska</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Theoretical Morphology of Ukrainian</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title/>
		<author>
			<persName><surname>Pulsary</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2004">2004</date>
			<pubPlace>Kyiv</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Weiss</surname></persName>
		</author>
		<ptr target="https://github.com/apache/lucene/tree/2183756f1c8253002bb697bdb8e026e86c4b3db5/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk" />
		<title level="m">Ukrainian Morfologik Analyzer</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">The Online Version of Grammatical Dictionary of Polish</title>
		<author>
			<persName><forename type="first">M</forename><surname>Woliński</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Kieraś</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">LREC 2016</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="2589" to="2594" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
