<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Computer linguistic system architecture for Ukrainian language content processing based on machine learning</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Victoria</forename><surname>Vysotska</surname></persName>
							<email>victoria.a.vysotska@lpnu.ua</email>
							<affiliation key="aff0">
								<orgName type="institution">Lviv Polytechnic National University</orgName>
								<address>
									<addrLine>Stepan Bandera 12</addrLine>
									<postCode>79013</postCode>
									<settlement>Lviv</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<address>
									<postCode>2024</postCode>
									<settlement>Lviv-Shatsk</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Computer linguistic system architecture for Ukrainian language content processing based on machine learning</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">898C7B32BC6E66690C348A378B5E15E8</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>natural language processing</term>
					<term>Ukrainian text</term>
					<term>NLP</term>
					<term>computer linguistics</term>
					<term>machine learning 1</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The general architecture of computer linguistic systems (CLS) is developed based on the main processes of processing information resources such as integration, maintenance and content management, as well as using methods of intellectual and linguistic analysis of text flow using machine learning technology. The IT of intellectual analysis of the text flow based on the processing of information resources has been improved, which made it possible to adapt the generally typical structure of content integration, management and support modules to solve various of natural language processing (NLP) problems and increase the efficiency of CLS functioning by 6-9%. The main NLP methods based on regular expression (RE) matching with patterns in grapheme and morphological analyses of Ukrainian-language texts are described. NLP methods based on pattern-matching regular expressions have been improved, which made it possible to adapt methods of text tokenization and normalization by cascades of simple substitutions of regular expressions and finite state machines. The main valid operations of regular expressions are defined as union and disjunction of symbols/strings/expressions, number and precedence operators, as well as anchors as special symbols for identifying the presence/absence of symbols in RE. The main stages of tokenization and normalization of the Ukrainian text by cascades of simple substitutions of regular expressions and finite state machines are defined. The morphological analysis (MA) method of the Ukrainian-language text based on word segmentation and normalization, sentence segmentation and modified Porter's stemming algorithm was improved as an effective means of identifying lem affixes for the possibility of marking the analyzed word, which made it possible to increase the accuracy of keyword searches by 9%. Algorithms for word segmentation and normalization, sentence segmentation, and Porter's modified stemming are implemented and described as an effective way of identifying lem affixes for the possibility of marking the analyzed word. Unlike the classic Porter algorithm (it does not have high accuracy even for English-language texts), the modified one is adapted specifically for the Ukrainian language and gives an accurate result in 85-93% of cases, depending on the quality, style, genre of the text and, accordingly, the content of CLS dictionaries. The algorithm for the minimum editorial distance of lines of Ukrainian texts is described as the minimum number of operations necessary to transform one into another.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Let's consider the architectural patterns of CLS design based on supporting the life cycle of the ML model for monitoring/managing the pipeline (information flow) of content (Fig. <ref type="figure" target="#fig_0">1</ref>) <ref type="bibr" target="#b0">[1]</ref>. The standard content processing pipeline implements an iterative process consisting of the stages of creating and deploying the machine learning (ML) process <ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref><ref type="bibr" target="#b4">[5]</ref>. The process of monitoring/managing the content pipeline should still consist of additional stages to improve the quality/efficiency/efficiency of NLP problem solving <ref type="bibr" target="#b5">[6]</ref><ref type="bibr" target="#b6">[7]</ref><ref type="bibr" target="#b7">[8]</ref><ref type="bibr" target="#b8">[9]</ref>. At the construction stage, raw integrated content is filtered from noise/duplicates and formatted into a suitable form for further processing/management, conducting experiments on it, transferring it to ML models for classification/clustering/prediction/evaluation, etc. <ref type="bibr" target="#b9">[10]</ref><ref type="bibr" target="#b10">[11]</ref><ref type="bibr" target="#b11">[12]</ref>. At the stage of content analysis and support, the content is deployed to determine the best ML model for making assessments/forecasts that directly affect the regular user and target audience. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related works</head><p>Based on Feedback and model output, the target audience interacts with CLS, which facilitates the adaptation of the selected learning model. Five stages of related processes define the basic architectural principles for building a typical CLS. The processes of monitoring, processing and managing content are interaction, formatting/filtering, NLP, machine learning <ref type="bibr" target="#b12">[13]</ref><ref type="bibr" target="#b13">[14]</ref><ref type="bibr" target="#b14">[15]</ref> and data accumulation in DS. For content analysis and support processes, respectively, these are feature analysis, deployment, prediction, interpretation, and content/result presentation. At the interaction stage, a set of rules for integrating content from multiple reliable sources at certain time intervals is necessary. Also, in parallel, a set of rules for checking the data entered by the CLS user is required as a preliminary stage for the formatting/filtering stage according to a collection of rules preset by the moderator and content from DS <ref type="bibr" target="#b15">[16]</ref><ref type="bibr" target="#b16">[17]</ref><ref type="bibr" target="#b17">[18]</ref><ref type="bibr" target="#b18">[19]</ref><ref type="bibr" target="#b19">[20]</ref><ref type="bibr" target="#b20">[21]</ref>. The next stage of NLP is a preparatory intermediate stage for machine learning and data accumulation. The machine learning stage can take various forms from SQL queries to various software modules. The support process is easier to implement than the management stage, provided that the latter is implemented correctly, especially during NLP analysis, in which additional lexical resources and artefacts (dictionaries, translators, regular expressions, etc.) are created, on which the effectiveness of CLS functioning directly depends (Fig. <ref type="figure" target="#fig_1">2</ref>) <ref type="bibr" target="#b0">[1]</ref><ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref>. The process of transition from raw text to a developed machine-learning model consists of a sequence of additional content transformations. First, the input textual content is transformed into an input corpus as a collection of texts, accumulated and stored in the DS. The incoming content is further grouped, filtered, formatted, linguistically processed, marked, normalized and converted into vectors for further processing. In the final transformation, the model/models (Fig. <ref type="figure" target="#fig_3">3</ref>) are trained on the vector corpus, and a generalized presentation of the original content is created for further use in solving a specific NLP problem <ref type="bibr" target="#b0">[1]</ref><ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref><ref type="bibr" target="#b4">[5]</ref><ref type="bibr" target="#b5">[6]</ref>. An ML-based CLS architecture with accelerated or even automatic model generation should support and optimize content transformation with ease of testing and tuning. The process of generating an optimal ML model is a complex cyclic algorithm, the main stages of which are the formation of a collection of features, model selection, and hyperparameter adjustment. After each iteration, the results are evaluated to determine the optimal collection of features, models, and parameters for solving a specific NLP problem with the appropriate input data <ref type="bibr" target="#b0">[1]</ref><ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref><ref type="bibr" target="#b4">[5]</ref><ref type="bibr" target="#b5">[6]</ref>. The process of generating an optimal machine learning model</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Interaction</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Learning the ML model</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Content repository</head><p>Analysis of signs and parameters</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Optimization of the ML model</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Content archive</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Set of content</head><p>Model repository</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model settings</head><p>Testing of the ML model According to <ref type="bibr" target="#b0">[1]</ref><ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref><ref type="bibr" target="#b4">[5]</ref><ref type="bibr" target="#b5">[6]</ref>, there are 3 main areas of statistical ML: a class of models, a form of a model, and a trained model. The class of models defines the relationship between the variables and the formed goal (for example, a linear model, a recurrent neural network, etc.). A model form is a specific component of a model: a collection of features, an algorithm, or a collection of hyperparameters. A trained model is a form of model that is trained on a specific data set and adapted to make predictions. CLSs consist of many trained models built during their selection, which creates and evaluates model shapes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Choice of ML model</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Materials and Methods</head><p>Any natural language text is initially a collection of non-random unstructured data as input content to CLS. But usually, the text is formed based on certain linguistic rules for the possibility of understanding these data. The purpose of the integration module is to transform this collection of non-random unstructured data into structured/semistructured fields (records) or markup for convenient interpretation by CLS modules. ML methods (for example, learning by a teacher) allow you to train (and retrain) statistical models as the language changes during NLP processes. By generating ML models on context-sensitive corpora, CLSs can apply narrow semantic values to improve accuracy without the need for additional interpretation.</p><p>Formally, the ML model of the Ukrainian language has to supplement the input incomplete phrase with missing words/phrases that are most likely to complete the content of the statement according to the previous text (context analysis for further guessing/predicting the meaning). Usually, a competently and correctly constructed text is predictable based on its coherence. Calculation of the entropy (degree of uncertainty/unpredictability) of the probability distribution of the model of the Ukrainian language measures the degree of predictability of the text. Thus, unfinished phrases Київстолиця... [Kyyiv -stolytsya...] (Kyiv -the capital...) and сонце сходить на... [sontse skhodytʹ na...] (the sun rises on...) have low entropy and statistical speech models are highly likely to guess the continuation of України [Ukrayiny] (Ukraine) and сході [skhodi] (the east), respectively. And expressions with high entropy like ми йдемо в гості до... [my ydemo v hosti do...] (we go to visit...) and я зустрів сьогодні... [ya zustriv sʹohodni...] (I met today...) offer many continuation options (parents, friends, neighbours, colleagues are equally likely without analyzing the previous context). Speech models can make inferences or identify connections between lexemes. Formally, the model uses context to identify a narrow decision space from a set of a small number of options. The application of statistical ML methods (with and without a teacher) allows the generation of speech models for extracting meaning from texts to support its predictability. First, the characteristic features of the content are identified to predict the goal. Textual data provides many opportunities to extract surface features based on parsing and breaking up sentences/utterances/phrases (e.g. bag of words), as well as based on extracted morphological/syntactic/semantic features. Special attention is paid to linguistic/ contextual/ structural features.</p><p>1. An example of the analysis of a linguistic feature can be the identification of the predominant gender in a fragment of the news text (the role of gender) in different contexts <ref type="bibr" target="#b0">[1]</ref> to identify gender biases regarding the subject of publications. In the gender analysis of the text, words in the feminine and masculine gender are used to form a frequency assessment of gender characteristics, i.e.</p><p>𝑆𝑖𝑛𝑔 𝐺𝑆 =&lt; 𝑋 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑊 𝑀𝑎𝑙𝑒 , 𝑊 𝐹𝑒𝑚𝑎𝑙𝑒 , 𝑊 𝑈𝑛𝑘𝑛𝑜𝑤𝑛 , 𝑊 𝐵𝑜𝑡ℎ , 𝑓 𝑔𝑒𝑛𝑑𝑒𝑡𝑖𝑧𝑒 , &gt;,</p><p>(1) where 𝑋 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 is analyzed sentence/expression; 𝑊 𝑀𝑎𝑙𝑒 is a set of words with the sign of a man; 𝑊 𝐹𝑒𝑚𝑎𝑙𝑒 is a set of words with the attribute woman; 𝑊 𝑈𝑛𝑘𝑛𝑜𝑤𝑛 is a set of words with an unknown gender sign; 𝑊 𝐵𝑜𝑡ℎ is a set of words with the sign of a man and a woman; 𝑓 𝑔𝑒𝑛𝑑𝑒𝑡𝑖𝑧𝑒 is the operator for identifying the gender class of a sentence.</p><p>𝑆𝑖𝑛𝑔 𝐺𝑆 = 𝑓 𝑔𝑒𝑛𝑑𝑒𝑡𝑖𝑧𝑒 (𝑋 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑊 𝑀𝑎𝑙𝑒 , 𝑊 𝐹𝑒𝑚𝑎𝑙𝑒 , 𝑊 𝑈𝑛𝑘𝑛𝑜𝑤𝑛 , 𝑊 𝐵𝑜𝑡ℎ ), 𝑆𝑖𝑛𝑔 𝐺𝑆 = { 𝑁 𝑀𝑎𝑙𝑒 &gt; 0, 𝑁 𝐹𝑒𝑚𝑎𝑙𝑒 = 0 → 𝑚𝑎𝑙𝑒 𝑁 𝑀𝑎𝑙𝑒 = 0, 𝑁 𝐹𝑒𝑚𝑎𝑙𝑒 &gt; 0 → 𝑓𝑒𝑚𝑎𝑙𝑒 𝑁 𝑀𝑎𝑙𝑒 &gt; 0, 𝑁 𝐹𝑒𝑚𝑎𝑙𝑒 &gt; 0 → 𝑏𝑜𝑡ℎ 𝑢𝑛𝑘𝑛𝑜𝑤𝑛 <ref type="bibr" target="#b1">(2)</ref> where 𝑁 𝑀𝑎𝑙𝑒 is the number of words with the sign of a man in the analyzed sentence 𝑋 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 ; 𝑁 𝐹𝑒𝑚𝑎𝑙𝑒 is the number of words with the sign female in the analyzed sentence 𝑋 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 .</p><p>It is also necessary to determine the frequency of words, gender signs and sentences in the entire publication:</p><formula xml:id="formula_0">𝑆𝑖𝑛𝑔 𝑇𝑆 =&lt; 𝑋 𝑇𝑒𝑥𝑡 , 𝑆 𝑁𝐺 , 𝑁 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑊 𝑁𝐺 , 𝑓 𝑐𝑜𝑢𝑛𝑡𝑔𝑒𝑛𝑑𝑒𝑟 &gt;, 𝑆𝑖𝑛𝑔 𝑇𝑆 = 𝑓 𝑐𝑜𝑢𝑛𝑡𝑔𝑒𝑛𝑑𝑒𝑟 (𝑁 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑆 𝑁𝐺 , 𝑊 𝑁𝐺 , 𝑓 𝑔𝑒𝑛𝑑𝑒𝑡𝑖𝑧𝑒 (𝑋 𝑇𝑒𝑥𝑡 , 𝑋 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 )),<label>(3)</label></formula><p>where 𝑋 𝑇𝑒𝑥𝑡 is analyzed publication text; 𝑆 𝑁𝐺 is a set of numbers of 𝑋 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 sentences of the analyzed text 𝑋 𝑇𝑒𝑥𝑡 marked by gender; 𝑁 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 is the number of sentences in the analyzed text 𝑋 𝑇𝑒𝑥𝑡 ; 𝑊 𝑁𝐺 is the set of the number of words of each gender characteristic for each marked sentence 𝑋 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 ; 𝑓 𝑐𝑜𝑢𝑛𝑡𝑔𝑒𝑛𝑑𝑒𝑟 is an operator of identification and classification/marking of all sentences of the analyzed text 𝑋 𝑇𝑒𝑥𝑡 by gender based on 𝑓 𝑔𝑒𝑛𝑑𝑒𝑡𝑖𝑧𝑒 .</p><formula xml:id="formula_1">𝑆𝑖𝑛𝑔 𝑇𝑆 = [ 𝑆 𝑁𝐺 [𝑆𝑖𝑛𝑔 𝐺𝑆 ]+= 1 𝑊 𝑁𝐺 [𝑆𝑖𝑛𝑔 𝐺𝑆 ]+= 𝑙𝑒𝑛(𝑋 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 ) 1 𝑁 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 (4)</formula><p>For gender identification, it is necessary to parse the original text of publications with the subsequent marking of sentences and words based on the NLTK library: </p><formula xml:id="formula_2">𝑆𝑖𝑛𝑔 𝑇𝑃 =&lt; 𝑋 𝑇𝑒𝑥𝑡 ,</formula><p>where 𝑁 𝑤𝑜𝑟𝑑 is the number of words in the analysed text 𝑋 𝑇𝑒𝑥𝑡 ; 𝑁 𝐺𝑒𝑛𝑑𝑒𝑟 is the number of classifications by gender (in this particular case -4); 𝑊 𝑁𝐺 𝑘 is the number of words in sentences of a certain gender sign; 𝑆 𝑁𝐺 is the set of the number of sentences in the analyzed text of a certain gender sign; 𝑝𝑐𝑒𝑛𝑡 𝑘 is the percentage of publication text belonging to a certain gender sign; 𝑆𝑖𝑛𝑔 𝐺𝑆 𝑘 is a specific gender sign; 𝑁 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑘 is the number of sentences in the analyzed text of a specific gender sign; 𝑆 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 is a set of sentences identified by parsing in the analysed text 𝑋 𝑇𝑒𝑥𝑡 ; 𝑊 𝑊𝑜𝑟𝑑 is a collection of sets of words identified by parsing in each sentence of the analyzed text 𝑋 𝑇𝑒𝑥𝑡 ; 𝑊 𝑊𝑜𝑟𝑑 is the set of all words of the text 𝑋 𝑇𝑒𝑥𝑡 ; 𝑡𝑜𝑡𝑎𝑙 is the number of all words in the analysed text 𝑋 𝑇𝑒𝑥𝑡 . Such a deterministic mechanism demonstrates how the content/frequency of use of words/phrases (especially stereotypical ones) affects the predictability of the content according to the previous context (the gender sign is built directly into the Ukrainian language -every noun has a gender). But speech signs are not always decisive, for example, plural and time are used to analyse language/processes/actions/events in time.</p><p>2. An example of the analysis of a contextual feature can be the analysis of moods or sentiment analysis of a text (emotional colouring when discussing a specific topic by a relevant group of people). Usually used in complex analysis of feedback from users, for example, e-commerce, the polarity of messages or reactions to events/phenomena, in social networks or in political/economic discussions/forums, etc. In superficial sentiment analysis, the mechanism of gender classification (positive/negative/neutral coloured word) is usually used. For example, for positive -чудовий [chudovyy] (wonderful), прекрасний [chudovyy] (beautiful), правдивий [pravdyvyy,] (true), negative-лінивий [linyvyy] (lazy), поганий [pohanyy] (bad), дратівливий [drativlyvyy] (annoying), and neutral -білий [bilyy] (white), сонячний [sonyachnyy] (sunny), космічний [sonyachnyy] (cosmic). But the mood is not a feature of the language and depends on the meaning of the words/phrases according to the surrounding context of the text, for example, the word кумедний [kumednyy] (funny) has several interpretations of conveying the mood, in particular, positive -смішний клоун [smishnyy kloun] (funny clown), negative -кумедний одяг [kumednyy odyah] (funny clothes), and neutral -кумедний кіт [kumednyy kit] (funny cat) or кумедна іграшка [kumedna ihrashka] (funny toy). The word гострий [hostryy] (sharp) from the word перець [peretsʹ] (pepper) or ніж [nizh] (knife) has a positive meaning when buying, but from the word біль [bilʹ] (pain) and ніж [nizh] (knife) in a criminal case, it has a negative meaning. Also, negation turns the meaning of a positive text with positive words into a negative one and vice versa, for example, ми дуже багато очікували від відпочинку на морі сонячними гарними днями, але обіцяна курортна база відпочинку все спаскудила [my duzhe bahato ochikuvaly vid vidpochynku na mori sonyachnymy harnymy dnyamy, ale obitsyana kurortna baza vidpochynku vse spaskudyla] (we expected a lot from a vacation at the sea on sunny, beautiful days, but the promised holiday resort spoiled everything) (one negative word спаскудила [spaskudyla] (spoiled) all the previous positive ones) or дощ, прохолода та вітер не стали перепонами гарно відпочити в чудовій компанії [doshch, prokholoda ta viter ne staly pereponamy harno vidpochyty v chudoviy kompaniyi doshch, prokholoda ta viter ne staly pereponamy harno vidpochyty v chudoviy kompaniyi] (rain, coolness and the wind did not become an obstacle to a good rest in a wonderful company). Only thanks to machine learning in such cases it is possible to get the predictability of the text and reveal the emotional colouring according to the context. An a priori deterministic/structural approach loses the flexibility of context and meaning, so most speech models take into account the location of words in context, using ML methods for prediction. The main method of developing simple speech models is the bag of words as the frequency of co-occurrence of words in a narrow, limited context (Fig. <ref type="figure" target="#fig_4">4</ref>).   Such evaluation helps to determine the probability neighbourhood and to determine their meaning from small fragments of text. Next, using statistical inference methods, word order can be predicted. This is quite simple for English texts where words are not inflected. For Ukrainian language tests, it is better to use not a bag of words, but a bag of word bases. For example, for 12-word combinations as a 3-gram (36 words) without taking into account declension, we will get a matrix of size 2020, and with consideration of declension, gender and person (analysis of only the bases of words) -1515. Moreover, for the Ukrainian language, the location of bases in the 3-gram is usually not important and often has an unambiguous probability of compatibility in terms of content, for example, інформаційний ресурс (інформ ресурс) [informatsiynyy resurs (inform resurs)] (information resource (inform resource)) and ресурс інформації (ресурс інформ) [resurs informatsiyi (resurs inform)] (information resource (inform resource)). The bag-of-words/stems model is also extended by analyzing the co-occurrence of stable phrases and fragments of expressions that are of great importance for identifying the meaning of the text. The expressions зелений край скатертини (межа) [zelenyy kray skatertyny (mezha)] (green edge of the tablecloth (border)) and зелений край батьківщини (місцевість) [zelenyy kray batʹkivshchyny (mistsevistʹ)] (green edge of the homeland (locality)) in the form of a 3gram carry a different meaning. That is, there are several interpretations only for the word edge (the boundary of an object, a piece, the end of an action/state, a special area, a place of residence, an administrative-territorial unit). Statistical analysis of n-grams makes it possible to distinguish patterns of context. Speech models based on the analysis of n-gram contexts require the ability to explore the relationship of text to some target variable. The application of the analysis of linguistic and contextual features contributes to the formation of the general predictability of the text. However, their identification and further use require the ability to parse/identify the linguistic units of the language.</p><formula xml:id="formula_4">0 електр 0 0 інтелект 0 0 0 інформ 0 0 2 0 комерц 0 1 0 0 0 комп'ютер 0 0 0 0 0 0 контент 2 0 0 0 0 0 0 лінгвіст 2 0 0 0 0 1 1 0 мов 1 0 0 0 0 0 0 0 0 опрацюв 0 0 0 1 0 0 1 0 1 0 пошук 0 0 1 1 0 0 1 0 0 0 0 природ 1 0 0 0 0 0 0 0 2 1 0 0 ресурс 0 0 0 1 0 0 0 0 0 1 0 0 0 систем 0 1 1 1 1 1 0 1 0 0 0 0 0 0 текст 2 0 0 0 0 0 3 1 0 1 1 0 0 0 0 1)</formula><p>3. An example of the analysis of a structural feature can be the construction of an ontology for the implementation of IIS. Along with linguistic and contextual features, it is then necessary to identify and process high-level language units to define a vocabulary of operations for the text corpus. Different units of language are processed at different levels, and the correct implementation of NLP methods based on ML is important for the operational and correct identification of the linguistic context (sem relationship structure). Based on a typical pattern of utterances (statement or simple phrase) in the form of the subject  verb  object  object definition (subject  predicate  appendix) construct ontologies that define specific relationships between entities. They make it possible to solve the problem of the lack of a mandatory order of words in a Ukrainian sentence to identify its semantics. It is advisable to use it for tasks where it is necessary to constantly process large volumes of text data and there is long-term resource support for the project. Semantic analysis consists not only in identifying the content of the text but also in generating data structures to which logical reasoning can be applied. Thematic Meaning Representations (TMR) are used to encode sentences in the form of predicate structures based on first-order logic or lambda calculus (λ-calculus). Network/graph structures are used to encode interactions of predicates of relevant text features. Then a traversal is implemented to analyze the centrality of terms or subjects and the reasons for the relationships between elements. Graph analysis is usually not a complete SEM (semantical analysis), but helps to form part of important logical decisions or conclusions. Semantics, syntax and morphology allow you to add data to simple text strings with linguistic meaning and generate new meaningful text content. Nowadays, natural language is one of the most commonly used forms of content. Its analysis makes it possible to increase the usefulness of data applications and make them an integral part of everyday life. Scalable analysis and machine learning of text primarily require up-to-date knowledge and text corpora of the relevant SA. For example, in the field of finance, CLS needs to identify financial terms, stock abbreviations and company names. Therefore, documents in the SA corpus must contain these entities. That is, the development of any CLS begins with obtaining textual data of the appropriate type and forming a corpus with structural and contextual features of SA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments, results and discussions</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Method of grapheme analysis of the Ukrainian language</head><p>For the GA of text strings, it is best to use regular expressions (RE) as algebraic notations for the features of a set of character strings. Commonly used in the development/maintenance of each type of computer language (programming, communication protocols, data markup, specification, and design), the operation of text editors, and word processing software, especially with IIS templated or SA text corpora collections. Identification/search of a fragment/string by pattern in a sequence of character strings is implemented to find all matches or the first one. The templates use special characters <ref type="figure">[, ], ^, \, -, ?, *, +, ., $, |, (, ), _, {, }</ref>, etc., including /, but the latter is not RE, but its boundaries The simplest RE is a tuple of simple characters (Table <ref type="table" target="#tab_3">1</ref>) to recognize the first or all pattern-like occurrences of character sequences.  <ref type="bibr" target="#b12">[13]</ref><ref type="bibr" target="#b13">[14]</ref><ref type="bibr" target="#b14">[15]</ref>; simply to denote carriages ^ (RE <ref type="bibr" target="#b15">[16]</ref><ref type="bibr" target="#b16">[17]</ref>. Question mark special character ? for RErules 20-21 allows you to mark optional characters in the searched string. This is useful in cases where there may be both present/absent characters in a certain sequence that do not resolve [].</p><p>In [] -you can indicate the absence of a specific symbol from the range of possible ones, but do not describe the absence of any symbol at all, indistinguishable from ?. The dot special character . for RE-rules 22-23 allows you to mark the location of any symbol in the sequence of the analyzed string. If the special character ? there is the absence or presence of one symbol, then we can submit the doubling of the symbol through the special symbol * (RE 26-29), which means the absence of a specific symbol or RE before * in the RE or its arbitrary number in consecutively placed in the recognized line, i.e. the result can be a line without this symbol. Therefore, to find at least one symbol from a possible sequence of the same two -for example RE 29, and for two different ones -30. The + special character for RE-rules 30-31 allows you to mark one or more cases immediately preceding the /RE symbol. {} (RE 32) is used to indicate the exact quantity (for example, exactly 2 times). The dot special character. often used together with the special character * to indicate any string of characters (RE 33).</p><p>An anchor is a special symbol (for example, a double sign ^ or a dollar sign $) specifying the location of the RE in the character string. In some cases, the caret ^ marks the beginning of a line (RE 34). The dollar sign $ recognizes the end of a line (RE 35-36). The backslash \ allows you to recognize special characters in the character string of the input test (RE 37-38). The anchors \b and \B identify the presence and absence of word boundaries, respectively (RE 39-42). A word is any tuple of numbers, underscores or letters (without special characters).</p><p>To organize the selection of alternatives between, for example, synonyms, the disjunction operation based on the special symbol | (RE 43-46). The combination of special characters | inside () allows you to arrange disjunction recognition only for a specific pattern, taking into account different inflexions/prefixes (RE 44). Special characters () are used to organize counters of type * (RE 46). The difference is that * is used for one character, not a whole sequence.</p><p>For complex disjunctive RE operators, when grouping from different special symbols, the concept of priority is used (Table <ref type="table" target="#tab_7">2</ref>): ()  *, +, ?, {}  string, ^, $  | from the highest to the lowest (delimitation by the symbol ) (). Greedy RE patterns of the type /[a-ya]*/ recognize zero or more letters and no matches, expanding the identification to cover as many strings as they can. Non-greedy RE based on *? and +? find the smallest possible text. RE of the type /˽*/ is used to indicate the absence or presence of a certain number of spaces, since there can always be additional spaces around. There are aliases for general ranges that can be used primarily to preserve grapheme type (Table <ref type="table" target="#tab_8">3</ref>). Correctly constructed REs avoid errors of assumption (overrecognition) and negation (accidentally missed). Reducing the overall error rate for GA implies two antagonistic conditions for generating a collection of REs increasing recall (minimizing false ignores) and increasing precision (minimizing false recognitions).   RE /{9}/ is the recognition of exactly 9 cases of the previous symbol/expression, RE /а.{3}я/ -sequences v, RE /{3,12}/ -from 3 to 12 of the previous symbol/expression, RE /(5,)/ is at least 5 occurrences of the preceding character/expression, and RE /(,13)/ is up to 13 occurrences of the preceding character/expression. The special character s before RE allow you to replace the expression with a pattern. The special character \k indicates the location of the character/phrase/expression as a duplicate of the first element in the capture group, i.e. the pattern in (), where k is the number of brackets or capture groups. Thus, special characters () have a double function in RE: to group conditions and to determine the order of application of operators. For grouping, without fixing the received template in the register, the RE of the form (?: template) is used as a group that does not capture the expression. When applying RE, the rank of use in the queue is determined. An RE of type (?: template) is a positive statement (RE 23).</p><p>The (?=pattern) operator is positive when identifying a zero-width pattern, i.e. the match pointer is not advanced. The (?!pattern) operator is positive if the pattern does not match, is zero-width, and the cursor does not advance. Negative statements are usually used in the analysis of a complex content model when a special case needs to be removed (RE 24).</p><p>Grapheme analysis is the preliminary processing and transformation of the text into a certain marked and compressed format for the following NLP processes (Fig. <ref type="figure" target="#fig_6">5</ref>): extracting content  extracting paragraphs  extracting sentences within a paragraph  extracting tokens within a sentence  marking tokens with tags for MA as part-of-speech marking.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Grapheme segmentation and labeling Saving content</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Marked content</head><p>Sentence At the first stages of the integration of content from various sources, it is necessary to implement the processes of filtering, access and calculation of text sizes based on the application of the standard API of pre-grapheme processing of the division of documents through the execution of the following sequence of NLTK methods: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>WWW</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Інформаційни ресурс</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Grapheme analysis</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Paragraphs</head><formula xml:id="formula_5"> 𝑓 𝑟𝑎𝑤 ()</formula><formula xml:id="formula_6">)))))),<label>(6)</label></formula><p>and if necessary, additional methods, such as adding tags or parsing sentences, converting annotated text into tree-like data structures, or extracting individual XML elements. To identify and extract the main content from an information resource with an undefined structure and high variability of documents from different sources, 𝑓 ℎ𝑡𝑚𝑙 () based on the Python readability-lxml library is used, which removes all anomalous artefacts, leaving only the text. When processing HTML text, 𝑓 ℎ𝑡𝑚𝑙 () uses a collection of formal REs to identify and remove navigation menus, declarations, script tags, and CSS, then creates a new content object model tree, extracts the text from the source tree, and embeds it into the newly created tree.</p><p>Vectorization, feature extraction, and ML tasks rely heavily on CLS's ability to efficiently break down textual content into its constituent components while preserving the original structure. The accuracy and sensitivity of ML models depend on the efficiency of identifying the connections of tokens with the corresponding context in the text. Paragraphs contain complete ideas of context and are the structural unit of content. Based on NLTK, the 𝑓 𝑝𝑎𝑡𝑎𝑠 () operator is implemented as a paragraph generator, which is defined as blocks of text separated by two newline characters. The 𝑓 𝑝𝑎𝑡𝑎𝑠 () the operator scans all files and passes each HTML text to the RE constructor, indicating that parsing of the HTML markup should be done through the lxml HTMLparser. The resulting object maintains a tree structure that can be navigated using native HTML tags and elements.</p><p>If paragraphs are structural units of content, then sentences are semantic units. As a paragraph expressing a single idea, a sentence contains a complete thought that the author has formulated and expressed in many words. Grapheme segmentation is the division of text into sentences for further processing by marking words with parts of speech in MA. The operator 𝑓 𝑠𝑒𝑛𝑡𝑠 (), calling 𝑓 𝑝𝑎𝑡𝑎𝑠 () and returning an iterator (generator), sorts all sentences from all paragraphs.</p><p>The 𝑓 𝑠𝑒𝑛𝑡𝑠 () operator bypasses all paragraphs selected by the 𝑓 𝑝𝑎𝑡𝑎𝑠 () operator and uses the 𝑓 𝑤𝑜𝑟𝑑𝑠 () operator to perform the actual grapheme segmentation. Internally, the 𝑓 𝑡𝑜𝑘𝑒𝑛𝑠 () operator uses 𝑓 𝑚𝑎𝑟𝑘 (), a model pre-trained with RE recognition/identification rules for various kinds of tokens, punctuation marks, abbreviations, geographical names, abbreviations, and other marks that serve as sentence start/end or tab marks. Punctuation marks do not always have an unambiguous interpretation, for example, or are a sign of the end of a sentence, but they are also present in dates, abbreviations, abbreviations, ellipses, etc. Determining sentence boundaries is not always an easy task. Punctuation is crucial for identifying word boundaries (commas, spaces, colons) and for identifying certain aspects of meaning (question marks, exclamation marks, quotation marks). For some tasks, such as tagging parts of speech, and analyzing or synthesizing speech, it is sometimes necessary to treat punctuation marks as if they were separate words. When analyzing speech, punctuation marks replace pauses, accents, and changes in intonation dynamics. Lexemization is the process of obtaining lexemes (syntactically encoded strings of symbols) and for its implementation, the operator 𝑓 𝑤𝑜𝑟𝑑𝑠 () based on RE is used, which is selected through 𝑓 𝑚𝑎𝑟𝑘 () markers for spaces and punctuation marks and returns a list of alphabetic and non-alphabetic characters. Like delimiting sentences, lexeme recognition is not always an easy task: the presence of punctuation marks in a lexeme, punctuation marks as independent lexemes, lexemes with and without hyphens, and lexemes as shortened forms of words (one or more words). Different marker selection tools are chosen for these cases. Any statement is a speech correlate of a sentence. The presence of lexemes of the dysfluency type (loss of speech speed, for example, a longer pause when thinking) carries not so much a semantic load as an emotional one. Exclamations such as мммм, ох, ах [mmmm, ohh, ah], etc. are fillers or filled pauses and are also emotionally coloured, but not semantically coloured. An unfinished word with further repetition and its ending or simply with repetition is a fragment that does not carry a semantic load, but only an emotional one. Therefore, when conducting PHA, depending on the goal of solving a specific problem through CLS, it is important to take into account (mark accordingly) or ignore some types of punctuation (ellipsis, exclamation points, etc.), dysfluencies, double fragments, exclamations, etc. If CLS is just a transcription of speech, then such phonemes should be ignored to avoid loss of speech rate. But they make it possible to determine the psychological state of the speaker and his emotional state, to identify the peculiarity of the speaker's authorial speech when the tone of the voice changes, they are relevant in predicting the future word, because they signal that the speaker is restarting the statement/idea, and therefore, for speech recognition, ordinary tokens are considered as phonemes. Marking a lexeme as a lemma (a set of lexical forms having the same base, the same main part of speech and the same word content) or as a word form (a fully inflected or derived form of a word) is a significant difference for conducting the next stage of MA as lemmatization or stemming, i.e. identification of word bases. For many NLP tasks in the English language, it is enough to mark the corresponding lexemes as word forms, but for the Ukrainian language -no, it is still necessary to identify the bases of the words (for example, based on the analysis of inflexion according to the tree of endings).</p><p>There are two ways to identify words with punctuation ignored -token recognition as types (the number of different words |V| in the set of words of the corpus, i.e. the cardinality of the alphabet of the corpus, where an element of the alphabet/dictionary is a unique word) and tokens (the total number N of words of the analyzed corpus), i.e. |V|  N. The largest Google N-grams corpus contains 13 million types among those displayed  40, so the true number is much larger.</p><p>The ratio between the number of types |V| and the number of tokens N is called Herdan's law <ref type="bibr">(Herdan, 1960)</ref> or Heaps' law <ref type="bibr">(Heaps, 1978)</ref>: |𝑉| = 𝑘𝑁 𝑥 , where 𝑘 and 𝑥 are positive constants for 0 &lt; 𝑥 &lt; 1. The value of x depends on the size of the corpus and the genre, for large corpora x varies within [0.67; 0.75], when the size of the dictionary for the text grows much faster than the square root of the length of its words. Another measure of the number of words in a language is the number of lemmas rather than word types (for example, the Oxford English Dictionary has over 615,000 entries).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Method of morphological analysis of the Ukrainian language</head><p>Morphology identifies the shape of things, and in textual analysis, the shape of individual words/tokens. Lexemes are both words and punctuation marks, allowing you to conduct the next SYA (syntactic analysis) more clearly. Word structure helps determine plural, gender, tense, person, declension, etc. MA is a difficult task, as most languages have many exceptions to the rules and special cases. The main task of MA is to identify parts of words to assign them to certain classes (tags) of parts of speech. For example, sometimes it is important to understand whether a noun is singular or plural, or is a proper name. It is also often necessary to know whether the verb has an indefinite form, past tense, or is an adjective. The resulting parts of speech are then used to generate larger structures (fragments/phrases), or whole word trees, which are then used to build semantic reasoning data structures. After GA (grapheme analysis), we have access to tokens in sentences in paragraphs of integrated content texts, which makes it possible to apply MA to mark words from the collection of tokens with parts of speech (e.g., verbs, nouns, prepositions, adjectives) that indicate the role of the word in the context of the sentence. In the Ukrainian language, the same word can usually take on different roles, depending on the inflexions. Part-of-speech tagging based on MA rules consists of adding a corresponding tag to each word from a collection of tokens that contains information about the definition of the word and its role in the current context. MA rules are used for the development of modules/subsystems for keyword identification, text classification (Fig. <ref type="figure" target="#fig_4">4</ref>.6), machine translation, and error correction, as well as for human psychological analysis, semantic analysis, etc. When identifying words for further classification, the rub_id attribute describes the rubric to which a specific keyword belongs (Table <ref type="table" target="#tab_10">4</ref>). The flag of the attribute defines the properties of this keyword (the part of the language to which it belongs). In thematic dictionaries, each word has its property, for example, a b c d o -different types of nouns, A -verbs, V -adjectives (Fig. <ref type="figure" target="#fig_9">7</ref>). To compare the complexity in thematic dictionaries (23 rules in total), each English word also has a property, for example, the numbers 1-23 are the numbers of rules of the PFX type (prefixes, rules 1-7) and SFX (suffixes and endings, rules 8-23) and describe some nouns for English words (Fig. <ref type="figure" target="#fig_10">8</ref>). For example, PFX-type rules describe the modification of some nouns for English words    The letters e and y near the suffixes are decision markers. A file of affixes (parts of words that attach to the root and bring grammatical or wordforming meaning, elements of word formation, for example, prefix, suffix, postfix, inflexion) has the *.aff file type and may contain additional attributes -the rules of reduction to the base of the word (Fig. <ref type="figure" target="#fig_11">9</ref>). The notation SET is usually used to identify the sequence of parts of affixes and directories. REP forms a lookup table to correct multiple characters for words. TRY identifies sequences to replace. SFX and PFX identify the types of suffixes and prefixes that are marked by word affixes. The flag of the flag attribute determines the type of word, the mask of the mask attribute shows the ending identification rule, the value of the find attribute is the ending of the word in the nominative case, the value of the repl attribute is the ending of the word in the nonnominative case. Exceptions to the rules are given in square brackets. For example, the first line (ordering 26) describes a specific example of recognizing nouns of group a with the alternation of -і -о and the inflection -ін of the nominative case in the instrumental case (inflection -оном), and the next entry (ordering 27) is the same nouns, but in the local case (inflection -оні), but does not recognize other rules of that group or other groups in the dative case -inflections -онові and -ону (Fig. <ref type="figure" target="#fig_12">10</ref>). The third record (ordering 28) already recognizes nouns with alternation -і -о with inflections v of the nominative case in the dative case -inflection -огу, but does not recognize other rules of the same group and (rules 29-31 do this, respectively): -огові (Д.М.), -огом (О.), -озі (М.). The presence of a ratio of words blocked by the moderator (Fig. <ref type="figure" target="#fig_29">12</ref>), in particular those that cannot be key, allows to reduce the amount of verification during text classification (Fig. <ref type="figure" target="#fig_29">13</ref>). To identify keywords, it is important to correctly recognize adjectives in any case, gender and number (Fig. <ref type="figure" target="#fig_15">14</ref>).   c. Group V for nouns marked with the flag as p: masculine (m.) and feminine (f.) patronymics of the singular (s.) and plural (m.) of male names. For further SYA, it is appropriate to recognize the verbs correctly (Fig. <ref type="figure" target="#fig_16">15</ref>). Let us describe in more detail each marked class of the set of noun recognition rules, indicating their total number N (Table <ref type="table" target="#tab_11">5</ref>). In total, about 1,300 rules for processing suffixes and endings are used for MA Ukrainian-language nouns, taking into account the alternation of letters.  It is quite difficult to generate terminal chains in English (but MA rules are much less, not in Ukrainian), because the presence of articles and the connection of groups of nouns with each other with the corresponding preposition makes the tree longer and wider. The generation of terminal chains in the Ukrainian language is complicated by cases and generic differences in inflexions of the term used in the context. To identify keywords, it is not enough to recognize nouns (about 1300 RE-rules), it is also necessary to identify adjectives -a total of 99 RE-rules for Ukrainian texts (Table <ref type="table" target="#tab_10">4</ref>.6-Table <ref type="table" target="#tab_10">4</ref>.7). For correct SYA and SEM, including ontology construction, it is necessary to recognize verbs based on more than 800 RE rules.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>99</head><p>CLS marks the words of the input text as parts of speech (clarifies after GA the tagged/marked lexemes as words) based on RE-rules and analysis of inflexions as singular nouns of the corresponding gender and case, plural nouns of the corresponding case, adjectives, adverbs, verbs, personal pronouns, etc. (each with a collection of features).</p><p>The MA module returns a collection of paragraph lists, each of which is a list of sentences, which are lists of tokens, including words marked by parts of speech. Periodic interim analysis of the input/integrated textual content allows to assess how the thematic corpus changes over time. In the process of analysis, we will count the number of paragraphs, sentences and words, and also save each unique lexeme in an additional intermediate dictionary. If the lexeme/word did not exist in the dictionary of lexemes/word bases, we mark it as new and store it in the intermediate dictionary for analysis by the moderator. We count the number of content and categories in the corpus of incoming text content and form a dictionary with a statistical summary of the corpus, which contains: the total number of integrated content and categories; the total number of paragraphs, sentences and words; the number of unique tokens; lexical diversity as the ratio of the number of unique lexemes to their total number; the average number of paragraphs in the content; average number of sentences per paragraph; total processing time.</p><p>Since the corpus grows as new data is collected, pre-processed and compressed, the MA method will allow us to calculate these features and analyze their dynamics of change. It is an important content monitoring tool to identify possible problems in CLS, for example, in an ML model, a significant change in lexical diversity and the number of paragraphs per content affects the quality of the model. That is, the MA method and GA methods, in addition to the identification of tokens and direct marking of words by parts of speech, are used to collect additional information when determining the amount of changes in the corpus to timely start further vectorization and restructuring of the ML model. The main stage of the MA method is the identification of the bases of words (stemming) without taking into account inflexions (suffixes and endings) and in some cases -prefixes. According to the content of the inflexions, a part of the language is identified as a word (Fig. <ref type="figure" target="#fig_29">16</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 16: An example of identification of forms of inflexion according to part of speech</head><p>For the next SYA, this is not enough (to mark the word only as a part of speech), it is still necessary to determine, for example, gender/distinctiveness, etc., for a noun/adjective. The classic Porter stemmer algorithm works by sequentially cutting off endings and suffixes. For English-language texts, this is not a problem, as there are very few inflexions. For Ukrainian words, a modified (extended) algorithm of Porter's stemmer should be applied with a check of both additional inflexions depending on the part of the language (according to the tree of endings), as well as the obtained word bases with a dictionary of bases to identify the existing word (Fig. <ref type="figure" target="#fig_17">17</ref>).</p><p>Algorithm  The increase in volume of MA RE-rules increases in a geometric progression the load on CLS only due to the recognition of inflexions and the bases of word forms. For Englishlanguage texts, the complexity is less due to several parameters, for example, for nouns 2 cases -2 inflexions in the plural (s|es). For the German language, the complexity increases -4 cases (but inflexions almost do not change, only articles change), phrases with  2 words are written together, etc. In the Ukrainian language, there are 7 cases of nouns, each of which changes its inflexion depending on the gender and plural/singular, and some words have different endings in some cases (for example, for втручання [vtruchannya] (intervention) in the local case, there are two options -втручанню, втручанні), in addition, there is often alternation of letters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 18: Classes of linguistic features of inflexions of morphological analysis</head><p>Therefore, for Ukrainian words, Porter's simple classic stemming algorithm is not suitable (reducing the word to the base root by cutting off inflexions). It is better to combine such an algorithm with a search/check of the obtained intermediate results with a tree of inflexions (so as not to go through all possible inflexions) and with the content of thematic dictionaries of bases with a set of RE-rules for the identification of features (classification by parts of speech). Only for text rubrication based on word identification, it is enough to conduct MA only for some noun groups (adjectives with nouns and nouns with nouns) without analyzing words of other parts of speech (recognition by the tree of inflexions -not an adjective and not a noun -ignore, in addition, the key ones should be sometimes there can be 1 preposition next to and only between nouns. It is enough to identify the bases of nouns/adjectives/abbreviations in the text and analyze their probability of clustering in different parts of the content relative to the total volume.</p><p>The classic stemming algorithm -Porter's Stemmer -does not use dictionaries of word bases but only applies a set of RE-rules for cutting off inflexions in sequence according to the specifics of a specific language. The algorithm works with individual words without analyzing and taking into account the context. Linguistic features such as features of word formation (prefix, suffix, etc.) and parts of speech (noun, verb, etc.) are not taken into account. The basis is the following techniques for words:</p><p> cutting off the inflexion from the analyzed word (for Ukrainian words, it can be implemented with the obtained bases and inflexions check with analogues in DB).  the word has an invariable inflexion (the condition is impossible for most Ukrainian words, but it is possible to identify particles, conjunctions, prepositions, some nouns of foreign origin, abbreviations, etc.).</p><p> changes inflexion in declension due to dropping/alternating letters.  the change of word inflexion and word formation corresponds to a specific RE-rule, for example, when forming words from some verb groups:</p><p>(ов)*ува(ти|нню|нням|нні|ння|ли|ло|ла|вшись|вши|в|вся|всь|лися|лись|тися|тись) [(ov)*uva(ty|nnyu|nnyam|nni|nnya|ly|lo|la|vshysʹ|vshy|v|vsya|vsʹ|lysya|lysʹ|tysya|tysʹ)].</p><p> changing the inflexion of the word as an exception to the RE rules.</p><p> the ending of the word coincides with the envelope RE-rule of identification of inflexion, but the word itself has no inflexion: вітер [viter] (wind), but відер [vider] (bucket).  most short words are invariable (stop word dictionary is sufficient).</p><p>Such techniques significantly complicate the stemming algorithm of Ukrainian words. Therefore, first, widespread inflections are analyzed, for example, for 1 letter ц (34), щ (110), ф (214), б (281), п (341), ж (353), з (581), г (636), л (754), с (914), ч (959), д (1038), н (2531), р (2709) or 1-4 letters (Table <ref type="table" target="#tab_7">2</ref>.2). Inflexions  5 (for example, max(йтесь)=6837, max(ванням)=4656) are significantly less among keywords, therefore, for the speed/efficiency of the solution in some CLS NLP tasks, they are ignored, but for SYA/ SEM will not allow this. Many NLP tasks do not require full implementation of all NLP processes from grapheme to pragmatic analyses. For example, to identify keywords, it is enough to provide a grapheme and morphological analysis (algorithm 4.2). But before almost any NLP process, the text must be normalized. sequentially marking each sequence of non-alphabetic characters as tokens and recognizing alphabetic sequences between spaces and other special characters (eg numbers and punctuation) according to RE rules as token words to form a list 𝑆 of identified alphabetic tokens as words 𝑤 𝑖 .</p><p>Step 1.3. Sort the list 𝑆𝑆 𝐴 identified tokens 𝑤 𝑖 alphabetically, counting occurrences of identical chains and forming an alphabetic-frequency dictionary 𝐷 𝑎 , the record of which is in the form of the number of occurrences -a word. Step 1.4. Transferring all letters of the upper register to the lower register and recalculating occurrences of word-tokens in the alphabetic-frequency dictionary 𝐷 𝐴 𝐷 𝑎 .</p><p>Step 1.5. Sort and save the dictionary 𝐷 𝑎 𝐷 𝑁 of identified 𝑤 𝑖 words by decreasing the frequency of appearance (in Germanic languages, the top will be articles, pronouns, adjectives and conjunctions, and in Slavic languages, most words with the same base and different inflexions will occupy different lines of the list, which significantly distorts the picture of the real distribution of words in texts). Stage 2. Segmentation/tokenization of words of the analyzed text content.</p><p>Step 2.1. Word segmentation based on dictionaries, metrics such as the probability of an error in a word, and statistical sequence models pre-trained from segmented text corpora (between spaces, punctuation, etc.).</p><p>Step 2.2. Tokenization based on RE-rules of marked tokens of the sequence type of non-alphabetic characters as tokens (dates, prices, URLs, hashtags, e-mail addresses, etc.), punctuation (as the end of a sentence or the boundary of a subordinate clause), mixed tokens of alphabetic-nonalphabetic characters (abbreviations, complex hyphenated words, with an apostrophe, etc.), lines with uppercase characters (such as the beginning of a sentence, geographical names, proper names, abbreviations) and their normalization if necessary (for example, к.т.н.  ктн (PhD) as a separate word-token or ML як машинне навчання [mashynne navchannya] (machine learning)).</p><p>Step 2.3. Analysis of tokens with uppercase characters (except when only the first letters are capitalized) for labelling based on the RE-rules of finite automata or as an abbreviation or emotion transfer.</p><p>Step 2.4. Marking of unidentified 𝐷 𝑥 tokens and ambiguities (e.g. apostrophe as part of a word, etc.). Stage 3. Lemmatization of a set of recognized and labelled alphabetic tokens of the text as lemmas, identified as words of the analyzed text.</p><p>Step 3.1. Normalization of tokens based on the identification of affixes from the termination tree as stenocardia of marked token-words (reducing the word to its initial form based on RE-rules MA for identification roots and affixes through Algorithm 1 of Porter's modified stemmer), i.e. determination of whether the analyzed tokens have the same root and differ only in inflexion with sequential identification of the part of the language of the analyzed words with subsequent marking of them as lemmas with all accompanying linguistic features. Step 3.2. Regrouping and recalculation of word frequencies in the alphabetic-frequency dictionary 𝐷 𝑁 𝐷 𝑙 taking into account the normalized words in step 3.1. Stage 4. Additional analysis of unidentified tokens 𝐷 𝑥  by iteratively combining frequent character/string pairs within token words (for example, whether tokens between spaces or other punctuation marks контент-аналіз [kontent-analiz] (content-analysis), Web-сайт</p><formula xml:id="formula_7">[Web-sayt] (Web-site), контент-моніторинг [kontent-monitorynh] (content-monitoring)</formula><p>or Web-resource [Web-resource] (Web-resource) are one word, or two) through bit-pair encoding, or BPE based on text compression for further possible identification of words, their labelling and normalization.</p><p>Step 4.1. Formation of a set of symbols equal to the collection of properties with 𝐷 𝑥 . К We present each word as a sequence of characters plus a special character at the end of the word or a special character, such as a dash, within a token (for example, контент-, Web-, контент-or Web-). We denote 𝑖 = 0. Step 4.9. We denote 𝑖 = 𝑖 + 1. If ℎ &gt; 0, 𝐷 𝑥  for 𝑛 𝑙 &gt; 1 and in 𝐷′ 𝑥 is at least 1 marked 𝑏 𝑖 , then go to step 4.4, otherwise (ℎ = 0 and 𝐷 𝑥 =  or 𝐷 𝑥  at 𝑛 𝑙 = 1 or 𝑠 𝑘 𝑥 no non-unique pair 𝑠 𝑗 𝑥 for formation 𝑎 𝑖 = (𝑠 𝑘 𝑥 , 𝑠 𝑗 𝑥 )) -until step 5.</p><p>Stage 5. Segmentation of sentences in the analysed content.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Method of lexical analysis of the Ukrainian language</head><p>The process of lexical analysis of the Ukrainian-language text 𝐶′  consists in parsing, segmentation and tokenization of each sentence separately, which is characterized not by a strict order of words, but at the same time by a constant arrangement of individual linguistic units. In a complete simple Ukrainian sentence with direct word order, the structural scheme is conditionally fixed. The main lexical categories of the corresponding sentence are noun and verb groups. Type 0 grammar according to N. Chomsky's classification is not appropriate for such sentences due to the complexity of implementation. With contextdependent grammar, specific restrictions are applied, in particular, to the structure of a Ukrainian-language sentence with some set of variations. Based on the syntactic rules of generating Ukrainian-language sentences with partial word order (for example, there is no strict order for the subject and predicate in the sentence, but the adjective is usually before the noun or another adjective, if it is not a poetic passage, also the lexical units of the noun group are placed around the subject, etc.), we derive the lexical scheme for the noun group 𝑆 ̃ based on regular expressions:</p><formula xml:id="formula_8">𝑆 ̃= ([𝐴]{0, 𝑛}[𝑆]{1, 𝑚}|[𝑃]),<label>(7)</label></formula><p>where 𝐴 = 𝑎 1 𝑎 2 𝑎 3 … 𝑎 𝑁−1 𝑎 𝑁 is a sequence of adjectives, and the entry [𝐴]{0, 𝑛} is a selection from 0 to 𝑛 adjectives from 𝑎 1 𝑎 2 𝑎 3 … 𝑎 𝑁−1 𝑎 𝑁 , at 𝑛𝑁; 𝑆 = 𝑠 1 𝑠 2 𝑠 3 … 𝑠 𝑀−1 𝑠 𝑀 is a sequence of nouns, and the entry [𝑆]{1, 𝑚} is a selection from 1 to 𝑚 nouns from 𝑠 1 𝑠 2 𝑠 3 … 𝑠 𝑀−1 𝑠 𝑀 , at 𝑚𝑀; 𝑃 = 𝑝 1 𝑝 2 𝑝 3 … 𝑝 𝐾−1 𝑝 𝐾 is a sequence of pronouns, and the entry [𝑃] is the choice of 1 pronoun from 𝑝 1 𝑝 2 𝑝 3 … 𝑝 𝐾−1 𝑝 𝐾 ; record (𝑥|𝑦) is a choice of either 𝑥, or 𝑦; the values of 𝑎 𝑖 and 𝑠 𝑗 agree in gender, number and case. Accordingly, for the verb group, the lexical scheme based on RE-expressions:</p><formula xml:id="formula_9">𝑉 ̃= ([𝑉]{1, 𝑛}[𝑆 ′ ̃]{0, 𝑚}|[𝑆 ′ ̃]{0, 𝑚}[𝑉]{1, 𝑛}),<label>(8)</label></formula><p>where 𝑉 = 𝑣 1 𝑣 2 𝑣 3 … 𝑣 𝑁−1 𝑣 𝑁 is a sequence of verbs, and the entry [𝑉]{1, 𝑛} is a choice from 1 to 𝑛 verbs from 𝑣 1 𝑣 2 𝑣 3 … 𝑣 𝑁−1 𝑣 𝑁 , at 𝑛𝑁; 𝑆′ ̃= 𝑆 ̃1𝑆 ̃2𝑆 ̃3 … 𝑆 ̃𝑀−1 𝑆 ̃𝑀 is a sequence of noun groups, and the entry [𝑆′ ̃]{0, 𝑚} is a choice from 0 to 𝑚 noun groups from 𝑆 ̃1𝑆 ̃2𝑆 ̃3 … 𝑆 ̃𝑀−1 𝑆 ̃𝑀, at 𝑚𝑀; entry (𝑥|𝑦) is choice of either 𝑥, or 𝑦; agreement between 𝑣 𝑖 and 𝑆 ̃𝑗 is carried out by person, gender and number. The lexical scheme of a Ukrainian sentence based on REexpressions:</p><formula xml:id="formula_10">𝑅 = ([𝑆′ ̃]{0,1}[𝑉′ ̃]{0,1}|[𝑉′ ̃]{0,1}[𝑆′ ̃]{0,1}),<label>(9)</label></formula><p>where 𝑉′ ̃= 𝑉 ̃1𝑉 ̃2𝑉 ̃3 … 𝑉 ̃𝑁−1 𝑉 ̃𝑁 is a sequence of verb groups, and the entry [𝑉′ ̃]{0,1} is a selection from 0 to 1 verb groups with 𝑉 ̃1𝑉 ̃2𝑉 ̃3 … 𝑉 ̃𝑁−1 𝑉 ̃𝑁 with the presence of a predicate; 𝑆′ ̃= 𝑆 ̃1𝑆 ̃2𝑆 ̃3 … 𝑆 ̃𝑀−1 𝑆 ̃𝑀 is a sequence of noun groups, and the entry [𝑆′ ̃]{0,1} is a selection from 0 to 1 noun groups from 𝑆 ̃1𝑆 ̃2𝑆 ̃3 … 𝑆 ̃𝑀−1 𝑆 ̃𝑀 with the presence of a subject; record (𝑥|𝑦) is a choice of 𝑥 or 𝑦; agreement between 𝑉 ̃𝑖 and 𝑆 ̃𝑗 is carried out by person, gender and number.</p><p>The main lexical features of the verb group are tense, number, person. For comparison, the lexical scheme of the noun group based on the RE-expression for an English-language sentence:</p><formula xml:id="formula_11">𝑆 ̃= (𝑎𝑟𝑡𝑖𝑐𝑙𝑒[𝐴]{0, 𝑛}[𝑆]/𝑜𝑓[𝐴]{0, 𝑛}[𝑆]/{0, 𝑚}|[𝑃]).</formula><p>(10) The lexical scheme of the English verb group based on the RE-expression:</p><formula xml:id="formula_12">𝑉 ̃= [𝑉][𝑆 ′ ̃]{0, 𝑚}.<label>(11</label></formula><p>) Lexical scheme for an English-language sentence based on the RE-expression:</p><formula xml:id="formula_13">𝑅 = [𝑆′ ̃][𝑉′ ̃].<label>(12)</label></formula><p>The agreement of cases between the lexical units of the Ukrainian-language sentence affects the further syntactic and semantic analysis of the content:</p><formula xml:id="formula_14">1. 𝑅 → 𝑅𝑌 𝑖 𝑥 𝑖 ′ , 2. 𝑥 𝑖 ′ 𝑌 𝑗 → 𝑌 𝑗 𝑥 𝑖 ′ , 3. 𝑅𝑌 𝑖 → 𝑥 𝑖 𝑅, 4. 𝑅 → 𝑞. } 𝑖, 𝑗 = 1,2,3,<label>(13)</label></formula><p>where 𝑥 𝑖 , 𝑥 𝑖 ′ , 𝑞 are the main lexical units; 𝑅, 𝑌 𝑖 are auxiliary lexical units; 𝑅 is the initial symbol as an indicator of the type of sentence chain generation.</p><p>Stages of lexical formation of a chain of tokens </p><formula xml:id="formula_15">𝑥 2 𝑥 1 𝑥 1 𝑥 3 𝑞𝑥 2 ′ 𝑥 1 ′ 𝑥 1 ′ 𝑥 3 ′ : 1. 𝑅 6. (2)𝑅𝑌 2 𝑌 1 𝑥 2 ′ 𝑥 1 ′ 𝑌 1 𝑥 1 ′ 𝑌 3 𝑥 3 ′ 2. (<label>1</label></formula><formula xml:id="formula_16">) 𝑅𝑌 2 𝑥 2 ′ 𝑌 1 𝑥 1 ′ 𝑌 1 𝑥 1 ′ 𝑌 3 𝑥 3 ′ 16. (4) 𝑥 2 𝑥 1 𝑥 1 𝑥 3 𝑞𝑥 2 ′ 𝑥 1 ′ 𝑥 1 ′ 𝑥 3 ′</formula><p>An example of lexical generation of the type {𝑥𝑞𝑥 ′ }: Саша</p><p>,поет 𝑑 , … respectively, where𝑥 (𝑎𝑏𝑐𝑑. . . ) is a sequence of proper names, 𝑥 ′ (𝑎 ′ 𝑏 ′ 𝑐 ′ 𝑑 ′ . . . ) is a sequence of professions agreed with proper names; 𝑞 is a dash. Any verb has the ability to act as a complement: моя дитина вподобала книгочитання [moya dytyna vpodobala knyhochytannya] (My child liked reading books). This process can theoretically be repeated an unlimited number of times: він книгочитанняцікаводумає про книгочитанняцікавість [vin knyhochytannyatsikavodumaye pro knyhochytannyatsikavistʹ] (it is interesting to read books, thinks about reading books, interesting), i.e.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Він книго</head><formula xml:id="formula_18">⏞ 𝑎 читання ⏞ 𝑏 цікавість ⏞ 𝑐 − думає про − книго ⏞ 𝑎 ′ читання ⏞ 𝑏 ′ цікавість ⏞ 𝑐 ′ . A language consisting of strings of the form 𝑎𝑏𝑐𝑑. . . 𝑑 ′ 𝑐 ′ 𝑏 ′ 𝑎 ′ (composed of symbols 𝑎 1 , 𝑎 2 , 𝑎 3 , 𝑎 1 ′ , 𝑎 2 ′ , 𝑎 3 ′</formula><p>) is generated by a grammar of 6 rules:</p><formula xml:id="formula_19">𝐼 → 𝑎 𝑖 𝐼𝑎 𝑖 ′ 𝐼 → 𝑎 𝑖 𝑎 ′ } 𝑖 = 1,2,3.<label>(14)</label></formula><p>Such grammar do not provide, for example, a natural description for the so-called nonproject constructions with breaks, crossing ( . . . . , . . . . ) or framing (</p><p>. ) directions of syntactic dependence (Fig. <ref type="figure" target="#fig_29">19</ref>).</p><p>Ukrainian. Наша мова, як і будь-яка інша, посідає унікальне місце.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>English.</head><p>A theorem is stated which describes the properties of this function.</p><p>German. ... die Tatsache, daß die Menschen die Fähigkeit besitzen, Verhältnisse der objektiven Realität in Aussagen wiederzuspiegeln.</p><p>Francian.  3. Sequential subordination (Fig. <ref type="figure" target="#fig_22">20</ref>):</p><p>досить повiльно рухлива черепаха or очень быстро бегущий олень.  Only with the correct identification and recognition of non-project constructions can a grammatical and syntactic analysis of Ukrainian sentences be carried out to build dependency trees of the components of these sentences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">The method of syntactic analysis of the Ukrainian language</head><p>The syntax is a set of relational rules for the formation of sentences/phrases, usually defined by the grammar. Sentences are linguistic units of language for generating meaning and encoding information. The purpose of SYA is to demonstrate meaningful relationships between words based on the division of a sentence into parts, or between tokens in a treelike structure 𝐶′  . Syntax is a necessary basis for reasoning about a system of concepts or semantics because it is an important tool for determining the degree to which words influence each other in the generation of phrases. For example, SYA identifies the prepositional phrase в потяг  There are cases in the textual content when not only the right but also the left sequential subordination has an unlimited depth of derivation, for example, due to subordinate clauses with the operative word which, what, when, etc. (тваринка, яку врятувала Софія [tvarynka, yaku vryatuvala Sofiya] -the animal that Sofia saved). Fig. <ref type="figure" target="#fig_3">23</ref> illustrates a phrase with a depth of 22 and is completely grammatically correct (as is its Ukrainian version). Moreover, nothing prevents you from continuing the phrase to the left на волю в обійми зеленої пахучої трави [na volyu v obiymy zelenoyi pakhuchoyi travy] (freely into the embrace of green, fragrant grass). The Ukrainian language allows you to generate phrases with an unlimited number of sequentially subordinating from left to right constructions of the type 𝑌 1 𝑌 2 . . . 𝑌 𝑖 . .. (unlimited right subordination), and at the same time, unlimited left subordination is possible in each of the constructions 𝑋 𝑖 -a sequence of chains . . . 𝑌 𝑖𝑗 . . . 𝑌 𝑖3 𝑌 𝑖2 𝑌 𝑖1 ; however, within the sequence 𝑌 𝑖𝑗 further unlimited expansion is impossible. According to the rules of the Ukrainian language 𝑌 𝑖 are interpreted as simple sentences, each of which is an additional determiner to the previous one, and 𝑌 𝑖𝑗 are interpreted as prepositive adjective inflexions.</p><p>The grammar 𝐺 ′ = ⟨𝐷 ′ , 𝐷 1 ′ , 𝐼 ′ , 𝑅 ′ ⟩ has a basic dictionary 𝐷 ′ = 𝑁 1 , 𝑁 2 , . . . , 𝑁 𝑛 symbols and rules of the form 𝑅 ′ = {𝑌 → 𝑍𝑁 𝑖 , 𝑋 → 𝑁 𝑖 }, where 𝑌𝐷 1 ′ and 𝑍𝐷 1 ′ . Each of 𝑁 𝑖 corresponds to some regular grammar 𝐺 𝑖 ′ = ⟨𝐷, 𝐷 1 𝑖 , 𝑁 𝑖 , 𝑅 𝑖 ⟩, where 𝐷 is the main dictionary 𝐺 𝑖 ′ , 𝐷 1 𝑖 is the auxiliary dictionary for 𝐷 1 𝑖 𝐷 ′ = 𝑁 𝑖 and 𝐷 1 𝑖 𝐷 1 ′ = 𝑁 𝑖 ; 𝑁 𝑖 is the initial symbol; scheme rules of the form 𝑅 𝑖 = {𝐶 → 𝑒𝐸, 𝐶 → 𝑐} (heading Latin characters are non-terminal, and line characters are terminal). The non-terminal dictionaries of the grammar 𝐺 𝑖 ′ are pairwise disjoint. Association:</p><formula xml:id="formula_21">𝐺 = 𝐺 ′ ∪ 𝐺 1 ′ ∪ 𝐺 2 ′ ∪ … ∪ 𝐺 𝑛 ′ ,<label>(15)</label></formula><p>where the main dictionary 𝐷 in all 𝐺 𝑖 ′ , and the auxiliary additional dictionary and scheme:</p><formula xml:id="formula_22">𝐷 1 = 𝐷 ′ ∪ 𝐷 1 ′ ∪ 𝐷 1 1 ∪ 𝐷 1 2 ∪. . .∪ 𝐷 1 𝑛 , 𝑅 = 𝑅 ′ ∪ 𝑅 1 ∪ 𝑅 2 ∪ … ∪ 𝑅 𝑛 (16)</formula><p>The grammar 𝐺 is special and equivalent to an automatic one, for example: To analyze the syntactic structure of a sentence is to identify the order of words depending on the syntactic structure and relationships, which is determined necessarily according to the analysis of neighbours and something derived/secondary. It is advisable to modify the grammar so that both parts of the predicate (Fig. <ref type="figure" target="#fig_26">24</ref>) are trees of syntactic relations. Lines with subscripts describe syntactic relations of various types; symbols 𝐴, 𝐵, 𝐶, . .. are syntactic categories. As a result, the syntactic structures (rather than phrases) of the language are obtained as part of the generative grammar. Another part of this grammar is the calculation in the Ukrainian language -with mandatory consideration of the logical derivation of linear sequences of words, solving the problem of discontinuous constituents.</p><formula xml:id="formula_23">𝑅 ′ = { 𝐼 → 𝐵𝑁 1 𝐵 → 𝐶𝑁 1 𝐶 → 𝐵𝑁 2 𝐶 → 𝐸𝑁 3 𝐸 → 𝐸𝑁 4 𝐸 → 𝑁 2 , 𝑅 1 = { 𝑁 1 → 𝑏𝑃 1 𝑃 1 → 𝑎𝑄 1 𝑄 1 → 𝑎𝑄 1 𝑄 1 → 𝑐 , 𝑅 2 = {𝑁 2 → 𝑑, 𝑅 3 = { 𝑁 3 → 𝑎𝑃 3 𝑁 3 → 𝑏𝑄 3 𝑁 3 → 𝑐𝑊 3 𝑃 3 → 𝑎 𝑄 3 → 𝑏 𝑊 3 → 𝑑𝑊 3 𝑊 3 → 𝑒𝑊 3 𝑊 3 → 𝑑 , 𝑅 4 = { 𝑁 4 → 𝑐𝑃 4 𝑃 4 → 𝑏 ,<label>(17)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">The method of semantic analysis of the Ukrainian language</head><p>Semantic analysis consists not only in identifying the content of the text but also in generating data structures to which logical reasoning can be applied. Thematic Meaning Representations (TMR) are used to encode sentences in the form of predicate structures based on first-order logic or lambda calculus (λ-calculus). Network/graph structures are used to encode interactions of predicates of relevant text features. Then a traversal is implemented to analyze the centrality of terms or subjects and the reasons for the relationships between elements.</p><p>Analysis of graphs, including ontology О, is usually not a complete SEM, but helps to form part of important logical decisions/conclusions based on the taxonomy of concepts 𝑋: 𝑂: 𝑈𝐿𝑆𝑅 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠.</p><p>(18) The result of SEM based on the ontological model of the rules of the syntax of the Ukrainian language О are weighted oriented graphs of the semantics of the text:</p><formula xml:id="formula_24">𝑂 = &lt; 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠, 𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝𝑠, 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 &gt;, (<label>19</label></formula><formula xml:id="formula_25">)</formula><p>where 𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝𝑠 is a tuple of relationships between SA concepts of the Ukrainian language; 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠 is a tuple of SA concepts describing the rules of the Ukrainian language; 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 is a tuple of functions for the interpretation of concepts/rules of the Ukrainian language.</p><p>The taxonomy of concepts sets the syntax of the language as the root concept of the ontology:</p><formula xml:id="formula_26">𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠  : &lt; 𝑅 𝑆𝑛𝑡 &gt; 𝐶′  . (20)</formula><p>The optimal definition of the tuple of relations between these concepts and the tuple of the rules of the Ukrainian language, formalized by the descriptive logic of DL, will allow effective processing of Ukrainian texts: 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠 =&lt; 𝑅 𝑀𝑟𝑝 , 𝑅 𝑃𝑛𝑐 , 𝑅 𝑆𝑡𝑟 , 𝑅 𝑆𝑛𝑡 , 𝑅 𝑆𝑚𝑛 &gt;, (21) where tuples of concepts of morphology 𝑅 𝑀𝑟𝑝 , punctuation 𝑅 𝑃𝑛𝑐 , structure 𝑅 𝑆𝑡𝑟 , syntax 𝑅 𝑆𝑛𝑡 (Fig. <ref type="figure" target="#fig_32">25</ref>) and semantics 𝑅 𝑆𝑚𝑛 .</p><p>In SEM, to identify the set of semes of the corresponding text and their relationship, first, based on the results of SYA, a semantic graph of the relations of linguistic units is built, taking into account the parts of the language of words:</p><formula xml:id="formula_27">𝐶′  = (𝐶  , 𝐷  , 𝑅  , 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠  ), 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠  =&lt; 𝐶 𝑊𝑟𝑑𝐶𝑚𝑏 , 𝐶 𝑆𝑛𝑡𝐶𝑚𝑏 &gt;,,<label>(22)</label></formula><p>where 𝐶 𝑊𝑟𝑑𝐶𝑚𝑏 is a tuple of word formation concepts; 𝐶 𝑆𝑛𝑡𝐶𝑚𝑏 is a tuple of sentence generation concepts in the Ukrainian language (Fig. <ref type="figure" target="#fig_8">26</ref>).</p><p>Tuple 𝐶 𝑊𝑟𝑑𝐶𝑚𝑏 according to the rules of the Ukrainian language syntax (Fig. <ref type="figure" target="#fig_8">26</ref>): Similarly, tuples are formed to identify the members of the sentence 𝑆𝑔𝑛 𝑆𝑛𝑀𝑏 𝑆𝑛𝑡 (Fig. <ref type="figure" target="#fig_10">28</ref>-Fig. <ref type="figure" target="#fig_11">29</ref>) and the complex sentence 𝑆𝑔𝑛 𝐶𝑙𝑆𝑡 𝐼𝐼𝐼 (Fig. <ref type="figure" target="#fig_27">30</ref>).  The process of extracting data from the Ukrainian-language text based on the syntax ontology allows you to supplement the conceptual weighting graphs of the content.</p><formula xml:id="formula_28">𝐶 𝑊𝑟𝑑𝐶𝑚𝑏 =&lt; 𝑆𝑔𝑛</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6.">The method of pragmatic analysis of the Ukrainian language</head><p>Pragmatics examines the dependence of meaning on the context of the textual content of the author and takes into account his prior knowledge, intentions, purpose, etc., in contrast to semantics, which analyzes the meaning itself depending on the results of GA, MA, LA and SYA within a particular text. Pragmatics is a continuation of SEM, taking into account the peculiarities of the context of the analysed text, taking into account the ambiguity of the statements of the analyzed text, based on the analysis of the features of the author's statements in previous similar texts, based on the time, place, method, purpose and other circumstances of the conversation.</p><p>In PA, when resolving the ambiguity of the author's speech in a specific analyzed text, taking into account the features of the author's speech in previous similar speeches, it is best to use word prediction models, for example, N-grammatical Language Models (LM). Each speaker, as a person with a unique life experience, has not only his dictionary of thematic words but also a unique handwriting of the use of these words and their sequence in a certain context of the relevant thematic direction. In the expression «лінгвістична система опрацьовує …» [linhvistychna systema opratsʹovuye …] (the linguistic system processes ...) the next word depends not only on the context but also on the so-called speech handwriting of the author of the text: текст, контент, текстовий контент, вхідні дані, вхідну інформацію, інтегровані дані, авторський контент, публікації [tekst, kontent, tekstovyy kontent, vkhidni dani, vkhidnu informatsiyu, intehrovani dani, avtorsʹkyy kontent, publikatsiyi] (text, content, text content, input data, input information, integrated data, author content, publications), etc. The phrase «включіть свою виконану лабораторну роботу ...» [vklyuchitʹ svoyu vykonanu laboratornu robotu ...] (include your completed lab work...) as opposed to «додайте свою виконану лабораторну роботу ...» [dodayte svoyu vykonanu laboratornu robotu ...] (add your completed lab work...) has a broader meaning and depends significantly not only on the context but also on the speaker (include can mean like download the developed software on the computer or in the sense of adding it as an item to some list, etc.). Dialogue participants intuitively understand the content based on their experience of communicating with the author of the phrase. Pragmatic analysis requires the introduction of models that determine the probability for each subsequent word. They are also intended for assigning the probability of the target utterance for correct machine translation, identification/correction of grammatical and stylistic errors, and handwriting or language recognition. Each language has special statistical parameters, and the analysis of the probability of the appearance of only letters and their combinations as N-grams of the corresponding language makes it possible to identify the language itself or the style of the author (Fig. <ref type="figure" target="#fig_29">31</ref> -with greater probability, the author of the benchmark wrote Excerpt 1).</p><p>Figure <ref type="figure" target="#fig_29">31</ref>: Probability of appearance of letters in the standard and analyzed passages For Ukrainian texts, the statistical parameters of styles are the probabilities of vowels, consonants, and gaps between words, as well as soft and sonorous groups of consonants. Probability is also important for enhancing communication. Physicist Stephen Hawking used simple movements to select words from a menu for speech synthesis. For such IS, it is appropriate to use word prediction to generate suggestions for a list of likely words for the menu. One of the most widespread and easiest to implement for English-language texts is LM -N-gram, which assigns probabilities to sentences or sequences of words. For Ukrainian-language texts, it is better to apply such LM to the sequence of word bases without taking into account inflexions (otherwise incorrect PA results will be obtained) to calculate 𝑃(𝑏|𝑎) is the probability of the appearance of the base of the word 𝑏 after the sequence of bases 𝑎. Taking into account words in N-grams of LM in Ukrainian-language texts is appropriate for identifying grammatical errors. 𝑃(систем|комп ′ ютер лінгвіст), 𝑃(системи|комп′ютерні лінгвістичні), 𝑃(систему|комп ′ ютерну лінгвістичну).</p><p>(37)</p><p>One of the best ways to calculate such a probability is to conduct a statistical analysis on large corpora of texts of the relevant author or relevant thematic direction from reliable Internet sources.: .</p><p>(</p><formula xml:id="formula_29">)<label>38</label></formula><p>This gives a probabilistic result for a certain period because the language is creative, not homogeneous, and the vocabulary is updated and constantly develops both in general and for a specific speaker -the author of the text. To analyze the corresponding random linguistic event 𝐴 𝑖 = комп ′ ют, 𝑃(𝐴 𝑖 ) is found to calculate the probability of the appearance of a certain sequence of linguistic events based on the chain rule or the general product rule (chain rule of probability):  .</p><p>(41)</p><p>The chain rule reflects the relationship between the overall probability of the appearance of a specific sequence of bases and the conditional probability of the appearance of a word base by specific previous word bases in this sequence. Taking into account the entire dynamics of the occurrence of all word bases in the text to sequences of other word bases is a redundant/inefficient process due to the variability of language/speech over time. Prediction of the 2-gram model consists of approximating the dynamics of the appearance of only the last few bases of words in a given sequence: </p><p>For example, for three sentences of the mini-corpus (conditionally, the &lt;p&gt; &lt;/p&gt; tags are the boundaries of one sentence), we will calculate the Markov assumption of the 2-gram occurrence of word bases:    With each subsequent multiplication, the probability decreases. Applying the logarithm of probabilities (log probabilities) will allow you to operate with not-so-small values for calculating accuracy. (48)</p><p>The resulting matrices will in most cases be sparse. Phrase and different variations (plural/singular and cases) система електронної контент-комерції [systema elektronnoyi kontent-komertsiyi] (electronic content commerce system):</p><p>𝑃(систем електрон контент комерц) = = 𝑃(електрон|систем)𝑃(контент|електрон) 𝑃(комерц|контент) = =0,1240,810,179=0,01797876.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>The general architecture of computer linguistic systems is developed based on the main processes of processing information resources such as integration, maintenance and content management, as well as using methods of intellectual and linguistic analysis of text flow using machine learning technology. The IT of intellectual analysis of the text flow based on the processing of information resources has been improved, which made it possible to adapt the generally typical structure of content integration, management and support modules to solve various NLP problems and increase the efficiency of CLS functioning by 6-9%. This became possible thanks to the combination of linguistic analysis methods adapted to the Ukrainian language, improved IT processing of information resources, ML and a set of metrics for evaluating the effectiveness of CLS functioning. The main principle of building such CLS is modularity, which facilitates their construction according to the requirements for the availability of appropriate processes for solving a specific NLP problem. The main NLP methods based on regular expression matching with patterns in grapheme and morphological analyses of Ukrainian-language texts are described. NLP methods based on pattern-matching regular expressions have been improved, which made it possible to adapt methods of text tokenization and normalization by cascades of simple substitutions of regular expressions and finite state machines. The main valid operations of regular expressions are defined as union and disjunction of symbols/strings/expressions, number and precedence operators, as well as anchors as special symbols for identifying the presence/absence of symbols in RE. The main stages of tokenization and normalization of the Ukrainian text by cascades of simple substitutions of regular expressions and finite state machines are defined. The MA method of the Ukrainian-language text based on word segmentation and normalization, sentence segmentation and modified Porter's stemming algorithm was improved as an effective means of identifying lem affixes for the possibility of marking the analysed word, which made it possible to increase the accuracy of keyword searches by 9%. Algorithms for word segmentation and normalization, sentence segmentation, and Porter's modified stemming are implemented and described as an effective way of identifying lem affixes for the possibility of marking the analysed word. Unlike the classic Porter algorithm (it does not have high accuracy even for English-</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: CLS content pipeline monitoring/management scheme</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Scheme of processing the CLS content pipeline</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: The process of forming and optimizing a machine-learning model</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Frequency matrix of co-occurrence of words</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head></head><label></label><figDesc>[a-я]*|$)/ to item 5, the possibility of meeting the word analysis at the beginning or the end of the line is added, when no character exists in these positions 7 /[0-9]+ (\$|грн\.|EU)/ the integer value of the price in грн. (UAH), or US/EU currency 8 /[0-9]+\,[0-9][0-9] грн\./ the actual value of the price in грн. (UAH) 9 /(^|\W)[0-9]+(\,[0-9][0-9])? (\$|грн\.|EU)?\b/ the actual value of the price in the currency of Ukraine/USA/EU at the level of a word in a sentence/utterance/phrase 10 /(^|\W)[0-9]{0,5}(\,[0-9][0-9])? (\$|грн\.|EU)?\b/ the actual value of the price in the currency of Ukraine/USA/EU at the word level, taking into account the limitation of the number of digits before the comma 11 /\b[6-9]+˽*(UAH|₴|грн\.| [Гг]грив(ня|ні|ень))\b/ lines with a price value &gt; 5 in the currency of Ukraine, taking into account various options for designations and abbreviations 12 /\b[0-9]+(\,[0-9]+)?˽* (UAH|₴|грн\.?)\b/ lines with the valid value of the price in the currency of Ukraine, taking into account the presence/absence of various options for designations and abbreviations</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Content partitioning, grapheme segmentation and labelling</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head></head><label></label><figDesc>with prefixes: re-(rule PFX 1), de-(rule PFX 2), dis-(rule PFX 3), con-(rule PFX 4), in-( PFX rule 5), pro-(PFX rule 6) and un-(PFX rule 7).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: The relation of keywords in the CLS database of text rubrics</figDesc><graphic coords="17,99.18,500.86,396.65,181.19" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Noun classification dictionaries for Ukrainian words</figDesc><graphic coords="18,179.60,85.05,235.80,171.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Noun classification dictionaries for English words</figDesc><graphic coords="18,102.20,443.81,130.03,233.90" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_11"><head>Figure 9 :</head><label>9</label><figDesc>Figure 9: Rules for reduction to the base of a word of the noun type</figDesc><graphic coords="19,91.25,414.65,412.45,179.90" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_12"><head>Figure 10 :</head><label>10</label><figDesc>Figure 10: An example of the rules of morphological analysis of Ukrainian nouns</figDesc><graphic coords="20,106.25,185.16,382.47,453.60" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_13"><head>Figure 11 :</head><label>11</label><figDesc>Figure 11: An example of rules for identifying the negative form of Ukrainian words</figDesc><graphic coords="21,126.50,267.12,341.40,122.24" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_14"><head>Figure 12 :Figure 13 :</head><label>1213</label><figDesc>Figure 12: The ratio of words blocked by the moderator</figDesc><graphic coords="21,140.00,425.84,314.30,169.83" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_15"><head>Figure 14 :</head><label>14</label><figDesc>Figure 14: An example of the rules of morphological analysis of Ukrainian adjectives</figDesc><graphic coords="22,100.13,85.05,394.75,347.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_16"><head>Figure 15 :</head><label>15</label><figDesc>Figure 15: An example of the rules of morphological analysis of Ukrainian verbs</figDesc><graphic coords="23,94.25,196.77,406.15,398.28" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_17"><head>Figure 17 :</head><label>17</label><figDesc>Figure 17: Modified stemming algorithm</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_18"><head>Algorithm 4 . 2 .Stage 1 .</head><label>421</label><figDesc>Abbreviated naive processing of textual content Rough tokenization (or grapheme analysis) of special characters of the input text.Step 1.1. Reading the text and removing repeated consecutive spaces and tags if they are present (if the text is integrated from a Web resource), sequentially marking the service characters of the beginning/end of the paragraph/heading/text, etc. Step 1.2. Grapheme parsing and segmentation between service characters or tags of the input text 𝑋,</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_19"><head>) 𝑅𝑌 3 𝑥 3 ′ 7 . 3 ′ 12 . ( 3 ) 3 ′ 4 . ( 1 ) 3 ′ 13 .</head><label>373123341313</label><figDesc>-11. .......... (2; 5 times)3. (1) 𝑅𝑌1 𝑥 1 ′ 𝑌 3 𝑥 𝑥 2 𝑅𝑌 1 𝑌 1 𝑌 3 𝑥 2 ′ 𝑥 1 ′ 𝑥 1 ′ 𝑥 𝑅𝑌 1 𝑥 1 ′ 𝑌 1 𝑥 1 ′ 𝑌 3 𝑥-15. .......... (3; 3 times) 5. (1</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_20"><head>Figure 19 : 2 .</head><label>192</label><figDesc>Figure 19: Examples of natural description for so-called non-design constructions</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_21"><head></head><label></label><figDesc>витяг з протоколу звiтування з наукової дiяльностi заступника завiдувача кафедри IСМ iнституту IКНI Нацiонального унiверситету "Львiвська полiтехнiка" мiста Львова країни Українa or жена сына заместителя председателя второй секции эклектики совета по прикладной мистике при президиуме Академии наук королевства Myрак</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_22"><head>Figure 20 :</head><label>20</label><figDesc>Figure 20: Examples of natural description of sequential subordination</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_23"><head>1 .Figure 21 : 1 Example 2 .</head><label>12112</label><figDesc>Figure 21: The process of deriving the Ukrainian-language chain for example 1 Example 2. 𝑃 = {𝑆 ̃𝑥,𝑦,𝑧 → 𝑆 𝑥,𝑦,𝑧 𝑆 ̃𝑥′ ,𝑦 ′ ,𝑝 , 𝑆 ̃𝑥,𝑦,𝑧 → 𝐴 ̃𝑥,𝑦,𝑧 𝑆 𝑥,𝑦,𝑧 , 𝑆 ̃𝑥,𝑦,𝑧 → 𝑆 𝑥,𝑦,𝑧 , 𝐴 ̃𝑥,𝑦,𝑧 → {дуже, досить, точно, просто, суттєво, . . . }𝐴 𝑥,𝑦,𝑧 , 𝐴 ̃𝑥,𝑦,𝑧 → 𝐴 𝑥,𝑦,𝑧 , 𝑆 ж,𝑦,𝑧 → школа 𝑦,𝑧 , . .., 𝑆 ч,𝑦,𝑧 → сміх 𝑦,𝑧 , школяр 𝑦,𝑧 , Львів 𝑦,𝑧 , . ..,</figDesc><graphic coords="36,168.13,430.52,258.75,188.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_24"><head>Figure 22 :Figure 23 :</head><label>2223</label><figDesc>Figure 22: The process of deriving the Ukrainian-language chain for example 2</figDesc><graphic coords="37,175.63,146.97,243.75,210.45" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_25"><head>Algorithm 4 . 3 .Stage 1 .</head><label>431</label><figDesc>Algorithm of sentence syntactic analysis. An unconstrained generated sequence is generated to the right by 𝑁 𝑖 as a syntactic group or sentence based on the rules of 𝑅 ′ . Stage 2. Any of 𝑁 𝑖 based on 𝑅 𝑖 is expanded indefinitely in the form of a tree (Fig.24) from right to left -into a chain of terminal symbols as words.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_26"><head>Figure 24 :</head><label>24</label><figDesc>Figure 24: Rules for building a tree</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_27"><head>Figure 30 :</head><label>30</label><figDesc>Figure 30: Class diagram for the Complex Sentence type hierarchy</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_28"><head></head><label></label><figDesc>𝑃(систем|комп ′ ютер лінгвіст) = 𝑁(комп ′ ютер лінгвіст систем) 𝑁(комп ′ ютер лінгвіст) , 𝑃(систем|комп ′ ютер лінгвіст) = 𝑃(комп ′ ютер лінгвіст систем) 𝑃(комп ′ ют лінгвіст)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_29"><head>𝑃(𝐴 1</head><label>1</label><figDesc>𝐴 2 … 𝐴 𝑛 ) = 𝑃(𝐴 1 )𝑃(𝐴 2 |𝐴 1 )𝑃(𝐴 3 |𝐴 1 2 ) … 𝑃(𝐴 𝑛 |𝐴 1 𝑛−1 ), 𝑃(𝐴 1 𝐴 2 … 𝐴 𝑛 ) = ∏ 𝑃(𝐴 𝑖 |𝐴 1 н и в т е р с м к л д у п я з б ч г ю б х ц ж й ш щ ф … 0.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_30"><head></head><label></label><figDesc>𝑃(систем|лінгвіст) = 𝑁(лінгвіст систем)𝑁(лінгвіст), 𝑃(систем|лінгвіст) = 𝑃(лінгвіст систем)𝑃(лінгвіст)(42)To forecast the conditional probability of the following base of the word, we use the Markov assumption (the probability of the word depends only on the previous one):𝑃(𝑥 𝑛 |𝑥 1 𝑛−1 )𝑃(𝑥 𝑛 |𝑥 𝑛−1 ). (43) To predict the conditional probability of the next base of the word in the N-gram based on the metric of Maximum (greatest) Likelihood Estimation (MLE) we calculate: 𝑃(𝑥 𝑛 |𝑥 1 𝑛−1 )𝑃(𝑥 𝑛 |𝑥 𝑛−𝑘+1 𝑛−1 ). (44) Based on this, we calculate the probability of a complete sequence of word stems: 𝑃(𝑥 1 𝑛 ) ∏ 𝑃(𝑥 𝑖 |𝑥 𝑖−1 ) We find the MLE estimate for the parameters of the N-gram model by statistically analyzing the corresponding text corpus and normalizing the frequency of occurrences of word bases and their sequences within [0;1]: 𝑃(𝑥 𝑛 |𝑥 𝑛−1 ) = 𝑁(𝑥 𝑛−1 𝑥 𝑛 ) ∑ 𝑁(𝑥 𝑛−1 𝑥) 𝑥 = 𝑁(𝑥 𝑛−1 𝑥 𝑛 ) 𝑁(𝑥 𝑛−1 ) .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_31"><head>&lt;p&gt; 3 ; 2 .Algorithm 4 . 4 .Stage 1 .</head><label>32441</label><figDesc>CLS опрацьовує текстовий контент на основі NLP-процесів &lt;/p&gt; &lt;p&gt; Інтеграція текстового контенту є одним із основних процесів CLS &lt;/p&gt; &lt;p&gt; CLS розв'язує конкретну NLP-задачу для відповідного контенту&lt;/p&gt; 𝑃(𝐶𝐿𝑆| &lt; 𝑝 &gt;) = 2 Estimation of the MLE parameter for the N-gram model as a relative frequency: Algorithm for the analysis of MLE-parameter estimates for the N-gram model. Parse the input text and break it into separate phrases (sentences)𝑅 1 𝑅 2 … 𝑅 𝑚 , marking each start-end with a corresponding &lt;p&gt; &lt;/p&gt; tag. Eliminate all non-alphabetic characters. Convert uppercase letters to lowercase. Remove service words if necessary (for certain NLP tasks). Stage 2. Apply Porter's stemming to obtain the sequence of word bases 𝑥 𝑖1 𝑥 𝑖2 … 𝑥 𝑖𝑛 𝑖 of word bases 𝑅 𝑖 taking into account word normalization. Stage 3. Receive input requests𝑄 1 𝑄 2 … 𝑄 𝑘 as a sequence of words of the searched data. Find 𝑄 𝑗 for each word 𝑦 𝑗1 𝑦 𝑗2 … 𝑦 𝑗𝑘 𝑗 basis by stemming.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_32"><head>Stage 5 .</head><label>5</label><figDesc>метод та засіб опрац інформ ресурс систем електрон контент комерц Find the probability of occurrence of 2-grams in the analyzed text. In each row, the value is divided by 𝑦 𝑗𝑖 , where 𝑖 is the row number after normalization.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="30,113.00,85.05,368.92,243.15" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>𝑆 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑁 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑊 𝑊𝑜𝑟𝑑 , 𝑁 𝑤𝑜𝑟𝑑 , 𝑓 𝑝𝑎𝑟𝑠𝑒𝑔𝑒𝑛𝑑𝑒𝑟 , 𝑓 𝑝𝑐𝑒𝑛𝑡 &gt;, 𝑆𝑖𝑛𝑔 𝑇𝑃 = 𝑓 𝑝𝑐𝑒𝑛𝑡 (𝑓 𝑝𝑎𝑟𝑠𝑒𝑔𝑒𝑛𝑑𝑒𝑟 (𝑁 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑊 𝑊𝑜𝑟𝑑 , 𝑁 𝑤𝑜𝑟𝑑 , 𝑓 𝑐𝑜𝑢𝑛𝑡𝑔𝑒𝑛𝑑𝑒𝑟 (𝑆 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 )),</figDesc><table><row><cell>𝑆𝑖𝑛𝑔 𝑇𝑆 =</cell><cell>𝑁 𝐺𝑒𝑛𝑑𝑒𝑟</cell><cell>𝑁 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑘 = 𝑆 𝑁𝐺 [𝑆𝑖𝑛𝑔 𝐺𝑆 𝑘 ] 𝑝𝑐𝑒𝑛𝑡 𝑘 = ( 𝑡𝑜𝑡𝑎𝑙 )  *  100 𝑊 𝑁𝐺 𝑘</cell></row><row><cell cols="3">[ 𝑝𝑟𝑖𝑛𝑡(𝑝𝑐𝑒𝑛𝑡 𝑘 , 𝑆𝑖𝑛𝑔 𝐺𝑆 𝑘 , 𝑁 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑘 ) 𝑡𝑜𝑡𝑎𝑙 = ∑ 𝑊 𝑁𝐺 𝑖 𝑘=1 𝑁 𝐺𝑒𝑛𝑑𝑒𝑟 , 𝑁 𝑆 𝑆 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 = ⋃ 𝑋 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑖 𝑖 𝑖=1 𝑋 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑖 = ⋃ 𝑊 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑖 𝑁 𝑊𝑆 𝑖 , 𝑊 𝑊𝑜𝑟𝑑 = ⋃ 𝑊 𝑊𝑜𝑟𝑑 𝑖 𝑡𝑜𝑡𝑎𝑙 𝑖</cell><cell>,</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 1</head><label>1</label><figDesc>Regular expressions of GA texts in the Ukrainian language for recognition of all characters</figDesc><table><row><cell>N</cell><cell>RE</cell><cell>Recognition</cell><cell>Example and result</cell></row><row><cell>1</cell><cell>/контент/</cell><cell>the exact sequence of substring</cell><cell>Структурна схема лінгвістичного аналізу</cell></row><row><cell></cell><cell></cell><cell>characters, taking into account the case</cell><cell>текстового контенту</cell></row><row><cell>2</cell><cell>/к/</cell><cell>a specific character, taking into account</cell><cell>Контент-аналіз застосовують для</cell></row><row><cell></cell><cell></cell><cell>the case</cell><cell>аналізу потоків контенту</cell></row><row><cell>3</cell><cell>/-/</cell><cell>specific special character</cell><cell>Контент-аналіз застосовують</cell></row><row><cell>4</cell><cell>/[кК]онтент/</cell><cell>exact sequence of characters without</cell><cell>Контент-аналіз застосовують для</cell></row><row><cell></cell><cell></cell><cell>taking into account the case of the 1st</cell><cell>аналізу потоків контенту</cell></row><row><cell></cell><cell></cell><cell>character</cell><cell></cell></row><row><cell>5</cell><cell>/[онві]/</cell><cell>or о, or н, or в, or і</cell><cell>Контент-аналіз застосовують</cell></row><row><cell>6</cell><cell>/[0123456789]/</cell><cell>Any number in a string sequence</cell><cell>RE чутливі до регістру-правила 1, 2 та 4</cell></row><row><cell></cell><cell></cell><cell></cell><cell>дають різні результати</cell></row><row><cell>7</cell><cell>/[0123]/</cell><cell>or 0, or 1, or 2, or 3</cell><cell>RE чутливі до регістру-правила 1, 2 та 4</cell></row><row><cell></cell><cell></cell><cell></cell><cell>дають різні результати</cell></row><row><cell>8</cell><cell>/[0-9]/</cell><cell>Any number in a string sequence</cell><cell>RE чутливі до регістру-правила 1, 2 та 4</cell></row><row><cell></cell><cell></cell><cell></cell><cell>дають різні результати</cell></row><row><cell>9</cell><cell>/[а-я]/</cell><cell>Any lowercase letter of the Ukrainian</cell><cell>Контент-аналіз застосовують</cell></row><row><cell></cell><cell></cell><cell>alphabet</cell><cell></cell></row><row><cell>10</cell><cell>/[А-Я]/</cell><cell>Any uppercase letter of the Ukrainian</cell><cell>Контент-аналіз застосовують</cell></row><row><cell></cell><cell></cell><cell>alphabet</cell><cell></cell></row><row><cell>11</cell><cell>/[А-Яа-я]/</cell><cell>Any letter of the Ukrainian alphabet,</cell><cell>Контент-аналіз застосовують</cell></row><row><cell></cell><cell></cell><cell>regardless of case</cell><cell></cell></row><row><cell>12</cell><cell>/[A-Z]/</cell><cell>Any uppercase letter of the English</cell><cell>RE чутливі до регістру-правила 1, 2 та 4</cell></row><row><cell></cell><cell></cell><cell>alphabet</cell><cell>дають різні результати</cell></row><row><cell>13</cell><cell>/[^А-Я]/</cell><cell>Any character other than an uppercase</cell><cell>Контент-аналіз застосовують для</cell></row><row><cell></cell><cell></cell><cell>letter of the English alphabet</cell><cell>аналізу потоків контенту</cell></row><row><cell>14</cell><cell>/[^Кк]/</cell><cell>Any character except the letters К and к</cell><cell>Контент-аналіз застосовують для</cell></row><row><cell></cell><cell></cell><cell></cell><cell>аналізу потоків контенту</cell></row><row><cell>15</cell><cell>/[^\.]/</cell><cell>Any character except the dot character.</cell><cell>Контент-аналіз застосовують</cell></row><row><cell>16</cell><cell>/[к^]/</cell><cell>or к, or ^</cell><cell>аналіз потоків контенту</cell></row><row><cell>17</cell><cell>/x^y/</cell><cell>String pattern x^y</cell><cell>функція x^y</cell></row><row><cell>18</cell><cell>/^[А-Я]/</cell><cell>Any uppercase letter of the Ukrainian</cell><cell>Контент-аналіз застосовують для</cell></row><row><cell></cell><cell></cell><cell>alphabet at the beginning of a line</cell><cell>аналізу потоків контенту в CLS</cell></row><row><cell>19</cell><cell>/^а/</cell><cell>The letter а at the beginning of the line</cell><cell>Контент-аналіз застосовують</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>or ж or т or л or д or ч or з Віддалено ллється на ланах нашого життя беззмінне збіжжя знання як обличчя особистого досвіду!</head><label></label><figDesc></figDesc><table><row><cell>/[нжтлдчз]/</cell><cell>or н or ж or т or л or д or ч or з</cell><cell>Віддалено ллється на ланах нашого</cell></row><row><cell></cell><cell></cell><cell>життя беззмінне збіжжя знання як</cell></row><row><cell></cell><cell></cell><cell>обличчя особистого досвіду!</cell></row><row><cell>/[0-9]*/</cell><cell>or none or an arbitrary number of one</cell><cell></cell></row><row><cell></cell><cell>element from the range 0-9</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>RE чутливі до регістру-правила 1, 2 та 4 дають різні результати</head><label></label><figDesc></figDesc><table><row><cell>/[0-9][0-9]*/</cell><cell>One digit from the range 0-9 is</cell><cell>RE чутливі до регістру-правила 1, 2 та 4</cell></row><row><cell></cell><cell>required, the other is not, but if there is</cell><cell>дають різні результати</cell></row><row><cell></cell><cell>-any number of one of 0-9</cell><cell></cell></row><row><cell>/[0-9]+/</cell><cell>any number of different digits from 0-9</cell><cell>Спецсимвол знаку питання ? для RE-</cell></row><row><cell></cell><cell></cell><cell>правил 20-21</cell></row><row><cell>/[нжтлдчз]+/</cell><cell>one or н or ж or т or л or д or ч or з or</cell><cell>Віддалено ллється на ланах нашого</cell></row><row><cell></cell><cell>several, or any combination thereof</cell><cell>життя беззмінне збіжжя знання як</cell></row><row><cell></cell><cell></cell><cell>обличчя особистого досвіду!</cell></row><row><cell>/[нжтлдчз]{2}/</cell><cell>exactly two or н or ж or т or л or д or ч</cell><cell>Віддалено ллється на ланах нашого</cell></row><row><cell></cell><cell>or з</cell><cell>життя беззмінне збіжжя знання як</cell></row><row><cell></cell><cell></cell><cell>обличчя особистого досвіду!</cell></row><row><cell>/аналіз.*аналіз/</cell><cell>String identification using a double</cell><cell>Контент-аналіз застосовують для</cell></row><row><cell></cell><cell>word аналіз</cell><cell>аналізу потоків контенту в CLS</cell></row><row><cell>/^В/</cell><cell>В at the beginning of the line</cell><cell>В наш час в Інтернет все є.</cell></row><row><cell>/^Контент-</cell><cell>recognition of a specific phrase</cell><cell>Контент-аналіз.˽</cell></row><row><cell>аналіз$/</cell><cell></cell><cell></cell></row><row><cell>/˽$/</cell><cell>marking a space at the end of a line</cell><cell>Контент-аналіз ˽ застосовують˽</cell></row><row><cell>/^Контент-</cell><cell>recognition of a specific phrase with a</cell><cell>Контент-аналіз.˽</cell></row><row><cell>аналіз\. $/</cell><cell>period and a space at the end of the</cell><cell></cell></row><row><cell></cell><cell>line</cell><cell></cell></row><row><cell>/^/[А-Я]\. $/</cell><cell>recognition of all possible sentences</cell><cell>В</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>наш час в Інтернет все є.˽</head><label></label><figDesc></figDesc><table><row><cell cols="3">implements the disjunction of the values upon matching. RE-rule 6 recognizes any number</cell></row><row><cell cols="3">in a sequence of string characters. The dash special character -in the middle [] for RE-rules</cell></row><row><cell cols="3">8-12 allows not to list all characters but indicates any character in the corresponding range.</cell></row><row><cell cols="3">For example, Pattern /[3-6]/ indicates any of the characters 3, 4, 5, or 6, and /[в-ж]/</cell></row><row><cell cols="3">indicates one of the characters в, г, д, or ж in the grapheme analysis of the input test. The</cell></row><row><cell cols="3">caret or circumflex character ^ inside [] for RE-rules 13-18 carries a different content load</cell></row><row><cell cols="3">depending on the location. If at the beginning immediately after [ means, all characters after</cell></row><row><cell cols="3">it are rejected in the parsed character string (RE 13-15). The caret ^ has 3 purposes: to</cell></row><row><cell cols="3">indicate the beginning of a line (not inside [] -RE 18-19); to indicate negation within [] (RE</cell></row><row><cell>/\bаналіз\b/</cell><cell>recognition of a specific set of symbols</cell><cell>Контент-аналіз застосовують для</cell></row><row><cell></cell><cell>(words) taking into account boundaries</cell><cell>аналізу потоків контенту</cell></row><row><cell>/\b19\b/</cell><cell>recognizing a word as a number</cell><cell>Йому виповнилось 19 в 2019.</cell></row><row><cell>/\b3\b/</cell><cell>word recognition within limits</cell><cell>Ціна -3$ за 13 одиниць.</cell></row><row><cell>/\b5\b/</cell><cell>word recognition within limits</cell><cell>Ціна -5Є за 5 одиниць.</cell></row><row><cell>/ML|МН/</cell><cell>recognition of abbreviations ML or МН</cell><cell>Реалізація CLS на основі МЛ</cell></row><row><cell>/контент(у|ний)/</cell><cell>recognition of words with different</cell><cell>Контентний аналіз застосовують до</cell></row><row><cell></cell><cell>inflections</cell><cell>великих потоків контенту</cell></row><row><cell>/№˽[0-9]+˽*/</cell><cell>1 digit with any number of spaces</cell><cell>В˽колонці ˽ №˽3˽˽˽˽˽˽</cell></row><row><cell>/(№˽[1-9]+˽*)*/</cell><cell>recognition of arbitrary sequence</cell><cell>В ˽ колонках ˽ №˽1˽ та ˽ №˽3˽, але не</cell></row><row><cell></cell><cell>number № and any number</cell><cell>в №˽13˽</cell></row><row><cell cols="3">RE is case-sensitive -rules 1, 2 and 4 give different results. Using the special characters</cell></row><row><cell cols="3">[ and ] solves the case-sensitivity problem of RE. The string of characters in the middle of []</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 2</head><label>2</label><figDesc>Regular expressions to recognize keywords, stop words and tokens</figDesc><table><row><cell>N</cell><cell>RE</cell><cell>Recognition</cell></row><row><cell>1</cell><cell>/але/</cell><cell>simple (but incorrect) pattern -takes into account other possible variants</cell></row><row><cell></cell><cell>/аналіз/</cell><cell>of the sequence of characters in the input string</cell></row><row><cell>2</cell><cell>/[аА]ле/</cell><cell>from case-sensitive, but unfortunately takes other cases into account,</cell></row><row><cell></cell><cell>/[аА]наліз/</cell><cell>such as малеча or каналізація</cell></row><row><cell>3</cell><cell>/\b[аА]ле\b/</cell><cell>taking into account the boundaries of the word (without letters,</cell></row><row><cell></cell><cell>/\b[аА]наліз\b/</cell><cell>underscores and numbers on both sides) -for but good, but the word</cell></row><row><cell></cell><cell></cell><cell>analysis already ignores</cell></row><row><cell>4</cell><cell></cell><cell></cell></row></table><note>/[^а-яА-Я][аА]наліз[а-я]/ Before аналіз there was not a single letter regardless of case, and after it is an arbitrary lowercase letter of the Ukrainian alphabet 5 /\b[аА]наліз[а-я]*/ Before аналіз there is no letter, underscore or number, followed by any lowercase letter of the Ukrainian alphabet or none 6 /(^|\b[аА]ле\b/ /(^|\b[аА]наліз(</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 3</head><label>3</label><figDesc>Basic RE aliases for general GA ranges</figDesc><table><row><cell>N</cell><cell>Range</cell><cell>RE</cell><cell>Recognition</cell><cell>Example</cell></row><row><cell>1</cell><cell>[˽\n\t\f\r]</cell><cell>\s</cell><cell>any spaces and tabs</cell><cell>аналіз˽контенту</cell></row><row><cell>2</cell><cell>[^\s]</cell><cell>\S</cell><cell>no spaces or tabs</cell><cell>аналіз˽контенту</cell></row><row><cell>3</cell><cell>[0-9]</cell><cell>\d</cell><cell>any number from the range</cell><cell>14˽лютого˽2005</cell></row><row><cell>4</cell><cell>[^0-9]</cell><cell>\D</cell><cell>no digit from the range</cell><cell>14˽лютого˽2005</cell></row><row><cell>5</cell><cell>[a-яА-Я0-9_]</cell><cell>\w</cell><cell>any letter, number and underscore</cell><cell>контент-аналіз</cell></row><row><cell>6</cell><cell>[^\w]</cell><cell>\W</cell><cell>no letter, number or underscore</cell><cell>контент-аналіз</cell></row><row><cell>7</cell><cell>\b[0-9]*\b</cell><cell>*</cell><cell>none or several previous REs</cell><cell>вже 22 рік</cell></row><row><cell>8</cell><cell>\b[0-9]+\b</cell><cell>+</cell><cell>one or more previous RE</cell><cell>вже 2022 рік</cell></row><row><cell>9</cell><cell>\b[0-9]?\b</cell><cell>?</cell><cell>definitely absent or present once</cell><cell>22 рік 2 століття</cell></row><row><cell>10</cell><cell>\b[0-9]{2}\b</cell><cell>{n}</cell><cell>a certain number of repetitions</cell><cell>22 рік 2 тисячоліття</cell></row><row><cell>11</cell><cell>\b[0-9]{1,2}\b</cell><cell>{n,m}</cell><cell cols="2">in the range of a certain number of repetitions 22 рік 2 тисячоліття</cell></row><row><cell>12</cell><cell>\b[0-9]{2,}\b</cell><cell>{n,}</cell><cell>at least a certain number of repetitions</cell><cell>22 рік 2 тисячоліття</cell></row><row><cell>13</cell><cell>\b[0-9]{,2}\b</cell><cell>{,m}</cell><cell>to a certain number of repetitions</cell><cell>22 рік 2 тисячоліття</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_9"><head></head><label></label><figDesc>is organization of access to previously unprocessed text;  𝑓 ℎ𝑡𝑚𝑙 () is the elimination of non-text content, scripts and style tags;  𝑓 𝑝𝑎𝑡𝑎𝑠 () is the identification of individual paragraphs from the content text;  𝑓 𝑠𝑒𝑛𝑡𝑠 () is the identification of individual sentences from the content text;  𝑓 𝑡𝑜𝑘𝑒𝑛𝑠 () is the identification of individual tokens from the content text;  𝑓 𝑚𝑎𝑟𝑘 () is grapheme labelling of identified tokens based on RE; 𝑇 𝑚𝑎𝑟𝑘𝑒𝑑 = 𝑓 𝑚𝑎𝑟𝑘 (𝑓 𝑡𝑜𝑘𝑒𝑛𝑠 (𝑓 𝑠𝑒𝑛𝑡𝑠 (𝑓 𝑝𝑎𝑡𝑎𝑠 (𝑓 ℎ𝑡𝑚𝑙 (𝑓 𝑟𝑎𝑤 (𝑋 𝑐𝑜𝑛𝑡𝑒𝑛𝑡</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_10"><head>Table 4</head><label>4</label><figDesc>Examples of Ukrainian and English words/flags for identifying keywords</figDesc><table><row><cell>N</cell><cell>Ukrainian</cell><cell>English</cell><cell>N</cell><cell>Ukrainian</cell><cell>English</cell></row><row><cell>1</cell><cell>курсорний/V</cell><cell>cursoriness/17,13</cell><cell cols="2">39 буферизувати/ABGH</cell><cell>buffer/18,9,13,17,10,23</cell></row><row><cell>2</cell><cell></cell><cell>cursorily</cell><cell cols="2">40 відформатувати/AB</cell><cell>format/1,20,17</cell></row><row><cell>3</cell><cell></cell><cell>cursor/9,13,17,10</cell><cell>41</cell><cell>кодувати/ABGH</cell><cell>code/17,2,23,10,12,18,9</cell></row><row><cell>4</cell><cell></cell><cell>cursory/16</cell><cell>42</cell><cell>кешувати/ABGH</cell><cell>cache/9,17,18,10,13</cell></row><row><cell>5</cell><cell>кирилічний/V</cell><cell>Cyrillic</cell><cell>43</cell><cell>кука/ab</cell><cell>hook/10,23,9,18,13,17</cell></row><row><cell>6</cell><cell>кілобітовий/V</cell><cell>kilobit/17</cell><cell>44</cell><cell>клавіатурний/V</cell><cell>keyboard/18,9,13,23,10,17</cell></row><row><cell>7</cell><cell>кілобіт/efg</cell><cell></cell><cell>45</cell><cell>клавіатура/ab</cell><cell></cell></row><row><cell cols="2">8 кілобайтовий/V</cell><cell>kilobyte/17</cell><cell>46</cell><cell>кодосумісний/V</cell><cell>code/17,2,23,10,12,18,9</cell></row><row><cell>9</cell><cell>кілобайт/efg</cell><cell></cell><cell>47</cell><cell></cell><cell>code compatible</cell></row><row><cell>10</cell><cell>кодек/efg</cell><cell>coder/2,13</cell><cell>48</cell><cell></cell><cell>compatible/17,5</cell></row><row><cell>11</cell><cell>кодер/efg</cell><cell></cell><cell>49</cell><cell></cell><cell>compatibleness/13</cell></row><row><cell>12</cell><cell>консольний/V</cell><cell>consoled/7</cell><cell>50</cell><cell></cell><cell>compatibility/5,13,17</cell></row><row><cell>13</cell><cell></cell><cell>consoler/13</cell><cell>51</cell><cell></cell><cell>compatibly/5</cell></row><row><cell>14</cell><cell>консоль/ij</cell><cell>console/23,8,10</cell><cell>52</cell><cell>кодогенератор/efg</cell><cell>code/17,2,23,10,12,18,9</cell></row><row><cell>15</cell><cell>Кобол/e</cell><cell>COBOL</cell><cell>53</cell><cell></cell><cell>generators/1</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_11"><head>Table 5</head><label>5</label><figDesc>Basic MA rules for marking nouns when marking a part of speech</figDesc><table><row><cell cols="2">Class flag</cell><cell>N</cell><cell>Features of MA-rules</cell></row><row><cell>І</cell><cell>а</cell><cell cols="2">248 For the singular:</cell></row><row><cell></cell><cell></cell><cell></cell><cell>1 declension: feminine, masculine and neuter nouns.</cell></row></table><note> 2 declensions: masculine in -ар, -ир, stressed (mixed group in -ар, -ир).</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_12"><head>Table 6</head><label>6</label><figDesc>Basic MA rules for marking adjectives as parts of speech</figDesc><table><row><cell cols="2">flag N</cell><cell>Peculiarities of MA rules for recognizing adjectives</cell></row><row><cell>V</cell><cell>83 </cell><cell>singular ending in -ий;</cell></row><row><cell></cell><cell></cell><cell>the short form singular changes to -ен in the same way as the full form (ясен -ясний...);</cell></row><row><cell></cell><cell></cell><cell>ending in -лиций;</cell></row><row><cell></cell><cell></cell><cell>ending in -ій/-їй;</cell></row><row><cell></cell><cell></cell><cell>plurals ending in -ій/-їй;</cell></row><row><cell></cell><cell></cell><cell>possessives from nouns of the 1st declension -names of people in -ин;</cell></row><row><cell></cell><cell></cell><cell>possessives from nouns of the 2nd declension in -ів (solid group);</cell></row><row><cell></cell><cell></cell><cell>possessives from nouns of the 2nd declension in-їв.</cell></row><row><cell>U</cell><cell>13 </cell><cell>soft group of possessives ending in-ів -&gt; -ев;</cell></row><row><cell></cell><cell></cell><cell>plurals ending in -ів.</cell></row><row><cell>W</cell><cell cols="2">3 the formation of an adverb from an adjective, the neuter gender of the comparative form of</cell></row><row><cell></cell><cell cols="2">adjectives corresponds to the corresponding adverb in the comparative form (міцніший -міцніше).</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_13"><head>Table 7</head><label>7</label><figDesc>Basic SFX-type RE of Ukrainian adjectives based on goroh.pp.ua</figDesc><table><row><cell>N</cell><cell cols="3">Flag Genus F1</cell><cell>F2</cell><cell>RE</cell><cell>Numeric</cell><cell>Sign</cell><cell>Example 1</cell><cell>Example 2</cell><cell>Case</cell><cell>N</cell></row><row><cell>1</cell><cell>V</cell><cell>ч</cell><cell>ий</cell><cell>ого</cell><cell>[^ц]ий</cell><cell>одн</cell><cell>in -ий</cell><cell>текстовий</cell><cell>текстового</cell><cell>Р.З.</cell><cell>1</cell></row><row><cell>2</cell><cell></cell><cell></cell><cell></cell><cell>ому</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>текстовому</cell><cell>Д.М.</cell><cell>2</cell></row><row><cell>3</cell><cell></cell><cell></cell><cell></cell><cell>им</cell><cell>ий</cell><cell></cell><cell></cell><cell></cell><cell>текстовим</cell><cell>О.Мн:Д.</cell><cell>3</cell></row><row><cell>4</cell><cell></cell><cell></cell><cell></cell><cell>ім</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>текстовім</cell><cell>М.</cell><cell>4</cell></row><row><cell>5</cell><cell></cell><cell>ж</cell><cell></cell><cell>а</cell><cell>[^ц]ий</cell><cell></cell><cell></cell><cell></cell><cell>текстова</cell><cell>Н.</cell><cell>5</cell></row><row><cell>6</cell><cell></cell><cell></cell><cell></cell><cell>ої</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>текстової</cell><cell>Р.</cell><cell>6</cell></row><row><cell>7</cell><cell></cell><cell></cell><cell></cell><cell>ій</cell><cell>ий</cell><cell></cell><cell></cell><cell></cell><cell>текстовій</cell><cell>Д.</cell><cell>7</cell></row><row><cell>8</cell><cell></cell><cell></cell><cell></cell><cell>у</cell><cell>[^ц]ий</cell><cell></cell><cell></cell><cell></cell><cell>текстову</cell><cell>З.</cell><cell>8</cell></row><row><cell>9</cell><cell></cell><cell></cell><cell></cell><cell>ою</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>текстовою</cell><cell>О.</cell><cell>9</cell></row><row><cell>10</cell><cell></cell><cell>с</cell><cell></cell><cell>е</cell><cell>ий</cell><cell></cell><cell></cell><cell></cell><cell>текстове</cell><cell>Н.</cell><cell></cell></row><row><cell>11</cell><cell></cell><cell>-</cell><cell></cell><cell>і</cell><cell></cell><cell>мн</cell><cell></cell><cell></cell><cell>текстові</cell><cell></cell><cell></cell></row><row><cell>12</cell><cell></cell><cell></cell><cell></cell><cell>их</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>текстових</cell><cell>Р.</cell><cell></cell></row><row><cell>13</cell><cell></cell><cell></cell><cell></cell><cell>ими</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>текстовими</cell><cell>О.</cell><cell></cell></row><row><cell>14</cell><cell></cell><cell>ч</cell><cell></cell><cell>ього</cell><cell>[^у]ций</cell><cell>одн</cell><cell>in -лиций</cell><cell>білолиций</cell><cell>білолицього</cell><cell>Р.З.</cell><cell></cell></row><row><cell>15</cell><cell></cell><cell></cell><cell></cell><cell>ьому</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>білолицьому</cell><cell>Д.М.</cell><cell></cell></row><row><cell>16</cell><cell></cell><cell>ж</cell><cell></cell><cell>я</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>білолиця</cell><cell>Н.</cell><cell></cell></row><row><cell>17</cell><cell></cell><cell></cell><cell></cell><cell>ьої</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>білолицьої</cell><cell>Р.</cell><cell></cell></row><row><cell>18</cell><cell></cell><cell></cell><cell></cell><cell>ю</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>білолицю</cell><cell>З.</cell><cell></cell></row><row><cell>19</cell><cell></cell><cell></cell><cell></cell><cell>ьою</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>білолицьою</cell><cell>О.</cell><cell></cell></row><row><cell>20</cell><cell></cell><cell>ч</cell><cell></cell><cell>ого</cell><cell>уций</cell><cell></cell><cell></cell><cell>куций</cell><cell>куцого</cell><cell>Р.З.</cell><cell></cell></row><row><cell>21</cell><cell></cell><cell></cell><cell></cell><cell>ому</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>куцому</cell><cell>Д.М.</cell><cell></cell></row><row><cell>22</cell><cell></cell><cell>ж</cell><cell></cell><cell>а</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>куца</cell><cell>Н.</cell><cell></cell></row><row><cell>23</cell><cell></cell><cell></cell><cell></cell><cell>ої</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>куцої</cell><cell>Р.</cell><cell></cell></row><row><cell>24</cell><cell></cell><cell></cell><cell></cell><cell>у</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>куцу</cell><cell>З.</cell><cell></cell></row><row><cell>25</cell><cell></cell><cell></cell><cell></cell><cell>ою</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>куцою</cell><cell>О.</cell><cell></cell></row><row><cell></cell><cell></cell><cell>ч</cell><cell>ій</cell><cell>ього</cell><cell>ій</cell><cell></cell><cell>in -ій/-їй</cell><cell>крайній</cell><cell>крайнього</cell><cell>Р.</cell><cell></cell></row><row><cell>27</cell><cell></cell><cell></cell><cell></cell><cell>ьому</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>крайньому</cell><cell>Д.</cell><cell></cell></row><row><cell>28</cell><cell></cell><cell></cell><cell></cell><cell>ім</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>крайнім</cell><cell>О.Мн.:Д.</cell><cell></cell></row><row><cell>29</cell><cell></cell><cell>ж</cell><cell></cell><cell>я</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>крайня</cell><cell>Н.</cell><cell></cell></row><row><cell>30</cell><cell></cell><cell></cell><cell></cell><cell>ьої</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>крайньої</cell><cell>Р.</cell><cell></cell></row><row><cell>31</cell><cell></cell><cell></cell><cell></cell><cell>ю</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>крайню</cell><cell>Д.</cell><cell></cell></row><row><cell>32</cell><cell></cell><cell></cell><cell></cell><cell>ьою</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>крайньою</cell><cell>О.</cell><cell></cell></row><row><cell>33</cell><cell></cell><cell>с</cell><cell></cell><cell>є</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>крайнє</cell><cell>Н.</cell><cell></cell></row><row><cell>34</cell><cell></cell><cell>-</cell><cell>й</cell><cell>-</cell><cell>[їі]й</cell><cell>мн</cell><cell></cell><cell></cell><cell>крайні</cell><cell>Н.</cell><cell></cell></row><row><cell>35</cell><cell></cell><cell></cell><cell></cell><cell>х</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>крайніх</cell><cell>Р.</cell><cell></cell></row><row><cell>36</cell><cell></cell><cell></cell><cell></cell><cell>ми</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>крайніми</cell><cell>О.</cell><cell></cell></row><row><cell>37</cell><cell></cell><cell>ч</cell><cell>їй</cell><cell>його</cell><cell>їй</cell><cell>одн</cell><cell></cell><cell>безкраїй</cell><cell>безкрайого</cell><cell>Р.З.</cell><cell></cell></row><row><cell>38</cell><cell></cell><cell></cell><cell></cell><cell>йому</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>безкрайому</cell><cell>Д.</cell><cell></cell></row><row><cell>39</cell><cell></cell><cell></cell><cell></cell><cell>їм</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>безкраїм</cell><cell>О.М.Мн.:Д.</cell><cell></cell></row><row><cell>40</cell><cell></cell><cell>ж</cell><cell></cell><cell>я</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>безкрая</cell><cell>Н.</cell><cell></cell></row><row><cell>41</cell><cell></cell><cell></cell><cell></cell><cell>йої</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>безкрайої</cell><cell>Р.</cell><cell></cell></row><row><cell>42</cell><cell></cell><cell></cell><cell></cell><cell>ю</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>безкраю</cell><cell>З.</cell><cell></cell></row><row><cell>43</cell><cell></cell><cell></cell><cell></cell><cell>йою</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>безкрайою</cell><cell>О.</cell><cell></cell></row><row><cell>44</cell><cell></cell><cell>с</cell><cell></cell><cell>є</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>безкрає</cell><cell>Н.</cell><cell></cell></row><row><cell>45</cell><cell></cell><cell>ч</cell><cell>-</cell><cell>ого</cell><cell>[їи]н</cell><cell></cell><cell>possessives from</cell><cell>мамин</cell><cell>маминого</cell><cell>Р.</cell><cell></cell></row><row><cell>46 47 48 49</cell><cell></cell><cell>ж</cell><cell></cell><cell>ому им ім а</cell><cell></cell><cell></cell><cell>nouns of the 1st declension -names of people on -ин</cell><cell></cell><cell>маминому маминим маминім мамина</cell><cell>Д. О. Мн:Д. М. Н.</cell><cell></cell></row><row><cell>50</cell><cell></cell><cell></cell><cell></cell><cell>ої</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>маминої</cell><cell>Р.</cell><cell></cell></row><row><cell>51</cell><cell></cell><cell></cell><cell></cell><cell>ій</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>маминій</cell><cell>Д.М.</cell><cell></cell></row><row><cell>52</cell><cell></cell><cell></cell><cell></cell><cell>у</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>мамину</cell><cell>З.</cell><cell></cell></row><row><cell>53</cell><cell></cell><cell></cell><cell></cell><cell>ою</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>маминою</cell><cell>О.</cell><cell></cell></row><row><cell>54</cell><cell></cell><cell>с</cell><cell></cell><cell>е</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>мамине</cell><cell>Н.</cell><cell></cell></row><row><cell>55</cell><cell></cell><cell></cell><cell></cell><cell>і</cell><cell></cell><cell>мн</cell><cell></cell><cell></cell><cell>мамині</cell><cell>Н.</cell><cell></cell></row><row><cell>56</cell><cell></cell><cell></cell><cell></cell><cell>их</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>маминих</cell><cell>Р.</cell><cell></cell></row><row><cell>57</cell><cell></cell><cell></cell><cell></cell><cell>ими</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>маминими</cell><cell>О.</cell><cell></cell></row><row><cell>58</cell><cell></cell><cell>ч</cell><cell>ів</cell><cell>ового</cell><cell>ів</cell><cell>одн</cell><cell>possessives from</cell><cell>татів</cell><cell>татового</cell><cell>Р.</cell><cell></cell></row><row><cell>59</cell><cell></cell><cell></cell><cell></cell><cell>овому</cell><cell></cell><cell></cell><cell>nouns of the 2nd</cell><cell></cell><cell>татовому</cell><cell>Д.</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_14"><head></head><label></label><figDesc>Go to the end of the word 𝑤 𝑠 . Recognize the inflection 𝑓 1 𝑖 in 𝑤 𝑠 from all possible ones (Fig.4.16 -the longest one is chosen, for example, in 𝑤 𝑠 =текстова we choose the ending 𝑓 1 𝑖 =ова, not𝑓 1𝑖 а) from the RE word type as 𝑅 𝑎𝑑𝑗𝑒𝑐𝑡𝑖𝑣𝑎𝑙 , 𝑅 𝑛𝑜𝑢𝑛 or 𝑅 𝑣𝑒𝑟𝑏 and in the presence of the removal of the inflexion 𝑓 1 𝑖 (Fig.18). Finding the deleted inflection 𝑓 1 𝑖 in the tree of inflexions 𝑇 𝑓𝑙𝑒𝑐𝑡𝑖𝑜𝑛 (the longest one is chosen)., then we store it in 𝑓 𝑖 = 𝑓 and delete in 𝑤 𝑠 . Stage 7. We check the obtained base 𝑤 𝑠 of the initial word 𝑤 𝑖 with the content of the base dictionary 𝐷 𝑤 𝑠 of Ukrainian words. If there is no respondent, we save &lt; 𝑤 𝑖 , 𝑤 𝑠 &gt; in the additional temporary intermediate dictionary 𝐷 &lt;𝑤 𝑖 ,𝑤 𝑠 &gt; for the moderator and proceed to stage 1, otherwise proceed to stage 4. Stage 8. Analysis of inflexion and the presence/absence of alternation of letters in the base/inflexions of the words &lt; 𝑤 𝑖 , 𝑤 𝑠 &gt; and the analogue of the base of the word in 𝐷 𝑤 𝑠 according to the relevant MA RE-rule to identify additional features of the analyzed word 𝑤 𝑖 . Stage 9. Addition of identified linguistic features of the recognized part of speech to the tag of the word 𝑤 𝑖 of type 𝑚 𝑎𝑑𝑗𝑒𝑐𝑡𝑖𝑣𝑎𝑙</figDesc><table><row><cell cols="3">Stage 6. Checking the contents of the subtree 𝑇 𝑓𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑓 1</cell><cell>with the existing word ending 𝑓 2 𝑖 (𝑓 = 𝑓 2 𝑖 + 𝑓 1 𝑖 ). If𝑤 𝑠 .</cell></row><row><cell cols="3">ends in 𝑓 2 𝑖 and has a counterpart in 𝑇 𝑓𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑓 1</cell></row><row><cell>𝑤 𝑖</cell><cell>, 𝑚 𝑛𝑜𝑢𝑛 𝑤 𝑖</cell><cell cols="2">or 𝑚 𝑣𝑒𝑟𝑏 𝑤 𝑖 respectively. Saving the results in the corresponding</cell></row><row><cell cols="3">dictionary 𝐷 𝑤 𝑖 of the analyzed text.</cell></row><row><cell cols="4">Stage 4. Preservation of inflection 𝑓 1 𝑖 in the word tag 𝑤 𝑖 . Stage 5. Mark 𝑤 𝑖 . as type 𝑚 𝑎𝑑𝑗𝑒𝑐𝑡𝑖𝑣𝑎𝑙 𝑤 𝑖 , 𝑚 𝑛𝑜𝑢𝑛 𝑤 𝑖 or 𝑚 𝑣𝑒𝑟𝑏 𝑤 𝑖 respectively.</cell></row></table><note>4.1. Modified Porter stemmer algorithm Stage 1. Identify the next token as the word 𝑤 𝑖 (𝑤 𝑠 = 𝑤 𝑖 ). Stage 2. Check with the dictionary of stop words whether 𝐷 𝑤 𝑠𝑤 or 𝑤 𝑠 is a service word. If yes, then 𝑖 = 𝑖 + 1 and go to step 1, otherwise go to step 3. Stage 3.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_15"><head></head><label></label><figDesc>Step 4.2. Calculation of the number 𝑛 𝑙 of each pair of characters/lines (𝑠 𝑘 𝑥 , 𝑠 𝑗 𝑥 ) as occurrences of word stems in the input text when {𝑠 𝑘 𝑥 𝐷 𝑥 , 𝑠 𝑗 𝑥 𝐷 𝑙 } or {𝑠 𝑘 𝑥 𝐷 𝑙 , 𝑠 𝑗 𝑥 𝐷 𝑥 }, which are next to each other and separated by a special character dash (compound words), period (date), comma (real number) and/or space, or their combination, but not punctuation marks, numbers and other special characters. Step 4.3. Formation of the alphabetic-frequency dictionary 𝐷′ 𝑥 based on (𝑠 𝑘 𝑥 , 𝑠 𝑗 𝑥 ). Determination of the number of occurrences of unique lexemes in 𝐷′ 𝑥  ℎ = |𝐷′ 𝑥 |. Step 4.4. Finding 𝑛 𝑙 = 𝑚𝑎𝑥 of the most frequent pair 𝑎 𝑖 = (𝑠 𝑘 𝑥 , 𝑠 𝑗 𝑥 ) in 𝐷′ 𝑥 , where (𝑠 𝑘 𝑥 , 𝑠 𝑗 𝑥 )𝐷′ 𝑥 , {𝑠 𝑘 𝑥 𝐷 𝑥 , 𝑠 𝑗 𝑥 𝐷 𝑙 } або {𝑠 𝑘 𝑥 𝐷 𝑙 , 𝑠 𝑗 𝑥 𝐷 𝑥 }. Step 4.7. Calculation of the number of occurrences in the input text 𝑏 𝑖 , occurrences of 𝑠 𝑘 𝑥 and 𝑠 𝑗 𝑥 at 𝑠 𝑘 𝑥 𝐷 𝑙 and/or 𝑠 𝑗 𝑥 𝐷 𝑙 respectively, when they are used separately (not next to each other). Step 4.8. Inclusion in 𝐷 𝑙 of the value of 𝑏 𝑖 , and its frequency of occurrence. Overwriting frequency values in 𝐷 𝑙 for 𝑠 𝑘 𝑥 and 𝑠 𝑗 𝑥 at 𝑠 𝑘 𝑥 𝐷 𝑙 and/or 𝑠 𝑗 𝑥 𝐷 𝑙 respectively.</figDesc><table><row><cell>Step 4.5. Replacing 𝑎 𝑖 with a new combination/merge character/string 𝑏 𝑖 = 𝑠 𝑘 𝑥 𝑠 𝑗 𝑥 .</cell></row><row><cell>Step 4.6. Extracting from 𝐷′ 𝑥 the value 𝑠 𝑘 𝑥 𝑠 𝑗 𝑥 and from 𝐷 𝑥 the values 𝑠 𝑘 𝑥 or 𝑠 𝑗 𝑥 respectively.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_16"><head></head><label></label><figDesc>...   </figDesc><table><row><cell>la</cell><cell>guerre,</cell><cell cols="2">dont</cell><cell>la</cell><cell cols="2">France</cell><cell cols="3">portait</cell><cell>encore</cell><cell>les</cell><cell>blessures...</cell></row><row><cell>Hungarian. Azt</cell><cell cols="2">hisszem,</cell><cell cols="2">hogy</cell><cell cols="5">késedelmemmel</cell><cell>sikerült</cell><cell>bebizonyítani.</cell></row><row><cell cols="3">Serbo-Croatian. Regulacija</cell><cell cols="3">procesa</cell><cell cols="2">jedan</cell><cell>je</cell><cell>od</cell><cell>najstarjih</cell><cell>oblika</cell><cell>regulacije.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_18"><head></head><label></label><figDesc>Class diagram for a hierarchy of the type Simple sentence Class diagram for a hierarchy of the type the Sentence Members Class diagram for the Circumstance type hierarchy</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="4">Simple sentence</cell><cell></cell></row><row><cell></cell><cell>Sign 1</cell><cell></cell><cell></cell><cell cols="2">Sign 2</cell><cell></cell><cell>Sign 3</cell><cell>Sign 4</cell><cell></cell><cell></cell><cell>Sign 5</cell><cell></cell><cell cols="2">Sign 6</cell><cell>Sign 7</cell><cell>Sign 8</cell></row><row><cell>Uncommon</cell><cell cols="2">Common</cell><cell cols="4">Uncomplicated</cell><cell>Simple</cell><cell>Complicated</cell><cell></cell><cell></cell><cell>With</cell><cell></cell><cell cols="2">With</cell><cell>With</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">uncomplicated</cell><cell cols="2">separated</cell><cell>appeals</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>parts of sentence</cell><cell></cell><cell cols="2">parts of the sentence</cell><cell>With build-in</cell><cell>With embedded</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>components</cell><cell>components</cell></row><row><cell></cell><cell></cell><cell></cell><cell cols="2">Noun</cell><cell></cell><cell></cell><cell>Verb</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="11">Figure 27: Sentence</cell><cell>members</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>Main</cell><cell cols="2">sentence</cell><cell cols="2">members</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">Second</cell><cell>sentence</cell><cell>members</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>Adverbial</cell></row><row><cell></cell><cell cols="2">Subjective</cell><cell></cell><cell></cell><cell cols="2">Predicate</cell><cell></cell><cell></cell><cell cols="3">Adjunct</cell><cell></cell><cell></cell><cell>Object</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>modifier</cell></row><row><cell>Simple</cell><cell></cell><cell cols="2">Composite</cell><cell></cell><cell cols="2">Simple</cell><cell>Composite</cell><cell>Coordinated</cell><cell></cell><cell cols="2">Uncoordinated</cell><cell></cell><cell cols="2">Direct</cell><cell>Indirect</cell></row><row><cell cols="10">Figure 28: Adverbial</cell><cell cols="2">modifier</cell><cell></cell><cell></cell></row><row><cell></cell><cell>Of</cell><cell cols="2">purpose</cell><cell></cell><cell>Of</cell><cell>time</cell><cell>Of</cell><cell>manner</cell><cell>Of</cell><cell cols="2">place</cell><cell>Of</cell><cell>cause</cell><cell>Of</cell><cell>condirion</cell><cell>Of</cell><cell>concession</cell></row><row><cell cols="12">Figure 29: The composite sentence</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="4">Uncompromising sentence</cell><cell></cell><cell></cell><cell>Complex syntactic construction</cell></row><row><cell></cell><cell></cell><cell cols="5">Conjunctive sentence</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>With</cell><cell></cell><cell></cell><cell>With</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">With coordinated</cell><cell>With conjunctive and</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>homogeneous</cell><cell></cell><cell cols="2">heterogeneous</cell><cell></cell><cell></cell></row><row><cell></cell><cell cols="3">The compound</cell><cell></cell><cell cols="3">The complex</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">and subordinate</cell><cell>unconjunctive</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>members</cell><cell></cell><cell></cell><cell>members</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell cols="2">sentence</cell><cell></cell><cell cols="2">sentence</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>connection</cell><cell>connection</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>of the sentence</cell><cell></cell><cell cols="3">of the sentence</cell><cell></cell></row><row><cell></cell><cell></cell><cell>With</cell><cell></cell><cell></cell><cell>With</cell><cell></cell><cell>Attributive</cell><cell cols="2">Subject</cell><cell></cell><cell cols="2">Adverbial</cell><cell></cell><cell>With few</cell></row><row><cell></cell><cell cols="3">connectivity</cell><cell cols="3">opposites</cell><cell>clauses</cell><cell cols="2">clauses</cell><cell></cell><cell cols="2">clauses</cell><cell></cell><cell>subjects</cell></row><row><cell></cell><cell cols="3">connections</cell><cell cols="3">connections</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_19"><head></head><label></label><figDesc>To analyze the sequence of 𝑁 bases of words 𝑥 1 𝑥 2 … 𝑥 𝑛 or 𝑥 1 𝑛 (𝑥 1 𝑥 2 … 𝑥 𝑛−1 𝑥 1 𝑛−1 ) when 𝐴 1 = 𝑥 1 , 𝐴 2 = 𝑥 2 , 𝐴 3 = 𝑥 3 , ..., 𝐴 𝑛 = 𝑥 𝑛 calculate: 𝑃(𝑥 1 𝑥 2 … 𝑥 𝑛 ) = 𝑃(𝑥 1 𝑛 ) = 𝑃(𝑥 1 )𝑃(𝑥 2 |𝑥 1 )𝑃(𝑥 3 |𝑥 1 2 ) … 𝑃(𝑥 𝑛 |𝑥 1 𝑛−1 ),</figDesc><table><row><cell>0.5</cell><cell></cell><cell></cell><cell cols="3">Benchmark 𝑃(𝑥 1 𝑛 ) = ∏ 𝑃(𝑥 𝑖 |𝑥 1 Excerpt 1 𝑖−1 ) 𝑛</cell><cell cols="2">Excerpt 2</cell><cell></cell><cell></cell><cell>(40)</cell></row><row><cell>133 0.31 0.32</cell><cell>0.082 0.15 0.16</cell><cell>0.074 0.17 0.12</cell><cell>0.068 0.13 0.12</cell><cell>0.054 0.11 0.11 𝑖=1</cell><cell cols="2">0.047 0.11 0.07</cell><cell>0.046 0.10 0.08</cell><cell>0.038 0.10 0.18</cell><cell>0.036 0.06 0.06</cell><cell>0.033 0.09 0.11</cell></row><row><cell>0</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>&lt;&gt;</cell><cell>о</cell><cell>а</cell><cell>н</cell><cell>и</cell><cell>в</cell><cell></cell><cell>т</cell><cell>е</cell><cell>р</cell><cell>с</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_20"><head>Stage 4 .</head><label>4</label><figDesc>Conduct a statistical analysis of the occurrence of word bases and sequences of query word bases in the analyzed text.</figDesc><table><row><cell>For</cell><cell></cell><cell cols="2">example,</cell><cell></cell><cell>for</cell><cell>the</cell><cell></cell><cell>search</cell><cell></cell><cell>phrase</cell><cell>𝑄 𝑗 :</cell></row><row><cell>𝑦 𝑗1</cell><cell>𝑦 𝑗2</cell><cell>𝑦 𝑗3</cell><cell cols="2">𝑦 𝑗4</cell><cell>𝑦 𝑗5</cell><cell>𝑦 𝑗6</cell><cell>𝑦 𝑗7</cell><cell>𝑦 𝑗8</cell><cell></cell><cell>𝑦 𝑗9</cell><cell>𝑦 𝑗10</cell></row><row><cell>метод</cell><cell>та</cell><cell>засіб</cell><cell cols="2">опрац</cell><cell>інформ</cell><cell>ресурс</cell><cell cols="4">систем електрон контент</cell><cell>комерц</cell></row><row><cell>58</cell><cell>190</cell><cell>25</cell><cell>62</cell><cell></cell><cell>122</cell><cell>83</cell><cell>170</cell><cell>89</cell><cell></cell><cell>408</cell><cell>300</cell></row><row><cell cols="2">Basics of words of</cell><cell>𝑥 𝑖1</cell><cell>𝑥 𝑖2</cell><cell>𝑥 𝑖3</cell><cell>𝑥 𝑖4</cell><cell>𝑥 𝑖5</cell><cell>𝑥 𝑖6</cell><cell>𝑥 𝑖7</cell><cell>𝑥 𝑖8</cell><cell>𝑥 𝑖9</cell><cell>𝑥 𝑖10</cell></row><row><cell cols="2">analyzed text</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>language texts), the modified one is adapted specifically for the Ukrainian language and gives an accurate result in 85-93% of cases, depending on the quality, style, genre of the text and, accordingly, the content of CLS dictionaries. The algorithm for the minimum editorial distance of lines of Ukrainian texts is described as the minimum number of operations necessary to transform one into another.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Applied text analysis with Python: Enabling languageaware data products with machine learning</title>
		<author>
			<persName><forename type="first">B</forename><surname>Bengfort</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bilbro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ojeda</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>O&apos;Reilly Media, Inc</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Deep Learning Architectures for Sequence Processing</title>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Martin</surname></persName>
		</author>
		<ptr target="https://web.stanford.edu/~jurafsky/slp3/9.pdf" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Naive Bayes and Sentiment Classification</title>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Martin</surname></persName>
		</author>
		<ptr target="https://web.stanford.edu/~jurafsky/slp3/4.pdf" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Logistic Regression</title>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<ptr target="https://web.stanford.edu/~jurafsky/slp3/5.pdf" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Martin</surname></persName>
		</author>
		<ptr target="https://web.stanford.edu/~jurafsky/slp3/7.pdf" />
		<title level="m">Neural Networks and Neural Language Models</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Modern State and Prospects of Information Technologies Development for Natural Language Content Processing</title>
		<author>
			<persName><forename type="first">V</forename><surname>Vysotska</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CEUR Workshop Proceedings</title>
		<imprint>
			<biblScope unit="volume">3368</biblScope>
			<biblScope unit="page" from="198" to="234" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">The text classification based on Big Data analysis for keyword definition using stemming</title>
		<author>
			<persName><forename type="first">A</forename><surname>Berko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Matseliukh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ivaniv</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chyrun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Schuchmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE 16th International conference on computer science and information technologies, CSIT-2021</title>
				<meeting>the IEEE 16th International conference on computer science and information technologies, CSIT-2021<address><addrLine>Lviv, Ukraine</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021-09-25">22-25 September 2021</date>
			<biblScope unit="page" from="184" to="188" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">The method for detecting plagiarism in a collection of documents</title>
		<author>
			<persName><forename type="first">N</forename><surname>Shakhovska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Shvorob</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Computer Sciences and Information Technologies</title>
				<meeting>the International Conference on Computer Sciences and Information Technologies</meeting>
		<imprint>
			<publisher>CSIT</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="142" to="145" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Intellectual Analysis System Project for Ukrainian-language Artistic Works to Determine the Text Authorship Attribution Probability</title>
		<author>
			<persName><forename type="first">R</forename><surname>Romanchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vysotska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Andrunyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chyrun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chyrun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Brodyak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th IEEE International Conference on Computer Science and Information Technologies, CSIT 2023</title>
				<meeting>the 18th IEEE International Conference on Computer Science and Information Technologies, CSIT 2023<address><addrLine>Lviv, Ukraine</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2023">October 19-21, 2023. 2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Identification and Correction of Grammatical Errors in Ukrainian Texts Based on Machine Learning Technology</title>
		<author>
			<persName><forename type="first">P</forename><surname>V. Lytvyn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Pukach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vysotska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Vovk</surname></persName>
		</author>
		<author>
			<persName><surname>Kholodna</surname></persName>
		</author>
		<idno type="DOI">10.3390/math11040904</idno>
		<ptr target="https://doi.org/10.3390/math11040904" />
	</analytic>
	<monogr>
		<title level="j">Mathematics</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page">904</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">An approach for a next-word prediction for Ukrainian language</title>
		<author>
			<persName><forename type="first">K</forename><surname>Shakhovska</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Wireless Communications and Mobile Computing</title>
		<imprint>
			<biblScope unit="page" from="1" to="9" />
			<date type="published" when="2021">2021. 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Ukrainian Language Chatbot for Sentiment Analysis and User Interests Recognition based on Data Mining</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kubinska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Holoshchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Holoshchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chyrun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CEUR Workshop Proceedings</title>
		<imprint>
			<biblScope unit="volume">3171</biblScope>
			<biblScope unit="page" from="315" to="327" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Development of the Speech-to-Text Chatbot Interface Based on Google API</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">N</forename><surname>Shakhovska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Basystiuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shakhovska</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CEUR Workshop Proceedings</title>
		<imprint>
			<biblScope unit="volume">2386</biblScope>
			<biblScope unit="page" from="212" to="221" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Information System for Recommendation List Formation of Clothes Style Image Selection According to User&apos;s Needs Based on NLP and Chatbots</title>
		<author>
			<persName><forename type="first">V</forename><surname>Husak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Lozynska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Karpov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Peleshchak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chyrun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vysotskyi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CEUR workshop proceedings</title>
		<imprint>
			<biblScope unit="volume">2604</biblScope>
			<biblScope unit="page" from="788" to="818" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Ukrainian Participles Formation by the Generative Grammars Use</title>
		<author>
			<persName><forename type="first">V</forename><surname>Vysotska</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CEUR workshop proceedings</title>
		<imprint>
			<biblScope unit="volume">2604</biblScope>
			<biblScope unit="page" from="407" to="427" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A comparative analysis for English and Ukrainian texts processing based on semantics and syntax approach</title>
		<author>
			<persName><forename type="first">V</forename><surname>Vysotska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Holoshchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Holoshchuk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">2870</biblScope>
			<biblScope unit="page" from="311" to="356" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Precision automated phonetic analysis of speech signals for information technology of text-dependent authentication of a person by voice</title>
		<author>
			<persName><forename type="first">O</forename><surname>Bisikalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Boivan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Khairova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kovtun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kovtun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">2853</biblScope>
			<biblScope unit="page" from="276" to="288" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Linguistic analysis method of Ukrainian commercial textual content for data mining</title>
		<author>
			<persName><forename type="first">O</forename><surname>Bisikalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vysotska</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CEUR Workshop Proceedings</title>
		<imprint>
			<biblScope unit="volume">2608</biblScope>
			<biblScope unit="page" from="224" to="244" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">The chi-square test and data clustering combined for author identification</title>
		<author>
			<persName><forename type="first">I</forename><surname>Khomytska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Bazylevych</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Teslyuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Karamysheva</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE XVIIIth Scientific and Technical Conference on Computer Science and Information Technologies, CSIT 2023</title>
				<meeting>the IEEE XVIIIth Scientific and Technical Conference on Computer Science and Information Technologies, CSIT 2023<address><addrLine>Lviv, Ukraine</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023-10-21">19-21 October 2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">The Multifactor Method Applied for Authorship Attribution on the Phonological Level</title>
		<author>
			<persName><forename type="first">I</forename><surname>Khomytska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Teslyuk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CEUR workshop proceedings</title>
		<imprint>
			<biblScope unit="volume">2604</biblScope>
			<biblScope unit="page" from="189" to="198" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Development of methods, models, and means for the author attribution of a text</title>
		<author>
			<persName><forename type="first">I</forename><surname>Khomytska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Teslyuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Holovatyy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Morushko</surname></persName>
		</author>
		<idno type="DOI">10.15587/1729-4061.2018.132052</idno>
	</analytic>
	<monogr>
		<title level="j">Eastern-European Journal of Enterprise Technologies</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="41" to="46" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
