=Paper=
{{Paper
|id=Vol-3723/paper9
|storemode=property
|title=Computer linguistic system architecture for Ukrainian language content processing based on machine learning
|pdfUrl=https://ceur-ws.org/Vol-3723/paper9.pdf
|volume=Vol-3723
|authors=Victoria Vysotska
|dblpUrl=https://dblp.org/rec/conf/modast/Vysotska24
}}
==Computer linguistic system architecture for Ukrainian language content processing based on machine learning==
Computer linguistic system architecture for Ukrainian
language content processing based on machine learning
Victoria Vysotska
Lviv Polytechnic National University, Stepan Bandera 12, 79013 Lviv, Ukraine
Abstract
The general architecture of computer linguistic systems (CLS) is developed based on the main
processes of processing information resources such as integration, maintenance and content
management, as well as using methods of intellectual and linguistic analysis of text flow using
machine learning technology. The IT of intellectual analysis of the text flow based on the
processing of information resources has been improved, which made it possible to adapt the
generally typical structure of content integration, management and support modules to solve
various of natural language processing (NLP) problems and increase the efficiency of CLS
functioning by 6-9%. The main NLP methods based on regular expression (RE) matching with
patterns in grapheme and morphological analyses of Ukrainian-language texts are described. NLP
methods based on pattern-matching regular expressions have been improved, which made it
possible to adapt methods of text tokenization and normalization by cascades of simple
substitutions of regular expressions and finite state machines. The main valid operations of
regular expressions are defined as union and disjunction of symbols/strings/expressions,
number and precedence operators, as well as anchors as special symbols for identifying the
presence/absence of symbols in RE. The main stages of tokenization and normalization of the
Ukrainian text by cascades of simple substitutions of regular expressions and finite state
machines are defined. The morphological analysis (MA) method of the Ukrainian-language text
based on word segmentation and normalization, sentence segmentation and modified Porter's
stemming algorithm was improved as an effective means of identifying lem affixes for the
possibility of marking the analyzed word, which made it possible to increase the accuracy of
keyword searches by 9%. Algorithms for word segmentation and normalization, sentence
segmentation, and Porter's modified stemming are implemented and described as an effective
way of identifying lem affixes for the possibility of marking the analyzed word. Unlike the classic
Porter algorithm (it does not have high accuracy even for English-language texts), the modified
one is adapted specifically for the Ukrainian language and gives an accurate result in 85-93% of
cases, depending on the quality, style, genre of the text and, accordingly, the content of CLS
dictionaries. The algorithm for the minimum editorial distance of lines of Ukrainian texts is
described as the minimum number of operations necessary to transform one into another.
Keywords
natural language processing, Ukrainian text, NLP, computer linguistics, machine learning 1
MoDaST-2024: 6th International Workshop on Modern Data Science Technologies, May, 31 - June, 1, 2024, Lviv-
Shatsk, Ukraine
victoria.a.vysotska@lpnu.ua (V. Vysotska)
0000-0001-6417-3689 (V. Vysotska)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
1. Introduction
Let's consider the architectural patterns of CLS design based on supporting the life cycle of
the ML model for monitoring/managing the pipeline (information flow) of content (Fig. 1)
[1]. The standard content processing pipeline implements an iterative process consisting of
the stages of creating and deploying the machine learning (ML) process [2-5]. The process
of monitoring/managing the content pipeline should still consist of additional stages to
improve the quality/efficiency/efficiency of NLP problem solving [6-9]. At the construction
stage, raw integrated content is filtered from noise/duplicates and formatted into a suitable
form for further processing/management, conducting experiments on it, transferring it to
ML models for classification/clustering/prediction/evaluation, etc. [10-12]. At the stage of
content analysis and support, the content is deployed to determine the best ML model for
making assessments/forecasts that directly affect the regular user and target audience.
Processes of monitoring, development and management of content
Interaction Formatting NLP Machine Accumulation of
Input
filtering learning content/
content
analysis of
Integration Transformation Normalization Classification
features
Presentation Interpretation Prognostication Deployment
Relevant
content
Feedback API Assessment Modeling
User Data storage
requests CLS website Computer linguistic system
Content analysis and support processes
Figure 1: CLS content pipeline monitoring/management scheme
2. Related works
Based on Feedback and model output, the target audience interacts with CLS, which
facilitates the adaptation of the selected learning model. Five stages of related processes
define the basic architectural principles for building a typical CLS. The processes of
monitoring, processing and managing content are interaction, formatting/filtering, NLP,
machine learning [13-15] and data accumulation in DS. For content analysis and support
processes, respectively, these are feature analysis, deployment, prediction, interpretation,
and content/result presentation. At the interaction stage, a set of rules for integrating
content from multiple reliable sources at certain time intervals is necessary. Also, in
parallel, a set of rules for checking the data entered by the CLS user is required as a
preliminary stage for the formatting/filtering stage according to a collection of rules pre-
set by the moderator and content from DS [16-21]. The next stage of NLP is a preparatory
intermediate stage for machine learning and data accumulation. The machine learning stage
can take various forms from SQL queries to various software modules. The support process
is easier to implement than the management stage, provided that the latter is implemented
correctly, especially during NLP analysis, in which additional lexical resources and artefacts
(dictionaries, translators, regular expressions, etc.) are created, on which the effectiveness
of CLS functioning directly depends (Fig. 2) [1-3].
Input Processes of monitoring, development and management of content
content
Interaction Formatting Linguistic Content marking Model training
filtering processing
Integration Transformation Normalization Vectorization Calculations
Content Text Marked Lexical Model
archive collection case resources repository
Content Content Corpus
Prognostication Modeling
collection selection analysis
СLS website Computer linguistic system
Figure 2: Scheme of processing the CLS content pipeline
The process of transition from raw text to a developed machine-learning model consists
of a sequence of additional content transformations. First, the input textual content is
transformed into an input corpus as a collection of texts, accumulated and stored in the DS.
The incoming content is further grouped, filtered, formatted, linguistically processed,
marked, normalized and converted into vectors for further processing. In the final
transformation, the model/models (Fig. 3) are trained on the vector corpus, and a
generalized presentation of the original content is created for further use in solving a
specific NLP problem [1-6]. An ML-based CLS architecture with accelerated or even
automatic model generation should support and optimize content transformation with ease
of testing and tuning. The process of generating an optimal ML model is a complex cyclic
algorithm, the main stages of which are the formation of a collection of features, model
selection, and hyperparameter adjustment. After each iteration, the results are evaluated to
determine the optimal collection of features, models, and parameters for solving a specific
NLP problem with the appropriate input data [1-6].
Processed The process of generating an optimal machine learning model
content
Monitoring Processing Generation of Data Learning the ML
the ML model management model
Data Forming Analysis of signs Testing of the
Transformation and parameters
collection features set ML model
Choice of ML
model
Content Set of Adjustment of Model Content
archive content parameters repository repository
Content Model Choosing the
Data filtering Model settings
selection control optimal model
Optimization of the ML model CLS cloud storage
Figure 3: The process of forming and optimizing a machine-learning model
According to [1-6], there are 3 main areas of statistical ML: a class of models, a form of a
model, and a trained model. The class of models defines the relationship between the
variables and the formed goal (for example, a linear model, a recurrent neural network,
etc.). A model form is a specific component of a model: a collection of features, an algorithm,
or a collection of hyperparameters. A trained model is a form of model that is trained on a
specific data set and adapted to make predictions. CLSs consist of many trained models built
during their selection, which creates and evaluates model shapes.
3. Materials and Methods
Any natural language text is initially a collection of non-random unstructured data as input
content to CLS. But usually, the text is formed based on certain linguistic rules for the
possibility of understanding these data. The purpose of the integration module is to
transform this collection of non-random unstructured data into structured/semi-
structured fields (records) or markup for convenient interpretation by CLS modules. ML
methods (for example, learning by a teacher) allow you to train (and retrain) statistical
models as the language changes during NLP processes. By generating ML models on
context-sensitive corpora, CLSs can apply narrow semantic values to improve accuracy
without the need for additional interpretation.
Formally, the ML model of the Ukrainian language has to supplement the input
incomplete phrase with missing words/phrases that are most likely to complete the content
of the statement according to the previous text (context analysis for further
guessing/predicting the meaning). Usually, a competently and correctly constructed text is
predictable based on its coherence. Calculation of the entropy (degree of
uncertainty/unpredictability) of the probability distribution of the model of the Ukrainian
language measures the degree of predictability of the text. Thus, unfinished phrases Київ -
столиця... [Kyyiv - stolytsya...] (Kyiv - the capital...) and сонце сходить на... [sontse
skhodytʹ na...] (the sun rises on...) have low entropy and statistical speech models are highly
likely to guess the continuation of України [Ukrayiny] (Ukraine) and сході [skhodi] (the
east), respectively. And expressions with high entropy like ми йдемо в гості до... [my
ydemo v hosti do...] (we go to visit...) and я зустрів сьогодні... [ya zustriv sʹohodni...] (I met
today...) offer many continuation options (parents, friends, neighbours, colleagues are
equally likely without analyzing the previous context). Speech models can make inferences
or identify connections between lexemes. Formally, the model uses context to identify a
narrow decision space from a set of a small number of options. The application of statistical
ML methods (with and without a teacher) allows the generation of speech models for
extracting meaning from texts to support its predictability. First, the characteristic features
of the content are identified to predict the goal. Textual data provides many opportunities
to extract surface features based on parsing and breaking up sentences/utterances/phrases
(e.g. bag of words), as well as based on extracted morphological/syntactic/semantic
features. Special attention is paid to linguistic/ contextual/ structural features.
1. An example of the analysis of a linguistic feature can be the identification of the
predominant gender in a fragment of the news text (the role of gender) in different contexts
[1] to identify gender biases regarding the subject of publications. In the gender analysis of
the text, words in the feminine and masculine gender are used to form a frequency
assessment of gender characteristics, i.e.
𝑆𝑖𝑛𝑔𝐺𝑆 =< 𝑋𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑊𝑀𝑎𝑙𝑒 , 𝑊𝐹𝑒𝑚𝑎𝑙𝑒 , 𝑊𝑈𝑛𝑘𝑛𝑜𝑤𝑛 , 𝑊𝐵𝑜𝑡ℎ , 𝑓𝑔𝑒𝑛𝑑𝑒𝑡𝑖𝑧𝑒 , >, (1)
where 𝑋𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 is analyzed sentence/expression; 𝑊𝑀𝑎𝑙𝑒 is a set of words with the sign of a
man; 𝑊𝐹𝑒𝑚𝑎𝑙𝑒 is a set of words with the attribute woman; 𝑊𝑈𝑛𝑘𝑛𝑜𝑤𝑛 is a set of words with
an unknown gender sign; 𝑊𝐵𝑜𝑡ℎ is a set of words with the sign of a man and a woman;
𝑓𝑔𝑒𝑛𝑑𝑒𝑡𝑖𝑧𝑒 is the operator for identifying the gender class of a sentence.
𝑆𝑖𝑛𝑔𝐺𝑆 = 𝑓𝑔𝑒𝑛𝑑𝑒𝑡𝑖𝑧𝑒 (𝑋𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑊𝑀𝑎𝑙𝑒 , 𝑊𝐹𝑒𝑚𝑎𝑙𝑒 , 𝑊𝑈𝑛𝑘𝑛𝑜𝑤𝑛 , 𝑊𝐵𝑜𝑡ℎ ), (2)
𝑁𝑀𝑎𝑙𝑒 > 0, 𝑁𝐹𝑒𝑚𝑎𝑙𝑒 = 0 → 𝑚𝑎𝑙𝑒
𝑁 = 0, 𝑁𝐹𝑒𝑚𝑎𝑙𝑒 > 0 → 𝑓𝑒𝑚𝑎𝑙𝑒
𝑆𝑖𝑛𝑔𝐺𝑆 = { 𝑀𝑎𝑙𝑒
𝑁𝑀𝑎𝑙𝑒 > 0, 𝑁𝐹𝑒𝑚𝑎𝑙𝑒 > 0 → 𝑏𝑜𝑡ℎ
𝑢𝑛𝑘𝑛𝑜𝑤𝑛
where 𝑁𝑀𝑎𝑙𝑒 is the number of words with the sign of a man in the analyzed sentence
𝑋𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 ; 𝑁𝐹𝑒𝑚𝑎𝑙𝑒 is the number of words with the sign female in the analyzed sentence
𝑋𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 .
It is also necessary to determine the frequency of words, gender signs and sentences in
the entire publication:
𝑆𝑖𝑛𝑔𝑇𝑆 =< 𝑋𝑇𝑒𝑥𝑡 , 𝑆𝑁𝐺 , 𝑁𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑊𝑁𝐺 , 𝑓𝑐𝑜𝑢𝑛𝑡𝑔𝑒𝑛𝑑𝑒𝑟 >, (3)
𝑆𝑖𝑛𝑔𝑇𝑆 = 𝑓𝑐𝑜𝑢𝑛𝑡𝑔𝑒𝑛𝑑𝑒𝑟 (𝑁𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑆𝑁𝐺 , 𝑊𝑁𝐺 , 𝑓𝑔𝑒𝑛𝑑𝑒𝑡𝑖𝑧𝑒 (𝑋𝑇𝑒𝑥𝑡 , 𝑋𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 )),
where 𝑋𝑇𝑒𝑥𝑡 is analyzed publication text; 𝑆𝑁𝐺 is a set of numbers of 𝑋𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 sentences of
the analyzed text 𝑋𝑇𝑒𝑥𝑡 marked by gender; 𝑁𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 is the number of sentences in the
analyzed text 𝑋𝑇𝑒𝑥𝑡 ; 𝑊𝑁𝐺 is the set of the number of words of each gender characteristic for
each marked sentence 𝑋𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 ; 𝑓𝑐𝑜𝑢𝑛𝑡𝑔𝑒𝑛𝑑𝑒𝑟 is an operator of identification and
classification/marking of all sentences of the analyzed text 𝑋𝑇𝑒𝑥𝑡 by gender based on
𝑓𝑔𝑒𝑛𝑑𝑒𝑡𝑖𝑧𝑒 .
𝑁𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 (4)
𝑆𝑁𝐺 [𝑆𝑖𝑛𝑔𝐺𝑆 ]+= 1
𝑆𝑖𝑛𝑔𝑇𝑆 = [
1
𝑊𝑁𝐺 [𝑆𝑖𝑛𝑔𝐺𝑆 ]+= 𝑙𝑒𝑛(𝑋𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 )
For gender identification, it is necessary to parse the original text of publications with
the subsequent marking of sentences and words based on the NLTK library:
𝑆𝑖𝑛𝑔𝑇𝑃 =< 𝑋𝑇𝑒𝑥𝑡 , 𝑆𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑁𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑊𝑊𝑜𝑟𝑑 , 𝑁𝑤𝑜𝑟𝑑 , 𝑓𝑝𝑎𝑟𝑠𝑒𝑔𝑒𝑛𝑑𝑒𝑟 , 𝑓𝑝𝑐𝑒𝑛𝑡 >, (5)
𝑆𝑖𝑛𝑔𝑇𝑃 = 𝑓𝑝𝑐𝑒𝑛𝑡 (𝑓𝑝𝑎𝑟𝑠𝑒𝑔𝑒𝑛𝑑𝑒𝑟 (𝑁𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 , 𝑊𝑊𝑜𝑟𝑑 , 𝑁𝑤𝑜𝑟𝑑 , 𝑓𝑐𝑜𝑢𝑛𝑡𝑔𝑒𝑛𝑑𝑒𝑟 (𝑆𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 )),
𝑁𝐺𝑒𝑛𝑑𝑒𝑟
𝑊𝑁𝐺 𝑘
𝑝𝑐𝑒𝑛𝑡𝑘 = ( ) ∗ 100
𝑡𝑜𝑡𝑎𝑙
𝑆𝑖𝑛𝑔𝑇𝑆 =
𝑁𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑘 = 𝑆𝑁𝐺 [𝑆𝑖𝑛𝑔𝐺𝑆 𝑘 ]
𝑝𝑟𝑖𝑛𝑡(𝑝𝑐𝑒𝑛𝑡𝑘 , 𝑆𝑖𝑛𝑔𝐺𝑆 𝑘 , 𝑁𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑘 )
𝑘=1[
𝑁𝐺𝑒𝑛𝑑𝑒𝑟
𝑁𝑆
𝑡𝑜𝑡𝑎𝑙 = ∑ 𝑊𝑁𝐺 𝑖 , 𝑆𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 = ⋃ 𝑋𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑖 ,
𝑖
𝑖=1
𝑁
𝑋𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑖 = ⋃𝑖 𝑊𝑆 𝑊𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑖 , 𝑊𝑊𝑜𝑟𝑑 = ⋃𝑡𝑜𝑡𝑎𝑙
𝑖 𝑊𝑊𝑜𝑟𝑑 𝑖
where 𝑁𝑤𝑜𝑟𝑑 is the number of words in the analysed text 𝑋𝑇𝑒𝑥𝑡 ; 𝑁𝐺𝑒𝑛𝑑𝑒𝑟 is the number of
classifications by gender (in this particular case – 4); 𝑊𝑁𝐺 𝑘 is the number of words in
sentences of a certain gender sign; 𝑆𝑁𝐺 is the set of the number of sentences in the analyzed
text of a certain gender sign; 𝑝𝑐𝑒𝑛𝑡𝑘 is the percentage of publication text belonging to a
certain gender sign; 𝑆𝑖𝑛𝑔𝐺𝑆 𝑘 is a specific gender sign; 𝑁𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑘 is the number of sentences
in the analyzed text of a specific gender sign; 𝑆𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 is a set of sentences identified by
parsing in the analysed text 𝑋𝑇𝑒𝑥𝑡 ; 𝑊𝑊𝑜𝑟𝑑 is a collection of sets of words identified by
parsing in each sentence of the analyzed text 𝑋𝑇𝑒𝑥𝑡 ; 𝑊𝑊𝑜𝑟𝑑 is the set of all words of the text
𝑋𝑇𝑒𝑥𝑡 ; 𝑡𝑜𝑡𝑎𝑙 is the number of all words in the analysed text 𝑋𝑇𝑒𝑥𝑡 . Such a deterministic
mechanism demonstrates how the content/frequency of use of words/phrases (especially
stereotypical ones) affects the predictability of the content according to the previous
context (the gender sign is built directly into the Ukrainian language – every noun has a
gender). But speech signs are not always decisive, for example, plural and time are used to
analyse language/processes/actions/events in time.
2. An example of the analysis of a contextual feature can be the analysis of moods or
sentiment analysis of a text (emotional colouring when discussing a specific topic by a
relevant group of people). Usually used in complex analysis of feedback from users, for
example, e-commerce, the polarity of messages or reactions to events/phenomena, in social
networks or in political/economic discussions/forums, etc. In superficial sentiment
analysis, the mechanism of gender classification (positive/negative/neutral coloured word)
is usually used. For example, for positive – чудовий [chudovyy] (wonderful), прекрасний
[chudovyy] (beautiful), правдивий [pravdyvyy,] (true), negative– лінивий [linyvyy] (lazy),
поганий [pohanyy] (bad), дратівливий [drativlyvyy] (annoying), and neutral – білий
[bilyy] (white), сонячний [sonyachnyy] (sunny), космічний [sonyachnyy] (cosmic). But the
mood is not a feature of the language and depends on the meaning of the words/phrases
according to the surrounding context of the text, for example, the word кумедний
[kumednyy] (funny) has several interpretations of conveying the mood, in particular,
positive – смішний клоун [smishnyy kloun] (funny clown), negative – кумедний одяг
[kumednyy odyah] (funny clothes), and neutral – кумедний кіт [kumednyy kit] (funny cat)
or кумедна іграшка [kumedna ihrashka] (funny toy). The word гострий [hostryy] (sharp)
from the word перець [peretsʹ] (pepper) or ніж [nizh] (knife) has a positive meaning when
buying, but from the word біль [bilʹ] (pain) and ніж [nizh] (knife) in a criminal case, it has
a negative meaning. Also, negation turns the meaning of a positive text with positive words
into a negative one and vice versa, for example, ми дуже багато очікували від відпочинку
на морі сонячними гарними днями, але обіцяна курортна база відпочинку все
спаскудила [my duzhe bahato ochikuvaly vid vidpochynku na mori sonyachnymy harnymy
dnyamy, ale obitsyana kurortna baza vidpochynku vse spaskudyla] (we expected a lot from
a vacation at the sea on sunny, beautiful days, but the promised holiday resort spoiled
everything) (one negative word спаскудила [spaskudyla] (spoiled) all the previous positive
ones) or дощ, прохолода та вітер не стали перепонами гарно відпочити в чудовій
компанії [doshch, prokholoda ta viter ne staly pereponamy harno vidpochyty v chudoviy
kompaniyi doshch, prokholoda ta viter ne staly pereponamy harno vidpochyty v chudoviy
kompaniyi] (rain, coolness and the wind did not become an obstacle to a good rest in a
wonderful company). Only thanks to machine learning in such cases it is possible to get the
predictability of the text and reveal the emotional colouring according to the context. An a
priori deterministic/structural approach loses the flexibility of context and meaning, so
most speech models take into account the location of words in context, using ML methods
for prediction. The main method of developing simple speech models is the bag of words as
the frequency of co-occurrence of words in a narrow, limited context (Fig. 4).
1) інтелектуальна інформаційна система інтелект інформ систем
2) інтелектуальний інформаційний пошук інтелект інформ пошук
3) опрацювання інформаційних ресурсів опрацюв інформ ресурс
4) система електронної комерції систем електр комерц
5) комп’ютерна лінгвістична система комп’ютер лінгвіст систем
6) аналіз природної мови аналіз природ мов
7) опрацювання природної мови опрацюв природ мов
8) опрацювання текстового контенту опрацюв текст контент
9) аналіз текстового контенту аналіз текст контент
10) пошук текстового контенту пошук текст контент
11) лінгвістичний аналіз контенту лінгвіст аналіз контент
12) лінгвістичний аналіз тексту лінгвіст аналіз текст
комп’ ютер
інтелект
контент
опрацюв
лінгвіст
систем
інформ
комерц
електр
природ
текст
пошук
ресурс
аналіз
мов
аналіз 0
електр 0 0
інтелект 0 0 0
інформ 0 0 2 0
комерц 0 1 0 0 0
комп’ютер 0 0 0 0 0 0
контент 2 0 0 0 0 0 0
лінгвіст 2 0 0 0 0 1 1 0
мов 1 0 0 0 0 0 0 0 0
опрацюв 0 0 0 1 0 0 1 0 1 0
пошук 0 0 1 1 0 0 1 0 0 0 0
природ 1 0 0 0 0 0 0 0 2 1 0 0
ресурс 0 0 0 1 0 0 0 0 0 1 0 0 0
систем 0 1 1 1 1 1 0 1 0 0 0 0 0 0
текст 2 0 0 0 0 0 3 1 0 1 1 0 0 0 0
Figure 4: Frequency matrix of co-occurrence of words
Such evaluation helps to determine the probability neighbourhood and to determine
their meaning from small fragments of text. Next, using statistical inference methods, word
order can be predicted. This is quite simple for English texts where words are not inflected.
For Ukrainian language tests, it is better to use not a bag of words, but a bag of word bases.
For example, for 12-word combinations as a 3-gram (36 words) without taking into account
declension, we will get a matrix of size 2020, and with consideration of declension, gender
and person (analysis of only the bases of words) – 1515. Moreover, for the Ukrainian
language, the location of bases in the 3-gram is usually not important and often has an
unambiguous probability of compatibility in terms of content, for example, інформаційний
ресурс (інформ ресурс) [informatsiynyy resurs (inform resurs)] (information resource
(inform resource)) and ресурс інформації (ресурс інформ) [resurs informatsiyi (resurs
inform)] (information resource (inform resource)). The bag-of-words/stems model is also
extended by analyzing the co-occurrence of stable phrases and fragments of expressions
that are of great importance for identifying the meaning of the text. The expressions
зелений край скатертини (межа) [zelenyy kray skatertyny (mezha)] (green edge of the
tablecloth (border)) and зелений край батьківщини (місцевість) [zelenyy kray
batʹkivshchyny (mistsevistʹ)] (green edge of the homeland (locality)) in the form of a 3-
gram carry a different meaning. That is, there are several interpretations only for the word
edge (the boundary of an object, a piece, the end of an action/state, a special area, a place of
residence, an administrative-territorial unit). Statistical analysis of n-grams makes it
possible to distinguish patterns of context. Speech models based on the analysis of n-gram
contexts require the ability to explore the relationship of text to some target variable. The
application of the analysis of linguistic and contextual features contributes to the formation
of the general predictability of the text. However, their identification and further use require
the ability to parse/identify the linguistic units of the language.
3. An example of the analysis of a structural feature can be the construction of an
ontology for the implementation of IIS. Along with linguistic and contextual features, it is
then necessary to identify and process high-level language units to define a vocabulary of
operations for the text corpus. Different units of language are processed at different levels,
and the correct implementation of NLP methods based on ML is important for the
operational and correct identification of the linguistic context (sem relationship structure).
Based on a typical pattern of utterances (statement or simple phrase) in the form of the
subject verb object object definition (subject predicate appendix) construct
ontologies that define specific relationships between entities. They make it possible to solve
the problem of the lack of a mandatory order of words in a Ukrainian sentence to identify
its semantics. It is advisable to use it for tasks where it is necessary to constantly process
large volumes of text data and there is long-term resource support for the project. Semantic
analysis consists not only in identifying the content of the text but also in generating data
structures to which logical reasoning can be applied. Thematic Meaning Representations
(TMR) are used to encode sentences in the form of predicate structures based on first-order
logic or lambda calculus (λ-calculus). Network/graph structures are used to encode
interactions of predicates of relevant text features. Then a traversal is implemented to
analyze the centrality of terms or subjects and the reasons for the relationships between
elements. Graph analysis is usually not a complete SEM (semantical analysis), but helps to
form part of important logical decisions or conclusions. Semantics, syntax and morphology
allow you to add data to simple text strings with linguistic meaning and generate new
meaningful text content. Nowadays, natural language is one of the most commonly used
forms of content. Its analysis makes it possible to increase the usefulness of data
applications and make them an integral part of everyday life. Scalable analysis and machine
learning of text primarily require up-to-date knowledge and text corpora of the relevant SA.
For example, in the field of finance, CLS needs to identify financial terms, stock
abbreviations and company names. Therefore, documents in the SA corpus must contain
these entities. That is, the development of any CLS begins with obtaining textual data of the
appropriate type and forming a corpus with structural and contextual features of SA.
4. Experiments, results and discussions
4.1. Method of grapheme analysis of the Ukrainian language
For the GA of text strings, it is best to use regular expressions (RE) as algebraic notations
for the features of a set of character strings. Commonly used in the
development/maintenance of each type of computer language (programming,
communication protocols, data markup, specification, and design), the operation of text
editors, and word processing software, especially with IIS templated or SA text corpora
collections. Identification/search of a fragment/string by pattern in a sequence of character
strings is implemented to find all matches or the first one. The templates use special
characters [, ], ^, \, -, ?, *, +, ., $, |, (, ), _, {, }, etc., including /, but the latter is not RE, but its
boundaries The simplest RE is a tuple of simple characters (Table 1) to recognize the first
or all pattern-like occurrences of character sequences.
Table 1
Regular expressions of GA texts in the Ukrainian language for recognition of all characters
N RE Recognition Example and result
1 /контент/ the exact sequence of substring Структурна схема лінгвістичного аналізу
characters, taking into account the case текстового контенту
2 /к/ a specific character, taking into account Контент-аналіз застосовують для
the case аналізу потоків контенту
3 /-/ specific special character Контент-аналіз застосовують
4 /[кК]онтент/ exact sequence of characters without Контент-аналіз застосовують для
taking into account the case of the 1st аналізу потоків контенту
character
5 /[онві]/ or о, or н, or в, or і Контент-аналіз застосовують
6 /[0123456789]/ Any number in a string sequence RE чутливі до регістру– правила 1, 2 та 4
дають різні результати
7 /[0123]/ or 0, or 1, or 2, or 3 RE чутливі до регістру– правила 1, 2 та 4
дають різні результати
8 /[0-9]/ Any number in a string sequence RE чутливі до регістру– правила 1, 2 та 4
дають різні результати
9 /[а-я]/ Any lowercase letter of the Ukrainian Контент-аналіз застосовують
alphabet
10 /[А-Я]/ Any uppercase letter of the Ukrainian Контент-аналіз застосовують
alphabet
11 /[А-Яа-я]/ Any letter of the Ukrainian alphabet, Контент-аналіз застосовують
regardless of case
12 /[A-Z]/ Any uppercase letter of the English RE чутливі до регістру– правила 1, 2 та 4
alphabet дають різні результати
13 /[^А-Я]/ Any character other than an uppercase Контент-аналіз застосовують для
letter of the English alphabet аналізу потоків контенту
14 /[^Кк]/ Any character except the letters К and к Контент-аналіз застосовують для
аналізу потоків контенту
15 /[^\.]/ Any character except the dot character. Контент-аналіз застосовують
16 /[к^]/ or к, or ^ аналіз потоків контенту
17 /x^y/ String pattern x^y функція x^y
18 /^[А-Я]/ Any uppercase letter of the Ukrainian Контент-аналіз застосовують для
alphabet at the beginning of a line аналізу потоків контенту в CLS
19 /^а/ The letter а at the beginning of the line Контент-аналіз застосовують
N RE Recognition Example and result
20 /контенту?/ Presence/absence of the optional y Структурна схема лінгвістичного аналізу
character in the substring текстового контенту
21 /зв’?зок/ The apostrophe ’ is optional for Структурна ознака описує зв’язок між
searching and is often omitted лінгвістичними лексемами.
22 /лін.вітиска/ Designation of any symbol лінгвістика або лінґвістика
23 /б.гу/ Designation of any symbol Зараз змагання з бігу, тому я біжу. Я
бігун, тому біжу естафету
24 /ї*/ Any line without ї or any number of ї MA лексеми провадять на основі її
особистої множини ознак
25 /її*/ Any string with one or more ї MA лексеми провадять на основі її
особистої множини ознак
26 /[нжтлдчз]*/ or without or any number or н or ж or т Віддалено ллється на ланах нашого
or л or д or ч or з життя беззмінне збіжжя знання як
обличчя особистого досвіду!
27 /[нжтлдчз]/ or н or ж or т or л or д or ч or з Віддалено ллється на ланах нашого
життя беззмінне збіжжя знання як
обличчя особистого досвіду!
28 /[0-9]*/ or none or an arbitrary number of one RE чутливі до регістру– правила 1, 2
element from the range 0-9 та 4 дають різні результати
29 /[0-9][0-9]*/ One digit from the range 0-9 is RE чутливі до регістру– правила 1, 2 та 4
required, the other is not, but if there is дають різні результати
- any number of one of 0-9
30 /[0-9]+/ any number of different digits from 0-9 Спецсимвол знаку питання ? для RE-
правил 20-21
31 /[нжтлдчз]+/ one or н or ж or т or л or д or ч or з or Віддалено ллється на ланах нашого
several, or any combination thereof життя беззмінне збіжжя знання як
обличчя особистого досвіду!
32 /[нжтлдчз]{2}/ exactly two or н or ж or т or л or д or ч Віддалено ллється на ланах нашого
or з життя беззмінне збіжжя знання як
обличчя особистого досвіду!
33 /аналіз.*аналіз/ String identification using a double Контент-аналіз застосовують для
word аналіз аналізу потоків контенту в CLS
34 /^В/ В at the beginning of the line В наш час в Інтернет все є.
35 /^Контент- recognition of a specific phrase Контент-аналіз.˽
аналіз$/
36 /˽$/ marking a space at the end of a line Контент-аналіз ˽ застосовують˽
37 /^Контент- recognition of a specific phrase with a Контент-аналіз.˽
аналіз\. $/ period and a space at the end of the
line
38 /^/[А-Я]\. $/ recognition of all possible sentences В наш час в Інтернет все є.˽
39 /\bаналіз\b/ recognition of a specific set of symbols Контент-аналіз застосовують для
(words) taking into account boundaries аналізу потоків контенту
40 /\b19\b/ recognizing a word as a number Йому виповнилось 19 в 2019.
41 /\b3\b/ word recognition within limits Ціна -3$ за 13 одиниць.
42 /\b5\b/ word recognition within limits Ціна -5Є за 5 одиниць.
43 /ML|МН/ recognition of abbreviations ML or МН Реалізація CLS на основі МЛ
44 /контент(у|ний)/ recognition of words with different Контентний аналіз застосовують до
inflections великих потоків контенту
45 /№˽[0-9]+˽*/ 1 digit with any number of spaces В˽колонці ˽ №˽3˽˽˽˽˽˽
46 /(№˽[1-9]+˽*)*/ recognition of arbitrary sequence В ˽ колонках ˽ №˽1˽ та ˽ №˽3˽, але не
number № and any number в №˽13˽
RE is case-sensitive – rules 1, 2 and 4 give different results. Using the special characters
[ and ] solves the case-sensitivity problem of RE. The string of characters in the middle of []
implements the disjunction of the values upon matching. RE-rule 6 recognizes any number
in a sequence of string characters. The dash special character - in the middle [] for RE-rules
8-12 allows not to list all characters but indicates any character in the corresponding range.
For example, Pattern /[3-6]/ indicates any of the characters 3, 4, 5, or 6, and /[в-ж]/
indicates one of the characters в, г, д, or ж in the grapheme analysis of the input test. The
caret or circumflex character ^ inside [] for RE-rules 13-18 carries a different content load
depending on the location. If at the beginning immediately after [ means, all characters after
it are rejected in the parsed character string (RE 13-15). The caret ^ has 3 purposes: to
indicate the beginning of a line (not inside [] – RE 18-19); to indicate negation within [] (RE
13-15); simply to denote carriages ^ (RE 16-17). Question mark special character ? for RE-
rules 20-21 allows you to mark optional characters in the searched string. This is useful in
cases where there may be both present/absent characters in a certain sequence that do not
resolve []. In [] – you can indicate the absence of a specific symbol from the range of possible
ones, but do not describe the absence of any symbol at all, indistinguishable from ?. The dot
special character . for RE-rules 22-23 allows you to mark the location of any symbol in the
sequence of the analyzed string. If the special character ? there is the absence or presence
of one symbol, then we can submit the doubling of the symbol through the special symbol *
(RE 26-29), which means the absence of a specific symbol or RE before * in the RE or its
arbitrary number in consecutively placed in the recognized line, i.e. the result can be a line
without this symbol. Therefore, to find at least one symbol from a possible sequence of the
same two - for example RE 29, and for two different ones - 30. The + special character for
RE-rules 30-31 allows you to mark one or more cases immediately preceding the /RE
symbol. {} (RE 32) is used to indicate the exact quantity (for example, exactly 2 times). The
dot special character. often used together with the special character * to indicate any string
of characters (RE 33).
An anchor is a special symbol (for example, a double sign ^ or a dollar sign $) specifying
the location of the RE in the character string. In some cases, the caret ^ marks the beginning
of a line (RE 34). The dollar sign $ recognizes the end of a line (RE 35-36). The backslash \
allows you to recognize special characters in the character string of the input test (RE 37-
38). The anchors \b and \B identify the presence and absence of word boundaries,
respectively (RE 39-42). A word is any tuple of numbers, underscores or letters (without
special characters).
To organize the selection of alternatives between, for example, synonyms, the
disjunction operation based on the special symbol | (RE 43-46). The combination of special
characters | inside () allows you to arrange disjunction recognition only for a specific
pattern, taking into account different inflexions/prefixes (RE 44). Special characters () are
used to organize counters of type * (RE 46). The difference is that * is used for one character,
not a whole sequence.
For complex disjunctive RE operators, when grouping from different special symbols,
the concept of priority is used (Table 2): () *, +, ?, {} string, ^, $ | from the highest
to the lowest (delimitation by the symbol ) (). Greedy RE patterns of the type /[a-ya]*/
recognize zero or more letters and no matches, expanding the identification to cover as
many strings as they can. Non-greedy RE based on *? and +? find the smallest possible text.
RE of the type /˽*/ is used to indicate the absence or presence of a certain number of spaces,
since there can always be additional spaces around. There are aliases for general ranges
that can be used primarily to preserve grapheme type (Table 3). Correctly constructed REs
avoid errors of assumption (overrecognition) and negation (accidentally missed). Reducing
the overall error rate for GA implies two antagonistic conditions for generating a collection
of REs increasing recall (minimizing false ignores) and increasing precision (minimizing
false recognitions).
Table 2
Regular expressions to recognize keywords, stop words and tokens
N RE Recognition
1 /але/ simple (but incorrect) pattern - takes into account other possible variants
/аналіз/ of the sequence of characters in the input string
2 /[аА]ле/ from case-sensitive, but unfortunately takes other cases into account,
/[аА]наліз/ such as малеча or каналізація
3 /\b[аА]ле\b/ taking into account the boundaries of the word (without letters,
/\b[аА]наліз\b/ underscores and numbers on both sides) - for but good, but the word
analysis already ignores
4 /[^а-яА-Я][аА]наліз[а-я]/ Before аналіз there was not a single letter regardless of case, and after it
is an arbitrary lowercase letter of the Ukrainian alphabet
5 /\b[аА]наліз[а-я]*/ Before аналіз there is no letter, underscore or number, followed by any
lowercase letter of the Ukrainian alphabet or none
6 /(^|\b[аА]ле\b/ to item 5, the possibility of meeting the word analysis at the beginning or
/(^|\b[аА]наліз([a-я]*|$)/ the end of the line is added, when no character exists in these positions
7 /[0-9]+ (\$|грн\.|EU)/ the integer value of the price in грн. (UAH), or US/EU currency
8 /[0-9]+\,[0-9][0-9] грн\./ the actual value of the price in грн. (UAH)
9 /(^|\W)[0-9]+(\,[0-9][0-9])? the actual value of the price in the currency of Ukraine/USA/EU at the
(\$|грн\.|EU)?\b/ level of a word in a sentence/utterance/phrase
10 /(^|\W)[0-9]{0,5}(\,[0-9][0- the actual value of the price in the currency of Ukraine/USA/EU at the
9])? (\$|грн\.|EU)?\b/ word level, taking into account the limitation of the number of digits
before the comma
11 /\b[6-9]+˽*(UAH|₴|грн\.| lines with a price value > 5 in the currency of Ukraine, taking into account
[Гг]грив(ня|ні|ень))\b/ various options for designations and abbreviations
12 /\b[0-9]+(\,[0-9]+)?˽* lines with the valid value of the price in the currency of Ukraine, taking
(UAH|₴|грн\.?)\b/ into account the presence/absence of various options for designations and
abbreviations
Table 3
Basic RE aliases for general GA ranges
N Range RE Recognition Example
1 [˽\n\t\f\r] \s any spaces and tabs аналіз˽контенту
2 [^\s] \S no spaces or tabs аналіз˽контенту
3 [0-9] \d any number from the range 14˽лютого˽2005
4 [^0-9] \D no digit from the range 14˽лютого˽2005
5 [a-яА-Я0-9_] \w any letter, number and underscore контент-аналіз
6 [^\w] \W no letter, number or underscore контент-аналіз
7 \b[0-9]*\b * none or several previous REs вже 22 рік
8 \b[0-9]+\b + one or more previous RE вже 2022 рік
9 \b[0-9]?\b ? definitely absent or present once 22 рік 2 століття
10 \b[0-9]{2}\b {n} a certain number of repetitions 22 рік 2 тисячоліття
11 \b[0-9]{1,2}\b {n,m} in the range of a certain number of repetitions 22 рік 2 тисячоліття
12 \b[0-9]{2,}\b {n,} at least a certain number of repetitions 22 рік 2 тисячоліття
13 \b[0-9]{,2}\b {,m} to a certain number of repetitions 22 рік 2 тисячоліття
N Range RE Recognition Example
14 [0-9]{1,}\*[0-9]{1,} \* special character designation * значения 5*93
15 1[0-9]{1}\.0[0-9]{1} \. special notation for the dot sign дата 14.02
16 [a-я]\? \? special designation of the question mark контент-аналіз?
17 [a-я]\n[a-я] \n special notation for the newline character контент-аналіз
контент-
моніторинг
18 [б-я])\t[a-я]) \t special notation for the tab character а) б) с)
19 s/текст/контент/ s/x/y/ replacement/clarification of the word by текст контент
another
20 s/([0-9]+)/<\1>/ s/R/R’/ expression replacement/clarification with a 27 <27>
template
21 /x(.*)y\1z/ /(.*)\1/ repeating a certain line/expression twice xAyAz
22 /x(.*)y(.*)z\1w\2u/ /()()\1\2/ duplicates of two expressions in certain places /xAyBzAwBu/
23 /(?:x|y)(z|)text /(?:|)(|) grouping, without fixing the template x w text x w
x\1/ \1/
24 /(?![яЯ]) [А-Яа-я]+/ /(?!x)y/ any string that does not begin with я контент-аналіз
RE /{9}/ is the recognition of exactly 9 cases of the previous symbol/expression, RE
/а.{3}я/ – sequences v, RE /{3,12}/ – from 3 to 12 of the previous symbol/expression, RE
/(5,)/ is at least 5 occurrences of the preceding character/expression, and RE /(,13)/ is up
to 13 occurrences of the preceding character/expression. The special character s before RE
allow you to replace the expression with a pattern. The special character \k indicates the
location of the character/phrase/expression as a duplicate of the first element in the
capture group, i.e. the pattern in (), where k is the number of brackets or capture groups.
Thus, special characters () have a double function in RE: to group conditions and to
determine the order of application of operators. For grouping, without fixing the received
template in the register, the RE of the form (?: template) is used as a group that does not
capture the expression. When applying RE, the rank of use in the queue is determined. An
RE of type (?: template) is a positive statement (RE 23).
The (?=pattern) operator is positive when identifying a zero-width pattern, i.e. the match
pointer is not advanced. The (?!pattern) operator is positive if the pattern does not match,
is zero-width, and the cursor does not advance. Negative statements are usually used in the
analysis of a complex content model when a special case needs to be removed (RE 24).
Grapheme analysis is the preliminary processing and transformation of the text into a
certain marked and compressed format for the following NLP processes (Fig. 5):
extracting content extracting paragraphs extracting sentences within a paragraph
extracting tokens within a sentence marking tokens with tags for MA as part-of-speech
marking.
Grapheme
HTML Tags analysis
A repository of
1
text corpuses
Paragraphs Lexemes
WWW
WWW
WWW
Інформаційни Marked content
Інформаційни
WWW
Інформаційни
WWW ресурс Sentence
ресурс
Інформаційни
ресурс
Information
ресурс Saving content
resource Grapheme segmentation and labeling
Figure 5: Content partitioning, grapheme segmentation and labelling
At the first stages of the integration of content from various sources, it is necessary to
implement the processes of filtering, access and calculation of text sizes based on the
application of the standard API of pre-grapheme processing of the division of documents
through the execution of the following sequence of NLTK methods:
𝑓𝑟𝑎𝑤 () is organization of access to previously unprocessed text;
𝑓ℎ𝑡𝑚𝑙 () is the elimination of non-text content, scripts and style tags;
𝑓𝑝𝑎𝑡𝑎𝑠 () is the identification of individual paragraphs from the content text;
𝑓𝑠𝑒𝑛𝑡𝑠 () is the identification of individual sentences from the content text;
𝑓𝑡𝑜𝑘𝑒𝑛𝑠 () is the identification of individual tokens from the content text;
𝑓𝑚𝑎𝑟𝑘 () is grapheme labelling of identified tokens based on RE;
𝑇𝑚𝑎𝑟𝑘𝑒𝑑 = 𝑓𝑚𝑎𝑟𝑘 (𝑓𝑡𝑜𝑘𝑒𝑛𝑠 (𝑓𝑠𝑒𝑛𝑡𝑠 (𝑓𝑝𝑎𝑡𝑎𝑠 (𝑓ℎ𝑡𝑚𝑙 (𝑓𝑟𝑎𝑤 (𝑋𝑐𝑜𝑛𝑡𝑒𝑛𝑡 )))))), (6)
and if necessary, additional methods, such as adding tags or parsing sentences, converting
annotated text into tree-like data structures, or extracting individual XML elements. To
identify and extract the main content from an information resource with an undefined
structure and high variability of documents from different sources, 𝑓ℎ𝑡𝑚𝑙 () based on the
Python readability-lxml library is used, which removes all anomalous artefacts, leaving only
the text. When processing HTML text, 𝑓ℎ𝑡𝑚𝑙 () uses a collection of formal REs to identify and
remove navigation menus, declarations, script tags, and CSS, then creates a new content
object model tree, extracts the text from the source tree, and embeds it into the newly
created tree.
Vectorization, feature extraction, and ML tasks rely heavily on CLS's ability to efficiently
break down textual content into its constituent components while preserving the original
structure. The accuracy and sensitivity of ML models depend on the efficiency of identifying
the connections of tokens with the corresponding context in the text. Paragraphs contain
complete ideas of context and are the structural unit of content. Based on NLTK, the 𝑓𝑝𝑎𝑡𝑎𝑠 ()
operator is implemented as a paragraph generator, which is defined as blocks of text
separated by two newline characters. The 𝑓𝑝𝑎𝑡𝑎𝑠 () the operator scans all files and passes
each HTML text to the RE constructor, indicating that parsing of the HTML markup should
be done through the lxml HTMLparser. The resulting object maintains a tree structure that
can be navigated using native HTML tags and elements.
If paragraphs are structural units of content, then sentences are semantic units. As a
paragraph expressing a single idea, a sentence contains a complete thought that the author
has formulated and expressed in many words. Grapheme segmentation is the division of
text into sentences for further processing by marking words with parts of speech in MA. The
operator 𝑓𝑠𝑒𝑛𝑡𝑠 (), calling 𝑓𝑝𝑎𝑡𝑎𝑠 () and returning an iterator (generator), sorts all sentences
from all paragraphs.
The 𝑓𝑠𝑒𝑛𝑡𝑠 () operator bypasses all paragraphs selected by the 𝑓𝑝𝑎𝑡𝑎𝑠 () operator and uses
the 𝑓𝑤𝑜𝑟𝑑𝑠 () operator to perform the actual grapheme segmentation. Internally, the
𝑓𝑡𝑜𝑘𝑒𝑛𝑠 () operator uses 𝑓𝑚𝑎𝑟𝑘 (), a model pre-trained with RE recognition/identification
rules for various kinds of tokens, punctuation marks, abbreviations, geographical names,
abbreviations, and other marks that serve as sentence start/end or tab marks. Punctuation
marks do not always have an unambiguous interpretation, for example, or are a sign of the
end of a sentence, but they are also present in dates, abbreviations, abbreviations, ellipses,
etc. Determining sentence boundaries is not always an easy task. Punctuation is crucial for
identifying word boundaries (commas, spaces, colons) and for identifying certain aspects of
meaning (question marks, exclamation marks, quotation marks). For some tasks, such as
tagging parts of speech, and analyzing or synthesizing speech, it is sometimes necessary to
treat punctuation marks as if they were separate words. When analyzing speech,
punctuation marks replace pauses, accents, and changes in intonation dynamics.
Lexemization is the process of obtaining lexemes (syntactically encoded strings of symbols)
and for its implementation, the operator 𝑓𝑤𝑜𝑟𝑑𝑠 () based on RE is used, which is selected
through 𝑓𝑚𝑎𝑟𝑘 () markers for spaces and punctuation marks and returns a list of alphabetic
and non-alphabetic characters. Like delimiting sentences, lexeme recognition is not always
an easy task: the presence of punctuation marks in a lexeme, punctuation marks as
independent lexemes, lexemes with and without hyphens, and lexemes as shortened forms
of words (one or more words). Different marker selection tools are chosen for these cases.
Any statement is a speech correlate of a sentence. The presence of lexemes of the dysfluency
type (loss of speech speed, for example, a longer pause when thinking) carries not so much
a semantic load as an emotional one. Exclamations such as мммм, ох, ах [mmmm, ohh, ah],
etc. are fillers or filled pauses and are also emotionally coloured, but not semantically
coloured. An unfinished word with further repetition and its ending or simply with
repetition is a fragment that does not carry a semantic load, but only an emotional one.
Therefore, when conducting PHA, depending on the goal of solving a specific problem
through CLS, it is important to take into account (mark accordingly) or ignore some types
of punctuation (ellipsis, exclamation points, etc.), dysfluencies, double fragments,
exclamations, etc. If CLS is just a transcription of speech, then such phonemes should be
ignored to avoid loss of speech rate. But they make it possible to determine the
psychological state of the speaker and his emotional state, to identify the peculiarity of the
speaker's authorial speech when the tone of the voice changes, they are relevant in
predicting the future word, because they signal that the speaker is restarting the
statement/idea, and therefore, for speech recognition, ordinary tokens are considered as
phonemes. Marking a lexeme as a lemma (a set of lexical forms having the same base, the
same main part of speech and the same word content) or as a word form (a fully inflected
or derived form of a word) is a significant difference for conducting the next stage of MA as
lemmatization or stemming, i.e. identification of word bases. For many NLP tasks in the
English language, it is enough to mark the corresponding lexemes as word forms, but for
the Ukrainian language – no, it is still necessary to identify the bases of the words (for
example, based on the analysis of inflexion according to the tree of endings).
There are two ways to identify words with punctuation ignored - token recognition as
types (the number of different words |V| in the set of words of the corpus, i.e. the cardinality
of the alphabet of the corpus, where an element of the alphabet/dictionary is a unique word)
and tokens (the total number N of words of the analyzed corpus), i.e. |V| N. The largest
Google N-grams corpus contains 13 million types among those displayed 40, so the true
number is much larger.
The ratio between the number of types |V| and the number of tokens N is called Herdan's
law (Herdan, 1960) or Heaps' law (Heaps, 1978): |𝑉| = 𝑘𝑁 𝑥 , where 𝑘 and 𝑥 are positive
constants for 0 < 𝑥 < 1. The value of x depends on the size of the corpus and the genre, for
large corpora x varies within [0.67; 0.75], when the size of the dictionary for the text grows
much faster than the square root of the length of its words. Another measure of the number
of words in a language is the number of lemmas rather than word types (for example, the
Oxford English Dictionary has over 615,000 entries).
4.2. Method of morphological analysis of the Ukrainian language
Morphology identifies the shape of things, and in textual analysis, the shape of individual
words/tokens. Lexemes are both words and punctuation marks, allowing you to conduct
the next SYA (syntactic analysis) more clearly. Word structure helps determine plural,
gender, tense, person, declension, etc. MA is a difficult task, as most languages have many
exceptions to the rules and special cases. The main task of MA is to identify parts of words
to assign them to certain classes (tags) of parts of speech. For example, sometimes it is
important to understand whether a noun is singular or plural, or is a proper name. It is also
often necessary to know whether the verb has an indefinite form, past tense, or is an
adjective. The resulting parts of speech are then used to generate larger structures
(fragments/phrases), or whole word trees, which are then used to build semantic reasoning
data structures. After GA (grapheme analysis), we have access to tokens in sentences in
paragraphs of integrated content texts, which makes it possible to apply MA to mark words
from the collection of tokens with parts of speech (e.g., verbs, nouns, prepositions,
adjectives) that indicate the role of the word in the context of the sentence. In the Ukrainian
language, the same word can usually take on different roles, depending on the inflexions.
Part-of-speech tagging based on MA rules consists of adding a corresponding tag to each
word from a collection of tokens that contains information about the definition of the word
and its role in the current context. MA rules are used for the development of
modules/subsystems for keyword identification, text classification (Fig. 4.6), machine
translation, and error correction, as well as for human psychological analysis, semantic
analysis, etc. When identifying words for further classification, the rub_id attribute
describes the rubric to which a specific keyword belongs (Table 4).
Table 4
Examples of Ukrainian and English words/flags for identifying keywords
N Ukrainian English N Ukrainian English
1 курсорний/V cursoriness/17,13 39 буферизувати/ABGH buffer/18,9,13,17,10,23
2 cursorily 40 відформатувати/AB format/1,20,17
3 cursor/9,13,17,10 41 кодувати/ABGH code/17,2,23,10,12,18,9
4 cursory/16 42 кешувати/ABGH cache/9,17,18,10,13
5 кирилічний/V Cyrillic 43 кука/ab hook/10,23,9,18,13,17
6 кілобітовий/V kilobit/17 44 клавіатурний/V keyboard/18,9,13,23,10,17
7 кілобіт/efg 45 клавіатура/ab
8 кілобайтовий/V kilobyte/17 46 кодосумісний/V code/17,2,23,10,12,18,9
9 кілобайт/efg 47 code compatible
10 кодек/efg coder/2,13 48 compatible/17,5
11 кодер/efg 49 compatibleness/13
12 консольний/V consoled/7 50 compatibility/5,13,17
13 consoler/13 51 compatibly/5
14 консоль/ij console/23,8,10 52 кодогенератор/efg code/17,2,23,10,12,18,9
15 Кобол/e COBOL 53 generators/1
N Ukrainian English N Ukrainian English
16 Cobol/13 54 generator/17,13
17 кілобод/efg kilobaud/13 55 конфігуратор/efg configuration/1,17,13
18 копілефт/e Copyleft/19,18,17 56 configure/1,10,17,9,8
19 хакер/efg hacker/13 57 криптозахищений/V crypto-protected/7,21
20 хеш/e hash/1,10,17,9 58 криптографічний/V cryptographic
21 таймер/efg timer/13 59 cryptographically
22 стек/efgo stack/13 60 cryptography/13,17
23 спам/e spam/13 61 копірайт/e copyright/13,17,18,9,10,23
24 смайл/ef smile/10,13,9,17,18 62 комутований/V switch/10,8,23,13,18,17,9,12
25 сайт/ef site/9,17,12,13 63 конкатенація/ab concatenate/22,17,9,10
26 рестарт/ef restart/8 64 комбосписок/ab combo/13,17
27 рекурсія/ab recursion/13 65 box/9,18,17,12,23,10,13
28 процесор/efg processor/13,17 66 list/12,13,18,9,15,10,23,22,17
29 проксі proxy/17,13 67 крос-компілятор/efg сross/13
30 принтер/efg printer/1,13 68 compilable/7
31 подкаст/e podcast/13 69 compilation/17,1,13
32 плотер/efg plotter/13,9,17,10 70 compile/1,17,9,2,10
33 піксель/efg pixel/17,13 71 compiler/2,17
34 опція/ab option/10,9,13,17 72 compiler's
35 оффлайн/e offline/13 73 крос-асемблер/efg cross-assembler/3,13,17
36 онлайн/e online/13 74 фрейм/efg frame/17,18,9,12,10,13,23
37 модем/efg modem/17,13 75 файл/ef file/6,9,18,17,10,13,23
38 сплайн/efg spline/13,17,9 76 сигнатура/ab signature/13,17
The flag of the attribute defines the properties of this keyword (the part of the language
to which it belongs). In thematic dictionaries, each word has its property, for example, a b c
d o – different types of nouns, A – verbs, V – adjectives (Fig. 7). To compare the complexity
in thematic dictionaries (23 rules in total), each English word also has a property, for
example, the numbers 1-23 are the numbers of rules of the PFX type (prefixes, rules 1-7)
and SFX (suffixes and endings, rules 8-23) and describe some nouns for English words (Fig.
8). For example, PFX-type rules describe the modification of some nouns for English words
with prefixes: re-(rule PFX 1), de- (rule PFX 2), dis- (rule PFX 3), con- (rule PFX 4), in- ( PFX
rule 5), pro- (PFX rule 6) and un- (PFX rule 7).
Figure 6: The relation of keywords in the CLS database of text rubrics
Figure 7: Noun classification dictionaries for Ukrainian words
Figure 8: Noun classification dictionaries for English words
SFX-type rules describe how some noun modifications for English words with suffixes or
endings (Fig. 8):
-able [^aeiou], -able ee, -able [^aeiou]e (rule SFX ),
-d e, -ied [^aeiou]y, -ed [^ey], -ed [aeiou]y (rule SFX 9),
-ing e, -ing [^e] (rule SFX 10) and - ieth y, -th [^y] (rule SFX 11),
-ment (rule SFX 14) and -ion e, -ication y, -en [^ey] (rule SFX 15),
-ings e, -ings [^e] (rule SFX 12) and -'s (rule SFX 13),
-iness [^aeiou]y, -ness [aeiou]y, -ness [^y] (rule SFX 16),
-ies [^aeiou]y, -s [aeiou]y, -es [sxzh], -s [^sxzhy] (rule SFX 17),
-r e, -ier [^aeiou]y, -er [aeiou]y, -er [^ey] (rule SFX 18),
-st e, -iest [^aeiou]y, -est [aeiou]y, est [^ey] (rule SFX 19),
-ive e, -ive [^e] (rule SFX 20) and -ly (rule SFX 21),
-ions e, -ications y, -ens [^ey] (rule SFX 22),
-rs e, -iers [^aeiou]y, -ers [aeiou]y, -ers [^ey] (rule SFX 23). The letters e and y near
the suffixes are decision markers.
A file of affixes (parts of words that attach to the root and bring grammatical or word-
forming meaning, elements of word formation, for example, prefix, suffix, postfix, inflexion)
has the *.aff file type and may contain additional attributes - the rules of reduction to the
base of the word (Fig. 9). The notation SET is usually used to identify the sequence of parts
of affixes and directories. REP forms a lookup table to correct multiple characters for words.
TRY identifies sequences to replace. SFX and PFX identify the types of suffixes and prefixes
that are marked by word affixes.
Figure 9: Rules for reduction to the base of a word of the noun type
The flag of the flag attribute determines the type of word, the mask of the mask attribute
shows the ending identification rule, the value of the find attribute is the ending of the word
in the nominative case, the value of the repl attribute is the ending of the word in the non-
nominative case. Exceptions to the rules are given in square brackets. For example, the first
line (ordering 26) describes a specific example of recognizing nouns of group a with the
alternation of -і -о and the inflection -ін of the nominative case in the instrumental case
(inflection -оном), and the next entry (ordering 27) is the same nouns, but in the local case
(inflection -оні), but does not recognize other rules of that group or other groups in the
dative case - inflections -онові and -ону (Fig. 10). The third record (ordering 28) already
recognizes nouns with alternation -і -о with inflections v of the nominative case in the dative
case - inflection -огу, but does not recognize other rules of the same group and (rules 29-31
do this, respectively): -огові (Д.М.), -огом (О.), -озі (М.).
Figure 10: An example of the rules of morphological analysis of Ukrainian nouns
The ninth entry (ordering 34) already recognizes nouns with inflexions on -[^л]ід of the
nominative case not after -л in the instrumental case with the inflexion -одом, but does not
recognize other rules of the same group and (according to rules 32-33 and 35 ): -[^л]оду
(Д.Р.), -[^л]одові (Д.), -[^л]оді (М.). REP defines a substitution table for correcting several
characters in Ukrainian words [535], for example REP 5; REP сч щ; REP уюч увальн; REP
ююч ювальн; REP ємн містк; REP обез зне. Negative form prefix (Fig. 11):
Adjectives ending in -ий;
Adjectives of the short form change to -ен in the same way as the full form (ясен -
ясний...).
The presence of a ratio of words blocked by the moderator (Fig. 12), in particular those
that cannot be key, allows to reduce the amount of verification during text classification
(Fig. 13). To identify keywords, it is important to correctly recognize adjectives in any case,
gender and number (Fig. 14).
Figure 11: An example of rules for identifying the negative form of Ukrainian words
Figure 12: The ratio of words blocked by the moderator
Figure 13: Relationship of rubrics
Figure 14: An example of the rules of morphological analysis of Ukrainian adjectives
Let us describe each marked class of the set of MA noun rules:
Class I for nouns marked with flag as a, b, c, d or o:
a. 1 declension: feminine, masculine and neuter nouns;
b. 2nd declension: masculine nouns ending in-ар, -ир, stressed (mixed
group in-ар, -ир);
c. 2 declension: masculine nouns with alternation-і, -о;
d. Numerals -ять, -сят, -сто;
Class II for nouns marked with flag as e, f, g or h:
a. Second declension masculine nouns with a zero ending;
b. Second declension masculine nouns ending in -о;
Class III for nouns marked with flag as i, j or k:
a. Third declension without alternation;
b. Second declension neuter ending in -о, -а, -я;
c. Second declension on a consonant without ending -і in the local case;
Class IV for nouns marked with flag as l, m or n:
a. Third declension with alternation;
b. The fourth declension of the neuter ending in -а, -я;
c. Group V for nouns marked with the flag as p: masculine (m.) and feminine
(f.) patronymics of the singular (s.) and plural (m.) of male names.
For further SYA, it is appropriate to recognize the verbs correctly (Fig. 15).
Let us describe in more detail each marked class of the set of noun recognition rules,
indicating their total number N (Table 5). In total, about 1,300 rules for processing suffixes
and endings are used for MA Ukrainian-language nouns, taking into account the alternation
of letters.
Figure 15: An example of the rules of morphological analysis of Ukrainian verbs
Table 5
Basic MA rules for marking nouns when marking a part of speech
Class flag N Features of MA-rules
І а 248 For the singular:
1 declension: feminine, masculine and neuter nouns.
2 declensions: masculine in -ар, -ир, stressed (mixed group in -ар, -ир).
Class flag N Features of MA-rules
2 declensions: masculine nouns with alternating -і and -о.
numerals -ять, -сят, -сто.
І b 384 For the plural:
1 declension: feminine, masculine and neuter nouns.
2 declensions: masculine in -ар, -ир, stressed (mixed group in -ар, -ир).
2 declensions: masculine nouns with alternating -і and -о.
plural nouns ending in na -и.
І c 54 2 declension, in gen. singular case with the ending -а/-я, namely the meaning:
beings and persons: студента, моря, Любомира;
items that can be counted: зошита, ножа, олівця;
own settlements: Ужгорода, Тернополя;
water bodies with a pronounced inflexion: Дніпра;
measurements: квадрата, міліметра (but віку, року);
definitions: відмінка;
architecture: парника, коридора, гаража.
І d 44 vocative case;
first declension (for endings [ая]);
2 declensions (ending [рнгдблвк] with alternation о-і and dropping е, о).
І o 53 For the plural:
1 declension: female/male/neuter genus with alternation of о/і and the appearance of
о(е) in the genus;
2 declensions: neuter in -о with alternating о/і in gen. plural.
ІІ e 19 For the singular:
a solid group of nouns ending in-о;
a solid group of nouns with a zero ending;
a solid group of nouns with zero ending in sibilant;
mixed group with zero ending in sibilants;
a soft group ending in -й, -ій or -ь.
ІІ f 25 For the plural:
a solid group of nouns with a zero ending;
a solid group of nouns with zero ending in sibilant;
mixed group with zero ending in sibilants;
a solid group of nouns ending in -о;
a soft group ending in -й, -ій or -ь;
a group of nouns ending in -ття, -ттів, incl. From group /i;
coincides with the gen. singular case;
nouns ending in -ок and dropping о are transferred to group a.
ІІ g 3 genitive case of the second declension in -а.
ІІ h 5 second declension (required for endings in consonants except ЙЖЧШЩ);
nouns of the second declension of the masculine gender with a zero ending;
coincides with the dative.
ІІІ i 47 For the singular:
nouns of the third declension of the feminine gender with a zero ending;
ending in a sibilant, except -ь;
nouns of the second declension of the neuter ending in -о, -а or -я;
soft group in-е, except for sibilants;
mixed group on sibilant before -е;
from adjectival nouns.
ІІІ j 66 For the plural:
nouns of the third declension of the feminine gender with a zero ending;
ending in a sibilant, except -ь;
nouns of the second declension of the neuter ending in -о, -а or -я;
a soft group on -е, except for sibilants or a mixed group;
Class flag N Features of MA-rules
mixed group on sibilant before -е.
ІІІ k 8 vocative;
nouns of the third declension of the feminine gender with a zero ending;
feminine singular of adjectival nouns.
ІV l 40 For the singular:
nouns of the third declension with alternation;
nouns of the fourth declension of the middle gender ending in -а or -я;
2 masculine declensions in -о[дв]ець with dropout of е and alternation of о-і;
mixed group 2 declensions in -яр;
2 masculine declensions in -ар/-ир, stressed (mixed group in - ар/-ир).
ІV m 66 For the plural:
nouns of the third declension with alternation;
nouns of the fourth declension of the middle gender ending in -а/-я;
2 masculine declensions in -о[дв]ець with dropout of е and alternation of о-і;
mixed group 2 declensions in -яр;
2 masculine declensions in -ар/-ир, stressed (mixed group in -ар/-ир).
ІV n 9 with alternation і е or with alternation і о;
ending in a sibilant, except -ь;
nouns of the third declension of the feminine gender with a zero ending;
mixed group of 2 declensions in -яр soft group in -ар/-ир.
ІV q 2 mixed group 2 declensions in -яр;
soft group in -ар/-ир (accented endings in declension).
V p 222 п masculine/feminine singular and plural patronymics from male names.
It is quite difficult to generate terminal chains in English (but MA rules are much less, not
in Ukrainian), because the presence of articles and the connection of groups of nouns with
each other with the corresponding preposition makes the tree longer and wider. The
generation of terminal chains in the Ukrainian language is complicated by cases and generic
differences in inflexions of the term used in the context. To identify keywords, it is not
enough to recognize nouns (about 1300 RE-rules), it is also necessary to identify adjectives
- a total of 99 RE-rules for Ukrainian texts (Table 4.6-Table 4.7). For correct SYA and SEM,
including ontology construction, it is necessary to recognize verbs based on more than 800
RE rules.
Table 6
Basic MA rules for marking adjectives as parts of speech
flag N Peculiarities of MA rules for recognizing adjectives
V 83 singular ending in -ий;
the short form singular changes to -ен in the same way as the full form (ясен - ясний...);
ending in -лиций;
ending in -ій/-їй;
plurals ending in -ій/-їй;
possessives from nouns of the 1st declension - names of people in -ин;
possessives from nouns of the 2nd declension in -ів (solid group);
possessives from nouns of the 2nd declension in-їв.
U 13 soft group of possessives ending in-ів -> -ев;
plurals ending in -ів.
W 3 the formation of an adverb from an adjective, the neuter gender of the comparative form of
adjectives corresponds to the corresponding adverb in the comparative form (міцніший - міцніше).
Table 7
Basic SFX-type RE of Ukrainian adjectives based on goroh.pp.ua
N Flag Genus F1 F2 RE Numeric Sign Example 1 Example 2 Case N
1 V ч ий ого [^ц]ий одн in -ий текстовий текстового Р.З. 1
2 ому текстовому Д.М. 2
3 им ий текстовим О.Мн:Д. 3
4 ім текстовім М. 4
5 ж а [^ц]ий текстова Н. 5
6 ої текстової Р. 6
7 ій ий текстовій Д. 7
8 у [^ц]ий текстову З. 8
9 ою текстовою О. 9
10 с е ий текстове Н. 10
11 - і мн текстові 11
12 их текстових Р. 12
13 ими текстовими О. 13
14 ч ього [^у]ций одн in -лиций білолиций білолицього Р.З. 14
15 ьому білолицьому Д.М. 15
16 ж я білолиця Н. 16
17 ьої білолицьої Р. 17
18 ю білолицю З. 18
19 ьою білолицьою О. 19
20 ч ого уций куций куцого Р.З. 20
21 ому куцому Д.М. 21
22 ж а куца Н. 22
23 ої куцої Р. 23
24 у куцу З. 24
25 ою куцою О. 25
26 ч ій ього ій in -ій/-їй крайній крайнього Р. 26
27 ьому крайньому Д. 27
28 ім крайнім О.Мн.:Д. 28
29 ж я крайня Н. 29
30 ьої крайньої Р. 30
31 ю крайню Д. 31
32 ьою крайньою О. 32
33 с є крайнє Н. 33
34 - й - [їі]й мн крайні Н. 34
35 х крайніх Р. 35
36 ми крайніми О. 36
37 ч їй його їй одн безкраїй безкрайого Р.З. 37
38 йому безкрайому Д. 38
39 їм безкраїм О.М.Мн.:Д. 39
40 ж я безкрая Н. 40
41 йої безкрайої Р. 41
42 ю безкраю З. 42
43 йою безкрайою О. 43
44 с є безкрає Н. 44
45 ч - ого [їи]н possessives from мамин маминого Р. 45
46 ому nouns of the 1st маминому Д. 46
47 им declension - маминим О. Мн:Д. 47
names of people
48 ім маминім М. 48
on -ин
49 ж а мамина Н. 49
50 ої маминої Р. 50
51 ій маминій Д.М. 51
52 у мамину З. 52
53 ою маминою О. 53
54 с е мамине Н. 54
55 і мн мамині Н. 55
56 их маминих Р. 56
57 ими маминими О. 57
58 ч ів ового ів одн possessives from татів татового Р. 58
59 овому nouns of the 2nd татовому Д. 59
N Flag Genus F1 F2 RE Numeric Sign Example 1 Example 2 Case N
60 овим declension in -ів, татовим О. Мн:Д. 60
61 овім solid group татовім М. 61
62 ж ова татова Н. 62
63 ової татової Р. 63
64 овій татовій Д.М. 64
65 ову татову З. 65
66 овою татовою О. 66
67 с ове татове Н. 67
68 - ові мн татові Н. 68
69 ових татових Р. 69
70 овими татовими О. 70
71 ч їв євого їв одн possessives from Вереміїв Веремієвого Р. 71
72 євому nouns of the 2nd Веремієвому Д. 72
73 євим declension in -їв, Веремієвим О. Мн:Д. 73
hard group
74 євім Веремієвім М. 74
75 ж єва Веремієва Н. 75
76 євої Веремієвої Р. 76
77 євій Веремієвій Д.М. 77
78 єву Веремієву З. 78
79 євою Веремієвою О. 79
80 с єве Веремієве Н. 80
81 - єві мн Веремієві Н. 81
82 євих Веремієвих Р. 82
83 євими Веремієвими О. 83
1 U ч ів евого ів одн soft group of вчителів вчителевого Р. 84
2 евому possessives on - вчителевому Д. 85
3 евим ів, -ев вчителевим О. Мн:Д. 86
4 евім вчителевім М. 87
5 ж ева вчителева Н. 88
6 евої вчителевої Р. 89
7 евій вчителевій Д.М. 90
8 еву вчителеву З. 91
9 евою вчителевою О. 92
10 с еве вчителеве Н. 93
11 - еві мн вчителеві Н. 94
12 евих вчителевих Р. 95
13 евими вчителевими О. 96
1 W ий о [^жчшщ]ий - adverb надісланий надіслано - 97
2 ій ьо ій синій синьо 98
3 їй йо їй безкраїй безкрайо 99
CLS marks the words of the input text as parts of speech (clarifies after GA the
tagged/marked lexemes as words) based on RE-rules and analysis of inflexions as singular
nouns of the corresponding gender and case, plural nouns of the corresponding case,
adjectives, adverbs, verbs, personal pronouns, etc. (each with a collection of features).
The MA module returns a collection of paragraph lists, each of which is a list of sentences,
which are lists of tokens, including words marked by parts of speech. Periodic interim
analysis of the input/integrated textual content allows to assess how the thematic corpus
changes over time. In the process of analysis, we will count the number of paragraphs,
sentences and words, and also save each unique lexeme in an additional intermediate
dictionary. If the lexeme/word did not exist in the dictionary of lexemes/word bases, we
mark it as new and store it in the intermediate dictionary for analysis by the moderator. We
count the number of content and categories in the corpus of incoming text content and form
a dictionary with a statistical summary of the corpus, which contains: the total number of
integrated content and categories; the total number of paragraphs, sentences and words;
the number of unique tokens; lexical diversity as the ratio of the number of unique lexemes
to their total number; the average number of paragraphs in the content; average number of
sentences per paragraph; total processing time.
Since the corpus grows as new data is collected, pre-processed and compressed, the MA
method will allow us to calculate these features and analyze their dynamics of change. It is
an important content monitoring tool to identify possible problems in CLS, for example, in
an ML model, a significant change in lexical diversity and the number of paragraphs per
content affects the quality of the model. That is, the MA method and GA methods, in addition
to the identification of tokens and direct marking of words by parts of speech, are used to
collect additional information when determining the amount of changes in the corpus to
timely start further vectorization and restructuring of the ML model. The main stage of the
MA method is the identification of the bases of words (stemming) without taking into
account inflexions (suffixes and endings) and in some cases - prefixes. According to the
content of the inflexions, a part of the language is identified as a word (Fig. 16).
Figure 16: An example of identification of forms of inflexion according to part of speech
For the next SYA, this is not enough (to mark the word only as a part of speech), it is still
necessary to determine, for example, gender/distinctiveness, etc., for a noun/adjective. The
classic Porter stemmer algorithm works by sequentially cutting off endings and suffixes. For
English-language texts, this is not a problem, as there are very few inflexions. For Ukrainian
words, a modified (extended) algorithm of Porter's stemmer should be applied with a check
of both additional inflexions depending on the part of the language (according to the tree of
endings), as well as the obtained word bases with a dictionary of bases to identify the
existing word (Fig. 17).
Algorithm 4.1. Modified Porter stemmer algorithm
Stage 1. Identify the next token as the word 𝑤𝑖 (𝑤𝑠 = 𝑤𝑖 ).
Stage 2. Check with the dictionary of stop words whether 𝐷𝑤𝑠𝑤 or 𝑤𝑠 is a service word. If yes, then
𝑖 = 𝑖 + 1 and go to step 1, otherwise go to step 3.
Stage 3. Go to the end of the word 𝑤𝑠 . Recognize the inflection 𝑓1𝑖 in 𝑤𝑠 from all possible ones (Fig.
4.16 - the longest one is chosen, for example, in 𝑤𝑠 =текстова we choose the ending 𝑓1𝑖 =ова, not
𝑓1𝑖 а) from the RE word type as 𝑅𝑎𝑑𝑗𝑒𝑐𝑡𝑖𝑣𝑎𝑙 , 𝑅𝑛𝑜𝑢𝑛 or 𝑅𝑣𝑒𝑟𝑏 and in the presence of the removal of
the inflexion 𝑓1𝑖 (Fig. 18).
Stage 4. Preservation of inflection 𝑓1𝑖 in the word tag 𝑤𝑖 .
𝑤𝑖 𝑤𝑖 𝑤𝑖
Stage 5. Mark 𝑤𝑖 . as type 𝑚𝑎𝑑𝑗𝑒𝑐𝑡𝑖𝑣𝑎𝑙 , 𝑚𝑛𝑜𝑢𝑛 or 𝑚𝑣𝑒𝑟𝑏 respectively.
Stage 6. Finding the deleted inflection 𝑓1𝑖 in the tree of inflexions 𝑇𝑓𝑙𝑒𝑐𝑡𝑖𝑜𝑛 (the longest one is chosen).
𝑓
Checking the contents of the subtree 𝑇𝑓𝑙𝑒𝑐𝑡𝑖𝑜𝑛
1
with the existing word ending 𝑓2𝑖 (𝑓 = 𝑓2𝑖 + 𝑓1𝑖 ). If𝑤𝑠 .
𝑓
ends in 𝑓2𝑖 and has a counterpart in 𝑇𝑓𝑙𝑒𝑐𝑡𝑖𝑜𝑛
1
, then we store it in 𝑓𝑖 = 𝑓 and delete in 𝑤𝑠 .
Stage 7. We check the obtained base 𝑤𝑠 of the initial word 𝑤𝑖 with the content of the base dictionary
𝐷𝑤𝑠 of Ukrainian words. If there is no respondent, we save < 𝑤𝑖 , 𝑤𝑠 > in the additional temporary
intermediate dictionary 𝐷<𝑤𝑖 ,𝑤𝑠> for the moderator and proceed to stage 1, otherwise proceed to
stage 4.
Stage 8. Analysis of inflexion and the presence/absence of alternation of letters in the
base/inflexions of the words < 𝑤𝑖 , 𝑤𝑠 > and the analogue of the base of the word in 𝐷𝑤𝑠 according
to the relevant MA RE-rule to identify additional features of the analyzed word 𝑤𝑖 .
Stage 9. Addition of identified linguistic features of the recognized part of speech to the tag of the
𝑤𝑖 𝑤𝑖 𝑤𝑖
word 𝑤𝑖 of type 𝑚𝑎𝑑𝑗𝑒𝑐𝑡𝑖𝑣𝑎𝑙 , 𝑚𝑛𝑜𝑢𝑛 or 𝑚𝑣𝑒𝑟𝑏 respectively. Saving the results in the corresponding
dictionary 𝐷𝑤𝑖 of the analyzed text.
Yes i=i+1 marking the word as
No unknown
tag retention with
end of text
features, stem, and saving words to the
word identification i inflection cache dictionary for
moderation
No Yes word marking No
Yes
stop- word seving of inflection
inflection recognition and signs in the tag presence of a base
search for word
Yes search for inflection in
base in dictionary
the ending tree
No noun inflection
Yes Yes No saving in stem/
inflection tag
No adjective inflection the presence of inflection
search for max inflection inflection cutting
Yes
No prefix in subtree off from word
verb inflection
No suffix+inflection
Saving a word in the Yes
combination as
stop-word dictionary presence of a prefix new inflection
Figure 17: Modified stemming algorithm
The increase in volume of MA RE-rules increases in a geometric progression the load on
CLS only due to the recognition of inflexions and the bases of word forms. For English-
language texts, the complexity is less due to several parameters, for example, for nouns 2
cases – 2 inflexions in the plural (s|es). For the German language, the complexity increases
- 4 cases (but inflexions almost do not change, only articles change), phrases with 2 words
are written together, etc. In the Ukrainian language, there are 7 cases of nouns, each of
which changes its inflexion depending on the gender and plural/singular, and some words
have different endings in some cases (for example, for втручання [vtruchannya]
(intervention) in the local case, there are two options – втручанню, втручанні), in
addition, there is often alternation of letters.
Figure 18: Classes of linguistic features of inflexions of morphological analysis
Therefore, for Ukrainian words, Porter's simple classic stemming algorithm is not
suitable (reducing the word to the base root by cutting off inflexions). It is better to combine
such an algorithm with a search/check of the obtained intermediate results with a tree of
inflexions (so as not to go through all possible inflexions) and with the content of thematic
dictionaries of bases with a set of RE-rules for the identification of features (classification
by parts of speech). Only for text rubrication based on word identification, it is enough to
conduct MA only for some noun groups (adjectives with nouns and nouns with nouns)
without analyzing words of other parts of speech (recognition by the tree of inflexions - not
an adjective and not a noun - ignore, in addition, the key ones should be sometimes there
can be 1 preposition next to and only between nouns. It is enough to identify the bases of
nouns/adjectives/abbreviations in the text and analyze their probability of clustering in
different parts of the content relative to the total volume.
The classic stemming algorithm - Porter's Stemmer - does not use dictionaries of word
bases but only applies a set of RE-rules for cutting off inflexions in sequence according to
the specifics of a specific language. The algorithm works with individual words without
analyzing and taking into account the context. Linguistic features such as features of word
formation (prefix, suffix, etc.) and parts of speech (noun, verb, etc.) are not taken into
account. The basis is the following techniques for words:
cutting off the inflexion from the analyzed word (for Ukrainian words, it can be
implemented with the obtained bases and inflexions check with analogues in DB).
the word has an invariable inflexion (the condition is impossible for most Ukrainian
words, but it is possible to identify particles, conjunctions, prepositions, some nouns
of foreign origin, abbreviations, etc.).
changes inflexion in declension due to dropping/alternating letters.
the change of word inflexion and word formation corresponds to a specific RE-rule,
for example, when forming words from some verb groups:
(ов)*ува(ти|нню|нням|нні|ння|ли|ло|ла|вшись|вши|в|вся|всь|лися|лись|тися|тись)
[(ov)*uva(ty|nnyu|nnyam|nni|nnya|ly|lo|la|vshysʹ|vshy|v|vsya|vsʹ|lysya|lysʹ|tysya|tysʹ)].
changing the inflexion of the word as an exception to the RE rules.
the ending of the word coincides with the envelope RE-rule of identification of
inflexion, but the word itself has no inflexion: вітер [viter] (wind), but відер [vider]
(bucket).
most short words are invariable (stop word dictionary is sufficient).
Such techniques significantly complicate the stemming algorithm of Ukrainian words.
Therefore, first, widespread inflections are analyzed, for example, for 1 letter ц (34), щ
(110), ф (214), б (281), п (341), ж (353), з (581), г (636), л (754), с (914), ч (959), д (1038),
н (2531), р (2709) or 1-4 letters (Table 2.2). Inflexions 5 (for example, max(йтесь)=6837,
max(ванням)=4656) are significantly less among keywords, therefore, for the
speed/efficiency of the solution in some CLS NLP tasks, they are ignored, but for SYA/ SEM
will not allow this. Many NLP tasks do not require full implementation of all NLP processes
from grapheme to pragmatic analyses. For example, to identify keywords, it is enough to
provide a grapheme and morphological analysis (algorithm 4.2). But before almost any NLP
process, the text must be normalized.
Algorithm 4.2. Abbreviated naive processing of textual content
Stage 1. Rough tokenization (or grapheme analysis) of special characters of the input text.
Step 1.1. Reading the text and removing repeated consecutive spaces and tags if they are present (if
the text is integrated from a Web resource), sequentially marking the service characters of the
beginning/end of the paragraph/heading/text, etc.
Step 1.2. Grapheme parsing and segmentation between service characters or tags of the input text 𝑋,
sequentially marking each sequence of non-alphabetic characters as tokens and recognizing
alphabetic sequences between spaces and other special characters (eg numbers and
punctuation) according to RE rules as token words to form a list 𝑆 of identified alphabetic
tokens as words 𝑤𝑖 .
Step 1.3. Sort the list 𝑆𝑆𝐴 identified tokens 𝑤𝑖 alphabetically, counting occurrences of identical
chains and forming an alphabetic-frequency dictionary 𝐷𝑎 , the record of which is in the form
of the number of occurrences – a word.
Step 1.4. Transferring all letters of the upper register to the lower register and recalculating
occurrences of word-tokens in the alphabetic-frequency dictionary 𝐷𝐴 𝐷𝑎 .
Step 1.5. Sort and save the dictionary 𝐷𝑎 𝐷𝑁 of identified 𝑤𝑖 words by decreasing the frequency of
appearance (in Germanic languages, the top will be articles, pronouns, adjectives and
conjunctions, and in Slavic languages, most words with the same base and different inflexions
will occupy different lines of the list, which significantly distorts the picture of the real
distribution of words in texts).
Stage 2. Segmentation/tokenization of words of the analyzed text content.
Step 2.1. Word segmentation based on dictionaries, metrics such as the probability of an error in a
word, and statistical sequence models pre-trained from segmented text corpora (between
spaces, punctuation, etc.).
Step 2.2. Tokenization based on RE-rules of marked tokens of the sequence type of non-alphabetic
characters as tokens (dates, prices, URLs, hashtags, e-mail addresses, etc.), punctuation (as the
end of a sentence or the boundary of a subordinate clause), mixed tokens of alphabetic-non-
alphabetic characters (abbreviations, complex hyphenated words, with an apostrophe, etc.),
lines with uppercase characters (such as the beginning of a sentence, geographical names,
proper names, abbreviations) and their normalization if necessary (for example, к.т.н. ктн
(PhD) as a separate word- token or ML як машинне навчання [mashynne navchannya]
(machine learning)).
Step 2.3. Analysis of tokens with uppercase characters (except when only the first letters are
capitalized) for labelling based on the RE-rules of finite automata or as an abbreviation or
emotion transfer.
Step 2.4. Marking of unidentified 𝐷𝑥 tokens and ambiguities (e.g. apostrophe as part of a word, etc.).
Stage 3. Lemmatization of a set of recognized and labelled alphabetic tokens of the text as lemmas,
identified as words of the analyzed text.
Step 3.1. Normalization of tokens based on the identification of affixes from the termination tree as
stenocardia of marked token-words (reducing the word to its initial form based on RE-rules
MA for identification roots and affixes through Algorithm 1 of Porter's modified stemmer), i.e.
determination of whether the analyzed tokens have the same root and differ only in inflexion
with sequential identification of the part of the language of the analyzed words with
subsequent marking of them as lemmas with all accompanying linguistic features.
Step 3.2. Regrouping and recalculation of word frequencies in the alphabetic-frequency dictionary
𝐷𝑁 𝐷𝑙 taking into account the normalized words in step 3.1.
Stage 4. Additional analysis of unidentified tokens 𝐷𝑥 by iteratively combining frequent
character/string pairs within token words (for example, whether tokens between spaces or
other punctuation marks контент-аналіз [kontent-analiz] (content-analysis), Web-сайт
[Web-sayt] (Web-site), контент-моніторинг [kontent-monitorynh] (content-monitoring)
or Web-resource [Web-resource] (Web-resource) are one word, or two) through bit-pair
encoding, or BPE based on text compression for further possible identification of words, their
labelling and normalization.
Step 4.1. Formation of a set of symbols equal to the collection of properties with 𝐷𝑥 . К We present
each word as a sequence of characters plus a special character at the end of the word or a
special character, such as a dash, within a token (for example, контент-, Web-, контент- or
Web-). We denote 𝑖 = 0.
Step 4.2. Calculation of the number 𝑛𝑙 of each pair of characters/lines (𝑠𝑘𝑥 , 𝑠𝑗𝑥 ) as occurrences of
word stems in the input text when {𝑠𝑘𝑥 𝐷𝑥 , 𝑠𝑗𝑥 𝐷𝑙 } or {𝑠𝑘𝑥 𝐷𝑙 , 𝑠𝑗𝑥 𝐷𝑥 }, which are next to
each other and separated by a special character dash (compound words), period (date),
comma (real number) and/or space, or their combination, but not punctuation marks,
numbers and other special characters.
Step 4.3. Formation of the alphabetic-frequency dictionary 𝐷′𝑥 based on (𝑠𝑘𝑥 , 𝑠𝑗𝑥 ). Determination of
the number of occurrences of unique lexemes in 𝐷′𝑥 ℎ = |𝐷′𝑥 |.
Step 4.4. Finding 𝑛𝑙 = 𝑚𝑎𝑥 of the most frequent pair 𝑎𝑖 = (𝑠𝑘𝑥 , 𝑠𝑗𝑥 ) in 𝐷′𝑥 , where (𝑠𝑘𝑥 , 𝑠𝑗𝑥 )𝐷′𝑥 ,
{𝑠𝑘𝑥 𝐷𝑥 , 𝑠𝑗𝑥 𝐷𝑙 } або {𝑠𝑘𝑥 𝐷𝑙 , 𝑠𝑗𝑥 𝐷𝑥 }.
Step 4.5. Replacing 𝑎𝑖 with a new combination/merge character/string 𝑏𝑖 = 𝑠𝑘𝑥 𝑠𝑗𝑥 .
Step 4.6. Extracting from 𝐷′𝑥 the value 𝑠𝑘𝑥 𝑠𝑗𝑥 and from 𝐷𝑥 the values 𝑠𝑘𝑥 or 𝑠𝑗𝑥 respectively.
Step 4.7. Calculation of the number of occurrences in the input text 𝑏𝑖 , occurrences of 𝑠𝑘𝑥 and 𝑠𝑗𝑥 at
𝑠𝑘𝑥 𝐷𝑙 and/or 𝑠𝑗𝑥 𝐷𝑙 respectively, when they are used separately (not next to each other).
Step 4.8. Inclusion in 𝐷𝑙 of the value of 𝑏𝑖 , and its frequency of occurrence. Overwriting frequency
values in 𝐷𝑙 for 𝑠𝑘𝑥 and 𝑠𝑗𝑥 at 𝑠𝑘𝑥 𝐷𝑙 and/or 𝑠𝑗𝑥 𝐷𝑙 respectively.
Step 4.9. We denote 𝑖 = 𝑖 + 1. If ℎ > 0, 𝐷𝑥 for 𝑛𝑙 > 1 and in 𝐷′𝑥 is at least 1 marked 𝑏𝑖 , then go to
step 4.4, otherwise (ℎ = 0 and 𝐷𝑥 = or 𝐷𝑥 at 𝑛𝑙 = 1 or 𝑠𝑘𝑥 no non-unique pair 𝑠𝑗𝑥 for
formation 𝑎𝑖 = (𝑠𝑘𝑥 , 𝑠𝑗𝑥 )) – until step 5.
Stage 5. Segmentation of sentences in the analysed content.
4.3. Method of lexical analysis of the Ukrainian language
The process of lexical analysis of the Ukrainian-language text 𝐶′ consists in parsing,
segmentation and tokenization of each sentence separately, which is characterized not by a
strict order of words, but at the same time by a constant arrangement of individual linguistic
units. In a complete simple Ukrainian sentence with direct word order, the structural
scheme is conditionally fixed. The main lexical categories of the corresponding sentence are
noun and verb groups. Type 0 grammar according to N. Chomsky's classification is not
appropriate for such sentences due to the complexity of implementation. With context-
dependent grammar, specific restrictions are applied, in particular, to the structure of a
Ukrainian-language sentence with some set of variations. Based on the syntactic rules of
generating Ukrainian-language sentences with partial word order (for example, there is no
strict order for the subject and predicate in the sentence, but the adjective is usually before
the noun or another adjective, if it is not a poetic passage, also the lexical units of the noun
group are placed around the subject, etc.), we derive the lexical scheme for the noun group
𝑆̃ based on regular expressions:
𝑆̃ = ([𝐴]{0, 𝑛}[𝑆]{1, 𝑚}|[𝑃]), (7)
where 𝐴 = 𝑎1 𝑎2 𝑎3 … 𝑎𝑁−1 𝑎𝑁 is a sequence of adjectives, and the entry [𝐴]{0, 𝑛} is a
selection from 0 to 𝑛 adjectives from 𝑎1 𝑎2 𝑎3 … 𝑎𝑁−1 𝑎𝑁 , at 𝑛𝑁; 𝑆 = 𝑠1 𝑠2 𝑠3 … 𝑠𝑀−1 𝑠𝑀 is a
sequence of nouns, and the entry [𝑆]{1, 𝑚} is a selection from 1 to 𝑚 nouns from
𝑠1 𝑠2 𝑠3 … 𝑠𝑀−1 𝑠𝑀 , at 𝑚𝑀; 𝑃 = 𝑝1 𝑝2 𝑝3 … 𝑝𝐾−1 𝑝𝐾 is a sequence of pronouns, and the entry
[𝑃] is the choice of 1 pronoun from 𝑝1 𝑝2 𝑝3 … 𝑝𝐾−1 𝑝𝐾 ; record (𝑥|𝑦) is a choice of either 𝑥, or
𝑦; the values of 𝑎𝑖 and 𝑠𝑗 agree in gender, number and case. Accordingly, for the verb group,
the lexical scheme based on RE-expressions:
𝑉̃ = ([𝑉]{1, 𝑛}[𝑆̃′ ]{0, 𝑚}|[𝑆̃′ ]{0, 𝑚}[𝑉]{1, 𝑛}), (8)
where 𝑉 = 𝑣1 𝑣2 𝑣3 … 𝑣𝑁−1 𝑣𝑁 is a sequence of verbs, and the entry [𝑉]{1, 𝑛} is a choice from
1 to 𝑛 verbs from 𝑣1 𝑣2 𝑣3 … 𝑣𝑁−1 𝑣𝑁 , at 𝑛𝑁; 𝑆′ ̃ = 𝑆̃1 𝑆̃2 𝑆̃3 … 𝑆̃𝑀−1 𝑆̃𝑀 is a sequence of noun
̃ ]{0, 𝑚} is a choice from 0 to 𝑚 noun groups from 𝑆̃1 𝑆̃2 𝑆̃3 … 𝑆̃𝑀−1 𝑆̃𝑀 ,
groups, and the entry [𝑆′
at 𝑚𝑀; entry (𝑥|𝑦) is choice of either 𝑥, or 𝑦; agreement between 𝑣𝑖 and 𝑆̃𝑗 is carried out
by person, gender and number. The lexical scheme of a Ukrainian sentence based on RE-
expressions:
𝑅 = ([𝑆′ ̃ ]{0,1}[𝑉′
̃ ]{0,1}|[𝑉′
̃ ]{0,1}[𝑆′
̃ ]{0,1}), (9)
̃ ̃
where 𝑉′ = 𝑉̃1 𝑉̃2 𝑉̃3 … 𝑉̃𝑁−1 𝑉̃𝑁 is a sequence of verb groups, and the entry [𝑉′]{0,1} is a
selection from 0 to 1 verb groups with 𝑉̃1 𝑉̃2 𝑉̃3 … 𝑉̃𝑁−1 𝑉̃𝑁 with the presence of a predicate;
̃ = 𝑆̃1 𝑆̃2 𝑆̃3 … 𝑆̃𝑀−1 𝑆̃𝑀 is a sequence of noun groups, and the entry [𝑆′
𝑆′ ̃ ]{0,1} is a selection
from 0 to 1 noun groups from 𝑆̃1 𝑆̃2 𝑆̃3 … 𝑆̃𝑀−1 𝑆̃𝑀 with the presence of a subject; record (𝑥|𝑦)
is a choice of 𝑥 or 𝑦; agreement between 𝑉̃𝑖 and 𝑆̃𝑗 is carried out by person, gender and
number.
The main lexical features of the verb group are tense, number, person. For comparison,
the lexical scheme of the noun group based on the RE-expression for an English-language
sentence:
𝑆̃ = (𝑎𝑟𝑡𝑖𝑐𝑙𝑒[𝐴]{0, 𝑛}[𝑆]/𝑜𝑓[𝐴]{0, 𝑛}[𝑆]/{0, 𝑚}|[𝑃]). (10)
The lexical scheme of the English verb group based on the RE-expression:
𝑉̃ = [𝑉][𝑆̃′ ]{0, 𝑚}. (11)
Lexical scheme for an English-language sentence based on the RE-expression:
̃ ][𝑉′
𝑅 = [𝑆′ ̃ ]. (12)
The agreement of cases between the lexical units of the Ukrainian-language sentence
affects the further syntactic and semantic analysis of the content:
1. 𝑅 → 𝑅𝑌𝑖 𝑥𝑖′ , (13)
′ ′
2. 𝑥𝑖 𝑌𝑗 → 𝑌𝑗 𝑥𝑖 , 𝑖, 𝑗 = 1,2,3,
3. 𝑅𝑌𝑖 → 𝑥𝑖 𝑅,
4. 𝑅 → 𝑞. }
′
where 𝑥𝑖 , 𝑥𝑖 , 𝑞 are the main lexical units; 𝑅, 𝑌𝑖 are auxiliary lexical units; 𝑅 is the initial
symbol as an indicator of the type of sentence chain generation.
Stages of lexical formation of a chain of tokens 𝑥2 𝑥1 𝑥1 𝑥3 𝑞𝑥2′ 𝑥1′ 𝑥1′ 𝑥3′ :
1. 𝑅 6. (2)𝑅𝑌2 𝑌1 𝑥2′ 𝑥1′ 𝑌1 𝑥1′ 𝑌3 𝑥3′
2. (1) 𝑅𝑌3 𝑥3′ 7. – 11. .......... (2; 5 times)
3. (1) 𝑅𝑌1 𝑥1′ 𝑌3 𝑥3′ 12. (3) 𝑥2 𝑅𝑌1 𝑌1 𝑌3 𝑥2′ 𝑥1′ 𝑥1′ 𝑥3′
4. (1) 𝑅𝑌1 𝑥1′ 𝑌1 𝑥1′ 𝑌3 𝑥3′ 13. – 15. .......... (3; 3 times)
5. (1) 𝑅𝑌2 𝑥2′ 𝑌1 𝑥1′ 𝑌1 𝑥1′ 𝑌3 𝑥3′ 16. (4) 𝑥2 𝑥1 𝑥1 𝑥3 𝑞𝑥2′ 𝑥1′ 𝑥1′ 𝑥3′
а 𝑏 с 𝑑
An example of lexical generation of the type {𝑥𝑞𝑥 ′ }: Саша, Софія, Катя, Данило, … –
а′ 𝑏′ с′ 𝑑
спортсмен, співачка, художниця,поет, … respectively, where𝑥 (𝑎𝑏𝑐𝑑. . . ) is a sequence of
proper names, 𝑥 ′ (𝑎′ 𝑏 ′ 𝑐 ′ 𝑑′ . . . ) is a sequence of professions agreed with proper names; 𝑞 is
a dash. Any verb has the ability to act as a complement: моя дитина вподобала
книгочитання [moya dytyna vpodobala knyhochytannya] (My child liked reading books).
This process can theoretically be repeated an unlimited number of times: він
книгочитанняцікаводумає про книгочитанняцікавість [vin
knyhochytannyatsikavodumaye pro knyhochytannyatsikavistʹ] (it is interesting to read
books, thinks about reading books, interesting), i.e.
𝑎 𝑏 𝑐 𝑎′ 𝑏′ 𝑐′
Він книго
⏞ читання
⏞ ⏞
цікавість − думає про − книго
⏞ читання
⏞ ⏞
цікавість.
A language consisting of strings of the form 𝑎𝑏𝑐𝑑. . . 𝑑′ 𝑐 ′ 𝑏′ 𝑎′ (composed of symbols 𝑎1 ,
𝑎2 , 𝑎3 , 𝑎1′ , 𝑎2′ , 𝑎3′ ) is generated by a grammar of 6 rules:
𝐼 → 𝑎𝑖 𝐼𝑎𝑖′ (14)
} 𝑖 = 1,2,3.
𝐼 → 𝑎𝑖 𝑎′
Such grammar do not provide, for example, a natural description for the so-called non-
project constructions with breaks, crossing ( . . . . , . . . . ) or framing ( . . . . ,
. . . . ) directions of syntactic dependence (Fig. 19).
Ukrainian. Наша мова, як і будь-яка інша, посідає унікальне місце.
A theorem is stated which describes the properties of this function.
English.
German. ... die Tatsache, daß die Menschen die Fähigkeit besitzen, Verhältnisse der objektiven Realität in Aussagen wiederzuspiegeln.
Francian. ... la guerre, dont la France portait encore les blessures...
Hungarian. Azt hisszem, hogy késedelmemmel sikerült bebizonyítani.
Serbo-Croatian. Regulacija procesa jedan je od najstarjih oblika regulacije.
Figure 19: Examples of natural description for so-called non-design constructions
To describe such constructions of sentences are used:
1. Right subordination: назва курсу, лист бумаги, une regle stricte, give him,
2. Left submission: основний курс, белый лист, cette regle, good advice.
3. Sequential subordination (Fig. 20):
досить повiльно рухлива черепаха or очень быстро бегущий олень.
витяг з протоколу звiтування з наукової дiяльностi заступника завiдувача кафедри
IСМ iнституту IКНI Нацiонального унiверситету "Львiвська полiтехнiка"
мiста Львова країни Українa
жена сына заместителя председателя второй секции эклектики совета по
or прикладной мистике при президиуме Академии наук королевства Myрак
Figure 20: Examples of natural description of sequential subordination
Only with the correct identification and recognition of non-project constructions can a
grammatical and syntactic analysis of Ukrainian sentences be carried out to build
dependency trees of the components of these sentences.
4.4. The method of syntactic analysis of the Ukrainian language
The syntax is a set of relational rules for the formation of sentences/phrases, usually defined
by the grammar. Sentences are linguistic units of language for generating meaning and
encoding information. The purpose of SYA is to demonstrate meaningful relationships
between words based on the division of a sentence into parts, or between tokens in a tree-
like structure 𝐶′ . Syntax is a necessary basis for reasoning about a system of concepts or
semantics because it is an important tool for determining the degree to which words
influence each other in the generation of phrases. For example, SYA identifies the
prepositional phrase в потяг [v potyah] (on the train) and the noun phrase чемодан в
потяг [chemodan v potyah] (the suitcase on the train) as constituents of the verb phrase
заніс чемодан в потяг [zanis chemodan v potyah] (carried the suitcase on the train). For
any derivable terminal chain (Fig. 21-22), the available such derivation in each sentence
occupies 𝑘 last positions from the right. It is necessary to fulfil a set of requirements that
lead to the sequential derivation of the type . . . . . ... or nested . . . ... . . :
Example 1. 𝑃 = {𝑆̃𝑥,𝑦,𝑧 → 𝑆𝑥,𝑦,𝑧 𝑆̃𝑥 ′ ,𝑦′ ,𝑝 , 𝑆̃𝑥,𝑦,𝑧 → 𝐴̃𝑥,𝑦,𝑧 𝑆̃𝑥,𝑦,𝑧 , 𝑆̃𝑥,𝑦,𝑧 → 𝑆𝑥,𝑦,𝑧 ,
𝐴̃𝑥,𝑦,𝑧 → {дуже, досить, точно, просто, суттєво, . . . }𝐴𝑥,𝑦,𝑧 , 𝐴̃𝑥,𝑦,𝑧 → 𝐴𝑥,𝑦,𝑧 ,
𝑆ж,𝑦,𝑧 → система𝑦,𝑧 , . .., Ах,𝑦,𝑧 → інформаційнийх,𝑦,𝑧 , простийх,𝑦,𝑧 , . ..,
𝑆ч,𝑦,𝑧 → запит𝑦,𝑧 , користувач𝑦,𝑧 , ресурс𝑦,𝑧 , бізнес𝑦,𝑧 , . . . }
Figure 21: The process of deriving the Ukrainian-language chain for example 1
Example 2. 𝑃 = {𝑆̃𝑥,𝑦,𝑧 → 𝑆𝑥,𝑦,𝑧 𝑆̃𝑥 ′ ,𝑦′ ,𝑝 , 𝑆̃𝑥,𝑦,𝑧 → 𝐴̃𝑥,𝑦,𝑧 𝑆𝑥,𝑦,𝑧 , 𝑆̃𝑥,𝑦,𝑧 → 𝑆𝑥,𝑦,𝑧 ,
𝐴̃𝑥,𝑦,𝑧 → {дуже, досить, точно, просто, суттєво, . . . }𝐴𝑥,𝑦,𝑧 , 𝐴̃𝑥,𝑦,𝑧 → 𝐴𝑥,𝑦,𝑧 ,
𝑆ж,𝑦,𝑧 → школа𝑦,𝑧 , . .., 𝑆ч,𝑦,𝑧 → сміх𝑦,𝑧 , школяр𝑦,𝑧 , Львів𝑦,𝑧 , . ..,
𝑆с,𝑦,𝑧 → місто𝑦,𝑧 , . .., Ах,𝑦,𝑧 → веселийх,𝑦,𝑧 , запальнийх,𝑦,𝑧 , дитячийх,𝑦,𝑧 . . . }
Another derivation is to use more memory, such as starting derivation with 𝑆̃ч,од,н
дуже 𝐴ч,од,н 𝐴ч,од,н 𝑆ч,од,н 𝑆ч,од,р 𝑆ж,од,р 𝑆с,од,р 𝑆ч,од,р.
Figure 22: The process of deriving the Ukrainian-language chain for example 2
Figure 23: An example from H. Feher’s short story – A Humorous Toast
There are cases in the textual content when not only the right but also the left sequential
subordination has an unlimited depth of derivation, for example, due to subordinate clauses
with the operative word which, what, when, etc. (тваринка, яку врятувала Софія
[tvarynka, yaku vryatuvala Sofiya] - the animal that Sofia saved). Fig. 23 illustrates a phrase
with a depth of 22 and is completely grammatically correct (as is its Ukrainian version).
Moreover, nothing prevents you from continuing the phrase to the left на волю в обійми
зеленої пахучої трави [na volyu v obiymy zelenoyi pakhuchoyi travy] (freely into the
embrace of green, fragrant grass). The Ukrainian language allows you to generate phrases
with an unlimited number of sequentially subordinating from left to right constructions of
the type 𝑌1 𝑌2 . . . 𝑌𝑖 . .. (unlimited right subordination), and at the same time, unlimited left
subordination is possible in each of the constructions 𝑋𝑖 - a sequence of chains
. . . 𝑌𝑖𝑗 . . . 𝑌𝑖3 𝑌𝑖2 𝑌𝑖1 ; however, within the sequence 𝑌𝑖𝑗 further unlimited expansion is
impossible. According to the rules of the Ukrainian language 𝑌𝑖 are interpreted as simple
sentences, each of which is an additional determiner to the previous one, and 𝑌𝑖𝑗 are
interpreted as prepositive adjective inflexions.
The grammar 𝐺 ′ = ⟨𝐷 ′ , 𝐷1′ , 𝐼 ′ , 𝑅 ′ ⟩ has a basic dictionary 𝐷 ′ = 𝑁1 , 𝑁2 , . . . , 𝑁𝑛 symbols and
rules of the form 𝑅 ′ = {𝑌 → 𝑍𝑁𝑖 , 𝑋 → 𝑁𝑖 }, where 𝑌𝐷1′ and 𝑍𝐷1′ . Each of 𝑁𝑖 corresponds to
some regular grammar 𝐺𝑖′ = ⟨𝐷, 𝐷1𝑖 , 𝑁𝑖 , 𝑅𝑖 ⟩, where 𝐷 is the main dictionary 𝐺𝑖′ , 𝐷1𝑖 is the
auxiliary dictionary for 𝐷1𝑖 𝐷′ = 𝑁𝑖 and 𝐷1𝑖 𝐷1′ = 𝑁𝑖 ; 𝑁𝑖 is the initial symbol; scheme rules
of the form 𝑅𝑖 = {𝐶 → 𝑒𝐸, 𝐶 → 𝑐} (heading Latin characters are non-terminal, and line
characters are terminal). The non-terminal dictionaries of the grammar 𝐺𝑖′ are pairwise
disjoint. Association:
𝐺 = 𝐺 ′ ∪ 𝐺1′ ∪ 𝐺2′ ∪ … ∪ 𝐺𝑛′ , (15)
where the main dictionary 𝐷 in all 𝐺𝑖′ , and the auxiliary additional dictionary and scheme:
𝐷1 = 𝐷 ′ ∪ 𝐷1′ ∪ 𝐷11 ∪ 𝐷12 ∪. . .∪ 𝐷1𝑛 , 𝑅 = 𝑅 ′ ∪ 𝑅1 ∪ 𝑅2 ∪ … ∪ 𝑅𝑛 (16)
The grammar 𝐺 is special and equivalent to an automatic one, for example:
𝑁3 → 𝑎𝑃3 (17)
𝐼 → 𝐵𝑁1 𝑁3 → 𝑏𝑄3
𝐵 → 𝐶𝑁1 𝑁1 → 𝑏𝑃1 𝑁3 → 𝑐𝑊3
𝐶 → 𝐵𝑁 𝑃 → 𝑎𝑄1 𝑃3 → 𝑎 𝑁 → 𝑐𝑃4
𝑅′ = 2
,𝑅 = { 1 , 𝑅 = {𝑁2 → 𝑑, 𝑅3 = , 𝑅4 = { 4 ,
𝐶 → 𝐸𝑁3 1 𝑄1 → 𝑎𝑄1 2 𝑄3 → 𝑏 𝑃4 → 𝑏
𝐸 → 𝐸𝑁4 𝑄1 → 𝑐 𝑊3 → 𝑑𝑊3
{𝐸 → 𝑁2 𝑊3 → 𝑒𝑊3
{𝑊3 → 𝑑
Algorithm 4.3. Algorithm of sentence syntactic analysis.
Stage 1. An unconstrained generated sequence is generated to the right by 𝑁𝑖 as a syntactic group or
sentence based on the rules of 𝑅′ .
Stage 2. Any of 𝑁𝑖 based on 𝑅𝑖 is expanded indefinitely in the form of a tree (Fig. 24) from right to
left - into a chain of terminal symbols as words.
To analyze the syntactic structure of a sentence is to identify the order of words
depending on the syntactic structure and relationships, which is determined necessarily
according to the analysis of neighbours and something derived/secondary. It is advisable to
modify the grammar so that both parts of the predicate (Fig. 24) are trees of syntactic
relations. Lines with subscripts describe syntactic relations of various types; symbols
𝐴, 𝐵, 𝐶, . .. are syntactic categories.
BB A D аналог правил
або
A B A B z аналог
z x y x y z граматики типу 0
x або y Aконтекстно-вільних
C B B C E
x C
правил
w
C C D D
Figure 24: Rules for building a tree
As a result, the syntactic structures (rather than phrases) of the language are obtained
as part of the generative grammar. Another part of this grammar is the calculation in the
Ukrainian language - with mandatory consideration of the logical derivation of linear
sequences of words, solving the problem of discontinuous constituents.
4.5. The method of semantic analysis of the Ukrainian language
Semantic analysis consists not only in identifying the content of the text but also in
generating data structures to which logical reasoning can be applied. Thematic Meaning
Representations (TMR) are used to encode sentences in the form of predicate structures
based on first-order logic or lambda calculus (λ-calculus). Network/graph structures are
used to encode interactions of predicates of relevant text features. Then a traversal is
implemented to analyze the centrality of terms or subjects and the reasons for the
relationships between elements.
Analysis of graphs, including ontology О, is usually not a complete SEM, but helps to form
part of important logical decisions/conclusions based on the taxonomy of concepts 𝑋:
𝑂: 𝑈𝐿𝑆𝑅 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠. (18)
The result of SEM based on the ontological model of the rules of the syntax of the
Ukrainian language О are weighted oriented graphs of the semantics of the text:
𝑂 = < 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠, 𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝𝑠, 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 >, (19)
where 𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝𝑠 is a tuple of relationships between SA concepts of the Ukrainian
language; 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠 is a tuple of SA concepts describing the rules of the Ukrainian language;
𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 is a tuple of functions for the interpretation of concepts/rules of the Ukrainian
language.
The taxonomy of concepts sets the syntax of the language as the root concept of the
ontology:
𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠: < 𝑅𝑆𝑛𝑡 > 𝐶′. (20)
The optimal definition of the tuple of relations between these concepts and the tuple of
the rules of the Ukrainian language, formalized by the descriptive logic of DL, will allow
effective processing of Ukrainian texts:
𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠 =< 𝑅𝑀𝑟𝑝 , 𝑅𝑃𝑛𝑐 , 𝑅𝑆𝑡𝑟 , 𝑅𝑆𝑛𝑡 , 𝑅𝑆𝑚𝑛 >, (21)
where tuples of concepts of morphology 𝑅𝑀𝑟𝑝 , punctuation 𝑅𝑃𝑛𝑐 , structure 𝑅𝑆𝑡𝑟 , syntax 𝑅𝑆𝑛𝑡
(Fig. 25) and semantics 𝑅𝑆𝑚𝑛 .
In SEM, to identify the set of semes of the corresponding text and their relationship, first,
based on the results of SYA, a semantic graph of the relations of linguistic units is built,
taking into account the parts of the language of words:
𝐶′ = (𝐶 , 𝐷 , 𝑅 , 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠), 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠 =< 𝐶𝑊𝑟𝑑𝐶𝑚𝑏 , 𝐶𝑆𝑛𝑡𝐶𝑚𝑏 >,, (22)
where 𝐶𝑊𝑟𝑑𝐶𝑚𝑏 is a tuple of word formation concepts; 𝐶𝑆𝑛𝑡𝐶𝑚𝑏 is a tuple of sentence
generation concepts in the Ukrainian language (Fig. 26).
Tuple 𝐶𝑊𝑟𝑑𝐶𝑚𝑏 according to the rules of the Ukrainian language syntax (Fig. 26):
𝐶𝑊𝑟𝑑𝐶𝑚𝑏 =< 𝑆𝑔𝑛1𝑊𝑟𝑑 , 𝑆𝑔𝑛2𝑊𝑟𝑑 , 𝑆𝑔𝑛3𝑊𝑟𝑑 , 𝑆𝑔𝑛4𝑊𝑟𝑑 >, (23)
where 𝑆𝑔𝑛𝑖𝑊𝑟𝑑 is a tuple of phrase generation properties.
Syntax
Phrases Sentence
Sign 1 Sign 2 Sign 3 Sign 4
Lexical Syntactic Nominal Adjed. Numer. Pronoun Verb Adverb Compound Complex Simple Complex
Dividing Connective Controversial Coordination Management Adjoining
Connection Connection Connection Connection Connection Connection
Figure 25: Class diagram for the Syntax Phrase type hierarchy
The tuple 𝑆𝑔𝑛1𝑊𝑟𝑑 according to the rules of the Ukrainian language syntax (Fig. 26):
𝑆𝑔𝑛1𝑊𝑟𝑑 =< 𝑆𝑔𝑛𝐿𝑥𝑐 𝐼 𝐼
, 𝑆𝑔𝑛𝑆𝑛𝑡 >, (24)
where 𝑆𝑔𝑛𝐿𝑥𝑐 is a tuple of lexical features of phrase generation; 𝑆𝑔𝑛𝑆𝑛𝑡 is a tuple of syntactic
𝐼 𝐼
signs of phrase generation.
Sentence
Sign 1 Sign 2 Sign 3 Members of the sentence
Emotionally Emotionally Main Minor
Declarative Interrogative Imperative Simple Complex
neutral colored members members
Affirmative Negative
Figure 26: Class diagram for the sentence type hierarchy
𝐼𝐼 𝐼𝐼 𝐼𝐼
𝑆𝑔𝑛2𝑊𝑟𝑑 =< 𝑆𝑔𝑛𝑁𝑜𝑢𝐼𝐼
, 𝑆𝑔𝑛𝐴𝑑𝑐 𝐼𝐼
, 𝑆𝑔𝑛𝑁𝑚𝑟 𝐼𝐼
, 𝑆𝑔𝑛𝑃𝑟𝑛 , 𝑆𝑔𝑛𝑉𝑟𝑏 , 𝑆𝑔𝑛𝐴𝑑𝑣 >, (25)
𝐼𝐼
where 𝑆𝑔𝑛𝑁𝑜𝑢 is a tuple of named properties; 𝑆𝑔𝑛𝐴𝑑𝑐 is a tuple of adjectival properties;
𝐼𝐼
𝐼𝐼
𝑆𝑔𝑛𝑁𝑚𝑟 is a tuple of numerical properties; 𝑆𝑔𝑛𝑃𝑟𝑛 𝐼𝐼
is a tuple of pronominal properties;
𝐼𝐼 𝐼𝐼
𝑆𝑔𝑛𝑉𝑟𝑏 is a tuple of verb properties; 𝑆𝑔𝑛𝐴𝑑𝑣 is a tuple of adverbial properties;
𝐼𝐼𝐼 𝐼𝐼𝐼
𝑆𝑔𝑛3𝑊𝑟𝑑 =< 𝑆𝑔𝑛𝐶𝑟𝑑 , 𝑆𝑔𝑛𝐼𝑛𝑓 >, (26)
𝐼𝐼𝐼 𝐼𝐼𝐼
where 𝑆𝑔𝑛𝐶𝑟𝑑 is a tuple of consecutive and 𝑆𝑔𝑛𝐼𝑛𝑓 is a tuple of subordinate properties;
𝐼𝑉 𝐼𝑉
𝑆𝑔𝑛4𝑊𝑟𝑑 =< 𝑆𝑔𝑛𝑆𝑚𝑊𝑑 , 𝑆𝑔𝑛𝐶𝑚𝑊𝑑 >, (27)
𝐼𝑉 𝐼𝑉
where 𝑆𝑔𝑛𝑆𝑚𝑊𝑑 is a tuple of simple properties and 𝑆𝑔𝑛𝐶𝑚𝑊𝑑 is a tuple of complex
𝐼𝐼𝐼
properties. The tuple 𝑆𝑔𝑛𝐶𝑟𝑑 describes the component properties of the relation sentence:
𝐼𝐼𝐼 𝐶𝑟𝑑 𝐶𝑟𝑑 𝐶𝑟𝑑 (28)
𝑆𝑔𝑛𝐶𝑟𝑑 =< 𝑆𝑔𝑛𝐴𝑑𝐶𝑚 , 𝑆𝑔𝑛𝐶𝑛𝐶𝑚 , 𝑆𝑔𝑛𝐷𝑣𝐶𝑚 >,
𝐶𝑟𝑑 𝐶𝑟𝑑
where 𝑆𝑔𝑛𝐴𝑑𝐶𝑚 is a tuple of properties of separating, 𝑆𝑔𝑛𝐶𝑛𝐶𝑚 is a tuple of properties of
𝐶𝑟𝑑
connecting and 𝑆𝑔𝑛𝐷𝑣𝐶𝑚 is a tuple of properties of opposite connections.
𝐼𝐼𝐼 𝐼𝑛𝑓 𝐼𝑛𝑓 𝐼𝑛𝑓 (29)
𝑆𝑔𝑛𝐼𝑛𝑓 =< 𝑆𝑔𝑛𝐶𝑡𝐶𝑚 , 𝑆𝑔𝑛𝑀𝑔𝐶𝑚 , 𝑆𝑔𝑛𝐴𝑔𝐶𝑚 >.
𝐼𝑛𝑓 𝐼𝑛𝑓
where 𝑆𝑔𝑛𝐶𝑡𝐶𝑚 is a tuple of matching properties; 𝑆𝑔𝑛𝑀𝑔𝐶𝑚 is a tuple of control properties;
𝐼𝑛𝑓
𝑆𝑔𝑛𝐴𝑔𝐶𝑚 is a tuple of fit properties.
A tuple of sentence generation concepts in the Ukrainian language (Fig. 26):
𝑆𝑛𝑡
𝐶𝑆𝑛𝑡𝐶𝑚𝑏 =< 𝑆𝑔𝑛1𝑆𝑛𝑡 , 𝑆𝑔𝑛2𝑆𝑛𝑡 , 𝑆𝑔𝑛3𝑆𝑛𝑡 , 𝑆𝑔𝑛𝑆𝑛𝑀𝑏 >, (30)
where sentence generation properties in the Ukrainian language are grouped in 𝑆𝑔𝑛𝑖𝑆𝑛𝑡 is
𝑆𝑛𝑡
tuples of sentence generation properties in the Ukrainian language; 𝑆𝑔𝑛𝑆𝑛𝑀𝑏 is a tuple of
properties of identification of sentence members;
𝑆𝑔𝑛1𝑆𝑛𝑡 =< 𝑆𝑔𝑛𝑁𝑟𝑆𝑛
𝐼 𝐼
, 𝑆𝑔𝑛𝑃𝑟𝑆𝑛 𝐼
, 𝑆𝑔𝑛𝐼𝑛𝑆𝑛 >, (31)
where 𝑆𝑔𝑛𝑁𝑟𝑆𝑛 is a tuple of narrative sentence generation properties; 𝑆𝑔𝑛𝑃𝑟𝑆𝑛 is a tuple of
𝐼 𝐼
properties of generating interrogative sentences; 𝑆𝑔𝑛𝐼𝑛𝑆𝑛 𝐼
is a tuple of properties of
generating motivational sentences;
𝐼𝐼
𝑆𝑔𝑛2𝑆𝑛𝑡 =< 𝑆𝑔𝑛𝐸𝑚𝑁𝑡
𝐼𝐼
, 𝑆𝑔𝑛𝐸𝑚𝐶𝑙 >, (32)
where 𝑆𝑔𝑛𝐸𝑚𝑁𝑡 is a tuple of properties for generating emotionally neutral sentences;
𝐼𝐼
𝐼𝐼
𝑆𝑔𝑛𝐸𝑚𝐶𝑙 is a tuple of properties for generating emotionally coloured sentences;
𝐼𝐼𝐼 𝐼𝐼𝐼
𝑆𝑔𝑛3𝑆𝑛𝑡 =< 𝑆𝑔𝑛𝑆𝑙𝑆𝑡 , 𝑆𝑔𝑛𝐶𝑙𝑆𝑡 >, (33)
𝐼𝐼𝐼 𝐼𝐼𝐼
where a tuple of concepts for the formation of 𝑆𝑔𝑛𝑆𝑙𝑆𝑡 simple and 𝑆𝑔𝑛𝐶𝑙𝑆𝑡 complex
sentences;
𝑆𝑛𝑡 𝑆𝑛𝑀𝑏 𝑆𝑛𝑀𝑏 (34)
𝑆𝑔𝑛𝑆𝑛𝑀𝑏 =< 𝑆𝑔𝑛𝑀𝑛𝑆𝑡𝑀𝑏 , 𝑆𝑔𝑛𝑆𝑑𝑆𝑡𝑀𝑏 >,
𝑆𝑛𝑀𝑏
where 𝑆𝑔𝑛𝑀𝑛𝑆𝑡𝑀𝑏 is a tuple of properties identifying the main members of the sentence;
𝑆𝑛𝑀𝑏
𝑆𝑔𝑛𝑆𝑑𝑆𝑡𝑀𝑏 is a tuple of properties of identification of secondary members of the sentence;
𝐼 𝑁𝑟𝑆𝑛 𝑁𝑟𝑆𝑛 (35)
𝑆𝑔𝑛𝑁𝑟𝑆𝑛 =< 𝑆𝑔𝑛𝐴𝑓𝑆𝑡 , 𝑆𝑔𝑛𝑁𝑔𝑆𝑡 >,
𝑁𝑟𝑆𝑛 𝑁𝑟𝑆𝑛
where 𝑆𝑔𝑛𝐴𝑓𝑆𝑡 is a tuple of properties of generating affirmative sentences; 𝑆𝑔𝑛𝑁𝑔𝑆𝑡 is a
tuple of negative sentence generation properties.
𝐼𝐼𝐼
To generate a simple sentence 𝑆𝑔𝑛𝑆𝑙𝑆𝑡 , the signs are analyzed (Fig. 27):
𝐼𝐼𝐼
𝑆𝑔𝑛𝑆𝑙𝑆𝑡 =< 𝑆𝑔𝑛1𝑆𝑙𝑆𝑡 , 𝑆𝑔𝑛2𝑆𝑙𝑆𝑡 , 𝑆𝑔𝑛3𝑆𝑙𝑆𝑡 , 𝑆𝑔𝑛4𝑆𝑙𝑆𝑡 , 𝑆𝑔𝑛5𝑆𝑙𝑆𝑡 , 𝑆𝑔𝑛6𝑆𝑙𝑆𝑡 , 𝑆𝑔𝑛7𝑆𝑙𝑆𝑡 , 𝑆𝑔𝑛8𝑆𝑙𝑆𝑡 >,
(36)
where𝑆𝑔𝑛𝑖𝑆𝑙𝑆𝑡 is a tuple of simple sentence generation properties.
𝑆𝑛𝑡
Similarly, tuples are formed to identify the members of the sentence 𝑆𝑔𝑛𝑆𝑛𝑀𝑏 (Fig. 28-
𝐼𝐼𝐼
Fig. 29) and the complex sentence 𝑆𝑔𝑛𝐶𝑙𝑆𝑡 (Fig. 30).
Simple sentence
Sign 1 Sign 2 Sign 3 Sign 4 Sign 5 Sign 6 Sign 7 Sign 8
Uncommon Common Uncomplicated Simple Complicated With With With
uncomplicated separated appeals
parts parts of
of sentence the sentence With With
build-in embedded
components components
Noun Verb
Figure 27: Class diagram for a hierarchy of the type Simple sentence
Sentence members
Main sentence members Second sentence members
Adverbial
Subjective Predicate Adjunct Object
modifier
Simple Composite Simple Composite Coordinated Uncoordinated Direct Indirect
Figure 28: Class diagram for a hierarchy of the type the Sentence Members
Adverbial modifier
Of purpose Of time Of manner Of place Of cause Of condirion Of concession
Figure 29: Class diagram for the Circumstance type hierarchy
The composite sentence
Conjunctive sentence Uncompromising sentence Complex syntactic construction
With With
With coordinated With conjunctive and
homogeneous heterogeneous
The compound The complex and subordinate unconjunctive
members members
sentence sentence connection connection
of the sentence of the sentence
With With Attributive Subject Adverbial With few
connectivity opposites clauses clauses clauses subjects
connections connections
Figure 30: Class diagram for the Complex Sentence type hierarchy
The process of extracting data from the Ukrainian-language text based on the syntax
ontology allows you to supplement the conceptual weighting graphs of the content.
4.6. The method of pragmatic analysis of the Ukrainian language
Pragmatics examines the dependence of meaning on the context of the textual content of
the author and takes into account his prior knowledge, intentions, purpose, etc., in contrast
to semantics, which analyzes the meaning itself depending on the results of GA, MA, LA and
SYA within a particular text. Pragmatics is a continuation of SEM, taking into account the
peculiarities of the context of the analysed text, taking into account the ambiguity of the
statements of the analyzed text, based on the analysis of the features of the author's
statements in previous similar texts, based on the time, place, method, purpose and other
circumstances of the conversation.
In PA, when resolving the ambiguity of the author's speech in a specific analyzed text,
taking into account the features of the author's speech in previous similar speeches, it is
best to use word prediction models, for example, N-grammatical Language Models (LM).
Each speaker, as a person with a unique life experience, has not only his dictionary of
thematic words but also a unique handwriting of the use of these words and their sequence
in a certain context of the relevant thematic direction. In the expression «лінгвістична
система опрацьовує …» [linhvistychna systema opratsʹovuye …] (the linguistic system
processes ...) the next word depends not only on the context but also on the so-called speech
handwriting of the author of the text: текст, контент, текстовий контент, вхідні дані,
вхідну інформацію, інтегровані дані, авторський контент, публікації [tekst, kontent,
tekstovyy kontent, vkhidni dani, vkhidnu informatsiyu, intehrovani dani, avtorsʹkyy
kontent, publikatsiyi] (text, content, text content, input data, input information, integrated
data, author content, publications), etc. The phrase «включіть свою виконану
лабораторну роботу ...» [vklyuchitʹ svoyu vykonanu laboratornu robotu ...] (include your
completed lab work...) as opposed to «додайте свою виконану лабораторну роботу ...»
[dodayte svoyu vykonanu laboratornu robotu ...] (add your completed lab work...) has a
broader meaning and depends significantly not only on the context but also on the speaker
(include can mean like download the developed software on the computer or in the sense
of adding it as an item to some list, etc.). Dialogue participants intuitively understand the
content based on their experience of communicating with the author of the phrase.
Pragmatic analysis requires the introduction of models that determine the probability for
each subsequent word. They are also intended for assigning the probability of the target
utterance for correct machine translation, identification/correction of grammatical and
stylistic errors, and handwriting or language recognition. Each language has special
statistical parameters, and the analysis of the probability of the appearance of only letters
and their combinations as N-grams of the corresponding language makes it possible to
identify the language itself or the style of the author (Fig. 31 - with greater probability, the
author of the benchmark wrote Excerpt 1).
0.2
0
<> о а н и в т е р с м к л д у п я з б ч г ю б х ц ж й ш щ ф …
0.5
Benchmark Excerpt 1 Excerpt 2
0.133
0.082
0.074
0.068
0.054
0.047
0.046
0.038
0.036
0.033
0.32
0.31
0.12
0.15
0.16
0.17
0.12
0.13
0.11
0.11
0.11
0.07
0.10
0.08
0.10
0.18
0.06
0.06
0.09
0.11
0
<> о а н и в т е р с
Figure 31: Probability of appearance of letters in the standard and analyzed passages
For Ukrainian texts, the statistical parameters of styles are the probabilities of vowels,
consonants, and gaps between words, as well as soft and sonorous groups of consonants.
Probability is also important for enhancing communication. Physicist Stephen Hawking
used simple movements to select words from a menu for speech synthesis. For such IS, it is
appropriate to use word prediction to generate suggestions for a list of likely words for the
menu. One of the most widespread and easiest to implement for English-language texts is
LM - N-gram, which assigns probabilities to sentences or sequences of words. For
Ukrainian-language texts, it is better to apply such LM to the sequence of word bases
without taking into account inflexions (otherwise incorrect PA results will be obtained) to
calculate 𝑃(𝑏|𝑎) is the probability of the appearance of the base of the word 𝑏 after the
sequence of bases 𝑎. Taking into account words in N-grams of LM in Ukrainian-language
texts is appropriate for identifying grammatical errors.
𝑃(систем|комп′ ютер лінгвіст), 𝑃(системи|комп′ютерні лінгвістичні), (37)
′
𝑃(систему|комп ютерну лінгвістичну).
One of the best ways to calculate such a probability is to conduct a statistical analysis on
large corpora of texts of the relevant author or relevant thematic direction from reliable
Internet sources.:
𝑁(комп′ ютер лінгвіст систем) (38)
𝑃(систем|комп′ ютер лінгвіст) = 𝑁(комп′ ютер лінгвіст)
,
′
𝑃(комп ютер лінгвіст систем)
𝑃(систем|комп′ ютер лінгвіст) = .
𝑃(комп′ ют лінгвіст)
This gives a probabilistic result for a certain period because the language is creative, not
homogeneous, and the vocabulary is updated and constantly develops both in general and
for a specific speaker - the author of the text. To analyze the corresponding random
linguistic event 𝐴𝑖 = комп′ ют, 𝑃(𝐴𝑖 ) is found to calculate the probability of the appearance
of a certain sequence of linguistic events based on the chain rule or the general product rule
(chain rule of probability):
𝑃(𝐴1 𝐴2 … 𝐴𝑛 ) = 𝑃(𝐴1 )𝑃(𝐴2 |𝐴1 )𝑃(𝐴3 |𝐴12 ) … 𝑃(𝐴𝑛 |𝐴1𝑛−1 ), (39)
𝑛
𝑃(𝐴1 𝐴2 … 𝐴𝑛 ) = ∏ 𝑃(𝐴𝑖 |𝐴1𝑖−1 ).
𝑖=1
To analyze the sequence of 𝑁 bases of words 𝑥1 𝑥2 … 𝑥𝑛 or 𝑥1𝑛 (𝑥1 𝑥2 … 𝑥𝑛−1 𝑥1𝑛−1 ) when
𝐴1 = 𝑥1 , 𝐴2 = 𝑥2 , 𝐴3 = 𝑥3 , ..., 𝐴𝑛 = 𝑥𝑛 calculate:
𝑃(𝑥1 𝑥2 … 𝑥𝑛 ) = 𝑃(𝑥1𝑛 ) = 𝑃(𝑥1 )𝑃(𝑥2 |𝑥1 )𝑃(𝑥3 |𝑥12 ) … 𝑃(𝑥𝑛 |𝑥1𝑛−1 ), (40)
𝑛
(41)
𝑃(𝑥1𝑛 ) = ∏ 𝑃(𝑥𝑖 |𝑥1𝑖−1 ).
𝑖=1
The chain rule reflects the relationship between the overall probability of the
appearance of a specific sequence of bases and the conditional probability of the appearance
of a word base by specific previous word bases in this sequence. Taking into account the
entire dynamics of the occurrence of all word bases in the text to sequences of other word
bases is a redundant/inefficient process due to the variability of language/speech over time.
Prediction of the 2-gram model consists of approximating the dynamics of the appearance
of only the last few bases of words in a given sequence:
𝑁(лінгвіст систем) 𝑃(лінгвіст систем) (42)
𝑃(систем|лінгвіст) = , 𝑃(систем|лінгвіст) =
𝑁(лінгвіст) 𝑃(лінгвіст)
To forecast the conditional probability of the following base of the word, we use the
Markov assumption (the probability of the word depends only on the previous one):
𝑃(𝑥𝑛 |𝑥1𝑛−1 )𝑃(𝑥𝑛 |𝑥𝑛−1 ). (43)
To predict the conditional probability of the next base of the word in the N-gram based
on the metric of Maximum (greatest) Likelihood Estimation (MLE) we calculate:
𝑛−1
𝑃(𝑥𝑛 |𝑥1𝑛−1 )𝑃(𝑥𝑛 |𝑥𝑛−𝑘+1 ). (44)
Based on this, we calculate the probability of a complete sequence of word stems:
𝑛
(45)
𝑃(𝑥1𝑛 ) ∏ 𝑃(𝑥𝑖 |𝑥𝑖−1 ),
𝑖=1
We find the MLE estimate for the parameters of the N-gram model by statistically
analyzing the corresponding text corpus and normalizing the frequency of occurrences of
word bases and their sequences within [0;1]:
𝑁(𝑥𝑛−1 𝑥𝑛 ) 𝑁(𝑥𝑛−1 𝑥𝑛 ) (46)
𝑃(𝑥𝑛 |𝑥𝑛−1 ) = = .
∑𝑥 𝑁(𝑥𝑛−1 𝑥) 𝑁(𝑥𝑛−1 )
For example, for three sentences of the mini-corpus (conditionally, the
tags are
the boundaries of one sentence), we will calculate the Markov assumption of the 2-gram
occurrence of word bases:
CLS опрацьовує текстовий контент на основі NLP-процесів
Інтеграція текстового контенту є одним із основних процесів CLS
CLS розв’язує конкретну NLP-задачу для відповідного контенту
2 1 1
𝑃(𝐶𝐿𝑆| < 𝑝 >) = ; 𝑃(інтегр| < 𝑝 >) = ; 𝑃(опрац|𝐶𝐿𝑆) = ;
3 3 3
1 2 1
𝑃(𝑝 > |контент) = ; 𝑃(контент|текст) = ; 𝑃(задач|NLP) = .
3 3 2
Estimation of the MLE parameter for the N-gram model as a relative frequency:
𝑛−1 (47)
𝑁(𝑥𝑛−𝑘+1 𝑥𝑛 )
𝑛−1
𝑃(𝑥𝑛 |𝑥𝑛−𝑘+1
𝑛−1 . )=
𝑁(𝑥𝑛−𝑘+1 )
Algorithm 4.4. Algorithm for the analysis of MLE-parameter estimates for the N-gram
model.
Stage 1. Parse the input text and break it into separate phrases (sentences)𝑅1 𝑅2 … 𝑅𝑚 , marking each
start-end with a corresponding
tag. Eliminate all non-alphabetic characters. Convert
uppercase letters to lowercase. Remove service words if necessary (for certain NLP tasks).
Stage 2. Apply Porter's stemming to obtain the sequence of word bases 𝑥𝑖1 𝑥𝑖2 … 𝑥𝑖𝑛𝑖 of word bases
𝑅𝑖 taking into account word normalization.
Stage 3. Receive input requests𝑄1 𝑄2 … 𝑄𝑘 as a sequence of words of the searched data. Find 𝑄𝑗 for
each word 𝑦𝑗1 𝑦𝑗2 … 𝑦𝑗𝑘𝑗 basis by stemming.
For example, for the search phrase 𝑄𝑗 :
𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5 𝑦𝑗6 𝑦𝑗7 𝑦𝑗8 𝑦𝑗9 𝑦𝑗10
метод та засіб опрац інформ ресурс систем електрон контент комерц
58 190 25 62 122 83 170 89 408 300
Stage 4. Conduct a statistical analysis of the occurrence of word bases and sequences of query word
bases in the analyzed text.
Basics of words of 𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5 𝑥𝑖6 𝑥𝑖7 𝑥𝑖8 𝑥𝑖9 𝑥𝑖10
analyzed text метод та засіб опрац інформ ресурс систем електрон контент комерц
𝑥𝑖1 метод 0 8 0 6 0 0 0 0 1 0
𝑥𝑖2 та 2 0 5 1 7 0 2 0 0 1
𝑥𝑖3 засіб 0 2 0 14 0 0 0 0 0 0
𝑥𝑖4 опрац 0 0 0 0 46 0 0 1 3 4
𝑥𝑖5 інформ 0 0 0 0 0 64 9 0 0 0
𝑥𝑖6 ресурс 0 7 0 0 0 0 0 1 0 0
𝑥𝑖7 систем 0 8 0 1 0 0 0 21 0 0
𝑥𝑖8 електрон 0 0 0 0 0 0 0 0 72 10
𝑥𝑖9 контент 0 10 0 0 0 0 0 0 0 73
𝑥𝑖10 комерц 0 6 0 0 0 0 0 0 176 0
Stage 5. Find the probability of occurrence of 2-grams in the analyzed text. In each row, the value is
divided by 𝑦𝑗𝑖 , where 𝑖 is the row number after normalization.
Basics of words 𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5 𝑥𝑖6 𝑥𝑖7 𝑥𝑖8 𝑥𝑖9 𝑥𝑖10
of analyzed text метод та засіб опрац інформ ресурс систем електрон контент комерц 𝑦𝑗𝑖
𝑥𝑖1 метод 0 0.18 0 0.1 0 0 0 0 0.02 0 58
𝑥𝑖2 та 0.01 0 0.03 0.005 0.035 0 0.01 0 0 0.005 190
𝑥𝑖3 засіб 0 0.08 0 0.16 0 0 0 0 0 0 25
𝑥𝑖4 опрац 0 0 0 0 0.74 0 0 0.016 0.048 0.064 62
𝑥𝑖5 інформ 0 0 0 0 0 0.52 0.074 0 0 0 122
𝑥𝑖6 ресурс 0 0.084 0 0 0 0 0 0.012 0 0 83
𝑥𝑖7 систем 0 0.047 0 0.006 0 0 0 0.124 0 0 170
𝑥𝑖8 електрон 0 0 0 0 0 0 0 0 0.81 0.112 89
𝑥𝑖9 контент 0 0.025 0 0 0 0 0 0 0 0.179 408
𝑥𝑖10 комерц 0 0.,02 0 0 0 0 0 0 0.053 0 300
With each subsequent multiplication, the probability decreases. Applying the logarithm
of probabilities (log probabilities) will allow you to operate with not-so-small values for
calculating accuracy.
𝑛
𝑛
(48)
∏ 𝑃𝑖 = 𝑒 ∑𝑖=1 𝑙𝑜𝑔𝑃𝑖 .
𝑖=1
The resulting matrices will in most cases be sparse. Phrase and different variations
(plural/singular and cases) система електронної контент-комерції [systema
elektronnoyi kontent-komertsiyi] (electronic content commerce system):
𝑃(систем електрон контент комерц) =
= 𝑃(електрон|систем)𝑃(контент|електрон) 𝑃(комерц|контент) =
=0,1240,810,179=0,01797876.
5. Conclusions
The general architecture of computer linguistic systems is developed based on the main
processes of processing information resources such as integration, maintenance and
content management, as well as using methods of intellectual and linguistic analysis of text
flow using machine learning technology. The IT of intellectual analysis of the text flow based
on the processing of information resources has been improved, which made it possible to
adapt the generally typical structure of content integration, management and support
modules to solve various NLP problems and increase the efficiency of CLS functioning by 6-
9%. This became possible thanks to the combination of linguistic analysis methods adapted
to the Ukrainian language, improved IT processing of information resources, ML and a set
of metrics for evaluating the effectiveness of CLS functioning. The main principle of building
such CLS is modularity, which facilitates their construction according to the requirements
for the availability of appropriate processes for solving a specific NLP problem. The main
NLP methods based on regular expression matching with patterns in grapheme and
morphological analyses of Ukrainian-language texts are described. NLP methods based on
pattern-matching regular expressions have been improved, which made it possible to adapt
methods of text tokenization and normalization by cascades of simple substitutions of
regular expressions and finite state machines. The main valid operations of regular
expressions are defined as union and disjunction of symbols/strings/expressions, number
and precedence operators, as well as anchors as special symbols for identifying the
presence/absence of symbols in RE. The main stages of tokenization and normalization of
the Ukrainian text by cascades of simple substitutions of regular expressions and finite state
machines are defined. The MA method of the Ukrainian-language text based on word
segmentation and normalization, sentence segmentation and modified Porter's stemming
algorithm was improved as an effective means of identifying lem affixes for the possibility
of marking the analysed word, which made it possible to increase the accuracy of keyword
searches by 9%. Algorithms for word segmentation and normalization, sentence
segmentation, and Porter's modified stemming are implemented and described as an
effective way of identifying lem affixes for the possibility of marking the analysed word.
Unlike the classic Porter algorithm (it does not have high accuracy even for English-
language texts), the modified one is adapted specifically for the Ukrainian language and
gives an accurate result in 85-93% of cases, depending on the quality, style, genre of the text
and, accordingly, the content of CLS dictionaries. The algorithm for the minimum editorial
distance of lines of Ukrainian texts is described as the minimum number of operations
necessary to transform one into another.
References
[1] B. Bengfort, R. Bilbro, T. Ojeda, Applied text analysis with Python: Enabling
languageaware data products with machine learning. O'Reilly Media, Inc. (2018).
[2] D. Jurafsky, J. H. Martin, Deep Learning Architectures for Sequence Processing. URL:
https://web.stanford.edu/~jurafsky/slp3/9.pdf.
[3] D. Jurafsky, J. H. Martin, Naive Bayes and Sentiment Classification. URL:
https://web.stanford.edu/~jurafsky/slp3/4.pdf.
[4] D. Jurafsky, Logistic Regression. URL: https://web.stanford.edu/~jurafsky/slp3/5.pdf.
[5] D. Jurafsky, J. H. Martin, Neural Networks and Neural Language Models.
https://web.stanford.edu/~jurafsky/slp3/7.pdf.
[6] V. Vysotska, Modern State and Prospects of Information Technologies Development for
Natural Language Content Processing, CEUR Workshop Proceedings 3368 (2024) 198-
234.
[7] A. Berko, Y. Matseliukh, Y. Ivaniv, L. Chyrun, V. Schuchmann, The text classification
based on Big Data analysis for keyword definition using stemming, in: Proceedings of
the IEEE 16th International conference on computer science and information
technologies, CSIT-2021, Lviv, Ukraine, 22–25 September 2021, pp. 184–188.
[8] N. Shakhovska, I. Shvorob, The method for detecting plagiarism in a collection of
documents, in: Proceedings of the International Conference on Computer Sciences and
Information Technologies, CSIT, 2015, pp. 142-145.
[9] R. Romanchuk, V. Vysotska, V. Andrunyk, L. Chyrun, S. Chyrun, O. Brodyak, Intellectual
Analysis System Project for Ukrainian-language Artistic Works to Determine the Text
Authorship Attribution Probability, in: Proceedings of the 18th IEEE International
Conference on Computer Science and Information Technologies, CSIT 2023, Lviv,
Ukraine, October 19-21, 2023. IEEE 2023.
[10] V. Lytvyn, P. Pukach, V. Vysotska, M. Vovk, N. Kholodna, Identification and Correction
of Grammatical Errors in Ukrainian Texts Based on Machine Learning Technology.
Mathematics 11 (4) (2023) 904. https://doi.org/10.3390/math11040904
[11] K. Shakhovska, et al., An approach for a next-word prediction for Ukrainian language.
Wireless Communications and Mobile Computing 2021 (2021) 1-9.
[12] S. Kubinska, R. Holoshchuk, S. Holoshchuk, L. Chyrun, Ukrainian Language Chatbot for
Sentiment Analysis and User Interests Recognition based on Data Mining, CEUR
Workshop Proceedings 3171 (2022) 315-327.
[13] T N. Shakhovska, O. Basystiuk, K. Shakhovska, Development of the Speech-to-Text
Chatbot Interface Based on Google API, CEUR Workshop Proceedings 2386 (2019) 212-
221.
[14] V. Husak, O. Lozynska, I. Karpov, I. Peleshchak, S. Chyrun, A. Vysotskyi, Information
System for Recommendation List Formation of Clothes Style Image Selection According
to User’s Needs Based on NLP and Chatbots, CEUR workshop proceedings 2604 (2020)
788-818.
[15] V. Vysotska, Ukrainian Participles Formation by the Generative Grammars Use, CEUR
workshop proceedings 2604 (2020) 407-427.
[16] V. Vysotska, S. Holoshchuk, R. Holoshchuk, A comparative analysis for English and
Ukrainian texts processing based on semantics and syntax approach, CEUR Workshop
Proceedings 2870 (2021) 311-356.
[17] O. Bisikalo, O. Boivan, N. Khairova, O. Kovtun, V. Kovtun, Precision automated phonetic
analysis of speech signals for information technology of text-dependent authentication
of a person by voice, CEUR Workshop Proceedings 2853 (2021) 276–288.
[18] O. Bisikalo, V. Vysotska, Linguistic analysis method of Ukrainian commercial textual
content for data mining, CEUR Workshop Proceedings 2608 (2020) 224-244.
[19] I. Khomytska, I. Bazylevych, V. Teslyuk, I. Karamysheva, The chi-square test and data
clustering combined for author identification, in: Proceedings of the IEEE XVIIIth
Scientific and Technical Conference on Computer Science and Information
Technologies, CSIT 2023, Lviv, Ukraine, 19-21 October 2023.
[20] I. Khomytska, V. Teslyuk, The Multifactor Method Applied for Authorship Attribution
on the Phonological Level, CEUR workshop proceedings 2604 (2020) 189-198.
[21] I. Khomytska, V. Teslyuk, A. Holovatyy, O. Morushko, Development of methods, models,
and means for the author attribution of a text, Eastern-European Journal of Enterprise
Technologies. 3(2(93)) (2018) 41–46. doi: 10.15587/1729-4061.2018.132052.