<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mathematical model of a decision support system for identification and correction of errors in Ukrainian texts based on machine learning⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rostyslav Fedchuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victoria Vysotska</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Systems and Networks Department, Lviv Polytechnic National University</institution>
          ,
          <addr-line>12 Bandera Str., 79013 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This research presents a mathematical model for a decision support system aimed at identifying and correcting errors in Ukrainian-language texts. The system combines two key components: error detection as token-level multi-class classification and error correction as context-aware text generation. Special attention is given to the structural and grammatical complexity of the Ukrainian language. Probabilistic models and machine learning techniques are used to classify errors and suggest corrections. The model accounts for dependencies between tokens and linguistic features specific to Ukrainian. Methods for text vectorization are mathematically justified based on morphological and syntactic characteristics. The correction module generates accurate replacements for erroneous tokens using contextual modeling. This approach enables high accuracy and adaptability in processing Ukrainian texts. The proposed system provides a universal foundation for automated Ukrainian text editing.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Error identification</kwd>
        <kwd>error correction</kwd>
        <kwd>NLP</kwd>
        <kwd>GEC</kwd>
        <kwd>Ukrainian language</kwd>
        <kwd>machine learning</kwd>
        <kwd>text generation</kwd>
        <kwd>probabilistic model</kwd>
        <kwd>text processing</kwd>
        <kwd>morphology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In today's rapidly growing text content, the problem of automatic text verification and correction is
becoming particularly urgent. For the English language, many studies have been conducted, and
significant progress has been made in solving the GEC problem [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, for the Ukrainian
language, there is still a lack of sufficient qualitative research that would consider the peculiarities
of the Ukrainian language, its morphological structure, dialect diversity, syntax multifacetedness,
and context of dependencies. Errors in texts can be of different nature – spelling, grammatical,
punctuation, semantic, etc., and their correction becomes necessary to improve the quality of
information, as well as to ensure its correct perception by users.
      </p>
      <p>The main approach to solving the GEC problem includes two interrelated stages: identification
of errors in the text and generation of corrections. Historically, traditional methods were used to
solve these problems, but their effectiveness is significantly inferior to modern machine learning
models.</p>
      <p>
        Traditional GEC systems are typically built on a set of linguistic rules [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] formulated by experts
and dictionary databases. Such systems check spelling, number, gender, or case agreement between
words, and even basic punctuation. For example, they are excellent at detecting erroneous word
forms (“великий книга”) or separate spelling of particles (“на приклад” → “наприклад”).
However, the complexity and multifacetedness of the grammatical dependencies of the Ukrainian
language impose numerous limitations on such systems. They do not consider contexts in which
rules often change their interpretation, for example, in colloquial or artistic language, where verbal
abbreviations, non-standard constructions, or stylistic variations occur. In addition, creating and
maintaining an extensive set of rules for such systems requires significant human and time
resources. As a result, traditional systems demonstrate low performance in real-world conditions
and have hardly adapted to new data and text styles.
      </p>
      <p>More sophisticated systems based on statistical machine translation (SMT) were the next step in
the development of the GEC problem and offered a more flexible approach to text correction. The
basis of such models is the analysis of large corpora of texts, where for each sentence a correct and
an incorrect version was known. Using statistical regularities, the models suggest the most likely
correction for each erroneous fragment. Although this approach has some improvements compared
to the previous one, it remained limited due to its dependence on the quality of the training corpus.
In addition, SMT models do not sufficiently consider the global context of the text, which leads to
many stylistic and grammatical errors in complex constructions. The modern era of solving the
GEC problem includes the use of machine learning, which has significant advantages compared to
traditional approaches. Machine learning models, such as transformative architectures, can
consider not only local, but also global dependencies between words in the text. They provide
context analysis, which is critical for classification and generation tasks within the GEC problem. A
classification approach is used to identify errors, where each token of the text is analyzed and
labeled with a corresponding label. In the Ukrainian language, this stage faces significant
difficulties associated with morphological complexity, since most words can take on different forms
depending on their case, number, or gender. Also, grammatical dependencies are often determined
by the interaction of several tokens located at a considerable distance from each other in the text,
such as when agreeing the subject and predicate in a complex sentence. The complexity of error
classification is compounded by the polysemy of lexemes, where the meaning of a word changes
depending on the semantic context.</p>
      <p>
        Error correction, on the other hand, is a text generation task where the system has to predict
the correct correction for each erroneous token or fragment of text. Here, the key is not only the
accuracy of grammar, but also the coordination of the corrected text with a style that matches the
context of the input text. Generative models, such as mT5 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], have proven their high efficiency in
solving this problem due to the contextual analysis of each token and the ability to work with
multi-level dependencies. For the Ukrainian language, this is especially important due to the
significant semantic load of dependencies between words and syntactic variations, which can
complicate the generation of corrections even for standard grammatical constructions. The use of
machine learning methods has significant advantages. Models built on such architectures are
adaptive in working with real texts and can qualitatively consider the context. They are devoid of
the strict limitations inherent in linguistic rules and can be trained on large data corpora, gradually
improving their predictions. Transformer models, such as BERT for identification and T5 for
correction, allow combining both stages within a single system that can work with complex error
cases.
      </p>
      <p>The purpose of the research is to develop a mathematical model of a decision support system for
identifying and correcting errors in Ukrainian-language texts using machine learning methods. The
main tasks are to formalize the processes of text analysis, build a model that considers the
morphological, syntactic and contextual specifics of the Ukrainian language, build DFD diagrams,
as well as create algorithms for detecting errors and correcting them.</p>
      <p>The object of the research is the processes of automatic identification and correction of errors in
Ukrainian texts, considering their morphological and syntactic complexity. The subject of the
research is a mathematical model and machine learning algorithms that allow solving classification
and generation problems for building an effective text processing system.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Statement of the problem</title>
      <p>The problem of correcting errors in a text belongs to the category of GEC tasks, which include two
main subtasks: error identification and error correction.</p>
      <p>Today, there are several reasons that complicate the solution of this problem for
Ukrainianlanguage texts. First, most traditional automatic text checking systems are designed for less
complex languages from a grammatical point of view, such as English. The application of these
systems to the Ukrainian language turns out to be ineffective, since they do not consider the
specifics of its grammar and syntax. For example, the variability of word forms (cases, numbers,
genders, declension) complicates the task of identifying erroneous tokens. In addition, the
Ukrainian language has significant lexical-semantic dependencies that depend on the context.</p>
      <p>Another challenge is the variety of errors that can occur in texts. Spelling errors are relatively
straightforward to process automatically, while grammatical, punctuation, and even stylistic errors
require detailed analysis of the context and interactions between words within a sentence or even a
paragraph. For example, in the sentence "Мама купив хліб," the error ("купив" instead of
"купила") requires modeling the agreement between the subject and the predicate. Identifying
such complex dependencies and creating mechanisms to correct them remains an open task.</p>
      <p>In addition, the task is complicated by the lack of large open corpora of training data for the
Ukrainian language that would provide sufficient quality for training machine learning models. The
dominance of English-language content in the field of NLP creates an uneven distribution of
resources and methodologies, which requires adapting existing technologies to the specifics of the
Ukrainian language.</p>
      <p>The GEC task is based on two key stages, which are performed sequentially: error identification
and correction. The error identification stage is responsible for identifying erroneous tokens in the
text and the type of error they contain. The work process at this stage includes the analysis of each
token and an assessment of the probability of an error, considering the context of the surrounding
words.</p>
      <p>To successfully solve the GEC problem, a system is needed that is able to integrate the
identification and correction processes into a single model-oriented approach and solve the
following challenges of the Ukrainian language: contextual ambiguity, morphological distribution
(it is necessary to take into account 7 cases of Ukrainian words, their number, gender, form),
syntactic complexity, and the lack of training corpora in the Ukrainian language.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Related works</title>
      <p>
        Modern automatic text correction tools have become an important tool in overcoming various
types of spelling, grammar, punctuation and stylistic errors. In this context, the most well-known
and effective solutions are software applications and platforms integrated into word processors,
online tools and machine learning systems. Grammarly [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is one of the most popular tools for
automatic text checking. Unfortunately, Grammarly does not currently support the Ukrainian
language for checking grammar, spelling or stylistics. The main functionality of the service is
focused on the English language. However, the company is actively working on the development of
the Ukrainian language in the field of computational linguistics. Grammarly created [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
published in the open access annotated GEC corpus of the Ukrainian language [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which contains
almost 34,000 sentences. This resource is intended for scientific and practical study of the language,
as well as for training and evaluation of grammar correction programs.
      </p>
      <p>
        GPTTools.ai [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a Ukrainian desktop application based on GPT-4 artificial intelligence,
designed to effectively work with texts in Ukrainian and more than 70 other languages. The tool
provides the ability to generate texts, edit, correct stylistic, spelling and grammatical errors, as well
as translate. A key feature is support for individual prompts – users can create their own templates
to automate routine tasks. The program is suitable for students, scientists, copywriters and anyone
who works with large amounts of text information, ensuring convenience and accuracy of data
processing. LanguageTool [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which supports a multilingual format, including Ukrainian, is
currently one of the most functional tools for text verification. LanguageTool works based on rules
and currently has 1219 rules [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This platform is available as an online tool, browser extension or
desktop application. It can work with grammatical, spelling and contextual errors, offering
recommendations for their correction. LanguageTool integration is possible through the
LanguageTool API, which supports the Ukrainian language and is called NLP-uk [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Through this
integration, the user gets access to normalization, tokenization, lemmatization, POS-tagging of
texts, as well as tools for solving the problem of ambiguities in the Ukrainian language.
      </p>
      <p>The word processors Microsoft Word and Google Docs also have a text proofing function,
including basic support for the Ukrainian language. Microsoft Word allows you to correct spelling
errors using dictionary databases and linguistic rules. Google Docs offers a similar basic proofing,
but its functionality is less developed. For complex text analysis, these platforms can use additional
integrations, such as LanguageTool.</p>
      <p>
        Among the specialized Ukrainian tools, OnlineCorrector [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] should be highlighted, which is
focused on checking spelling, punctuation, and style in texts. This online tool allows you to upload
texts for checking and receive a result with corrected errors. It pays special attention to working
with texts of various formats, considering the specifics of the Ukrainian language.
      </p>
      <p>
        Modern machine learning models, particularly GECToR [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], open up new opportunities for
automating text correction processes. This model is built on a transformative architecture and is
available for adaptation to any language, including Ukrainian. GECToR uses an attention
mechanism to consider the context in text correction, which allows achieving high accuracy and
efficiency.
      </p>
      <p>
        Today, platforms such as Hugging Face [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] are also available to developers, which offer
pretrained models, including multilingual versions of BART, BERT, GPT, or T5. These platforms allow
for the creation of customized solutions for GEC and other natural language processing tasks by
adapting the models to the specifics of the language and style of the text.
      </p>
      <p>A great example of an interactive multilingual tool is modern GPT models, including ChatGPT,
which can not only check and correct texts, but also adapt to complex stylistic features, considering
the context of the text. These models can work with any type of text, providing both basic checking
and context-based corrections.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the study examines the effectiveness of using GPT-3.5 for grammatical error correction
(GEC) in a multilingual context in various scenarios, including zero-shot learning, fine-tuning, and
using the model to rank correction hypotheses generated by other models. The authors evaluate
the performance of GPT-3.5 through automated evaluation methods, such as grammaticality
assessment using language models (LM), Scribendi test, and semantic embedding analysis. The
results demonstrate that for English, GPT-3.5 shows a high level of consistency in corrections,
preserving the semantic integrity of the text, due to which the model corrects errors well and
generates smooth sentences. However, in other languages, including Ukrainian, Czech, German,
Russian, and Spanish, GPT-3.5 often significantly modifies the source sentences, which sometimes
leads to changes in their semantics and creates difficulties for evaluation. The main feature of this
work is the detection of GPT-3.5's tendency to overcorrect, as well as the analysis of its limitations
in correcting punctuation errors, tense agreement, syntactic dependencies, and lexical
compatibility. Despite the powerful capabilities of the model, the authors emphasize the problems
of its adaptation for GEC tasks in a multilingual environment, especially for languages with a rich
grammatical structure, such as Ukrainian.
      </p>
      <p>In [15], the authors proposed a new approach to multilingual grammatical error correction
(GEC), which is based on the use of pre-trained machine translation models. A feature of their
approach is the integration of neural machine translation methods into the GEC task, which made
it possible to effectively adapt the models to work with texts in different languages, in particular
low-resource ones. The authors managed to overcome the limitations of traditional GEC models by
optimizing pre-trained transformers, which ensured high accuracy of correction,
contextawareness, and naturalness of the corrected text. The main difference of their work is that they
adapted translation models, in particular transformers, for multilingual correction, while
maintaining their universality and ability to scale to new language sets. This allows not only to
correct errors with high accuracy, but also to effectively work with languages for which limited
training resources are available.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Mathematical model of a decision support system</title>
      <p>The decision support system for identifying and correcting errors in Ukrainian-language texts is
based on the use of natural language processing (NLP) and machine learning methods [16-24]. The
main goal of the DSS is to automatically identify errors in the text, correct them or provide
recommendations for correction. The system consists of three main modules: an error identification
module, an error correction module, and a machine learning module. This modular structure allows
you to easily update the system, adapt it to different tasks, and increase the accuracy of error
recognition and correction. An important aspect is the ability to integrate additional components,
such as lexical databases, semantic networks, and algorithms for increasing the accuracy of text
analysis.</p>
      <p>The error identification module performs text analysis to detect incorrect words, grammatical,
spelling, and stylistic errors. Its main tasks are:





</p>
      <p>Text segmentation – dividing the input text into individual sentences.</p>
      <p>Stop word removal – removal of punctuation marks, as well as other language constructs
that do not carry a semantic load and will only slow down the identification of errors.
Text tokenization – splitting the input text into individual words or phrases for further
analysis.</p>
      <p>Lemmatization, stemming and morphological analysis – determining the basic form of
words, their part of speech and grammatical characteristics.</p>
      <p>Classification of words into correct and potentially incorrect – application of machine
learning methods to assess the correctness of words in the context of a sentence.
Contextual analysis – assessment of the probability of an error based on surrounding words
(n-gram models, transformers, rules).</p>
      <p>Based on these stages, a list of potential errors is formed, which is transferred to the correction
module. After identifying errors, the system should offer appropriate corrections. The main
functions of the error correction module are as follows: generation of corrections, contextual
checking, ranking of correction options, calculation of the model’s degree of confidence in the
correction.</p>
      <p>Depending on the settings, the system can either automatically make corrections or prompt the
user to choose the most appropriate option. In some cases, the option of interactive training by the
user is possible.</p>
      <p>The machine learning module is critical for system adaptation and improvement. Its main
functions include model training, parameter updating, and feedback collection.</p>
      <p>In the process of building a mathematical model of a decision support system (DSS) for
identifying and correcting errors in Ukrainian-language texts, an important stage is the
formalization of input and output parameters. This allows you to clearly determine what data the
system processes and what results it should generate.</p>
      <p>The system receives text in various forms as input, which may contain spelling, grammatical,
punctuation, and stylistic errors. The input data is formalized as follows:</p>
      <p>X ={ x1 , x2 , … , xn },
(1)
where: X is the input text array consisting of n tokens (words, symbols, sentences); xi is a
separate token or word in the text.</p>
      <p>Additionally, auxiliary parameters can be used: text format (plain text, HTML, XML, JSON);
language (the system is focused exclusively on the Ukrainian language), contextual metadata (text
genre (scientific, journalistic, colloquial), style (formal/informal)).</p>
      <p>During the processing process, the text goes through several stages of analysis. The main
intermediate parameters are linguistic features (POS tags P ( xi) morphological features M ( xi),
syntactic dependencies S ( xi , x j)); probabilistic characteristics (estimated probability that a word
contains an error in a given context, confidence model); candidates for corrections, error labels.</p>
      <p>The result of the system is the corrected text and correction quality metrics. Formally, the
output data can be presented as a set:</p>
      <p>X^ ={ x^1 , x^2 , ... , x^n }
(2)
where: X^ – corrected text, x^1 – corrected version of the word or token x^i.</p>
      <p>The input parameters include the corrected text, a list of errors found, and correction evaluation
metrics, including accuracy, completeness, F1-measure, and confidence score.</p>
      <p>In the process of developing a decision support system for identifying and correcting errors in
Ukrainian-language texts, it is important to consider several limitations that affect its functionality,
as well as clearly define the performance criteria that allow assessing the quality of its work. These
aspects are of decisive importance for building a reliable and accurate mathematical model.</p>
      <p>One of the key limitations is the linguistic specificity of the texts with which the system works.
The Ukrainian language has a rich morphological structure, a significant number of grammatical
rules, and numerous exceptions, which complicates the process of automated analysis. An
important challenge is the processing of polysemy – cases when one word can have several
meanings depending on the context. In addition, the system may encounter difficulties when
processing neologisms, professional vocabulary, or dialect features that are not always included in
the training corpora.</p>
      <p>Technological constraints also play an important role in the system's performance. One critical
aspect is the performance of the algorithms used to analyze and correct text. Some modern natural
language processing models, such as neural network transformers, require significant computing
resources, which can create problems when used in real time or on devices with limited computing
power. In addition, the length of the text being analyzed can impose limitations on the system's
performance, since large fragments must be divided into smaller parts, which can reduce the
accuracy of contextual analysis.</p>
      <p>Defining performance criteria is an important stage in assessing the quality of the system. The
main indicator is the accuracy of error detection and correction, which is evaluated using metrics
such as precision, recall, and F1-measure. High precision means that the system detects only those
errors that are present in the text, while high completeness indicates the ability to recognize as
many errors as possible without omissions. The optimal option is a balance between these
indicators, since an excessive number of detected errors without sufficient accuracy can lead to
erroneous corrections. Formally, the correction criteria can be defined as:</p>
      <p>Accuracy =</p>
      <p>TP +TN
TP +TN + FP + FN</p>
      <p>TP
Quality = ,</p>
      <p>TP + FP + FN</p>
      <p>FP
ErrorRate= ,</p>
      <p>TP + FP</p>
      <p>TP
Recall= ,</p>
      <p>TP + FN</p>
      <p>Precision × Recall
F 1=2 × ,</p>
      <p>Precision+ Recall
where TP is the number of correctly identified errors, TN is the number of correct corrections,
FP is the number of incorrectly identified errors, FN is the number of missed errors.</p>
      <p>In addition to accuracy, an important criterion is the speed of the system. Text processing time
should be minimal, especially in interactive applications where the user expects quick feedback.
Performance can be measured by estimating the number of words or sentences that the system is
able to process per second. Optimization of computational algorithms allows to reduce the load on
the processor and memory, which is especially important for integrating the model into mobile
applications or web services.
where T i is the processing time of the i-th sentence, N is the total number of sentences in the
test corpus.</p>
      <p>Another important aspect is the resource consumption, which can be estimated using random
access memory (RAM) and processor time (CPU). For example, the average resource usage is
determined by the formula:
where Ri is the amount of memory used during processing the i-th text fragment.</p>
      <p>Another important aspect is the quality of the corrected text. Even if the system correctly
identifies errors, its corrections must be relevant and correspond to the style and semantics of the
original text. This parameter is assessed using metrics that compare the corrected text with the
reference text, such as BLEU [17] or METEOR, as well as through manual assessment by experts or
feedback from users. Corrections should not violate the logic or content of the text, which is
especially critical for automated correction systems in professional areas, for example, in scientific
articles or legal documents.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Mathematical model of the error identification in Ukrainian texts</title>
      <p>Error identification in texts is one of the fundamental tasks of natural language processing (NLP),
which is of particular importance for languages such as Ukrainian, which are characterized by high
morphological complexity. The identification task consists in identifying fragments of text that
may contain spelling, grammatical, lexical-semantic or syntactic errors, as well as in determining
the type of these errors. This is a key stage in automatic text correction systems, where the
effectiveness of correction directly depends on the accuracy and quality of identification.</p>
      <p>Formally, the task can be formulated as follows. Let X = { x1 , x2 , … , x N } be the input text
presented as a sequence of tokens (where a token can be a word, symbol or sentence). The task
consists in assigning to each xi a label yi , corresponding to a certain class of error from the set of
classes ( C ) . A typical set of classes is C = {c0 , c1 , … , c K }, where ( c0 ) denotes "no error", and
( c1 , … , c K ) denote specific types of errors (e.g., spelling error, grammatical error, punctuation
error, etc.).</p>
      <p>The probability that a token ( xi ) belongs to a particular class ( ck ) , is defined as:
P ( yi=ck∣xi , context )=f id ( xi , context ) ,
(10)
where f id is a recognition (identification) function that considers both the token xi and its
context. The main task of the identification model is to learn to accurately evaluate the function f id
which will be able to correctly determine the type of error or its absence.</p>
      <p>The Ukrainian language has several specific features that significantly complicate the effective
identification of errors. First, this is morphological complexity, which consists in the richness of
inflection forms. For example, adjectives, verbs or nouns can change in gender, number and case,
which makes it difficult to detect both spelling and grammatical errors. Second, the contextual
ambiguity of Ukrainian words, that is, words with the same spelling can have different meanings or
functions depending on the context. For example, the word "пішли" can belong to the past tense
verb form or act as a colloquial form to describe movement. Therefore, the identification system
should consider not only the local analysis of the token, but also syntactic and semantic
connections within the entire sentence.</p>
      <p>In addition, the variability of regional dialects, borrowings, as well as the use of colloquial and
informal constructions characteristic of the Ukrainian language complicates identification. For
example, words like "шо" or abbreviations such as "всьо" are rarely found in formal texts but are
actively used in informal speech. The consequence of such linguistic diversity is the need to adapt
classification models to datasets that include examples of both standard language and colloquial
forms.</p>
      <p>In addition, it is important to note that the identification of errors in texts is closely related to
the processing of different types of errors. Spelling errors can be easily detected using lexical
dictionaries (e.g., checking for unrecognized words), while grammatical or punctuation errors
depend on the complex structural dependence between words in a sentence.</p>
      <p>Thus, the task of identifying errors in Ukrainian-language text appears as a multi-level problem
that requires the use of complex mathematical models that can consider both contextual
dependencies between words and features of morphology and syntax. The effectiveness of such
models largely depends on the quality of machine learning, which is based on large corpora of texts
with clearly annotated errors. Since the models used in this problem are mostly based on machine
learning algorithms and deep neural networks, the text must be translated into a form that can be
used by algorithms for calculation. To this end, the text is represented through mathematically
formalized data that considers both the tokens themselves and their context.</p>
      <p>To work with modern deep learning methods, the text must be presented as multidimensional
vectors, which are numerical representations of the text. The format of the input data stream is
represented as:</p>
      <p>X ={w1 , w2 , … , w N },
(11)
where wi is the vector representation of the token xi .</p>
      <p>Text vectorization involves converting words or tokens into multidimensional numerical
vectors – embeddings, which store semantic or contextual information about the words. Two main
approaches to vectorization are used: static and dynamic. In the static vector representation
approach, each word in the text has a constant vector that does not depend on its context. Methods
such as Word2Vec or FastText represent words based on their occurrence in the text. For example,
if the word xi is converted into a vector with dimension d, then the text matrix looks like this:
where W ∈ RN ×d is a matrix of such representations</p>
      <p>For example, the FastText method considers subwords (n-gram structures) to model exceptions
such as errors or unfamiliar tokens.</p>
      <p>Dynamic contextual vector representations are a more modern approach that consider the
context in which a word is used. Popular models such as BERT, XLM-RoBERTa, or mBERT
generate such dynamic vectors, where for each token xi in the text, wi=f ( xi , context ), where f is
a transform model that takes into account the context within the entire sentence (or even the text).
In this approach, the vector wi not only conveys the meaning of the word but also considers the
dependencies between other tokens. This is critical for the problem of error identification, since
some errors can only be identified within a broader context.</p>
      <p>To identify errors, it is not enough to analyze individual words or tokens. You also need to
consider their position in the sentence and their relationship to neighboring elements. Context is
modeled in the following ways: a sequence of tokens is processed by recurrent or transform neural
networks that store information about previous and next tokens; the use of special mechanisms,
such as the attention mechanism, which allows you to highlight significant parts of the text for a
particular token. Mathematically, the context can be represented as a combination of tokens in a
window of length m around the analyzed token xi:
context ( xi)={ xi−m , … , xi−1 , xi+1 , … , xi+m }
(13)
where m is the length of the context window. In transformer architectures, the length of the
context can cover the entire sentence or even several sentences. Describing the identification
problem in a formalized form, it is possible to structure the input data in the form of two
interconnected components: a matrix of word vectors and error labels.</p>
      <p>The matrix of word vectors W has the form W ∈ R(N ×d), where N is the number of tokens in
the text, and d is the dimension of the vector representations. Error labels (ground truth),
represented in the form of a label vector Y ={ y1 , y2 , … , y N }, where each yi∈ C ={c0 , c1 , … , c K }
corresponds to an error class or its absence.</p>
      <p>The formalization of the input data should consider the specifics of the Ukrainian language,
including declension, conjugation, agreement by gender and number. Ukrainian texts are
characterized by abbreviations ("д/з") and regionalisms, which are also important to consider. It is
also necessary to ensure adaptability to various variations of the Ukrainian language, which can be
found in spoken and literary texts.</p>
      <p>In the process of building a system for identifying errors in Ukrainian-language texts, the
central task is to develop a mathematical classification model capable of determining whether a
text fragment (token or sentence) contains an error, and if so, classifying it by type. Since error
classification is a multi-class classification problem, the mathematical model must determine the
probability of each fragment belonging to a certain class ck .</p>
      <p>Identifying errors in text is a complex process that includes several stages of processing. The
key task is to find those tokens or segments of text that may potentially contain errors (the
socalled candidates for correction), based on formal criteria, language rules and statistical features.
The algorithm of the identification system consists of several mandatory stages:</p>
      <p>Pre-processing of text: Normalization, Tokenization, lemmatization (or stemming).
Normalization includes removing unnecessary spaces, tabs, special characters that can be
noise; unifying the register to lowercase; expanding abbreviations.</p>
      <p>Analysis of the syntactic and morphological structure of the text. At this stage, the text is
processed to isolate its grammatical and syntactic dependencies. For this, POS-tagging and
Dependency Parsing are performed. A syntactic structure is created for the sentence that
models the interdependencies between words. This allows you to detect types of syntactic
errors, for example, incorrect agreement between the subject and the predicate.</p>
      <p>Creating a language model for context analysis. After syntactic analysis, the text is
segmented into context windows that allow us to consider neighboring tokens when
identifying errors. Context windows are defined as subsets of tokens:
context ( xi)={ xi−m , … , xi , … , xi+m }, where m is the width of the window. For example, in
the sentence "Мій котик завжди їсть морозиво" for the token "їсть" the context window
with width m=2 will be: [{"котик", "завжди", "їсть", "морозиво"}]. Context is critically
important, since many errors cannot be detected without considering grammatical and
semantic dependencies between words.
4. Selecting candidate tokens for correction. At this stage, the system identifies tokens that
may potentially contain errors. This is achieved using a combination of rules, lexical
knowledge, and statistical features. Candidate tokens can be words, phrases, or symbols
that satisfy one or more criteria that signal a possible error. The definition of such tokens is
based on a combined approach that considers both linguistic rules and the results of
statistical data analysis:




</p>
      <p>Lexical analysis. A word is considered a candidate if it is not included in the dictionary
of standard words of the Ukrainian language (for example, "пирог" → "пиріг"). The
morpheme structure is also analyzed: incorrect suffixes or prefixes signal possible errors
in the word, such as "зробить" → "зроблять".</p>
      <p>Grammatical agreement analysis. Incorrect agreement in gender, number, or case
between words. For example: "У мене була дві книжки" → "було дві книжки".
Punctuation analysis. Violation of the rules for constructing complex sentences or
missing/extra punctuation marks. For example: "Прийшов додому і почав читати
книгу" → "Прийшов додому, і почав читати книгу".</p>
      <p>Statistical frequency. If a word or phrase is rarely found (or is absent altogether) in a
corpus of standard texts, it becomes a candidate.</p>
      <p>Contextual inconsistency. Tokens that do not match the context. This can be detected
using pre-trained models, such as BERT, which predict whether a given word is typical
in each context.</p>
      <p>Candidate filtering. After the initial selection of potential errors, the system filters candidate
tokens to exclude false positive predictions. For example: removing technical terms from
the candidates that are not in the dictionary due to their specificity or filtering variants that
formally comply with language rules (certain rarely used constructions). This reduces the
load on the next stage – the correction module, where only probable errors are processed.</p>
      <p>As a result of this algorithm, a subset of tokens that are candidates for correction is determined.
They are passed to the next level of analysis system, which can make a specific decision on
correction.</p>
      <p>To visualize the algorithm, a DFD diagram was used, which is a schematic tool for visualizing
data flows in the system. DFD allows you to understand the main data flows in the system and
helps to clearly and structurally present the error identification process.</p>
      <p>At DFD level 0, the system is represented as a single block that interacts with external entities.
It shows that the user enters text into the system. The system returns the output: a list of tokens
with error labels and error types.</p>
      <p>Next, the data flow is detailed, this is the DFD level 1, where the system is divided into main
functional blocks.</p>
      <p>DFD level 2 is an extended algorithm. This level displays all the operations that occur within
each block.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Mathematical model of the error correction in Ukrainian texts</title>
      <p>Error correction in Ukrainian-language texts can be formulated as a text generation task, where the
model receives text with errors as input, analyzes the context, determines the correct option
instead of the erroneous fragment, and returns the corrected text.</p>
      <p>Let the input text be given as a sequence of tokens X ={ x1 , x2 , … , x N }and a sequence of labels
Y ={ y1 , y2 , … , y N }, where xi is a separate token, and yi∈ C ={c0 , c1 , … , c K } is the error type
for the corresponding token xi : c0 - "no error", {c0 , c1 , … , c K }types of errors (e.g., spelling,
grammar, punctuation). The correction task is to transform the text X into the text
X^ ={ x^1 , x^2 , ... , x^n }, where each corrected token x^i is an exact representation without errors,
correctly consistent with the context, or unchanged token if there is no error yi=c0.</p>
      <p>The main goal of the model is to generate a text x^i, which complies with the grammatical,
stylistic and lexical rules of the Ukrainian language and is consistent with the context of the stored
text fragment.</p>
      <p>From a mathematical point of view, error correction for each token xi in the text can be
described as the task of finding the optimal candidatex^i:
x^i=arg max P ( x'i∣xi , context ) ,</p>
      <p>x'i∈ D
where: D is the dictionary set of all valid tokens for the Ukrainian language; P ( x'i∣xi , context )
is the conditional probability that the token x'i is the correct correction of xi. For each token xi, the
model estimates the probability of different correction options for x'i , given the error and the
context around the token. The probability is modeled as follows:
(14)
(15)</p>
      <p>P ( x'i∣xi , xi−1 , xi+1 , … , xi−m , xi+m)=f cor ( xi , context )
where context is all nearby tokens in a window of width 2m, and f cor is the correction function
implemented by the text generation model. Error correction requires considering semantic and
grammatical context. Semantic context helps to consider consistency with neighboring words,
while grammatical context ensures that a word agrees with gender, number, tense, case, etc.</p>
      <p>The error correction algorithm works based on input data received from the identification
module, which has identified potentially erroneous tokens and classified them by error types. Error
correction consists in finding optimal replacements for erroneous tokens consistent with the
context, grammatical rules, and stylistic features of the Ukrainian language, and in forming a
corrected text.</p>
      <p>Description of the error correction algorithm:
1. Initialization of the data structure. This is the initial stage, at which the correction module
must receive input tokens X and error labels Y, as well as create an empty list for corrected
tokens: X^ =[ ] .
2. Processing each token. At this stage, for each xi in the text X, the error label yi is checked.</p>
      <p>If yi=c0then x^1= xi , and the token is added to the list X^ without changes. If yi ≠ c0 ( the
token is identified as erroneous) then the correction process is initiated according to the
error type yi.</p>
      <p>Generation of candidates for correction. For each erroneous token x^i a set of possible
correction options D={d1 , d2 , … , d M } is determined using the Ukrainian dictionary,
grammatical analysis, and a language model. The generation of correct text can be
implemented using several approaches: Greedy Search, Beam Search, and Sampling with
Temperature. The Greedy Search model sequentially selects the most probable option for
each token, but this method does not always guarantee global optimality. During Beam
Search, several most probable options are considered, from which the best one is selected
based on a global assessment of the context. Sampling with Temperature uses the
generation of options adjusted by temperature parameters to increase the variability of
corrections. For the problem of generating correct text, the objective function of the
correction model optimizes the probability of correct corrected text. For the multilevel
evaluation problem, the cross-entropy loss function is used:</p>
      <p>N
Loss=−∑ log ( P ( x^i∣xi , context )) ,
i=1
(16)
To identify spelling errors, variants are generated that are as similar as possible to x^i using
the Levenshtein distance comparison principle. For example, for the token "книга", the set
of variants can be V = {{"книга"}, {"книги"}, {"книзі"}}. To identify grammatical errors in
accordance with the rules of Ukrainian grammar, the system uses morphological analysis.
This limits the list of candidates to words that match the required format. Language models,
in turn, determine context-sensitive candidates using pre-trained models that consider the
context of neighboring words when generating corrections.</p>
      <p>Candidate evaluation. For each candidate yi∈ D , the system calculates the probability of
matching the context. The evaluation is carried out using language models (BERT, GPT) or
grammar rules. The models calculate which replacement yi is the best in the context of the
entire sentence. In the sentence "Я люблю читати книгі", the probability P( {"книги" | "Я
люблю читати"}) will be higher than for "книзі". The grammar rules calculate additional
constraints on the choice of words.</p>
      <p>Choosing the best candidate. For an erroneous token xi, the candidate v˚ j with the
maximum probability score is chosen:
v˚ j=arg max P ( v j∣context ) ,
v j∈ D
(17)
For "книгі", the value is v˚ j = {" книги "}, since this is the most likely option in the given
context.</p>
      <p>Correction of punctuation errors. If the error type yi indicates a punctuation violation, then
instead of the token xi, the correct punctuation mark defined by the syntax rules is inserted.</p>
      <p>For the text "Не знаю як пояснити", the system will correct it to "Не знаю, як пояснити".
7. Formation of the final text. After processing each token, the source text is formed. If the
token was not changed, it is added to the final list X^ in its original form. If the token was
changed, the corrected version v˚ j is added to the list X^ .
8. Checking consistency. For the entire sentence, a final check of consistency between tokens
is performed: checking the grammatical correctness of the agreement between the subject
and the predicate, nouns and adjectives, verbs and their arguments; checking the semantic
content correspondence of the entire construction.
9. Returning the corrected text. The system returns the corrected text X^ , and can also provide
a list of changes made with an indication of the type of each error.</p>
      <p>To visualize the algorithm and processes of the error correction module in Ukrainian-language
texts, DFD diagrams were constructed.</p>
      <p>Example of the algorithm with the input text "Я люблю читати книгі й розкажу друзі про цю
книгу.":</p>
      <p>Labels from the identification module: {{"Я": no errors}, {"люблю": no errors}, {"читати"∶ no
errors}, {"книгі"∶ spelling}, {"й"∶ no errors}, {"розкажу"∶ no errors}, {"друзі"∶ grammatical},
{"про"∶ no errors}, {"цю"∶ no errors}, {"книгу": no errors} }. Correction: "книгі" → "книги",
"друзі" → "друзям".</p>
      <p>Corrected text: " Я люблю читати книги й розкажу друзям про цю книгу."</p>
    </sec>
    <sec id="sec-7">
      <title>7. Mathematical model of machine learning</title>
      <p>The text in its original form is not suitable for processing by machine learning algorithms that
work with numerical data. Therefore, the main task of this stage is to convert the text into a format
that allows it to be used to build models that can learn effectively. Before starting the formation of
vector representations of the text, the tokenization process is performed, which must consider the
peculiarities of Ukrainian grammar, such as the processing of complex words, particles, and quotes.
After tokenization, each token must be converted into a numerical format suitable for machine
processing. Depending on the task and the methods used to process the text, each token can be
represented in different formats: as a number in a dictionary corresponding to a specific token
(integer encoding), as a multidimensional representation (One-Hot Encoding), or as a word
embeddings. The first two methods do not scale for large corpora and do not consider the
relationships between tokens, which makes them not the best option for selection.</p>
      <p>Word embeddings are multidimensional vectors that consider semantic and syntactic
relationships between words and are built based on statistical models. The most common methods
are Word2Vec, FastText, and BERT. Word2Vec is a statistical method for constructing vectors
through training on text corpora. Vectors of similar words are closer to each other in a
multidimensional space. For example: {"собака"} ≈ [0.2, 0.8, -0.3], {"вовк"} ≈ [0.1, 0.7, -0.4]. FastText
is an extension of Word2Vec that considers sub-word elements (morphemes) within a token. This
is important for the Ukrainian language due to its high morphological variability. BERT, in turn,
generates contextual vectors that depend on the position of the word in the sentence. For example:
"сльози" in the sentence "сльози на очах" will have a different vector than "сльози природи".</p>
      <p>Each token xi from the text X can be represented as a vector W . The training sample represents
the data on which the model will learn to recognize errors and suggest corrections. When working
with Ukrainian-language texts, it is important to consider not only the general principles of data
collection, but also the specifics of the Ukrainian language, which includes rich morphology,
complex syntactic and grammatical structure, as well as the presence of regionalisms, borrowings
and colloquial speech.</p>
      <p>The main tasks of this stage are: creating a representative corpus of texts covering different
aspects of the language: formal, colloquial and artistic styles; introducing artificial errors that
simulate real types of errors in the text, or using corpora with errors marked manually; balancing
the data to avoid the dominance of one type of error and ensure uniform training of the model;
building a data structure that combines the original text, modified text with errors, error labels and
the correct version of the text.</p>
      <p>To train the model, a corpus of texts containing marked language errors is required. Sources of
such corpora: real texts, including student essays, scientific papers with errors, social networks,
blogs, personal messages, online forums and annotated corpora, which are existing corpora for the
Ukrainian language (UA GEC, GRAC [16] or similar), where the texts have already been annotated
by experts for correction tasks. Error-free texts are also important for building the training sample.
They are used to introduce artificial errors to simulate real scenarios and train the model to
correctly "read" the context and identify non-problematic areas of the text. Here, the sources can be
books, formal documents, news; Wikipedia and other open text databases; works of art in the
Ukrainian language.</p>
      <p>The training sample should represent a variety of error types. The main categories include
spelling (randomly added letters, replacing one letter with another, omission or excess of letters),
grammatical (incorrect word agreement, conjugation or verb conjugation errors, incorrect word
order), punctuation, lexical and contextual errors. The question of increasing the size of the
training sample often arises. To do this, you can use automatic error generation using rules or
static algorithms, or ready-made language models (GPT, mT5) to generate texts with typical errors.
For automatic error generation, you can use an algorithm for random error insertion, replacing
letters based on the frequency of real errors (for example, "с" → "з"), deleting or adding characters
to words, omitting commas or adding unnecessary punctuation marks. It is also often used to
replace the correct case with a random one ("до школи" → "до школою") or randomly add words
out of context ("Я граю в футбол." → "Я гамак в футбол.").</p>
      <p>To ensure high-quality training of the model, it is necessary to create a balanced sample, where
each type of error should have enough examples and at the same time the ratio of incorrect texts to
correct ones should not be excessively large (40% of incorrect texts to 60% of correct ones). To
create a high-quality sample, the texts should be annotated manually. To do this, each error is
marked with the appropriate type, the exact positions of the incorrect tokens are indicated, and the</p>
      <p>N
P (Y∣X )=∏ P ( yi∣x1 , x2 , … , x N ) ,</p>
      <p>i=1</p>
      <p>N
P ( X^∣X , Y )=∏ P ( x^i∣x1 , x2 , … , x N , y1 , y2 , … , y N ) ,</p>
      <p>i=1
where P ( x^i∣X , Y )is the probability of correcting the token xi, P ( yi∣X )is the probability that
the token xibelongs to the class yi. Training involves the use of modern text processing methods
based on neural language models. For the problem of error identification and correction, two
architectures are most effective: recurrent neural networks (RNN, LSTM, GRU) and Transformer.
Recurrent neural networks are used for the problem of sequence analysis. Their main advantage is
that they preserve the context of previous tokens. For example, LSTM (Long Short-Term Memory)
allows you to consider long-term dependencies between tokens:
correct version of the text is added next to the errors. Result: each text will become an object of the
format: {"Я іду чтати книгу."} → {[1, {"іду","grammatical error","йду"}], [2, {"чтати","spelling
error”,"читати" }]}. As a result, the corpus of texts is divided into three parts: a training sample for
building the model (~70% of the data), a validation sample for checking the quality during training
(~15% of the data), and a test sample for assessing accuracy on new, unknown texts for the model
(~15% of the data).</p>
      <p>The process of training a machine learning model for the problem of error identification and
correction includes setting up the model architecture, defining the objective function, optimizing
the parameters, as well as checking and validating the obtained results. Model should maximize the
probabilities of:
(18)
(19)
(20)
(21)
(22)
(23)
Lossid=−
1 N K</p>
      <p>N ∑i=1 k∑=0 y(i ,k) log ⁡( y(^i ,k)) ,
Losscor=−
1 T</p>
      <p>∑ log ⁡( P ( x^i , X , Y )) ,</p>
      <p>N i=1
where: y(i ,k) is the label of token xi belonging to class ck ; y(^i ,k) – is the probability predicted by
the model for this class. For the error correction problem, the loss function estimates the difference
between the corrected text and the correct text</p>
      <p>where P ( x^i , X , Y ) is calculated using the transformer model. The overall objective function of
the model looks like this:</p>
      <p>Loss= λ1⋅ Lossid + λ2⋅ Losscor
where λ1and λ2 are the weighting factors for balancing the tasks.</p>
      <p>The learning algorithm can be described as follows:</p>
      <p>ht= LSTM ( xt , ht−1) ,
where ht is the hidden state at step t , which contains information about the current token and
the previous ones.</p>
      <p>Transformer models such as BERT, XLM-RoBERTa or mT5 are the most effective for identifying
and correcting errors in texts. Thanks to the attention mechanism, dependencies between all
tokens in the text are considered, regardless of their positional distance. BERT is used for error
identification, and mT5 or GPT can be used for error correction.</p>
      <p>The objective function determines how well the model recognizes errors and suggests correct
corrections. For the error identification problem, the loss function is based on cross-entropy:</p>
      <p>Model initialization. At this stage, the model architecture is selected and hyperparameters
are configured (number of layers, number of neurons, embedding size, etc.).</p>
      <p>Data preparation. The texts are converted into vector representations, and the training
sample is divided into three subsets: training, validation, and test (70% / 15% / 15%).
The learning process. Here, tokens from the sample are fed to the model input, the
predicted error classes Y^ and corrected texts X^ are calculated. Next, the process of
calculating the loss value Loss takes place. At the finish of this stage, the model parameters
are optimized using gradient descent algorithms (Adam or AdamW).</p>
      <p>Validation. After each epoch, the learning process is checked on validation data. The model
is evaluated using the following metrics: Accuracy, Precision, Recall, F1-score, and BLEU to
assess the quality of corrected texts.</p>
      <p>Testing. Model evaluation on the remaining test data to obtain final metrics.</p>
      <p>After training, the model becomes capable of accurately classifying text tokens by error type,
offering high-quality corrections that are consistent with the context, and generating corrected text
with minimal stylistic and grammatical violations.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Model fine-tuning</title>
      <p>In this study, the ‘facebook/mbart-large-50’ model was chosen for the task of automatic error
correction in Ukrainian-language texts. The choice of this transformer architecture is due to its
multilingual nature, which allows it to work effectively with less resource-intensive languages,
particularly Ukrainian. Due to pre-training on many parallel texts, MBART has a high potential for
generating grammatically and stylistically correct sentences in Seq2Seq tasks. In addition, the
model performs well in other text transformation tasks, in particular paraphrasing and machine
translation, which are close in nature to error correction.</p>
      <p>The UA-GEC corpus was used for training, which contains examples of sentences from typical
errors in the Ukrainian language and their corrections. Data pre-processing was carried out using
the Stanza library, for dividing into source (input, erroneous sentences) and target (corrected
sentences), which corresponded to the Seq2Seq training format.</p>
      <p>The training was performed in the Kaggle environment with an available GPU and limited
RAM. Considering the resource constraints, a compromise set of hyperparameters was selected:
batch_size = 4, max_token_length = 64, learning_rate = 5e-5, n_epochs = 6. The training was
performed in two stages: first 3 epochs with basic settings, and then additional retraining for
another 3 epochs with a lower learning rate. Initially, the training rate was 5e-5, which contributed
to the rapid assimilation of the main patterns, while reducing the rate to 3e-5 in the later stages
helped to avoid overtraining and allowed for more precise adjustment of the weights.</p>
      <p>As part of the quantitative analysis of the quality of the model for correcting errors in Ukrainian
texts, several metrics were used that allow an objective assessment of the learning dynamics. In
particular, the values of the text-level metrics sacreBLEU, BLEU, and METEOR were recorded, as
well as the losses on the training and validation samples during three epochs of learning.</p>
      <p>The initial quality of the model, recorded in the first epoch, turned out to be expectedly low.
The value of the BLEU metric was zero, sacreBLEU was only 0.004, and METEOR was 0.011. This
indicates the almost complete inability of the model at the start to generate any corrections that at
least partially coincided with the reference ones. Such a result is typical for transformer models
until they begin to form meaningful correspondences between the input and target sequences.</p>
      <p>However, a significant breakthrough is observed already in the second epoch. The BLEU value
increases sharply to 0.477, sacreBLEU to 47.76, and METEOR to 0.584. This indicates that the model
has begun to effectively learn the dependencies between incorrect and correct sentences,
reproducing a significant part of the target responses either with an exact match or with a high
degree of semantic similarity. It is also important to note the drop in the value of the loss function:
the training loss decreased from 4.16 to 2.71, and the validation loss from 3.70 to 0.99, which is a
good indicator that the learning is going in the right direction, and the model not only remembers
but also generalizes the patterns.</p>
      <p>In the third epoch, the positive dynamics remains: BLEU reaches 0.659, sacreBLEU – 65.92,
METEOR – 0.687. At the same time, the losses are reduced even more significantly: the training
loss drops to 0.43, the validation loss to 0.51. This indicates the stability of the learning process and
the absence of signs of overtraining, which is especially important for the text correction task,
where the generative ability of the model must be flexible, not highly specialized.</p>
      <p>Starting from the fourth epoch, training continued at a slower learning rate – learning_rate was
reduced from 5e-5 to 3e-5. This allowed the model to more accurately “finish” the already formed
correspondences, but the quality gain was insignificant: BLEU increased to 0.662, METEOR – to
0.692, and sacreBLEU – to 66.16. Such stability of metrics indicates that the model has reached a
plateau – it has already acquired the basic regularities and now only slightly refines the
predictions. At the same time, the validation loss began to slowly increase (from 0.51 to 0.57),
which may indicate the beginning of overtraining. Thus, reducing learning_rate made it possible to
avoid sharp fluctuations, but did not provide a significant improvement – that is, the best results
were at 3-4 epochs.</p>
      <p>Summarizing these graphs, we can conclude that the chosen approach to training the model is
effective. The high indicators of sacreBLEU and METEOR demonstrate not only lexical accuracy,
but also stylistic adequacy of the generated variants. The rapid growth of metrics between the first
and second epochs indicates that even in a short training time the model can master the basics of
language correction. In general, this gives grounds for a confident forecast regarding the potential
improvement of quality with further training, expansion of the training corpus, and enrichment of
the types of language errors in the input examples. The quality of the model was assessed by
manual testing on a set of sentences with typically common errors that often occur in
Ukrainianlanguage texts. In total, dozens of sentences were tested, covering various types of language errors,
including punctuation, spelling, lexical, and morphological. Analysis of the results allows us to
draw conclusions about the current level of formed linguistic generalizations in the model, identify
its strengths, and identify areas where there is a lack of linguistic competence.</p>
      <p>The model performed best in correcting punctuation errors. In most cases, it confidently
recognized and corrected the absence of commas in complex constructions. For example, the
sentence “Я бачив як вона грає на піанино” was successfully converted to “Я бачив, як вона
грає на піанино” and in the sentence “Коли я приїхав у Львів мені сподобалось атмосфера” a
comma was added after the subordinate clause, which complies with the norms of Ukrainian
syntax. Similarly, in a sentence with direct speech, the model used the correct punctuation – “не
хай так і буде сказав дмитро” was corrected to “Не хай так і буде, – сказав дмитро”, which
demonstrates the model’s understanding of the specifics of direct speech and its intonation
highlighting in writing, although the model did not provide quotation marks.</p>
      <p>However, the system only partially coped with spelling errors. Simple cases, such as misspelling
a word with a space ("Близ ько") or using a lowercase service word at the beginning of a sentence,
were successfully corrected. But in more complex situations, where correction requires a deeper
analysis of the dictionary form, the system failed. For example, the word "ней мовірно" remained
unchanged, although it should have been corrected to "неймовірно". This indicates that the
current version of the model does not always have a sufficient level of spelling sensitivity,
especially in cases where errors do not have clear contextual clues and require checking for
compliance with the dictionary norm.</p>
      <p>A similar situation is observed with morphological errors. In the sentence " Коли я приїхав у
Львів, мені сподобалось атмосфера" the model did not detect a conflict between the neuter verb
and the feminine noun, leaving the erroneous form unchanged. This indicates an insufficient depth
of coordination between grammatical categories in the processing of the input text.</p>
      <p>In rare cases, the model demonstrates the ability to perform syntactic transformations, changing
the structure of a sentence from incorrect to grammatically correct. For example, the sentence “я
тебе звати” was transformed into “Я тебе зватиму”, indicating the formation of a basic
understanding of predicative constructions and verb agreement. However, most examples of this
type remain beyond the scope of successful correction, and the transformational ability of the
model currently appears limited.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusions</title>
      <p>The result of the work is a developed mathematical model of a decision support system for
identifying and correcting errors in Ukrainian-language texts, focused on the implementation of
machine learning approaches. As a result of the research, a mathematical basis was formulated for
solving the tasks set, including the formalization of the data flow, the placement of system
components, and the presentation of texts in a form suitable for machine processing.</p>
      <p>The main mathematical structure of the system was highlighted, which consists of two key
modules: for identifying and correcting errors. Both modules interact within the system, ensuring
the correct processing of the text sequence. A separate mathematical model was built for the
problem of identifying errors, which is based on a probabilistic approach. The main emphasis in
modeling is placed on preserving the context and considering dependencies between tokens to
increase the accuracy of identification. A mathematical model was built that involves calculating
the conditional probability of choosing the best candidate for correcting an erroneous token within
a given text context. The approach to data preparation through text vectorization, formation of a
training sample and organization of the model training process based on large text corpora was
considered. The importance of building a representative corpus of training data, which includes
texts with real errors, as well as artificially simulated examples of errors considering their
typological distribution, was emphasized.</p>
      <p>The model was trained on the UA_GEC corpus, which demonstrated encouraging results in
correcting punctuation and basic spelling errors, especially within simple and clearly structured
sentences. At the same time, it is not effective enough in detecting lexical and morphological
errors, and still poorly copes with deeper syntactic rearrangements. The results outline clear
directions for further improvement of the system, namely, expanding the training corpus with
examples with morphological and lexical errors, as well as introducing additional processing
mechanisms that would compensate for the limited orthographic competence of the model. In the
future, improving these aspects will allow creating a more reliable and comprehensive tool for
automatic verification of Ukrainian-language texts.</p>
      <p>As a result, the developed mathematical model is a universal approach to processing Ukrainian
texts, which allows solving the problems of identifying and correcting errors within a single
system. The determined formal aspects of the interaction between the components of the model
create the basis for its effective training and further implementation in practical tasks.
Acknowledgements
The research was carried out with the grant support of the National Research Fund of Ukraine,
"Information system development for automatic detection of misinformation sources and
inauthentic behaviour of chat users", project registration number 33/0012 from 3/03/2025
(2023.04/0012). Also, we would like to thank the reviewers for their precise and concise
recommendations that improved the presentation of the results obtained.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4 in order to: Grammar and spelling
check. After using these tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[15] A. Luhtaru, E. Korotkova, M. Fishel, No Error Left Behind: Multilingual Grammatical Error
Correction with Pre-trained Translation Models, in: Proc. EACL 2024, Vol. 1: Long Papers, St.</p>
      <p>Julian’s, Malta, 2024, pp. 1209–1222.
[16] M. Shvedova, et al., General Regionally Annotated Corpus of Ukrainian Language (GRAC),</p>
      <p>Network for Ukrainian Studies Jena (2017–2022). doi:10.48550/arXiv.1911.02116.
[17] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a Method for Automatic Evaluation of
Machine Translation, in: Proc. 40th Annual Meeting of ACL, Philadelphia, USA, 2002, pp. 311–
318. doi:10.3115/1073083.1073135.
[18] V. Lytvyn, P. Pukach, V. Vysotska, M. Vovk, N. Kholodna, Identification and correction of
grammatical errors in Ukrainian texts based on machine learning technology, Mathematics 11
(4) (2023) 904. doi:10.3390/math11040904.
[19] H. Lipyanina, A. Sachenko, T. Lendyuk, S. Nadvynychny, S. Grodskyi, Decision tree based
targeting model of customer interaction with business page, in: CMIS-2020 Computer
Modeling and Intelligent Systems, CEUR Workshop Proc., vol. 2608, CEUR-WS.org, 2020.
urn:nbn:de:0074-2608-1.
[20] A. Chiche, Hybrid decision support system framework for crop yield prediction and
recommendation, Int. J. Comput. 18 (2) (2019) 181–190. doi:10.47839/ijc.18.2.1416.
[21] O. Mediakov, V. Vysotska, D. Uhryn, Y. Ushenko, C. Hu, Information technology for
generating lyrics for song extensions based on transformers, Int. J. Mod. Educ. Comput. Sci. 16
(1) (2024) 23–36.
[22] H. Lipyanina, V. Maksymovych, A. Sachenko, T. Lendyuk, A. Fomenko, I. Kit, Assessing the
investment risk of virtual IT company based on machine learning, in: S. Babichev, D. Peleshko,
O. Vynokurova (Eds.), Data Stream Mining &amp; Processing. DSMP 2020, Communications in
Computer and Information Science, vol. 1158, Springer, Cham, 2020, pp. 122–134.
doi:10.1007/978-3-030-61656-4_11.
[23] S. Bhatia, M. Sharma, K.K. Bhatia, P. Das, Opinion target extraction with sentiment analysis,</p>
      <p>Int. J. Comput. 17 (3) (2018) 136–142. doi:10.47839/ijc.17.3.1033.
[24] A. Demchuk, B. Rusyn, L. Pohreluk, et al., Commercial content distribution system based
neural network and machine learning, in: CEUR Workshop Proc., 2019, pp. 40–57.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bryant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Qorib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Ng</surname>
          </string-name>
          , T. Briscoe,
          <article-title>Grammatical Error Correction: A Survey of the State of the Art, Comput</article-title>
          . Linguist.
          <volume>49</volume>
          (
          <issue>3</issue>
          ) (
          <year>2023</year>
          )
          <fpage>643</fpage>
          -
          <lpage>701</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2211.05166.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O. B.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Ilori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Onesirosan</surname>
          </string-name>
          ,
          <article-title>The proximate composition and nutritive value of the winged bean Psophocarpus tetragonolobus (L.) DC for broilers</article-title>
          ,
          <source>Anim. Feed Sci. Technol</source>
          .
          <volume>11</volume>
          (
          <year>1984</year>
          )
          <fpage>231</fpage>
          -
          <lpage>237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddhant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Raffel</surname>
          </string-name>
          , mT5:
          <string-name>
            <given-names>A</given-names>
            <surname>Massively Multilingual</surname>
          </string-name>
          <article-title>Pre-trained Text-to-Text Transformer</article-title>
          ,
          <source>in: Proc. Conf. NAACL: HLT, Online</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>483</fpage>
          -
          <lpage>498</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>41</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Grammarly</given-names>
            <surname>Inc</surname>
          </string-name>
          .,
          <string-name>
            <surname>About</surname>
            <given-names>Us</given-names>
          </string-name>
          , Grammarly Inc. (n.d.). URL: https://www.grammarly.com/about.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Grammarly</surname>
          </string-name>
          ,
          <article-title>UA-GEC</article-title>
          . URL: https://github.com/grammarly/ua-gec.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Syvokon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nahorna</surname>
          </string-name>
          , UA-GEC:
          <article-title>Grammatical Error Correction and Fluency Corpus for the Ukrainian Language</article-title>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2103.16997.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Brovinska</surname>
          </string-name>
          , “
          <article-title>I waited eight years for Grammarly to support Ukrainian.” A Ukrainian created a tool for spell checking and translating in two clicks</article-title>
          .
          <source>It supports 76 languages</source>
          , URL: https://dev.ua/news/ai-servisy-
          <volume>1706885687</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>[8] LanguageTool, We believe that anyone can write beautifully and professionally</article-title>
          . URL: https://languagetool.org/about.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>LanguageTool</given-names>
            <surname>Community</surname>
          </string-name>
          ,
          <article-title>Error Rules for LanguageTool</article-title>
          , URL: https://community.languagetool.org/rule/list? offset=0&amp;max=10&amp;lang=uk&amp;filter=&amp;categoryFilter=&amp;
          <article-title>_action_list=%D0%A4%D1%96%D0%BB% D1%8C%D1%82%</article-title>
          <issue>D1</issue>
          %
          <fpage>80</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Github</surname>
            ,
            <given-names>LanguageTool API</given-names>
          </string-name>
          NLP UK, URL: https://github.com/brown-uk/nlp_uk.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>OnlineCorrector</surname>
          </string-name>
          , Writing correctly is easy, URL: https://onlinecorrector.com.ua
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Omelianchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Atrasevych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Chernodub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Skurzhanskyi</surname>
          </string-name>
          , GECToR - Grammatical Error Correction: Tag, Not Rewrite, ArXiv,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2005</year>
          .
          <volume>12592</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>HuggingFace</surname>
          </string-name>
          , Grammatical Error Correction Models, URL: https://huggingface.co/models? other
          <article-title>=grammatical+error+correction&amp;sort=trending&amp;search=gec.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Katinskaia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yangarber</surname>
          </string-name>
          , GPT-
          <volume>3</volume>
          .5 for
          <string-name>
            <given-names>Grammatical</given-names>
            <surname>Error Correction</surname>
          </string-name>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2405.08469.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>