=Paper=
{{Paper
|id=Vol-3861/paper9
|storemode=property
|title=Computational intelligence technology for Ukrainian language textual content processing based on big data analysis, NLP and machine learning
|pdfUrl=https://ceur-ws.org/Vol-3861/paper9.pdf
|volume=Vol-3861
|authors=Victoria Vysotska
|dblpUrl=https://dblp.org/rec/conf/ciaw/Vysotska24
}}
==Computational intelligence technology for Ukrainian language textual content processing based on big data analysis, NLP and machine learning==
<pdf width="1500px">https://ceur-ws.org/Vol-3861/paper9.pdf</pdf>
<pre>
                                                                                                                                         ⋆


                                Victoria Vysotska

                                Lviv Polytechnic National University, Stepan Bandera 12, 79013 Lviv, Ukraine

                                                Abstract
                                                The work aims to develop models, methods, and means of analysis and synthesis of computer linguistic
                                                systems (CLS) based on new and improved methods of processing Ukrainian-language textual content to
                                                solve natural language processing problems (NLP). The scientific novelty of the obtained results lies in
                                                solving an important scientific and applied problem of analysis and synthesis of CLS for solving various
                                                tasks of processing Ukrainian-language textual content based on developing new and improving known
                                                models, methods and means of NLP. The following new scientific results were obtained: – A model of
                                                intellectual analysis of the text flow, which, unlike the existing one, is based on the processing information
                                                resources, NLP and machine learning, which the typical structures of content integration, management and
                                                support modules; – Methods of adapted processing information resources for processing Ukrainian-
                                                language text and take into account the needs of the permanent target audience based on the analysis of
                                                the history of the target audience's activity on the CLS web resource, which made it possible to form a set
                                                of metrics and indicators of the effectiveness of the CLS functioning for the various NLP tasks solution; –
                                                A model of linguistic processing of text based on the grapheme, morphological, lexical and syntactic
                                                analyses improvement, which, unlike the existing ones, are adapted for processing Ukrainian-language text
                                                through regular expressions and machine learning, made it possible to adapt the processes of processing
                                                Ukrainian-language text content and increase the accuracy of the obtained results depending from a specific
                                                NLP task; – A method of identifying keywords in Ukrainian-language texts based on grapheme and
                                                morphological analysis of word bases through regular expressions and N-grams was developed, which
                                                made it possible to increase the accuracy of searching for keywords, search for stable word combinations
                                                and categorize content; – A method of determining the style of the author of thematic Ukrainian-language
                                                text content was developed based on the keywords, stable word combinations, N-grams analysis, which
                                                made it possible to determine the stylistic contribution of each of the authors and increase the accuracy of
                                                the attribution of a scientific and technical publication; – A method was developed for calculating the degree
                                                of verification of the author of a Ukrainian-language text from a set of possible ones based on a comparative
                                                analysis of the styles of potential authors, which made it possible to increase the accuracy of classification
                                                based on the similarity of style; – Methods of analysis and synthesis of CLS were developed based on the
                                                creation of a general typical structure of the text content processing CLS in the Ukrainian language through
                                                support for modularity, modelling of the interaction of main processes and components, which made it
                                                possible to expand the collection of solutions to various typical tasks of the NLP by implementing typical
                                                software of such systems; – NLP methods, which, unlike the existing ones, are implemented on the basis of
                                                developed regular expressions of grapheme and morphological analysis of Ukrainian-language text and
                                                modified Porter’s stemming algorithm as an effective identifying lem affixes for the possibility of
                                                demarcating the analysed word, which made it possible to optimize the process and improve the accuracy
                                                of Ukrainian words/sentences normalization; – Text tokenization and normalization methods, which, in
                                                contrast to the existing ones, use cascades of simple substitutions of developed regular expressions of
                                                matching with templates based on production rules, finite automata and the ontological model of the rules
                                                of the Ukrainian language syntax.

                                                Keywords
                                                Computer linguistic systems, NLP, Ukrainian-language, textual content, machine learning 1


                                CIAW-2024: Computational Intelligence Application Workshop, October 10-12, 2024, Lviv, Ukraine
                                   Victoria.A.Vysotska@lpnu.ua (V. Vysotska)
                                   0000-0001-6417-3689 (V. Vysotska)
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
1. Introduction
The active development of information technologies (IT) is at the intersection of globalization and
informatization. The rapid rate of growth of society's informatization is directly related to the rate
of development and implementation of computer linguistic systems (CLS), the development of which
is based on models and methods of natural language processing (NLP) [1-3]. The complexity of
developing models, techniques, and tools of NLP lies in solving non-typical NLP problems and
adapting these models, methods, and tools to a specific natural language [4-6]. Each natural language
is unique, with its flavour of rules, history, grammar, exceptions, and peculiarities of generating
linguistic units for conveying meaning, complicating developing a CLS.
    Usually, each successful CLS development project is designed for a specific task (for example,
machine translation [7-9], identification of plagiarism/rewriting [10-12], text rubrication [13-14], text
attribution analysis [15-21], information retrieval [22-28], referencing/abstracting [29-30], voice
assistants [31-33], intelligent chatbots [34-39], etc.) and is both one-time and closed (for example,
Amazon Alexa, Google Assistant, Facebook, Voice Mate, Bixby, Siri, Abby Lingvo, Microsoft Cortana,
Microsoft Word, Grammarly, Google Translation, PROMT, CuneiForm, Trados, OmegaT, Wordfast,
Dragon, IBM via voice, Speereo, Finereader, Tesseract, OCRopus, etc.) without being able to read the
content to willing IT professionals/specialists. In rare cases, the developers provide open access to
such CLS projects and the opportunity to get acquainted with their structure and content. The
development of any NLP application for an arbitrary natural language of more than 7000 languages
and dialects is based on studying large textual monolingual/parallel corpora of that language,
containing more than hundreds of millions of words and linguistic resources. Only about 20 natural
languages (English, Chinese, Western European languages, Japanese, etc.) are the results of research
on such corpora known, making it possible to develop CLS of various complexity for these languages.
Unfortunately, in modern realities, the Ukrainian language is considered in the international
scientific community to be an exotic language with a low resource index, i.e., it does not have enough
educational, research and processed data to develop modern applied applications of NLP. Such
applied applications are used to build CLS in cyber security (detection of fakes and propaganda, so-
called trolls/bots in social networks), sociology (analysis of the dynamics of changes in public opinion
on thematic issues), philology (automatic research of large data sets of various thematic orientations
and different periods), psychology (analysis of the psychological portrait of a person, identification
of post-traumatic stress disorder of participants in hostilities or occupation), national security
(information warfare), jurisprudence (criminology and court case), social communications (analysis
of community posts in social networks) and other important branches of modern Ukraine. The above
determines the relevance of the topic of the dissertation research.
    Scientific research by N. Chomsky, V.M. Glushkov, A.V. Hladkoy, D.V. Lande, V.A. Shyrokov,
N.V. Sharonova, N.F. Khairova, O.V. Bisikalo, S.N. Buk, N.P. Darchuk, Z.V. Partyka, A.V. Anisimova,
Yu.D. Apresyan, O.O. Marchenko, I.M. Kulchytskyi, A.O. Nikonenko, M. Gross, A. Lanten, V.H.
Yngve, S. Sharoff, Yu.A. Schrader, D. Jurafsky, B. Bengfort, J.H. Martin, L. Tesniere, T. Ojeda, P.M.
Postal, D.G. Hays, T.A. van Dijk, S. Marcus, J. Lyons, L.W. Tosh, Y. Bar-Hillel, D.G. Bobrow, G. Lakoff,
R. Bilbro, N. Kotsyba, A.Yu. Berko, Yu.M. Shcherbyna, V.Yu. Velychko, V.F. Starko and many others
make it possible to understand the basic principles of linguistic processing of the text depending on
the features of a specific natural language. More than 80% of such studies concern the processing of
English-language texts. There are fewer studies on Slavic languages, particularly the low-resource
Ukrainian language. In particular, there are no publications regarding the development
recommendations, functional requirements, general structure, or typical architecture of the CLS for
processing Ukrainian-language textual content. Directly applying the English language's models,
methods, algorithms, and IT processing to Ukrainian-language textual content does not yield positive
results. Already at the level of morphological analysis, a significant conflict arises between the
methods developed for the English-language text and their use for the Ukrainian-language text. For
example, for a simple Porter algorithm (stemming) without appropriate modification, it is not correct
to separate the base of the word from the inflexion, which leads to inaccurate identification of key
phrases, which, in turn, affects the solution of any NLP problem where it is necessary to quickly
identify set of keywords (categorization, search, annotation, etc.). Determining the main features and
processes of linguistic analysis of Ukrainian-language texts will significantly facilitate the stages of
processing the text flow of information, such as integration, support and content management. In
turn, the adaptation of the processes of intellectual analysis of text content with the identification of
functional requirements for the relevant modules of the CLS will lead to the possibility of developing
its typical architecture based on the principle of modularity (adding components depending on the
content of the NLP task and the purpose of the CLS).
    The above testifies to the relevance of research in solving the significant scientific and applied
problem of analysis and synthesis of CLS for solving various tasks of processing Ukrainian-language
textual content, which will make it possible to increase the level of resourcefulness of the natural
Ukrainian language based on the development of new and improvement of known models, methods
and means of NLP.
    The work aims to develop models, methods, and means of analysis and synthesis of computer
linguistic systems based on new and improved known methods of processing Ukrainian-language
textual content to solve problems of natural language processing. The purpose of the work is to
determine the need to perform such tasks:

   1.   To analyse the specifics of the construction of the CLS by systematizing the processes of their
        implementation and functioning, which will provide an opportunity to distinguish a class of
        systems whose functional properties allow to perform a quantitative assessment of the
        expected effects of the implementation of a typical CLS of processing Ukrainian-language
        textual content for solving various tasks of the NLP;
   2.   To develop information technology for the construction of CLS for the processing of
        Ukrainian-language text, which will make it possible to determine their basic structure,
        functional requirements, the sequence of setting and training the system, and general design
        principles;
   3.   To offer IT processing of information resources as integration, management and support of
        Ukrainian-language content based on the improvement of linguistic analysis of text content
        for the development of metrics for evaluating the effectiveness of the functioning of the CLS
        for solving various tasks of the NLP;
   4.   To develop methods of processing Ukrainian-language textual content for solving various
        problems of NLP to increase the accuracy of the obtained results;
   5.   To develop methods and means of intellectual analysis of textual content to increase the
        efficiency of solving various tasks of NLP;
   6.   Create software modules for processing Ukrainian-language textual content for solving
        various tasks of NLP and conducting experiments;
   7.   To test the obtained results by building and implementing applied CLS to process Ukrainian-
        language textual content.

   The object of research is the processes of analysis and synthesis of computer linguistic systems
for processing Ukrainian-language textual content.
   The research subject is models, methods, and means of processing Ukrainian-language textual
content to solve various problems of NLP.
   The following research methods were used to achieve the goal: the theory of formal grammars
and automata, the theory of sets, the theory of data and knowledge models, the theory of probability
and mathematical statistics, the theory of models, algorithms, and logical-linguistic numbers,
information theory, graph theory, and knowledge presentation methods for modelling the processes
of processing Ukrainian-language textual content and developing machine learning modules; models
and methods of processing and analysing textual content for the implementation of the processes of
solving various problems of NLP; methods of object-oriented and system analysis and design - for
design and development of CLS; the theory of relational databases, methods of artificial intelligence,
object-oriented programming - for the software implementation of the Ukrainian-language textual
content processing system for the solution of various NLP tasks. The practical significance of the
obtained results lies in the fact that they can be used to build applied CLS for processing Ukrainian-
language textual content. In particular, the following results are practically valuable:

      The application of the method of identification of persistent word combinations in the
       identification of keywords in Ukrainian-language scientific texts of a technical profile allows
       an increase in the accuracy of the search for keywords by 6-9% and highlights thematic terms
       from the text for further classification of the publication;
      Development of a formal approach to the design of a content monitoring module for
       identifying keywords in Ukrainian-language texts based on web data mining, NLP and
       linguistic analysis of defined words of text content, which made it possible to develop the
       general structure of typical CLS and increase the effectiveness of CLS functioning by 6-9%
       depending on the solution of a specific NLP problem;
      The application of the method of calculating the degree of verification of the author of the
       Ukrainian-language text based on the analysis of the styles of potential authors made it
       possible to increase the accuracy of identification by 6-12% and carry out the decomposition
       of the method through the study of stylistic coefficients such as the coherence of speech, the
       degree of syntactic complexity, linguistic diversity, indices of concentration and exclusivity
       of the text;
      Development of a content monitoring module to identify a potential author of a text from a
       set of possible ones based on a comparison of the results of the analysis of a template author’s
       text with the researched one to reduce the volume of the corresponding set to [9;34]% of the
       total number of project participants, depending on the subject and the time range of scientific
       writing - technical publications, as well as the frequency of publications of this author in this
       period on a specific topic;
      Experimental testing of the method of identifying the author’s style in Ukrainian-language
       texts based on web data mining and linguistic analysis of defined stop words allows the
       selection of content potentially similar in style from a set of potential author’s publications.

2. Related works
Determining the main processes and features of the linguistic analysis of Ukrainian-language texts
will significantly facilitate the stages of processing the text flow of content such as integration,
support and content management (Fig. 1). Adaptation of the processes of intellectual analysis of text
content with the identification of functional requirements for the relevant modules of the CLS will
lead to the possibility of developing a typical structure of similar systems based on the principle of
modularity (adding components depending on the content of the NLP task and the purpose of the
CLS). The application of the specified IT/methods/models in the typical structure of the CLS, adapted
for any process of processing Ukrainian-language textual content, is a necessary prerequisite for the
successful implementation of the CLS project for solving a specific task of the NLP, which requires
the use of an appropriate set of standard libraries, utilities and software with open source, which will
solve specialized functions of the project according to the needs of the end user. The state of the CLS
is determined by the tuple of the main properties at a specific moment in time or the activity of the
corresponding NLP process: 𝑠 = (𝑝 , 𝑝 , … , 𝑝 ), 𝑖 = 1, 𝑛, where 𝑠 is the corresponding i-th state
at a specific moment in time 𝑡 from the set with power |S|=n, 𝑝 is the corresponding 𝑖𝑗-th property
of the state from the set with power |P|=m, which determines the behaviour of the CLS as 𝑝 =
(𝑟 , 𝑟 , … , 𝑟 ), 𝑗 = 1, 𝑚, where 𝑟 is the corresponding parameter of the specific property 𝑝
for the state 𝑠 . For any CLS, the state 𝑠 is one of the NLP processes, for example, the identification
of keywords and/or stable phrases for the next state 𝑠      of the system as a rubric of a text array of
data. Accordingly, the properties of the state 𝑠 are morphological 𝑝 , lexical 𝑝 and syntactic 𝑝 .
Some NLP tasks may have semantic ones, etc. Then, for the property 𝑝 , a set of parameters is
determined for the corresponding text analysis, depending on the specific task of NLP [40-50].
According to these parameters, the strategy of the CLS operation at the moment of time 𝑡 is specified
for:


                                                                                                   Internet

                                                       Content
              Web site                               management
                                                       module

                                                                                    Content
                                                                                  integration
               Content                         Module of linguistic                 module
               support                        analysis of Ukrainian-
               module                           language textual
                                                     content


                            A module for solving a                                     Content
                             specific NLP problem                                       Data
                            of Ukrainian-language                                     Repository
                                textual content
              DB profiles
                                                                       Machine
                                                      Knowledge
                                                                       learning
                                                         base
                 Client            Server                              module     Technological
               subsystem         subsystem                                         subsystem

Figure 1: Generalized structure of the computer linguistic system

      parameters of the morphological property 𝑝 are N-grams and morphemes: roots 𝑟 ,
       endings 𝑟 , affixes 𝑟 ; grammatical categories of different parts of speech 𝑟 , word length
       𝑟 , word placement in a sentence 𝑟 , number of syllables in a word 𝑟 , number of word
       contents 𝑟 , ratio of consonants and vowels 𝑟 , etc.;
      the parameters of the lexical property 𝑝 are the location of the sentence in the test 𝑟 , the
       location of the word in the sentence 𝑟 , the weight of the word 𝑟 , the weight of the
       sentence 𝑟 , the base of the word 𝑟 , the inflexion of the word 𝑟 , etc.;
      parameters of the syntactic property 𝑝 are the depth of the word in the dependency tree of
       the sentence 𝑟 , the location of the word in the sentence 𝑟 , the number of contents of the
       word 𝑟 , the number of words per sentence 𝑟 , the number of words 𝑟 and sentences
       𝑟 , whether the word is a capital letter 𝑟 / with a hyphen 𝑟 / compound 𝑟 , etc.;
      parameters of the semantic property 𝑝 are the number of word content 𝑟 , the depth of
       the word in the dependency tree 𝑟 , the size of paragraphs 𝑟 , the placement of paragraphs
       𝑟 , etc.

   Depending on the tuple 𝑝 𝑠 , the behaviour of the CLS is determined, that is, the implementation
of a set of rules (activation of actions or events) for implementing a specific NLP process depending
on the input text data. Accordingly, the event 𝑜 is the change of one property to another 𝑝 𝑝
or 𝑜 : 𝑝 𝑝 according to the fulfilment of certain conditions 𝑈 for the input analyzed text 𝑋 and
the intermediate processed text 𝐶: 𝑝 = 𝑜 (𝑝 , 𝑈, 𝑋, 𝐶). Action 𝑑 is the process of activation of an
event 𝑜 by another event 𝑜 in CLS: 𝐶′ = 𝑑 (𝑜 ∘ 𝑜 ). The more complex the language (morphology,
syntax, etc.), the more difficult it is to process the corresponding texts in natural language. In
addition, for such low-resource languages as Ukrainian, there are no standardized rules and
dictionaries for processing texts in natural language to solve the relevant tasks of NLP. Many
scientific linguistic schools and IT specialists are working on creating Ukrainian dictionaries, text
corpora and rules for processing Ukrainian texts. However, these are usually linguists and
philologists unfamiliar with the features of specific modern tools, such as programming languages,
ML methods, big data analysis, etc. There is a colossal gap between the research results of philologists
and applied linguists, on the one hand, and IT specialists, on the other, for developing Ukrainian-
language tests. Today, quite a few, such as Ukrainian, have been implemented for general access to
NLP tools.

3. Material and methods
The developed typical structure of 𝑆        CLS consists of modules for solving a specific task of NLP
𝑀 , content support 𝑀         , content integration 𝑀 , content management 𝑀          , linguistic 𝑀
and intelligent analysis of textual content flows (IATCF) 𝑀     [48]:

                      𝑆       =< 𝑀           ,𝑀         ,𝑀        ,𝑀         ,𝑀        ,𝑀    >.                   (1)
   Accordingly, the solution module of a specific NLP problem 𝑀                              :

                      𝑀       =< 𝑁           ,𝑆        ,𝑆    ,𝑆        ,𝑆        ,𝑃    ,𝐼        >,               (2)
    where 𝑆 is the average conversion rate, 𝑆          is the average cost of orders, 𝑆 is the average
cost or utility of the purpose of the visit, 𝑆 is the average 𝑃ROI or the average return on investment,
𝑃 is the percentage (%) of profit from new visitors, 𝐼         is the new buyers/customers index at the
first visit.
    The presence of the 𝑀         text content support module reduces costs for moderators/analysts
who collect/analyze statistical data on the dynamics of the CLS functioning, the activity of the
permanent target audience as a reaction to website content changes, and the formation of rules for
the analysis of user information portraits and thematic content plots:

            𝑀      =< 𝐼       ,𝐾        ,𝑃        ,𝑃        ,𝑆    ,𝐼        ,𝑃        ,𝑃    ,𝐾        ,𝑃    >,    (3)
   where 𝐼    is the advertising quality index; 𝐾 is a brand recognition factor; 𝑃       and 𝑃     are
% of new/repeated customers and users; 𝑆       is average 𝑃ROI by type of advertising; 𝐼     and 𝑃
are index and % conversion of goals by type of advertising; 𝑃      and 𝑃    are % of visits by type of
media advertisement; 𝐾      is the conversion rate of goals by type of means.

                                       𝑃 (𝑤)             𝑁    +𝑁                                (4)
                          𝐼    (𝑤) =            ,𝐾    =               ,
                                       𝑃 (𝑤)             𝑁    +𝑁
   where 𝑃 (𝑤) is a function for determining % of visits from advertisement w; 𝑃 (𝑤) is a
function for determining % conversion of goals for visits from w; 𝐼 (𝑤) is a function for
determining the index of advertising quality w; 𝑁           is the total number of user queries of
intellectual and informational search (IIS) by keywords; 𝑁       is the number of direct visits to the
website; 𝑁     is the number of IIS requests with brand name.
   The presence of the 𝑀        text content integration module reduces the costs of CLS moderators
and content authors, automating/implementing some of their work/functions such as content
collection from several different reliable sources, its recognition, filtering, saving, formatting,
analysis, annotation, classification, etc.:

           𝑀     =< 𝑃 , 𝑃          ,𝑃    ,𝐾        ,𝐾        ,𝑃        ,𝑃        ,𝑆    ,𝑃    ,𝑆        ,𝑆    >,   (5)
   where 𝑃 , 𝑃 and 𝑃 are % of repeat visits of the user from the previous visit >𝑡 , within
[𝑡 ; 𝑡 ] when 𝑡 <𝑡 and <𝑡 days, respectively; 𝐾     is a brand recognition factor; 𝑃 and 𝑃 are
% of new/repeated visitors and interest; 𝑆 is the average number of clicks on advertising for 𝑁
visits; 𝑃  is the bounce rate for one web page; 𝑆    is the average number of web page views per
visit; 𝑆 is the average length of stay on the web page.

                     𝑁              𝑁                       𝑁                𝑁                  (6)
             𝑃    =       , 𝑆     =       ⋅𝑁 , 𝐾         =          , 𝑃 =         .
                      𝑁             𝑁                       𝑁                𝑁
   where 𝑁 is the number of direct web page visits; 𝑁             is the number of one-page visits to a
web page; 𝑁      is the number of visits for analysis; 𝑁      is the total number of visits; 𝑁   is the
average number of clicks on advertising; 𝑁      is the total number of actions on the page; 𝑁      and
𝑁    are the total number of all and interested users.
   The presence of a text content management module reduces costs for moderators/administrators
who update the website and create rules for caching/searching popular information blocks:

         𝑀       =< 𝐾    ,𝑃     ,𝑃    ,𝑃     ,𝑃      ,𝑃    ,𝑃        ,𝑃    ,𝑃       ,𝐾   ,𝑆   >,   (7)
   where 𝐾 is an indicator of internal IIS; 𝑃      is % edition of the page with an error; 𝑃       and
𝑃 are % of mobile users with a high-speed Internet connection; 𝑃 and 𝑃 are % of users with
low/medium/high display resolution and with a specific operating system; 𝑃          and 𝑃 are % of
users with a specific browser and with English and/or Ukrainian language support; 𝐾              is an
indicator of the number of users, views and page visits. The 𝑆    indicator is the base of the content
management module:

                                𝑆     =< 𝑁      ,𝑁        ,𝑁    ,𝑁        >,                       (8)
   where 𝑁 and 𝑁 are the average number of page views per visit and for a specific time 𝑡;
𝑁 is the average number of unique users for a specific time 𝑡; 𝑁 is the average number of
visits for a specific time 𝑡. The indicator of internal search on the site:
                        𝐾 =< 𝑁 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝑆 , 𝑃 ,
   𝑃 ,𝑃 ,𝑃 ,𝑆 ,𝑇 ,𝑃 ,𝑃 ,𝐾                        >,
   where 𝑁       is the number of zero search results; 𝑃 and 𝑃 are % of users who were on the
page for > 𝑡 time and viewed > 𝑘 pages after the search; 𝑃 and 𝑃 are % of purchases made and
% of buyers among users using search; 𝑃           is % of rejections after visiting one page as a search
result; 𝑃 is % conversion from users using search; 𝑃 and 𝑃 are % of users who do not use and
use search; 𝑆      is the average number of pages viewed by visitors after a search; 𝑇 is the average
time spent on the site for a visit after a search; 𝑃 and 𝑃 are % of visitors who conduct several
searches during the visit and who left the site after viewing the search results; 𝑆 is the average
number of search results; 𝑃 is % of visits with search; 𝑃 is % of zero search results, in particular,

                                𝑁              𝑁              𝑁                             (9)
                        𝑃       =    , 𝑃 =           , 𝐾    =      ,
                                𝑁              𝑁              𝑁
   where 𝑁 , 𝑁         and 𝑁 are the number of all viewed pages issued with an error and viewed
pages with a search, respectively; 𝑁 is the number of zero search results; 𝑁   and 𝑁      is visits
without search and with search.
   The presence of a module for intellectual analysis of text streams of content reduces the
time/costs/personnel/resources for the timely and prompt acquisition of relevant, unique, current
content, which leads to an increase in the volume of the target audience of CLS, in particular,
contributes to the growth of the economic effect of the implementation:

                            𝑀       =< 𝑆   ,𝑆     ,𝑆           ,𝑃    ,𝑃        >,                  (10)
   where 𝑆   is the average conversion rate; 𝑆   is the average length of visit; 𝑆     is the average
number of views per visit; 𝑃   is % of unique customers/visitors/users; 𝑃       is % of new website
customers.
   According to the tracking of 𝐾                    events and interaction with the 𝐾               site, they analyze:

                                                                             +𝑅                 𝑅(11)
           𝐾       = 𝛼(𝐾           ,𝐾        ) =< 𝑃         ,𝑃    ,𝑃   ,𝐼    >, 𝐼    ,      =
                                                                           𝑁
   where 𝑃      is % interaction with the site (for example, commenting, voting, registration,
authorization, subscription, etc.); 𝑃 is % of users who activate various events (for example, clicking
on an ad, starting a function, pausing, etc.); 𝑃 is % of users interacting with different types of
content presentation (viewing the next communication, panning, zooming, etc.); 𝐼        is the value of
the measure of usefulness, respectively, of the page/site/CLS/content; 𝑁      is the number of unique
page views; 𝑅      is profit from e-business; 𝑅     is the value of the utility measure of user visits
(based on transactions) and the purpose of user visits (based on the utility of goals).
   Analysis of success/effectiveness/operational search on the site:

           𝐾       =< 𝑃        ,𝑅       ,𝑆       ,𝑃     ,𝑃       ,𝑁    ,𝑅    ,𝑅        ,𝑁       ,𝑁     ,𝐼   >,             (12)
   where 𝑃      is the value of the usefulness of visiting 𝑃     site/page; 𝑅 is conversion rating in
e-business for CLS corresponding to the NLP task; 𝑆         is the value of average utility; 𝑃  is the
value of e-business profit for the CLS of the corresponding NLP task; 𝑃 is the value of the achieved
conversion of visits to the site/page of the CLS:

   𝑃     =                ,𝑅        =            ∙ 100%, 𝑆        =               ,𝑃        =𝑅        +𝑅      ,𝑃     =       ∙
                                                                 100%,
    where 𝑁      is the number of visits; 𝑅   is the usefulness of e-business; 𝑅  is the utility of the
goal; 𝑁     is the number of transactions; 𝑁      is the number of conversions.
    To attract new visitors and increase the volume of the permanent target audience, the calculation
of the impact on the income of the IIS on the site is used 𝐼 :

                                             𝐼       = (𝑅        −𝑅    )∙𝑁    ,                                            (13)
   where 𝑁 is the number of visits from the IIS; 𝑅     and 𝑅 are the utility of visits without and
with IIS.
   The topic of a set of keywords is one of the main indicators of IIS for identifying the specific
content of a page. Optimize investment for sets of keywords that increase conversion values. The
return on investment value (𝑃ROI ) must be positive (𝑁 > 𝑁 ), i.e.:

                   𝑁 −𝑁                             (𝑁 ∙ 𝐴 )/100 − 𝑁                        (14)
       𝑃       =            ∙ 100% > 0, 𝑃        =                           ∙ 100%,
                   𝑁                                         𝑁
   where 𝑁      is expenses; 𝑁 is profit; 𝐴      is the amount of profit. Then they find how much
>q% of funds can be spent on a specific keyword in advertising without the risk of getting 𝑃ROI <0.
To calculate the amount of funds for attracting users, use:

                                                 𝑁     ∙𝐴                                                                  (15)
                                                      100                             𝑅
                               𝐶         =    ,𝐶       =𝐶                           ∙     .
                                   𝑃                                                  100
                                    100 + 1
   The method of determining the effectiveness/quality of the CLS site for solving the NLP problem:
   Stage 1. Formulation and identification of usefulness according to the goals of the target audience
according to the input data from the tuple 𝑋.
   Stage 2. Activation of reports of the operation of the CLS from the tuple 𝑌 of the initial data:
   Step 1. Define an unlimited number of goals (4 goals for each target audience profile).
   Step 2. Identify the optimal volume of visits/time of the end user/customer for a successful
conversion.
   Step 3. Analyse the volume of the contribution of each goal to the total profit.
   Step 4. Combine goals by categories/directions/species.
   Step 5. Form separate sets of transactions as appropriate for the purposes.
   Stage 3. Support various marketing campaigns/customers through 𝑀              .
   Stage 4. Support for processing the service content of the site with the 𝑀         module.
   Stage 5. Updating the profiles of the target audience according to feedback support through the
𝑀      module, and analyzing user actions through the 𝑀            module.
   Stage 6. Integrating content from different sources through 𝑀 according to the achieved goals
and processing it through the 𝑀       module.
   Stage 7. Periodic checks are performed to see whether the goals are being achieved and whether
the profit is growing according to the goals. If it subsides, go to stage 1. Otherwise, go to stage 2.
   A classified list of the input stream of content 𝑋 with a set of relevant properties demarcates
project participants through their typification and restriction of access rights depending on the
content: regular users, potential visitors, linguists, statistical analysts, administrators, content/rules
moderators, authors of unique content, information resource as content source etc. The typed
structure of the content input stream template with a set of relevant properties helps to define the
main functional requirements for the site/CLS and its typical structure and delineate the non-
functional capabilities, classify the sources, calculate the frequencies and the corresponding
restrictions/conditions of integration from the usual source:

                 𝑋 =< 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 >,                     (16)
    where 𝑋 is URL addresses of sources for databases (DB) of CLS filters; 𝑋 is content as a result
of integration from different 𝑋 sources according to a predetermined list of URLs without a
predetermined structure according to relevant thematic requests; 𝑋 is thematic requests of
visitors/users of the CLS site in the form of a set of keywords or persistent phrases; 𝑋 is actual data
of permanent users/profiles and a set of rules of permitted actions within the corresponding type of
user of the CLS; 𝑋 is statistical data of actions/ events/ phenomena of the subjects/objects of the
CLS for the solution of the corresponding NLP task and the rules for collecting/saving/analysing
statistics in specific time intervals of the CLS operation; 𝑋 is statistical data on the functioning of
the CLS; 𝑋 is contents of the DB/DS of content/rules/filters/annotations, etc. of the CLS; 𝑋 is
different types of linguistic dictionaries depending on the purpose of the CLS for solving a specific
NLP problem; 𝑋 is a set of personalized/anonymous reviews and comments of users to the relevant
content of CLS; 𝑋 is a tuple of the results of personalized/anonymous votes of regular/potential
users regarding the content of CLS; 𝑋 is statistical personalized individual actions of users of the
CLS; 𝑋 is set of external/internal advertising of thematic content; 𝑋 is thematic stickers of
information content (exchange rates, announcements, digests, weather, anecdotes, horoscope, etc.);
𝑋 is a tuple of options for setting up and changing the CLS/site configurations.
    Filling the tuple of the output data stream 𝑌 according to the purpose of the CLS for solving a
specific NLP problem directly depends on the content of the input classified stream of content 𝑋 with
a predetermined set of properties depending on the interaction with the site of the corresponding
types of project participants:

                           𝑌 =< 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 >,                           (17)
    where 𝑌 is text content as an information product or the result of providing an appropriate
information service for solving a specific NLP task on the CLS website; 𝑌 is a set of meaningfully
generated/cached pages as a result of thematic requests/IIS of users/visitors of the CLS site; 𝑌 is
annotations/digests/abstracts on textual thematic content; 𝑌 is a tuple of statistics of user/visitor
interaction with the site; 𝑌 is a tuple of the content of the profiles of regular users of the CLS
according to the personalized statistics 𝑌 for the corresponding generation of an individual portrait
of the user/audience at certain time intervals; 𝑌 is a tuple of meaningful recommended site content,
personalized for a specific regular user according to the profile/actions/interaction with the CLS in
certain time intervals; 𝑌 is a set of content topics/headings with the possibility of renewal according
to the results of the latest IIS/requests from regular site users; 𝑌 is a scheme of interrelationships of
textual thematic content according to the appropriate classification (current, relevant, author's,
outdated, popular, similar, last-viewed, often-viewed, consecutively by a certain most viewed, longer
viewed, most viewed from search engines or internal IIS, viewed by a typical group of users, etc.); 𝑌
is the set of content rating results on a predetermined scale within the corresponding ranking
classification; 𝑌 is a set of marked evaluation and ranking of user comments as the degree of
permission to publish on the site/page, if necessary, with a prohibition mark for a specific contributor
to write further comments and ranking by the degree of trust of all contributors. The list of the
output flow of content, its main features, the corresponding classification, and IT
generation/support/analysis contributes to the definition of precise general functional requirements
for implementing the CLS to solve any NLP problem.
     The model of the process of linguistic analysis of the Ukrainian-language text 𝑀 is presented

   𝑀      =< 𝑋, 𝑊, 𝐶, 𝐾, 𝑌, 𝐷, 𝑆                , 𝑆 , 𝑆 , 𝑆 , 𝑆 , 𝑆 , 𝑆 , 𝑆 , 𝑆 , 𝑆 , ,  ,  ,  ,  ,  ,  ,  >,
     where 𝑋 is the input data in the CLS from various sources of information 𝑊; 𝑌 is the original
relevant content from the CLS as a result of the IIS according to the requests of users/visitors; 𝑆 is
the process of linguistic analysis of content as a component of the IATCF subsystem 𝑆 ; 𝑆 is the
process of generation/modification of the rules of operation of all modules by the moderator of the
CLS; 𝑆 is the process of filling an unstructured database with integrated content 𝑋; S𝑆 is the
filling module of the structured database based on the processed integrated content 𝐶; 𝑆 and 𝑆
are processes of generating results according to the requests of visitors and users; 𝑆 is a cache
processing process for generating reports on popular requests from CLS users; 𝑆 is cache
filling/modification process; 𝑆 is the process of generating statistical results of the functioning of
the CLS/modules and the activities of users 𝐷;  is the operator of generation/modification of the
rules of operation of all modules from the moderator of the CLS;  is the operator of filling an
unstructured database with integrated content 𝑋;  is the operator of filling the structured database
based on the processed, integrated content of 𝐶;  and  are operators for generating results
according to the requests of visitors and users;  is a cache processing operator for generating
reports 𝑌 on popular requests from users;  is cache filling/modification operator with 𝐾 data;  is
an operator for generating statistical results of the functioning of the CLS/modules and user
activities:

         𝑆     =< 𝑋, 𝑌, 𝐶, 𝐷, 𝑅, , , , , , , , ,  > , 𝑌 =  ∘  ∘  ∘  ∘  ∘  ∘  ∘  ∘ ,                       (18)
   where 𝑋 is the input text data array; 𝑌 is a tuple of the original processed text according to the
purpose of the CLS; 𝐶 is a set of intermediate content, which is processed at the appropriate level in
the CLS; 𝐷 is auxiliary dictionaries; 𝑅 is a set of processing rules;  is grapheme analysis operator
(GA);  is morphological analysis operator (MA);  is lexical analysis operator (LA);  is operator of
syntactic analysis (SA);  is semantic analysis operator (SEM);  is ontological analysis operator;  is
reference analysis operator;  is structural analysis operator;  is operator pragmatic analysis (PA).
   The primary process of linguistic analysis of textual content is presented:

              𝑌 = (𝐶 , 𝐷 , 𝑅 , (𝐶 , 𝐷 , 𝑅 , (𝐶 , 𝐷 , 𝑅 , (𝐶 , 𝐷 , 𝑅 , , (𝐶 , 𝐷 , 𝑅 , (𝐶 , 𝐷 , 𝑅 ,
                         (𝐶 , 𝐷 , 𝑅 , (𝐶 , 𝐷 , 𝑅 , (𝐶 , 𝐷 , 𝑅 , 𝑋))))))))),                                     (19)
   where the content sets𝐶 = {𝐶 , 𝐶 , 𝐶 , 𝐶 , 𝐶 , 𝐶 , 𝐶 , 𝐶 , 𝐶 }, linguistic dictionaries                          𝐷=
{𝐷 , 𝐷 , 𝐷 , 𝐷 , 𝐷 , 𝐷 , 𝐷 , 𝐷 , 𝐷 , } and sets of     production/association      rules                            𝑅=
 𝑅 , 𝑅 , 𝑅 , 𝑅 , 𝑅 , 𝑅 , 𝑅 , 𝑅 , 𝑅 .
   The primary linguistic process of processing textual Ukrainian-language information to solve a
specific task of the NLP consists of nine stages:
   Stage 1. Grapheme analysis  of textual Ukrainian-language information 𝑋:

                 𝐶 = (𝑋, 𝐷 , 𝑅 ), 𝐶 =  ∘  ∘  ∘  ∘  ∘  ∘  ,                         (20)
   where 𝑋 is the input text data array;  is GA operator; 𝐶 is grapheme structure of the input text;
𝐷 is grapheme dictionaries and libraries; 𝑅 is GA rules;  is an optical character recognition
operator;  is grapheme parsing operator of the input text 𝑋 into sections, paragraphs and
sentences;  is grapheme analysis operator of linguistic chains into separate words;  is the
operator for forming a set of unrecognized chains;  is the operator of identification and marking
of unrecognized chains as numbers, dates, constant returns, abbreviations, proper and geographical
names, etc.;  is the operator for marking non-text strings as special symbols, formulas, figures,
tables, etc.;  is an operator for generating a marked linear sequence of words 𝐶 with official signs
and connections.
   Stage 2. Morphological analysis  of text content 𝐶 consists in the identification, analysis and
determination of the form and structure of words, in particular:

                 𝐶 = (𝐶 , 𝐷 , 𝑅 ), 𝐶 =  ∘  ∘  or 𝐶 =  ∘  ∘  ,                     (21)
    where  is the morphological segmentation operator of the grapheme-recognized chain of
symbols (words/tokens);  is a token lemmatization operator;  is the operator for marking parts
of speech for segmented words;  is the word stemming operator.
    Production rules for identification/generation of Ukrainian participles [51]:
    I. Formation of grammatical meanings: {𝐷 → 𝐷 (𝑥, 𝑦)}, where 𝑥 = (𝑎𝑐𝑡/𝑝𝑎𝑠); 𝑦 = (𝑝𝑟𝑒𝑠/
𝑝𝑎𝑠𝑡), for example, {𝐷 → 𝐷 (𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠); 𝐷 → 𝐷 (𝑎𝑐𝑡, 𝑝𝑟𝑒𝑠), …}.
    II. Analysis of morphemes: {𝐷 (𝑎𝑐𝑡, 𝑝𝑟𝑒𝑠) → 𝑂 𝑡, 𝑑, 𝑎 𝐶(𝑎𝑐𝑡, 𝑝𝑟𝑒𝑠, 𝑎 )𝛷; 𝐷 (𝑎𝑐𝑡, 𝑝𝑎𝑠𝑡) →
𝑂 (𝑡̄, 𝑑, 𝑎 )𝐶(𝑎𝑐𝑡, 𝑝𝑎𝑠𝑡, 𝑎 )𝛷;                 𝐷 (𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠) → 𝑂 𝑡, 𝑑 − 𝑑, 𝑎 𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠, 𝑎 )𝛷;
𝐷 (𝑝𝑎𝑠, 𝑝𝑎𝑠𝑡) → 𝑂 𝑡, 𝑑 − 𝑑, 𝑎 𝐶(𝑝𝑎𝑠, 𝑝𝑎𝑠𝑡, 𝑎 )𝛷}, where 𝑂, 𝐶, 𝛷 are designation of various
morphemes without description.
   III. Decomposition    of    the   verb   stem:   {𝑂 𝑎𝑡𝑒𝑚 → 𝑂 𝑎𝑡𝑒𝑚 𝑇;           𝑂 𝑑, ∅ 𝐶(𝑥, 𝑦) →
𝑂 𝑑, ∅ 𝐶 𝐶(𝑥, 𝑦, 𝐼); 𝑂 (𝑎𝑡𝑒𝑚) → 𝑂(𝑎𝑡𝑒𝑚)}, where 𝑇 is thematic element (TE) -и(і,ї)-/-а(я)-/-
ол(р)о-; 𝑎𝑡𝑒𝑚 is attribute value 𝑎 different from 𝑎𝑡𝑒𝑚, i.e (𝑎/𝑖/𝑎/𝚤̃/𝑜), 𝐶 is verb suffix; ∅ is any
attribute value other than ∅; 𝑥 and 𝑦 must satisfy the following condition: at 𝑥 = 𝑝𝑎𝑠 it is necessary
that 𝑦 = 𝑝𝑟𝑒𝑠.
    IV. TE identification: {(𝑎)𝑇𝛼 → 𝑂(𝑎)𝜁; 𝑂(𝚤̃)𝑇𝛼 → 𝑂(𝚤̃)𝜁; 𝑂(𝑎)𝑇 → 𝑂(𝑎)𝑎 +; 𝑂(𝑖)𝑇 → 𝑂(𝑖)𝑖 +;
𝑂(𝑜)𝑇 → 𝑂(𝑜)𝑜 +,;             𝑂 𝑑, 𝐼𝐼, 𝑎 𝑇𝐶(𝑎𝑐𝑡, 𝑝𝑟𝑒𝑠) → 𝑂 𝑑, 𝐼𝐼, 𝑎 𝑎 + 𝐶(𝑎𝑐𝑡, 𝑝𝑟𝑒𝑠);          𝑂 𝑑−
𝑑, 𝐼, 𝑎 𝑇𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠) → 𝑂 𝑑 − 𝑑, 𝐼, 𝑎 𝑎 + 𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠);       𝑂 𝑑 − 𝑑, 𝐼, 𝑖 𝑇𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠) → 𝑂 𝑑 −
𝑑, 𝐼, 𝑖 + 𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠); (𝑎, 𝐼𝐼)𝑇𝛽 → 𝑂(𝑎, 𝐼𝐼)𝑎 + 𝜉; 𝑂(𝚤̃, 𝐼)𝑇𝛽 → 𝑂(𝚤̃, 𝐼) + 𝜉}, where 𝜁 and 𝜉 are
arbitrary vowel and consonant; + is boundary between morphemes.
    V. Forming verbs with the appropriate morpheme: {𝑂(І, 𝑦)𝐶 → 𝑂(І, 𝑦)ува +; 𝑂(І, 𝑦)𝐶 →
𝑂(І, 𝑦)oва +;                𝑂(𝑦)𝐶 → 𝑂(𝑦);                𝑂 𝑡, 𝑑, н 𝐶 → 𝑂 𝑡, 𝑑, н + 𝐶(𝑝𝑎𝑠, 𝑝𝑎𝑠𝑡);
𝑂(𝑡, 𝑑, н)𝐶 𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠) → 𝑂(𝑡, 𝑑, н)ну + 𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠)}.
    VI. Suffix   identification:   {𝐶(𝑎𝑐𝑡, 𝑝𝑎𝑠𝑡, 𝐼 − 𝐼𝐼) → л +;     𝑂(𝑎𝑡𝑒𝑚)𝐶(𝑎𝑐𝑡, 𝑝𝑟𝑒𝑠, 𝐼) → уч +;
𝑂 𝑎𝑡𝑒𝑚 𝑌𝐶(𝑎𝑐𝑡, 𝑝𝑟𝑒𝑠, 𝐼) → юч +;                                    𝑂(𝑎𝑡𝑒𝑚)𝐶(𝑎𝑐𝑡, 𝑝𝑟𝑒𝑠, 𝐼𝐼) → ач +;
𝑂 𝑎𝑡𝑒𝑚 𝑌𝐶(𝑎𝑐𝑡, 𝑝𝑟𝑒𝑠, 𝐼𝐼) → яч +; 𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠/𝑝𝑎𝑠𝑡, 𝐼 − 𝐼𝐼) → н +; 𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠/𝑝𝑎𝑠𝑡, 𝐼 −
𝐼𝐼) → т +; 𝑂(𝑎𝑡𝑒𝑚)𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠/𝑝𝑎𝑠𝑡, 𝐼 − 𝐼𝐼) → ен +; 𝑂 𝑎𝑡𝑒𝑚 𝑌𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠/𝑝𝑎𝑠𝑡, 𝐼 − 𝐼𝐼) →
єн +;      𝑂(𝑎𝑡𝑒𝑚)𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠/𝑝𝑎𝑠𝑡, 𝐼 − 𝐼𝐼) → ува +;        𝑂 𝑎𝑡𝑒𝑚 𝑌𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠/𝑝𝑎𝑠𝑡, 𝐼 − 𝐼𝐼) →
юва +;       𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠/𝑝𝑎𝑠𝑡, 𝐼 − 𝐼𝐼) → овува +;      𝑂(𝑎𝑡𝑒𝑚)𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠/𝑝𝑎𝑠𝑡, 𝐼 − 𝐼𝐼) → ова +;
𝑂 𝑎𝑡𝑒𝑚 𝑌𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠/𝑝𝑎𝑠𝑡, 𝐼 − 𝐼𝐼) → йова +;              𝑂 𝑎𝑡𝑒𝑚 𝑋 𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠/𝑝𝑎𝑠𝑡, 𝐼 − 𝐼𝐼) →
𝑋 ьова+}, where 𝑌 is any suffix/TE; 𝑋 is soft consonant, 𝑋 is arbitrary consonant.

  1.                                                                       R

  2.(I)      #                         Sж,од,н,3                                                 Vод,тепер,3                            #

  3.(III.1) #                          Sж,oд,н,3                              Vод,тепер,3 Sч,од,зн,1           Sсер,од,ор,3            #

                         Sж,од,н,3                   Sч,од, р,3              Vод,тепер,3 Sч,од,зн,1           Sсер,од,ор,3
  4.(II.1) #                                                                                                                              #

  5.(II.2) #        Аж,од,н Sж,од,н,3                Sч,од, р,3              Vод,тепер,3 Sч,од, зн,1          Sсер,од,ор,3            #

  6.(II.2) #       Аж,од,н      Sж,од,н,3     Ач,од, р         Sч,од, р ,3 Vод,тепер,3 Sч,од, зн,1            Sсер,од,ор,3            #

  7.(II.2) #       Аж,од,н      Sж,од,н,3     Ач,од, р         Sч,од, р ,3 Vод,тепер,3 Sч,од, зн,1 Асер,од,ор          Sсер,од,ор,3   #

  8-9      ..............................................................................................................................

  10.(II.4) #      Аж,од,н       Sж,од,н       Ач,од, р             Sч,од, р Vод,тепер,3 Sч,од,зн,1      Асер,од,ор       Sсер,од,ор     #

  11.(II.3) #      Аж,од,н      Sж,од,н        Ач,од, р             Sч,од, р Vод,тепер,3 Sчзайм
                                                                                            ,од, зн,1
                                                                                                      Асер,од,ор          Sсер,од,ор      #

 12-18       ............................................................................................................................

                   IV.6          IV.2            IV.6           IV.1            IV.7          IV.4           IV.6          IV.3
                  весела        посмiшка        твого               сина     наповнює          мене безмежним             щастям
             #                                                                                                                            #
Figure 2: An example of building a tree for parsing the dependencies of sentence words

  VII. Selection of the verb form (𝑓/𝑓) and inflection: {𝛷 → 𝛷(𝑓); 𝛷(𝑓, 𝑠) → ого, им, ому;
𝛷(𝑓, 𝑚) → ий; 𝛷(𝑓, 𝑤) → ою, ої; 𝛷(𝑓, 𝑠̄ ) → им, ими, их; 𝛷 → 𝛷 𝑓 ; 𝛷 𝑓, 𝑤 → а, 𝑦; 𝛷 𝑓, 𝑘 → е;
𝛷 𝑓, 𝑠̄ → 𝑖,; 𝐶(𝑝𝑎𝑠)𝛷 𝑓 → о}.
   VIII. Dictionary-based                    stem              identification:                  {𝑂 𝑡 − 𝑡̄, 𝑑 − 𝑑, I, 𝑎𝑡𝑒𝑚, у →
aвтоматиз+, буд+, мал +, . ..; 𝑂 𝑡 − 𝑡̄, 𝑑, I, 𝑎𝑡𝑒𝑚, ∅ → вес+, . ..; 𝑂 𝑡, 𝑑 − 𝑑, IІ, 𝚤̃, ∅ →
втрач+, . ..;   𝑂 𝑡̄, 𝑑, I, 𝑎, ∅ → втруч+, . ..;       𝑂 𝑡, 𝑑 − 𝑑, I, 𝚤̃, у → дослідж+, . ..;
𝑂(𝑡̄, 𝑑, I, 𝚤̃, у) → запізн+, . ..;            𝑂 𝑡, 𝑑, I, 𝑎, ∅ → кох+, . ..;                       𝑂 𝑡, 𝑑, IІ, і, ∅ → люб+, . ..;
𝑂 𝑡, 𝑑, I, 𝑎𝑡𝑒𝑚, ∅ → нес+, . ..;                    𝑂(𝑡, 𝑑, IІ, і, ∅) → поділ+, . ..;                        𝑂(𝑡, 𝑑, I, 𝑎𝑡𝑒𝑚, ∅) →
привес+, . ..;          𝑂(𝑡, 𝑑, I, 𝑎𝑡𝑒𝑚, у) → побуд+, розфарб+, . ..;                                            𝑂 𝑡̄, 𝑑, I, 𝑎, ∅ →
смі𝑗+, стогн+, . ..;        𝑂 𝑡, 𝑑, I, 𝑎, ∅ → спит+, . ..;       𝑂(𝑡̄, 𝑑, I, 𝑎𝑡𝑒𝑚, н) → усміх+, . ..;
𝑂 𝑡, 𝑑, I, 𝑎𝑡𝑒𝑚, 𝑦 → фарб+, . ..; 𝑂(𝑡, 𝑑, I, о, ∅) → мол+, . ..; 𝑂(𝑡̄, 𝑑, I, і, ∅) → змарн+, . ..;
…}.
    IX. Basic morphological rules: {𝛼 +→ 𝛼 + 𝑗𝛼 ; 𝑗 + и → і; о𝑍 + 𝐶(𝑝𝑎𝑠, 𝑝𝑟𝑒𝑠) + 𝛷 → 𝑎𝑍 +
𝐶(𝑎𝑐𝑡, 𝑝𝑟𝑒𝑠) + 𝛷; с' + W → ш + W; в' + W → вл' + W; б' + W → бл' + W; д' + W → дж' + W; т'+ W
→ ч + W; …; д + W → д' + W; с + W → с' + W; …; нн + Ф → н + о}, where 𝛼 and 𝛼 are arbitrary
vowels; 𝑗 is sound designation [j] (йот); Z is any sequence not longer than 3 characters; W = -е(є)н-
, -у(ю)ва-, -ова-, -овува-.
     X. Graphical and orthographic rules: {𝑗 + 𝑎 → я, 𝑗𝑎 → я; 𝑗 + у → ю, 𝑗у → ю; 𝑗 + е → є, 𝑗е → є;
…; Х + 𝑎 → Х + я; Х + у → Х + ю; Х + и → Х + і; Х + і → Х +; Х + е → Х + є}.
   XI. Erasure of the boundary indicator between morphemes: {𝐴 + 𝐵 → 𝐴𝐵}, where 𝐴 and 𝐵 are
any morphemes that none of the rules of groups IX-X apply to 𝐴 + 𝐵.
   Stage 3. Lexical analysis  of the text content 𝐶 in the intermediate stage of the analysis of the
lexeme sequence to generate a parsing tree at the SA level:

               𝐶 = (𝐶 , 𝐷 , 𝑅 ), 𝐶′ =  ∘  , 𝐶′ =  ∘  ∘  or 𝐶′ =  ∘  ,                           (22)
   where        is a speech segmentation operator for identification/clarification of
words/phrases/tokens after MA;  is speech recognition or speech-to-text operator;  is optical
character recognition operator as the second part after GA and MA for clarifying incorrect moments
of recognition, taking into account the recognized adjacent tokens;  is the word
tokenization/segmentation operator as data preparation for building a parsing tree at SA;  is text-
to-speech.
   Stage 4. The syntactic analysis  of text content 𝐶 consists in building a tree for parsing word
dependencies (Fig. 2) in a sequence of lexemes based on their categories:

                                   𝐶 = (𝐶 , 𝐷 , 𝑅 ), 𝐶 =  ∘  ∘  ,                                     (23)
    where  is grammar induction implementation operator;  is the operator of
identification/elimination of boundary ambiguity or sentence violation;  is operator of syntactic
parsing of phrases/sentences for building a SA tree. Rules for formulating Ukrainian phrases:
    I. Choice of structure: {𝑅 → #𝑆 , ,н, 𝑉 ,тепер, #}, where 𝑉 is verb group, 𝑆 is noun group, 𝑥
is gender, 𝑦 is singular/од, or plural/мн; 𝑧 is the case, 𝑤 is the person.
    II. Noun group: {𝑉 , , , → 𝑆 , , , 𝑆 , ,р, ; 𝑆 , , , → 𝐴 , , 𝑆 , , , ; 𝐾 𝑆 , , , 𝐾 →
     , , , 𝐾 , 𝐾  𝐴 , , , 𝐾  𝑆 ; 𝑆 , , , → 𝑆 , , }.
𝐾 𝑆 займ
   III. Verb              group:              {𝑉 ,тепер, → 𝑉 ,тепер, 𝑆   ,   ,зн,   𝑆    ,     ,ор,   ;   𝑉 ,тепер, →
𝑉 ,тепер, 𝑆    ,   ,ор,   𝑆   ,    ,зн,   ;            𝑉 ,тепер, → 𝑉 ,тепер, 𝑆      ,   ,зн,      ;       𝑉 ,тепер, →
𝑉 ,тепер, 𝑆 , ,ор, }.
    IV. Substitution of words: {𝑆ч, , → син , , . ..; 𝑆ж, , → посмішкау, , . ..; 𝑆сер,у, → щастя , , . . . ;
 займ                   займ
𝑆х,од, , →я ;          𝑆х,од, , → ти ;          𝑉у,тепер, → наповнити ,тепер, , . ..;                        Ах,у, →
веселийх, , , безмежнийх, , , м𝑖йх, , , тв𝑖йх, , , . . . }.
    Stage 5. Semantic analysis  of the Ukrainian-language text 𝐶 consists of

                                     𝐶 = (𝐶 , 𝐷 , 𝑅 ), 𝐶 =  ∘  ,                                       (24)
   where  is the identification operator of lexical semantics with the generation of a collection of
values of each lexeme of the text;  is the relational semantics identification operator of the
interdependencies of the content of the lexemes of the text.
   Stage 6. Reference analysis  identification of interphase units 𝐶 .

                                         𝐶 = (𝐶 , 𝐷 , 𝑅 ).                                   (25)
   Reference analysis is often part of SEM. For Ukrainian texts, when analysing large corpora of
texts, it is best to carry out as a separate stage (for example, for the analysis of the correspondence
of a social group/community in social networks or other dialogues to identify logical, meaningful
connections between the posts of different participants due to the subjectivity of everyone's speech.
   Stage 7. Structural analysis  of the Ukrainian-language text 𝐶 based on the degree of coincidence
of lexical, terminological units of unity of text fragments. It is often part of SEM for short
texts/messages or not used at all. For large corpora of texts as an additional stage of elimination of
marked inaccuracy in SEM.
                                 𝐶 = (𝐶 , 𝐷 , 𝑅 ) or 𝐶 = (𝐶 , 𝐷 , 𝑅 ).                               (26)
   Stage 8. Ontological analysis of  text content 𝐶 on the basis or part of the results of SEM and
reference/structural analyses if necessary:

                  𝐶 = (𝐶 , 𝐷 , 𝑅 ), 𝐶 = (𝐶 , 𝐷 , 𝑅 ) or 𝐶 = (𝐶 , 𝐷 , 𝑅 ).                       (27)
   Stage 9. Pragmatic analysis of  text content 𝐶 is used to determine the text's structure by
considering the context of sentences when forming paragraphs, sections, and dialogues. PA is an
essential addition to SEM, reference, and structural analyses if it does not contribute to eliminating
marked inaccuracy.

                          𝑌 = (𝐶 , 𝐷 , 𝑅 , 𝐶 , [𝐶 , 𝐶 , 𝐶 ], ), 𝑌 =  ∘  ,                            (28)
    where  is a semantics identification operator outside individual sentences/phrases;  is the
operator of text processing through higher-level NLP applications, for example, to simulate
intelligent behaviour and an apparent understanding of natural language.
    A general scheme/model of the pipeline of the CLS operation has been developed based on
improved methods of processing information resources such as integration, maintenance and
content management, as well as the development of improved methods of intellectual and linguistic
analysis of text flow using machine learning technology (Fig. 3) [52-58]. Based on feedback from the
user and output data of the ML model, the target audience interacts with the CLS, which contributes
to the adaptation of the selected learning model. Five stages of relevant processes determine the basic
architectural principles of building a typical CLS. The methods of monitoring, developing and
managing content are interaction, formatting/filtering, NLP, ML and data accumulation in DS.
Content and support processes feature analysis, deployment, prediction, interpretation, and
content/result presentation. At the interaction stage, a set of rules for integrating content from
multiple reliable sources at certain intervals is developed. Also, in parallel, a set of rules for checking
the data entered by the user of the CLS was created as a preliminary stage for the formatting/filtering
stage according to a collection of rules and content from the DS set in advance by the moderator.
The next stage of NLP is an intermediate stage for ML and data accumulation. The ML stage is
implemented through SQL queries and modules. The support process is more accessible to implement
than the management stage, especially when analysing the results of the NLP, in which additional
lexical resources and artefacts (dictionaries, translators, regular expressions, etc.) are created, which
directly depend on the effectiveness of the CLS functioning (Fig. 4) [52-58].

                                        Processes of monitoring, development and management of content

                   Interaction           Formatting                NLP                 Machine      Accumulation of
       Input
                                          filtering                                    learning        content/
      content
                                                                                                      analysis of
                   Integration          Transformation        Normalization        Classification
                                                                                                       features
                  Presentation          Interpretation       Prognostication          Deployment
   Relevant
   content
                   Feedback                  API               Assessment              Modeling
     User                                                                                             Data storage
   requests     CLS website                                                                Computer linguistic system

                                                    Content analysis and support processes

Figure 3: Scheme of the pipeline of the CLS operation

   The transition process from the raw text to the expanded ML model consists of additional content
transformations. First, the input text content is transformed into the input corpus as a collection of
texts, accumulated and stored in the DS. The incoming content is further grouped, filtered, formatted,
linguistically processed, marked, normalized and converted into vectors for further processing. In
the final transformation of the model (Fig. 5) [52-60], they train on the vector corpus to create a
generalized presentation of the original content for further use in solving a specific NLP problem.

      Input            Processes of monitoring, development and management of content
     content
      Interaction        Formatting             Linguistic       Content marking         Model training
                          filtering            processing
    Integration         Transformation       Normalization         Vectorization          Calculations


  Content             Text                  Marked                Lexical               Model
  archive             collection            case                  resources             repository
      Content              Content              Corpus
                                                                 Prognostication            Modeling
     collection           selection            analysis

СLS website                                                                Computer linguistic system

Figure 4: Scheme of the pipeline for processing Ukrainian-language textual content

        Processed            The process of generating an optimal machine learning model
         content
        Monitoring       Processing          Generation of           Data               Learning the ML
                                             the ML model         management                model
       Data                                     Forming           Analysis of signs      Testing of the
                        Transformation
     collection                               features set        and parameters          ML model

                                              Choice of ML
                                                 model

  Content             Set of                 Adjustment of        Model                 Content
  archive             content                 parameters          repository            repository
                          Content               Model              Choosing the
    Data filtering                                                                       Model settings
                          selection             control           optimal model

 Optimization of the ML model                                                         CLS cloud storage

Figure 5: Machine learning pipeline process

    NLP methods have been improved based on the developed 82 regular expressions (RGs) of pattern
matching in GA and more than 2000 RGs of morphological analysis of Ukrainian-language texts.
RV's primary admissible operations are the union and disjunction of symbols/chains/expressions,
number and precedence operators, and anchors of the presence/absence of symbols in regular
expressions. The main stages of tokenization and normalization of the Ukrainian text by cascades of
simple substitutions of RG and finite automata are determined. Algorithms for word segmentation
and normalization, sentence segmentation, and Porter's modified stemming are implemented and
described as an effective way of identifying lem affixes for the possibility of marking the analysed
word. Porter's modified stemming algorithm is based on searching/checking the obtained
intermediate results with the tree of inflexions (so as not to go through all possible inflexions) and
with the content of thematic dictionaries of bases with a set of PG-rules for identification of features
(classification by parts of speech).
    Step 1. Identify the next lexeme as the word 𝑤 (𝑤 = 𝑤 ).
    Step 2. Check with the stop word dictionary whether 𝐷         or 𝑤 is a service word. If yes, then
𝑖 = 𝑖 + 1 and go to step 1. Otherwise, go to step 3.
    Stage 3. Go to the end of the word 𝑤 . Recognize the inflection 𝑓 in 𝑤 from all possible ones
(the longest one is chosen, for example, in 𝑤 =текстова we choose the ending 𝑓 =ова, not 𝑓 а)
from the RG of the word type 𝑅              ,𝑅     , or 𝑅      and in the presence of deletion of the
inflexion 𝑓 .
    Stage 4. Saving the inflection 𝑓 in the word tag 𝑤 .
    Stage 5. Label 𝑤 as type 𝑚           ,𝑚      or 𝑚     , respectively.
   Stage 6. Finding the deleted inflection 𝑓 in the tree of inflexions 𝑇            (the longest one is
chosen). Checking the contents of the subtree 𝑇            with the existing word ending 𝑓 (𝑓 = 𝑓 +
𝑓 ). If 𝑤 ends in 𝑓 and has a counterpart in 𝑇           , then we store it in 𝑓 = 𝑓 and delete in 𝑤 .
    Stage 7. We check the obtained base 𝑤 of the initial word 𝑤 with the content of the dictionary
of bases 𝐷 of words of the Ukrainian language. If there is no respondent, we store < 𝑤 , 𝑤 > in
the additional temporary intermediate dictionary 𝐷 ,            for the moderator and proceed to stage
1. Otherwise, proceed to stage 4.
    Stage 8. Analysis of inflexion and the presence/absence of alternation of letters in the
base/inflexions of the words< 𝑤 , 𝑤 > and the analogue of the base of the word in 𝐷 according
to the corresponding РG-rule of MA to identify additional features of the analyzed word 𝑤 .
    Stage 9. Adding the identified linguistic features of the recognized part of speech to the tag of the
word 𝑤 of the type 𝑚                  , 𝑚        or 𝑚       , respectively. Saving the results in the
corresponding dictionary 𝐷 of the analysed text.
   Unlike the classic Porter's algorithm, the modified one is adapted specifically for the Ukrainian
language and gives an accurate result in 85-93% of cases, depending on the quality, style, genre of
the text and, accordingly, the content of the dictionaries of CLS. In total, about 1,300 rules for
processing suffixes and endings, considering the alternation of letters, adjectives - 99 RG-rules, and
verbs - more than 800 RG-rules have been implemented for MA Ukrainian-language nouns. The
algorithm for the minimum editorial distance of lines of Ukrainian texts is described as the minimum
number of operations required to transform one into another. Also, an algorithm for calculating the
maximum likelihood metric for the 2-gram and 3-gram models based on the analysis of word bases
was developed to identify stable word combinations as keywords. To forecast the conditional
probability of the following base of the word, we use the Markov assumption (the probability of the
word depends on the previous one).
   Moreover, suppose the keywords are a set of nouns or an adjective with a noun. In that case,
other words, such as verbs, participles, etc., will be considered additional separators as other
punctuation marks that demarcate persistent phrases as potential keywords. The order of bases is
not crucial for the Ukrainian language.
   Stage 1. Process the input text and break it into separate phrases (sentences) 𝑅 𝑅 … 𝑅 , marking
each start-end with the corresponding <p> </p> tag. Eliminate all non-alphabetic characters.
Convert uppercase letters to lowercase. Remove official words if necessary (for certain NLP tasks).
   Stage 2. Apply Porter's stemming to obtain the sequence of word stems 𝑥 𝑥 … 𝑥 of word
stems 𝑅 taking into account word normalization, respectively.
   Stage 3. Receive input queries 𝑄 𝑄 … 𝑄 as a sequence of words of the searched data. Find 𝑄
for each word 𝑦 𝑦 … 𝑦 basis by stemming.
   For example, for the search phrase 𝑄 :


 Translation - Method and tools for information systems processing in electronic content commerce
                                            systems
  𝑦        𝑦       𝑦        𝑦          𝑦          𝑦          𝑦          𝑦            𝑦           𝑦
 метод      та    засіб    опрац     інформ     ресурс     систем    електрон      контент     комерц
  58       190     25        62        122        83         170        89           408         300
   Stage 4. Conduct a statistical analysis of the occurrence of word stems and sequences of query
word stems in the analyzed text.
       The text       𝑥     𝑥      𝑥      𝑥      𝑥        𝑥      𝑥        𝑥         𝑥        𝑥
    Words basics метод та засіб опрац інформ ресурс систем електрон контент комерц
    𝑥        метод     0     8      0      6      0        0      0        0         1         0
    𝑥          та      2     0      5      1      7        0      2        0         0         1
    𝑥         засіб    0     2      0      14     0        0      0        0         0         0
    𝑥        опрац     0     0      0      0      46       0      0        1         3         4
    𝑥       інформ     0     0      0      0      0       64      9        0         0         0
    𝑥        ресурс    0     7      0      0      0        0      0        1         0         0
    𝑥       систем     0     8      0      1      0        0      0        21        0         0
    𝑥      електрон    0     0      0      0      0        0      0        0        72        10
    𝑥       контент    0    10      0      0      0        0      0        0         0        73
    𝑥       комерц     0     6      0      0      0        0      0        0       176         0
     Stage 5. Find the probability of the appearance of 2-grams in the analyzed text. In each row, the
value is divided by 𝑦 , where 𝑖 is the row number after normalization.
   The text         𝑥          𝑥      𝑥       𝑥        𝑥        𝑥         𝑥            𝑥          𝑥          𝑥        𝑦
 words basics      метод       та     засіб   опрац   інформ   ресурс    систем    електрон      контент     комерц

 𝑥      метод        0         0.18    0       0.1      0        0         0            0         0.02           0    58

 𝑥        та        0.01        0     0.03    0.005    0.035     0        0.01          0           0         0.005   190

 𝑥       засіб       0         0.08    0      0.16      0        0         0            0           0            0    25

 𝑥      опрац        0          0      0        0      0.74      0         0           0.016      0.048       0.064   62

 𝑥     інформ        0          0      0        0       0       0.52     0.074          0           0            0    122

 𝑥      ресурс       0         0.08    0        0       0        0         0           0.012        0            0    83

 𝑥     систем        0         0.05    0      0.006     0        0         0           0.124        0            0    170

 𝑥    електрон       0          0      0        0       0        0         0            0         0.81        0.112   89

 𝑥     контент       0         0.03    0        0       0        0         0            0           0         0.179   408

 𝑥     комерц        0         0.02    0        0       0        0         0            0         0.053          0    300


   The resulting matrices will, in most cases, be sparse. Phrase and various variations
(plural/singular and cases) 𝑃(система електронної контент комерції):
                𝑃(електрон|систем)𝑃(контент|електрон) 𝑃(комерц|контент) =
                                       =0.1240.810.179=0.01797876.
   The SEM method has been improved based on the taxonomy of concepts, which specifies the
syntax of the Ukrainian language as the root concept of the ontology: 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠 : < 𝑅 > 𝐶′.
   In SEM, to identify the set of semes of the corresponding Ukrainian-language text and their
relationship, first, based on the results of SA, a semantic graph of the relations of linguistic units is
built, taking into account the parts of the language of words:

                 𝐶′ = (𝐶 , 𝐷 , 𝑅 , 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠 ), 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠 =< 𝐶                        ,𝐶           >,
   where 𝐶          is a tuple of concepts of phrase formation; 𝐶         is a tuple of sentence
generation concepts in the Ukrainian language. Tuple 𝐶       is given as:

                           𝐶              =< 𝑆𝑔𝑛       , 𝑆𝑔𝑛     , 𝑆𝑔𝑛         , 𝑆𝑔𝑛        >,
   where 𝑆𝑔𝑛        is a tuple of phrase generation properties:

                                   𝑆𝑔𝑛    =< 𝑆𝑔𝑛                , 𝑆𝑔𝑛   >,
                  𝑆𝑔𝑛       =< 𝑆𝑔𝑛   , 𝑆𝑔𝑛 , 𝑆𝑔𝑛                 , 𝑆𝑔𝑛 , 𝑆𝑔𝑛 , 𝑆𝑔𝑛                  >,
                 𝑆𝑔𝑛       =< 𝑆𝑔𝑛 , 𝑆𝑔𝑛    >, 𝑆𝑔𝑛                  =< 𝑆𝑔𝑛    , 𝑆𝑔𝑛                   >,
   where 𝑆𝑔𝑛       is a tuple of lexical signs of phrase generation; 𝑆𝑔𝑛 is a tuple of syntactic signs
of phrase generation; 𝑆𝑔𝑛            is a tuple of named properties; 𝑆𝑔𝑛          is a tuple of adjectival
properties; 𝑆𝑔𝑛       is a tuple of properties of numerals; 𝑆𝑔𝑛       is a tuple of pronominal properties;
𝑆𝑔𝑛     is a tuple of verb properties; 𝑆𝑔𝑛        is a tuple of adverbial properties; 𝑆𝑔𝑛      is a tuple of
consecutive properties and 𝑆𝑔𝑛 is a tuple of subordinate properties; 𝑆𝑔𝑛               is a tuple of ordinal
properties and 𝑆𝑔𝑛           is a tuple of subordinate properties.
    The tuple 𝑆𝑔𝑛         describes the component properties of a relation clause:

                             𝑆𝑔𝑛      =< 𝑆𝑔𝑛         , 𝑆𝑔𝑛       , 𝑆𝑔𝑛       >,
   where 𝑆𝑔𝑛       is a tuple of the properties of a separating connection, 𝑆𝑔𝑛       is a tuple of the
properties of a connecting connection, and 𝑆𝑔𝑛           is a tuple of the properties of an opposing
connection.

                             𝑆𝑔𝑛      =< 𝑆𝑔𝑛        , 𝑆𝑔𝑛        , 𝑆𝑔𝑛       >,
   where 𝑆𝑔𝑛          is a tuple of matching properties; 𝑆𝑔𝑛             is a tuple of control properties;
𝑆𝑔𝑛        is a tuple of adjacency properties. A tuple of sentence generation concepts: 𝐶                =<
𝑆𝑔𝑛 , 𝑆𝑔𝑛 , 𝑆𝑔𝑛 , 𝑆𝑔𝑛              >, where sentence generation properties are grouped in 𝑆𝑔𝑛
are a tuple of sentence generation properties; 𝑆𝑔𝑛    is a tuple of clause identification properties;

            𝑆𝑔𝑛   =< 𝑆𝑔𝑛  , 𝑆𝑔𝑛              , 𝑆𝑔𝑛   >,𝑆𝑔𝑛   =< 𝑆𝑔𝑛      , 𝑆𝑔𝑛                 >,
              𝑆𝑔𝑛  =< 𝑆𝑔𝑛 , 𝑆𝑔𝑛                >,𝑆𝑔𝑛    =< 𝑆𝑔𝑛      , 𝑆𝑔𝑛                 >,
    where 𝑆𝑔𝑛        is a tuple of narrative sentence generation properties; 𝑆𝑔𝑛              is a tuple of
properties for generating interrogative sentences; 𝑆𝑔𝑛          is a tuple of prompt sentence generation
properties; 𝑆𝑔𝑛        is a tuple of properties for generating emotionally neutral sentences; 𝑆𝑔𝑛
is a tuple of properties for generating emotional sentences; a tuple of concepts for the formation of
𝑆𝑔𝑛      simple and 𝑆𝑔𝑛        complex sentences; 𝑆𝑔𝑛             is a tuple of properties identifying the
main members of the sentence; 𝑆𝑔𝑛               is a tuple of the properties of the identification of the
secondary members of the sentence; 𝑆𝑔𝑛               =< 𝑆𝑔𝑛         , 𝑆𝑔𝑛      >; 𝑆𝑔𝑛        is a tuple of
properties for generating affirmative sentences; 𝑆𝑔𝑛          is a tuple of negative sentence generation
properties. To generate a simple sentence 𝑆𝑔𝑛          features are analyzed:

        𝑆𝑔𝑛      =< 𝑆𝑔𝑛       , 𝑆𝑔𝑛    , 𝑆𝑔𝑛     , 𝑆𝑔𝑛       , 𝑆𝑔𝑛   , 𝑆𝑔𝑛    , 𝑆𝑔𝑛    , 𝑆𝑔𝑛        >,
   where 𝑆𝑔𝑛        is a tuple of simple sentence generation properties.

4. Experiments, results and discussion
I will analyse the results of the experimental approbation of the developed methods and means of
linguistic, intellectual analysis of texts in the Ukrainian language based on the development of
methods for identifying keywords, determining persistent word combinations, thematic
classification of the text and detecting duplication of text. Let us consider the peculiarities of the
process of syntactic analysis of Ukrainian-language textual content aimed at identifying significant
keywords of input texts. Having determined the role and formal features of the syntactic analyser in
the process of identifying keywords of the content topic, the procedures of the proposed method
were decomposed into two stages (Table 1), where A (total keywords identified with a given word
weight), B (generated significant words without pronoun and verbs), C (coincidence of words with
the author's list), D (accuracy of the coincidence of identified keywords with the author's list), E
(additionally defined keywords, but not determined by the author of the publication). In stage 1, the
research for step 1 (analysis of full articles) and step 2 (articles without metadata such as abstract,
author keywords and list of references) was carried out without the application of ML, and in stage
2 - with ML. The method of article analysis without metadata achieves the best results according to
the density criterion. The author of the article often defines a more significant number of words (𝐴 )
and a smaller number of keywords (𝐴 ) than are present in the text of the scientific and technical
publication (Fig. 6). Unlike known parsers, the proposed method provides self-improvement and self-
learning of the keyword definition module due to the identification mechanism of significant
statistical parameters within the limits defined by the moderator. A system has been developed on
the Victana website, which allows users to choose from a list of languages of the analysed text
(http://victana.lviv.ua/index.php/kliuchovi-slova). The value of 𝐴 differs from the value of 𝐴 by
0.69 (by number, but not by content); 𝐴 from 𝐴 by 1.74; 𝐴 from 𝐴 by 2.66; 𝐴 from 𝐴 by 3.58.
The value of 𝐴 differs from the value of 𝐴 by 4.36; respectively, 𝐴 from 𝐴 by 3.31; 𝐴 from 𝐴 by
2.39; 𝐴 from 𝐴 by 1.47. Adaptively changing the parameters/rules of the module almost doubles
the collection of identified keywords (for example, the value of 𝐴 is greater than 𝐴 by 1.144654; 𝐴
by 1.750524; 𝐴 by 1.557652; 𝐴 by 1.36478). The total increase in value obtained depending on the
moderation of dictionaries is, respectively, for 𝐴 is 14.46541; 𝐴 is 36.47799; 𝐴 is 55.7652; 𝐴 is
75.05241. When comparing 𝐴 is greater than 𝐴 ÷ 𝐴 and we have a chain of such values as 1.7985;
1.5084; 1.3217; and 1.176.

Table 1
Statistical data of the study of the content of scientific and technical publications
          Name   Words                     Stage 1                            Stage 2
                 weight       A      B        C    D      E      A      B        C    D      E
          Step 1  ≥1         5.46   3.92     2.51 2.08   1.74   7.43   7.03     3.27  3     4.18
                  ≥2         1.08   0.88     0.63 0.59   0.26   2.67   2.64     1.65 1.54   1.12
                  ≥3         0.41   0.38     0.22 0.21   0.16   1.21   1.2      0.85 0.79   0.41
                  ≥4         0.15   0.13     0.09 0.09   0.04   0.46   0.45     0.33 0.31   0.15
                  ≥5          0      0        0    0      0      0      0        0    0      0
          Step 2  ≥1         6.51   5.02     2.68 2.23   2.37   8.35   7.78     3.25 2.91   4.99
                  ≥2         1.34   1.11     0.74 0.72   0.39   3.12   3.07     1.81 1.67   1.43
                  ≥3         0.51   0.45     0.29 0.27   0.17   1.42   1.4      0.93 0.85   0.54
                  ≥4         0.19   0.17     0.12 0.12   0.05   0.73   0.72     0.45 0.42   0.31
                  ≥5         0.11   0.1      0.06 0.06   0.04   0.33   0.32     0.25 0.23   0.1

   10                          Author's keywords
                               Number of words
                               Stage 1, Step 1
                               Stage 1, Step 2
                               Stage 2, Step 1
a) 0                           Stage 2, Step 2     b)
Figure 6: Results of the analysis of more than 300 scientific and technical publications

   For different stages and steps of the experiment of processing the primary text, the average
coincidence of the lists of discovered keywords with the author's keywords varies in the range of
52.6-68.5%. The accuracy of matching keywords with the author's keywords ranges from 43.6 to
62.9%. The average match of meaningful keywords compared to all found by the system ranges from
38.9-75.8%, depending on the stages of analysis of article texts. The accuracy of matching keywords
compared to all found by the system varies between 34.3-71.9%, depending on the stages of analysis
of article texts. For 𝐴 , the module most often identified the number of keywords {5, 7, 3} (10),
although the distribution of found keywords was within [1;18] words (except 17).
   For 𝐴 , the module most often identified the number of keywords also {5, 7, 3}, although the
distribution of found keywords is within [1;18] (except 17), the number of identified words increased,
and the highest reliability index was achieved. For 𝐴 , the module most often identified the number
of keywords {7, 6, 5, 10, 8}, although the distribution of found keywords was within [2;14] (the range
narrowed significantly). For 𝐴 , the module most often identified the number of keywords {8, 5, 7,
10}, the distribution of identified keywords within [3;16] (accuracy improved). The accuracy of the
definition of keywords increases during the moderation of dictionaries and the ML module. The
difference between the number of keywords defined by the author and identified by the module at
𝐴 is 44.39919% (difference in %). Accuracy improves with 𝐴 is 33.70672%, significantly improving
with 𝐴 is 24.33809%, and with 𝐴 is 14.96945%.

                                            Total words                                                         Total words
                                            Meaningful words                                                    Meaningful words
                                            Coincidence with author's                                           Coincidence of words
                                            Match accuracy                                                      Match accuracy
                                            Additional words
                                                                                                                Additional words
         0                                                                    0
                 Weight = 1    Weight = 2   Weight = 3    Weight = 4                   Weight=1 Weight=2 Weight=3 Weight=4 Weight=5
 a)                                                                     b)
                                            Total words                                                        Total words
                                            Meaningful words                                                   Meaningful words
                                            Coincidence with author's                                          Coincidence with author's
                                            Match accuracy                                                     Match accuracy
                                            Additional words                                                   Additional words
         0                                                                   0
               Weight = 1     Weight =2     Weight =3    Weight =4                     Weight=1 Weight=2 Weight=3 Weight=4 Weight=5
c)                                                                      d)
Figure 7: Obtaining meaningful words at the stage: a) 1.1, b) 1.2, c) 2.1 and d) 2.2

      10                                      Author's keywords           10                                   Author's keywords
                                              Number of words                                                  Number of words
                                              Defined by the system                                            Defined by the system
                                              Meaningful words                                                 Meaningful words
                                              Coincidence with author's                                        Coincidence with author's
                                              Match accuracy                                                   Match accuracy
 a) 0                                         Additional words          b) 0                                   Additional words

      10                                       Author's keywords                  10                            Author's keywords
                                               Number of words                                                  Number of words
                                               Defined by the system                                            Defined by the system
                                               Meaningful words                                                 Meaningful words
                                               Coincidence with author's                                        Coincidence with author's
                                               Match accuracy                                                   Match accuracy
                                               Additional words
 c) 0                                                                        d) 0                               Additional words

Figure 8: Arithmetic mean occurrence of words at the stage: a) 1.1, b) 1.2, c) 2.1 and d) 2.2

   Analysis was performed for filtered texts without metadata and unfiltered texts. The average
values obtained for filtered texts 𝑃𝑒𝑟 = 0.28 and unfiltered 𝑃𝑒𝑟 = 0.19 shows that filtering
scientific articles improves keyword density by 1.48 times or 47.83% (Fig. 9a).

                                                                             2                Filtered text          General text
     2                 Filtered text               Primary text

     0                                                                       0
             1 5 913172125293337414549535761656973778185899397                     1 6 111621263136414651566166717681869196
a)                                                                      b)
Figure 9: Results of checking articles without specifying the thematic dictionary

    The obtained values for the texts 𝑃𝑒𝑟 = 0.34 and 𝑃𝑒𝑟 = 0.25, taking into account the
refinement of the thematic dictionary through ML and the replenishment of blocked words, shows
that filtering with simultaneous moderation of the thematic dictionary improves keyword density
by 1.35 times or by 35.44% (Fig. 9b). A comparison of the values in the original author's text 𝑃𝑒𝑟 =
0.19 and 𝑃𝑒𝑟 = 0.25 without/with the refinement of the thematic dictionary, respectively,
demonstrates the effectiveness of the moderation of the thematic dictionary in the initial text - the
density of keywords increases 1.34 times or by 34.33% (Fig. 10a). Comparison of the values in the
filtered author's text 𝑃𝑒𝑟 = 0.28 and 𝑃𝑒𝑟 = 0.34 without/with the refinement of the thematic
dictionary, respectively, demonstrates the effectiveness of the moderation of the thematic dictionary
in the filtered text as the density of keywords increases 1.23 times or by 23.14% (Fig. 10b).

      1      General text with a detailed dictionary            1         Filtered text with refined vocabulary
             General text without a specified dictionary                  Filtered text without dictionary refinement


      0                                                         0
 a)       1 6 111621263136414651566166717681869196         b)       1 5 913172125293337414549535761656973778185899397

Figure 10: Results of analysis of articles with different dictionaries

    So, the experimental study confirmed the method's reliability - for different stages of processing
the primary text, the average coincidence of the lists of identified keywords with the author's
keywords varies in the range of 52.6-68.5% (by 9%). The accuracy of matching keywords with the
author's keywords ranges from 43.6 to 62.9%. The average match of meaningful keywords compared
to all found by the system ranges from 38.9-75.8%, depending on the stages of analysis of article texts.
The accuracy of matching keywords compared to all found by the system varies between 34.3-71.9%,
depending on the stages of analysis of article texts. A method of determining stable word
combinations when identifying textual content keywords in reference passages of the author's text
has been developed. The process consists of the use of Zipf's law in the formation of stable word
combinations as key, taking into account the following rules of preliminary linguistic processing of
the text: removal of all stop words; form bigrams only within the limits of punctuation marks and
words that are not verbs or pronouns (the latter are considered punctuation marks); determine verbs
by inflexions; form bigrams based on their bases without taking into account their inflexions;
definition of adjectives by inflexions and to believe that adjectives should only be in the first place
in the bigram from Ukrainian-language texts. A module has been developed to identify persistent
phrases as keywords in textual content. An approach to developing linguistic content analysis
software for the determination of stable word combinations in identifying keywords of Ukrainian-
language and English-language textual content is proposed. The peculiarity of the approach is
adapting the linguistic, statistical analysis of lexical units to the peculiarities of the constructions of
Ukrainian and English words/texts. The results of the experimental approbation of the proposed
method of content analysis of English- and Ukrainian-language texts to determine stable word
combinations in identifying keywords of technical texts were studied.
    A method of identifying the style of the author of the text based on the analysis of linguistic
speech coefficients in the standard has been developed. The technique consists of a comparative
study of the author's attribution in the author's statistically processed work (standard) with an
arbitrarily analysed passage. The method evaluates the probability of the text of the article belonging
to the author of the benchmark with the analysis of the relevant coefficients of lexical speech as the
concentration of the text 𝐼 , the coherence of the speech 𝐾 , the uniqueness of the text 𝐼 , the
syntactic complexity of the speech 𝐾 and the linguistic diversity of the speech 𝐾 . The degree of
speech connectivity 𝐾 does not decrease significantly. In 2001, it changed within [0.5; 1.2], and in
2021 – within [0.4; 0.9] (Fig. 11). Moreover, the method works under the condition that the author's
standard has already been researched - the task of NLP is to form the author's frequency dictionary,
including service/stop words.
    An algorithm for determining stop words of text content based on linguistic analysis of text
content has been developed. For the individual style of the author's text, markers are service/stop
words (for example, particles, conjunctions, prepositions, parasite words, slang, slang, etc.) unrelated
to the article's topic. The absolute and relative frequencies of stopwords were analysed and compared
with the reference values for each excerpt. Therefore, applying the method of reference words gives
the following results: finding what most likely belongs to the standard among the studied passages.
Other results also confirm the effectiveness of the keyword method in author attribution of texts.
The proposed assumption about the insignificance of the influence of the share as a parameter of the
process on the results led to a decrease in the correlation coefficients but placed the probability of
belonging to the standard for passages in the correct order (Table 2). More likely, Excerpt 4 belongs
to the author of the template (although there is no significant difference between results 4 and 2, if
they are written in the same period, they do not belong to the author of the template; if in different
periods with the template, the probability of belonging to this author increases).

  2                                                                      Kl         Ks         Kz

  1

  0
        15
        22
        29
        36
        43
        50
        57
        64
        71
        78
        85
        92
        99
         1
         8


       106
       113
       120
       127
       134
       141
       148
       155
       162
       169
       176
       183
       190
       197
       204
       211
       218
Figure 11: Analysis of the distribution of speech style parameters 𝐾 , 𝐾 and 𝐾

Table 2
Correlation coefficients for stop words
   New numbering Article number   Re–U Participle Conjunction Preposition  Re–U
         1              4       0.7326  0.9594      0.9544      0.5639    0.6905
         2              2       0.7066  0.9580      0.5714      0.4928    0.4913
         3              1       0.6076     1         0.79        0.72     0.6900
         4              3       0.2810  0.8800      0.1624      0.1517    0.2254

    An algorithm for the linguistic analysis of Ukrainian-language texts and a syntactic analyser of
text content has been developed. The features of the algorithm are the adaptation of morphological
and syntactic analysis of lexical units to the peculiarities of constructions of Ukrainian words/texts.
Algorithms are tested to identify significant stopwords in Ukrainian-language text based on regular
expressions. When parsing words belonging to a part of speech, declension within this part of speech
was taken into account. For this purpose, word inflexions were analysed for classification, selection
of the basis and formation of the corresponding alphabetic-frequency dictionaries. The dictionaries
contents were subsequently taken into account in the next steps of determining the text's authorship
by calculating the parameters and coefficients of the author's speech. Software implementation for
solving some NLP problems, as research of:

      keywords (https://victana.lviv.ua/kliuchovi-slova);
      stable phrases (https://victana.lviv.ua/nlp/stiiki-slovospoluchennia);
      classification of textual content (https://victana.lviv.ua/kliuchovi-slova);
      quantitative evaluations of speech (https://victana.lviv.ua/nlp/linhvometriia);
      the author's style based on calculations of stylometry coefficients and their comparison with
       the corresponding coefficients in the standard text (https://victana.lviv.ua/nlp/stylemetriia);
      differences in text signs (https://victana.lviv.ua/nlp/hlotokhronolohiia);
      features of the style of texts based on N-grams (https://victana.lviv.ua/nlp/n-grams).

   The results of the experimental approbation of the proposed content monitoring method for
determining the author in Ukrainian-language scientific texts of a technical profile were studied. A
comparison of the results of more than 300 one-person works of a technical direction by 100 different
authors for 2001–2021 was carried out to determine whether and how the coefficients of text
diversity of these authors change in different periods. A method of identifying the potential
(probable) author of a Ukrainian-language text based on the analysis of the author's linguistic speech
coefficients in a reference passage of the author's text has been developed. Decomposition of the
method of determining the author was carried out based on the analysis of such speech coefficients
as speech coherence, degree of syntactic complexity, linguistic diversity, indices of concentration
and exclusivity of the text. In parallel, such parameters of the author's style as the number of words
in a specific text, the total number of words in this text, the number of sentences, the number of
prepositions, the number of conjunctions, the number of words with a frequency of 1 and the number
of words with a frequency of 10 and more, as well as keywords and 3 - grams. For example, 3-grams
of 3 articles were analysed [61-63] (Ukrainian versions). For the most frequently used letters, the
frequency of appearance of 3-grams with such initial letters will have an almost identical distribution
(peak values in Fig. 12a), but not for other letters. Therefore, it is expedient to study only 3 grams for
initial letters that occur less often in the texts of a specific language to determine the degree of
belonging of the text to the corresponding author (for example, Fig. 12b). According to these graphs.
It appears that Articles (1,2) are more likely to be written by the same author, although the same
author could also write Articles (1,3) (but this is not true). Different authors write articles (2,3).
Applying linguistic, statistical analysis of 3-grams to a set of articles makes it possible to form a
subset of publications similar in terms of linguistic characteristics. Imposing additional conditions in
the form of linguistic, statistical analyses (a set of keywords, stable word combinations (Table 3),
stylometric, ligvometric, etc.) will significantly reduce the subset, clarifying the list of more likely
authors' works. Thus, the analysis of the content and frequency of appearance of only official words
separates Articles (1,3) into different subsets, leaving Articles (1,2) in one. 78.4814% of 3-grams were
analysed for Article 1, 72.6332% for Article 2, and 84.1271% for Article 3. The difference in the use of
the corresponding 3-grams between Articles (1,2) is R12=56.5254%, between Articles (2,3) –
R23=69.4271%, between Articles (1,3) – R13=62.9839%. Accordingly, Articles (1,2) are more similar by
[6-12]% (R23>R12 by 12.9017%, R23 > R13 by 6.4432%, R13> R12 by 6.4585%, i.e. R23>R13>R12) than
Articles (1,3) and (2,3). The smaller the Rij, the greater the degree to which the same author writes
the articles. Then, in case Articles (1,2) are more likely to be written by one author/team than Articles
(2,3) and (1,3), respectively.

Table 3
List by frequency rating of stable phrases for Article 1
             FREG                          t-test                         LR                       Х2
     Phrase      AF      RF          Phrase            t           Phrase       logL        Phrase           X2
    система       4   0.08888       система         1.82222   інформаційний     5.03e     прийняття       45.00000
  електронний             9      електронний           2        технологія       –1        рішення            0
 інформаційни     4   0.08888    електронний        1.57809   інтелектуальни    2.13e      система        45.00000
   й система              9        контент-            1         й система       –1      електронний          0
                                   комерція
  електронний    3    0.06666   розділ науковий     1.31993   інформаційний     8.36e    електронний      32.94642
    контент-             7                             3         система         –2        контент-           9
    комерція                                                                               комерція
     розділ      2    0.04444   інформаційний       1.22222       портал        5.58e   розділ науковий   29.30232
    науковий             4         система             2         науковий        –2                           6
     портал      1    0.02222     прийняття         0.97777   курс технологія   3.31e   курс технологія   21.98863
    науковий             2         рішення             8                         –2                           6
 інтелектуальн   1    0.02222   курс технологія     0.95555    сховище дані     3.31e    сховище дані     21.98863
   ий система            2                             6                         –2                           6
   прийняття     1    0.02222    сховище дані       0.95555      прийняття      8.27e       портал        14.31818
    рішення              2                             6          рішення        –3        науковий           2
      курс       1    0.02222       портал          0.93333   розділ науковий   1.89e   інформаційний     5.848550
   технологія            2         науковий            3                         –3         система
  сховище дані   1    0.02222   інтелектуальни      0.77777    електронний      1.55e   інтелектуальни    3.579545
                         2         й система           8         контент-        –4        й система
                                                                 комерція
 інформаційни    1    0.02222   інформаційний       0.68888      система        1.37e   інформаційний     1.890409
  й технологія           2        технологія           9       електронний       –6       технологія
a)                                                          b)
Figure 12: Graph of the frequency distribution of 1-gram and 3-gram occurrences in Articles 1–3
(blue for Article 1 [61], orange for Article 2 [62] and grey for Article 3 [63])

     100,00%         Collective 1            Collective 2          10,00                Algorithm 1       Algorithm 2
                     Collective 3            Collective 4                               Algorithm 3       Algorithm 4
      50,00%                                                        5,00


a)     0,00%                                                     b) 0,00
      100,00%
                                         Algorithm 1         Algorithm 2         Algorithm 3          Algorithm 4

      50,00%


c)     0,00%

Figure 13: Style analysis: a – according to the developed algorithms 1-4; b – taking into account the
signs of speech; c – for analysed collective works 1-4 and the average value


 a)                                 b)                                          c)


 d)                                 e)                                     f)
Figure 14: Study of style at stage 2 for the text with the construction of a frequency dictionary: a –
complete with 100 words; b – the main one of 100 words; c – complete with 200 words; d – the main
one of 200 words; d - complete with 50 words; e - the main one of 50 words

    When identifying the author of a text, it is assumed that the text reflects the author's style of
writing, which makes it possible to distinguish him from others. To compare texts with each other,
it is necessary to compare some numerical characteristics of the text, which would be approximate
for the texts of the same author and differ significantly for the works of different authors. Such a
characteristic can be the density of the distribution of letter combinations of three consecutive
symbols (3-grams). During the experimental testing based on the developed four different algorithms
for calculating the degree of verification of the author of the Ukrainian-language text from a set of
possible values, values were obtained that confirm that the style of the authors numbered x and y is
quite close (more than 90%) to the style of collective works 1–4, respectively. Also, the number of
authors (from 42.02% to 34.04% of the total 100 participants in the project from more than 300 articles)
was significantly reduced, with similarity in speech style. Figure 13 presents graphs of the results
obtained when applying algorithms to analyse the method developed to determine the author's style.
   Further, an analysis of stop words and keywords of the authors' works was used to determine the
author's style, as 34.04% got to those. Each individual has their vocabulary for conveying thought,
including so-called "parasitic" (that is, therefore, although, etc.) and service words (and, and, and,
but, although, etc.). Figure 14 presents an example of the analysis of the author's style in the second
stage by analysing the frequency of service appearance and keywords, considering various filters.
Therefore, a method of determining the style of the author of thematic Ukrainian-language textual
content was developed based on the analysis of keywords, stable word combinations, N-grams,
lingumetry and stylometry, which made it possible to determine the stylistic contribution of each of
the authors and increase the accuracy of attribution of a scientific and technical publication by 6%.
A method for calculating the degree of verification of the author of a Ukrainian-language text from
a set of possible ones based on a comparative analysis of the styles of potential authors has also been
developed, which made it possible to increase the accuracy of classification by style similarity by 7%.

5. Conclusions
The work solves an important scientific and applied problem of analysis and synthesis of CLS for
solving various problems of processing Ukrainian-language textual content based on the
development of new and improvement of known models, methods and tools of NLP:

   1.   An analysis of the current state and prospects of IT development of natural language
        processing was carried out, which made it possible to define the problem and research tasks,
        as well as to form general research directions in the absence of non-commercial open-source
        software as CLS for processing Ukrainian-language textual content and a standardized design
        approach.
   2.   The relevance of solving the problem of analysis and synthesis of CLS based on the
        development of the general structure of the system for processing Ukrainian-language
        textual content is substantiated due to the interaction of the main processes/components of
        IS and methods of linguistic processing of textual content adapted to the Ukrainian language
        based on grapheme, morphological, lexical, syntactic, semantic, structural, ontological and
        pragmatic analysis allowed to improve the IT of intellectual analysis of text flow for solving
        a specific task of NLP. It ensured the adaptation of NLP processes for the analysis of
        Ukrainian-language textual content and, based on them, increased the accuracy of the
        obtained results by 6-48%, depending on the specific task of NLP. For example, for the NLP
        task of determining the Ukrainian-language text keywords, the density of keywords
        increases in the range [1.23; 1.48] times or by [23.14; 47.83]% depending on filling the
        thematic dictionary quality/accuracy through machine learning.
   3.   The methods of processing information resources, such as integration, management and
        support of Ukrainian-language content, were improved, which made it possible to adapt the
        process of intellectual analysis of the text flow and develop metrics of the effectiveness of the
        CLS functioning for the solution of various tasks of the NLP. The developed methods and
        tools make it possible to build a CLS for processing Ukrainian-language text content
        according to the needs of the permanent/potential target audience based on the analysis of
        the history of actions of website users.
   4.   The NLP methods based on regular expressions of pattern matching were improved, which
        made it possible to adapt the methods of tokenization and text normalization by cascades of
        simple substitutions of regular expressions and finite state machines.
   5.   The MA method of the Ukrainian-language text based on word segmentation and
        normalization, sentence segmentation and modified Porter's stemming algorithm was
        improved as an effective tool of identifying lem affixes for the possibility of marking the
        analysed word, which made it possible to increase the keyword searches accuracy by 9%.
   6.   The IT of the intellectual analysis of the text flow was improved based on the processing of
        information resources, which made it possible to adapt the general structure of modules for
        integration, management and support of content to solve various tasks of the NLP and
        increase the efficiency of the operation of the CLS by 6-9%. It became possible thanks to the
        combination of methods of linguistic analysis adapted to the Ukrainian language, improved
        IT processing of information resources, ML, and a set of metrics for evaluating the
        effectiveness of the CLS's functioning. The main principle of building such CLS is modularity,
        which facilitates their construction by requiring the availability of appropriate processes for
        solving a specific NLP problem.
   7.   A method of determining the author in Ukrainian-language texts has been developed based
        on the analysis of the coefficients of the author’s lexical speech in the reference passage of
        the author’s text, which is based on the study of a collection of keywords, persistent phrases,
        indicators of linguometry, stylometry, as well as the results of the analysis of N-grams based
        on comparisons of usage differences 2-gram and 3-gram for publications similar in style in
        the range of [6;7]%, and for exactly not similar – >12%), which made it possible to determine
        a set of potential authors of publications from more than one author (up to [9; 34]% of the
        total number of project participants) and develop a method for identifying the author's style.
   8.   A method of determining stable word combinations was developed based on the
        identification of keywords of the Ukrainian-language text and the analysis of the linguistic
        speech coefficients of the author of the text in reference excerpts of the content, which made
        it possible to improve the accuracy of the method of determining the style of the author of
        the text by 9% based on statistical linguistics.
   9.   Relevant materials confirm the reliability of scientific and practical results on the
        implementation of dissertation studies by comparing the obtained practical results on
        different samples of reliable input data. CLS was developed using CMS Joomla on the
        information resource http://victana.lviv.ua! (for designing the e-framework of articles), PHP
        (for implementing text content processing methods), HTML (for implementing page mark-
        up), CSS (for describing page styles), and MySQL (for storing data and dictionaries). The
        experimental study confirmed the reliability of the method of identifying keywords - for
        different algorithms for processing the primary text, the average match between the lists of
        identified keywords and the author's keywords varies in the 52.6-68.5% range. The accuracy
        of matching keywords with the author's keywords ranges from 43.6 to 62.9%. The average
        match of meaningful keywords compared to all found by the system ranges from 38.9-75.8%,
        depending on the stages of analysis of article texts. The accuracy of matching keywords
        compared to all found by the system varies between 34.3-71.9%, depending on the stages of
        analysis of article texts.

Acknowledgements
The research was carried out with the grant support of the National Research Fund of Ukraine,
"Information system development for automatic detection of misinformation sources and inauthentic
behaviour of chat users ", project registration number 187/0012 from 1/08/2024 (2023.04/0012). Also,
we would like to thank the reviewers for their precise and concise recommendations that improved
the presentation of the results obtained.

References
[1] I. Lauriola, A. Lavelli, F. Aiolli, An introduction to deep learning in natural language processing:
    Models, techniques, and tools, Neurocomputing 470 (2022) 443-456.
[2] Y. Kang, Z. Cai, C. W. Tan, Q. Huang, H. Liu, Natural language processing (NLP) in management
     research: A literature review, Journal of Management Analytics 7(2) (2020) 139-172.
[3] L. Hickman, S. Thapa, L. Tay, M. Cao, P. Srinivasan, Text preprocessing for text mining in
     organizational research: Review and recommendations,Organizational Research Methods 25(1)
     (2022) 114-146.
[4] D. Hu, An introductory survey on attention mechanisms in NLP problems, in: Proceedings of
     the Intelligent Systems Conference on Intelligent Systems and Applications 2 (2020) 432-448.
[5] M. Gardner, W. Merrill, J. Dodge, M. E. Peters, A. Ross, S. Singh, N. A. Smith, Competency
     problems: On finding and removing artifacts in language data, arXiv preprint arXiv:2104.08646,
     2021.
[6] L. Wu, et. al., Graph neural networks for natural language processing: A survey, Foundations
     and Trends in Machine Learning 16(2) (2023) 119-328.
[7] E. Fedorov, O. Nechyporenko, Linguistic Constructions Translation Method Based on Neural
     Networks, CEUR Workshop Proceedings 3396 (2023) 295-306.
[8] M.-A. Lefer, N. Grabar, Super-creative and over bureaucratic: A cross-genre corpus based study
     on the use and translation of evaluative prefixation in ted talks and EU parliamentary debates,
     Across Languages and Cultures 16(2) (2015) 187–208.
[9] M. Konyk, V. Vysotska, S. Goloshchuk, R. Holoshchuk, S. Chyrun, I. Budz, Technology of
     Ukrainian-English Machine Translation Based on Recursive Neural Network as LSTM, CEUR
     Workshop Proceedings 3387 (2023) 357-370.
[10] N. Shakhovska, I. Shvorob, The method for detecting plagiarism in a collection of documents,
     in: Proceedings of the International Conference on Computer Sciences and Information
     Technologies, CSIT, 2015, pp. 142-145.
[11] O. Karnalim, G. Kurniawati, Programming Style on Source Code Plagiarism and Collusion
     Detection, International Journal of Computing 19(1) (2020). 27-38.
[12] V. Vysotska, Y. Burov, V. Lytvyn, A. Demchuk, Defining Author's Style for Plagiarism Detection
     in Academic Environment, in: Proceedings of the International Conference on Data Stream
     Mining and Processing, DSMP, 2018, pp. 128-133.
[13] O. Barkovska, V. Kholiev, A. Havrashenko, D. Mohylevskyi, A. Kovalenko, A Conceptual Text
     Classification Model Based on Two-Factor Selection of Significant Words, CEUR Workshop
     Proceedings 3396 (2023) 244-255.
[14] A. Berko, Y. Matseliukh, Y. Ivaniv, L. Chyrun, V. Schuchmann, The text classification based on
     Big Data analysis for keyword definition using stemming, in: Proceedings of the IEEE 16th
     International conference on computer science and information technologies on Computer
     science and information technologies, Lviv, Ukraine, 22–25 September, 2021, pp. 184–188.
[15] V. Lytvyn, V. Vysotska, I. Budz, Y. Pelekh, N. Sokulska, R. Kovalchuk, L. Dzyubyk, O.
     Tereshchuk, M. Komar, Development of the quantitative method for automated text content
     authorship attribution based on the statistical analysis of N-grams distribution, Eastern-
     European Journal of Enterprise Technologies, 6(2(102)) (2019) 28–51. doi:10.15587/1729-
     4061.2019.186834.
[16] I. Khomytska, I. Bazylevych, V. Teslyuk, I. Karamysheva, The chi-square test and data clustering
     combined for author identification, in: Proceedings of the IEEE XVIIIth Scientific and Technical
     Conference on Computer Science and Information Technologies, 2023.
[17] I. Khomytska, V. Teslyuk, The Multifactor Method Applied for Authorship Attribution on the
     Phonological Level, CEUR workshop proceedings 2604 (2020) 189-198.
[18] R. Romanchuk, V. Vysotska, V. Andrunyk, L. Chyrun, S. Chyrun, O. Brodyak, Intellectual
     Analysis System Project for Ukrainian-language Artistic Works to Determine the Text
     Authorship Attribution Probability, in: Proceedings of the International Scientific and Technical
     Conference on Computer Sciences and Information Technologies, 2023.
[19] I. Khomytska, V. Teslyuk, A. Holovatyy, O. Morushko, Development of methods, models, and
     means for the author attribution of a text, Eastern-European Journal of Enterprise Technologies
     3(2(93)) (2018) 41–46. doi: 10.15587/1729-4061.2018.132052.
[20] I. Khomytska, V. Teslyuk, Authorship and Style Attribution by Statistical Methods of Style
     Differentiation on the Phonological Level, Advances in Intelligent Systems and Computing 871
     (2019) 105–118. doi: 10.1007/978-3-030-01069-0_8.
[21] R. Nazarchuk, S. Albota, Tweets about Ukraine during the russian-Ukrainian War: Quantitative
     Characteristics and Sentiment Analysis, CEUR Workshop Proceedings 3426 (2023) 551-560.
[22] A. Taran, Terminology of Computational Linguistics in Terms of Indexing and Information
     Retrieval in the System "iSybislaw", CEUR Workshop Proceedings 2870 (2021) 225-234.
[23] N. Kunanets, H. Matsiuk, Use of the Smart City Ontology for Relevant Information Retrieval,
     CEUR Workshop Proceedings 2362 (2019) 322-333.
[24] K. Nataliia, M. Halyna, Application of Saaty Method While Choosing Thesaurus View Model of
     the "Smart city" Subject Domain for the Improvement of Information Retrieval Efficiency, in:
     Proceedings of the IEEE 13th International Scientific and Technical Conference on Computer
     Sciences and Information Technologies, CSIT, vol. 2, 2018, pp. 21-25. doi:10.1109/STC-
     CSIT.2018.8526656.
[25] Y. Burov, V. Vysotska, L. Chyrun, Y. Ushenko, D. Uhryn, Z. Hu, Intelligent Network Architecture
     Development for E-Business Processes Based on Ontological Models, International Journal of
     Information Engineering and Electronic Business 16(5) (2024) 1-54. doi:10.5815/ijieeb.2024.05.01.
[26] P. Zweigenbaum, S.J. Darmoni, N. Grabar, The contribution of morphological knowledge to
     French MeSH mapping for information retrieval, in: Proceedings of the Annual AMIA
     Symposium, 2001, pp. 796–800.
[27] É. Bigeard, F. Thiessard, N. Grabar, Detecting drug non-compliance in internet fora using
     information retrieval and machine learning approaches, Studies in Health Technology and
     Informatics 264 (2019) 30–34.
[28] V. Claveau, T. Hamon, S. Le Maguer, N. Grabar, Health consumer-oriented information
     retrieval, Studies in Health Technology and Informatics 210 (2015) 80–84.
[29] V. Lytvyn, Y. Burov, V. Vysotska, Y. Pukach, O. Tereshchuk, I. Shakleina, Abstracting Text
     Content Based on Weighing the TF-IDF Measure by the Subject Area Ontology, in: Proceedings
     of the IEEE International Conference on Smart Information Systems and Technologies (SIST),
     Nur-Sultan, Kazakhstan, 2021. URL: https://ieeexplore.ieee.org/document/9465978.
[30] A. Périnet, T. Hamon, Distributional analysis applied to specialized texts. Reduction of data
     sparseness by context abstractions, Traitement Automatique des Langues 56(2) (2015) 77–102.
[31] V. Trysnyuk, Y. Nagornyi, K. Smetanin, I. Humeniuk, T. Uvarova, A method for user
     authenticating to critical infrastructure objects based on voice message identification, Advanced
     Information Systems 4(3) (2020) 11–16. doi:10.20998/2522-9052.2020.3.02.
[32] O. Bisikalo, O. Boivan, N. Khairova, O. Kovtun, V. Kovtun, Precision automated phonetic
     analysis of speech signals for information technology of text-dependent authentication of a
     person by voice, CEUR Workshop Proceedings 2853 (2021) 276–288.
[33] A. Sartiukova, O. Markiv, V. Vysotska, I. Shakleina, N. Sokulska, I. Romanets. Remote Voice
     Control of Computer Based on Convolutional Neural Network, in: Proceedings of the IEEE 12th
     International Conference on Intelligent Data Acquisition and Advanced Computing Systems:
     Technology and Applications (IDAACS), Dortmund, Germany, 7 September 2023, pp. 1058-1064.
[34] S. Kubinska, R. Holoshchuk, S. Holoshchuk, L. Chyrun, Ukrainian Language Chatbot for
     Sentiment Analysis and User Interests Recognition based on Data Mining, CEUR Workshop
     Proceedings 3171 (2022) 315-327.
[35] V. Husak, O. Lozynska, I. Karpov, I. Peleshchak, S. Chyrun, A. Vysotskyi, Information System
     for Recommendation List Formation of Clothes Style Image Selection According to User’s Needs
     Based on NLP and Chatbots, CEUR Workshop Proceedings 2604 (2020) 788-818.
[36] A. Medvedyk, M. Lohoida, Z. Rybchak, O. Kulyna, IT Slang: Development of Telegram Chatbot,
     CEUR Workshop Proceedings 3396 (2023) 152-162.
[37] O. Romanovskyi, N. Pidbutska, A. Knysh, Elomia Chatbot: The Effectiveness of Artificial
     Intelligence in the Fight for Mental Health, CEUR Workshop Proceedings 2870 (2021) 1215-1224.
[38] A. Yarovyi, D. Kudriavtsev, Method of Multi-Purpose Text Analysis Based on a Combination of
     Knowledge Bases for Intelligent Chatbot, CEUR Workshop Proceedings 2870 (2021) 1238-1248.
[39] N. Shakhovska, O. Basystiuk, K. Shakhovska, Development of the Speech-to-Text Chatbot
     Interface Based on Google API, CEUR Workshop Proceedings 2386 (2019) 212-221.
[40] T. Basyuk, A. Vasyliuk, Peculiarities of an Information System Development for Studying
     Ukrainian Language and Carrying out an Emotional and Content Analysis, CEUR Workshop
     Proceedings 3396 (2023). URL: https://ceur-ws.org/Vol-3396/paper23.pdf.
[41] V. Vysotska, S. Holoshchuk, R. Holoshchuk, A Comparative Analysis for English and Ukrainian
     Texts Processing Based on Semantics and Syntax Approach, CEUR Workshop Proceedings 2870
     (2021) 311-356.
[42] A. Dmytriv, S. Holoshchuk, L. Chyrun, R. Holoshchuk, Comparative Analysis of Using Different
     Parts of Speech in the Ukrainian Texts Based on Stylistic Approach, CEUR Workshop
     Proceedings 3171 (2022) 546-560.
[43] S. Yevseiev, et. al., Development of a Method for Determining the Indicators of Manipulation
     Based on Morphological Synthesis, Eastern-European Journal of Enterprise Technologies 117(9)
     (2022).
[44] O. Cherednichenko, O. Kanishcheva, O. Yakovleva, D. Arkatov, Collection and Processing of a
     Medical Corpus in Ukrainian, CEUR Workshop Proceedings 2604 (2020) 272-282.
[45] A. Dmytriv, V. Vysotska, M. Bublyk, The Speech Parts Identification for Ukrainian Words Based
     on VESUM and Horokh Using, in: Proceedings of the 16th International Conference on
     Computer Sciences and Information Technologies (CSIT), vol. 2, 2021, September, pp. 21-33.
[46] V. Vysotska, S. Mazepa, L. Chyrun, O. Brodyak, I. Shakleina, V. Schuchmann, NLP Tool for
     Extracting Relevant Information from Criminal Reports or Fakes/Propaganda Content, in:
     Proceedings of the IEEE 17th International Conference on Computer Sciences and Information
     Technologies (CSIT), 2022, November, pp. 93-98.
[47] M. Lupei, O. Mitsa, V. Sharkan, S. Vargha, N. Lupei, Analyzing Ukrainian Media Texts by Means
     of Support Vector Machines: Aspects of Language and Copyright, in: Proceedings of the
     International Conference on Computer Science, Engineering and Education Applications, 2023,
     March, pp. 173-182.
[48] V. Vysotska, Analytical Method for Social Network User Profile Textual Content Monitoring
     Based on the Key Performance Indicators of the Web Page and Posts Analysis, CEUR Workshop
     Proceedings 3171 (2022) 1380-1402.
[49] K. Shakhovska, I. Dumyn, N. Kryvinska, M. K. Kagita, An approach for a next-word prediction
     for Ukrainian language, Wireless Communications and Mobile Computing 2021 (2021) 1-9.
[50] I. Demydov, Architecture of the Computer-linguistic System for Processing of Specialized Web-
     communities’ Educational Content. URL: https://ceur-ws.org/Vol-2616/paper1.pdf.
[51] V. Vysotska, Ukrainian participles formation by the generative grammars use, CEUR Workshop
     Proceedings 2604 (2020) 407–427.
[52] B. Bengfort, R. Bilbro, T. Ojeda, Applied text analysis with Python: Enabling language-aware
     data products with machine learning, O'Reilly Media, Inc., 2018.
[53] D.     Jurafsky,      J.  H.   Martin,     Speech     and     Language    Processing.    URL:
     https://web.stanford.edu/~jurafsky/slp3/ed3book_sep212021.pdf.
[54] D. Jurafsky, J. H. Martin, Regular Expressions, Text Normalization, Edit Distance. URL:
     https://web.stanford.edu/~jurafsky/slp3/2.pdf.
[55] D. Jurafsky, J. H. Martin, Deep Learning Architectures for Sequence Processing. URL:
     https://web.stanford.edu/~jurafsky/slp3/9.pdf.
[56] D. Jurafsky, J. H. Martin, Naive Bayes and Sentiment Classification. URL:
     https://web.stanford.edu/~jurafsky/slp3/4.pdf.
[57] D.        Jurafsky,        J.      H.      Martin,        Logistic      Regression.      URL:
     https://web.stanford.edu/~jurafsky/slp3/5.pdf.
[58] D. Jurafsky, J. H. Martin, Neural Networks and Neural Language Models. URL:
     https://web.stanford.edu/~jurafsky/slp3/7.pdf.
[59] I. Khomytska, V. Teslyuk, N. Kryvinska, I. Bazylevych, Software-based approach towards
     automated authorship acknowledgement-chi-square test on one consonant group, Electronics
     (Switzerland) 9(7) (2020) 1–11.
[60] A. R. Sydor, V. M. Teslyuk, P. Y. Denysyuk, Recurrent expressions for reliability indicators of
     compound electropower systems, Technical Electrodynamics 4 (2014) 47–49.
[61] V. Lytvyn, et. al., Development of the linguometric method for automatic identification of the
     author of text content based on statistical analysis of language diversity coefficients, Eastern-
     European Journal of Enterprise Technologies 5(2(95)), (2018) 16–28. doi: 10.15587/1729-
     4061.2018.142451.
[62] V. Lytvyn, et. al., Development of the system to integrate and generate content considering the
     cryptocurrent needs of users, Eastern-European Journal of Enterprise Technologies 1(2(97))
     (2019) 18–39. doi: 10.15587/1729-4061.2019.154709.
[63] P. Kravets, The Game Method for Orthonormal Systems Construction, in: Proceedings of the
     9th International Conference - The Experience of Designing and Applications of CAD Systems
     in Microelectronics, 2007. doi: 10.1109/cadsm.2007.4297555.

</pre>