1. Introduction

International Journal of Information Engineering and Electronic Business 16(5) (2024) 1

10.1007/978-3-030-01069-0_8

0 CIAW-2024: Computational Intelligence Application Workshop 1 Lviv Polytechnic National University , Stepan Bandera 12, 79013 Lviv , Ukraine

2018

2 105 118

The work aims to develop models, methods, and means of analysis and synthesis of computer linguistic systems (CLS) based on new and improved methods of processing Ukrainian-language textual content to solve natural language processing problems (NLP). The scientific novelty of the obtained results lies in solving an important scientific and applied problem of analysis and synthesis of CLS for solving various tasks of processing Ukrainian-language textual content based on developing new and improving known models, methods and means of NLP. The following new scientific results were obtained: - A model of intellectual analysis of the text flow, which, unlike the existing one, is based on the processing information resources, NLP and machine learning, which the typical structures of content integration, management and support modules; - Methods of adapted processing information resources for processing Ukrainianlanguage text and take into account the needs of the permanent target audience based on the analysis of the history of the target audience's activity on the CLS web resource, which made it possible to form a set of metrics and indicators of the effectiveness of the CLS functioning for the various NLP tasks solution; A model of linguistic processing of text based on the grapheme, morphological, lexical and syntactic analyses improvement, which, unlike the existing ones, are adapted for processing Ukrainian-language text through regular expressions and machine learning, made it possible to adapt the processes of processing Ukrainian-language text content and increase the accuracy of the obtained results depending from a specific NLP task; - A method of identifying keywords in Ukrainian-language texts based on grapheme and morphological analysis of word bases through regular expressions and N-grams was developed, which made it possible to increase the accuracy of searching for keywords, search for stable word combinations and categorize content; - A method of determining the style of the author of thematic Ukrainian-language text content was developed based on the keywords, stable word combinations, N-grams analysis, which made it possible to determine the stylistic contribution of each of the authors and increase the accuracy of the attribution of a scientific and technical publication; - A method was developed for calculating the degree of verification of the author of a Ukrainian-language text from a set of possible ones based on a comparative analysis of the styles of potential authors, which made it possible to increase the accuracy of classification based on the similarity of style; - Methods of analysis and synthesis of CLS were developed based on the creation of a general typical structure of the text content processing CLS in the Ukrainian language through support for modularity, modelling of the interaction of main processes and components, which made it possible to expand the collection of solutions to various typical tasks of the NLP by implementing typical software of such systems; - NLP methods, which, unlike the existing ones, are implemented on the basis of developed regular expressions of grapheme and morphological analysis of Ukrainian-language text and modified Porter's stemming algorithm as an effective identifying lem affixes for the possibility of demarcating the analysed word, which made it possible to optimize the process and improve the accuracy of Ukrainian words/sentences normalization; - Text tokenization and normalization methods, which, in contrast to the existing ones, use cascades of simple substitutions of developed regular expressions of matching with templates based on production rules, finite automata and the ontological model of the rules of the Ukrainian language syntax.

eol>Computer linguistic systems NLP Ukrainian-language textual content machine learning 1

1. Introduction

The active development of information technologies (IT) is at the intersection of globalization and informatization. The rapid rate of growth of society's informatization is directly related to the rate of development and implementation of computer linguistic systems (CLS), the development of which is based on models and methods of natural language processing (NLP) [1-3]. The complexity of developing models, techniques, and tools of NLP lies in solving non-typical NLP problems and adapting these models, methods, and tools to a specific natural language [4-6]. Each natural language is unique, with its flavour of rules, history, grammar, exceptions, and peculiarities of generating linguistic units for conveying meaning, complicating developing a CLS.

Usually, each successful CLS development project is designed for a specific task (for example, machine translation [7-9], identification of plagiarism/rewriting [10-12], text rubrication [13-14], text attribution analysis [15-21], information retrieval [22-28], referencing/abstracting [29-30], voice assistants [31-33], intelligent chatbots [34-39], etc.) and is both one-time and closed (for example, Amazon Alexa, Google Assistant, Facebook, Voice Mate, Bixby, Siri, Abby Lingvo, Microsoft Cortana, Microsoft Word, Grammarly, Google Translation, PROMT, CuneiForm, Trados, OmegaT, Wordfast, Dragon, IBM via voice, Speereo, Finereader, Tesseract, OCRopus, etc.) without being able to read the content to willing IT professionals/specialists. In rare cases, the developers provide open access to such CLS projects and the opportunity to get acquainted with their structure and content. The development of any NLP application for an arbitrary natural language of more than 7000 languages and dialects is based on studying large textual monolingual/parallel corpora of that language, containing more than hundreds of millions of words and linguistic resources. Only about 20 natural languages (English, Chinese, Western European languages, Japanese, etc.) are the results of research on such corpora known, making it possible to develop CLS of various complexity for these languages. Unfortunately, in modern realities, the Ukrainian language is considered in the international scientific community to be an exotic language with a low resource index, i.e., it does not have enough educational, research and processed data to develop modern applied applications of NLP. Such applied applications are used to build CLS in cyber security (detection of fakes and propaganda, socalled trolls/bots in social networks), sociology (analysis of the dynamics of changes in public opinion on thematic issues), philology (automatic research of large data sets of various thematic orientations and different periods), psychology (analysis of the psychological portrait of a person, identification of post-traumatic stress disorder of participants in hostilities or occupation), national security (information warfare), jurisprudence (criminology and court case), social communications (analysis of community posts in social networks) and other important branches of modern Ukraine. The above determines the relevance of the topic of the dissertation research.

Scientific research by N. Chomsky, V.M. Glushkov, A.V. Hladkoy, D.V. Lande, V.A. Shyrokov, N.V. Sharonova, N.F. Khairova, O.V. Bisikalo, S.N. Buk, N.P. Darchuk, Z.V. Partyka, A.V. Anisimova, Yu.D. Apresyan, O.O. Marchenko, I.M. Kulchytskyi, A.O. Nikonenko, M. Gross, A. Lanten, V.H. Yngve, S. Sharoff, Yu.A. Schrader, D. Jurafsky, B. Bengfort, J.H. Martin, L. Tesniere, T. Ojeda, P.M. Postal, D.G. Hays, T.A. van Dijk, S. Marcus, J. Lyons, L.W. Tosh, Y. Bar-Hillel, D.G. Bobrow, G. Lakoff, R. Bilbro, N. Kotsyba, A.Yu. Berko, Yu.M. Shcherbyna, V.Yu. Velychko, V.F. Starko and many others make it possible to understand the basic principles of linguistic processing of the text depending on the features of a specific natural language. More than 80% of such studies concern the processing of English-language texts. There are fewer studies on Slavic languages, particularly the low-resource Ukrainian language. In particular, there are no publications regarding the development recommendations, functional requirements, general structure, or typical architecture of the CLS for processing Ukrainian-language textual content. Directly applying the English language's models, methods, algorithms, and IT processing to Ukrainian-language textual content does not yield positive results. Already at the level of morphological analysis, a significant conflict arises between the methods developed for the English-language text and their use for the Ukrainian-language text. For example, for a simple Porter algorithm (stemming) without appropriate modification, it is not correct to separate the base of the word from the inflexion, which leads to inaccurate identification of key phrases, which, in turn, affects the solution of any NLP problem where it is necessary to quickly identify set of keywords (categorization, search, annotation, etc.). Determining the main features and processes of linguistic analysis of Ukrainian-language texts will significantly facilitate the stages of processing the text flow of information, such as integration, support and content management. In turn, the adaptation of the processes of intellectual analysis of text content with the identification of functional requirements for the relevant modules of the CLS will lead to the possibility of developing its typical architecture based on the principle of modularity (adding components depending on the content of the NLP task and the purpose of the CLS).

The above testifies to the relevance of research in solving the significant scientific and applied problem of analysis and synthesis of CLS for solving various tasks of processing Ukrainian-language textual content, which will make it possible to increase the level of resourcefulness of the natural Ukrainian language based on the development of new and improvement of known models, methods and means of NLP.

The work aims to develop models, methods, and means of analysis and synthesis of computer linguistic systems based on new and improved known methods of processing Ukrainian-language textual content to solve problems of natural language processing. The purpose of the work is to determine the need to perform such tasks:

To analyse the specifics of the construction of the CLS by systematizing the processes of their implementation and functioning, which will provide an opportunity to distinguish a class of systems whose functional properties allow to perform a quantitative assessment of the expected effects of the implementation of a typical CLS of processing Ukrainian-language textual content for solving various tasks of the NLP; To develop information technology for the construction of CLS for the processing of Ukrainian-language text, which will make it possible to determine their basic structure, functional requirements, the sequence of setting and training the system, and general design principles; To offer IT processing of information resources as integration, management and support of Ukrainian-language content based on the improvement of linguistic analysis of text content for the development of metrics for evaluating the effectiveness of the functioning of the CLS for solving various tasks of the NLP; To develop methods of processing Ukrainian-language textual content for solving various problems of NLP to increase the accuracy of the obtained results; To develop methods and means of intellectual analysis of textual content to increase the efficiency of solving various tasks of NLP; Create software modules for processing Ukrainian-language textual content for solving various tasks of NLP and conducting experiments; To test the obtained results by building and implementing applied CLS to process Ukrainianlanguage textual content.

The object of research is the processes of analysis and synthesis of computer linguistic systems for processing Ukrainian-language textual content.

The research subject is models, methods, and means of processing Ukrainian-language textual content to solve various problems of NLP.

The following research methods were used to achieve the goal: the theory of formal grammars and automata, the theory of sets, the theory of data and knowledge models, the theory of probability and mathematical statistics, the theory of models, algorithms, and logical-linguistic numbers, information theory, graph theory, and knowledge presentation methods for modelling the processes of processing Ukrainian-language textual content and developing machine learning modules; models and methods of processing and analysing textual content for the implementation of the processes of solving various problems of NLP; methods of object-oriented and system analysis and design - for design and development of CLS; the theory of relational databases, methods of artificial intelligence, object-oriented programming - for the software implementation of the Ukrainian-language textual content processing system for the solution of various NLP tasks. The practical significance of the obtained results lies in the fact that they can be used to build applied CLS for processing Ukrainianlanguage textual content. In particular, the following results are practically valuable:     

The application of the method of identification of persistent word combinations in the identification of keywords in Ukrainian-language scientific texts of a technical profile allows an increase in the accuracy of the search for keywords by 6-9% and highlights thematic terms from the text for further classification of the publication; Development of a formal approach to the design of a content monitoring module for identifying keywords in Ukrainian-language texts based on web data mining, NLP and linguistic analysis of defined words of text content, which made it possible to develop the general structure of typical CLS and increase the effectiveness of CLS functioning by 6-9% depending on the solution of a specific NLP problem; The application of the method of calculating the degree of verification of the author of the Ukrainian-language text based on the analysis of the styles of potential authors made it possible to increase the accuracy of identification by 6-12% and carry out the decomposition of the method through the study of stylistic coefficients such as the coherence of speech, the degree of syntactic complexity, linguistic diversity, indices of concentration and exclusivity of the text; Development of a content monitoring module to identify a potential author of a text from a set of possible ones based on a comparison of the results of the analysis of a template author’s text with the researched one to reduce the volume of the corresponding set to [9;34]% of the total number of project participants, depending on the subject and the time range of scientific writing - technical publications, as well as the frequency of publications of this author in this period on a specific topic; Experimental testing of the method of identifying the author’s style in Ukrainian-language texts based on web data mining and linguistic analysis of defined stop words allows the selection of content potentially similar in style from a set of potential author’s publications.

2. Related works

Determining the main processes and features of the linguistic analysis of Ukrainian-language texts will significantly facilitate the stages of processing the text flow of content such as integration, support and content management (Fig. 1). Adaptation of the processes of intellectual analysis of text content with the identification of functional requirements for the relevant modules of the CLS will lead to the possibility of developing a typical structure of similar systems based on the principle of modularity (adding components depending on the content of the NLP task and the purpose of the CLS). The application of the specified IT/methods/models in the typical structure of the CLS, adapted for any process of processing Ukrainian-language textual content, is a necessary prerequisite for the successful implementation of the CLS project for solving a specific task of the NLP, which requires the use of an appropriate set of standard libraries, utilities and software with open source, which will solve specialized functions of the project according to the needs of the end user. The state of the CLS is determined by the tuple of the main properties at a specific moment in time or the activity of the corresponding NLP process: = ( , , … , ), = 1, , where is the corresponding i-th state at a specific moment in time from the set with power |S|=n, is the corresponding -th property of the state from the set with power |P|=m, which determines the behaviour of the CLS as = ( , , … , ), = 1, , where is the corresponding parameter of the specific property for the state . For any CLS, the state is one of the NLP processes, for example, the identification of keywords and/or stable phrases for the next state of the system as a rubric of a text array of data. Accordingly, the properties of the state are morphological , lexical and syntactic . Some NLP tasks may have semantic ones, etc. Then, for the property , a set of parameters is determined for the corresponding text analysis, depending on the specific task of NLP [40-50]. According to these parameters, the strategy of the CLS operation at the moment of time is specified for:

Web site Content support module DB profiles

Client subsystem

Content management

module Module of linguistic analysis of Ukrainianlanguage textual

content A module for solving a specific NLP problem of Ukrainian-language textual content

Server subsystem

Knowledge base

Machine learning module

Content integration module

Content

Data

Repository Technological subsystem

Internet parameters of the morphological property are N-grams and morphemes: roots endings , affixes ; grammatical categories of different parts of speech , word length , word placement in a sentence , number of syllables in a word , number of word contents , ratio of consonants and vowels

, etc.; the parameters of the lexical property are the location of the sentence in the test , the location of the word in the sentence , the weight of the word , the weight of the sentence , the base of the word , the inflexion of the word , etc.; parameters of the syntactic property

are the depth of the word in the dependency tree of , the location of the word in the sentence , the number of contents of the , the number of words per sentence , the number of words and sentences , whether the word is a capital letter / with a hyphen / compound , etc.; parameters of the semantic property are the number of word content , the depth of the word in the dependency tree , the size of paragraphs , the placement of paragraphs

Depending on the tuple  , the behaviour of the CLS is determined, that is, the implementation of a set of rules (activation of actions or events) for implementing a specific NLP process depending on the input text data. Accordingly, the event is the change of one property to another  or : 

according to the fulfilment of certain conditions for the input analyzed text and the intermediate processed text : = ( , , , ). Action is the process of activation of an event by another event in CLS: ′ = ( ∘ ). The more complex the language (morphology, syntax, etc.), the more difficult it is to process the corresponding texts in natural language. In addition, for such low-resource languages as Ukrainian, there are no standardized rules and dictionaries for processing texts in natural language to solve the relevant tasks of NLP. Many cost or utility of the purpose of the visit,

is the average ROI or the average return on investment, is the percentage (%) of profit from new visitors, is the new buyers/customers index at the first visit.

The presence of the

text content support module reduces costs for moderators/analysts who collect/analyze statistical data on the dynamics of the CLS functioning, the activity of the permanent target audience as a reaction to website content changes, and the formation of rules for the analysis of user information portraits and thematic content plots: =< , , , , , , >, where is the advertising quality index; is a brand recognition factor; are index and % conversion of goals by type of advertising;

is the conversion rate of goals by type of means. scientific linguistic schools and IT specialists are working on creating Ukrainian dictionaries, text corpora and rules for processing Ukrainian texts. However, these are usually linguists and philologists unfamiliar with the features of specific modern tools, such as programming languages, ML methods, big data analysis, etc. There is a colossal gap between the research results of philologists and applied linguists, on the one hand, and IT specialists, on the other, for developing Ukrainianlanguage tests. Today, quite a few, such as Ukrainian, have been implemented for general access to

NLP tools. 3. Material and methods The developed typical structure of

CLS consists of modules for solving a specific task of NLP , content support , content integration , content management , linguistic and intelligent analysis of textual content flows (IATCF) [48]: Accordingly, the solution module of a specific NLP problem =< =< , , ,

, , , , , , , , >. : >, (1) (2) (3) (4) (5) = , , + + , , ( ) is a function for determining % of visits from advertisement w; ( ) is a function for determining % conversion of goals for visits from w; ( ) is a function for determining the index of advertising quality w; is the total number of user queries of intellectual and informational search (IIS) by keywords; is the number of direct visits to the website;

is the number of IIS requests with brand name.

The presence of the

text content integration module reduces the costs of CLS moderators and content authors, automating/implementing some of their work/functions such as content collection from several different reliable sources, its recognition, filtering, saving, formatting, analysis, annotation, classification, etc.: =< , , , , , , , , , , >,

are % of repeat visits of the user from the previous visit > , within [ ; ] when < and < days, respectively; is a brand recognition factor; % of new/repeated visitors and interest;

is the average number of clicks on advertising for visits; visit; is the bounce rate for one web page;

is the average number of web page views per is the average length of stay on the web page. (6) (8) (9) (10) , ,

, result; use search; =< , , , >,

are the average number of page views per visit and for a specific time  ; is the average number of unique users for a specific time  ; is the average number of visits for a specific time  . The indicator of internal search on the site: , , =< , , , , , , >, , , , , , , is the number of zero search results; are % of users who were on the page for > time and viewed > pages after the search; % of buyers among users using search;

is % of rejections after visiting one page as a search is % conversion from users using search; and

are % of users who do not use and is the average number of pages viewed by visitors after a search; is the average time spent on the site for a visit after a search; and are % of visitors who conduct several searches during the visit and who left the site after viewing the search results; is the average number of search results; is % of visits with search;

is % of zero search results, in particular, and = , = , = , is the number of direct web page visits;

is the number of one-page visits to a is the number of visits for analysis; is the total number of visits; average number of clicks on advertising;

is the total number of actions on the page; are the total number of all and interested users.

The presence of a text content management module reduces costs for moderators/administrators who update the website and create rules for caching/searching popular information blocks: =< , , , , , , , , , >, where is an indicator of internal IIS; is % edition of the page with an error; and are % of mobile users with a high-speed Internet connection; and are % of users with low/medium/high display resolution and with a specific operating system; and users with a specific browser and with English and/or Ukrainian language support; are % of is an indicator of the number of users, views and page visits. The indicator is the base of the content management module: , are the number of all viewed pages issued with an error and viewed is the number of zero search results; and is visits pages with a search, respectively; without search and with search.

The presence of a module for intellectual analysis of text streams of content reduces the time/costs/personnel/resources for the timely and prompt acquisition of relevant, unique, current content, which leads to an increase in the volume of the target audience of CLS, in particular, contributes to the growth of the economic effect of the implementation: =< , , , >, is the average conversion rate; is the average length of visit; is the average number of views per visit; is % of unique customers/visitors/users; is % of new website customers.

is % interaction with the site (for example, commenting, voting, registration, authorization, subscription, etc.);

is % of users who activate various events (for example, clicking on an ad, starting a function, pausing, etc.); is % of users interacting with different types of content presentation (viewing the next communication, panning, zooming, etc.); is the value of the measure of usefulness, respectively, of the page/site/CLS/content; is the number of unique page views; is profit from e-business; is the value of the utility measure of user visits (based on transactions) and the purpose of user visits (based on the utility of goals).

Analysis of success/effectiveness/operational search on the site: =< , , , , , , , is the value of the usefulness of visiting ,

, site/page; , >,

(12) is conversion rating in e-business for CLS corresponding to the NLP task; is the value of average utility; value of e-business profit for the CLS of the corresponding NLP task; is the value of the achieved conversion of visits to the site/page of the CLS: (13) (14) (15) According to the tracking of events and interaction with the site, they analyze: = ( , ) =< , , ,

∙ according to the input data from the tuple .

The method of determining the effectiveness/quality of the CLS site for solving the NLP problem: Stage 1. Formulation and identification of usefulness according to the goals of the target audience Stage 2. Activation of reports of the operation of the CLS from the tuple of the initial data: Step 1. Define an unlimited number of goals (4 goals for each target audience profile).

Step 2. Identify the optimal volume of visits/time of the end user/customer for a successful conversion.

= , = ∙ 100%, = + , =

∙ = 100%,

To attract new visitors and increase the volume of the permanent target audience, the calculation of the impact on the income of the IIS on the site is used , , : > ∙

), i.e.: )/100 −

∙ 100%, = ( −

) ∙ is the number of visits from the IIS; and are the utility of visits without and with IIS.

The topic of a set of keywords is one of the main indicators of IIS for identifying the specific content of a page. Optimize investment for sets of keywords that increase conversion values. The return on investment value ( ROI) must be positive ( = − ∙ 100% > 0, =

(

is the amount of profit. Then they find how much >q% of funds can be spent on a specific keyword in advertising without the risk of getting ROI<0. To calculate the amount of funds for attracting users, use:

Step 3. Analyse the volume of the contribution of each goal to the total profit.

Step 4. Combine goals by categories/directions/species.

Step 5. Form separate sets of transactions as appropriate for the purposes.

Stage 3. Support various marketing campaigns/customers through

Stage 4. Support for processing the service content of the site with the module.

Stage 5. Updating the profiles of the target audience according to feedback support through the module, and analyzing user actions through the module.

Stage 6. Integrating content from different sources through according to the achieved goals and processing it through the

module.

Stage 7. Periodic checks are performed to see whether the goals are being achieved and whether the profit is growing according to the goals. If it subsides, go to stage 1. Otherwise, go to stage 2.

A classified list of the input stream of content with a set of relevant properties demarcates project participants through their typification and restriction of access rights depending on the content: regular users, potential visitors, linguists, statistical analysts, administrators, content/rules moderators, authors of unique content, information resource as content source etc. The typed structure of the content input stream template with a set of relevant properties helps to define the main functional requirements for the site/CLS and its typical structure and delineate the nonfunctional capabilities, classify the sources, calculate the frequencies and the corresponding restrictions/conditions of integration from the usual source: =< , , , , , , , , , , , , , >, (16) where

is URL addresses of sources for databases (DB) of CLS filters; is content as a result of integration from different

sources according to a predetermined list of URLs without a predetermined structure according to relevant thematic requests; is thematic requests of visitors/users of the CLS site in the form of a set of keywords or persistent phrases; is actual data of permanent users/profiles and a set of rules of permitted actions within the corresponding type of user of the CLS;

is statistical data of actions/ events/ phenomena of the subjects/objects of the CLS for the solution of the corresponding NLP task and the rules for collecting/saving/analysing statistics in specific time intervals of the CLS operation; is statistical data on the functioning of the CLS; is contents of the DB/DS of content/rules/filters/annotations, etc. of the CLS; is different types of linguistic dictionaries depending on the purpose of the CLS for solving a specific

NLP problem;

content of CLS; is a set of personalized/anonymous reviews and comments of users to the relevant is a tuple of the results of personalized/anonymous votes of regular/potential users regarding the content of CLS;

is statistical personalized individual actions of users of the CLS; is set of external/internal advertising of thematic content; is thematic stickers of information content (exchange rates, announcements, digests, weather, anecdotes, horoscope, etc.); is a tuple of options for setting up and changing the CLS/site configurations.

Filling the tuple of the output data stream according to the purpose of the CLS for solving a specific NLP problem directly depends on the content of the input classified stream of content with a predetermined set of properties depending on the interaction with the site of the corresponding types of project participants: =< , , , , , , , , , >, (17) where is text content as an information product or the result of providing an appropriate information service for solving a specific NLP task on the CLS website; is a set of meaningfully generated/cached pages as a result of thematic requests/IIS of users/visitors of the CLS site; is annotations/digests/abstracts on textual thematic content; is a tuple of statistics of user/visitor interaction with the site; is a tuple of the content of the profiles of regular users of the CLS according to the personalized statistics for the corresponding generation of an individual portrait of the user/audience at certain time intervals; is a tuple of meaningful recommended site content, personalized for a specific regular user according to the profile/actions/interaction with the CLS in certain time intervals; is a set of content topics/headings with the possibility of renewal according to the results of the latest IIS/requests from regular site users; is a scheme of interrelationships of textual thematic content according to the appropriate classification (current, relevant, author's, outdated, popular, similar, last-viewed, often-viewed, consecutively by a certain most viewed, longer viewed, most viewed from search engines or internal IIS, viewed by a typical group of users, etc.); is the set of content rating results on a predetermined scale within the corresponding ranking classification;

is a set of marked evaluation and ranking of user comments as the degree of permission to publish on the site/page, if necessary, with a prohibition mark for a specific contributor to write further comments and ranking by the degree of trust of all contributors. The list of the output flow of content, its main features, the corresponding classification, and

IT generation/support/analysis contributes to the definition of precise general functional requirements for implementing the CLS to solve any NLP problem.

The model of the process of linguistic analysis of the Ukrainian-language text is presented =< , , , , , , ,

, ,  ,  ,  ,  ,  ,  , , ,  ,  ,  ,  ,  ,  ,  >, where is the input data in the CLS from various sources of information ; is the original relevant content from the CLS as a result of the IIS according to the requests of users/visitors; is the process of linguistic analysis of content as a component of the IATCF subsystem process of generation/modification of the rules of operation of all modules by the moderator of the CLS;  is the process of filling an unstructured database with integrated content ; S  is the filling module of the structured database based on the processed integrated content ;  and  ;  is the are processes of generating results according to the requests of visitors and users;  is a cache processing process for generating reports on popular requests from CLS users;  is cache filling/modification process;  is the process of generating statistical results of the functioning of the CLS/modules and the activities of users ;  is the operator of generation/modification of the rules of operation of all modules from the moderator of the CLS;  is the operator of filling an unstructured database with integrated content ;  is the operator of filling the structured database based on the processed, integrated content of ;  and  are operators for generating results according to the requests of visitors and users;  is a cache processing operator for generating reports on popular requests from users;  is cache filling/modification operator with data;  is an operator for generating statistical results of the functioning of the CLS/modules and user activities: =< , , , , , , , , , , , , ,  > , =  ∘  ∘  ∘  ∘  ∘  ∘  ∘  ∘ , (18) where is the input text data array; is a tuple of the original processed text according to the purpose of the CLS; is a set of intermediate content, which is processed at the appropriate level in the CLS; is auxiliary dictionaries; is a set of processing rules;  is grapheme analysis operator (GA);  is morphological analysis operator (MA);  is lexical analysis operator (LA);  is operator of syntactic analysis (SA);  is semantic analysis operator (SEM);  is ontological analysis operator;  is reference analysis operator;  is structural analysis operator;  is operator pragmatic analysis (PA).

The primary process of linguistic analysis of textual content is presented: , , , , , , , ,  .

and sets of production/association rules = 3.(III.1) # 4.(II.1) # 5.(II.2) # 6.(II.2) # 7.(II.2) # 8-9 10.(II.4) # 11.(II.3) # 12-18  Sж,од,н,3 

Sж,oд,н,3  Sж,од,н,3

 Аж,од,н Sж,од,н,3  Sч,од,р,3 

Sч,од,р,3 Аж,од,н Аж,од,н Аж,од,н Аж,од,н  Sж,од,н,3  Sж,од,н,3 Sж,од,н Sж,од,н Ач,од,р Ач,од,р Ач,од,р Ач,од,р ............................................................................................................................

IV.6

IV.2

IV.6

IV.1

IV.7

IV.4

IV.6

IV.3 # весела посмiшка твого сина наповнює мене безмежним щастям ..............................................................................................................................

IX. Basic morphological rules: { +→ + ; + и → і; о + ( , ) + → + ( , ) + ; с' + W → ш + W; в' + W → вл' + W; б' + W → бл' + W; д' + W → дж' + W; т'+ W → ч + W; …; д + W → д' + W; с + W → с' + W; …; нн + Ф → н + о}, where and are arbitrary vowels; is sound designation [j] (йот); Z is any sequence not longer than 3 characters; W = -е(є)н, -у(ю)ва-, -ова-, -овува-.

 = ( , , ),  =  ∘  ∘  , (23) where  is grammar induction implementation operator;  is the operator of identification/elimination of boundary ambiguity or sentence violation;  is operator of syntactic parsing of phrases/sentences for building a SA tree. Rules for formulating Ukrainian phrases:

I. Choice of structure: { → # , ,н, ,тепер, #}, where is verb group, is noun group, is gender, is singular/од, or plural/мн; is the case, is the person.

II. Noun group: { , , , → , , , , ,р, ; , , , → , , , , , ; , , , → за,й,м, ,

III. Verb  , , , group:  ; , , , → , , }.

{ ,тепер, → ,тепер, , ,зн, , ,ор, ; ,тепер, , ,ор, , ,зн, ;

,тепер, → ,тепер, , ,зн, ; ,тепер, , ,ор, }.

IV. Substitution of words: { ч, , → син , , . ..; ж, , → посмішкау, , . ..; сер,у, → щастя , , . . . ; хз,аойдм, , → я ; хз,аойдм, , → ти ; у,тепер, → наповнити ,тепер, , . ..; веселийх, , , безмежнийх, , , м йх, , , тв йх, , , . . . }.

Stage 5. Semantic analysis  of the Ukrainian-language text  consists of ,тепер, → ,тепер, → Ах,у, →  = ( , , ),  =  ∘  , (24) where  is the identification operator of lexical semantics with the generation of a collection of values of each lexeme of the text;  is the relational semantics identification operator of the interdependencies of the content of the lexemes of the text.

Stage 6. Reference analysis  identification of interphase units .

 = ( , , ). (25)

Reference analysis is often part of SEM. For Ukrainian texts, when analysing large corpora of texts, it is best to carry out as a separate stage (for example, for the analysis of the correspondence of a social group/community in social networks or other dialogues to identify logical, meaningful connections between the posts of different participants due to the subjectivity of everyone's speech.

Stage 7. Structural analysis  of the Ukrainian-language text  based on the degree of coincidence of lexical, terminological units of unity of text fragments. It is often part of SEM for short texts/messages or not used at all. For large corpora of texts as an additional stage of elimination of marked inaccuracy in SEM.

 = ( , , ) or  = ( , , ).

Stage 8. Ontological analysis of  text content  on the basis or part of the results of SEM and reference/structural analyses if necessary:

 = ( , , ),  = ( , , ) or  = ( , , ).

Stage 9. Pragmatic analysis of  text content  is used to determine the text's structure by considering the context of sentences when forming paragraphs, sections, and dialogues. PA is an essential addition to SEM, reference, and structural analyses if it does not contribute to eliminating marked inaccuracy.

= ( , , , , [ , , ], ), =  ∘  , where  is a semantics identification operator outside individual sentences/phrases;  is the operator of text processing through higher-level NLP applications, for example, to simulate intelligent behaviour and an apparent understanding of natural language.

A general scheme/model of the pipeline of the CLS operation has been developed based on improved methods of processing information resources such as integration, maintenance and content management, as well as the development of improved methods of intellectual and linguistic analysis of text flow using machine learning technology (Fig. 3) [52-58]. Based on feedback from the user and output data of the ML model, the target audience interacts with the CLS, which contributes to the adaptation of the selected learning model. Five stages of relevant processes determine the basic architectural principles of building a typical CLS. The methods of monitoring, developing and managing content are interaction, formatting/filtering, NLP, ML and data accumulation in DS. Content and support processes feature analysis, deployment, prediction, interpretation, and content/result presentation. At the interaction stage, a set of rules for integrating content from multiple reliable sources at certain intervals is developed. Also, in parallel, a set of rules for checking the data entered by the user of the CLS was created as a preliminary stage for the formatting/filtering stage according to a collection of rules and content from the DS set in advance by the moderator. The next stage of NLP is an intermediate stage for ML and data accumulation. The ML stage is implemented through SQL queries and modules. The support process is more accessible to implement than the management stage, especially when analysing the results of the NLP, in which additional lexical resources and artefacts (dictionaries, translators, regular expressions, etc.) are created, which directly depend on the effectiveness of the CLS functioning (Fig. 4) [52-58]. (26) (27) (28) Input content Relevant content

User requests

Interaction Integration Presentation

Feedback CLS website

Processes of monitoring, development and management of content

Formatting

filtering Transformation Interpretation

API

NLP Normalization Prognostication

Assessment

Machine learning Classification Deployment

Modeling

Accumulation of

content/ analysis of features

Data storage

Computer linguistic system

Content analysis and support processes

The transition process from the raw text to the expanded ML model consists of additional content transformations. First, the input text content is transformed into the input corpus as a collection of texts, accumulated and stored in the DS. The incoming content is further grouped, filtered, formatted,

The process of generating an optimal machine learning model linguistically processed, marked, normalized and converted into vectors for further processing. In the final transformation of the model (Fig. 5) [52-60], they train on the vector corpus to create a generalized presentation of the original content for further use in solving a specific NLP problem.

NLP methods have been improved based on the developed 82 regular expressions (RGs) of pattern matching in GA and more than 2000 RGs of morphological analysis of Ukrainian-language texts. RV's primary admissible operations are the union and disjunction of symbols/chains/expressions, number and precedence operators, and anchors of the presence/absence of symbols in regular expressions. The main stages of tokenization and normalization of the Ukrainian text by cascades of simple substitutions of RG and finite automata are determined. Algorithms for word segmentation and normalization, sentence segmentation, and Porter's modified stemming are implemented and described as an effective way of identifying lem affixes for the possibility of marking the analysed word. Porter's modified stemming algorithm is based on searching/checking the obtained intermediate results with the tree of inflexions (so as not to go through all possible inflexions) and with the content of thematic dictionaries of bases with a set of PG-rules for identification of features (classification by parts of speech).

Step 1. Identify the next lexeme as the word ( = ).

Step 2. Check with the stop word dictionary whether or is a service word. If yes, then = + 1 and go to step 1. Otherwise, go to step 3.

Stage 3. Go to the end of the word . Recognize the inflection in from all possible ones (the longest one is chosen, for example, in =текстова we choose the ending =ова, not а) Generation of the ML model

Forming features set Choice of ML

model Adjustment of parameters

Model control

Lexical resources from the RG of the word type , , or and in the presence of deletion of the inflexion .

Stage 4. Saving the inflection in the word tag .

Stage 5. Label as type

, or , respectively.

Stage 6. Finding the deleted inflection in the tree of inflexions (the longest one is chosen). Checking the contents of the subtree with the existing word ending ( = + ). If ends in and has a counterpart in , then we store it in = and delete in .

Stage 7. We check the obtained base of the initial word with the content of the dictionary of bases of words of the Ukrainian language. If there is no respondent, we store < , > in the additional temporary intermediate dictionary for the moderator and proceed to stage ,

1. Otherwise, proceed to stage 4.

Stage 8. Analysis of inflexion and the presence/absence of alternation of letters in the base/inflexions of the words< , > and the analogue of the base of the word in according to the corresponding РG-rule of MA to identify additional features of the analyzed word .

Stage 9. Adding the identified linguistic features of the recognized part of speech to the tag of the word of the type , or , respectively. Saving the results in the corresponding dictionary

of the analysed text.

Unlike the classic Porter's algorithm, the modified one is adapted specifically for the Ukrainian language and gives an accurate result in 85-93% of cases, depending on the quality, style, genre of the text and, accordingly, the content of the dictionaries of CLS. In total, about 1,300 rules for processing suffixes and endings, considering the alternation of letters, adjectives - 99 RG-rules, and verbs - more than 800 RG-rules have been implemented for MA Ukrainian-language nouns. The algorithm for the minimum editorial distance of lines of Ukrainian texts is described as the minimum number of operations required to transform one into another. Also, an algorithm for calculating the maximum likelihood metric for the 2-gram and 3-gram models based on the analysis of word bases was developed to identify stable word combinations as keywords. To forecast the conditional probability of the following base of the word, we use the Markov assumption (the probability of the word depends on the previous one).

Moreover, suppose the keywords are a set of nouns or an adjective with a noun. In that case, other words, such as verbs, participles, etc., will be considered additional separators as other punctuation marks that demarcate persistent phrases as potential keywords. The order of bases is not crucial for the Ukrainian language.

Stage 1. Process the input text and break it into separate phrases (sentences) , marking each start-end with the corresponding <p> </p> tag. Eliminate all non-alphabetic characters. Convert uppercase letters to lowercase. Remove official words if necessary (for certain NLP tasks).

Stage 2. Apply Porter's stemming to obtain the sequence of word stems of word … … stems  taking into account word normalization, respectively.

Stage 3. Receive input queries

… as a sequence of words of the searched data. Find  for each word …

basis by stemming.

For example, for the search phrase :

Translation - Method and tools for information systems processing in electronic content commerce systems

ресурс метод та засіб опрац інформ систем електрон

контент комерц , , , , where =< , , , , , , >, is a tuple of simple sentence generation properties.

where is a tuple of lexical signs of phrase generation; is a tuple of syntactic signs of phrase generation; is a tuple of named properties; is a tuple of adjectival properties; is a tuple of properties of numerals; is a tuple of pronominal properties; is a tuple of verb properties; is a tuple of adverbial properties; is a tuple of consecutive properties and is a tuple of subordinate properties; is a tuple of ordinal where is a tuple of the properties of a separating connection, is a tuple of the properties of a connecting connection, and is a tuple of the properties of an opposing connection. , , , >, >, =<

, =< >, >, where is a tuple of matching properties; is a tuple of control properties; is a tuple of adjacency properties. A tuple of sentence generation concepts: =< , , , >, where sentence generation properties are grouped in are a tuple of sentence generation properties; is a tuple of clause identification properties; where is a tuple of narrative sentence generation properties; is a tuple of properties for generating interrogative sentences; is a tuple of prompt sentence generation properties; is a tuple of properties for generating emotionally neutral sentences; is a tuple of properties for generating emotional sentences; a tuple of concepts for the formation of simple and complex sentences; is a tuple of properties identifying the main members of the sentence; is a tuple of the properties of the identification of the secondary members of the sentence; =< , >; is a tuple of properties for generating affirmative sentences; is a tuple of negative sentence generation properties. To generate a simple sentence features are analyzed: >, >, , ,

4. Experiments, results and discussion

I will analyse the results of the experimental approbation of the developed methods and means of linguistic, intellectual analysis of texts in the Ukrainian language based on the development of methods for identifying keywords, determining persistent word combinations, thematic classification of the text and detecting duplication of text. Let us consider the peculiarities of the process of syntactic analysis of Ukrainian-language textual content aimed at identifying significant keywords of input texts. Having determined the role and formal features of the syntactic analyser in the process of identifying keywords of the content topic, the procedures of the proposed method were decomposed into two stages (Table 1), where A (total keywords identified with a given word weight), B (generated significant words without pronoun and verbs), C (coincidence of words with the author's list), D (accuracy of the coincidence of identified keywords with the author's list), E (additionally defined keywords, but not determined by the author of the publication). In stage 1, the research for step 1 (analysis of full articles) and step 2 (articles without metadata such as abstract, author keywords and list of references) was carried out without the application of ML, and in stage 2 - with ML. The method of article analysis without metadata achieves the best results according to the density criterion. The author of the article often defines a more significant number of words ( ) and a smaller number of keywords ( ) than are present in the text of the scientific and technical publication (Fig. 6). Unlike known parsers, the proposed method provides self-improvement and selflearning of the keyword definition module due to the identification mechanism of significant statistical parameters within the limits defined by the moderator. A system has been developed on the Victana website, which allows users to choose from a list of languages of the analysed text (http://victana.lviv.ua/index.php/kliuchovi-slova). The value of differs from the value of by 0.69 (by number, but not by content); from by 1.74; from by 2.66; from by 3.58. The value of differs from the value of by 4.36; respectively, from by 3.31; from by 2.39; from by 1.47. Adaptively changing the parameters/rules of the module almost doubles the collection of identified keywords (for example, the value of is greater than by 1.144654; by 1.750524; by 1.557652; by 1.36478). The total increase in value obtained depending on the moderation of dictionaries is, respectively, for is 14.46541; is 36.47799; is 55.7652; is 75.05241. When comparing is greater than ÷ and we have a chain of such values as 1.7985; 1.5084; 1.3217; and 1.176.

For different stages and steps of the experiment of processing the primary text, the average coincidence of the lists of discovered keywords with the author's keywords varies in the range of 52.6-68.5%. The accuracy of matching keywords with the author's keywords ranges from 43.6 to 62.9%. The average match of meaningful keywords compared to all found by the system ranges from 38.9-75.8%, depending on the stages of analysis of article texts. The accuracy of matching keywords compared to all found by the system varies between 34.3-71.9%, depending on the stages of analysis of article texts. For , the module most often identified the number of keywords {5, 7, 3} (10), although the distribution of found keywords was within [1;18] words (except 17).

For , the module most often identified the number of keywords also {5, 7, 3}, although the distribution of found keywords is within [1;18] (except 17), the number of identified words increased, Total words Meaningful words Coincidence with author's Match accuracy Additional words Total words Meaningful words Coincidence with author's Match accuracy Additional words

Author's keywords Number of words Defined by the system Meaningful words Coincidence with author's Match accuracy Additional words Author's keywords Number of words Defined by the system Meaningful words Coincidence with author's Match accuracy Additional words 0 0 10 b) 0

10 d) 0 b) 2 0 with a) c) 0 a) 2 0 Total words Meaningful words Coincidence of words Match accuracy

Additional words Total words Meaningful words Coincidence with author's Match accuracy Additional words Author's keywords Number of words Defined by the system Meaningful words Coincidence with author's Match accuracy Additional words Author's keywords Number of words Defined by the system Meaningful words Coincidence with author's Match accuracy

Additional words d) Weight = 1

Weight =2

Weight =3

Weight =4

Weight=1 Weight=2 Weight=3 Weight=4 Weight=5

Analysis was performed for filtered texts without metadata and unfiltered texts. The average values obtained for filtered texts = 0.28 and unfiltered = 0.19 shows that filtering scientific articles improves keyword density by 1.48 times or 47.83% (Fig. 9a). and the highest reliability index was achieved. For , the module most often identified the number of keywords {7, 6, 5, 10, 8}, although the distribution of found keywords was within [2;14] (the range narrowed significantly). For , the module most often identified the number of keywords {8, 5, 7, 10}, the distribution of identified keywords within [3;16] (accuracy improved). The accuracy of the definition of keywords increases during the moderation of dictionaries and the ML module. The difference between the number of keywords defined by the author and identified by the module at is 44.39919% (difference in %). Accuracy improves with is 33.70672%, significantly improving is 24.33809%, and with The obtained values for the texts = 0.34 and = 0.25, taking into account the refinement of the thematic dictionary through ML and the replenishment of blocked words, shows that filtering with simultaneous moderation of the thematic dictionary improves keyword density by 1.35 times or by 35.44% (Fig. 9b). A comparison of the values in the original author's text =

= 0.25 without/with the refinement of the thematic dictionary, respectively, a) 1 0 b) 1 0 demonstrates the effectiveness of the moderation of the thematic dictionary in the initial text - the density of keywords increases 1.34 times or by 34.33% (Fig. 10a). Comparison of the values in the filtered author's text = 0.28 and = 0.34 without/with the refinement of the thematic dictionary, respectively, demonstrates the effectiveness of the moderation of the thematic dictionary in the filtered text as the density of keywords increases 1.23 times or by 23.14% (Fig. 10b).

General text with a detailed dictionary General text without a specified dictionary

Filtered text with refined vocabulary Filtered text without dictionary refinement

So, the experimental study confirmed the method's reliability - for different stages of processing the primary text, the average coincidence of the lists of identified keywords with the author's keywords varies in the range of 52.6-68.5% (by 9%). The accuracy of matching keywords with the author's keywords ranges from 43.6 to 62.9%. The average match of meaningful keywords compared to all found by the system ranges from 38.9-75.8%, depending on the stages of analysis of article texts. The accuracy of matching keywords compared to all found by the system varies between 34.3-71.9%, depending on the stages of analysis of article texts. A method of determining stable word combinations when identifying textual content keywords in reference passages of the author's text has been developed. The process consists of the use of Zipf's law in the formation of stable word combinations as key, taking into account the following rules of preliminary linguistic processing of the text: removal of all stop words; form bigrams only within the limits of punctuation marks and words that are not verbs or pronouns (the latter are considered punctuation marks); determine verbs by inflexions; form bigrams based on their bases without taking into account their inflexions; definition of adjectives by inflexions and to believe that adjectives should only be in the first place in the bigram from Ukrainian-language texts. A module has been developed to identify persistent phrases as keywords in textual content. An approach to developing linguistic content analysis software for the determination of stable word combinations in identifying keywords of Ukrainianlanguage and English-language textual content is proposed. The peculiarity of the approach is adapting the linguistic, statistical analysis of lexical units to the peculiarities of the constructions of Ukrainian and English words/texts. The results of the experimental approbation of the proposed method of content analysis of English- and Ukrainian-language texts to determine stable word combinations in identifying keywords of technical texts were studied.

A method of identifying the style of the author of the text based on the analysis of linguistic speech coefficients in the standard has been developed. The technique consists of a comparative study of the author's attribution in the author's statistically processed work (standard) with an arbitrarily analysed passage. The method evaluates the probability of the text of the article belonging to the author of the benchmark with the analysis of the relevant coefficients of lexical speech as the concentration of the text , the coherence of the speech , the uniqueness of the text , the syntactic complexity of the speech and the linguistic diversity of the speech . The degree of speech connectivity does not decrease significantly. In 2001, it changed within [0.5; 1.2], and in 2021 – within [0.4; 0.9] (Fig. 11). Moreover, the method works under the condition that the author's standard has already been researched - the task of NLP is to form the author's frequency dictionary, including service/stop words.

An algorithm for determining stop words of text content based on linguistic analysis of text content has been developed. For the individual style of the author's text, markers are service/stop words (for example, particles, conjunctions, prepositions, parasite words, slang, slang, etc.) unrelated 2 1 0 Kl

Kz to the article's topic. The absolute and relative frequencies of stopwords were analysed and compared with the reference values for each excerpt. Therefore, applying the method of reference words gives the following results: finding what most likely belongs to the standard among the studied passages. Other results also confirm the effectiveness of the keyword method in author attribution of texts. The proposed assumption about the insignificance of the influence of the share as a parameter of the process on the results led to a decrease in the correlation coefficients but placed the probability of belonging to the standard for passages in the correct order (Table 2). More likely, Excerpt 4 belongs to the author of the template (although there is no significant difference between results 4 and 2, if they are written in the same period, they do not belong to the author of the template; if in different periods with the template, the probability of belonging to this author increases).

1 8 51 22 29 36 43 50 57 64 71 78 85 92 99 016 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218

An algorithm for the linguistic analysis of Ukrainian-language texts and a syntactic analyser of text content has been developed. The features of the algorithm are the adaptation of morphological and syntactic analysis of lexical units to the peculiarities of constructions of Ukrainian words/texts. Algorithms are tested to identify significant stopwords in Ukrainian-language text based on regular expressions. When parsing words belonging to a part of speech, declension within this part of speech was taken into account. For this purpose, word inflexions were analysed for classification, selection of the basis and formation of the corresponding alphabetic-frequency dictionaries. The dictionaries contents were subsequently taken into account in the next steps of determining the text's authorship by calculating the parameters and coefficients of the author's speech. Software implementation for solving some NLP problems, as research of:        keywords (https://victana.lviv.ua/kliuchovi-slova); stable phrases (https://victana.lviv.ua/nlp/stiiki-slovospoluchennia); classification of textual content (https://victana.lviv.ua/kliuchovi-slova); quantitative evaluations of speech (https://victana.lviv.ua/nlp/linhvometriia); the author's style based on calculations of stylometry coefficients and their comparison with the corresponding coefficients in the standard text (https://victana.lviv.ua/nlp/stylemetriia); differences in text signs (https://victana.lviv.ua/nlp/hlotokhronolohiia); features of the style of texts based on N-grams (https://victana.lviv.ua/nlp/n-grams).

The results of the experimental approbation of the proposed content monitoring method for determining the author in Ukrainian-language scientific texts of a technical profile were studied. A comparison of the results of more than 300 one-person works of a technical direction by 100 different authors for 2001–2021 was carried out to determine whether and how the coefficients of text diversity of these authors change in different periods. A method of identifying the potential (probable) author of a Ukrainian-language text based on the analysis of the author's linguistic speech coefficients in a reference passage of the author's text has been developed. Decomposition of the method of determining the author was carried out based on the analysis of such speech coefficients as speech coherence, degree of syntactic complexity, linguistic diversity, indices of concentration and exclusivity of the text. In parallel, such parameters of the author's style as the number of words in a specific text, the total number of words in this text, the number of sentences, the number of prepositions, the number of conjunctions, the number of words with a frequency of 1 and the number of words with a frequency of 10 and more, as well as keywords and 3 - grams. For example, 3-grams of 3 articles were analysed [61-63] (Ukrainian versions). For the most frequently used letters, the frequency of appearance of 3-grams with such initial letters will have an almost identical distribution (peak values in Fig. 12a), but not for other letters. Therefore, it is expedient to study only 3 grams for initial letters that occur less often in the texts of a specific language to determine the degree of belonging of the text to the corresponding author (for example, Fig. 12b). According to these graphs. It appears that Articles (1,2) are more likely to be written by the same author, although the same author could also write Articles (1,3) (but this is not true). Different authors write articles (2,3). Applying linguistic, statistical analysis of 3-grams to a set of articles makes it possible to form a subset of publications similar in terms of linguistic characteristics. Imposing additional conditions in the form of linguistic, statistical analyses (a set of keywords, stable word combinations (Table 3), stylometric, ligvometric, etc.) will significantly reduce the subset, clarifying the list of more likely authors' works. Thus, the analysis of the content and frequency of appearance of only official words separates Articles (1,3) into different subsets, leaving Articles (1,2) in one. 78.4814% of 3-grams were analysed for Article 1, 72.6332% for Article 2, and 84.1271% for Article 3. The difference in the use of the corresponding 3-grams between Articles (1,2) is R12=56.5254%, between Articles (2,3) – R23=69.4271%, between Articles (1,3) – R13=62.9839%. Accordingly, Articles (1,2) are more similar by [6-12]% (R23>R12 by 12.9017%, R23 > R13 by 6.4432%, R13> R12 by 6.4585%, i.e. R23>R13>R12) than Articles (1,3) and (2,3). The smaller the Rij, the greater the degree to which the same author writes the articles. Then, in case Articles (1,2) are more likely to be written by one author/team than Articles (2,3) and (1,3), respectively. інформаційни й технологія

FREG

AF 4 4 3 2 1 1 1 1 1 1 50,00% 50,00% c) 0,00% e) f) c)

When identifying the author of a text, it is assumed that the text reflects the author's style of writing, which makes it possible to distinguish him from others. To compare texts with each other, it is necessary to compare some numerical characteristics of the text, which would be approximate for the texts of the same author and differ significantly for the works of different authors. Such a characteristic can be the density of the distribution of letter combinations of three consecutive symbols (3-grams). During the experimental testing based on the developed four different algorithms for calculating the degree of verification of the author of the Ukrainian-language text from a set of possible values, values were obtained that confirm that the style of the authors numbered x and y is quite close (more than 90%) to the style of collective works 1–4, respectively. Also, the number of authors (from 42.02% to 34.04% of the total 100 participants in the project from more than 300 articles) was significantly reduced, with similarity in speech style. Figure 13 presents graphs of the results obtained when applying algorithms to analyse the method developed to determine the author's style.

Further, an analysis of stop words and keywords of the authors' works was used to determine the author's style, as 34.04% got to those. Each individual has their vocabulary for conveying thought, including so-called "parasitic" (that is, therefore, although, etc.) and service words (and, and, and, but, although, etc.). Figure 14 presents an example of the analysis of the author's style in the second stage by analysing the frequency of service appearance and keywords, considering various filters. Therefore, a method of determining the style of the author of thematic Ukrainian-language textual content was developed based on the analysis of keywords, stable word combinations, N-grams, lingumetry and stylometry, which made it possible to determine the stylistic contribution of each of the authors and increase the accuracy of attribution of a scientific and technical publication by 6%. A method for calculating the degree of verification of the author of a Ukrainian-language text from a set of possible ones based on a comparative analysis of the styles of potential authors has also been developed, which made it possible to increase the accuracy of classification by style similarity by 7%.

5. Conclusions

The work solves an important scientific and applied problem of analysis and synthesis of CLS for solving various problems of processing Ukrainian-language textual content based on the development of new and improvement of known models, methods and tools of NLP:

An analysis of the current state and prospects of IT development of natural language processing was carried out, which made it possible to define the problem and research tasks, as well as to form general research directions in the absence of non-commercial open-source software as CLS for processing Ukrainian-language textual content and a standardized design approach.

The relevance of solving the problem of analysis and synthesis of CLS based on the development of the general structure of the system for processing Ukrainian-language textual content is substantiated due to the interaction of the main processes/components of IS and methods of linguistic processing of textual content adapted to the Ukrainian language based on grapheme, morphological, lexical, syntactic, semantic, structural, ontological and pragmatic analysis allowed to improve the IT of intellectual analysis of text flow for solving a specific task of NLP. It ensured the adaptation of NLP processes for the analysis of Ukrainian-language textual content and, based on them, increased the accuracy of the obtained results by 6-48%, depending on the specific task of NLP. For example, for the NLP task of determining the Ukrainian-language text keywords, the density of keywords increases in the range [1.23; 1.48] times or by [23.14; 47.83]% depending on filling the thematic dictionary quality/accuracy through machine learning.

The methods of processing information resources, such as integration, management and support of Ukrainian-language content, were improved, which made it possible to adapt the process of intellectual analysis of the text flow and develop metrics of the effectiveness of the CLS functioning for the solution of various tasks of the NLP. The developed methods and tools make it possible to build a CLS for processing Ukrainian-language text content according to the needs of the permanent/potential target audience based on the analysis of the history of actions of website users.

The NLP methods based on regular expressions of pattern matching were improved, which made it possible to adapt the methods of tokenization and text normalization by cascades of simple substitutions of regular expressions and finite state machines.

The MA method of the Ukrainian-language text based on word segmentation and normalization, sentence segmentation and modified Porter's stemming algorithm was

improved as an effective tool of identifying lem affixes for the possibility of marking the analysed word, which made it possible to increase the keyword searches accuracy by 9%. The IT of the intellectual analysis of the text flow was improved based on the processing of information resources, which made it possible to adapt the general structure of modules for integration, management and support of content to solve various tasks of the NLP and increase the efficiency of the operation of the CLS by 6-9%. It became possible thanks to the combination of methods of linguistic analysis adapted to the Ukrainian language, improved IT processing of information resources, ML, and a set of metrics for evaluating the effectiveness of the CLS's functioning. The main principle of building such CLS is modularity, which facilitates their construction by requiring the availability of appropriate processes for solving a specific NLP problem.

A method of determining the author in Ukrainian-language texts has been developed based on the analysis of the coefficients of the author’s lexical speech in the reference passage of the author’s text, which is based on the study of a collection of keywords, persistent phrases, indicators of linguometry, stylometry, as well as the results of the analysis of N-grams based on comparisons of usage differences 2-gram and 3-gram for publications similar in style in the range of [6;7]%, and for exactly not similar – >12%), which made it possible to determine a set of potential authors of publications from more than one author (up to [9; 34]% of the total number of project participants) and develop a method for identifying the author's style. A method of determining stable word combinations was developed based on the identification of keywords of the Ukrainian-language text and the analysis of the linguistic speech coefficients of the author of the text in reference excerpts of the content, which made it possible to improve the accuracy of the method of determining the style of the author of the text by 9% based on statistical linguistics.

Relevant materials confirm the reliability of scientific and practical results on the implementation of dissertation studies by comparing the obtained practical results on different samples of reliable input data. CLS was developed using CMS Joomla on the information resource http://victana.lviv.ua! (for designing the e-framework of articles), PHP (for implementing text content processing methods), HTML (for implementing page markup), CSS (for describing page styles), and MySQL (for storing data and dictionaries). The experimental study confirmed the reliability of the method of identifying keywords - for different algorithms for processing the primary text, the average match between the lists of identified keywords and the author's keywords varies in the 52.6-68.5% range. The accuracy of matching keywords with the author's keywords ranges from 43.6 to 62.9%. The average match of meaningful keywords compared to all found by the system ranges from 38.9-75.8%, depending on the stages of analysis of article texts. The accuracy of matching keywords compared to all found by the system varies between 34.3-71.9%, depending on the stages of analysis of article texts.

Acknowledgements

The research was carried out with the grant support of the National Research Fund of Ukraine, "Information system development for automatic detection of misinformation sources and inauthentic behaviour of chat users ", project registration number 187/0012 from 1/08/2024 (2023.04/0012). Also, we would like to thank the reviewers for their precise and concise recommendations that improved the presentation of the results obtained.

References

[1] I. Lauriola, A. Lavelli, F. Aiolli, An introduction to deep learning in natural language processing:

Models, techniques, and tools, Neurocomputing 470 (2022) 443-456. [2] Y. Kang, Z. Cai, C. W. Tan, Q. Huang, H. Liu, Natural language processing (NLP) in management research: A literature review, Journal of Management Analytics 7(2) (2020) 139-172. [3] L. Hickman, S. Thapa, L. Tay, M. Cao, P. Srinivasan, Text preprocessing for text mining in organizational research: Review and recommendations,Organizational Research Methods 25(1) (2022) 114-146. [4] D. Hu, An introductory survey on attention mechanisms in NLP problems, in: Proceedings of the Intelligent Systems Conference on Intelligent Systems and Applications 2 (2020) 432-448. [5] M. Gardner, W. Merrill, J. Dodge, M. E. Peters, A. Ross, S. Singh, N. A. Smith, Competency problems: On finding and removing artifacts in language data, arXiv preprint arXiv:2104.08646, 2021. [6] L. Wu, et. al., Graph neural networks for natural language processing: A survey, Foundations and Trends in Machine Learning 16(2) (2023) 119-328. [7] E. Fedorov, O. Nechyporenko, Linguistic Constructions Translation Method Based on Neural

Networks, CEUR Workshop Proceedings 3396 (2023) 295-306. [8] M.-A. Lefer, N. Grabar, Super-creative and over bureaucratic: A cross-genre corpus based study on the use and translation of evaluative prefixation in ted talks and EU parliamentary debates, Across Languages and Cultures 16(2) (2015) 187–208. [9] M. Konyk, V. Vysotska, S. Goloshchuk, R. Holoshchuk, S. Chyrun, I. Budz, Technology of Ukrainian-English Machine Translation Based on Recursive Neural Network as LSTM, CEUR Workshop Proceedings 3387 (2023) 357-370. [10] N. Shakhovska, I. Shvorob, The method for detecting plagiarism in a collection of documents, in: Proceedings of the International Conference on Computer Sciences and Information Technologies, CSIT, 2015, pp. 142-145. [11] O. Karnalim, G. Kurniawati, Programming Style on Source Code Plagiarism and Collusion

Detection, International Journal of Computing 19(1) (2020). 27-38. [12] V. Vysotska, Y. Burov, V. Lytvyn, A. Demchuk, Defining Author's Style for Plagiarism Detection in Academic Environment, in: Proceedings of the International Conference on Data Stream Mining and Processing, DSMP, 2018, pp. 128-133. [13] O. Barkovska, V. Kholiev, A. Havrashenko, D. Mohylevskyi, A. Kovalenko, A Conceptual Text Classification Model Based on Two-Factor Selection of Significant Words, CEUR Workshop Proceedings 3396 (2023) 244-255. [14] A. Berko, Y. Matseliukh, Y. Ivaniv, L. Chyrun, V. Schuchmann, The text classification based on Big Data analysis for keyword definition using stemming, in: Proceedings of the IEEE 16th International conference on computer science and information technologies on Computer science and information technologies, Lviv, Ukraine, 22–25 September, 2021, pp. 184–188. [15] V. Lytvyn, V. Vysotska, I. Budz, Y. Pelekh, N. Sokulska, R. Kovalchuk, L. Dzyubyk, O.

Tereshchuk, M. Komar, Development of the quantitative method for automated text content authorship attribution based on the statistical analysis of N-grams distribution, EasternEuropean Journal of Enterprise Technologies, 6(2(102)) (2019) 28–51. doi:10.15587/17294061.2019.186834. [16] I. Khomytska, I. Bazylevych, V. Teslyuk, I. Karamysheva, The chi-square test and data clustering combined for author identification, in: Proceedings of the IEEE XVIIIth Scientific and Technical Conference on Computer Science and Information Technologies, 2023. [17] I. Khomytska, V. Teslyuk, The Multifactor Method Applied for Authorship Attribution on the

Phonological Level, CEUR workshop proceedings 2604 (2020) 189-198. [18] R. Romanchuk, V. Vysotska, V. Andrunyk, L. Chyrun, S. Chyrun, O. Brodyak, Intellectual Analysis System Project for Ukrainian-language Artistic Works to Determine the Text Authorship Attribution Probability, in: Proceedings of the International Scientific and Technical Conference on Computer Sciences and Information Technologies, 2023. [19] I. Khomytska, V. Teslyuk, A. Holovatyy, O. Morushko, Development of methods, models, and means for the author attribution of a text, Eastern-European Journal of Enterprise Technologies 3(2(93)) (2018) 41–46. doi: 10.15587/1729-4061.2018.132052. [38] A. Yarovyi, D. Kudriavtsev, Method of Multi-Purpose Text Analysis Based on a Combination of

Knowledge Bases for Intelligent Chatbot, CEUR Workshop Proceedings 2870 (2021) 1238-1248. [39] N. Shakhovska, O. Basystiuk, K. Shakhovska, Development of the Speech-to-Text Chatbot

Interface Based on Google API, CEUR Workshop Proceedings 2386 (2019) 212-221. [40] T. Basyuk, A. Vasyliuk, Peculiarities of an Information System Development for Studying Ukrainian Language and Carrying out an Emotional and Content Analysis, CEUR Workshop Proceedings 3396 (2023). URL: https://ceur-ws.org/Vol-3396/paper23.pdf. [41] V. Vysotska, S. Holoshchuk, R. Holoshchuk, A Comparative Analysis for English and Ukrainian Texts Processing Based on Semantics and Syntax Approach, CEUR Workshop Proceedings 2870 (2021) 311-356. [42] A. Dmytriv, S. Holoshchuk, L. Chyrun, R. Holoshchuk, Comparative Analysis of Using Different Parts of Speech in the Ukrainian Texts Based on Stylistic Approach, CEUR Workshop Proceedings 3171 (2022) 546-560. [43] S. Yevseiev, et. al., Development of a Method for Determining the Indicators of Manipulation Based on Morphological Synthesis, Eastern-European Journal of Enterprise Technologies 117(9) (2022). [44] O. Cherednichenko, O. Kanishcheva, O. Yakovleva, D. Arkatov, Collection and Processing of a

Medical Corpus in Ukrainian, CEUR Workshop Proceedings 2604 (2020) 272-282. [45] A. Dmytriv, V. Vysotska, M. Bublyk, The Speech Parts Identification for Ukrainian Words Based on VESUM and Horokh Using, in: Proceedings of the 16th International Conference on Computer Sciences and Information Technologies (CSIT), vol. 2, 2021, September, pp. 21-33. [46] V. Vysotska, S. Mazepa, L. Chyrun, O. Brodyak, I. Shakleina, V. Schuchmann, NLP Tool for Extracting Relevant Information from Criminal Reports or Fakes/Propaganda Content, in: Proceedings of the IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), 2022, November, pp. 93-98. [47] M. Lupei, O. Mitsa, V. Sharkan, S. Vargha, N. Lupei, Analyzing Ukrainian Media Texts by Means of Support Vector Machines: Aspects of Language and Copyright, in: Proceedings of the International Conference on Computer Science, Engineering and Education Applications, 2023, March, pp. 173-182. [48] V. Vysotska, Analytical Method for Social Network User Profile Textual Content Monitoring Based on the Key Performance Indicators of the Web Page and Posts Analysis, CEUR Workshop Proceedings 3171 (2022) 1380-1402. [49] K. Shakhovska, I. Dumyn, N. Kryvinska, M. K. Kagita, An approach for a next-word prediction for Ukrainian language, Wireless Communications and Mobile Computing 2021 (2021) 1-9. [50] I. Demydov, Architecture of the Computer-linguistic System for Processing of Specialized Webcommunities’ Educational Content. URL: https://ceur-ws.org/Vol-2616/paper1.pdf. [51] V. Vysotska, Ukrainian participles formation by the generative grammars use, CEUR Workshop

Proceedings 2604 (2020) 407–427. [52] B. Bengfort, R. Bilbro, T. Ojeda, Applied text analysis with Python: Enabling language-aware data products with machine learning, O'Reilly Media, Inc., 2018. [53] D. Jurafsky, J. H. Martin, Speech and Language Processing. URL: https://web.stanford.edu/~jurafsky/slp3/ed3book_sep212021.pdf. [54] D. Jurafsky, J. H. Martin, Regular Expressions, Text Normalization, Edit Distance. URL: https://web.stanford.edu/~jurafsky/slp3/2.pdf. [55] D. Jurafsky, J. H. Martin, Deep Learning Architectures for Sequence Processing. URL: https://web.stanford.edu/~jurafsky/slp3/9.pdf. [56] D. Jurafsky, J. H. Martin, Naive Bayes and Sentiment Classification. URL: https://web.stanford.edu/~jurafsky/slp3/4.pdf. [57] D. Jurafsky, J. H. Martin, Logistic Regression. URL: https://web.stanford.edu/~jurafsky/slp3/5.pdf. [58] D. Jurafsky, J. H. Martin, Neural Networks and Neural Language Models. URL: https://web.stanford.edu/~jurafsky/slp3/7.pdf. [59] I. Khomytska, V. Teslyuk, N. Kryvinska, I. Bazylevych, Software-based approach towards automated authorship acknowledgement-chi-square test on one consonant group, Electronics (Switzerland) 9(7) (2020) 1–11. [60] A. R. Sydor, V. M. Teslyuk, P. Y. Denysyuk, Recurrent expressions for reliability indicators of compound electropower systems, Technical Electrodynamics 4 (2014) 47–49. [61] V. Lytvyn, et. al., Development of the linguometric method for automatic identification of the author of text content based on statistical analysis of language diversity coefficients, EasternEuropean Journal of Enterprise Technologies 5(2(95)), (2018) 16–28. doi: 10.15587/17294061.2018.142451. [62] V. Lytvyn, et. al., Development of the system to integrate and generate content considering the cryptocurrent needs of users, Eastern-European Journal of Enterprise Technologies 1(2(97)) (2019) 18–39. doi: 10.15587/1729-4061.2019.154709. [63] P. Kravets, The Game Method for Orthonormal Systems Construction, in: Proceedings of the 9th International Conference - The Experience of Designing and Applications of CAD Systems in Microelectronics, 2007. doi: 10.1109/cadsm.2007.4297555.