VictoriaVysotska victoria.a.vysotska@lpnu.ua Lviv Polytechnic National University

Stepan Bandera 12 79013 Lviv Ukraine

1613-0073 4E3B9784B861E1C008FA884FAE146295 GROBID - A machine learning software for extracting information from scholarly documents Computer linguistic systems NLP Ukrainian-language textual content machine learning 1

The work aims to develop models, methods, and means of analysis and synthesis of computer linguistic systems (CLS) based on new and improved methods of processing Ukrainian-language textual content to solve natural language processing problems (NLP). The scientific novelty of the obtained results lies in solving an important scientific and applied problem of analysis and synthesis of CLS for solving various tasks of processing Ukrainian-language textual content based on developing new and improving known models, methods and means of NLP. The following new scientific results were obtained: -A model of intellectual analysis of the text flow, which, unlike the existing one, is based on the processing information resources, NLP and machine learning, which the typical structures of content integration, management and support modules; -Methods of adapted processing information resources for processing Ukrainianlanguage text and take into account the needs of the permanent target audience based on the analysis of the history of the target audience's activity on the CLS web resource, which made it possible to form a set of metrics and indicators of the effectiveness of the CLS functioning for the various NLP tasks solution; -A model of linguistic processing of text based on the grapheme, morphological, lexical and syntactic analyses improvement, which, unlike the existing ones, are adapted for processing Ukrainian-language text through regular expressions and machine learning, made it possible to adapt the processes of processing Ukrainian-language text content and increase the accuracy of the obtained results depending from a specific NLP task; -A method of identifying keywords in Ukrainian-language texts based on grapheme and morphological analysis of word bases through regular expressions and N-grams was developed, which made it possible to increase the accuracy of searching for keywords, search for stable word combinations and categorize content; -A method of determining the style of the author of thematic Ukrainian-language text content was developed based on the keywords, stable word combinations, N-grams analysis, which made it possible to determine the stylistic contribution of each of the authors and increase the accuracy of the attribution of a scientific and technical publication; -A method was developed for calculating the degree of verification of the author of a Ukrainian-language text from a set of possible ones based on a comparative analysis of the styles of potential authors, which made it possible to increase the accuracy of classification based on the similarity of style; -Methods of analysis and synthesis of CLS were developed based on the creation of a general typical structure of the text content processing CLS in the Ukrainian language through support for modularity, modelling of the interaction of main processes and components, which made it possible to expand the collection of solutions to various typical tasks of the NLP by implementing typical software of such systems; -NLP methods, which, unlike the existing ones, are implemented on the basis of developed regular expressions of grapheme and morphological analysis of Ukrainian-language text and modified Porter's stemming algorithm as an effective identifying lem affixes for the possibility of demarcating the analysed word, which made it possible to optimize the process and improve the accuracy of Ukrainian words/sentences normalization; -Text tokenization and normalization methods, which, in contrast to the existing ones, use cascades of simple substitutions of developed regular expressions of matching with templates based on production rules, finite automata and the ontological model of the rules of the Ukrainian language syntax.

Introduction

The active development of information technologies (IT) is at the intersection of globalization and informatization. The rapid rate of growth of society's informatization is directly related to the rate of development and implementation of computer linguistic systems (CLS), the development of which is based on models and methods of natural language processing (NLP) [1][2][3]. The complexity of developing models, techniques, and tools of NLP lies in solving non-typical NLP problems and adapting these models, methods, and tools to a specific natural language [4][5][6]. Each natural language is unique, with its flavour of rules, history, grammar, exceptions, and peculiarities of generating linguistic units for conveying meaning, complicating developing a CLS.

Usually, each successful CLS development project is designed for a specific task (for example, machine translation [7][8][9], identification of plagiarism/rewriting [10][11][12], text rubrication [13][14], text attribution analysis [15][16][17][18][19][20][21], information retrieval [22][23][24][25][26][27][28], referencing/abstracting [29][30], voice assistants [31][32][33], intelligent chatbots [34][35][36][37][38][39], etc.) and is both one-time and closed (for example, Amazon Alexa, Google Assistant, Facebook, Voice Mate, Bixby, Siri, Abby Lingvo, Microsoft Cortana, Microsoft Word, Grammarly, Google Translation, PROMT, CuneiForm, Trados, OmegaT, Wordfast, Dragon, IBM via voice, Speereo, Finereader, Tesseract, OCRopus, etc.) without being able to read the content to willing IT professionals/specialists. In rare cases, the developers provide open access to such CLS projects and the opportunity to get acquainted with their structure and content. The development of any NLP application for an arbitrary natural language of more than 7000 languages and dialects is based on studying large textual monolingual/parallel corpora of that language, containing more than hundreds of millions of words and linguistic resources. Only about 20 natural languages (English, Chinese, Western European languages, Japanese, etc.) are the results of research on such corpora known, making it possible to develop CLS of various complexity for these languages. Unfortunately, in modern realities, the Ukrainian language is considered in the international scientific community to be an exotic language with a low resource index, i.e., it does not have enough educational, research and processed data to develop modern applied applications of NLP. Such applied applications are used to build CLS in cyber security (detection of fakes and propaganda, socalled trolls/bots in social networks), sociology (analysis of the dynamics of changes in public opinion on thematic issues), philology (automatic research of large data sets of various thematic orientations and different periods), psychology (analysis of the psychological portrait of a person, identification of post-traumatic stress disorder of participants in hostilities or occupation), national security (information warfare), jurisprudence (criminology and court case), social communications (analysis of community posts in social networks) and other important branches of modern Ukraine. The above determines the relevance of the topic of the dissertation research.

Scientific research by N. Chomsky, V.M. Glushkov, A.V. Hladkoy, D.V. Lande, V.A. Shyrokov, N.V. Sharonova, N.F. Khairova, O.V. Bisikalo, S.N. Buk, N.P. Darchuk, Z.V. Partyka, A.V. Anisimova, Yu.D. Apresyan, O.O. Marchenko, I.M. Kulchytskyi, A.O. Nikonenko, M. Gross, A. Lanten, V.H. Yngve, S. Sharoff, Yu.A. Schrader, D. Jurafsky, B. Bengfort, J.H. Martin, L. Tesniere, T. Ojeda, P.M. Postal, D.G. Hays, T.A. van Dijk, S. Marcus, J. Lyons, L.W. Tosh, Y. Bar-Hillel, D.G. Bobrow, G. Lakoff, R. Bilbro, N. Kotsyba, A.Yu. Berko, Yu.M. Shcherbyna, V.Yu. Velychko, V.F. Starko and many others make it possible to understand the basic principles of linguistic processing of the text depending on the features of a specific natural language. More than 80% of such studies concern the processing of English-language texts. There are fewer studies on Slavic languages, particularly the low-resource Ukrainian language. In particular, there are no publications regarding the development recommendations, functional requirements, general structure, or typical architecture of the CLS for processing Ukrainian-language textual content. Directly applying the English language's models, methods, algorithms, and IT processing to Ukrainian-language textual content does not yield positive results. Already at the level of morphological analysis, a significant conflict arises between the methods developed for the English-language text and their use for the Ukrainian-language text. For example, for a simple Porter algorithm (stemming) without appropriate modification, it is not correct to separate the base of the word from the inflexion, which leads to inaccurate identification of key phrases, which, in turn, affects the solution of any NLP problem where it is necessary to quickly identify set of keywords (categorization, search, annotation, etc.). Determining the main features and processes of linguistic analysis of Ukrainian-language texts will significantly facilitate the stages of processing the text flow of information, such as integration, support and content management. In turn, the adaptation of the processes of intellectual analysis of text content with the identification of functional requirements for the relevant modules of the CLS will lead to the possibility of developing its typical architecture based on the principle of modularity (adding components depending on the content of the NLP task and the purpose of the CLS).

The above testifies to the relevance of research in solving the significant scientific and applied problem of analysis and synthesis of CLS for solving various tasks of processing Ukrainian-language textual content, which will make it possible to increase the level of resourcefulness of the natural Ukrainian language based on the development of new and improvement of known models, methods and means of NLP.

The work aims to develop models, methods, and means of analysis and synthesis of computer linguistic systems based on new and improved known methods of processing Ukrainian-language textual content to solve problems of natural language processing. The purpose of the work is to determine the need to perform such tasks: 1. To analyse the specifics of the construction of the CLS by systematizing the processes of their implementation and functioning, which will provide an opportunity to distinguish a class of systems whose functional properties allow to perform a quantitative assessment of the expected effects of the implementation of a typical CLS of processing Ukrainian-language textual content for solving various tasks of the NLP; 2. To develop information technology for the construction of CLS for the processing of Ukrainian-language text, which will make it possible to determine their basic structure, functional requirements, the sequence of setting and training the system, and general design principles; 3. To offer IT processing of information resources as integration, management and support of Ukrainian-language content based on the improvement of linguistic analysis of text content for the development of metrics for evaluating the effectiveness of the functioning of the CLS for solving various tasks of the NLP; 4. To develop methods of processing Ukrainian-language textual content for solving various problems of NLP to increase the accuracy of the obtained results; 5. To develop methods and means of intellectual analysis of textual content to increase the efficiency of solving various tasks of NLP; 6. Create software modules for processing Ukrainian-language textual content for solving various tasks of NLP and conducting experiments; 7. To test the obtained results by building and implementing applied CLS to process Ukrainianlanguage textual content.

The object of research is the processes of analysis and synthesis of computer linguistic systems for processing Ukrainian-language textual content.

The research subject is models, methods, and means of processing Ukrainian-language textual content to solve various problems of NLP.

The following research methods were used to achieve the goal: the theory of formal grammars and automata, the theory of sets, the theory of data and knowledge models, the theory of probability and mathematical statistics, the theory of models, algorithms, and logical-linguistic numbers, information theory, graph theory, and knowledge presentation methods for modelling the processes of processing Ukrainian-language textual content and developing machine learning modules; models and methods of processing and analysing textual content for the implementation of the processes of solving various problems of NLP; methods of object-oriented and system analysis and design -for design and development of CLS; the theory of relational databases, methods of artificial intelligence, object-oriented programming -for the software implementation of the Ukrainian-language textual content processing system for the solution of various NLP tasks. The practical significance of the obtained results lies in the fact that they can be used to build applied CLS for processing Ukrainianlanguage textual content. In particular, the following results are practically valuable:  The application of the method of identification of persistent word combinations in the identification of keywords in Ukrainian-language scientific texts of a technical profile allows an increase in the accuracy of the search for keywords by 6-9% and highlights thematic terms from the text for further classification of the publication;  Development of a formal approach to the design of a content monitoring module for identifying keywords in Ukrainian-language texts based on web data mining, NLP and linguistic analysis of defined words of text content, which made it possible to develop the general structure of typical CLS and increase the effectiveness of CLS functioning by 6-9% depending on the solution of a specific NLP problem;  The application of the method of calculating the degree of verification of the author of the Ukrainian-language text based on the analysis of the styles of potential authors made it possible to increase the accuracy of identification by 6-12% and carry out the decomposition of the method through the study of stylistic coefficients such as the coherence of speech, the degree of syntactic complexity, linguistic diversity, indices of concentration and exclusivity of the text;  Development of a content monitoring module to identify a potential author of a text from a set of possible ones based on a comparison of the results of the analysis of a template author's text with the researched one to reduce the volume of the corresponding set to [9;34]% of the total number of project participants, depending on the subject and the time range of scientific writing -technical publications, as well as the frequency of publications of this author in this period on a specific topic;  Experimental testing of the method of identifying the author's style in Ukrainian-language texts based on web data mining and linguistic analysis of defined stop words allows the selection of content potentially similar in style from a set of potential author's publications.

Related works

Determining the main processes and features of the linguistic analysis of Ukrainian-language texts will significantly facilitate the stages of processing the text flow of content such as integration, support and content management (Fig. 1). Adaptation of the processes of intellectual analysis of text content with the identification of functional requirements for the relevant modules of the CLS will lead to the possibility of developing a typical structure of similar systems based on the principle of modularity (adding components depending on the content of the NLP task and the purpose of the CLS). The application of the specified IT/methods/models in the typical structure of the CLS, adapted for any process of processing Ukrainian-language textual content, is a necessary prerequisite for the successful implementation of the CLS project for solving a specific task of the NLP, which requires the use of an appropriate set of standard libraries, utilities and software with open source, which will solve specialized functions of the project according to the needs of the end user. The state of the CLS is determined by the tuple of the main properties at a specific moment in time or the activity of the corresponding NLP process: 𝑠 = (𝑝 , 𝑝 , … , 𝑝 ), 𝑖 = 1, 𝑛, where 𝑠 is the corresponding i-th state at a specific moment in time 𝑡 from the set with power |S|=n, 𝑝 is the corresponding 𝑖𝑗-th property of the state from the set with power |P|=m, which determines the behaviour of the CLS as 𝑝 = (𝑟 , 𝑟 , … , 𝑟 ), 𝑗 = 1, 𝑚, where 𝑟 is the corresponding parameter of the specific property 𝑝 for the state 𝑠 . For any CLS, the state 𝑠 is one of the NLP processes, for example, the identification of keywords and/or stable phrases for the next state 𝑠 of the system as a rubric of a text array of data. Accordingly, the properties of the state 𝑠 are morphological 𝑝 , lexical 𝑝 and syntactic 𝑝 . Some NLP tasks may have semantic ones, etc. Then, for the property 𝑝 , a set of parameters is determined for the corresponding text analysis, depending on the specific task of NLP [40][41][42][43][44][45][46][47][48][49][50]. According to these parameters, the strategy of the CLS operation at the moment of time 𝑡 is specified for:  parameters of the morphological property 𝑝 are N-grams and morphemes: roots 𝑟 , endings 𝑟 , affixes 𝑟 ; grammatical categories of different parts of speech 𝑟 , word length 𝑟 , word placement in a sentence 𝑟 , number of syllables in a word 𝑟 , number of word contents 𝑟 , ratio of consonants and vowels 𝑟 , etc.;  the parameters of the lexical property 𝑝 are the location of the sentence in the test 𝑟 , the location of the word in the sentence 𝑟 , the weight of the word 𝑟 , the weight of the sentence 𝑟 , the base of the word 𝑟 , the inflexion of the word 𝑟 , etc.;  parameters of the syntactic property 𝑝 are the depth of the word in the dependency tree of the sentence 𝑟 , the location of the word in the sentence 𝑟 , the number of contents of the word 𝑟 , the number of words per sentence 𝑟 , the number of words 𝑟 and sentences 𝑟 , whether the word is a capital letter 𝑟 / with a hyphen 𝑟 / compound 𝑟 , etc.;  parameters of the semantic property 𝑝 are the number of word content 𝑟 , the depth of the word in the dependency tree 𝑟 , the size of paragraphs 𝑟 , the placement of paragraphs 𝑟 , etc.

Depending on the tuple 𝑝 𝑠 , the behaviour of the CLS is determined, that is, the implementation of a set of rules (activation of actions or events) for implementing a specific NLP process depending on the input text data. Accordingly, the event 𝑜 is the change of one property to another 𝑝 𝑝 or 𝑜 : 𝑝 𝑝 according to the fulfilment of certain conditions 𝑈 for the input analyzed text 𝑋 and the intermediate processed text 𝐶: 𝑝 = 𝑜 (𝑝 , 𝑈, 𝑋, 𝐶). Action 𝑑 is the process of activation of an event 𝑜 by another event 𝑜 in CLS: 𝐶′ = 𝑑 (𝑜 ∘ 𝑜 ). The more complex the language (morphology, syntax, etc.), the more difficult it is to process the corresponding texts in natural language. In addition, for such low-resource languages as Ukrainian, there are no standardized rules and dictionaries for processing texts in natural language to solve the relevant tasks of NLP. Many scientific linguistic schools and IT specialists are working on creating Ukrainian dictionaries, text corpora and rules for processing Ukrainian texts. However, these are usually linguists and philologists unfamiliar with the features of specific modern tools, such as programming languages, ML methods, big data analysis, etc. There is a colossal gap between the research results of philologists and applied linguists, on the one hand, and IT specialists, on the other, for developing Ukrainianlanguage tests. Today, quite a few, such as Ukrainian, have been implemented for general access to NLP tools.

Material and methods

The developed typical structure of 𝑆 CLS consists of modules for solving a specific task of NLP 𝑀 , content support 𝑀 , content integration 𝑀 , content management 𝑀 , linguistic 𝑀 and intelligent analysis of textual content flows (IATCF) 𝑀 [48]:

𝑆 =< 𝑀 , 𝑀 , 𝑀 , 𝑀 , 𝑀 , 𝑀 >.(1)

Accordingly, the solution module of a specific NLP problem 𝑀 :

𝑀 =< 𝑁 , 𝑆 , 𝑆 , 𝑆 , 𝑆 , 𝑃 , 𝐼 >, (2)

where 𝑆 is the average conversion rate, 𝑆 is the average cost of orders, 𝑆 is the average cost or utility of the purpose of the visit, 𝑆 is the average 𝑃 ROI or the average return on investment, 𝑃 is the percentage (%) of profit from new visitors, 𝐼 is the new buyers/customers index at the first visit.

The presence of the 𝑀 text content support module reduces costs for moderators/analysts who collect/analyze statistical data on the dynamics of the CLS functioning, the activity of the permanent target audience as a reaction to website content changes, and the formation of rules for the analysis of user information portraits and thematic content plots:

𝑀 =< 𝐼 ,

where 𝑃 (𝑤) is a function for determining % of visits from advertisement w; 𝑃 (𝑤) is a function for determining % conversion of goals for visits from w; 𝐼 (𝑤) is a function for determining the index of advertising quality w; 𝑁 is the total number of user queries of intellectual and informational search (IIS) by keywords; 𝑁 is the number of direct visits to the website; 𝑁 is the number of IIS requests with brand name. The presence of the 𝑀 text content integration module reduces the costs of CLS moderators and content authors, automating/implementing some of their work/functions such as content collection from several different reliable sources, its recognition, filtering, saving, formatting, analysis, annotation, classification, etc.:

𝑀

=< 𝑃 , 𝑃 , 𝑃 , 𝐾 , 𝐾 , 𝑃 , 𝑃 , 𝑆 , 𝑃 , 𝑆 , 𝑆 >,

where 𝑃 , 𝑃 and 𝑃 are % of repeat visits of the user from the previous visit >𝑡 , within

where 𝐾 is an indicator of internal IIS; 𝑃 is % edition of the page with an error; 𝑃 and 𝑃 are % of mobile users with a high-speed Internet connection; 𝑃 and 𝑃 are % of users with low/medium/high display resolution and with a specific operating system; 𝑃 and 𝑃 are % of users with a specific browser and with English and/or Ukrainian language support; 𝐾 is an indicator of the number of users, views and page visits. The 𝑆 indicator is the base of the content management module:

𝑆 =< 𝑁 , 𝑁 , 𝑁 , 𝑁 >,(8)

where 𝑁 and 𝑁 are the average number of page views per visit and for a specific time 𝑡; 𝑁 is the average number of unique users for a specific time 𝑡; 𝑁 is the average number of visits for a specific time 𝑡. The indicator of internal search on the site:

𝐾 =< 𝑁 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝑆 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝑆 , 𝑇 , 𝑃 , 𝑃 , 𝐾 >, where 𝑁 is the number of zero search results; 𝑃 and 𝑃 are % of users who were on the page for > 𝑡 time and viewed > 𝑘 pages after the search; 𝑃 and 𝑃 are % of purchases made and % of buyers among users using search; 𝑃 is % of rejections after visiting one page as a search result; 𝑃 is % conversion from users using search; 𝑃 and 𝑃 are % of users who do not use and use search; 𝑆 is the average number of pages viewed by visitors after a search; 𝑇 is the average time spent on the site for a visit after a search; 𝑃 and 𝑃 are % of visitors who conduct several searches during the visit and who left the site after viewing the search results; 𝑆 is the average number of search results; 𝑃 is % of visits with search; 𝑃 is % of zero search results, in particular,

𝑃 = 𝑁 𝑁 , 𝑃 = 𝑁 𝑁 , 𝐾 = 𝑁 𝑁 ,(9)

where 𝑁 , 𝑁 and 𝑁 are the number of all viewed pages issued with an error and viewed pages with a search, respectively; 𝑁 is the number of zero search results; 𝑁 and 𝑁 is visits without search and with search.

The presence of a module for intellectual analysis of text streams of content reduces the time/costs/personnel/resources for the timely and prompt acquisition of relevant, unique, current content, which leads to an increase in the volume of the target audience of CLS, in particular, contributes to the growth of the economic effect of the implementation:

𝑀 =< 𝑆 , 𝑆 , 𝑆 , 𝑃 , 𝑃 >, (10)

where 𝑆 is the average conversion rate; 𝑆 is the average length of visit; 𝑆 is the average number of views per visit; 𝑃 is % of unique customers/visitors/users; 𝑃 is % of new website customers.

According to the tracking of 𝐾 events and interaction with the 𝐾 site, they analyze:

𝐾 = 𝛼(𝐾 , 𝐾 ) =< 𝑃 , 𝑃 , 𝑃 , 𝐼 >, 𝐼 = 𝑅 + 𝑅 𝑁 ,(11)

where 𝑃 is % interaction with the site (for example, commenting, voting, registration, authorization, subscription, etc.); 𝑃 is % of users who activate various events (for example, clicking on an ad, starting a function, pausing, etc.); 𝑃 is % of users interacting with different types of content presentation (viewing the next communication, panning, zooming, etc.); 𝐼 is the value of the measure of usefulness, respectively, of the page/site/CLS/content; 𝑁 is the number of unique page views; 𝑅 is profit from e-business; 𝑅 is the value of the utility measure of user visits (based on transactions) and the purpose of user visits (based on the utility of goals).

Analysis of success/effectiveness/operational search on the site:

𝐾 =< 𝑃 , 𝑅 , 𝑆 , 𝑃 , 𝑃 , 𝑁 , 𝑅 , 𝑅 , 𝑁 , 𝑁 , 𝐼 >, (12)

where 𝑃 is the value of the usefulness of visiting 𝑃 site/page; 𝑅 is conversion rating in e-business for CLS corresponding to the NLP task; 𝑆 is the value of average utility; 𝑃 is the value of e-business profit for the CLS of the corresponding NLP task; 𝑃 is the value of the achieved conversion of visits to the site/page of the CLS:

𝑃 = , 𝑅 = • 100%, 𝑆 = , 𝑃 = 𝑅 + 𝑅 , 𝑃 = • 100%,

where 𝑁 is the number of visits; 𝑅 is the usefulness of e-business; 𝑅 is the utility of the goal; 𝑁 is the number of transactions; 𝑁 is the number of conversions. To attract new visitors and increase the volume of the permanent target audience, the calculation of the impact on the income of the IIS on the site is used 𝐼 :

𝐼 = (𝑅 − 𝑅 ) • 𝑁 , (13)

where 𝑁 is the number of visits from the IIS; 𝑅 and 𝑅 are the utility of visits without and with IIS.

The topic of a set of keywords is one of the main indicators of IIS for identifying the specific content of a page. Optimize investment for sets of keywords that increase conversion values. The return on investment value (𝑃 ROI ) must be positive (𝑁 > 𝑁 ), i.e.:

𝑃 = 𝑁 − 𝑁 𝑁 • 100% > 0, 𝑃 = (𝑁 • 𝐴 )/100 − 𝑁 𝑁 • 100%,(14)

where 𝑁 is expenses; 𝑁 is profit; 𝐴 is the amount of profit. Then they find how much >q% of funds can be spent on a specific keyword in advertising without the risk of getting 𝑃 ROI <0. To calculate the amount of funds for attracting users, use:

𝐶 = 𝑁 • 𝐴 100 𝑃 100 + 1 , 𝐶 = 𝐶 • 𝑅 100 . (15)

The method of determining the effectiveness/quality of the CLS site for solving the NLP problem: Stage 1. Formulation and identification of usefulness according to the goals of the target audience according to the input data from the tuple 𝑋.

Stage 2. Activation of reports of the operation of the CLS from the tuple 𝑌 of the initial data:

Step 1. Define an unlimited number of goals (4 goals for each target audience profile).

Step 2. Identify the optimal volume of visits/time of the end user/customer for a successful conversion.

Step 3. Analyse the volume of the contribution of each goal to the total profit.

Step 4. Combine goals by categories/directions/species.

Step 5. Form separate sets of transactions as appropriate for the purposes. Stage 3. Support various marketing campaigns/customers through 𝑀 . Stage 4. Support for processing the service content of the site with the 𝑀 module. Stage 5. Updating the profiles of the target audience according to feedback support through the 𝑀 module, and analyzing user actions through the 𝑀 module. Stage 6. Integrating content from different sources through 𝑀 according to the achieved goals and processing it through the 𝑀 module. Stage 7. Periodic checks are performed to see whether the goals are being achieved and whether the profit is growing according to the goals. If it subsides, go to stage 1. Otherwise, go to stage 2.

A classified list of the input stream of content 𝑋 with a set of relevant properties demarcates project participants through their typification and restriction of access rights depending on the content: regular users, potential visitors, linguists, statistical analysts, administrators, content/rules moderators, authors of unique content, information resource as content source etc. The typed structure of the content input stream template with a set of relevant properties helps to define the main functional requirements for the site/CLS and its typical structure and delineate the nonfunctional capabilities, classify the sources, calculate the frequencies and the corresponding restrictions/conditions of integration from the usual source:

𝑋 =< 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 , 𝑋 >, (16)

where 𝑋 is URL addresses of sources for databases (DB) of CLS filters; 𝑋 is content as a result of integration from different 𝑋 sources according to a predetermined list of URLs without a predetermined structure according to relevant thematic requests; 𝑋 is thematic requests of visitors/users of the CLS site in the form of a set of keywords or persistent phrases; 𝑋 is actual data of permanent users/profiles and a set of rules of permitted actions within the corresponding type of user of the CLS; 𝑋 is statistical data of actions/ events/ phenomena of the subjects/objects of the CLS for the solution of the corresponding NLP task and the rules for collecting/saving/analysing statistics in specific time intervals of the CLS operation; 𝑋 is statistical data on the functioning of the CLS; 𝑋 is contents of the DB/DS of content/rules/filters/annotations, etc. of the CLS; 𝑋 is different types of linguistic dictionaries depending on the purpose of the CLS for solving a specific NLP problem; 𝑋 is a set of personalized/anonymous reviews and comments of users to the relevant content of CLS; 𝑋 is a tuple of the results of personalized/anonymous votes of regular/potential users regarding the content of CLS; 𝑋 is statistical personalized individual actions of users of the CLS; 𝑋 is set of external/internal advertising of thematic content; 𝑋 is thematic stickers of information content (exchange rates, announcements, digests, weather, anecdotes, horoscope, etc.); 𝑋 is a tuple of options for setting up and changing the CLS/site configurations.

Filling the tuple of the output data stream 𝑌 according to the purpose of the CLS for solving a specific NLP problem directly depends on the content of the input classified stream of content 𝑋 with a predetermined set of properties depending on the interaction with the site of the corresponding types of project participants:

𝑌 =< 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 , 𝑌 >, (17)

where 𝑌 is text content as an information product or the result of providing an appropriate information service for solving a specific NLP task on the CLS website; 𝑌 is a set of meaningfully generated/cached pages as a result of thematic requests/IIS of users/visitors of the CLS site; 𝑌 is annotations/digests/abstracts on textual thematic content; 𝑌 is a tuple of statistics of user/visitor interaction with the site; 𝑌 is a tuple of the content of the profiles of regular users of the CLS according to the personalized statistics 𝑌 for the corresponding generation of an individual portrait of the user/audience at certain time intervals; 𝑌 is a tuple of meaningful recommended site content, personalized for a specific regular user according to the profile/actions/interaction with the CLS in certain time intervals; 𝑌 is a set of content topics/headings with the possibility of renewal according to the results of the latest IIS/requests from regular site users; 𝑌 is a scheme of interrelationships of textual thematic content according to the appropriate classification (current, relevant, author's, outdated, popular, similar, last-viewed, often-viewed, consecutively by a certain most viewed, longer viewed, most viewed from search engines or internal IIS, viewed by a typical group of users, etc.); 𝑌 is the set of content rating results on a predetermined scale within the corresponding ranking classification; 𝑌 is a set of marked evaluation and ranking of user comments as the degree of permission to publish on the site/page, if necessary, with a prohibition mark for a specific contributor to write further comments and ranking by the degree of trust of all contributors. The list of the output flow of content, its main features, the corresponding classification, and IT generation/support/analysis contributes to the definition of precise general functional requirements for implementing the CLS to solve any NLP problem.

The model of the process of linguistic analysis of the Ukrainian-language text 𝑀 is presented

𝑀 =< 𝑋, 𝑊, 𝐶, 𝐾, 𝑌, 𝐷, 𝑆 , 𝑆 , 𝑆  , 𝑆  , 𝑆  , 𝑆  , 𝑆  , 𝑆  , 𝑆  , 𝑆  , ,  ,  ,  ,  ,  ,  ,  >,

where 𝑋 is the input data in the CLS from various sources of information 𝑊; 𝑌 is the original relevant content from the CLS as a result of the IIS according to the requests of users/visitors; 𝑆 is the process of linguistic analysis of content as a component of the IATCF subsystem 𝑆 ; 𝑆  is the process of generation/modification of the rules of operation of all modules by the moderator of the CLS; 𝑆  is the process of filling an unstructured database with integrated content 𝑋; S𝑆  is the filling module of the structured database based on the processed integrated content 𝐶; 𝑆  and 𝑆  are processes of generating results according to the requests of visitors and users; 𝑆  is a cache processing process for generating reports on popular requests from CLS users; 𝑆  is cache filling/modification process; 𝑆  is the process of generating statistical results of the functioning of the CLS/modules and the activities of users 𝐷;  is the operator of generation/modification of the rules of operation of all modules from the moderator of the CLS;  is the operator of filling an unstructured database with integrated content 𝑋;  is the operator of filling the structured database based on the processed, integrated content of 𝐶;  and  are operators for generating results according to the requests of visitors and users;  is a cache processing operator for generating reports 𝑌 on popular requests from users;  is cache filling/modification operator with 𝐾 data;  is an operator for generating statistical results of the functioning of the CLS/modules and user activities:

𝑆 =< 𝑋, 𝑌, 𝐶, 𝐷, 𝑅, , , , , , , , ,  > , 𝑌 =  ∘  ∘  ∘  ∘  ∘  ∘  ∘  ∘ , (18)

where 𝑋 is the input text data array; 𝑌 is a tuple of the original processed text according to the purpose of the CLS; 𝐶 is a set of intermediate content, which is processed at the appropriate level in the CLS; 𝐷 is auxiliary dictionaries; 𝑅 is a set of processing rules;  is grapheme analysis operator (GA);  is morphological analysis operator (MA);  is lexical analysis operator (LA);  is operator of syntactic analysis (SA);  is semantic analysis operator (SEM);  is ontological analysis operator;  is reference analysis operator;  is structural analysis operator;  is operator pragmatic analysis (PA).

The primary process of linguistic analysis of textual content is presented:

𝑌 = (𝐶  , 𝐷  , 𝑅  , (𝐶  , 𝐷  , 𝑅  , (𝐶  , 𝐷  , 𝑅  , (𝐶  , 𝐷  , 𝑅  , , (𝐶  , 𝐷  , 𝑅  , (𝐶  , 𝐷  , 𝑅  , (𝐶  , 𝐷  , 𝑅  , (𝐶  , 𝐷  , 𝑅  , (𝐶  , 𝐷  , 𝑅  , 𝑋))))))))),(19)

where the content sets𝐶

= {𝐶  , 𝐶  , 𝐶  , 𝐶  , 𝐶  , 𝐶  , 𝐶  , 𝐶  , 𝐶  }, linguistic dictionaries 𝐷 = {𝐷  , 𝐷  , 𝐷  , 𝐷  , 𝐷  , 𝐷  , 𝐷  , 𝐷  , 𝐷  , } and sets of production/association rules 𝑅 = 𝑅  , 𝑅  , 𝑅  , 𝑅  , 𝑅  , 𝑅  , 𝑅  , 𝑅  , 𝑅  .

The primary linguistic process of processing textual Ukrainian-language information to solve a specific task of the NLP consists of nine stages:

Stage 1. Grapheme analysis  of textual Ukrainian-language information 𝑋:

𝐶  = (𝑋, 𝐷  , 𝑅  ), 𝐶  =  ∘  ∘  ∘  ∘  ∘  ∘  , (20)

where 𝑋 is the input text data array;  is GA operator; 𝐶  is grapheme structure of the input text; 𝐷  is grapheme dictionaries and libraries; 𝑅  is GA rules;  is an optical character recognition operator;  is grapheme parsing operator of the input text 𝑋 into sections, paragraphs and sentences;  is grapheme analysis operator of linguistic chains into separate words;  is the operator for forming a set of unrecognized chains;  is the operator of identification and marking of unrecognized chains as numbers, dates, constant returns, abbreviations, proper and geographical names, etc.;  is the operator for marking non-text strings as special symbols, formulas, figures, tables, etc.;  is an operator for generating a marked linear sequence of words 𝐶  with official signs and connections.

Stage 2. Morphological analysis  of text content 𝐶  consists in the identification, analysis and determination of the form and structure of words, in particular:

𝐶  = (𝐶  , 𝐷  , 𝑅  ), 𝐶  =  ∘  ∘  or 𝐶  =  ∘  ∘  , (21)

where  is the morphological segmentation operator of the grapheme-recognized chain of symbols (words/tokens);  is a token lemmatization operator;  is the operator for marking parts of speech for segmented words;  is the word stemming operator.

Production rules for identification/generation of Ukrainian participles [51] X. Graphical and orthographic rules: {𝑗 + 𝑎 → я, 𝑗𝑎 → я; 𝑗 + у → ю, 𝑗у → ю; 𝑗 + е → є, 𝑗е → є; …; Х + 𝑎 → Х + я; Х + у → Х + ю; Х + и → Х + і; Х + і → Х +; Х + е → Х + є}.

XI. Erasure of the boundary indicator between morphemes: {𝐴 + 𝐵 → 𝐴𝐵}, where 𝐴 and 𝐵 are any morphemes that none of the rules of groups IX-X apply to 𝐴 + 𝐵.

Stage 3. Lexical analysis  of the text content 𝐶  in the intermediate stage of the analysis of the lexeme sequence to generate a parsing tree at the SA level:

𝐶  = (𝐶  , 𝐷  , 𝑅  ), 𝐶′  =  ∘  , 𝐶′  =  ∘  ∘  or 𝐶′  =  ∘  , (22)

where  is a speech segmentation operator for identification/clarification of words/phrases/tokens after MA;  is speech recognition or speech-to-text operator;  is optical character recognition operator as the second part after GA and MA for clarifying incorrect moments of recognition, taking into account the recognized adjacent tokens;  is the word tokenization/segmentation operator as data preparation for building a parsing tree at SA;  is textto-speech.

Stage 4. The syntactic analysis  of text content 𝐶  consists in building a tree for parsing word dependencies (Fig. 2) in a sequence of lexemes based on their categories:

𝐶  = (𝐶  , 𝐷  , 𝑅  ), 𝐶  =  ∘  ∘  , (23)

where  is grammar induction implementation operator;  is the operator of identification/elimination of boundary ambiguity or sentence violation;  is operator of syntactic parsing of phrases/sentences for building a SA tree. Rules

𝐶  = (𝐶  , 𝐷  , 𝑅  ), 𝐶  =  ∘  , (24)

where  is the identification operator of lexical semantics with the generation of a collection of values of each lexeme of the text;  is the relational semantics identification operator of the interdependencies of the content of the lexemes of the text. Stage 6. Reference analysis  identification of interphase units 𝐶  .

𝐶  = (𝐶  , 𝐷  , 𝑅  ). (25)

Reference analysis is often part of SEM. For Ukrainian texts, when analysing large corpora of texts, it is best to carry out as a separate stage (for example, for the analysis of the correspondence of a social group/community in social networks or other dialogues to identify logical, meaningful connections between the posts of different participants due to the subjectivity of everyone's speech.

Stage 7. Structural analysis  of the Ukrainian-language text 𝐶  based on the degree of coincidence of lexical, terminological units of unity of text fragments. It is often part of SEM for short texts/messages or not used at all. For large corpora of texts as an additional stage of elimination of marked inaccuracy in SEM.

𝐶  = (𝐶  , 𝐷  , 𝑅  ) or 𝐶  = (𝐶  , 𝐷  , 𝑅  ).(26)

Stage 8. Ontological analysis of  text content 𝐶  on the basis or part of the results of SEM and reference/structural analyses if necessary:

𝐶  = (𝐶  , 𝐷  , 𝑅  ), 𝐶  = (𝐶  , 𝐷  , 𝑅  ) or 𝐶  = (𝐶  , 𝐷  , 𝑅  ).(27)

Stage 9. Pragmatic analysis of  text content 𝐶  is used to determine the text's structure by considering the context of sentences when forming paragraphs, sections, and dialogues. PA is an essential addition to SEM, reference, and structural analyses if it does not contribute to eliminating marked inaccuracy.

𝑌 = (𝐶  , 𝐷  , 𝑅  , 𝐶  , [𝐶  , 𝐶  , 𝐶  ], ), 𝑌 =  ∘  , (28)

where  is a semantics identification operator outside individual sentences/phrases;  is the operator of text processing through higher-level NLP applications, for example, to simulate intelligent behaviour and an apparent understanding of natural language.

A general scheme/model of the pipeline of the CLS operation has been developed based on improved methods of processing information resources such as integration, maintenance and content management, as well as the development of improved methods of intellectual and linguistic analysis of text flow using machine learning technology (Fig. 3) [52][53][54][55][56][57][58]. Based on feedback from the user and output data of the ML model, the target audience interacts with the CLS, which contributes to the adaptation of the selected learning model. Five stages of relevant processes determine the basic architectural principles of building a typical CLS. The methods of monitoring, developing and managing content are interaction, formatting/filtering, NLP, ML and data accumulation in DS. Content and support processes feature analysis, deployment, prediction, interpretation, and content/result presentation. At the interaction stage, a set of rules for integrating content from multiple reliable sources at certain intervals is developed. Also, in parallel, a set of rules for checking the data entered by the user of the CLS was created as a preliminary stage for the formatting/filtering stage according to a collection of rules and content from the DS set in advance by the moderator. The next stage of NLP is an intermediate stage for ML and data accumulation. The ML stage is implemented through SQL queries and modules. The support process is more accessible to implement than the management stage, especially when analysing the results of the NLP, in which additional lexical resources and artefacts (dictionaries, translators, regular expressions, etc.) are created, which directly depend on the effectiveness of the CLS functioning (Fig. 4) [52][53][54][55][56][57][58]. The transition process from the raw text to the expanded ML model consists of additional content transformations. First, the input text content is transformed into the input corpus as a collection of texts, accumulated and stored in the DS. The incoming content is further grouped, filtered, formatted, linguistically processed, marked, normalized and converted into vectors for further processing. In the final transformation of the model (Fig. 5) [52][53][54][55][56][57][58][59][60], they train on the vector corpus to create a generalized presentation of the original content for further use in solving a specific NLP problem. The process of generating an optimal machine learning model

Learning the ML model

Content repository

Analysis of signs and parameters

Optimization of the ML model

Content archive

Set of content

Model repository

Model settings

Testing of the ML model

Choice of ML model

Figure 5: Machine learning pipeline process NLP methods have been improved based on the developed 82 regular expressions (RGs) of pattern matching in GA and more than 2000 RGs of morphological analysis of Ukrainian-language texts. RV's primary admissible operations are the union and disjunction of symbols/chains/expressions, number and precedence operators, and anchors of the presence/absence of symbols in regular expressions. The main stages of tokenization and normalization of the Ukrainian text by cascades of simple substitutions of RG and finite automata are determined. Algorithms for word segmentation and normalization, sentence segmentation, and Porter's modified stemming are implemented and described as an effective way of identifying lem affixes for the possibility of marking the analysed word. Porter's modified stemming algorithm is based on searching/checking the obtained intermediate results with the tree of inflexions (so as not to go through all possible inflexions) and with the content of thematic dictionaries of bases with a set of PG-rules for identification of features (classification by parts of speech).

Step 1. Identify the next lexeme as the word 𝑤 (𝑤 = 𝑤 ).

Step 2. Check with the stop word dictionary whether 𝐷 or 𝑤 is a service word. If yes, then 𝑖 = 𝑖 + 1 and go to step 1. Otherwise, go to step 3.

Stage 3. Go to the end of the word 𝑤 . Recognize the inflection 𝑓 in 𝑤 from all possible ones (the longest one is chosen, for example, in 𝑤 =текстова we choose the ending 𝑓 =ова, not 𝑓 а) Stage 7. We check the obtained base 𝑤 of the initial word 𝑤 with the content of the dictionary of bases 𝐷 of words of the Ukrainian language. If there is no respondent, we store < 𝑤 , 𝑤 > in the additional temporary intermediate dictionary 𝐷 , for the moderator and proceed to stage 1. Otherwise, proceed to stage 4.

from

Stage 8. Analysis of inflexion and the presence/absence of alternation of letters in the base/inflexions of the words< 𝑤 , 𝑤 > and the analogue of the base of the word in 𝐷 according to the corresponding РG-rule of MA to identify additional features of the analyzed word 𝑤 .

Stage 9. Adding the identified linguistic features of the recognized part of speech to the tag of the word 𝑤 of the type 𝑚 , 𝑚 or 𝑚 , respectively. Saving the results in the corresponding dictionary 𝐷 of the analysed text.

Unlike the classic Porter's algorithm, the modified one is adapted specifically for the Ukrainian language and gives an accurate result in 85-93% of cases, depending on the quality, style, genre of the text and, accordingly, the content of the dictionaries of CLS. In total, about 1,300 rules for processing suffixes and endings, considering the alternation of letters, adjectives -99 RG-rules, and verbs -more than 800 RG-rules have been implemented for MA Ukrainian-language nouns. The algorithm for the minimum editorial distance of lines of Ukrainian texts is described as the minimum number of operations required to transform one into another. Also, an algorithm for calculating the maximum likelihood metric for the 2-gram and 3-gram models based on the analysis of word bases was developed to identify stable word combinations as keywords. To forecast the conditional probability of the following base of the word, we use the Markov assumption (the probability of the word depends on the previous one).

Moreover, suppose the keywords are a set of nouns or an adjective with a noun. In that case, other words, such as verbs, participles, etc., will be considered additional separators as other punctuation marks that demarcate persistent phrases as potential keywords. The order of bases is not crucial for the Ukrainian language.

Stage 1. Process the input text and break it into separate phrases (sentences) 𝑅 𝑅 … 𝑅 , marking each start-end with the corresponding <p> </p> tag. Eliminate all non-alphabetic characters. Convert uppercase letters to lowercase. Remove official words if necessary (for certain NLP tasks). The resulting matrices will, in most cases, be sparse. Phrase and various variations (plural/singular and cases) 𝑃(система електронної контент комерції): 𝑃(електрон|систем)𝑃(контент|електрон) 𝑃(комерц|контент) = =0.1240.810.179=0.01797876. The SEM method has been improved based on the taxonomy of concepts, which specifies the syntax of the Ukrainian language as the root concept of the ontology: 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠  : < 𝑅 > 𝐶′  .

In SEM, to identify the set of semes of the corresponding Ukrainian-language text and their relationship, first, based on the results of SA, a semantic graph of the relations of linguistic units is built, taking into account the parts of the language of words:

𝐶′  = (𝐶  , 𝐷  , 𝑅  , 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠  ), 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠  =< 𝐶 , 𝐶 >,

where 𝐶 is a tuple of concepts of phrase formation; 𝐶 is a tuple of sentence generation concepts in the Ukrainian language. Tuple 𝐶 is given as:

𝐶 =< 𝑆𝑔𝑛 , 𝑆𝑔𝑛 , 𝑆𝑔𝑛 , 𝑆𝑔𝑛 >,

where 𝑆𝑔𝑛 is a tuple of phrase generation properties:

𝑆𝑔𝑛 =< 𝑆𝑔𝑛 , 𝑆𝑔𝑛 >, 𝑆𝑔𝑛 =< 𝑆𝑔𝑛 , 𝑆𝑔𝑛 , 𝑆𝑔𝑛 , 𝑆𝑔𝑛 , 𝑆𝑔𝑛 , 𝑆𝑔𝑛 >, 𝑆𝑔𝑛 =< 𝑆𝑔𝑛 , 𝑆𝑔𝑛 >, 𝑆𝑔𝑛 =< 𝑆𝑔𝑛 , 𝑆𝑔𝑛 >,

author keywords and list of references) was carried out without the application of ML, and in stage 2 -with ML. The method of article analysis without metadata achieves the best results according to the density criterion. The author of the article often defines a more significant number of words (𝐴 ) and a smaller number of keywords (𝐴 ) than are present in the text of the scientific and technical publication (Fig. 6). Unlike known parsers, the proposed method provides self-improvement and selflearning of the keyword definition module due to the identification mechanism of significant statistical parameters within the limits defined by the moderator. A system has been developed on the Victana website, which allows users to choose from a list of languages of the analysed text (http://victana.lviv.ua/index.php/kliuchovi-slova). For different stages and steps of the experiment of processing the primary text, the average coincidence of the lists of discovered keywords with the author's keywords varies in the range of 52.6-68.5%. The accuracy of matching keywords with the author's keywords ranges from 43.6 to 62.9%. The average match of meaningful keywords compared to all found by the system ranges from 38.9-75.8%, depending on the stages of analysis of article texts. The accuracy of matching keywords compared to all found by the system varies between 34.3-71.9%, depending on the stages of analysis of article texts. For 𝐴 , the module most often identified the number of keywords {5, 7, 3} (10), although the distribution of found keywords was within [1;18] words (except 17).

For 𝐴 , the module most often identified the number of keywords also {5, 7, 3}, although the distribution of found keywords is within [1;18] (except 17), the number of identified words increased, and the highest reliability index was achieved. For 𝐴 , the module most often identified the number of keywords {7, 6, 5, 10, 8}, although the distribution of found keywords was within [2;14] (the range narrowed significantly). For 𝐴 , the module most often identified the number of keywords {8, 5, 7, 10}, the distribution of identified keywords within [3;16] (accuracy improved). The accuracy of the definition of keywords increases during the moderation of dictionaries and the ML module. The difference between the number of keywords defined by the author and identified by the module at 𝐴 is 44.39919% (difference in %). Accuracy improves with 𝐴 is 33.70672%, significantly improving with 𝐴 is 24.33809%, and with 𝐴 is 14.96945%. Analysis was performed for filtered texts without metadata and unfiltered texts. The average values obtained for filtered texts 𝑃𝑒𝑟 = 0.28 and unfiltered 𝑃𝑒𝑟 = 0.19 shows that filtering scientific articles improves keyword density by 1.48 times or 47.83% (Fig. 9a). The obtained values for the texts 𝑃𝑒𝑟 = 0.34 and 𝑃𝑒𝑟 = 0.25, taking into account the refinement of the thematic dictionary through ML and the replenishment of blocked words, shows that filtering with simultaneous moderation of the thematic dictionary improves keyword density by 1.35 times or by 35.44% (Fig. 9b). A comparison of the values in the original author's text 𝑃𝑒𝑟 = 0.19 and 𝑃𝑒𝑟 = 0.25 without/with the refinement of the thematic dictionary, respectively, demonstrates the effectiveness of the moderation of the thematic dictionary in the initial text -the density of keywords increases 1.34 times or by 34.33% (Fig. 10a). Comparison of the values in the filtered author's text 𝑃𝑒𝑟 = 0.28 and 𝑃𝑒𝑟 = 0.34 without/with the refinement of the thematic dictionary, respectively, demonstrates the effectiveness of the moderation of the thematic dictionary in the filtered text as the density of keywords increases 1.23 times or by 23.14% (Fig. 10b).

0 Weight = 1 Weight = 2 Weight = 3 Weight = 4

a) b)

Figure 10: Results of analysis of articles with different dictionaries So, the experimental study confirmed the method's reliability -for different stages of processing the primary text, the average coincidence of the lists of identified keywords with the author's keywords varies in the range of 52.6-68.5% (by 9%). The accuracy of matching keywords the author's keywords ranges from 43.6 to 62.9%. The average match of meaningful keywords compared to all found by the system ranges from 38.9-75.8%, depending on the stages of analysis of article texts. The accuracy of matching keywords compared to all found by the system varies between 34.3-71.9%, depending on the stages of analysis of article texts. A method of determining stable word combinations when identifying textual content keywords in reference passages of the author's text has been developed. The process consists of the use of Zipf's law in the formation of stable word combinations as key, taking into account the following rules of preliminary linguistic processing of the text: removal of all stop words; form bigrams only within the limits of punctuation marks and words that are not verbs or pronouns (the latter are considered punctuation marks); determine verbs by inflexions; form bigrams based on their bases without taking into account their inflexions; definition of adjectives by inflexions and to believe that adjectives should only be in the first place in the bigram from Ukrainian-language texts. A module has been developed to identify persistent phrases as keywords in textual content. An approach to developing linguistic content analysis software for the determination of stable word combinations in identifying keywords of Ukrainianlanguage and English-language textual content is proposed. The peculiarity of the approach is adapting the linguistic, statistical analysis of lexical units to the peculiarities of the constructions of Ukrainian and English words/texts. The results of the experimental approbation of the proposed method of content analysis of English-and Ukrainian-language texts to determine stable word combinations in identifying keywords of technical texts were studied.

A method of identifying the style of the author of the text based on the analysis of linguistic speech coefficients in the standard has been developed. The technique consists of a comparative study of the author's attribution in the author's statistically processed work (standard) with an arbitrarily analysed passage. The method evaluates the probability of the text of the article belonging to the author of the benchmark with the analysis of the relevant coefficients of lexical speech as the concentration of the text 𝐼 , the coherence of the speech 𝐾 , the uniqueness of the text 𝐼 , the syntactic complexity of the speech 𝐾 and the linguistic diversity of the speech 𝐾 . The degree of speech connectivity 𝐾 does not decrease significantly. In 2001, it changed within [0.5; 1.2], and in 2021 -within [0.4; 0.9] (Fig. 11). Moreover, the method works under the condition that the author's standard has already been researched -the task of NLP is to form the author's frequency dictionary, including service/stop words.

An algorithm for determining stop words of text content based on linguistic analysis of text content has been developed. For the individual style of the author's text, markers are service/stop words (for example, particles, conjunctions, prepositions, parasite words, slang, slang, etc.) unrelated General text with a detailed dictionary General text without a specified dictionary 0 1 1 5 913172125293337414549535761656973778185899397 Filtered text with refined vocabulary Filtered text without dictionary refinement to the article's topic. The absolute and relative frequencies of stopwords were analysed and compared with the reference values for each excerpt. Therefore, applying the method of reference words gives the following results: finding what most likely belongs to the standard among the studied passages. Other results also confirm the effectiveness of the keyword method in author attribution of texts. The proposed assumption about the insignificance of the influence of the share as a parameter of the process on the results led to a decrease in the correlation coefficients but placed the probability of belonging to the standard for passages in the correct order (Table 2). More likely, Excerpt 4 belongs to the author of the template (although there is no significant difference between results 4 and 2, if they are written in the same period, they do not belong to the author of the template; if in different periods with the template, the probability of belonging to this author increases). An algorithm for the linguistic analysis of Ukrainian-language texts and a syntactic analyser of text content has been developed. The features of the algorithm are the adaptation of morphological and syntactic analysis of lexical units to the peculiarities of constructions of Ukrainian words/texts. Algorithms are tested to identify significant stopwords in Ukrainian-language text based on regular expressions. When parsing words belonging to a part of speech, declension within this part of speech was taken into account. For this purpose, word inflexions were analysed for classification, selection of the basis and formation of the corresponding alphabetic-frequency dictionaries. The dictionaries contents were subsequently taken into account in the next steps of determining the text's authorship by calculating the parameters and coefficients of the author's speech. Software implementation for solving some NLP problems, as research of:

 keywords (https://victana.lviv.ua/kliuchovi-slova);  stable phrases (https://victana.lviv.ua/nlp/stiiki-slovospoluchennia);  classification of textual content (https://victana.lviv.ua/kliuchovi-slova);  quantitative evaluations of speech (https://victana.lviv.ua/nlp/linhvometriia);  the author's style based on calculations of stylometry coefficients and their comparison with the corresponding coefficients in the standard text (https://victana.lviv.ua/nlp/stylemetriia);  differences in text signs (https://victana.lviv.ua/nlp/hlotokhronolohiia);  features of the style of texts based on N-grams (https://victana.lviv.ua/nlp/n-grams).

The results of the experimental approbation of the proposed content monitoring method for determining the author in Ukrainian-language scientific texts of a technical profile were studied. A comparison of the results of more than 300 one-person works of a technical direction by 100 different authors for 2001-2021 was carried out to determine whether and how the coefficients of text diversity of these authors change in different periods. A method of identifying the potential (probable) author of a Ukrainian-language text based on the analysis of the author's linguistic speech coefficients in a reference passage of the author's text has been developed. Decomposition of the method of determining the author was carried out based on the analysis of such speech coefficients as speech coherence, degree of syntactic complexity, linguistic diversity, indices of concentration and exclusivity of the text. In parallel, such parameters of the author's style as the number of words in a specific text, the total number of words in this text, the number of sentences, the number of prepositions, the number of conjunctions, the number of words with a frequency of 1 and the number of words with a frequency of 10 and more, as well as keywords and 3 -grams. For example, 3-grams of 3 articles were analysed [61][62][63] (Ukrainian versions). For the most frequently used letters, the frequency of appearance of 3-grams with such initial letters will have an almost identical distribution (peak values in Fig. 12a), but not for other letters. Therefore, it is expedient to study only 3 grams for initial letters that occur less often in the texts of a specific language to determine the degree of belonging of the text to the corresponding author (for example, Fig. 12b). According to these graphs. It appears that Articles (1,2) are more likely to be written by the same author, although the same author could also Articles (1,3) (but this is not true). Different authors write articles (2,3). Applying linguistic, statistical analysis of 3-grams to a set of articles makes it possible to form a subset of publications similar in terms of linguistic characteristics. Imposing additional conditions in the form of linguistic, statistical analyses (a set of keywords, stable word combinations (Table 3), stylometric, ligvometric, etc.) will significantly reduce the subset, clarifying the list of more likely authors' works. Thus, the analysis of the content and frequency of appearance of only official words separates Articles (1,3) into different subsets, leaving Articles (1,2) in one. 78.4814% of 3-grams were analysed for Article 1, 72.6332% for Article 2, and 84.1271% for Article 3. The difference in the use of the corresponding 3-grams between Articles (1,2) is R12=56.5254%, between Articles (2,3) -R23=69.4271%, between Articles (1,3) -R13=62.9839%. Accordingly, Articles (1,2) are more similar by [6][7][8][9][10][11][12]% (R23>R12 by 12.9017%, R23 > R13 by 6.4432%, R13> R12 by 6.4585%, i.e. R23>R13>R12) than Articles (1,3) and (2,3). The smaller the Rij, the greater the degree to which the same author writes the articles. Then, in case Articles (1,2) are more likely to be written by one author/team than Articles (2,3) and (1,3), respectively. When identifying the author of a text, it is assumed that the text reflects the author's style of writing, which makes it possible to distinguish him from others. To compare texts with each other, it is necessary to compare some numerical characteristics of the text, which would be approximate for the texts of the same author and differ significantly for the works of different authors. Such a characteristic can be the density of the distribution of letter combinations of three consecutive symbols (3-grams). During the experimental testing based on the developed four different algorithms for calculating the degree of verification of the author of the Ukrainian-language text from a set of possible values, values were obtained that confirm that the style of the authors numbered x and y is 0,00% quite close (more than 90%) to the style of collective works 1-4, respectively. Also, the number of authors (from 42.02% to 34.04% of the total 100 participants in the project from more than 300 articles) was significantly reduced, with similarity in speech style. Figure 13 presents graphs of the results obtained when applying algorithms to analyse the method developed to determine the author's style.

Further, an analysis of stop words and keywords of the authors' works was used to determine the author's style, as 34.04% got to those. Each individual has their vocabulary for conveying thought, including so-called "parasitic" (that is, therefore, although, etc.) and service words (and, and, and, but, although, etc.). Figure 14 presents an example of the analysis of the author's style in the second stage by analysing the frequency of service appearance and keywords, considering various filters. Therefore, a method of determining the style of the author of thematic Ukrainian-language textual content was developed based on the analysis of keywords, stable word combinations, N-grams, lingumetry and stylometry, which made it possible to determine the stylistic contribution of each of the authors and increase the accuracy of attribution of a scientific and technical publication by 6%. A method for calculating the degree of verification of the author of a Ukrainian-language text from a set of possible ones based on a comparative analysis of the styles of potential authors has also been developed, which made it possible to increase the accuracy of classification by style similarity by 7%.

Conclusions

The work solves an important scientific and applied problem of analysis and synthesis of CLS for solving various problems of processing Ukrainian-language textual content based on the development of new and improvement of known models, methods and tools of NLP:

1. An analysis of the current state and prospects of IT development of natural language processing was carried out, which made it possible to define the problem and research tasks, as well as to form general research directions in the absence of non-commercial open-source software as CLS for processing Ukrainian-language textual content and a standardized design approach. 2. The relevance of solving the problem of analysis and synthesis of CLS based on the development of the general structure of the system for processing Ukrainian-language textual content is substantiated due to the interaction of the main processes/components of IS and methods of linguistic processing of textual content adapted to the Ukrainian language based on grapheme, morphological, lexical, syntactic, semantic, structural, ontological and pragmatic analysis allowed to improve the IT of intellectual analysis of text flow for solving a specific task of NLP. It ensured the adaptation of NLP processes for the analysis of Ukrainian-language textual content and, based on them, increased the accuracy of the obtained results by 6-48%, depending on the specific task of NLP. For example, for the NLP task of determining the Ukrainian-language text keywords, the density of keywords increases in the range [1.23; 1.48] times or by [23.14; 47.83]% depending on filling the thematic dictionary quality/accuracy through machine learning. 3. The methods of processing information resources, such as integration, management and support of Ukrainian-language content, were improved, which made it possible to adapt the process of intellectual analysis of the text flow and develop metrics of the effectiveness of the CLS functioning for the solution of various tasks of the NLP. The developed methods and tools make it possible to build a CLS for processing Ukrainian-language text content according to the needs of the permanent/potential target audience based on the analysis of the history of actions of website users. 4. The NLP methods based on regular expressions of pattern matching were improved, which made it possible to adapt the methods of tokenization and text normalization by cascades of simple substitutions of regular expressions and finite state machines. 5. The MA method of the Ukrainian-language text based on word segmentation and normalization, sentence segmentation and modified Porter's stemming algorithm was improved as an effective tool of identifying lem affixes for the possibility of marking the analysed word, which made it possible to increase the keyword searches accuracy by 9%. 6. The IT of the intellectual analysis of the text flow was improved based on the processing of information resources, which made it possible to adapt the general structure of modules for integration, management and support of content to solve various tasks of the NLP and increase the efficiency of the operation of the CLS by 6-9%. It became possible thanks to the combination of methods of linguistic analysis adapted to the Ukrainian language, improved IT processing of information resources, ML, and a set of metrics for evaluating the effectiveness of the CLS's functioning. The main principle of building such CLS is modularity, which facilitates their construction by requiring the availability of appropriate processes for solving a specific NLP problem. 7. A method of determining the author in Ukrainian-language texts has been developed based on the analysis of the coefficients of the author's lexical speech in the reference passage of the author's text, which is based on the study of a collection of keywords, persistent phrases, indicators of linguometry, stylometry, as well as the results of the analysis of N-grams based on comparisons of usage differences 2-gram and 3-gram for publications similar in style in the range of [6;7]%, and for exactly not similar ->12%), which made it possible to determine a set of potential authors of publications from more than one author (up to [9; 34]% of the total number of project participants) and develop a method for identifying the author's style. 8. A method of determining stable word combinations was developed based on the identification of keywords of the Ukrainian-language text and the analysis of the linguistic speech coefficients of the author of the text in reference excerpts of the content, which made it possible to improve the accuracy of the method of determining the style of the author of the text by 9% based on statistical linguistics. 9. Relevant materials confirm the reliability of scientific and practical results on the implementation of dissertation studies by comparing the obtained practical results on different samples of reliable input data. CLS was developed using CMS Joomla on the information resource http://victana.lviv.ua! (for designing the e-framework of articles), PHP (for implementing text content processing methods), HTML (for implementing page markup), CSS (for describing page styles), and MySQL (for storing data and dictionaries). The experimental study confirmed the reliability of the method of identifying keywords -for different algorithms for processing the primary text, the average match between the lists of identified keywords and the author's keywords varies in the 52.6-68.5% range. The accuracy of matching keywords with the author's keywords ranges from 43.6 to 62.9%. The average match of meaningful keywords compared to all found by the system ranges from 38.9-75.8%, depending on the stages of analysis of article texts. The accuracy of matching keywords compared to all found by the system varies between 34.3-71.9%, depending on the stages of analysis of article texts.

Figure 1 :1Figure 1: Generalized structure of the computer linguistic system

18 #Figure 2 :182Figure 2: An example of building a tree for parsing the dependencies of sentence words

Figure 3 :3Figure 3: Scheme of the pipeline of the CLS operation

Figure 4 :4Figure 4: Scheme of the pipeline for processing Ukrainian-language textual content

Stage 2 .Stage 4 . 5 .245Apply Porter's stemming to obtain the sequence of word stems 𝑥 𝑥 … 𝑥 of word stems 𝑅 taking into account word normalization, respectively. Stage 3. Receive input queries 𝑄 𝑄 … 𝑄 as a sequence of words of the searched data. Find 𝑄 for each word 𝑦 𝑦 … 𝑦 basis by stemming. For example, for the search phrase 𝑄 :Translation -Method and tools for information systems processing in electronic content commerce systems Conduct a statistical analysis of the occurrence of word stems and sequences of query word stems in the analyzed text. Find the probability of the appearance of 2-grams in the analyzed text. In each row, the value is divided by 𝑦 , where 𝑖 is the row number after normalization.

Figure 6 :6Figure 6: Results of the analysis of more than 300 scientific and technical publications

Figure 7 :Figure 8 :78Figure 7: Obtaining meaningful words at the stage: a) 1.1, b) 1.2, c) 2.1 and d) 2.2

Figure 9 :9Figure 9: Results of checking articles without specifying the thematic dictionary

Figure 11 :11Figure 11: Analysis of the distribution of speech style parameters 𝐾 , 𝐾 and 𝐾

Figure 12 :Figure 13 :Figure 14 :121314Figure 12: Graph of the frequency distribution of 1-gram and 3-gram occurrences in Articles 1-3 (blue for Article 1 [61], orange for Article 2 [62] and grey for Article 3 [63])

𝑆 is the average length of stay on the web page.visits; 𝑃is the bounce rate for one web page; 𝑆is the average number of web page views pervisit; 𝑃=𝑁 𝑁, 𝑆=𝑁 𝑁⋅ 𝑁 , 𝐾=𝑁 𝑁, 𝑃 =𝑁 𝑁.(6)where 𝑁 is the number of direct web page visits; 𝑁is the number of one-page visits to aweb page; 𝑁is the number of visits for analysis; 𝑁is the total number of visits; 𝑁is theaverage number of clicks on advertising; 𝑁is the total number of actions on the page; 𝑁and𝑁are the total number of all and interested users.The presence of a text content management module reduces costs for moderators/administratorswho update the website and create rules for caching/searching popular information blocks:𝑀=< 𝐾 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝑃 , 𝐾, 𝑆>,[𝑡 ; 𝑡 ] when 𝑡 <𝑡 and <𝑡 days, respectively; 𝐾is a brand recognition factor; 𝑃 and 𝑃 are% of new/repeated visitors and interest; 𝑆is the average number of clicks on advertising for 𝑁

the RG of the word type 𝑅 , 𝑅 , or 𝑅 and in the presence of deletion of the inflexion 𝑓 . Stage 4. Saving the inflection 𝑓 in the word tag 𝑤 . Stage 5. Label 𝑤 as type 𝑚 , 𝑚 or 𝑚 , respectively. Stage 6. Finding the deleted inflection 𝑓 in the tree of inflexions 𝑇 (the longest one is chosen). Checking the contents of the subtree 𝑇 with the existing word ending 𝑓 (𝑓 = 𝑓 + 𝑓 ). If 𝑤 ends in 𝑓 and has a counterpart in 𝑇 , then we store it in 𝑓 = 𝑓 and delete in 𝑤 .

Table 11The value of 𝐴 differs from the value of 𝐴 by 0.69 (by number, but not by content); 𝐴 from 𝐴 by 1.74; 𝐴 from 𝐴 by 2.66; 𝐴 from 𝐴 by 3.58. The value of 𝐴 differs from the value of 𝐴 by 4.36; respectively, 𝐴 from 𝐴 by 3.31; 𝐴 from 𝐴 by 2.39; 𝐴 from 𝐴 by 1.47. Adaptively changing the parameters/rules of the module almost doubles the collection of identified keywords (for example, the value of 𝐴 is greater than 𝐴 by 1.144654; 𝐴 by 1.750524; 𝐴 by 1.557652; 𝐴 by 1.36478). The total increase in value obtained depending on the moderation of dictionaries is, respectively, for 𝐴 is 14.46541; 𝐴 is 36.47799; 𝐴 is 55.7652; 𝐴 is 75.05241. When comparing 𝐴 is greater than 𝐴 ÷ 𝐴 and we have a chain of such values as 1.7985; 1.5084; 1.3217; and 1.176. Statistical data of the study of the content of scientific and technical publications

Name WordsStage 1Stage 2weightABCDEABCDEStep 1≥ 1

5.46 3.92 2.51 2.08 1.74 7.43 7.03 3.27 3 4.18 ≥ 2 1.08 0.88 0.63 0.59 0.26 2.67 2.64 1.65 1.

Table 22Correlation coefficients for stop wordsNew numbering Article numberRe-UParticiple Conjunction PrepositionRe-U140.73260.95940.95440.56390.6905220.70660.95800.57140.49280.4913310.607610.790.720.6900430.28100.88000.16240.15170.2254

Table 33List by frequency rating of stable phrases for Article 1FREGt-testLRХ2PhraseAFRFPhrasetPhraselogLPhraseX2система40.08888система1.82222інформаційний5.03eприйняття45.00000електронний9електронний2технологія-1рішення0інформаційни40.08888електронний1.57809інтелектуальни2.13eсистема45.00000й система9контент-1й система-1електронний0комерціяелектронний30.06666розділ науковий1.31993інформаційний8.36eелектронний32.94642контент-73система-2контент-9комерціякомерціярозділ20.04444інформаційний1.22222портал5.58eрозділ науковий29.30232науковий4система2науковий-26портал10.02222прийняття0.97777курс технологія3.31eкурс технологія21.98863науковий2рішення8-26інтелектуальн10.02222курс технологія0.95555сховище дані3.31eсховище дані21.98863ий система26-26прийняття10.02222сховище дані0.95555прийняття8.27eпортал14.31818рішення26рішення-3науковий2курс10.02222портал0.93333розділ науковий1.89eінформаційний5.848550технологія2науковий3-3системасховище дані10.02222інтелектуальни0.77777електронний1.55eінтелектуальни3.5795452й система8контент--4й системакомерціяінформаційни10.02222інформаційний0.68888система1.37eінформаційний1.890409й технологія2технологія9електронний-6технологія

Acknowledgements

The research was carried out with the grant support of the National Research Fund of Ukraine, "Information system development for automatic detection of misinformation sources and inauthentic behaviour of chat users ", project registration number 187/0012 from 1/08/2024 (2023.04/0012). Also, we would like to thank the reviewers for their precise and concise recommendations that improved the presentation of the results obtained.

Experiments, results and discussion

I will analyse the results of the experimental approbation of the developed methods and means of linguistic, intellectual analysis of texts in the Ukrainian language based on the development of methods for identifying keywords, determining persistent word combinations, thematic classification of the text and detecting duplication of text. Let us consider the peculiarities of the process of syntactic analysis of Ukrainian-language textual content aimed at identifying significant keywords of input texts. Having determined the role and formal features of the syntactic analyser in the process of identifying keywords of the content topic, the procedures of the proposed method were decomposed into two stages (Table 1), where A (total keywords identified with a given word weight), B (generated significant words without pronoun and verbs), C (coincidence of words with the author's list), D (accuracy of the coincidence of identified keywords with the author's list), E (additionally defined keywords, but not determined by the author of the publication). In stage 1, the research for step 1 (analysis of full articles) and step 2 (articles without metadata such as abstract,

An introduction to deep learning in natural language processing: Models, techniques, and tools ILauriola ALavelli FAiolli Neurocomputing 470 2022 Natural language processing (NLP) in management research: A literature review YKang ZCai CWTan QHuang HLiu Journal of Management Analytics 7 2 2020 Text preprocessing for text mining in organizational research: Review and recommendations LHickman SThapa LTay MCao PSrinivasan Organizational Research Methods 25 1 2022 An introductory survey on attention mechanisms in NLP problems DHu Proceedings of the Intelligent Systems Conference on Intelligent Systems and Applications the Intelligent Systems Conference on Intelligent Systems and Applications 2020 2 MGardner WMerrill JDodge MEPeters ARoss SSingh NASmith arXiv:2104.08646 Competency problems: On finding and removing artifacts in language data 2021 arXiv preprint Graph neural networks for natural language processing: A survey LWu Foundations and Trends in Machine Learning 16 2 2023 Linguistic Constructions Translation Method Based on Neural Networks EFedorov ONechyporenko CEUR Workshop Proceedings 3396 2023 Super-creative and over bureaucratic: A cross-genre corpus based study on the use and translation of evaluative prefixation in ted talks and EU parliamentary debates M.-ALefer NGrabar Across Languages and Cultures 16 2 2015 Technology of Ukrainian-English Machine Translation Based on Recursive Neural Network as LSTM MKonyk VVysotska SGoloshchuk RHoloshchuk SChyrun IBudz CEUR Workshop Proceedings 3387 2023 The method for detecting plagiarism in a collection of documents NShakhovska IShvorob Proceedings of the International Conference on Computer Sciences and Information Technologies the International Conference on Computer Sciences and Information Technologies CSIT 2015 Programming Style on Source Code Plagiarism and Collusion Detection OKarnalim GKurniawati International Journal of Computing 19 1 2020 Defining Author's Style for Plagiarism Detection in Academic Environment VVysotska YBurov VLytvyn ADemchuk Proceedings of the International Conference on Data Stream Mining and Processing the International Conference on Data Stream Mining and Processing DSMP 2018 A Conceptual Text Classification Model Based on Two-Factor Selection of Significant Words OBarkovska VKholiev AHavrashenko DMohylevskyi AKovalenko CEUR Workshop Proceedings 3396 2023 The text classification based on Big Data analysis for keyword definition using stemming ABerko YMatseliukh YIvaniv LChyrun VSchuchmann Proceedings of the IEEE 16th International conference on computer science and information technologies on Computer science and information technologies the IEEE 16th International conference on computer science and information technologies on Computer science and information technologies

Lviv, Ukraine

22-25 September, 2021 Development of the quantitative method for automated text content authorship attribution based on the statistical analysis of N-grams distribution VLytvyn VVysotska IBudz YPelekh NSokulska RKovalchuk LDzyubyk OTereshchuk MKomar 10.15587/1729-4061.2019.186834 Eastern-European Journal of Enterprise Technologies 6 2 2019 The chi-square test and data clustering combined for author identification IKhomytska IBazylevych VTeslyuk IKaramysheva Proceedings of the IEEE XVIIIth Scientific and Technical Conference on Computer Science and Information Technologies the IEEE XVIIIth Scientific and Technical Conference on Computer Science and Information Technologies 2023 The Multifactor Method Applied for Authorship Attribution on the Phonological Level IKhomytska VTeslyuk CEUR workshop proceedings 2604 2020 Intellectual Analysis System Project for Ukrainian-language Artistic Works to Determine the Text Authorship Attribution Probability RRomanchuk VVysotska VAndrunyk LChyrun SChyrun OBrodyak Proceedings of the International Scientific and Technical Conference on Computer Sciences and Information Technologies the International Scientific and Technical Conference on Computer Sciences and Information Technologies 2023 Development of methods, models, and means for the author attribution of a text IKhomytska VTeslyuk AHolovatyy OMorushko 10.15587/1729-4061.2018.132052 Eastern-European Journal of Enterprise Technologies 3 2 2018 Authorship and Style Attribution by Statistical Methods of Style Differentiation on the Phonological Level IKhomytska VTeslyuk 10.1007/978-3-030-01069-0_8 Advances in Intelligent Systems and Computing 871 2019 Tweets about Ukraine during the russian-Ukrainian War: Quantitative Characteristics and Sentiment Analysis RNazarchuk SAlbota CEUR Workshop Proceedings 3426 2023 Terminology of Computational Linguistics in Terms of Indexing and Information Retrieval in the System "iSybislaw ATaran CEUR Workshop Proceedings 2021 2870 Use of the Smart City Ontology for Relevant Information Retrieval NKunanets HMatsiuk CEUR Workshop Proceedings 2362 2019 Application of Saaty Method While Choosing Thesaurus View Model of the "Smart city" Subject Domain for the Improvement of Information Retrieval Efficiency KNataliia MHalyna 10.1109/STC-CSIT.2018.8526656 Proceedings of the IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT the IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2018 2 Intelligent Network Architecture Development for E-Business Processes Based on Ontological Models YBurov VVysotska LChyrun YUshenko DUhryn ZHu 10.5815/ijieeb.2024.05.01 International Journal of Information Engineering and Electronic Business 16 5 2024 The contribution of morphological knowledge to French MeSH mapping for information retrieval PZweigenbaum SJDarmoni NGrabar Proceedings of the Annual AMIA Symposium the Annual AMIA Symposium 2001 Detecting drug non-compliance in internet fora using information retrieval and machine learning approaches ÉBigeard FThiessard NGrabar Studies in Health Technology and Informatics 264 2019 Health consumer-oriented information retrieval VClaveau THamon SLe Maguer NGrabar Studies in Health Technology and Informatics 210 2015 Abstracting Text Content Based on Weighing the TF-IDF Measure by the Subject Area Ontology VLytvyn YBurov VVysotska YPukach OTereshchuk IShakleina Proceedings of the IEEE International Conference on Smart Information Systems and Technologies (SIST) the IEEE International Conference on Smart Information Systems and Technologies (SIST)

Nur-Sultan, Kazakhstan

2021 Distributional analysis applied to specialized texts. Reduction of data sparseness by context abstractions APérinet THamon Traitement Automatique des Langues 56 2 2015 A method for user authenticating to critical infrastructure objects based on voice message identification VTrysnyuk YNagornyi KSmetanin IHumeniuk TUvarova 10.20998/2522-9052.2020.3.02 Advanced Information Systems 4 3 2020 Precision automated phonetic analysis of speech signals for information technology of text-dependent authentication of a person by voice OBisikalo OBoivan NKhairova OKovtun VKovtun CEUR Workshop Proceedings 2021 2853 Remote Voice Control of Computer Based on Convolutional Neural Network ASartiukova OMarkiv VVysotska IShakleina NSokulska IRomanets Proceedings of the IEEE 12th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS) the IEEE 12th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS)

Dortmund, Germany

7 September 2023 Ukrainian Language Chatbot for Sentiment Analysis and User Interests Recognition based on Data Mining SKubinska RHoloshchuk SHoloshchuk LChyrun CEUR Workshop Proceedings 3171 2022 Information System for Recommendation List Formation of Clothes Style Image Selection According to User's Needs Based on NLP and Chatbots VHusak OLozynska IKarpov IPeleshchak SChyrun AVysotskyi CEUR Workshop Proceedings 2604 2020 IT Slang: Development of Telegram Chatbot AMedvedyk MLohoida ZRybchak OKulyna CEUR Workshop Proceedings 3396 2023 Elomia Chatbot: The Effectiveness of Artificial Intelligence in the Fight for Mental Health ORomanovskyi NPidbutska AKnysh CEUR Workshop Proceedings 2870 2021 Method of Multi-Purpose Text Analysis Based on a Combination of Knowledge Bases for Intelligent Chatbot AYarovyi DKudriavtsev CEUR Workshop Proceedings 2870 2021 Development of the Speech-to-Text Chatbot Interface Based on Google API NShakhovska OBasystiuk KShakhovska CEUR Workshop Proceedings 2386 2019 Peculiarities of an Information System Development for Studying Ukrainian Language and Carrying out an Emotional and Content Analysis TBasyuk AVasyliuk CEUR Workshop Proceedings 2023 3396 A Comparative Analysis for English and Ukrainian Texts Processing Based on Semantics and Syntax Approach VVysotska SHoloshchuk RHoloshchuk CEUR Workshop Proceedings 2870 2021 Comparative Analysis of Using Different Parts of Speech in the Ukrainian Texts Based on Stylistic Approach ADmytriv SHoloshchuk LChyrun RHoloshchuk CEUR Workshop Proceedings 3171 2022 Development of a Method for Determining the Indicators of Manipulation Based on Morphological Synthesis SYevseiev Eastern-European Journal of Enterprise Technologies 117 9 2022 Collection and Processing of a Medical Corpus in Ukrainian OCherednichenko OKanishcheva OYakovleva DArkatov CEUR Workshop Proceedings 2604 2020 The Speech Parts Identification for Ukrainian Words Based on VESUM and Horokh Using ADmytriv VVysotska MBublyk Proceedings of the 16th International Conference on Computer Sciences and Information Technologies (CSIT) the 16th International Conference on Computer Sciences and Information Technologies (CSIT) 2021. September 2 NLP Tool for Extracting Relevant Information from Criminal Reports or Fakes/Propaganda Content VVysotska SMazepa LChyrun OBrodyak IShakleina VSchuchmann Proceedings of the IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT) the IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT) 2022. November Analyzing Ukrainian Media Texts by Means of Support Vector Machines: Aspects of Language and Copyright MLupei OMitsa VSharkan SVargha NLupei Proceedings of the International Conference on Computer Science, Engineering and Education Applications the International Conference on Computer Science, Engineering and Education Applications 2023. March Analytical Method for Social Network User Profile Textual Content Monitoring Based on the Key Performance Indicators of the Web Page and Posts Analysis VVysotska CEUR Workshop Proceedings 3171 2022 An approach for a next-word prediction for Ukrainian language KShakhovska IDumyn NKryvinska MKKagita Wireless Communications and Mobile Computing 2021. 2021 IDemydov Architecture of the Computer-linguistic System for Processing of Specialized Webcommunities' Educational Content Ukrainian participles formation by the generative grammars use VVysotska CEUR Workshop Proceedings 2020 2604 Applied text analysis with Python: Enabling language-aware data products with machine learning BBengfort RBilbro TOjeda 2018 O'Reilly Media, Inc Speech and Language Processing DJurafsky JHMartin Regular Expressions, Text Normalization, Edit Distance DJurafsky JHMartin Deep Learning Architectures for Sequence Processing DJurafsky JHMartin Naive Bayes and Sentiment Classification DJurafsky JHMartin Logistic Regression DJurafsky JHMartin DJurafsky JHMartin Neural Networks and Neural Language Models Software-based approach towards automated authorship acknowledgement-chi-square test on one consonant group IKhomytska VTeslyuk NKryvinska IBazylevych Electronics 9 7 2020 Recurrent expressions for reliability indicators of compound electropower systems ARSydor VMTeslyuk PYDenysyuk Technical Electrodynamics 4 2014 Development of the linguometric method for automatic identification of the author of text content based on statistical analysis of language diversity coefficients VLytvyn 10.15587/1729-4061.2018.142451 Eastern-European Journal of Enterprise Technologies 5 2 2018 Development of the system to integrate and generate content considering the cryptocurrent needs of users VLytvyn 10.15587/1729-4061.2019.154709 Eastern-European Journal of Enterprise Technologies 1 2 2019 The Game Method for Orthonormal Systems Construction PKravets 10.1109/cadsm.2007.4297555 Proceedings of the 9th International Conference -The Experience of Designing and Applications of CAD Systems in Microelectronics the 9th International Conference -The Experience of Designing and Applications of CAD Systems in Microelectronics 2007