1. Introduction

Linguistic intellectual analysis methods for Ukrainian textual content processing

Victoria Vysotska

0 0 Lviv Polytechnic National University , Stepan Bandera 12, 79013 Lviv , Ukraine

3 44

The peculiarities of the method of syntactic analysis of Ukrainian-language text content aimed at automatic detection of significant keywords of input texts are considered. The role and formal features of the parser in the process of identifying keywords of the content topic are defined, and the procedures of the proposed method are decomposed into 4 stages. Compared to well-known parsers, the proposed method provides self-improvement and self-learning of the automated keyword identification system due to the mechanism of identification of significant statistical parameters within the limits defined by the moderator. The experimental study confirmed the reliability of the method - for various methods of processing the primary text, the average coincidence of the lists of identified keywords with the authors varies in the range of 52.6-68.5%. The accuracy of matching

eol>computer linguistics system NLP Ukrainian language information resource system modelling 1

1. Introduction

The identification of keywords of the text content  ( , , , , ) → ′ is a mapping of the input text content into the new state ′, which, unlike the previous one, is supplemented with a set of keywords as the main markers of the text content. For this purpose, the multi-level linear (sequences) [1-3]. And, if necessary, hierarchical/network (interconnections) structure of the text is linguistically investigated as symbols, N-grams, morphological features, weights of words and phrases, features of sentences and interconnected units (Fig. 1) [4-9].

Author User Authorization 2. Models and methods 2.1. Peculiarities of defining keywords of the Ukrainian-language text

Web Mining technology is based on the use of methods of intellectual analysis of the flow of information content to identify patterns in the Internet or Web-site [10-12]. The main technology of Web Mining is Text Mining, which is used to extract structured/unstructured data from Web-pages, Web-sites, link structures, etc. [13-15].

Algorithm 1. Content keyword identification based on Web Mining Stage 1. Integration/downloading of textual content for further analysis.

Stage 2. Grapheme analysis of textual content .

Step 1. Formatting of incoming text content, for example, the same apostrophes for Ukrainian text. Step 2 Removal of the service part of the content, such as tags.

Step 3. Removal of the non-character part of content, such as dates, numbers, financial symbols, mathematical formulas, images, etc. Removal of special characters that are not included in the alphabet, except for service ones such as space, and apostrophe.

Step 4. Analysis of abbreviations and abbreviations of content . If  n used in the text and not in the dictionary , then step 5, otherwise step 6.

Step 5. If necessary, edit the thematic dictionary , for example, add new abbreviations or abbreviations. Step 6. Segmentation of the input array of text into sentences and paragraphs with appropriate marking of the corresponding boundaries.

Step 7. Segmentation of the sequence of symbols of sentences of content into tokens. Stage 3. Morphological analysis of the Ukrainian-language text .

Step 1. Selection of bases (word forms without inflexions).

Step 2. Analysis of the resulting inflexion to determine the part of speech.

Step 3. Marking the word with the appropriate part of speech.

Step 4. Word forms are marked by a collection of morphological features: case, gender, declension, singular/plural, person, etc.).

Step 5. If the part of speech word is a noun, mark it as a potential keyword. If the part of speech of the word is an adjective, mark it and the next word (if it is a noun) as a phrase that could potentially be a keyword.

Step 6. Formation of a linear chain of labelled structures.

Stage 4. Lexical analysis of the Ukrainian text .

Step 1. Search for the base in the base dictionary for further normalization taking into account the part of the language used in a specific place of the text .

Step 2. Normalization of marked morphological structures.

Step 3. Segmentation and analysis of a chain of normalized tokens of content into tokens and word types taking into account marked sentence boundaries.

Step 4. Formation of collections of tokens (sequences of symbols according to appropriate templates) as lexemes with further identification of their types, taking into account their interrelationships in the textual content .

Step 5. If the dimensionality of the text content is  N1, then step 9, otherwise step 5. Stage 5. Syntactic analysis of textual content .

Step 1. Selection of tokens 1 ∈ for text content .

Step 2. Identification of a sequence of tokens as an expression or sentence.

Step 3. Identification of the nominal group of the expression based on the dictionary of word bases . Step 4. Definition of the verb group of the sentence based on the dictionary of word bases . Step 5. Formation of a left-to-right parsing tree of linguistic variables.

Step 6. Analysis of noun phrase group for textual content .

Step 7. Analysis of the verb group of the sentence for textual content .

Step 8. Study of syntactic categories by word forms.

Step 9. If not the end of content , then go to step 2, otherwise go to step 9.

Stage 6. Semantic analysis of the Ukrainian text .

Step 1. Expression tokens are compared with the semantic classes of the dictionary . Step 2. Definition of morpho-semantic analogues for a specific sentence.

Step 3. Combining tokens into a common structure.

Step 4. Generating a tuple of superpositions of lexical functions and semantic classes. Stage 7. Referential analysis for determining interphase unities of the text .

Step 1. Contextual analysis of content for identification of local references (which, this, his) and selection of utterances - kernels of unity.

Step 2. Thematic analysis to highlight the thematic structure.

Step 3. Identification of the identity of references; synonymizing, duplication and re-nomination of tokens; implications based on situational connections.

Stage 8. Structural analysis of textual content .

Step 1. Identification of the basic tuple of rhetorical connections between entities.

Step 2. Construction of a nonlinear network of units.

Stage 9. Identifying a set of content keywords  ( , , , , ) → ′.

Step 1. Formation of an alphabetic-frequency dictionary = ( , , ).

Step 2. Identification of terms ( ∈ 1)( ∈ ) as nouns, noun phrases, an adjective with a noun, or abbreviations.

Step 3. Formation of a shortened list of words whose frequencies correspond to the conditions of formation of potential keywords –  .

Step 4. Determination of the level of uniqueness  ( ), ∈ . Step 5. calculation (number of characters without spaces) for ∈ at ≥ 80. Step 6. Calculation of (keyword usage frequency). For terms with ≤ 2000 frequency ∈ (6; 8]%, з 2000 > < 3000 frequency ∈ [4; 6]%, with ≥ 3000 frequency ∈ [2; 4)%.

Step 7. Calculation of the probability of using the keywords (at the beginning of the text), (in the middle of the text content) and (at the end of the text content).

Step 8. Comparison of , та values for keyword prioritization under the condition ≫ ≫ .

Step 9. Sorting keywords according to defined priorities.

Step 10. Comparison of  content with the ℎ ∈ list.

Step 11. Formation of a new list of =  ℎ tokens.

Step 12. Formation of the collection of keywords ′ with ∈ , = { , ≥ 80, , , , , }.

2.2. Method of identifying keywords of Ukrainian-language content

The analysis of the text flow of content for the identification of keywords is usually implemented on Zipf's law and reduced to the selection of words with an average frequency of occurrence [16-18]. This is easy to implement for English-language texts. It will not work for Ukrainian-language texts. It is necessary to adapt the parser and stemming algorithms to the Ukrainian language based on thematic frequency dictionaries of the basics [19-27].

Algorithm 2. Adaptation of parser/stemming algorithms of Ukrainian texts.

Stage 1. Based on the parser, a set of words with a frequency of occurrence within a certain limit is identified, for example, 4-6% with ≤ 2000 characters without spaces; Stage 2. Based on the parser and stemming, a subset of frequently used semantically loaded words is generated by extracting/marking words from the blocked dictionary, for example, such as prepositions, conjunctions, pronouns, verbs, particles, etc.; Stage 3. If the keyword is an adjective (inflexion of the normalized word ий [yy]), then all bases to the right of it are found in the text and a frequency dictionary is built for them. Those phrases that are used more than the corresponding threshold value (but less than this adjective) are keywords. The threshold value is determined by the moderator. Repeat multiple keywords Stage 4. If the keyword is a noun (the inflexion of the word is not ий [yy]), then all bases and their inflexions on both sides of it are examined.

Step 1. All words to the left of the noun are analysed for the presence of inflexions ий [yy] and compared with the frequency dictionary. A set of words that are used most often above the threshold value is identified - these are new keywords.

Step 2. All bases and their inflexions on the right are analysed - without inflexion ий [yy] and inflexions of other parts of speech, except nouns, are compared with the frequency dictionary, which determines the set of keywords.

Stage 5. The new subset is compared with the thematic dictionary of the basics of Ukrainian words to form a set of keywords; Stage 6. If there is no analogue of the word, add it to the thematic dictionary of word bases through the buffer dictionary (edited by the moderator) to accumulate statistics for various stylistic text content.

3. Experiments, results and discussion 3.1. Content keyword identification based on Web Mining technology

100 scientific articles of the "Lviv Polytechnic" NU Bulletin of the "Information Systems and Networks" series (http://science.lp.edu.ua/sisn), two numbers 783 (http://science.lp.edu.ua/SISN/SISN-2014) and 805 (http://science.lp.edu.ua/sisn/vol-cur-8052014-2) were chosen as the experimental base for the relevant research. To achieve the goal of the research, IS was developed (Fig. 2), placed on the Victana resource (http://victana.lviv.ua/index.php/kliuchovi-slova) using the following tools: CMS Joomla! for IS e-framework, PHP for algorithm implementation, MySQL for data storage and dictionaries, HTML for implementation of Web-pages markup and CSS for description of Web-page styles.

The developed IS has the following main components.

1. A user-friendly dialogue web interface on the web page of the Ключові слова [Klyuchovi slova] (Keywords) menu with the following sections (Fig. 2): • Вибрати мову контенту [Vybraty movu kontentu] (Select the content language) – one/several languages of the analyzed text. Мін. вага слова, % [Min. vaha slova, %] (Min. word weight, %) – the percentage of the weight of the keyword to the total number of words of the text, after which the keywords will be selected; format - ХХ.ХХ, within [00.01 - 99.99]; mandatory field. Help – short instructions in Ukrainian on a separate web page. Контент [Kontent] (Content) – field for analysed text content. Ключові слова [Klyuchovi slova] (Keywords) – field for displaying IS of keywords set. Генерувати [Heneruvaty] (Generate) – start the keyword identification process. Очистити [Ochystyty] (Clear) – clearing the input field Контент [Kontent] (Content). Повторюваність слів, раз [Povtoryuvanistʹ sliv, raz] (Repetition of words, times) – the number of repetitions of the keyword in the text. Рекомендовані рубрики [Rekomendovani rubryky] (Recommended headings) – a list of thematic headings according to keywords.

2. The main relations of DB: the bases of words; prohibited words; rubrics; and rules of bringing to the base of the word.

3. PHP functions for processing text content: get_keywords() – creating a list of keywords. get_word() – a record of the rules for bringing the word to the base. explode_str_on_words() – clears the received content from blocked words, special characters, etc. blocked_words() – forms a list of blocked words depending on the selected language of the context. count_words() – calculation of key word frequencies. set_keywords() – writing keywords to the DB if they are not available. recommend_rubric() – creation of a list of recommended rubrics.

function error() – processing errors, sending a letter to the IS administrator.

The study of the dynamics of the module for determining the collection of keywords from 100 scientific and technical articles was carried out in two stages with analysis: • • content of the thematic dictionary and a set of blocked words. refined based on the ML content of the thematic dictionary and set of blocked words, since with each subsequent verification of the text through the corresponding module, an additional collection of unknown words is potentially generated (absent in the list of blocked and in the thematic dictionary). • • • • • • • • • •

At each stage, the module implements the verification of the text of articles in two steps: analysis of the entire article (Fig. 3a) and without meta-data (information about authors, title, author keywords and annotations in several languages, references list, etc.) (Fig. 3b) to analyse the accuracy error of generating a collection of keywords in the presence of information noise. 3.2. An experimental study results of the Ukrainian-language content keywords identification The statistical analysis was carried out based on a comparison of sets of keywords defined by the authors of the article and defined by the module at two different stages with different word weights within [1,5] (in the option *Мін.вага слова, % [*Min.vaha slova, %] (*Min. word weight, %)) with full and abbreviated texts of works (Table 1) with an average arithmetic value of the author's keywords of 4.77, which approximately consist of 9-10 words. Table 2 contains the following notations: A (total identified keywords at a given word weight), B (formed significant words without pronouns and verbs), C (coincidence of words with the author's list), D (accuracy of the coincidence of identified keywords with the author's list), E (additional keywords defined, but not defined by the author of the publication). Known IS of keywords identification are within [100 ÷ 1000] words [28-32].

The disadvantage of these IS is the inaccuracy and incorrect processing of Ukrainian-language texts in the absence of competently constructed morphological dictionaries, dictionaries of bases and blocked words. Also, the main drawback of most such IS is the limited processing of volumes of text content [100 ÷ 1000] (Fig. 4). The best IS for processing Ukrainian-language textual content is [33] (Fig. 5), but it does not identify the set of keywords, but only the frequency of use of words, phrases and parts of words. Doesn't work with word bases at all (ключових [klyuchovykh] (keywords) and ключові [klyuchovi] (keywords) are different). The developed resource works with the basics of the word and is focused on Ukrainian/English texts (Fig. 1). For [20] in Ukrainian, the frequency of using keywords on Victana: слово [slovo] (word) – 120; ключовий [klyuchovyy] (key) – 49; контент [kontent] (content) – 46; аналіз [analiz] (analysis) – 39; Chomsky – 37; система [systema] (system) – 37. The authors identified keywords: текст [tekst] (text), україномовний [ukrayinomovnyy] (Ukrainian), алгоритм [alhorytm] (algorithm), синтаксичний аналіз [syntaksychnyy analiz] (syntactic analysis), породжувальні граматики [porodzhuvalʹni hramatyky] (generative grammars), лінгвістичний аналіз [linhvistychnyy analiz] (linguistic analysis), контент-моніторінг [kontent-monitorinh] (content monitoring), ключові слова [klyuchovi slova] (keywords), інформаційна лінгвістична система [informatsiyna linhvistychna systema] (informational linguistic system), структурна схема речення [strukturna skhema rechennya] (sentence structure scheme). Authors usually define keywords more than Zipf-law patterns of word frequency distribution.

The author of the article almost always forms at his discretion the number and content of a set of keywords in the range of 2 to 10 word combinations (usually 3-5). The developed module defines a different number of words, depending on the writing style of the corresponding author, the volume of the article, the genre, the topic, and the frequency of use of the corresponding words (from 0 to several dozen). The coincidence of the sets of found keywords with the author's without taking into account the extra words defined by the authors (repetition > 30 for a text volume of more than 4800 words) is, respectively, for [33] - 83%; [32] - 57%; [31] - 35%; %; http://victana.lviv.ua/kliuchovi-slova - 90% (Fig. 6). Fig. 7 demonstrates the features of generating a set of probable keywords compared to an author set. The author of the article often defines a larger number of words ( 2) and a smaller number of keywords ( 1) than are present in the text. Fig. 7b shows the distribution of text density in articles, where the number of 1 – pages, 2 – paragraphs, 3 – lines, 4 – words, 5 – characters, 6 – spaces and characters, 7 – words per page, 8 – characters per page, 9 – spaces and characters on the page. 1 6

Marking Chart column name Arithmetic average number of keywords

Explanation Value 1 Author's keywords defined by the author 4.77 2 Number of words contain author's 9.82 3 Stage 1, Step 1 5.46 4 Stage 1, Step 2 probable keywords 6.51 5 Stage 2, Step 1 found by the module 7.43 6 Stage 2, Step 2 at stage X and step Y (Fig. 8-Fig. 9) 8.35

The value of 3 differs from the value of 1 by 0.69 (by number, but not by content); respectively, 4 from 1 by 1.74; 5 from 1 by 2.66; 6 from 1 by 3.58. The value of 2 differs from the value of 3 by 4.36; respectively, 2 from 4 by 3.31; 2 from 5 by 2.39; 2 from 6 by 1.47. Adaptively changing the parameters/rules of the module almost doubles the collection of identified keywords (for example, the value of 1 is greater than 3 by 1.144654; 6 by 1.750524; 5 by 1.557652; 4 by 1.36478). The total increase in the value obtained depending on the moderation of dictionaries is, respectively, for 3 14.46541; 4 – 36.47799; 5 – 55.7652; 6 – 75.05241. When comparing 2 more than 3 ÷ 6, we have a chain of such values as 1.7985; 1.5084; 1.3217; 1,176. For different stages and steps of the experiment of processing the primary text, the average coincidence of the lists of identified keywords with the author's keywords varies in the range of 52.6-68.5%. The accuracy of matching keywords with the author's keywords ranges from 43.6 to 62.9%. The average match of meaningful keywords compared to all found by the system varies between 38.9-75.8%, depending on the stages of analysis of the text of the articles. The accuracy of matching keywords compared to all found by the system ranges from 34.3-71.9%, depending on the stages of analysis of article texts.

For 3, the module most often identified the number of keywords {5, 7, 3} (10), although the distribution of found keywords was within [1;18] words (except 17). For 4, IS identified the number of keywords also {5, 7, 3} most often, although the distribution of found keywords is 10 5 a 0

10 c) 0 10 5 b) 0

10 d) 0 within [1;18] (except 17), the number of identified words increased and the highest reliability index was achieved. For 5, the module most often identified the number of keywords {7, 6, 5, 10, 8}, although the distribution of found keywords was within [2;14] (the range narrowed significantly). For 6, the module most often identified the number of keywords {8, 5, 7, 10}, the distribution of identified keywords within [3;16] (accuracy improved). The accuracy of the definition of keywords increases in the process of the moderation of dictionaries and the MLmodule. The difference between the number of keywords defined by the author and identified by the module at 3 is 44.39919% (difference in %). Descriptive statistical data of keyword identification in experiments

Name

Average Standard error

Median/ Mode Standard deviation Sampling variance

Excess Asymmetry

Interval Minimum/ Maximum

Sum

Score Biggest( 1 )/ Smallest( 1 ) Reliability level (95.0%)

1 Statistical data of histogram construction for 3 and 3 6 (Fig. 10)

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 More 1 2 3 72.73 85.86 90.91 97.98 85.86 89.90 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 1 6

11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 specifying the thematic dictionary (blue - filtered text, orange - general text)

Accuracy improves with 4 – 33.70672%, significantly improves with 5 – 24.33809%, and with 6 is 14.96945% (Table 4). Table 5 shows data from research articles when generating sets of keywords (Fig. 10). Analysis was performed for 100 filtered texts without metadata and unfiltered texts. The obtained average values for 100 filtered texts = 0,28 and unfiltered 0 = 0,19 shows that such filtering of scientific articles improves the density of keywords by 1.48 times or by 47.83% (Fig. 11a). The obtained average values for 100 texts = 0,34 and

0 = 0,25 taking into account the refinement of the thematic dictionary due to the addition of blocked words show that filtering with simultaneous moderation of the thematic dictionary improves keyword density by 1.35 times or by 35.44% (Fig. 11b). a) c) 40 20 0 40 20 0 2 0 2 0 200% 100% 000% 200% 100% 000%

A comparison of the values in the original author's text 0 = 0,19 and 0 = 0,25 without/with the refinement of the thematic dictionary, respectively, demonstrates the effectiveness of moderation of the thematic dictionary in the initial text - the density of keywords increases 1.34 times or by 34.33% (Fig. 12a). Values comparison in the filtered author's text = 0,28 and = 0,34 without/with the refinement of the thematic dictionary, respectively, demonstrates the effectiveness of the moderation of the thematic dictionary in the filtered text - the density of keywords increases 1.23 times or by 23.14% (Fig. 12b). 1 0 1 0 1 6

11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

3.3. Analysis of methods for identifying stable phrases as keywords

The identification of stable phrases consists of the following stages: morphological analysis (MA), SYA, selection of key words and analysis of key phrases for stability (Fig. 13) [34-37].

Diagnostics Diagnostics Diagnostics

Lexical analysis,

LA Finite automata

Syntactic analysis, SYA

Context-free grammars Contextual analysis Attributive grammars Output generation Synttraaxn-csloanttioronlled

Output optimization Flow analysis Definition of stable phrases Decision tables

Token flow Object tables

Parse tree Object tables Attributive tree

Object tables Intermediate form (prefix, postfix,

threes, etc.) Object module

For Ukrainian-language texts, it is best to use a combination of procedural, tabular, and statistical stemming approaches. In the MA procedural approach, emphasis is placed on the use of ready-made dictionaries of bases and dictionaries of ready-made forms (DRF) in the analysis of words. Then the MA algorithm consists of the following steps: search in the SFG, base selection, and base search in the dictionary. The basis of most MAs of the Ukrainian language is a tree or Finite State Automata (FSA) (Fig. 14). a)

з л і т и н ч т н н и и й й а н а н тю ч ег о вв іи сй т и кч н и й са мкело лоемі нк аср фин кеочт мални аоичк айгн іи я чй н и й и н т еа з к с и чс н и й і

з н

л а н тю ч ег о нвв ті с т и к

а b) са мкело лоемі нк сф о л о г і яч н и й

ар н е м а и н т еа з к с и с

The type of word is determined by the form of inflexions (Fig. 13). The algorithm works with individual words, so the content of the word is not taken into account. Parts of speech (adjective, noun, etc.) and categories of morphology (stem, suffix, etc.) are also unavailable. Variants of the rules for the stemming of Ukrainian words: short words remain unchanged, change during stemming (is an exception), do not change during stemming (is an exception), correspond to a regular expression, change the ending, has an unchanged ending, or the inflexion is cut off from the word. All this significantly complicates the keyword identification algorithm. Therefore, first of all, it is necessary to analyse widespread inflexions. Syntax - rules for combining words into correct expressions - word combinations and sentences (compare: programming language syntax). The task of the SYA (parser) is to construct the syntactic structure of the input sentence. Aspects of SYA implementation are dictionaries (information about individual language units); formal rules and interaction with neighbouring processing levels (morphological analysis, semantic analysis). Context-free grammar (CFG) rules are most often used in SYA: <N, T, X, R>, where N is a set of non-terminal symbols, T is a set of terminal symbols ( ∩ = ∅), X – axiom ( ∈ ), R is a set of transformation (substitution) rules of type → , where ∈ , is a list of terminal and non-terminal symbols. CFG example:

= { , , , , , }, S, ={система, рубрикувати, україномовний, контент, за, ключовий, слово} [ ={systema, rubrykuvaty, ukrayinomovnyy, kontent, za, klyuchovyy, slovo}] ( ={system, categorize, Ukrainian-language, content, by, key, word}), = { → , → , → , → , → , → система, → рубрикувати, → україномовний, → ключовий, → контент, → слово, → за}.

The disadvantage of using CFG is the periodic appearance of ambiguity with SYA, for example, "The system categorizes Ukrainian-language content by keywords" (Fig. 15). Examples of wellknown SYA systems for English tests are: "Machinese Phrase Tagger" (Fig. 16) and VISL. There is no online available information resource for SYA Ukrainian texts. "Ontology Matcher Demo" uses Machinese metadata to find ontology objects in the text (Fig. 17).

S NP

VP Система V

NP рубрикує A

P за

NP A

N україномовний контент ключовими словами hat)) → (the boy) with (the hat)) of Ukrainian-language texts for the identification of stable word combinations when defining keywords is presented in Fig. 19.

To select stable word combinations in the analysed texts and carry out their comparative analysis, we will use 4 different methods: FREG (frequency + morphological patterns, i.e. direct counting of the number of words); t-test; statistics 2; LR is the likelihood ratio.

Collocations is a word combination as a semantically and syntactically linguistic unit, where one part is chosen according to meaning, and the other depends on the first (for example, ставити умови [stavyty umovy] (to set conditions) – the choice of the verb ставити [stavyty] (to set) is determined by tradition and depends on the noun of умови [umovy] (the condition), with the word пропозицію [propozytsiyu] (offer) there will be another verb – вносити [vnosyty] (to enter)). This is a limited (selective) combination of words: phraseological units, idioms, proper names and trademarks. Collocations often include complex names (for example, крейсер москва [kreyser moskva] (moscow cruiser), руський корабль [rusʹkyy korablʹ] (russian ship), безпілотник Байрактар [bezpilotnyk Bayraktar] (Bayraktar drone), від’ємний наступ [vidʺyemnyy nastup] (negative attack), німецькі леопарди [nimetsʹki leopardy] (German leopards), жест доброї волі [zhest dobroyi voli] (goodwill gesture), etc.). Another name for the same phenomenon is stable phrases, N-grams. Examples of collocations – • • • • • • Грати роль [hraty rolʹ] (to play a role), мати значення [maty znachennya] (to have a meaning), впливати [vplyvaty] (to influence), справляти враження [spravlyaty vrazhennya] (to make an impression); Засоби масової… [zasoby masovoyi…] (means of mass...), зброя масової… [zbroya masovoyi…] (weapons of mass...), вищий навчальний …. [vyshchyy navchalʹnyy ….] (higher education); глибокий старець [hlybokyy staretsʹ] (deep old man)  поверхневий/мілкий невеликий юнак [poverkhnevyy/milkyy nevelykyy yunak] (superficial/shallow little young man); міцний чай [mitsnyy chay] (strong tea)  сильний чай [sylʹnyy chay] (strong tea); Кока-кола [Koka-kola] (Coca-Cola), Microsoft Windows; Гола Пристань [Hola Prystanʹ] (Hola Prystan), Нова Каховка [Nova Kakhovka] (Nova Kakhovka), Володимир Волинський [Volodymyr Volynsʹkyy] (Volodymyr Volynsky), Володимир Зеленський [Volodymyr Zelensʹkyy] (Volodymyr Zelensky), Нью Йорк [Nʹyu York] (New York), Стив Джобс [Styv Dzhobs] (Steve Jobs).

1. The FREG method is a direct calculation of the frequency of use of pairs (threes). For example, FREG for the sentence В літературі описано декілька підходів до автоматичного виділення стійких словосполучень [V literaturi opysano dekilʹka pidkhodiv do avtomatychnoho vydilennya stiykykh slovospoluchenʹ] (In the literature, several approaches to the automatic selection of stable word combinations are described) «.» → в літературі [dekilʹka pidkhodiv] (in the literature); літературі описано [literaturi opysano] (described in the literature); описано декілька [opysano dekilʹka] (several are described); декілька підходів [dekilʹka pidkhodiv] (several approaches); підходів до [pidkhodiv do] (approaches to); до автоматичного [do avtomatychnoho] (to automatic); автоматичного виділення [avtomatychnoho vydilennya] (automatic selection); виділення стійких [avtomatychnoho vydilennya] (allocation of persistent); стійких словосполучень [stiykykh slovospoluchenʹ] (stable phrases). Unfortunately, as a result of using this method on large volumes of text, we get the so-called "garbage" due to the high frequency of service words. The method also requires consideration of the frequency of occurrence and patterns of word combinations.

2. The t-test method consists of statistical hypotheses testing and MA statistical model using Н0: the words met by chance; ( 1 2) = ( 1) ( 2); taking into account not only pairs but also the individual words use frequency (those that make up a pair); = ̄− average, is theoretical average, 2 is empirical dispersion, is empirical sample size. The method is not completely correct for the language, but it allows to obtain results in practice, for example, the frequency of appearance of the stable phrase контент аналіз [kontent analiz] (content analysis) in [37] with (контент) = 85/4338 and (аналіз) = 53/4338 is 0: (аналіз) = (контент) (аналіз) ≈ 2,39 ⋅ 10−4. In the Bernoulli scheme, 2 = (1 − ) ≈ at values of ̄ = 18/4338 and ≈ 3,997955.

3. Pearson's 2 method is applied to 2x2 tables (Table 6). Normality is not expected in the calculations. Example, 2 = An example of using Pearson's 2 method

2 = аналіз

1 = контент 18 (контент аналіз)

1 ≠ контент 35 (e.g., статистичний аналіз) 2 ≠ аналіз 67 (including, контент моніторинг) 4218 (including, статистичний моніторинг) (1 − ) − , we get the LR likelihood ratio 4. The LR method consists of the calculation of hypotheses ( 1 >> 2) 1: ( 2| 1 ) = = ( 2|¬ 1) and 2: ( 2| 1) = 1 ≠ 2 = ( 2|¬ 1 ) where = 2; 1 =

12; 2 = 1

2−− 112. Then, using the binomial distribution ( , , ) = , where −2

is asymptotically distributed as 2. The term extraction experiment was conducted on 3 articles from different SAs. The template for experimenting is: [Adjective + Noun], [Adjective + Noun], [Noun + Noun, Genitive Distinctive], [Noun + Noun, Instrumental Distinctive], [Noun + '-' + Noun]. During the experiment, 6 methods were used: manually determined by the authors of the articles (A); determined by the Victana.lviv.ua system, taking into account Zipf's law (B); frequency+morphological patterns FREG (C); t-test (D); likelihood ratio LR (F); statistic 2 (G). An analysis of 3 articles in Ukrainian and translated into English was conducted (Table A -Table B of Appendix). Key words that occur in the results of all methods are highlighted in bold, in italics only in methods B-G, and underlined in methods A and C-G. When conducting a linguistic analysis, the following features were used to form alphabetic-frequency dictionaries of two words each: •

Bigrams were formed within the boundaries of punctuation marks (if there was at least some punctuation mark between the words - these words were not considered a

An alphabetic-frequency dictionary of two words was formed based on their bases (bigrams) and content analysis of these bigrams; When analysing the inflexions of the analysed words, verbs were not taken into account when forming the bigram alphabetic-frequency dictionary (verbs were considered one of the punctuation marks); Before the linguistic analysis of the texts, all stop words (participles, adverbs, conjunctions) and pronouns were removed.

Statistical methods allow taking into account the use of individual words. Subtleties are associated with applying the methods to different data volumes and probability ranges (better than t-test for larger p where normality is violated; likelihood ratio is better approximated by 2 than 2x2 tables for small volumes). It is more often used not for accepting/rejecting hypotheses, but for ranking candidate phrases. For comparison with the obtained results, we will use the library from Google - Word2Vec, which has proven itself as an alternative to TF-IDF (А1 - Table C of Appendix). We will also use the built-in methods for searching for word combinations in Python. But it didn't work very well on these datasets, because it needs huge corpora to work well. The most interesting thing is that it allows you to do this after translating each word from the corpus into a space, the size of which is set by the user, for example, 'король' + 'жінка' - 'чоловік' = 'королева' ['king' + 'woman' - 'man' = 'queen'] ('king' + 'woman' - 'man' = 'queen')

After translation into a space of a certain dimension, each word becomes a vector, so you can use them to form basic operations of addition, subtraction, multiplication, etc. We will also consider the analysis through bigrams (А2 – Table C of Appendix) and skip grams (А3 – Table C of Appendix). The results are better than Word2Vec, namely the analysis of skipgrams with a value of 3 and also the cleaning of stop words in English were the best (А4 – Table C of Appendix). However, these results are quite far from those obtained in Table A of the Appendix. The result is worsened by not taking into account punctuation marks and the use of stop words in the linguistic analysis as meaningful.

3.4. Parametric classification of the text in Ukrainian

When classifying the text, the definition of the grammatical meta-data of the word is implemented based on grapheme/morphological analysis (Fig. 20, algorithm 3) [38-41].

Algorithm 3. Thematic classification of Ukrainian-language content Stage 1. Splitting the Ukrainian-language text С into parts (paragraphs/paragraphs, etc.). Step 1. Loading into the С text tree generation module.

Step 2. Formation of a new array of tapes in the structure.

Step 3. Parsing of strings of symbols of parts of the text С .

Step 4. Identify the period as the end of a sentence, not part of the contraction and go to step 5, otherwise store it in an array and go to step 3.

Step 5. Identification of the end-of-text character and go to step 6, otherwise mark the end of a part of the text and go to step 2.

Step 6. Saving the tree of parts of text as a structure ∈ .

Content classification module Stage 2. Splitting the part into expressions while preserving the structure of the text С3. Step 1. Analysis of the new structure of part of the text ∈ . Formation of the structure of the expression (paragraph/sentence, etc.) ∈ with the ID_part key of type n-to-1 with the structure of text parts С .

Step 2. Formation of a new array in the structure of sentences ∈ .

Step 3. Parsing characters to the next punctuation mark.

Step 4. If the abbreviation or special entry (date, money, etc.) is according to the regular expression, then the corresponding marking of this sequence and the transition to step 5, otherwise, saving in the structure ∈ and transition to step 2.

Step 5. If the end of the text part, then mark and go to step 6, otherwise go to step 2. Step 6. Saving a tree of sentences in the form of a ∈ structure.

Step 7. If the end of the text, then go to step 3, otherwise go to step 1.

Stage 3. Splitting sentences into lexemes while preserving the connection with the corresponding sentence ∈ and, accordingly, the number of the position in the sentence.

Step 1. Formation of the lexeme structure ∈ with the fields ID_lex, ID_sent, N_lex, T_lex as a description of the lexeme meta-data.

Step 2. Analysis of the sentence lexeme with ∈ .

Step 3. Formation of a new lexeme in the lexeme structure ∈ .

Step 4. Parsing characters up to the first character not from the Ukrainian alphabet or an apostrophe and saving tokens in the structure.

Step 5. If the end-of-sentence character, then go to step 6, otherwise go to step 3.

Step 6. Syntax analysis based on algorithms 2.

Step 7. Morphological analysis based on received lexeme chains.

Stage 4. Identification of the topic of the Ukrainian-language text ∈ .

Step 1. Identification of the hierarchical structure of features ∈ of each semantically significant lexeme from the noun group, except for pronouns.

Step 2. Generating a dictionary with a hierarchy of token property types.

Step 3. Unification, if necessary, of similar tokens.

Step 4. Identification of a set of key words of the text С′ = ( (С , ), ) with = { 1, 2, 3, 4}, where is a collection of classification conditions, 1 is a set of thematic keywords, 2 is a set of frequencies of occurrence of keywords, 3 is dependencies , = || − || = ∑ =( 1, )| ( ) − ( )|. , = || , − , ||. 1−

The text "x" belongs to the author whose distance to the density of the N-gram distribution will be the smallest. When solving the classification problem, the data set was not divided into test and training sets. Weighted average distribution densities of N-grams were constructed over the entire set of content of one author. The distance from content i to a specific author y was calculated as , . The formula makes it possible to exclude the participation of the density of the distribution of N-grams of content i in the average density of the distribution of N-grams of a specific author. The Web resource for analyzing N-grams has the following fields (Fig. 41b): • • • • Вибрати мову тексту [Vybraty movu tekstu] (Select the language of the text) – the language of the text for analysis (research). The default is "Ukrainian". Число грами [Chyslo hramy] (Number of grams) – кількість знаків у грамі. Можна міняти на 1, 2, 3, 4. За замовчуванням 3.

Limitation of text in characters. Текст [Tekst] (Text) – the field where the researched text is copied from the buffer. Генерувати [Heneruvaty] (Generate) – to start the generation of N-grams.

Очистити [Ochystyty] (Clear) – clearing the entered data.

We compare three scientific and technical publications [53, 54, 55] with each other based on linguistic statistical analysis of 3-grams. Articles 1, and 2 were written by the same team [53, 54], and Article 3 was written by another author [55] (Table 17). The language of the text is Ukrainian (letters in the alphabet are 33, so there are 35937 possible N-grams). Parameter values for analyzed articles 1–3

But when comparing articles, we will take into account only those 3-grams that appeared in the text at the same time in three articles at least once. Therefore, for this particular example, all 3-grams are 2147. That is, for Article 1 we analyse 78.4814% of 3 grams, for Article 2 – 72.6332% and Article 3 – 84.1271%. Accordingly, the difference in the use of the corresponding 3-grams between Articles 1 and 2 is R12=56,5254 %, Articles 2 and 3 – R23=69,4271 %, between Articles 1 and 3 – R13=62.9839 %. These indicators alone show that the characteristics of Articles 1 and 2 are more similar (R23>R12 by 12.9017%, R23 > R13 by 6.4432%, R13> R12 by 6.4585%, i.e. R23>R13>R12) than the characteristics of Articles 1–3 respectively and 2–3. The smaller the Rij, the greater the degree to which the articles are written by the same author. In that case, Articles 1 and 2 are more likely to be written by the same author/team than Articles 2–3 and Articles 1–3 respectively. But let's analyse the use of individual 3-gram clusters in the corresponding articles and compare the obtained results (Table 18). The value of the parameters of the appearance of 3-grams for the analyzed articles 1–3 3-gram

The average value of 1 appearance 1 2 3

Match for articles, % 1–2 2–3 1–3

Discrepancy for articles, % 1–2 1–3 2–3 а_ _ б_ _ в_ _ г_ _ д_ _ е_ _ є_ _ ж_ _ з_ _ и_ _ і_ _ ї_ _ й_ _ к_ _ л_ _ м_ _ н_ _ о_ _ п_ _ р_ _

According to Table 19 and Fig. 42a some of the letters in the Ukrainian language are used most often, others are much less common. For the most frequently used letters, the frequency of appearance of 3-grams with such initial letters will have an almost identical distribution (peak values on the graph Fig. 42a), but not for other letters.

Therefore, it is advisable to study only trigrams for initial letters that are less common in the texts of a specific language to determine the degree of belonging of the text to the corresponding author (for example, Fig. 42-Fig. 43).

According to these graphs, it appears that Article 1 and Article 2 were most likely written by the same author, although Article 1 and Article could also have been written by the same author (but this is not true). However, articles 2–3 were written by different authors. The application of linguistic statistical analysis of 3 grams to a set of articles will allow to formation of a subset of publications similar in terms of linguistic characteristics. Imposition of additional conditions on this subset in the form of linguistic statistical analyses (set of keywords, stable phrases, stylometric, ligvometric, etc.) will allow for a significant reduce this subset, clarifying the list of more likely author's works. Thus, an analysis of the content and frequency of appearance of only official words will separate articles 1 and 3 into different subsets, leaving articles 1 and 2 in one. 3.7. Analysis of the developed method of quantitative assessment of the potential author identification of a scientific and technical publication The method consists of six algorithms for the analysis of Ukrainian-language texts. Algorithm I. Pre-processing of data based on content analysis (parsing, segmentation and tokenization of text, as well as linguistic analysis of text).

Algorithm II. Calculation and analysis of the features of the author's speech style (frequency of word usage, volume of punctuation marks, sentences, symbols, words and the ratio of the number of marks and sentences).

Algorithm III. Calculation and analysis of the parameters of the author's speech style (speech coherence, syntactic complexity, lexical diversity, degree of concentration and exclusivity of the text). Algorithm IV. Classification by parameters and lexical features of the textual content of other publications (application of classifiers such as fuzzy, SVM and a combination of the previous two). Algorithm V. Performance analysis based on the obtained results to determine each classifier accuracy. Algorithm VI. Determining a subset of potential authors based on filtering from the set of all researched through the analysis of features and style parameters (algorithms VIII–XI).

A lexer-type system (tokenizer, segmenter) has been developed as part of a text analyser based on tokenization (Fig. 44a). Tokens are extracted during the operation of the parser rules and are immediately checked for compliance with the conditions in the syntax rules to avoid generating absurdity (Fig. 44b).

a) лексер аналізує

текст b) я я я я ям яма яма яма мал мала мала мала лад ладонь ладонь лексер ад дон донька донька кат ката аналізує текст

разом із разом із парсером парсером тата татам там мого ого ого

The rules help to solve several tasks, increasing the efficiency of the grammar engine, which loads the compiled rules during text parsing, without wasting time on syntax parsing. (alg. 12)

Algorithm 12 (VІІ). Text content segmenter Step 1. Word recognition.

Step 2. Definition of token boundaries.

Step 3. Definition of complete word forms.

Step 4. Identification of indivisible tokens that contain dots, blanks, etc.

Step 5. Splitting the text into sentences.

In addition to defining the boundaries of tokens, the lexer also performs preliminary recognition of the morphological attributes of words, turning tokens into tokens. When constructing Ukrainian-language sentences with direct word order, a distinction is made between the noun group Ñ and the verb group Ř (Fig. 45, Fig. 46). 1. S 32..((III)I.1) ## NN~~чч,,оодд,,нн,,33 Rод,тп,3 Nч,од,Rз,1од,тп,3 Nс,од,о,3 ## 546...(((IIIIII...212))) ### AAчч,,оодд ,,Nнн~ч,одNN~~,нчч,3,,оодд,,нн,,33 Aж,оNN~~д,жжр,,оодд,,ррN~,,33ж,од,р,3 RRRоооддд,,,тттппп,,,333 NNNччч,,,оооддд,,,ззз,,,111 NNNссс,,,оооддд,,,ооо,,,333 ### 7.(II.2) # Aч,од ,н N~ч,од,н,3 Aж,од,р N~ж,од,р,3 Rод,тп,3 Nч,од,з,1 Ас,од,о Nс,од,о,3 # 8-9 .............................................................................................................................. 10.(II.4) # Aч,од ,н N ч,од,н Aж,од,р Nж,од,р,3 Rод,тп,3 Nч,од,з,1 Ас,од,о Nс,од,о # 11.(II.3) # Aч,од ,н N ч,од,н Aж,од,р Nж,од,р,3 Rод,тп,3 Nчз,аойдм,з,1 Ас,од,о Nс,од,о # 12-20 ............................................................................................................................

IV.2 IV.6 IV.1 IV.7 IV.4 IV.6 IV.3 сміх моєї донечки наповнює мене безмежним щастям #

We get constituents tree, or the syntactic structure of the analysed sentence (Fig. 47). For dictionary lexemes, a dictionary article whose form is the lexeme is also defined. In alphabeticfrequency dictionaries, its characteristics are determined through/for a word (Fig. 48). 2. # 3. 4. 5. 6.

S ~

Nч,од,н,3 ~ Nч,од,н,3

~ Aч,од ,н Nч,од,н,3

N ч,од,н

Aж,од,р ~ Nж,од,р,3 ~ Nж,од,р,3

Nж,од,р Nчз,аойдм,з,1 Nс,од,о мене безмежним щастям

a) b) Figure 48: a) The base of rules of the alphabetic-frequency dictionary of parts of speech), where A is a verb, other capital letters are additional characteristics of a verb, V is an adjective, small letters of the English alphabet are characteristics of a noun and b) regular expressions of morphological analysis of nouns

The database stores regular expressions for bringing the word to the base (Fig. 49a-b), where the flag is the rule for identifying the type of word (for example, noun group, singular), mask – inflexions of the word (exceptions in square brackets), find – inflexions of the word in the nominative case, repl – inflexions of the word during declension (Fig. 49c). determining the basis of a word

Also, in the database (Fig. 49b) there is a dictionary of service words, that is, words that are additional parameters for analysing the features of the author's speech style and taking into account during the analysis of texts significantly affect the final result.

We will determine the optimal developed algorithm out of four (VIII-XI) for identifying the style of the author of the publication based on the analysis of his collective works.

Algorithm VIII. Filtering a set of analysed author's styles int i=0, j=0; while (i<4){ int c1=0, c2=0, cc2=0; while (j<94){ int s=0; while (l<12){ if ((K[i][l]+abs(F[l]-K[i][l]))>A[j][l]) &&

((K[i][l]-abs(F[l]-K[i][l]))< A[j][l]) s+=1; if (l>6) && ((K[i][l]+abs(F[l]-K[i][l]))>A[j][l]) &&

((K[i][l]-abs(F[l]-K[i][l]))< A[j][l])cc2+=s;

Array K[i][l] – parameters and coefficients of style for 4 collective works (Table 20 and Table E of Appendix – highlighted in yellow), some of whose authors are numbered 6 and 30 (highlighted in blue). Array A[j][l] – style features for 94 authors. Array F[l] – average values of style features for 94 authors. The algorithm determines whether the value of the parameters and coefficients of the speech style of the j-th author falls within the limits [xi+xсер; xi–xсер] deviation of parameter values and speech coefficients of the i-th collective work style. Arrays A2 (authors, the values of most parameters and coefficients are similar to the style of the team і) and A3 (authors, the values of most of the coefficients are similar to the style of the team і) are filled through the filters. Next, a new subset of authors (whose styles are more similar to the collective ones – і-th work) is formed from the obtained previous arrays by superimposing a new filter. The result of the algorithm for analyzing the style of a publication author on Victana [16] 94 authors on more than 300 individual publications for the period 2001–2021 1 2 3 4 5 6 7 8 9

As a result, we will get the values given in Table 21 (algorithm VIII). Columns A are the results of the analysis of all the values of the coefficient vectors and speech parameters of the authors from Table 20. Column B is the result of analysing only the last 5 columns in Table 20. Unfortunately, this algorithm produced such results that the listed authors of these works are unlikely to have written them themselves (the best results are highlighted in red - and it is not enough to claim that they are the authors of more than 50% of these collective works). Although, on the other hand, this algorithm gives good results - reducing the number of authors at the first stage of authorship determination (up to 34.04% of the total number of project participants). This is necessary for further filtering through the analysis of root words (prepositions and conjunctions) and keywords, features of semantics and vocabulary when constructing sentences, etc. Experimental testing of algorithms I–IV on the Victana Web resource [16]

Average value A

Filter 2 B IX X XI

As a result, we will get the values given in Table 21 (algorithm IX). Then we will analyse algorithm IX. It does not differ significantly from the previous one, only by the condition in the third cycle: if ((K[i][l]+V[l])>A[j][l]) && ((K[i][l]- V[l])< A[j][l]) s+=1, where V[l] is an array of average absolute values of deviations of data points from the average value. The obtained results are slightly improved, but not enough to claim that authors numbered 6 and 30 are the real authors of collective works 1–4, although they wrote them. On the other hand, the number of authors (up to 38.56% of the total number of project participants) with a similar style of speech increased slightly. Now let's analyse algorithm X. In algorithm 1, we will also replace the condition in the third cycle with the following:

if (abs(A[j][l]- K[i][l])>abs(K[i][l]-F[l])) s+=1

As a result, we will get the values given in Table 6.14 (algorithm X). As we can see, the obtained values make it clear that the style of authors numbered 6 and 30 is quite close (more than 75–100%) to the style of collective works 1-4, respectively (positive results are highlighted in red). Although the number of authors (up to 42.02% of the total number of project participants) with similarities in speech style has increased significantly. On the other hand, many of those who were not included in the previous stages of the study were included in that list, and those who were also included in the previous two stages of the study fell out of the crowd. Now let's try to reduce the total number by applying the XI algorithm to the obtained initial data - parameters and speech coefficients of 94 project participants. In algorithm X, we improve the condition in the third cycle: if ((abs(A[j][l]- K[i][l])>abs(K[i][l]-F[l])) && (abs(A[j][l]- F[l])>abs(K[i][l]F[l])))|| ((abs(A[j][l]- K[i][l])<abs(K[i][l]-F[l])) && (abs(A[j][l]- F[l])<abs(K[i][l]F[l]))) s+=1

As a result, we get the values given in Table 21 (Algorithm XI). The obtained values also confirm that the style of authors numbered 6 and 30 is quite close (more than 75–100%) to the style of collective works 1–4, respectively (positive results are highlighted in red). Also significantly reduced the number of authors (to 38.03% of the total number of project participants) with similarities in speech style. Fig. 50 provides detailed graphs of the results obtained when applying algorithms VIII–XI (numbered 1–4, respectively) for the analysis of the method of determining the author’s style developed by us.

Further, to determine the author's style, an analysis of root words (prepositions and conjunctions) and keywords of the authors' works was used, as 38.03% got to those. Each individual has his special vocabulary for conveying his opinion, including the so-called "parasitic" (тобто, отже, хоча [tobto, otzhe, khocha] (that is, therefore, although) etc.) та службових слів (і, та, й, але, хоч би [i, ta, y, ale, khoch by] (and, and, and, but, although) etc.).

Collective 1 Collective 3

Collective 2 Collective 4

Algorithm 1 Algorithm 3

Algorithm 2

Algorithm 4

4. Conclusions

A method of determining stable word combinations was developed based on the identification of keywords of the Ukrainian-language text and analysis of the lexical speech coefficients of the author of the text in reference excerpts of the content, which made it possible to improve the accuracy of the method of determining the style of the author of the text by 9% based on statistical linguistics. The method consists of the use of Zipf's law in the formation of stable word combinations as key, taking into account the following rules of preliminary linguistic processing of the text: removal of all sentence words; form bigrams only within the limits of punctuation marks; the verb and pronoun are considered punctuation marks; determine verbs by their inflexions; form bigrams based on their bases without taking into account their inflexions; definition of adjectives by their inflexions and to believe that adjectives should only be in the first place in the bigram from Ukrainian-language texts. A set of programs has been developed to identify persistent phrases as key. An approach to the development of linguistic content analysis software for the determination of stable word combinations in the identification of keywords of Ukrainian-language and English-language textual content is proposed. The peculiarity of the approach is the adaptation of the linguistic statistical analysis of lexical units to the peculiarities of the constructions of Ukrainian and English words/texts. The results of the experimental approbation of the proposed method of content analysis of English- and Ukrainian-language texts for the determination of stable word combinations in the identification of keywords of technical texts were studied.

A method of determining the author in Ukrainian-language texts has been developed based on the analysis of the coefficients of the author’s lexical speech in the referenced passage of the author’s text, which is based on the analysis of a collection of keywords, persistent phrases, indicators of lingumetry, stylometry, as well as the results of the analysis of N-grams based on comparisons of differences in the use of 2- gram and 3-gram for publications similar in style within [6;7]%, and for those not exactly similar – >12%), which made it possible to identify a set of potential authors of publications from more than one author (up to [9;34] % of the total number of project participants) and develop a method for identifying the author's style.

A method of identifying the style of the author of the text based on the analysis of the features of the author's speech style in a template passage of the author's text has been developed. The method consists of a comparative analysis of the author's attribution in a statistically processed work of the author (standard) with an arbitrary analysed passage. The method evaluates the degree of text belonging to the template of the author's style with the analysis of the corresponding coefficients of the lexical author's speech. Moreover, the method works under the condition that the template of the author's style is generated on reliable data. An analysis of reference words was used for attribution, the obtained results are presented in the form of correlation coefficients. Separately, we will mention the evolution of the significance of one of the parameters of the text - in the author's attribution of the texts.

An algorithm for identifying service words based on linguistic analysis of text content has been developed. For each of the passages, the absolute and relative frequencies of stop words were analysed and compared with the reference values. Therefore, the application of the method of reference words gives the following results: finding among the studied passages what most likely belongs to the standard. Other results confirm the effectiveness of the reference words method in the authorial attribution of texts. The proposed assumption about the insignificance of the influence of the share as a parameter of the method on the results led to a decrease in the correlation coefficients. However, to confirm or refute the fact that fractions are not a determining factor in the author's style, more thorough research must be performed. An algorithm for the lexical analysis of Ukrainian-language texts and an algorithm for syntactic analysis of text content has been developed. The peculiarities of the algorithms are the adaptation of the morphological and syntactic analysis of word forms to the peculiarities of the construction of Ukrainian words/texts. Belonging to a part of speech and declension within this part of speech were taken into account based on the analysis of inflexions and word bases according to regular expressions.

A comparison of the results of content monitoring on a set of 300 one-man works of a technical direction by 100 different authors for the period 2001–2021 was carried out to determine whether and how the coefficients of text diversity of these authors change in different periods. The best results according to the density criterion are achieved by the article analysis method without initial mandatory information such as abstracts and keywords in different languages, as well as a list of references. The method of identifying a potential author is decomposed based on the analysis of speech style parameters such as speech coherence, degree of syntactic complexity, lexical diversity, degree of concentration and exclusivity. Characteristics of the author's style were also analysed, such as the total amount of words in the text, the number of unique words, the number of conjunctions/prepositions, the number of sentences, and the number of words with a frequency of 1 and ≥10. For example, 3-grams of 3 articles were analysed. 78.4814% of 3-grams were analysed for Article 1, 72.6332% for Article 2, and 84.1271% for Article 3. Accordingly, the difference in the use of the corresponding 3-grams between Articles 1–2 is R12=56,5254 %, between 2 and 3 – %, between 1 and 3 – R13=62.9839 %. These indicators themselves show that the characteristics of Articles 1 and 2 are more similar (R23>R12 by 12.9017%, R23 > R13 by 6.4432%, R13> R12 by 6.4585%, i.e., R23>R13>R12) than the characteristics of Articles 1–3, respectively and 2–3. The smaller the Rij, the greater the degree to which the articles are written by the same author. Then in this case Articles 1–2 are more likely to be written by the same author than Articles 2–3 and 1–3 respectively.

This work solved an important scientific and applied problem of CLS analysis and synthesis for solving various problems of processing Ukrainian-language textual content based on the development of new and improvement of known NLP models, methods and tools.

During the execution of the work, the following results were obtained: 1. An analysis of the current state and prospects for IT development of natural language processing was carried out, which made it possible to define the problem and research objectives, as well as to form general research directions in the absence of non-commercial CLS with open source for processing Ukrainian-language textual content and a standardized design approach.

2. The relevance of solving the problem of analysis and synthesis of CLS based on the development of the general structure of the system for processing Ukrainian-language textual content is substantiated, due to the interaction of the main processes/components of IS and methods of linguistic processing of textual content adapted to the Ukrainian language based on grapheme, morphological, lexical, syntactic, semantic, structural, ontological and pragmatic analysis allowed to improve the IT intellectual analysis of the text flow for solving a specific NLP problem. This ensured the adaptation of NLP processes for the analysis of Ukrainian-language textual content and, based on them, increased the accuracy of the obtained results by 6-48%, depending on the specific NLP task. For example, for the NLP task of determining the keywords of the Ukrainian-language text, the density of keywords increases in the range [1.23; 1.48] times or by [23.14; 47.83]% depending on the quality/accuracy of filling the thematic dictionary through machine learning.

3. The methods of processing information resources such as integration, management and support of Ukrainian-language content have been improved, which made it possible to adapt the process of intellectual analysis of the text flow and develop metrics for the effectiveness of CLS functioning for the solution of various NLP tasks. The developed methods and tools make it possible to build CLS processing of Ukrainian-language text content according to the needs of the permanent/potential target audience based on the analysis of the history of actions of website users.

4. NLP methods based on pattern-matching regular expressions were improved, which made it possible to adapt the methods of tokenization and normalization of text by cascades of simple substitutions of regular expressions and finite state machines.

5. The MA method of the Ukrainian-language text based on word segmentation and normalization, sentence segmentation and modified Porter's stemming algorithm was improved as an effective means of identifying lem affixes for the possibility of marking the analysed word, which made it possible to increase the accuracy of keyword searches by 9%.

6. The IT of intellectual analysis of the text flow was improved based on the processing of information resources, which made it possible to adapt the generally typical structure of modules for integration, management and support of content to solve various NLP problems and increase the efficiency of CLS functioning by 6-9%. This became possible thanks to the combination of linguistic analysis methods adapted to the Ukrainian language, improved IT processing of information resources, ML and a set of metrics for evaluating the effectiveness of CLS functioning. The main principle of building such CLS is modularity, which facilitates their construction according to the requirements for the presence of appropriate processes for solving a specific NLP problem.

7. A method of determining the author in Ukrainian-language texts has been developed based on the analysis of the coefficients of the author’s lexical speech in the referenced passage of the author’s text, which is based on the analysis of a collection of keywords, persistent phrases, indicators of lingumetry, stylometry, as well as the results of the analysis of N-grams based on comparisons of differences in the use of 2- gram and 3-gram for publications similar in style within [6;7]%, and for those not exactly similar – >12%), which made it possible to identify a set of potential authors of publications from more than one author (up to [9;34] % of the total number of project participants) and develop a method for identifying the author's style.

8. A method of determining stable word combinations was developed based on the identification of keywords of the Ukrainian-language text and analysis of the lexical speech coefficients of the author of the text in reference excerpts of the content, which made it possible to improve the accuracy of the method of determining the style of the author of the text by 9% based on statistical linguistics.

9. The reliability of scientific and practical results is confirmed by relevant materials on the implementation of dissertation research, as well as by comparing the obtained practical results on different samples of reliable input data. CLS was developed on the information resource http://victana.lviv.ua using CMS Joomla! (for developing the e-framework of articles), PHP (for implementing text content processing methods), HTML (for implementing page markup), CSS (for describing page styles), and MySQL (for storing data and dictionaries). The experimental study confirmed the reliability of the method of determining keywords - for different algorithms for processing the primary text, the average coincidence of the lists of identified keywords with the authors varies in the range of 52.6-68.5%. The accuracy of matching keywords with the author's keywords ranges from 43.6 to 62.9%. The average match of meaningful keywords compared to all found by the system ranges from 38.9-75.8%, depending on the stages of analysis of article texts. The accuracy of matching keywords compared to all found by the system varies between 34.3-71.9%, depending on the stages of analysis of the texts of the articles. [11] A. Rejeb, K. Rejeb, A. Appolloni, H. Treiblmaier, M. Iranmanesh, Exploring the impact of ChatGPT on education: A web mining and machine learning approach, The International Journal of Management Education 22( 1 ) (2024) 100932. [12] V. Kayser, E. Shala, Scenario development using web mining for outlining technology futures, Technological forecasting and social change 156 (2020) 120086. [13] M. Karp, N. Kunanets, Y. Kucher, Meiosis and litotes in The Catcher in the Rye by Jerome

David Salinger: Text Mining, CEUR Workshop Proceedings 2870 (2021) 166-178. [14] S. Kumar, A. K. Kar, P. V. Ilavarasan, Applications of text mining in services management: A systematic literature review, International Journal of Information Management Data Insights 1( 1 ) (2021) 100008. [15] L. Hickman, et. al., Text preprocessing for text mining in organizational research: Review and recommendations, Organizational Research Methods 25( 1 ) (2022) 114-146. [16] Z. Yang, Z. Xiangyi, The Applicability of Zipf's Law in Report Text, Lecture Notes on Language and Literature 6( 10 ) (2023) 57-64. [17] Z. Wang, M. Ren, D. Gao, Z. Li, A Zipf's law-based text generation approach for addressing imbalance in entity extraction, Journal of Informetrics 17( 4 ) (2023) 101453. [18] A. Koshevoy, H. Miton, O. Morin, Zipf’s law of abbreviation holds for individual characters across a broad range of writing systems, Cognition 238 (2023) 105527. [19] C. Boyer, L. Dolamic, N. Grabar, Automated Detection of Health Websites' HONcode Conformity: Can N-gram Tokenization Replace Stemming?, Studies in Health Technology and Informatics 216 (2015) 1064. [20] O. Bisikalo, V. Vysotska, Linguistic analysis method of Ukrainian commercial textual content for data mining, CEUR Workshop Proceedings 2608 (2020). 224-244. [21] V. Vysotska, P. Pukach, V. Lytvyn, D. Uhryn, Y. Ushenko, Z. Hu, Intelligent Analysis of Ukrainian-language Tweets for Public Opinion Research based on NLP Methods and Machine Learning Technology, International Journal of Modern Education and Computer Science (IJMECS) 15( 3 ) (2023) 70-93. [22] V. Starko, A. Rysin, VESUM: A Large Morphological Dictionary of Ukrainian As a Dynamic

Tool, CEUR Workshop Proceedings 3171 (2022) 61-70. [23] V. Lytvyn, P. Pukach, V. Vysotska, M. Vovk, N. Kholodna, Identification and Correction of Grammatical Errors in Ukrainian Texts Based on Machine Learning Technology, Mathematics 11( 4 ) (2023) 904. [24] V. Starko, O. Synchak, Feminine Personal Nouns in Ukrainian: Dynamics in a Corpus, CEUR

Workshop Proceedings 3396 (2023) 407-425. [25] O. Synchak, V. Starko, Ukrainian Feminine Personal Nouns in Online Dictionaries and

Corpora, CEUR Workshop Proceedings 3171 (2022) 775-790. [26] V. Starko, Implementing Semantic Annotation in a Ukrainian Corpus, CEUR Workshop

Proceedings 2870 (2021) 435-447. [27] Starko, V.: Semantic Annotation for Ukrainian: Categorization Scheme, Principles, and

Tools. In: CEUR workshop proceedings, Vol-2604, 239-248. (2020). [28] Keygeneratortext. URL: http://msurf.ru/tools/keygeneratortext/. [29] Keygeneratorurl. URL: http://webmasta.org/tools/keygeneratorurl/. [30] Keywordstext. URL: http://www.keywordstext.therealist.ru/. [31] Keygeneratortext. URL: http://syn1.ru/tools/keygeneratortext/. [32] Terminology extraction. URL: http://labs.translated.net/terminology-extraction/. [33] Advego. URL: http://advego.ru/text/seo/. [34] V. Vysotska, S. Holoshchuk, R. Holoshchuk, A comparative analysis for English and Ukrainian texts processing based on semantics and syntax approach, CEUR Workshop Proceedings 2870 (2021) 311-356. [35] V. Vysotska, O. Markiv, S. Teslia, Y. Romanova, I. Pihulechko, Correlation Analysis of Text Author Identification Results Based on N-Grams Frequency Distribution in Ukrainian Scientific and Technical Articles, CEUR Workshop Proceedings 3171 (2022) 277-314. [36] V. Vysotska, S. Mazepa, L. Chyrun, O. Brodyak, I. Shakleina, V. Schuchmann, NLP tool for extracting relevant information from criminal reports or fakes/propaganda content, in Proceedings of the 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), 2022, November, pp. 93-98. [37] V. Lytvyn, et. al., Analysis of statistical methods for stable combinations determination of keywords identification, Eastern-European Journal of Enterprise Technologies 2/2(92) (2018) 23-37. doi: 10.15587/1729-4061.2018.126009. [38] N. Kholodna, V. Vysotska, O. Markiv, S. Chyrun, Machine Learning Model for Paraphrases Detection Based on Text Content Pair Binary Classification, CEUR Workshop Proceedings 3312 (2022) 283-306. [39] Y. Stepaniak, V. Vysotska, O. Markiv, L. Chyrun, S. Chyrun, L. Pohreliuk, Technology of Text Content Topic Classification Based on Machine Learning Methods, in Proceedings of the IEEE 5th International Conference on Advanced Information and Communication Technologies (AICT), 2023, pp. 121-126. [40] Y. Hlavcheva, O. Kanishcheva, М. Vovk, M. Glavchev, Using Topic Modeling for Automation

Search to Reviewer, CEUR Workshop Proceedings 3171 (2022) 81-90. [41] N. Khairova, A. Kolesnyk, O. Mamyrbayev, G. Ybytayeva, Y. Lytvynenko, Automatic Multilingual Ontology Generation Based on Texts Focused on Criminal Topic, CEUR Workshop Proceedings 2870 (2021) 108-117. [42] V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals,

Soviet physics doklady 10( 8 ) (1966) 707-710. [43] R. Bellman, R. Kalaba, Dynamic programming and statistical communication theory, Proceedings of the National Academy of Sciences of the United States of America 43( 8 ) (1957) 749. [44] R. Bellman, R. Kalaba, On the role of dynamic programming in statistical communication theory, IRE Transactions on Information Theory 3( 3 ) (1957) 197-203. [45] R. Bellmam, Dynamic programming. Princeton univ. press. Princeton. New Jersey, 1957. [46] R. Bellman, On the approximation of curves by line segments using dynamic programming,

Communications of the ACM 4( 6 ) (1961) 284. [47] R. A. Wagner, M. J. Fischer, The string-to-string correction problem, Journal of the ACM (JACM) 21( 1 ) (1974) 168-173. [48] D. Gusfield, Algorithms on stings, trees, and sequences: Computer science and computational biology, Acm Sigact News 28( 4 ) (1997) 41-60. [49] G. D. Forney, The viterbi algorithm, Proceedings of the IEEE 61( 3 ) (1973) 268-278. [50] V. Motyka, Y. Stepaniak, M. Nasalska, V. Vysotska, Lexical Diversity Parameters Analysis for Author's Styles in Scientific and Technical Publications, CEUR Workshop Proceedings 3403 (2023) 595–617. [51] R. Romanchuk, V. Vysotska, V. Andrunyk, L. Chyrun, S. Chyrun, O. Brodyak, Intellectual Analysis System Project for Ukrainian-language Artistic Works to Determine the Text Authorship Attribution Probability, in Proceedings of the 2023 IEEE 18th International Conference on Computer Sciences and Information Technologies, CSIT-2023, Lviv, 19-21 October 2023 р. [52] V. Lytvyn, et. al., Development of the quantitative method for automated text content authorship attribution based on the statistical analysis of N-grams distribution, EasternEuropean Journal of Enterprise Technologies 6( 2-102 ) (2019) 28-51. doi: 10.15587/17294061.2019.186834. [53] V. Lytvyn, et. al., Development of the linguometric method for automatic identification of the author of text content based on statistical analysis of language diversity coefficients, Eastern-European Journal of Enterprise Technologies 5(2(95)) (2018) 16–28. doi: 10.15587/1729-4061.2018.142451. [54] V. Lytvyn, et. al., Development of the system to integrate and generate content considering the cryptocurrent needs of users, Eastern-European Journal of Enterprise Technologies 1(2(97)) (2019) 18–39. doi: 10.15587/1729-4061.2019.154709. [55] P. Kravets, The Game Method for Orthonormal Systems Construction, in Proceedings of the 9th International Conference - The Experience of Designing and Applications of CAD Systems in Microelectronics, 2007. doi: doi.org/10.1109/cadsm.2007.4297555.

Appendices

Table A List by frequency rating of stable word combinations for 3 random articles № Author's Q A

Victana.lviv.ua (according to FREG, t-test Zipf's law)

B 1 Стиль автора Стоп-слово 2 Статистичний аналіз Метод визначення 3 Лінгвістичний аналіз Визначення стилю 54 лКАіввнатгонвртісистьтиакктааиавтнраибуція САтниалльізаувртиовркау 6 Визначення стилю Частота появи 7 Україномовні тексти Автор тексту 8 Технологія лінгвометрії Уривок тексту 9 Технологія стилеметрії Коефіцієнт кореляції 10 гТлеохтнтоолхоргоіянології Дослідження тексту F Коефіцієнт кореляції Відносна частота Частота появи Стопове слово Україномовний текст Стиль автора Поява слова Авторська атрибуція Визначення стилю Слова уривку

2 G Коефіцієнт кореляції Відносна частота Частота появи Авторська атрибуція Стиль автора Україномовний текст Стопове слово Визначення стилю Поява слова Слова уривку Текстовий контент Ключове слово Тематичний словник Текстовий контент Тематичний словник Ключове слово Формування системою

Web Mining Лінгвістичний аналіз

Слова контенту Метод визначення

Текстовий контент Визначення слів Слов’янськомовні тексти Технологія NLP Аналіз статистики Ключове словосполучення Множина слів Інформаційний ресурс

Контент-аналіз Information resource

Content analysis Диспозиції особистості Соціальна мережа Ключове словосполучення Слова контенту Множина слів Формування системою Контент-аналіз Психологічна особистість Контент-аналіз Марковане слово Психологічний зріз Стан особистості Формування зрізу Зріз стану Зріз особистості In the work[1] in English

Reference fragment Words fragment Syntactic words Frequency fragment Swadesh list Stop words Author style Recognition author Author’s text

Anchor words In the work[2] in English

Text content Web mining Keywords text Keywords defined Analysis text Keywords content Content monitoring Content analysis Stop word

Author’s keywords In the work[3] in English

Content analysis Psychological personality Psychological state Social networks Marked words State personality Based analysis Psychological base State based

Based content Контент-аналіз Лінгвістичний аналіз Морфологічний аналіз Соціальна мережа Формування зрізу Зріз розуміння Розуміння особистості Україномовні тексти Big-Five Style of the author Statistical analysis Linguistic analysis Quantitative linguistics Author’s attribution Recognition of style Ukrainian texts Linguometry technology Stylemetry technology Glottochronology technology Web Mining Content monitoring Content analysis Porter stemmer Linguistic analysis Determining the keywords Slavic language Slavic texts Method for determining Web technology Content analysis Linguistic analysis Morphological analysis Social network Status of personality Personality understanding Formation of the status Stop words Method of formation Стоп- слово Тематичний словник Пости користувача Повідомлення користувача Користувач мережі Стан особистості Аналізована особистість Соціальна мережа Reference fragment Author’s style Author’s text Syntactic words Stop words Formatted fragments Anchor words Author’s language Method of anchor Frequency dictionary Text content Content analysis Analysis of statistics Defined systematically Stop word Potential keywords Content monitoring Author’s keywords Keywords content Direct word Psychological state Personality analysis Personality disposition Psychological analysis Personality model Stop words Psychological disposition Content monitoring

Social network 10 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 7 8 9 10 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 Table B Differences in methods according to the rating list of 100 stable word combinations Q

G Контент-моніторінг

Контент-моніторінг Формування системою Web Mining Слова контенту Психологічна особистість Психологічний стан Формування зрізу Стан особистості Марковане слово Психологічний зріз Контент-аналіз Зріз стану Аналізована особистість Соціальна мережа Words fragment Reference fragment Stop words Swadesh list Recognition author Syntactic words Frequency fragment Author’s text Anchor words Author style Web mining Text content Keywords content Keywords text Keywords defined Stop word Analysis text Author’s keywords Content monitoring Content analysis Psychological personality Psychological state Content analysis Based analysis State personality Psychological base Social networks Marked words State based Psychological base Формування системою Web Mining Визначення слів Слова контенту Психологічна особистість Психологічний стан Формування зрізу Зріз стану Марковане слово Контент-аналіз Психологічний зріз Стан особистості Соціальна мережа Аналізована особистість Words fragment Reference fragment Recognition author Stop words Swadesh list Syntactic words Frequency fragment Author’s text Author style Anchor words Web mining Text content Keywords content Analysis text Keywords text Keywords defined Stop word Content monitoring Content analysis Author’s keywords Content analysis Psychological personality Psychological state Based analysis Psychological base State personality Social networks Psychological base Marked words State based A B C D F G ('контент_моніторінгу', 13) ENG ('swadesh_list', 18) ('based_on', 15) А2

UA ENG

ENG А3

UA А4

UA ('тематичного_словника', 11) ('слов_янськомовних', 10) ('based_on', 20) ('slavic_language', 15) ('author_s', 13) (('ключових', 'слів'), 72) (('текстового', 'контенту'), 21) (('на', 'етапі'), 17) (('визначення', 'ключових'), 16) (('крок', '1'), 16) (('крок', '2'), 16) (('web', 'mining'), 15) (('слів', 'в'), 14) (('тематичного', 'словника'), 11) (('для', 'визначення'), 10) (('of', 'the'), 134) (('in', 'the'), 61) (('by', 'the'), 45) (('analysis', 'of'), 39) (('of', 'a'), 31) (('the', 'text'), 30) (('the', 'system'), 30) (('to', 'the'), 29) (('of', 'keywords'), 28) (('text', 'content'), 27) (('ключових', 'слів'), 74) (('слів', 'в'), 24) (('web', 'mining'), 22) (('текстового', 'контенту'), 21) (('на', '2'), 20) (('визначення', 'ключових'), 19) (('ключових', 'в'), 19) (('визначення', 'слів'), 18) (('слів', 'для'), 18) (('на', 'крок'), 18) (('of', 'the'), 258) (('the', 'of'), 235) (('of', 'of'), 137) (('the', 'the'), 122) (('of', 'keywords'), 72) (('in', 'the'), 71) (('a', 'of'), 70) (('and', 'of'), 69) (('by', 'the'), 64) (('of', 'content'), 63) (('text', 'content'), 30) (('web', 'mining'), 24) (('keywords', 'text'), 23) (('keywords', 'defined'), 22) (('stage', '1'), 20) (('analysis', 'text'), 18) (('step', '2'), 18) (('keywords', 'content'), 17) (('content', 'monitoring'), 17) (('step', '1'), 17)

Work [3] ('психологічного_стану', 16) ('формування_зрізу', 12) ('sfx_a', 12) ('структурну_схему', 7) ('відкритість_досвіду', 6) ('зрізу_психологічного', 2) ('based_on', 35) ('psychological_state', 26) ('social_networks', 22) ('his_her', 11) ('following_structural', 8) ('big_five', 7) ('let_us', 7) ('structural_scheme', 4) (('на', 'основі'), 21) (('психологічного', 'стану'), 18) (('контент', 'аналізу'), 16) (('маркованих', 'слів'), 15) (('зрізу', 'психологічного'), 14) (('стану', 'особистості'), 14) (('формування', 'зрізу'), 12) (('особистості', 'на'), 12) (('sfx', 'a'), 12) (('основі', 'контент'), 11) (('of', 'the'), 134) (('is', 'the'), 117) (('the', 'content'), 45) (('of', 'a'), 43) (('analysis', 'of'), 37) (('based', 'on'), 35) (('on', 'the'), 34) (('in', 'the'), 33) (('content', 'analysis'), 30) (('the', 'process'), 27)

Stop word але в для до з і й мов не про та що а в від до ж з за і й на над не ні ось от се хіба хоч що як 1 2 3 1 1 14 1 1 2 2 2 1 2 3 1 1 1 2 1 2 2 1 1 2 1 1 1 1 1 1 2 1 RF in fragment

RF in fragment RF 0.0116 0.0074 0.0008 0.0012 0.0140 0.0034 0.0033 0.0129 0.0053 0.0300 0.0038 0.0159 0.0011 0.0237 0.0011 0.0004 0.0001 0.0088 0.0206 0.0028 0.0060 0.0011 0.0074 0.0033 0.0140 0.0033 0.0129 0.0053 0.0300 0.0022 0.0159 0.0237 0.0003 0.0018 0.0040 0.0074 0.0088 0.0027 0.0206 0.0028 0.0060 Stop word а але без бо в від ж з за і й на навіть не під таки тож у що щоб як адже але би в ж з за і мов на не отсе при про се у чи що щоб як Iwt 0.76 0.74 0.78 0.74 0.74 0.75 0.75 0.74

Ikt 0.015 0.012 0.016 0.019 0.012 0.015 0.019 0.013 Table E The result of the algorithm of analysis of the author's style of the publication № 1 2 3 4 5 6 7 8 Letter Table F Frequencies of appearance of letters in the standard and the studied passages Letter

Fragment 1 AF RF

Fragment 2 AF RF

[1]

Y. H.

Hu ,

C. T.

Tai ,

K. E.

Liu ,

C. F.

Cai , Identification of highly-cited papers using topic-modelbased and bibliometric features: The consideration of keyword popularity , Journal of Informetrics 14 ( 1 ) ( 2020 ) 101004 .

[2]

Cheikhrouhou , et. al., Multi-task learning for simultaneous script identification and keyword spotting in document images , Pattern Recognition 113 ( 2021 ) 107832 .

[3]

Kumar ,

Mahrishi ,

Meena , A comprehensive review of recent automatic speech summarization and keyword identification techniques, AI in Industrial Applications: Approaches to Solve the Intrinsic Industrial Optimization Problems , 2022 , pp. 111 - 126 .

[4]

Kenekayoro , Author and keyword bursts as indicators for the identification of emerging or dying research trends , J. Sci. Res . 9 ( 2 ) ( 2020 ) 120 - 126 .

[5]

Berko ,

Matseliukh ,

Ivaniv ,

Chyrun ,

Schuchmann , The text classification based on Big Data analysis for keyword definition using stemming , in: Proceedings of IEEE 16th International conference on computer science and information technologies , Lviv, Ukraine, 22 - 25 September, 2021 , pp. 184 - 188 .

[6]

Taran , The Role of Keyword Language in the Database of World Slavic linguistics "iSybislaw" , CEUR Workshop Proceedings 3171 ( 2022 ) 266 - 276 .

[7]

Bondarchuk , et. al., Keyword-based Study of Thematic Vocabulary in British Weather News, CEUR Workshop Proceedings 3171 ( 2022 ) 451 - 460 .

[8]

O.V.

Bisikalo ,

Wójcik ,

O.V.

Yahimovich ,

Smailova , Method of determining of keywords in English texts based on the DKPro Core , in: Proceedings of SPIE - The International Society for Optical Engineering , 2016 , 10031 .

[9]

Campos , et. al., YAKE! Keyword extraction from single documents using multiple local features , Information Sciences 509 ( 2020 ) 257 - 289 .

[10]

P. S.

Sharma ,

Yadav ,

R. N.

Thakur , Web page ranking using web mining techniques: a comprehensive survey , Mobile Information Systems 2022 ( 1) ( 2022 ) 7519573 .