<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Linguistic intellectual analysis methods for Ukrainian textual content processing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victoria Vysotska</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Stepan Bandera 12, 79013 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>3</fpage>
      <lpage>44</lpage>
      <abstract>
        <p>The peculiarities of the method of syntactic analysis of Ukrainian-language text content aimed at automatic detection of significant keywords of input texts are considered. The role and formal features of the parser in the process of identifying keywords of the content topic are defined, and the procedures of the proposed method are decomposed into 4 stages. Compared to well-known parsers, the proposed method provides self-improvement and self-learning of the automated keyword identification system due to the mechanism of identification of significant statistical parameters within the limits defined by the moderator. The experimental study confirmed the reliability of the method - for various methods of processing the primary text, the average coincidence of the lists of identified keywords with the authors varies in the range of 52.6-68.5%. The accuracy of matching</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;computer linguistics</kwd>
        <kwd>system</kwd>
        <kwd>NLP</kwd>
        <kwd>Ukrainian language</kwd>
        <kwd>information resource</kwd>
        <kwd>system modelling 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The identification of keywords of the text content  ( ,  ,  ,  ,  ) →  ′ is a mapping of the
input text content  into the new state  ′, which, unlike the previous one, is supplemented with
a set of keywords as the main markers of the text content. For this purpose, the multi-level linear
(sequences) [1-3]. And, if necessary, hierarchical/network (interconnections) structure of the
text is linguistically investigated as symbols, N-grams, morphological features, weights of words
and phrases, features of sentences and interconnected units (Fig. 1) [4-9].</p>
      <sec id="sec-1-1">
        <title>Author</title>
      </sec>
      <sec id="sec-1-2">
        <title>User</title>
      </sec>
      <sec id="sec-1-3">
        <title>Authorization</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Models and methods</title>
      <sec id="sec-2-1">
        <title>2.1. Peculiarities of defining keywords of the Ukrainian-language text</title>
        <p>Web Mining technology is based on the use of methods of intellectual analysis of the flow of
information content to identify patterns in the Internet or Web-site [10-12]. The main
technology of Web Mining is Text Mining, which is used to extract structured/unstructured data
from Web-pages, Web-sites, link structures, etc. [13-15].</p>
        <p>Algorithm 1. Content keyword identification based on Web Mining
Stage 1. Integration/downloading of textual content for further analysis.</p>
        <p>Stage 2. Grapheme analysis of textual content  .</p>
        <p>Step 1. Formatting of incoming text content, for example, the same apostrophes for Ukrainian text.
Step 2 Removal of the service part of the  content, such as tags.</p>
        <p>Step 3. Removal of the non-character part of  content, such as dates, numbers, financial symbols,
mathematical formulas, images, etc. Removal of special characters that are not included in the
alphabet, except for service ones such as space, and apostrophe.</p>
        <p>Step 4. Analysis of abbreviations and abbreviations of content  . If  n used in the text and not in the
dictionary  , then step 5, otherwise step 6.</p>
        <p>Step 5. If necessary, edit the thematic dictionary  , for example, add new abbreviations or abbreviations.
Step 6. Segmentation of the input array of text  into sentences and paragraphs with appropriate marking
of the corresponding boundaries.</p>
        <p>Step 7. Segmentation of the sequence of symbols of sentences of content  into tokens.
Stage 3. Morphological analysis of the Ukrainian-language text  .</p>
        <p>Step 1. Selection of bases (word forms without inflexions).</p>
        <p>Step 2. Analysis of the resulting inflexion to determine the part of speech.</p>
        <p>Step 3. Marking the word with the appropriate part of speech.</p>
        <p>Step 4. Word forms are marked by a collection of morphological features: case, gender, declension,
singular/plural, person, etc.).</p>
        <p>Step 5. If the part of speech word is a noun, mark it as a potential keyword. If the part of speech of the
word is an adjective, mark it and the next word (if it is a noun) as a phrase that could potentially
be a keyword.</p>
        <p>Step 6. Formation of a linear chain of labelled structures.</p>
        <p>Stage 4. Lexical analysis of the Ukrainian text  .</p>
        <p>Step 1. Search for the base in the base dictionary for further normalization taking into account the part of
the language used in a specific place of the text  .</p>
        <p>Step 2. Normalization of marked morphological structures.</p>
        <p>Step 3. Segmentation and analysis of a chain of normalized tokens of content  into tokens and word
types taking into account marked sentence boundaries.</p>
        <p>Step 4. Formation of collections of tokens (sequences of symbols according to appropriate templates) as
lexemes with further identification of their types, taking into account their interrelationships in
the textual content  .</p>
        <p>Step 5. If the dimensionality of the text content is  N1, then step 9, otherwise step 5.
Stage 5. Syntactic analysis of textual content  .</p>
        <p>Step 1. Selection of tokens  1 ∈  for text content  .</p>
        <p>Step 2. Identification of a sequence of tokens as an expression or sentence.</p>
        <p>Step 3. Identification of the nominal group of the expression based on the dictionary of word bases  .
Step 4. Definition of the verb group of the sentence based on the dictionary of word bases  .
Step 5. Formation of a left-to-right parsing tree of linguistic variables.</p>
        <p>Step 6. Analysis of noun phrase group for textual content  .</p>
        <p>Step 7. Analysis of the verb group of the sentence for textual content  .</p>
        <p>Step 8. Study of syntactic categories by word forms.</p>
        <p>Step 9. If not the end of content  , then go to step 2, otherwise go to step 9.</p>
        <p>Stage 6. Semantic analysis of the Ukrainian text  .</p>
        <p>Step 1. Expression tokens are compared with the semantic classes of the dictionary  .
Step 2. Definition of morpho-semantic analogues for a specific sentence.</p>
        <p>Step 3. Combining tokens into a common structure.</p>
        <p>Step 4. Generating a tuple of superpositions of lexical functions and semantic classes.
Stage 7. Referential analysis for determining interphase unities of the text  .</p>
        <p>Step 1. Contextual analysis of  content for identification of local references (which, this, his) and
selection of utterances - kernels of unity.</p>
        <p>Step 2. Thematic analysis to highlight the thematic structure.</p>
        <p>Step 3. Identification of the identity of references; synonymizing, duplication and re-nomination of
tokens; implications based on situational connections.</p>
        <p>Stage 8. Structural analysis of textual content  .</p>
        <p>Step 1. Identification of the basic tuple of rhetorical connections between entities.</p>
        <p>Step 2. Construction of a nonlinear network of units.</p>
        <p>Stage 9. Identifying a set of content keywords  ( ,  ,  ,  ,  ) →  ′.</p>
        <p>Step 1. Formation of an alphabetic-frequency dictionary  = ( ,  ,  ).</p>
        <p>Step 2. Identification of terms ( ∈  1)( ∈  ) as nouns, noun phrases, an adjective with
a noun, or abbreviations.</p>
        <p>Step 3. Formation of a shortened list of words whose frequencies correspond to the conditions of
formation of potential keywords –   .</p>
        <p>Step 4. Determination of the level of uniqueness   ( ),  ∈  .
Step 5.    calculation (number of characters without spaces) for  ∈  at  ≥ 80.
Step 6. Calculation of   (keyword usage frequency). For terms with    ≤ 2000 frequency
  ∈ (6; 8]%, з 2000 &gt;    &lt; 3000 frequency   ∈ [4; 6]%, with    ≥ 3000
frequency   ∈ [2; 4)%.</p>
        <p>Step 7. Calculation of the probability of using the keywords   (at the beginning of the text),   (in
the middle of the text content) and   (at the end of the text content).</p>
        <p>Step 8. Comparison of   ,   та   values for keyword prioritization under the condition   ≫
  ≫   .</p>
        <p>Step 9. Sorting keywords according to defined priorities.</p>
        <p>Step 10. Comparison of   content with the  ℎ ∈  list.</p>
        <p>Step 11. Formation of a new list of   =   ℎ tokens.</p>
        <p>Step 12. Formation of the collection of keywords  ′ with  ∈  ,  = { ,
 ≥ 80,    ,   ,  ,   ,   }.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Method of identifying keywords of Ukrainian-language content</title>
        <p>The analysis of the text flow of  content for the identification of keywords is usually
implemented on Zipf's law and reduced to the selection of words with an average frequency of
occurrence [16-18]. This is easy to implement for English-language texts. It will not work for
Ukrainian-language texts. It is necessary to adapt the parser and stemming algorithms to the
Ukrainian language based on thematic frequency dictionaries of the basics [19-27].</p>
        <p>Algorithm 2. Adaptation of parser/stemming algorithms of Ukrainian texts.</p>
        <p>Stage 1. Based on the parser, a set of words with a frequency of occurrence within a certain limit is
identified, for example, 4-6% with ≤ 2000 characters without spaces;
Stage 2. Based on the parser and stemming, a subset of frequently used semantically loaded words is
generated by extracting/marking words from the blocked dictionary, for example, such as
prepositions, conjunctions, pronouns, verbs, particles, etc.;
Stage 3. If the keyword is an adjective (inflexion of the normalized word ий [yy]), then all bases to the
right of it are found in the text and a frequency dictionary is built for them. Those phrases that
are used more than the corresponding threshold value (but less than this adjective) are
keywords. The threshold value is determined by the moderator. Repeat multiple keywords
Stage 4. If the keyword is a noun (the inflexion of the word is not ий [yy]), then all bases and their
inflexions on both sides of it are examined.</p>
        <p>Step 1. All words to the left of the noun are analysed for the presence of inflexions ий [yy] and compared
with the frequency dictionary. A set of words that are used most often above the threshold value
is identified - these are new keywords.</p>
        <p>Step 2. All bases and their inflexions on the right are analysed - without inflexion ий [yy] and inflexions of
other parts of speech, except nouns, are compared with the frequency dictionary, which
determines the set of keywords.</p>
        <p>Stage 5. The new subset is compared with the thematic dictionary of the basics of Ukrainian words to
form a set of keywords;
Stage 6. If there is no analogue of the word, add it to the thematic dictionary of word bases through the
buffer dictionary (edited by the moderator) to accumulate statistics for various stylistic text
content.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments, results and discussion</title>
      <sec id="sec-3-1">
        <title>3.1. Content keyword identification based on Web Mining technology</title>
        <p>100 scientific articles of the "Lviv Polytechnic" NU Bulletin of the "Information Systems and
Networks" series (http://science.lp.edu.ua/sisn), two numbers 783
(http://science.lp.edu.ua/SISN/SISN-2014) and 805
(http://science.lp.edu.ua/sisn/vol-cur-8052014-2) were chosen as the experimental base for the relevant research. To achieve the goal of
the research, IS was developed (Fig. 2), placed on the Victana resource
(http://victana.lviv.ua/index.php/kliuchovi-slova) using the following tools: CMS Joomla! for IS
e-framework, PHP for algorithm implementation, MySQL for data storage and dictionaries,
HTML for implementation of Web-pages markup and CSS for description of Web-page styles.</p>
        <p>The developed IS has the following main components.</p>
        <p>1. A user-friendly dialogue web interface on the web page of the Ключові слова [Klyuchovi
slova] (Keywords) menu with the following sections (Fig. 2):
•
Вибрати мову контенту [Vybraty movu kontentu] (Select the content language) –
one/several languages of the analyzed text.
Мін. вага слова, % [Min. vaha slova, %] (Min. word weight, %) – the percentage of the
weight of the keyword to the total number of words of the text, after which the
keywords will be selected; format - ХХ.ХХ, within [00.01 - 99.99]; mandatory field.
Help – short instructions in Ukrainian on a separate web page.
Контент [Kontent] (Content) – field for analysed text content.
Ключові слова [Klyuchovi slova] (Keywords) – field for displaying IS of keywords set.
Генерувати [Heneruvaty] (Generate) – start the keyword identification process.
Очистити [Ochystyty] (Clear) – clearing the input field Контент [Kontent] (Content).
Повторюваність слів, раз [Povtoryuvanistʹ sliv, raz] (Repetition of words, times) – the
number of repetitions of the keyword in the text.
Рекомендовані рубрики [Rekomendovani rubryky] (Recommended headings) – a list of
thematic headings according to keywords.</p>
        <p>2. The main relations of DB: the bases of words; prohibited words; rubrics; and rules of
bringing to the base of the word.</p>
        <p>3. PHP functions for processing text content:
get_keywords() – creating a list of keywords.
get_word() – a record of the rules for bringing the word to the base.
explode_str_on_words() – clears the received content from blocked words, special
characters, etc.
blocked_words() – forms a list of blocked words depending on the selected language of
the context.
count_words() – calculation of key word frequencies.
set_keywords() – writing keywords to the DB if they are not available.
recommend_rubric() – creation of a list of recommended rubrics.</p>
        <p>function error() – processing errors, sending a letter to the IS administrator.</p>
        <p>The study of the dynamics of the module for determining the collection of keywords from
100 scientific and technical articles was carried out in two stages with analysis:
•
•
content of the thematic dictionary and a set of blocked words.
refined based on the ML content of the thematic dictionary and set of blocked words,
since with each subsequent verification of the text through the corresponding module,
an additional collection of unknown words is potentially generated (absent in the list of
blocked and in the thematic dictionary).
•
•
•
•
•
•
•
•
•
•</p>
        <p>At each stage, the module implements the verification of the text of articles in two steps:
analysis of the entire article (Fig. 3a) and without meta-data (information about authors, title,
author keywords and annotations in several languages, references list, etc.) (Fig. 3b) to analyse
the accuracy error of generating a collection of keywords in the presence of information noise.
3.2. An experimental study results of the Ukrainian-language content keywords
identification
The statistical analysis was carried out based on a comparison of sets of keywords defined by
the authors of the article and defined by the module at two different stages with different word
weights within [1,5] (in the option *Мін.вага слова, % [*Min.vaha slova, %] (*Min. word weight,
%)) with full and abbreviated texts of works (Table 1) with an average arithmetic value of the
author's keywords of 4.77, which approximately consist of 9-10 words. Table 2 contains the
following notations: A (total identified keywords at a given word weight), B (formed significant
words without pronouns and verbs), C (coincidence of words with the author's list), D (accuracy
of the coincidence of identified keywords with the author's list), E (additional keywords defined,
but not defined by the author of the publication). Known IS of keywords identification are within
[100 ÷ 1000] words [28-32].</p>
        <p>The disadvantage of these IS is the inaccuracy and incorrect processing of Ukrainian-language
texts in the absence of competently constructed morphological dictionaries, dictionaries of
bases and blocked words. Also, the main drawback of most such IS is the limited processing of
volumes of text content [100 ÷ 1000] (Fig. 4). The best IS for processing Ukrainian-language
textual content is [33] (Fig. 5), but it does not identify the set of keywords, but only the frequency
of use of words, phrases and parts of words. Doesn't work with word bases at all (ключових
[klyuchovykh] (keywords) and ключові [klyuchovi] (keywords) are different). The developed
resource works with the basics of the word and is focused on Ukrainian/English texts (Fig. 1).
For [20] in Ukrainian, the frequency of using keywords on Victana: слово [slovo] (word) – 120;
ключовий [klyuchovyy] (key) – 49; контент [kontent] (content) – 46; аналіз [analiz] (analysis)
– 39; Chomsky – 37; система [systema] (system) – 37. The authors identified keywords: текст
[tekst] (text), україномовний [ukrayinomovnyy] (Ukrainian), алгоритм [alhorytm] (algorithm),
синтаксичний аналіз [syntaksychnyy analiz] (syntactic analysis), породжувальні граматики
[porodzhuvalʹni hramatyky] (generative grammars), лінгвістичний аналіз [linhvistychnyy
analiz] (linguistic analysis), контент-моніторінг [kontent-monitorinh] (content monitoring),
ключові слова [klyuchovi slova] (keywords), інформаційна лінгвістична система
[informatsiyna linhvistychna systema] (informational linguistic system), структурна схема
речення [strukturna skhema rechennya] (sentence structure scheme). Authors usually define
keywords more than Zipf-law patterns of word frequency distribution.</p>
        <p>The author of the article almost always forms at his discretion the number and content of a
set of keywords in the range of 2 to 10 word combinations (usually 3-5). The developed module
defines a different number of words, depending on the writing style of the corresponding
author, the volume of the article, the genre, the topic, and the frequency of use of the
corresponding words (from 0 to several dozen). The coincidence of the sets of found keywords
with the author's without taking into account the extra words defined by the authors (repetition
&gt; 30 for a text volume of more than 4800 words) is, respectively, for [33] - 83%; [32] - 57%; [31]
- 35%; %; http://victana.lviv.ua/kliuchovi-slova - 90% (Fig. 6). Fig. 7 demonstrates the features
of generating a set of probable keywords compared to an author set. The author of the article
often defines a larger number of words ( 2) and a smaller number of keywords ( 1) than are
present in the text. Fig. 7b shows the distribution of text density in articles, where the number
of 1 – pages, 2 – paragraphs, 3 – lines, 4 – words, 5 – characters, 6 – spaces and characters, 7 –
words per page, 8 – characters per page, 9 – spaces and characters on the page.
1
6</p>
        <p>Marking Chart column name Arithmetic average number of keywords</p>
        <p>Explanation Value
 1 Author's keywords defined by the author 4.77
 2 Number of words contain author's 9.82
 3 Stage 1, Step 1 5.46
 4 Stage 1, Step 2 probable keywords 6.51
 5 Stage 2, Step 1 found by the module 7.43
 6 Stage 2, Step 2 at stage X and step Y (Fig. 8-Fig. 9) 8.35</p>
        <p>The value of  3 differs from the value of  1 by 0.69 (by number, but not by content);
respectively,  4 from  1 by 1.74;  5 from  1 by 2.66;  6 from  1 by 3.58. The value of  2 differs
from the value of  3 by 4.36; respectively,  2 from  4 by 3.31;  2 from  5 by 2.39;  2 from  6
by 1.47. Adaptively changing the parameters/rules of the module almost doubles the collection
of identified keywords (for example, the value of  1 is greater than  3 by 1.144654;  6 by
1.750524;  5 by 1.557652;  4 by 1.36478). The total increase in the value obtained depending
on the moderation of dictionaries is, respectively, for  3 14.46541;  4 – 36.47799;  5 – 55.7652;
 6 – 75.05241. When comparing  2 more than  3 ÷  6, we have a chain of such values as
1.7985; 1.5084; 1.3217; 1,176. For different stages and steps of the experiment of processing
the primary text, the average coincidence of the lists of identified keywords with the author's
keywords varies in the range of 52.6-68.5%. The accuracy of matching keywords with the
author's keywords ranges from 43.6 to 62.9%. The average match of meaningful keywords
compared to all found by the system varies between 38.9-75.8%, depending on the stages of
analysis of the text of the articles. The accuracy of matching keywords compared to all found by
the system ranges from 34.3-71.9%, depending on the stages of analysis of article texts.</p>
        <p>For  3, the module most often identified the number of keywords {5, 7, 3} (10), although
the distribution of found keywords was within [1;18] words (except 17). For  4, IS identified the
number of keywords also {5, 7, 3} most often, although the distribution of found keywords is
10
5
a 0</p>
        <p>10
c) 0
10
5
b) 0</p>
        <p>10
d) 0
within [1;18] (except 17), the number of identified words increased and the highest reliability
index was achieved. For  5, the module most often identified the number of keywords {7, 6, 5,
10, 8}, although the distribution of found keywords was within [2;14] (the range narrowed
significantly). For  6, the module most often identified the number of keywords {8, 5, 7, 10}, the
distribution of identified keywords within [3;16] (accuracy improved). The accuracy of the
definition of keywords increases in the process of the moderation of dictionaries and the
MLmodule. The difference between the number of keywords defined by the author and identified
by the module at  3 is 44.39919% (difference in %).
Descriptive statistical data of keyword identification in experiments</p>
        <p>Name</p>
        <p>Average
Standard error</p>
        <p>Median/ Mode
Standard deviation
Sampling variance</p>
        <p>Excess
Asymmetry</p>
        <p>Interval
Minimum/ Maximum</p>
        <p>Sum</p>
        <p>
          Score
Biggest(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )/ Smallest(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
Reliability level (95.0%)
        </p>
        <p>1
Statistical data of histogram construction for  3 and  3 6 (Fig. 10)</p>
        <p>N
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
More
1
2
3
72.73
85.86
90.91
97.98
85.86
89.90
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97
1
6</p>
        <p>11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
specifying the thematic dictionary (blue - filtered text, orange - general text)</p>
        <p>Accuracy improves with  4 – 33.70672%, significantly improves with  5 – 24.33809%, and
with  6 is 14.96945% (Table 4). Table 5 shows data from research articles when generating sets
of keywords (Fig. 10). Analysis was performed for 100 filtered texts without metadata and
unfiltered texts. The obtained average values for 100 filtered texts    = 0,28 and unfiltered
  0 = 0,19 shows that such filtering of scientific articles improves the density of keywords by
1.48 times or by 47.83% (Fig. 11a). The obtained average values for 100 texts  

 = 0,34 and</p>
        <p>0 = 0,25 taking into account the refinement of the thematic dictionary due to the addition
of blocked words show that filtering with simultaneous moderation of the thematic dictionary
improves keyword density by 1.35 times or by 35.44% (Fig. 11b).
a)
c)
40
20
0
40
20
0
2
0
2
0
200%
100%
000%
200%
100%
000%</p>
        <p>A comparison of the values in the original author's text   0 = 0,19 and   0 = 0,25
without/with the refinement of the thematic dictionary, respectively, demonstrates the
effectiveness of moderation of the thematic dictionary in the initial text - the density of
keywords increases 1.34 times or by 34.33% (Fig. 12a). Values comparison in the filtered author's
text    = 0,28 and     = 0,34 without/with the refinement of the thematic dictionary,
respectively, demonstrates the effectiveness of the moderation of the thematic dictionary in the
filtered text - the density of keywords increases 1.23 times or by 23.14% (Fig. 12b).
1
0
1
0
1
6</p>
        <p>11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Analysis of methods for identifying stable phrases as keywords</title>
        <p>The identification of stable phrases consists of the following stages: morphological analysis
(MA), SYA, selection of key words and analysis of key phrases for stability (Fig. 13) [34-37].</p>
        <p>Diagnostics
Diagnostics
Diagnostics</p>
        <p>Lexical analysis,</p>
        <p>LA
Finite automata</p>
        <p>Syntactic
analysis, SYA</p>
        <p>Context-free
grammars
Contextual
analysis
Attributive
grammars
Output generation
Synttraaxn-csloanttioronlled</p>
        <p>Output
optimization
Flow analysis
Definition of
stable phrases
Decision tables</p>
        <p>Token flow
Object tables</p>
        <p>Parse tree
Object tables
Attributive tree</p>
        <p>Object tables
Intermediate form
(prefix, postfix,</p>
        <p>threes, etc.)
Object module</p>
        <p>For Ukrainian-language texts, it is best to use a combination of procedural, tabular, and
statistical stemming approaches. In the MA procedural approach, emphasis is placed on the use
of ready-made dictionaries of bases and dictionaries of ready-made forms (DRF) in the analysis
of words. Then the MA algorithm consists of the following steps: search in the SFG, base
selection, and base search in the dictionary. The basis of most MAs of the Ukrainian language is
a tree or Finite State Automata (FSA) (Fig. 14).
a)</p>
        <p>з
л і т и н ч т н н и и й й а
н а н тю ч ег о вв іи сй т и кч н и й
са мкело лоемі нк аср фин кеочт мални аоичк айгн іи я чй н и й
и н т еа з к с и чс н и й
і</p>
        <p>з
н</p>
        <p>л
а н тю ч ег о нвв ті с т и к</p>
        <p>а
b)
са мкело лоемі нк сф о л о г і яч н и й</p>
        <p>ар н е м а
и н т еа з к с и с</p>
        <p>The type of word is determined by the form of inflexions (Fig. 13). The algorithm works with
individual words, so the content of the word is not taken into account. Parts of speech (adjective,
noun, etc.) and categories of morphology (stem, suffix, etc.) are also unavailable. Variants of the
rules for the stemming of Ukrainian words: short words remain unchanged, change during
stemming (is an exception), do not change during stemming (is an exception), correspond to a
regular expression, change the ending, has an unchanged ending, or the inflexion is cut off from
the word. All this significantly complicates the keyword identification algorithm. Therefore, first
of all, it is necessary to analyse widespread inflexions. Syntax - rules for combining words into
correct expressions - word combinations and sentences (compare: programming language
syntax). The task of the SYA (parser) is to construct the syntactic structure of the input sentence.
Aspects of SYA implementation are dictionaries (information about individual language units);
formal rules and interaction with neighbouring processing levels (morphological analysis,
semantic analysis). Context-free grammar (CFG) rules are most often used in SYA: &lt;N, T, X, R&gt;,
where N is a set of non-terminal symbols, T is a set of terminal symbols ( ∩  = ∅), X – axiom
( ∈  ), R is a set of transformation (substitution) rules of type  →  , where  ∈  ,  is a list
of terminal and non-terminal symbols. CFG example:</p>
        <p>= { ,  ,  ,  ,  ,  }, S,  ={система, рубрикувати, україномовний, контент, за,
ключовий, слово} [ ={systema, rubrykuvaty, ukrayinomovnyy, kontent, za, klyuchovyy,
slovo}] ( ={system, categorize, Ukrainian-language, content, by, key, word}),
 = { →  ,  →  ,  →  ,  →  ,  →  ,  → система, 
→ рубрикувати,  → україномовний,  → ключовий,  → контент, 
→ слово,  → за}.</p>
        <p>The disadvantage of using CFG is the periodic appearance of ambiguity with SYA, for example,
"The system categorizes Ukrainian-language content by keywords" (Fig. 15). Examples of
wellknown SYA systems for English tests are: "Machinese Phrase Tagger" (Fig. 16) and VISL. There is
no online available information resource for SYA Ukrainian texts. "Ontology Matcher Demo"
uses Machinese metadata to find ontology objects in the text (Fig. 17).</p>
        <p>S
NP</p>
        <p>VP
Система V</p>
        <p>NP
рубрикує A</p>
        <p>N</p>
        <p>P
за</p>
        <p>PP</p>
        <p>NP
A</p>
        <p>N
україномовний контент ключовими словами
hat)) → (the boy) with (the hat)) of Ukrainian-language texts for the identification of stable
word combinations when defining keywords is presented in Fig. 19.</p>
        <p>To select stable word combinations in the analysed texts and carry out their comparative
analysis, we will use 4 different methods: FREG (frequency + morphological patterns, i.e. direct
counting of the number of words); t-test; statistics  2; LR is the likelihood ratio.</p>
        <p>Collocations is a word combination as a semantically and syntactically linguistic unit, where
one part is chosen according to meaning, and the other depends on the first (for example,
ставити умови [stavyty umovy] (to set conditions) – the choice of the verb ставити [stavyty]
(to set) is determined by tradition and depends on the noun of умови [umovy] (the condition),
with the word пропозицію [propozytsiyu] (offer) there will be another verb – вносити [vnosyty]
(to enter)). This is a limited (selective) combination of words: phraseological units, idioms,
proper names and trademarks. Collocations often include complex names (for example, крейсер
москва [kreyser moskva] (moscow cruiser), руський корабль [rusʹkyy korablʹ] (russian ship),
безпілотник Байрактар [bezpilotnyk Bayraktar] (Bayraktar drone), від’ємний наступ
[vidʺyemnyy nastup] (negative attack), німецькі леопарди [nimetsʹki leopardy] (German
leopards), жест доброї волі [zhest dobroyi voli] (goodwill gesture), etc.). Another name for the
same phenomenon is stable phrases, N-grams. Examples of collocations –
•
•
•
•
•
•
Грати роль [hraty rolʹ] (to play a role), мати значення [maty znachennya] (to have a meaning),
впливати [vplyvaty] (to influence), справляти враження [spravlyaty vrazhennya] (to make an
impression);
Засоби масової… [zasoby masovoyi…] (means of mass...), зброя масової… [zbroya masovoyi…]
(weapons of mass...), вищий навчальний …. [vyshchyy navchalʹnyy ….] (higher education);
глибокий старець [hlybokyy staretsʹ] (deep old man)  поверхневий/мілкий невеликий юнак
[poverkhnevyy/milkyy nevelykyy yunak] (superficial/shallow little young man);
міцний чай [mitsnyy chay] (strong tea)  сильний чай [sylʹnyy chay] (strong tea);
Кока-кола [Koka-kola] (Coca-Cola), Microsoft Windows;
Гола Пристань [Hola Prystanʹ] (Hola Prystan), Нова Каховка [Nova Kakhovka] (Nova Kakhovka),
Володимир Волинський [Volodymyr Volynsʹkyy] (Volodymyr Volynsky), Володимир Зеленський
[Volodymyr Zelensʹkyy] (Volodymyr Zelensky), Нью Йорк [Nʹyu York] (New York), Стив Джобс
[Styv Dzhobs] (Steve Jobs).</p>
        <p>1. The FREG method is a direct calculation of the frequency of use of pairs (threes). For
example, FREG for the sentence В літературі описано декілька підходів до автоматичного
виділення стійких словосполучень [V literaturi opysano dekilʹka pidkhodiv do
avtomatychnoho vydilennya stiykykh slovospoluchenʹ] (In the literature, several approaches to
the automatic selection of stable word combinations are described) «.» → в літературі
[dekilʹka pidkhodiv] (in the literature); літературі описано [literaturi opysano] (described in
the literature); описано декілька [opysano dekilʹka] (several are described); декілька підходів
[dekilʹka pidkhodiv] (several approaches); підходів до [pidkhodiv do] (approaches to); до
автоматичного [do avtomatychnoho] (to automatic); автоматичного виділення
[avtomatychnoho vydilennya] (automatic selection); виділення стійких [avtomatychnoho
vydilennya] (allocation of persistent); стійких словосполучень [stiykykh slovospoluchenʹ]
(stable phrases). Unfortunately, as a result of using this method on large volumes of text, we get
the so-called "garbage" due to the high frequency of service words. The method also requires
consideration of the frequency of occurrence and patterns of word combinations.</p>
        <p>2. The t-test method consists of statistical hypotheses testing and MA statistical model using
Н0: the words met by chance;  ( 1 2) =  ( 1) ( 2); taking into account not only pairs but
also the individual words use frequency (those that make up a pair);  =  ̄−
average,  is theoretical average,  2 is empirical dispersion,  is empirical sample size. The
method is not completely correct for the language, but it allows to obtain results in practice, for
example, the frequency of appearance of the stable phrase контент аналіз [kontent analiz]
(content analysis) in [37] with  (контент) = 85/4338 and  (аналіз) = 53/4338 is
 0:  (аналіз) =  (контент) (аналіз) ≈ 2,39 ⋅ 10−4. In the Bernoulli scheme,  2 =  (1 −
 ) ≈  at values of  ̄ = 18/4338 and  ≈ 3,997955.</p>
        <p>3. Pearson's  2 method is applied to 2x2 tables (Table 6). Normality is not expected in the
calculations. Example,  2 =
An example of using Pearson's  2 method</p>
        <p>2 = аналіз</p>
        <p>1 = контент
18 (контент аналіз)</p>
        <p>1 ≠ контент
35 (e.g., статистичний аналіз)
 2 ≠ аналіз 67 (including, контент моніторинг) 4218 (including, статистичний моніторинг)
     (1 −  ) − , we get the LR likelihood ratio
4. The LR method consists of the calculation of hypotheses ( 1 &gt;&gt;  2)
 1:  ( 2| 1
) =  =  ( 2|¬ 1) and  2:  ( 2| 1) =  1 ≠  2 =  ( 2|¬ 1
)
where  =  2;  1 =</p>
        <p>12;  2 =

 1</p>
        <p>2−−  112. Then, using the binomial distribution  ( ,  ,  ) =
,
where −2</p>
        <p>is asymptotically distributed as  2. The term extraction experiment was
conducted on 3 articles from different SAs. The template for experimenting is: [Adjective +
Noun], [Adjective + Noun], [Noun + Noun, Genitive Distinctive], [Noun + Noun, Instrumental
Distinctive], [Noun + '-' + Noun]. During the experiment, 6 methods were used: manually
determined by the authors of the articles (A); determined by the Victana.lviv.ua system, taking
into account Zipf's law (B); frequency+morphological patterns FREG (C); t-test (D); likelihood
ratio LR (F); statistic  2 (G). An analysis of 3 articles in Ukrainian and translated into English was
conducted (Table A -Table B of Appendix). Key words that occur in the results of all methods are
highlighted in bold, in italics only in methods B-G, and underlined in methods A and C-G. When
conducting a linguistic analysis, the following features were used to form alphabetic-frequency
dictionaries of two words each:
•</p>
        <p>Bigrams were formed within the boundaries of punctuation marks (if there was at least
some punctuation mark between the words - these words were not considered a</p>
        <p>An alphabetic-frequency dictionary of two words was formed based on their bases
(bigrams) and content analysis of these bigrams;
When analysing the inflexions of the analysed words, verbs were not taken into account
when forming the bigram alphabetic-frequency dictionary (verbs were considered one
of the punctuation marks);
Before the linguistic analysis of the texts, all stop words (participles, adverbs,
conjunctions) and pronouns were removed.</p>
        <p>Statistical methods allow taking into account the use of individual words. Subtleties are
associated with applying the methods to different data volumes and probability ranges (better
than t-test for larger p where normality is violated; likelihood ratio is better approximated by  2
than 2x2 tables for small volumes). It is more often used not for accepting/rejecting hypotheses,
but for ranking candidate phrases. For comparison with the obtained results, we will use the
library from Google - Word2Vec, which has proven itself as an alternative to TF-IDF (А1 - Table C
of Appendix). We will also use the built-in methods for searching for word combinations in
Python. But it didn't work very well on these datasets, because it needs huge corpora to work
well. The most interesting thing is that it allows you to do this after translating each word from
the corpus into a space, the size of which is set by the user, for example,
'король' + 'жінка' - 'чоловік' = 'королева' ['king' + 'woman' - 'man' = 'queen'] ('king' +
'woman' - 'man' = 'queen')</p>
        <p>After translation into a space of a certain dimension, each word becomes a vector, so you
can use them to form basic operations of addition, subtraction, multiplication, etc. We will also
consider the analysis through bigrams (А2 – Table C of Appendix) and skip grams (А3 – Table C of
Appendix). The results are better than Word2Vec, namely the analysis of skipgrams with a value
of 3 and also the cleaning of stop words in English were the best (А4 – Table C of Appendix).
However, these results are quite far from those obtained in Table A of the Appendix. The result
is worsened by not taking into account punctuation marks and the use of stop words in the
linguistic analysis as meaningful.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Parametric classification of the text in Ukrainian</title>
        <p>When classifying the text, the definition of the grammatical meta-data of the word is
implemented based on grapheme/morphological analysis (Fig. 20, algorithm 3) [38-41].</p>
        <p>Algorithm 3. Thematic classification of Ukrainian-language content
Stage 1. Splitting the Ukrainian-language text С into parts (paragraphs/paragraphs, etc.).
Step 1. Loading into the С text tree generation module.</p>
        <p>Step 2. Formation of a new array of tapes in the structure.</p>
        <p>Step 3. Parsing of strings of symbols of parts of the text С .</p>
        <p>Step 4. Identify the period as the end of a sentence, not part of the contraction and go to step 5, otherwise
store it in an array and go to step 3.</p>
        <p>Step 5. Identification of the end-of-text character and go to step 6, otherwise mark the end of a part of
the text and go to step 2.</p>
        <p>Step 6. Saving the tree of parts of text   as a structure    ∈   .</p>
        <p>Content classification module
Stage 2. Splitting the part into expressions while preserving the structure of the text С3.
Step 1. Analysis of the new structure of part of the text    ∈   . Formation of the structure of the
expression (paragraph/sentence, etc.)    ∈   with the ID_part key of type n-to-1 with the
structure of text parts С .</p>
        <p>Step 2. Formation of a new array in the structure of sentences    ∈   .</p>
        <p>Step 3. Parsing characters to the next punctuation mark.</p>
        <p>Step 4. If the abbreviation or special entry (date, money, etc.) is according to the regular expression, then
the corresponding marking of this sequence and the transition to step 5, otherwise, saving in the
structure    ∈   and transition to step 2.</p>
        <p>Step 5. If the end of the text part, then mark and go to step 6, otherwise go to step 2.
Step 6. Saving a tree of sentences in the form of a    ∈   structure.</p>
        <p>Step 7. If the end of the text, then go to step 3, otherwise go to step 1.</p>
        <p>Stage 3. Splitting sentences into lexemes while preserving the connection with the corresponding
sentence    ∈   and, accordingly, the number of the position in the sentence.</p>
        <p>Step 1. Formation of the lexeme structure    ∈   with the fields ID_lex, ID_sent, N_lex, T_lex as a
description of the lexeme meta-data.</p>
        <p>Step 2. Analysis of the sentence lexeme with    ∈   .</p>
        <p>Step 3. Formation of a new lexeme in the lexeme structure    ∈   .</p>
        <p>Step 4. Parsing characters up to the first character not from the Ukrainian alphabet or an apostrophe and
saving tokens in the structure.</p>
        <p>Step 5. If the end-of-sentence character, then go to step 6, otherwise go to step 3.</p>
        <p>Step 6. Syntax analysis based on algorithms 2.</p>
        <p>Step 7. Morphological analysis based on received lexeme chains.</p>
        <p>Stage 4. Identification of the topic of the Ukrainian-language text    ∈   .</p>
        <p>Step 1. Identification of the hierarchical structure of features    ∈   of each semantically significant
lexeme from the noun group, except for pronouns.</p>
        <p>Step 2. Generating a dictionary with a hierarchy of token property types.</p>
        <p>Step 3. Unification, if necessary, of similar tokens.</p>
        <p>Step 4. Identification of a set of key words    of the text С′ =   (  (С ,   ),   ) with   =
{  1,   2,   3,   4}, where   is a collection of classification conditions,   1 is a set of
thematic keywords,   2 is a set of frequencies of occurrence of keywords,    3 is dependencies
  , = ||  −   || = ∑ =( 1, )|  ( ) −   ( )|.   , = ||  , −  ,  ||.
1−</p>
        <p>The text "x" belongs to the author whose distance to the density of the N-gram distribution
will be the smallest. When solving the classification problem, the data set was not divided into
test and training sets. Weighted average distribution densities of N-grams were constructed
over the entire set of content of one author. The distance from content i to a specific author y
was calculated as   , . The formula makes it possible to exclude the participation of the density
of the distribution of N-grams of content i in the average density of the distribution of N-grams
of a specific author. The Web resource for analyzing N-grams has the following fields (Fig. 41b):
•
•
•
•
Вибрати мову тексту [Vybraty movu tekstu] (Select the language of the text) – the
language of the text for analysis (research). The default is "Ukrainian".
Число грами [Chyslo hramy] (Number of grams) – кількість знаків у грамі. Можна
міняти на 1, 2, 3, 4. За замовчуванням 3.</p>
        <p>Limitation of text in characters.
Текст [Tekst] (Text) – the field where the researched text is copied from the buffer.
Генерувати [Heneruvaty] (Generate) – to start the generation of N-grams.</p>
        <p>Очистити [Ochystyty] (Clear) – clearing the entered data.</p>
        <p>We compare three scientific and technical publications [53, 54, 55] with each other based on
linguistic statistical analysis of 3-grams. Articles 1, and 2 were written by the same team [53, 54],
and Article 3 was written by another author [55] (Table 17). The language of the text is Ukrainian
(letters in the alphabet are 33, so there are 35937 possible N-grams).
Parameter values for analyzed articles 1–3</p>
        <p>But when comparing articles, we will take into account only those 3-grams that appeared in
the text at the same time in three articles at least once. Therefore, for this particular example,
all 3-grams are 2147. That is, for Article 1 we analyse 78.4814% of 3 grams, for Article 2 –
72.6332% and Article 3 – 84.1271%. Accordingly, the difference in the use of the corresponding
3-grams between Articles 1 and 2 is R12=56,5254 %, Articles 2 and 3 – R23=69,4271 %, between
Articles 1 and 3 – R13=62.9839 %. These indicators alone show that the characteristics of Articles
1 and 2 are more similar (R23&gt;R12 by 12.9017%, R23 &gt; R13 by 6.4432%, R13&gt; R12 by 6.4585%, i.e.
R23&gt;R13&gt;R12) than the characteristics of Articles 1–3 respectively and 2–3. The smaller the Rij, the
greater the degree to which the articles are written by the same author. In that case, Articles 1
and 2 are more likely to be written by the same author/team than Articles 2–3 and Articles 1–3
respectively. But let's analyse the use of individual 3-gram clusters in the corresponding articles
and compare the obtained results (Table 18).
The value of the parameters of the appearance of 3-grams for the analyzed articles 1–3
3-gram</p>
        <p>The average value of 1 appearance
1 2 3</p>
        <p>Match for articles, %
1–2 2–3 1–3</p>
        <p>Discrepancy for articles, %
1–2 1–3 2–3
а_ _
б_ _
в_ _
г_ _
д_ _
е_ _
є_ _
ж_ _
з_ _
и_ _
і_ _
ї_ _
й_ _
к_ _
л_ _
м_ _
н_ _
о_ _
п_ _
р_ _</p>
        <p>According to Table 19 and Fig. 42a some of the letters in the Ukrainian language are used
most often, others are much less common. For the most frequently used letters, the frequency
of appearance of 3-grams with such initial letters will have an almost identical distribution (peak
values on the graph Fig. 42a), but not for other letters.</p>
        <p>Therefore, it is advisable to study only trigrams for initial letters that are less common in the
texts of a specific language to determine the degree of belonging of the text to the
corresponding author (for example, Fig. 42-Fig. 43).</p>
        <p>According to these graphs, it appears that Article 1 and Article 2 were most likely written by
the same author, although Article 1 and Article could also have been written by the same author
(but this is not true). However, articles 2–3 were written by different authors. The application
of linguistic statistical analysis of 3 grams to a set of articles will allow to formation of a subset
of publications similar in terms of linguistic characteristics. Imposition of additional conditions
on this subset in the form of linguistic statistical analyses (set of keywords, stable phrases,
stylometric, ligvometric, etc.) will allow for a significant reduce this subset, clarifying the list of
more likely author's works. Thus, an analysis of the content and frequency of appearance of only
official words will separate articles 1 and 3 into different subsets, leaving articles 1 and 2 in one.
3.7. Analysis of the developed method of quantitative assessment of the potential
author identification of a scientific and technical publication
The method consists of six algorithms for the analysis of Ukrainian-language texts.
Algorithm I. Pre-processing of data based on content analysis (parsing, segmentation and tokenization of
text, as well as linguistic analysis of text).</p>
        <p>Algorithm II. Calculation and analysis of the features of the author's speech style (frequency of word
usage, volume of punctuation marks, sentences, symbols, words and the ratio of the number of
marks and sentences).</p>
        <p>Algorithm III. Calculation and analysis of the parameters of the author's speech style (speech coherence,
syntactic complexity, lexical diversity, degree of concentration and exclusivity of the text).
Algorithm IV. Classification by parameters and lexical features of the textual content of other publications
(application of classifiers such as fuzzy, SVM and a combination of the previous two).
Algorithm V. Performance analysis based on the obtained results to determine each classifier accuracy.
Algorithm VI. Determining a subset of potential authors based on filtering from the set of all researched
through the analysis of features and style parameters (algorithms VIII–XI).</p>
        <p>A lexer-type system (tokenizer, segmenter) has been developed as part of a text analyser
based on tokenization (Fig. 44a). Tokens are extracted during the operation of the parser rules
and are immediately checked for compliance with the conditions in the syntax rules to avoid
generating absurdity (Fig. 44b).</p>
        <p>a)
лексер
аналізує</p>
        <p>текст
b)
я
я
я
я
ям
яма
яма
яма
мал
мала
мала
мала
лад
ладонь
ладонь
лексер
ад
дон
донька
донька
кат
ката
аналізує
текст</p>
        <p>разом із
разом
із
парсером
парсером
тата
татам
там
мого
ого
ого</p>
        <p>The rules help to solve several tasks, increasing the efficiency of the grammar engine, which
loads the compiled rules during text parsing, without wasting time on syntax parsing. (alg. 12)</p>
        <p>Algorithm 12 (VІІ). Text content segmenter
Step 1. Word recognition.</p>
        <p>Step 2. Definition of token boundaries.</p>
        <p>Step 3. Definition of complete word forms.</p>
        <p>Step 4. Identification of indivisible tokens that contain dots, blanks, etc.</p>
        <p>Step 5. Splitting the text into sentences.</p>
        <p>In addition to defining the boundaries of tokens, the lexer also performs preliminary
recognition of the morphological attributes of words, turning tokens into tokens. When
constructing Ukrainian-language sentences with direct word order, a distinction is made
between the noun group Ñ and the verb group Ř (Fig. 45, Fig. 46).
1. S
32..((III)I.1) ## NN~~чч,,оодд,,нн,,33 Rод,тп,3 Nч,од,Rз,1од,тп,3 Nс,од,о,3 ##
546...(((IIIIII...212))) ### AAчч,,оодд ,,Nнн~ч,одNN~~,нчч,3,,оодд,,нн,,33 Aж,оNN~~д,жжр,,оодд,,ррN~,,33ж,од,р,3 RRRоооддд,,,тттппп,,,333 NNNччч,,,оооддд,,,ззз,,,111 NNNссс,,,оооддд,,,ооо,,,333 ###
7.(II.2) # Aч,од ,н N~ч,од,н,3 Aж,од,р N~ж,од,р,3 Rод,тп,3 Nч,од,з,1 Ас,од,о Nс,од,о,3 #
8-9 ..............................................................................................................................
10.(II.4) # Aч,од ,н N ч,од,н Aж,од,р Nж,од,р,3 Rод,тп,3 Nч,од,з,1 Ас,од,о Nс,од,о #
11.(II.3) # Aч,од ,н N ч,од,н Aж,од,р Nж,од,р,3 Rод,тп,3 Nчз,аойдм,з,1 Ас,од,о Nс,од,о #
12-20 ............................................................................................................................</p>
        <p>IV.2 IV.6 IV.1 IV.7 IV.4 IV.6 IV.3
сміх моєї донечки наповнює мене безмежним щастям
#</p>
        <p>We get constituents tree, or the syntactic structure of the analysed sentence (Fig. 47). For
dictionary lexemes, a dictionary article whose form is the lexeme is also defined. In
alphabeticfrequency dictionaries, its characteristics are determined through/for a word (Fig. 48).
2. #
3.
4.
5.
6.</p>
        <p>S
~</p>
        <p>Nч,од,н,3
~
Nч,од,н,3</p>
        <p>~
Aч,од ,н Nч,од,н,3</p>
        <p>N ч,од,н</p>
        <p>Aж,од,р
~
Nж,од,р,3
~
Nж,од,р,3</p>
        <p>Nж,од,р
Nчз,аойдм,з,1 Nс,од,о
мене безмежним щастям</p>
        <p>a) b)
Figure 48: a) The base of rules of the alphabetic-frequency dictionary of parts of speech), where
A is a verb, other capital letters are additional characteristics of a verb, V is an adjective, small
letters of the English alphabet are characteristics of a noun and b) regular expressions of
morphological analysis of nouns</p>
        <p>The database stores regular expressions for bringing the word to the base (Fig. 49a-b), where
the flag is the rule for identifying the type of word (for example, noun group, singular), mask –
inflexions of the word (exceptions in square brackets), find – inflexions of the word in the
nominative case, repl – inflexions of the word during declension (Fig. 49c).
determining the basis of a word</p>
        <p>Also, in the database (Fig. 49b) there is a dictionary of service words, that is, words that are
additional parameters for analysing the features of the author's speech style and taking into
account during the analysis of texts significantly affect the final result.</p>
        <p>We will determine the optimal developed algorithm out of four (VIII-XI) for identifying the
style of the author of the publication based on the analysis of his collective works.</p>
        <p>Algorithm VIII. Filtering a set of analysed author's styles
int i=0, j=0;
while (i&lt;4){
int c1=0, c2=0, cc2=0;
while (j&lt;94){
int s=0;
while (l&lt;12){
if ((K[i][l]+abs(F[l]-K[i][l]))&gt;A[j][l]) &amp;&amp;</p>
        <p>((K[i][l]-abs(F[l]-K[i][l]))&lt; A[j][l])
s+=1;
if (l&gt;6) &amp;&amp; ((K[i][l]+abs(F[l]-K[i][l]))&gt;A[j][l]) &amp;&amp;</p>
        <p>((K[i][l]-abs(F[l]-K[i][l]))&lt; A[j][l])cc2+=s;</p>
        <p>Array K[i][l] – parameters and coefficients of style for 4 collective works (Table 20 and Table
E of Appendix – highlighted in yellow), some of whose authors are numbered 6 and 30
(highlighted in blue). Array A[j][l] – style features for 94 authors. Array F[l] – average values of
style features for 94 authors. The algorithm determines whether the value of the parameters
and coefficients of the speech style of the j-th author falls within the limits [xi+xсер; xi–xсер]
deviation of parameter values and speech coefficients of the i-th collective work style. Arrays A2
(authors, the values of most parameters and coefficients are similar to the style of the team і)
and A3 (authors, the values of most of the coefficients are similar to the style of the team і) are
filled through the filters. Next, a new subset of authors (whose styles are more similar to the
collective ones – і-th work) is formed from the obtained previous arrays by superimposing a new
filter.
The result of the algorithm for analyzing the style of a publication author on Victana [16] 94
authors on more than 300 individual publications for the period 2001–2021
1
2
3
4
5
6
7
8
9</p>
        <p>N</p>
        <p>As a result, we will get the values given in Table 21 (algorithm VIII). Columns A are the results
of the analysis of all the values of the coefficient vectors and speech parameters of the authors
from Table 20. Column B is the result of analysing only the last 5 columns in Table 20.
Unfortunately, this algorithm produced such results that the listed authors of these works are
unlikely to have written them themselves (the best results are highlighted in red - and it is not
enough to claim that they are the authors of more than 50% of these collective works). Although,
on the other hand, this algorithm gives good results - reducing the number of authors at the first
stage of authorship determination (up to 34.04% of the total number of project participants).
This is necessary for further filtering through the analysis of root words (prepositions and
conjunctions) and keywords, features of semantics and vocabulary when constructing
sentences, etc.
Experimental testing of algorithms I–IV on the Victana Web resource [16]</p>
        <p>Average value
A</p>
        <p>B</p>
        <p>Filter
2
B
IX
X
XI</p>
        <p>As a result, we will get the values given in Table 21 (algorithm IX). Then we will analyse
algorithm IX. It does not differ significantly from the previous one, only by the condition in the
third cycle: if ((K[i][l]+V[l])&gt;A[j][l]) &amp;&amp; ((K[i][l]- V[l])&lt; A[j][l]) s+=1, where V[l] is an
array of average absolute values of deviations of data points from the average value. The
obtained results are slightly improved, but not enough to claim that authors numbered 6 and
30 are the real authors of collective works 1–4, although they wrote them. On the other hand,
the number of authors (up to 38.56% of the total number of project participants) with a similar
style of speech increased slightly. Now let's analyse algorithm X. In algorithm 1, we will also
replace the condition in the third cycle with the following:</p>
        <p>if (abs(A[j][l]- K[i][l])&gt;abs(K[i][l]-F[l])) s+=1</p>
        <p>As a result, we will get the values given in Table 6.14 (algorithm X). As we can see, the
obtained values make it clear that the style of authors numbered 6 and 30 is quite close (more
than 75–100%) to the style of collective works 1-4, respectively (positive results are highlighted
in red). Although the number of authors (up to 42.02% of the total number of project
participants) with similarities in speech style has increased significantly. On the other hand,
many of those who were not included in the previous stages of the study were included in that
list, and those who were also included in the previous two stages of the study fell out of the
crowd. Now let's try to reduce the total number by applying the XI algorithm to the obtained
initial data - parameters and speech coefficients of 94 project participants. In algorithm X, we
improve the condition in the third cycle:
if ((abs(A[j][l]- K[i][l])&gt;abs(K[i][l]-F[l])) &amp;&amp; (abs(A[j][l]-
F[l])&gt;abs(K[i][l]F[l])))|| ((abs(A[j][l]- K[i][l])&lt;abs(K[i][l]-F[l])) &amp;&amp; (abs(A[j][l]-
F[l])&lt;abs(K[i][l]F[l]))) s+=1</p>
        <p>As a result, we get the values given in Table 21 (Algorithm XI). The obtained values also
confirm that the style of authors numbered 6 and 30 is quite close (more than 75–100%) to the
style of collective works 1–4, respectively (positive results are highlighted in red). Also
significantly reduced the number of authors (to 38.03% of the total number of project
participants) with similarities in speech style. Fig. 50 provides detailed graphs of the results
obtained when applying algorithms VIII–XI (numbered 1–4, respectively) for the analysis of the
method of determining the author’s style developed by us.</p>
        <p>Further, to determine the author's style, an analysis of root words (prepositions and
conjunctions) and keywords of the authors' works was used, as 38.03% got to those. Each
individual has his special vocabulary for conveying his opinion, including the so-called "parasitic"
(тобто, отже, хоча [tobto, otzhe, khocha] (that is, therefore, although) etc.) та службових
слів (і, та, й, але, хоч би [i, ta, y, ale, khoch by] (and, and, and, but, although) etc.).</p>
        <p>Collective 1
Collective 3</p>
        <p>Collective 2
Collective 4</p>
        <p>Algorithm 1
Algorithm 3</p>
        <p>Algorithm 2</p>
        <p>Algorithm 4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>A method of determining stable word combinations was developed based on the identification
of keywords of the Ukrainian-language text and analysis of the lexical speech coefficients of the
author of the text in reference excerpts of the content, which made it possible to improve the
accuracy of the method of determining the style of the author of the text by 9% based on
statistical linguistics. The method consists of the use of Zipf's law in the formation of stable word
combinations as key, taking into account the following rules of preliminary linguistic processing
of the text: removal of all sentence words; form bigrams only within the limits of punctuation
marks; the verb and pronoun are considered punctuation marks; determine verbs by their
inflexions; form bigrams based on their bases without taking into account their inflexions;
definition of adjectives by their inflexions and to believe that adjectives should only be in the
first place in the bigram from Ukrainian-language texts. A set of programs has been developed
to identify persistent phrases as key. An approach to the development of linguistic content
analysis software for the determination of stable word combinations in the identification of
keywords of Ukrainian-language and English-language textual content is proposed. The
peculiarity of the approach is the adaptation of the linguistic statistical analysis of lexical units
to the peculiarities of the constructions of Ukrainian and English words/texts. The results of the
experimental approbation of the proposed method of content analysis of English- and
Ukrainian-language texts for the determination of stable word combinations in the identification
of keywords of technical texts were studied.</p>
      <p>A method of determining the author in Ukrainian-language texts has been developed based
on the analysis of the coefficients of the author’s lexical speech in the referenced passage of the
author’s text, which is based on the analysis of a collection of keywords, persistent phrases,
indicators of lingumetry, stylometry, as well as the results of the analysis of N-grams based on
comparisons of differences in the use of 2- gram and 3-gram for publications similar in style
within [6;7]%, and for those not exactly similar – &gt;12%), which made it possible to identify a set
of potential authors of publications from more than one author (up to [9;34] % of the total
number of project participants) and develop a method for identifying the author's style.</p>
      <p>A method of identifying the style of the author of the text based on the analysis of the
features of the author's speech style in a template passage of the author's text has been
developed. The method consists of a comparative analysis of the author's attribution in a
statistically processed work of the author (standard) with an arbitrary analysed passage. The
method evaluates the degree of text belonging to the template of the author's style with the
analysis of the corresponding coefficients of the lexical author's speech. Moreover, the method
works under the condition that the template of the author's style is generated on reliable data.
An analysis of reference words was used for attribution, the obtained results are presented in
the form of correlation coefficients. Separately, we will mention the evolution of the significance
of one of the parameters of the text - in the author's attribution of the texts.</p>
      <p>An algorithm for identifying service words based on linguistic analysis of text content has
been developed. For each of the passages, the absolute and relative frequencies of stop words
were analysed and compared with the reference values. Therefore, the application of the
method of reference words gives the following results: finding among the studied passages what
most likely belongs to the standard. Other results confirm the effectiveness of the reference
words method in the authorial attribution of texts. The proposed assumption about the
insignificance of the influence of the share as a parameter of the method on the results led to a
decrease in the correlation coefficients. However, to confirm or refute the fact that fractions are
not a determining factor in the author's style, more thorough research must be performed. An
algorithm for the lexical analysis of Ukrainian-language texts and an algorithm for syntactic
analysis of text content has been developed. The peculiarities of the algorithms are the
adaptation of the morphological and syntactic analysis of word forms to the peculiarities of the
construction of Ukrainian words/texts. Belonging to a part of speech and declension within this
part of speech were taken into account based on the analysis of inflexions and word bases
according to regular expressions.</p>
      <p>A comparison of the results of content monitoring on a set of 300 one-man works of a
technical direction by 100 different authors for the period 2001–2021 was carried out to
determine whether and how the coefficients of text diversity of these authors change in
different periods. The best results according to the density criterion are achieved by the article
analysis method without initial mandatory information such as abstracts and keywords in
different languages, as well as a list of references. The method of identifying a potential author
is decomposed based on the analysis of speech style parameters such as speech coherence,
degree of syntactic complexity, lexical diversity, degree of concentration and exclusivity.
Characteristics of the author's style were also analysed, such as the total amount of words in
the text, the number of unique words, the number of conjunctions/prepositions, the number of
sentences, and the number of words with a frequency of 1 and ≥10. For example, 3-grams of 3
articles were analysed. 78.4814% of 3-grams were analysed for Article 1, 72.6332% for Article 2,
and 84.1271% for Article 3. Accordingly, the difference in the use of the corresponding 3-grams
between Articles 1–2 is R12=56,5254 %, between 2 and 3 – %, between 1 and 3 – R13=62.9839 %.
These indicators themselves show that the characteristics of Articles 1 and 2 are more similar
(R23&gt;R12 by 12.9017%, R23 &gt; R13 by 6.4432%, R13&gt; R12 by 6.4585%, i.e., R23&gt;R13&gt;R12) than the
characteristics of Articles 1–3, respectively and 2–3. The smaller the Rij, the greater the degree
to which the articles are written by the same author. Then in this case Articles 1–2 are more
likely to be written by the same author than Articles 2–3 and 1–3 respectively.</p>
      <p>This work solved an important scientific and applied problem of CLS analysis and synthesis
for solving various problems of processing Ukrainian-language textual content based on the
development of new and improvement of known NLP models, methods and tools.</p>
      <p>During the execution of the work, the following results were obtained:
1. An analysis of the current state and prospects for IT development of natural language
processing was carried out, which made it possible to define the problem and research
objectives, as well as to form general research directions in the absence of non-commercial CLS
with open source for processing Ukrainian-language textual content and a standardized design
approach.</p>
      <p>2. The relevance of solving the problem of analysis and synthesis of CLS based on the
development of the general structure of the system for processing Ukrainian-language textual
content is substantiated, due to the interaction of the main processes/components of IS and
methods of linguistic processing of textual content adapted to the Ukrainian language based on
grapheme, morphological, lexical, syntactic, semantic, structural, ontological and pragmatic
analysis allowed to improve the IT intellectual analysis of the text flow for solving a specific NLP
problem. This ensured the adaptation of NLP processes for the analysis of Ukrainian-language
textual content and, based on them, increased the accuracy of the obtained results by 6-48%,
depending on the specific NLP task. For example, for the NLP task of determining the keywords
of the Ukrainian-language text, the density of keywords increases in the range [1.23; 1.48] times
or by [23.14; 47.83]% depending on the quality/accuracy of filling the thematic dictionary
through machine learning.</p>
      <p>3. The methods of processing information resources such as integration, management and
support of Ukrainian-language content have been improved, which made it possible to adapt
the process of intellectual analysis of the text flow and develop metrics for the effectiveness of
CLS functioning for the solution of various NLP tasks. The developed methods and tools make it
possible to build CLS processing of Ukrainian-language text content according to the needs of
the permanent/potential target audience based on the analysis of the history of actions of
website users.</p>
      <p>4. NLP methods based on pattern-matching regular expressions were improved, which made
it possible to adapt the methods of tokenization and normalization of text by cascades of simple
substitutions of regular expressions and finite state machines.</p>
      <p>5. The MA method of the Ukrainian-language text based on word segmentation and
normalization, sentence segmentation and modified Porter's stemming algorithm was improved
as an effective means of identifying lem affixes for the possibility of marking the analysed word,
which made it possible to increase the accuracy of keyword searches by 9%.</p>
      <p>6. The IT of intellectual analysis of the text flow was improved based on the processing of
information resources, which made it possible to adapt the generally typical structure of
modules for integration, management and support of content to solve various NLP problems
and increase the efficiency of CLS functioning by 6-9%. This became possible thanks to the
combination of linguistic analysis methods adapted to the Ukrainian language, improved IT
processing of information resources, ML and a set of metrics for evaluating the effectiveness of
CLS functioning. The main principle of building such CLS is modularity, which facilitates their
construction according to the requirements for the presence of appropriate processes for
solving a specific NLP problem.</p>
      <p>7. A method of determining the author in Ukrainian-language texts has been developed
based on the analysis of the coefficients of the author’s lexical speech in the referenced passage
of the author’s text, which is based on the analysis of a collection of keywords, persistent
phrases, indicators of lingumetry, stylometry, as well as the results of the analysis of N-grams
based on comparisons of differences in the use of 2- gram and 3-gram for publications similar in
style within [6;7]%, and for those not exactly similar – &gt;12%), which made it possible to identify
a set of potential authors of publications from more than one author (up to [9;34] % of the total
number of project participants) and develop a method for identifying the author's style.</p>
      <p>8. A method of determining stable word combinations was developed based on the
identification of keywords of the Ukrainian-language text and analysis of the lexical speech
coefficients of the author of the text in reference excerpts of the content, which made it possible
to improve the accuracy of the method of determining the style of the author of the text by 9%
based on statistical linguistics.</p>
      <p>
        9. The reliability of scientific and practical results is confirmed by relevant materials on the
implementation of dissertation research, as well as by comparing the obtained practical results
on different samples of reliable input data. CLS was developed on the information resource
http://victana.lviv.ua using CMS Joomla! (for developing the e-framework of articles), PHP (for
implementing text content processing methods), HTML (for implementing page markup), CSS
(for describing page styles), and MySQL (for storing data and dictionaries). The experimental
study confirmed the reliability of the method of determining keywords - for different algorithms
for processing the primary text, the average coincidence of the lists of identified keywords with
the authors varies in the range of 52.6-68.5%. The accuracy of matching keywords with the
author's keywords ranges from 43.6 to 62.9%. The average match of meaningful keywords
compared to all found by the system ranges from 38.9-75.8%, depending on the stages of
analysis of article texts. The accuracy of matching keywords compared to all found by the system
varies between 34.3-71.9%, depending on the stages of analysis of the texts of the articles.
[11] A. Rejeb, K. Rejeb, A. Appolloni, H. Treiblmaier, M. Iranmanesh, Exploring the impact of
ChatGPT on education: A web mining and machine learning approach, The International
Journal of Management Education 22(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) (2024) 100932.
[12] V. Kayser, E. Shala, Scenario development using web mining for outlining technology
futures, Technological forecasting and social change 156 (2020) 120086.
[13] M. Karp, N. Kunanets, Y. Kucher, Meiosis and litotes in The Catcher in the Rye by Jerome
      </p>
      <p>
        David Salinger: Text Mining, CEUR Workshop Proceedings 2870 (2021) 166-178.
[14] S. Kumar, A. K. Kar, P. V. Ilavarasan, Applications of text mining in services management: A
systematic literature review, International Journal of Information Management Data
Insights 1(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) (2021) 100008.
[15] L. Hickman, et. al., Text preprocessing for text mining in organizational research: Review
and recommendations, Organizational Research Methods 25(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) (2022) 114-146.
[16] Z. Yang, Z. Xiangyi, The Applicability of Zipf's Law in Report Text, Lecture Notes on Language
and Literature 6(
        <xref ref-type="bibr" rid="ref10">10</xref>
        ) (2023) 57-64.
[17] Z. Wang, M. Ren, D. Gao, Z. Li, A Zipf's law-based text generation approach for addressing
imbalance in entity extraction, Journal of Informetrics 17(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) (2023) 101453.
[18] A. Koshevoy, H. Miton, O. Morin, Zipf’s law of abbreviation holds for individual characters
across a broad range of writing systems, Cognition 238 (2023) 105527.
[19] C. Boyer, L. Dolamic, N. Grabar, Automated Detection of Health Websites' HONcode
Conformity: Can N-gram Tokenization Replace Stemming?, Studies in Health Technology
and Informatics 216 (2015) 1064.
[20] O. Bisikalo, V. Vysotska, Linguistic analysis method of Ukrainian commercial textual content
for data mining, CEUR Workshop Proceedings 2608 (2020). 224-244.
[21] V. Vysotska, P. Pukach, V. Lytvyn, D. Uhryn, Y. Ushenko, Z. Hu, Intelligent Analysis of
Ukrainian-language Tweets for Public Opinion Research based on NLP Methods and
Machine Learning Technology, International Journal of Modern Education and Computer
Science (IJMECS) 15(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) (2023) 70-93.
[22] V. Starko, A. Rysin, VESUM: A Large Morphological Dictionary of Ukrainian As a Dynamic
      </p>
      <p>
        Tool, CEUR Workshop Proceedings 3171 (2022) 61-70.
[23] V. Lytvyn, P. Pukach, V. Vysotska, M. Vovk, N. Kholodna, Identification and Correction of
Grammatical Errors in Ukrainian Texts Based on Machine Learning Technology,
Mathematics 11(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) (2023) 904.
[24] V. Starko, O. Synchak, Feminine Personal Nouns in Ukrainian: Dynamics in a Corpus, CEUR
      </p>
      <p>Workshop Proceedings 3396 (2023) 407-425.
[25] O. Synchak, V. Starko, Ukrainian Feminine Personal Nouns in Online Dictionaries and</p>
      <p>Corpora, CEUR Workshop Proceedings 3171 (2022) 775-790.
[26] V. Starko, Implementing Semantic Annotation in a Ukrainian Corpus, CEUR Workshop</p>
      <p>Proceedings 2870 (2021) 435-447.
[27] Starko, V.: Semantic Annotation for Ukrainian: Categorization Scheme, Principles, and</p>
      <p>Tools. In: CEUR workshop proceedings, Vol-2604, 239-248. (2020).
[28] Keygeneratortext. URL: http://msurf.ru/tools/keygeneratortext/.
[29] Keygeneratorurl. URL: http://webmasta.org/tools/keygeneratorurl/.
[30] Keywordstext. URL: http://www.keywordstext.therealist.ru/.
[31] Keygeneratortext. URL: http://syn1.ru/tools/keygeneratortext/.
[32] Terminology extraction. URL: http://labs.translated.net/terminology-extraction/.
[33] Advego. URL: http://advego.ru/text/seo/.
[34] V. Vysotska, S. Holoshchuk, R. Holoshchuk, A comparative analysis for English and Ukrainian
texts processing based on semantics and syntax approach, CEUR Workshop Proceedings
2870 (2021) 311-356.
[35] V. Vysotska, O. Markiv, S. Teslia, Y. Romanova, I. Pihulechko, Correlation Analysis of Text
Author Identification Results Based on N-Grams Frequency Distribution in Ukrainian
Scientific and Technical Articles, CEUR Workshop Proceedings 3171 (2022) 277-314.
[36] V. Vysotska, S. Mazepa, L. Chyrun, O. Brodyak, I. Shakleina, V. Schuchmann, NLP tool for
extracting relevant information from criminal reports or fakes/propaganda content, in
Proceedings of the 2022 IEEE 17th International Conference on Computer Sciences and
Information Technologies (CSIT), 2022, November, pp. 93-98.
[37] V. Lytvyn, et. al., Analysis of statistical methods for stable combinations determination of
keywords identification, Eastern-European Journal of Enterprise Technologies 2/2(92)
(2018) 23-37. doi: 10.15587/1729-4061.2018.126009.
[38] N. Kholodna, V. Vysotska, O. Markiv, S. Chyrun, Machine Learning Model for Paraphrases
Detection Based on Text Content Pair Binary Classification, CEUR Workshop Proceedings
3312 (2022) 283-306.
[39] Y. Stepaniak, V. Vysotska, O. Markiv, L. Chyrun, S. Chyrun, L. Pohreliuk, Technology of Text
Content Topic Classification Based on Machine Learning Methods, in Proceedings of the
IEEE 5th International Conference on Advanced Information and Communication
Technologies (AICT), 2023, pp. 121-126.
[40] Y. Hlavcheva, O. Kanishcheva, М. Vovk, M. Glavchev, Using Topic Modeling for Automation</p>
      <p>Search to Reviewer, CEUR Workshop Proceedings 3171 (2022) 81-90.
[41] N. Khairova, A. Kolesnyk, O. Mamyrbayev, G. Ybytayeva, Y. Lytvynenko, Automatic
Multilingual Ontology Generation Based on Texts Focused on Criminal Topic, CEUR
Workshop Proceedings 2870 (2021) 108-117.
[42] V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals,</p>
      <p>
        Soviet physics doklady 10(
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) (1966) 707-710.
[43] R. Bellman, R. Kalaba, Dynamic programming and statistical communication theory,
Proceedings of the National Academy of Sciences of the United States of America 43(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
(1957) 749.
[44] R. Bellman, R. Kalaba, On the role of dynamic programming in statistical communication
theory, IRE Transactions on Information Theory 3(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) (1957) 197-203.
[45] R. Bellmam, Dynamic programming. Princeton univ. press. Princeton. New Jersey, 1957.
[46] R. Bellman, On the approximation of curves by line segments using dynamic programming,
      </p>
      <p>
        Communications of the ACM 4(
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) (1961) 284.
[47] R. A. Wagner, M. J. Fischer, The string-to-string correction problem, Journal of the ACM
(JACM) 21(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) (1974) 168-173.
[48] D. Gusfield, Algorithms on stings, trees, and sequences: Computer science and
computational biology, Acm Sigact News 28(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) (1997) 41-60.
[49] G. D. Forney, The viterbi algorithm, Proceedings of the IEEE 61(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) (1973) 268-278.
[50] V. Motyka, Y. Stepaniak, M. Nasalska, V. Vysotska, Lexical Diversity Parameters Analysis for
Author's Styles in Scientific and Technical Publications, CEUR Workshop Proceedings 3403
(2023) 595–617.
[51] R. Romanchuk, V. Vysotska, V. Andrunyk, L. Chyrun, S. Chyrun, O. Brodyak, Intellectual
Analysis System Project for Ukrainian-language Artistic Works to Determine the Text
Authorship Attribution Probability, in Proceedings of the 2023 IEEE 18th International
Conference on Computer Sciences and Information Technologies, CSIT-2023, Lviv, 19-21
October 2023 р.
[52] V. Lytvyn, et. al., Development of the quantitative method for automated text content
authorship attribution based on the statistical analysis of N-grams distribution,
EasternEuropean Journal of Enterprise Technologies 6(
        <xref ref-type="bibr" rid="ref10 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">2-102</xref>
        ) (2019) 28-51. doi:
10.15587/17294061.2019.186834.
[53] V. Lytvyn, et. al., Development of the linguometric method for automatic identification of
the author of text content based on statistical analysis of language diversity coefficients,
Eastern-European Journal of Enterprise Technologies 5(2(95)) (2018) 16–28. doi:
10.15587/1729-4061.2018.142451.
[54] V. Lytvyn, et. al., Development of the system to integrate and generate content considering
the cryptocurrent needs of users, Eastern-European Journal of Enterprise Technologies
1(2(97)) (2019) 18–39. doi: 10.15587/1729-4061.2019.154709.
[55] P. Kravets, The Game Method for Orthonormal Systems Construction, in Proceedings of the
9th International Conference - The Experience of Designing and Applications of CAD
Systems in Microelectronics, 2007. doi: doi.org/10.1109/cadsm.2007.4297555.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Appendices</title>
      <p>Table A
List by frequency rating of stable word combinations for 3 random articles
№ Author's
Q A</p>
      <p>Victana.lviv.ua (according to FREG, t-test
Zipf's law)</p>
      <p>B
1 Стиль автора Стоп-слово
2 Статистичний аналіз Метод визначення
3 Лінгвістичний аналіз Визначення стилю
54 лКАіввнатгонвртісистьтиакктааиавтнраибуція САтниалльізаувртиовркау
6 Визначення стилю Частота появи
7 Україномовні тексти Автор тексту
8 Технологія лінгвометрії Уривок тексту
9 Технологія стилеметрії Коефіцієнт кореляції
10 гТлеохтнтоолхоргоіянології Дослідження тексту
F
Коефіцієнт кореляції
Відносна частота
Частота появи
Стопове слово
Україномовний текст
Стиль автора
Поява слова
Авторська атрибуція
Визначення стилю
Слова уривку</p>
      <p>2
G
Коефіцієнт
кореляції
Відносна частота
Частота появи
Авторська атрибуція
Стиль автора
Україномовний текст
Стопове слово
Визначення стилю
Поява слова
Слова уривку
Текстовий контент
Ключове слово
Тематичний словник
Текстовий контент
Тематичний словник
Ключове слово
Формування системою</p>
      <p>Web Mining
Лінгвістичний аналіз</p>
      <p>Слова контенту
Метод визначення</p>
      <p>Текстовий контент
Визначення слів
Слов’янськомовні
тексти
Технологія NLP
Аналіз статистики
Ключове словосполучення
Множина слів
Інформаційний ресурс</p>
      <p>Контент-аналіз
Information resource</p>
      <p>Content analysis
Диспозиції особистості
Соціальна мережа
Ключове словосполучення
Слова контенту
Множина слів
Формування системою
Контент-аналіз
Психологічна особистість
Контент-аналіз
Марковане слово
Психологічний зріз
Стан особистості
Формування зрізу
Зріз стану
Зріз особистості
In the work[1] in English</p>
      <p>Reference fragment
Words fragment
Syntactic words
Frequency fragment
Swadesh list
Stop words
Author style
Recognition author
Author’s text</p>
      <p>Anchor words
In the work[2] in English</p>
      <p>Text content
Web mining
Keywords text
Keywords defined
Analysis text
Keywords content
Content monitoring
Content analysis
Stop word</p>
      <p>Author’s keywords
In the work[3] in English</p>
      <p>Content analysis
Psychological personality
Psychological state
Social networks
Marked words
State personality
Based analysis
Psychological base
State based</p>
      <p>Based content
Контент-аналіз
Лінгвістичний аналіз
Морфологічний аналіз
Соціальна мережа
Формування зрізу
Зріз розуміння
Розуміння особистості
Україномовні тексти
Big-Five
Style of the author
Statistical analysis
Linguistic analysis
Quantitative linguistics
Author’s attribution
Recognition of style
Ukrainian texts
Linguometry technology
Stylemetry technology
Glottochronology
technology
Web Mining
Content monitoring
Content analysis
Porter stemmer
Linguistic analysis
Determining the
keywords
Slavic language
Slavic texts
Method for determining
Web technology
Content analysis
Linguistic analysis
Morphological analysis
Social network
Status of personality
Personality
understanding
Formation of the status
Stop words
Method of formation
Стоп- слово
Тематичний словник
Пости користувача
Повідомлення користувача
Користувач мережі
Стан особистості
Аналізована особистість
Соціальна мережа
Reference fragment
Author’s style
Author’s text
Syntactic words
Stop words
Formatted fragments
Anchor words
Author’s language
Method of anchor
Frequency dictionary
Text content
Content analysis
Analysis of statistics
Defined systematically
Stop word
Potential keywords
Content monitoring
Author’s keywords
Keywords content
Direct word
Psychological state
Personality analysis
Personality disposition
Psychological analysis
Personality model
Stop words
Psychological disposition
Content monitoring</p>
      <p>Social network
10
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
7
8
9
10
1
2
3
4
5
6
1
2
3
4
5
6
7
8
9
10
Table B
Differences in methods according to the rating list of 100 stable word combinations
Q</p>
      <p>F</p>
      <p>G</p>
      <p>G</p>
      <p>F</p>
      <p>G
Контент-моніторінг</p>
      <p>Контент-моніторінг
Формування
системою
Web Mining
Слова контенту
Психологічна
особистість
Психологічний стан
Формування зрізу
Стан особистості
Марковане слово
Психологічний зріз
Контент-аналіз
Зріз стану
Аналізована
особистість
Соціальна мережа
Words fragment
Reference fragment
Stop words
Swadesh list
Recognition author
Syntactic words
Frequency fragment
Author’s text
Anchor words
Author style
Web mining
Text content
Keywords content
Keywords text
Keywords defined
Stop word
Analysis text
Author’s keywords
Content monitoring
Content analysis
Psychological
personality
Psychological state
Content analysis
Based analysis
State personality
Psychological base
Social networks
Marked words
State based
Psychological base
Формування
системою
Web Mining
Визначення слів
Слова контенту
Психологічна
особистість
Психологічний стан
Формування зрізу
Зріз стану
Марковане слово
Контент-аналіз
Психологічний зріз
Стан особистості
Соціальна мережа
Аналізована
особистість
Words fragment
Reference fragment
Recognition author
Stop words
Swadesh list
Syntactic words
Frequency fragment
Author’s text
Author style
Anchor words
Web mining
Text content
Keywords content
Analysis text
Keywords text
Keywords defined
Stop word
Content monitoring
Content analysis
Author’s keywords
Content analysis
Psychological
personality
Psychological state
Based analysis
Psychological base
State personality
Social networks
Psychological base
Marked words
State based
A
B
C
D
F
G
('контент_моніторінгу', 13)
ENG
('swadesh_list', 18)
('based_on', 15)
А2</p>
      <p>UA
ENG</p>
      <p>ENG
А3</p>
      <p>UA
А4</p>
      <p>UA
('тематичного_словника', 11)
('слов_янськомовних', 10)
('based_on', 20)
('slavic_language', 15)
('author_s', 13)
(('ключових', 'слів'), 72)
(('текстового', 'контенту'), 21)
(('на', 'етапі'), 17)
(('визначення', 'ключових'), 16)
(('крок', '1'), 16)
(('крок', '2'), 16)
(('web', 'mining'), 15)
(('слів', 'в'), 14)
(('тематичного', 'словника'), 11)
(('для', 'визначення'), 10)
(('of', 'the'), 134)
(('in', 'the'), 61)
(('by', 'the'), 45)
(('analysis', 'of'), 39)
(('of', 'a'), 31)
(('the', 'text'), 30)
(('the', 'system'), 30)
(('to', 'the'), 29)
(('of', 'keywords'), 28)
(('text', 'content'), 27)
(('ключових', 'слів'), 74)
(('слів', 'в'), 24)
(('web', 'mining'), 22)
(('текстового', 'контенту'), 21)
(('на', '2'), 20)
(('визначення', 'ключових'), 19)
(('ключових', 'в'), 19)
(('визначення', 'слів'), 18)
(('слів', 'для'), 18)
(('на', 'крок'), 18)
(('of', 'the'), 258)
(('the', 'of'), 235)
(('of', 'of'), 137)
(('the', 'the'), 122)
(('of', 'keywords'), 72)
(('in', 'the'), 71)
(('a', 'of'), 70)
(('and', 'of'), 69)
(('by', 'the'), 64)
(('of', 'content'), 63)
(('text', 'content'), 30)
(('web', 'mining'), 24)
(('keywords', 'text'), 23)
(('keywords', 'defined'), 22)
(('stage', '1'), 20)
(('analysis', 'text'), 18)
(('step', '2'), 18)
(('keywords', 'content'), 17)
(('content', 'monitoring'), 17)
(('step', '1'), 17)</p>
      <p>Work [3]
('психологічного_стану', 16)
('формування_зрізу', 12)
('sfx_a', 12)
('структурну_схему', 7)
('відкритість_досвіду', 6)
('зрізу_психологічного', 2)
('based_on', 35)
('psychological_state', 26)
('social_networks', 22)
('his_her', 11)
('following_structural', 8)
('big_five', 7)
('let_us', 7)
('structural_scheme', 4)
(('на', 'основі'), 21)
(('психологічного', 'стану'), 18)
(('контент', 'аналізу'), 16)
(('маркованих', 'слів'), 15)
(('зрізу', 'психологічного'), 14)
(('стану', 'особистості'), 14)
(('формування', 'зрізу'), 12)
(('особистості', 'на'), 12)
(('sfx', 'a'), 12)
(('основі', 'контент'), 11)
(('of', 'the'), 134)
(('is', 'the'), 117)
(('the', 'content'), 45)
(('of', 'a'), 43)
(('analysis', 'of'), 37)
(('based', 'on'), 35)
(('on', 'the'), 34)
(('in', 'the'), 33)
(('content', 'analysis'), 30)
(('the', 'process'), 27)</p>
      <p>Stop
word
але
в
для
до
з
і
й
мов
не
про
та
що
а
в
від
до
ж
з
за
і
й
на
над
не
ні
ось
от
се
хіба
хоч
що
як
1
2
3
1
1
14
1
1
2
2
2
1
2
3
1
1
1
2
1
2
2
1
1
2
1
1
1
1
1
1
2
1
RF in fragment</p>
      <p>RF in fragment
RF
0.0116
0.0074
0.0008
0.0012
0.0140
0.0034
0.0033
0.0129
0.0053
0.0300
0.0038
0.0159
0.0011
0.0237
0.0011
0.0004
0.0001
0.0088
0.0206
0.0028
0.0060
0.0011
0.0074
0.0033
0.0140
0.0033
0.0129
0.0053
0.0300
0.0022
0.0159
0.0237
0.0003
0.0018
0.0040
0.0074
0.0088
0.0027
0.0206
0.0028
0.0060
Stop
word
а
але
без
бо
в
від
ж
з
за
і
й
на
навіть
не
під
таки
тож
у
що
щоб
як
адже
але
би
в
ж
з
за
і
мов
на
не
отсе
при
про
се
у
чи
що
щоб
як
Iwt
0.76
0.74
0.78
0.74
0.74
0.75
0.75
0.74</p>
      <p>Ikt
0.015
0.012
0.016
0.019
0.012
0.015
0.019
0.013
Table E
The result of the algorithm of analysis of the author's style of the publication
№
1
2
3
4
5
6
7
8
Letter
Table F
Frequencies of appearance of letters in the standard and the studied passages
Letter</p>
      <p>Fragment 1
AF RF</p>
      <p>Fragment 2
AF RF</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y. H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Tai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>Identification of highly-cited papers using topic-modelbased and bibliometric features: The consideration of keyword popularity</article-title>
          ,
          <source>Journal of Informetrics</source>
          <volume>14</volume>
          (
          <issue>1</issue>
          ) (
          <year>2020</year>
          )
          <fpage>101004</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cheikhrouhou</surname>
          </string-name>
          , et. al.,
          <article-title>Multi-task learning for simultaneous script identification and keyword spotting in document images</article-title>
          ,
          <source>Pattern Recognition</source>
          <volume>113</volume>
          (
          <year>2021</year>
          )
          <fpage>107832</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mahrishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Meena</surname>
          </string-name>
          ,
          <article-title>A comprehensive review of recent automatic speech summarization and keyword identification techniques, AI in Industrial Applications: Approaches to Solve the Intrinsic Industrial Optimization Problems</article-title>
          ,
          <year>2022</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kenekayoro</surname>
          </string-name>
          ,
          <article-title>Author and keyword bursts as indicators for the identification of emerging or dying research trends</article-title>
          ,
          <source>J. Sci. Res</source>
          .
          <volume>9</volume>
          (
          <issue>2</issue>
          ) (
          <year>2020</year>
          )
          <fpage>120</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matseliukh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ivaniv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chyrun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Schuchmann</surname>
          </string-name>
          ,
          <article-title>The text classification based on Big Data analysis for keyword definition using stemming</article-title>
          ,
          <source>in: Proceedings of IEEE 16th International conference on computer science and information technologies</source>
          , Lviv, Ukraine,
          <fpage>22</fpage>
          -
          <lpage>25</lpage>
          September,
          <year>2021</year>
          , pp.
          <fpage>184</fpage>
          -
          <lpage>188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Taran</surname>
          </string-name>
          ,
          <article-title>The Role of Keyword Language in the Database of World Slavic linguistics "iSybislaw"</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          <volume>3171</volume>
          (
          <year>2022</year>
          )
          <fpage>266</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bondarchuk</surname>
          </string-name>
          , et. al.,
          <source>Keyword-based Study of Thematic Vocabulary in British Weather News, CEUR Workshop Proceedings</source>
          <volume>3171</volume>
          (
          <year>2022</year>
          )
          <fpage>451</fpage>
          -
          <lpage>460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.V.</given-names>
            <surname>Bisikalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wójcik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.V.</given-names>
            <surname>Yahimovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Smailova</surname>
          </string-name>
          ,
          <article-title>Method of determining of keywords in English texts based on the DKPro Core</article-title>
          ,
          <source>in: Proceedings of SPIE - The International Society for Optical Engineering</source>
          ,
          <year>2016</year>
          ,
          <volume>10031</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Campos</surname>
          </string-name>
          , et. al.,
          <article-title>YAKE! Keyword extraction from single documents using multiple local features</article-title>
          ,
          <source>Information Sciences 509</source>
          (
          <year>2020</year>
          )
          <fpage>257</fpage>
          -
          <lpage>289</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yadav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <article-title>Web page ranking using web mining techniques: a comprehensive survey</article-title>
          ,
          <source>Mobile Information Systems</source>
          <year>2022</year>
          (
          <article-title>1) (</article-title>
          <year>2022</year>
          )
          <fpage>7519573</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>