<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International
Conference on Computer Sciences and Information Technologies, Lviv</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/CSIT52700.2021.9648682</article-id>
      <title-group>
        <article-title>Correlation Analysis of Text Author Identification Results Based on N-Grams Frequency Distribution in Ukrainian Scientific and Technical Articles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victoria Vysotska</string-name>
          <email>victoria.a.vysotska@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oksana Markiv</string-name>
          <email>oksana.o.markiv@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sofiia Teslia</string-name>
          <email>sofiia.teslia.sa.2019@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yeva Romanova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Inesa Pihulechko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>S. Bandera Street, 12, Lviv, 79013</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Osnabrück University</institution>
          ,
          <addr-line>Friedrich-Janssen-Str. 1, Osnabrück, 49076</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>85</volume>
      <fpage>1</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>The results of experimental approbation of the proposed content monitoring method used for the determination of the author style in Ukrainian scientific texts of technical profile have been studied. Authorship identification systems typically use plagiarism and rewrite metrics to determine it. There is a necessity to identify whether the work has been borrowed fully or partially. Therefore, the situation when the work has not been published yet is not taken into consideration. Quantitative content analysis of the scientific and technical texts uses the advantages of content monitoring and analysis of text based on NLP, Web-Mining and stylometry methods to identify many authors whose speech styles are similar to the studied passages. It narrows the search for further use in stylometric methods to determine the degree of the analyzed text belonging to a particular author. The method of determining the author has been decomposed on the basis of such speech coefficients analysis as lexical diversity, degree (measure) of syntactic complexity, speech coherence, indices of the text exclusivity and concentration. In parallel, the parameters of the author style, such as the text words, sentences, prepositions, conjunction quantities and the number of words with a frequency of 1, 10 or more have been analyzed. N-Grams, NLP, correlation analysis, authorship definition, Ukrainian text, distribution function density, exponential and median smoothing, linguometry, stylometric analysis COLINS-2022: 6th International Conference on Computational Linguistics and Intelligent Systems, May 12-13, 2022, Gliwice, Poland ORCID: 0000-0001-6417-3689 (V. Vysotska); 0000-0002-1691-1357 (O. Markiv); 0000-0002-2591-2431 (S. Teslia); 0000-0003-0522-0806 (Y. Romanova); 0000-0003-2789-2902 (I. Pihulechko)</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Due to the increasing availability and distribution of text documents in electronic form the
importance of using automatic methods to analyze the content of documents has been increased [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">1-3</xref>
        ].
The tasks of text analysis include the necessity of documents classification and clustering [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">4-7</xref>
        ] by
various criteria, such as genre, writing format (novel, essay), emotional coloring, speech style, as well
as the task of text author identification [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref9">8 -14</xref>
        ].
      </p>
      <p>With the simplification of access to various data, growth of the ability to search, copy and distribute
data on networks, the task of identifying the author becomes urgent. Issues related to the determination
of authorship are also important in linguistic, historical and forensic researches. The general availability
of electronic devices allows to push the recognition of the author with the involvement of a large number
of experts in the background, speed up and simplify this process through its automation.</p>
      <p>
        The concept of author identification is defined as the process of author identification based on the
set of the text general and particular features that constitute the author style [
        <xref ref-type="bibr" rid="ref10 ref9">8-9</xref>
        ].
      </p>
      <p>2022 Copyright for this paper by its authors.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Statistical methods based on the search for "author invariant" are popular in existing systems for
determining the text authorship. "Author invariant" characterizes the text linguistic features (lexical,
grammatical, phraseological and other ones). The invariant can be the following: the share of vowels
or consonants, the frequency of certain part of speech use, the probability of transitions from one part
of speech to another, "favorite" words, information entropy etc. Authors proposed a statistical method
for determining the text author and genre based on the frequency distribution of letter combinations
(ngrams) [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">10-12</xref>
        ]. This method has shown decent results for works of Slavic research publications.
Unfortunately, the accuracy of determining authorship statistical methods depends on the data specifics
using the language, style and length of written texts that have been studied [
        <xref ref-type="bibr" rid="ref14 ref15 ref16 ref17 ref18 ref19 ref20 ref21 ref22 ref23 ref24">13-28</xref>
        ]. Because of this, it
is difficult to conclude the accuracy of such an approach to data of a different nature. For this reason,
the aim of this work is to analyze the application of such a mathematical apparatus as the distribution
of letter combinations for different languages in solving the problem of establishing the texts authorship
of different lengths and written in different language styles. Chosen topic, namely relative frequency
ngrams, is only gaining popularity in Ukraine and is not very popular. Several literature sources to
describe what n-grams are and what they are used for have been found.
      </p>
      <p>N-gram is a sequence of n-elements [29]. From a semantic point of view, it can be a sequence of
sounds, syllables, words or letters. In practice, N-grams are more common as a series of words, stable
phrases that called collocations. A sequence of two consecutive elements is often called a bigram, a
sequence of three elements is called a trigram, which have been presented in the studied dataset. At
least four or more elements are denoted as N-grams, N is replaced by the number of consecutive
elements. N-grams in general are used in a wide range of sciences. They can be used, for example, in
the field of theoretical mathematics, biology, cartography, as well as in music. The most common uses
of N-grams include the following: extracting data for the cluster of satellite images series of the Earth
from space, decision which specific parts of the Earth are in the image, and searching for genetic
sequences in computer compression for indexing data in search engines, using N-grams, usually
indexed data related to sound. In natural language processing N-gram is used mainly for prediction
based on probabilistic models. The N-gram model calculates the probability of the last word of the
Ngram, if all the previous ones are known. When using this approach to modeling language, it is assumed
that the appearance of each word depends only on previous words [30]. Another application of N-grams
is the detection of plagiarism. If you divide the text into several small fragments represented by
Ngrams, they are easy to compare with each other, and thus obtain a degree of similarity of controlled
documents [31]. N-grams are often used successfully to categorize text and language. In addition, they
can be used to create functions that allow you to gain knowledge from textual data. Using N-grams,
you can effectively find candidates to replace words with spelling mistakes. Google Research Centers
have used N-gram models for a wide range of research and development. These include projects such
as statistical translation from one language to another, language recognition, spelling correction,
information retrieval, and more. For the purposes of these projects were used text corpora, which
contain several trillion words. Google has decided to create its own educational building. The project
is called Google tera corpus and it contains 1,024,908,267,229 words collected from public websites
[32].</p>
      <p>
        For a long time, cryptograms decryption is aided by frequency analysis the essence of which is the
study of statistical patterns of symbols appearance and their compounds in original and encrypted
messages [
        <xref ref-type="bibr" rid="ref2">1</xref>
        ]. In order to complicate frequency analysis ciphers have appeared in cryptography what
leads to a uniform distribution of characters in the cryptogram. The principles of frequency analysis are
widely used in password programs and allow to reduce the search time by several orders of magnitude
[
        <xref ref-type="bibr" rid="ref3">2</xref>
        ] based on classification and clustering [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">4-7</xref>
        ] of documents by various criteria, such as genre, epoch,
format (novel, essay), emotional coloring, speech style, as well as the task of determining the text author
[
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref9">8-15</xref>
        ].Obviously, frequency analysis requires first of all the reference frequencies of alphabet letters
repetition on which the open texts are written and frequencies of N-grams repetition. For Ukrainian,
English and almost all European languages the average frequency of letters, bigrams, trigrams repetition
can be found in the literature [
        <xref ref-type="bibr" rid="ref17 ref18 ref19 ref20 ref21 ref22 ref23">16-22</xref>
        ].
      </p>
      <p>
        Unfortunately, for the Ukrainian language only the frequency of letters repetition is given in the
literature [
        <xref ref-type="bibr" rid="ref24">23-25</xref>
        ]. Therefore, the purpose of this work is to investigate the repetition frequency of letters
and letters of the Ukrainian language on the basis of randomly selected texts in the Ukrainian language
of scientific and technical orientation. The analysis of the obtained data confirms that for the Ukrainian
language, as well as for other European languages, the alternation of vowels and consonants is inherent.
If you study other texts, there may differ in the numbers of the given letter’s frequencies, which is
explained, firstly, by the length of the studied text, and, secondly, by its subject matter. For example,
the generally used letter F can become quite common in technical texts, because it is used in such words
as function, differential, diffusion, coefficient, etc. Even greater deviations from the traditional use of
individual letters are observed in some works of art, especially in poems.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and Materials</title>
      <p>
        Modern systems for determining the text authorship use different approaches to the theory of
mathematical statistics, pattern recognition and probability theory, cluster analysis algorithms, neural
networks and others [33-45]. The systems differ in the method of author identification, the means of
text analysis, the required text amount and accuracy [
        <xref ref-type="bibr" rid="ref13">12</xref>
        ]. Methods of text authorship identification
based on the calculation of any text characteristics (official parts of speech, prepositions, conjunctions,
particles, independent parts of speech, nouns, verbs, adjectives, word lengths, sentence lengths) also
differ in comparing frequencies in the different textual content for different tasks [46-63]. The most
commonly used measures of comparing texts are the following: Information entropy, Fisher
information, Chi-squared test and Kullback-Leibler divergence [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">9-12</xref>
        ].
      </p>
      <p>
        When identifying the author of the text it is assumed that the text reflects the individual style of the
text author, which allows to differ it from other ones. To compare the texts with each other it is necessary
to compare the text with some numerical characteristic that was close to the texts of the same author
and would different in the works of various authors. Such a characteristic of author in the article [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">9-12</xref>
        ]
uses the distribution function density (DFD) of letter combinations of three consecutive characters
(3grams). DFD is defined as the set of empirical frequencies of birth of letters or their combinations. The
analysis of the text with the help of DFD does not take into account the occurrence of punctuation
marks, spaces and numbers.
      </p>
      <p>The task of identifying the author of the unknown text in terms of DFD is formulated as follows.</p>
      <p>Here is a set of texts that contain works of famous authors. Let   be the number of works by the
ath author.   , – the number of symbols in the i-th work of the a-th author,  = 1, … ,   . All texts in this
set will be presented in the form of DFD. DFD of the text, the volume of which is equal to   , , is given
as the set of values   , ( ) =   /  , ,   – the number of N-grams under the number  . The argument  =
1, … ,  ( ,  )corresponds to the number of letters (n-grams) in alphabetical order, where  is the power
of the alphabet of the language in which the text is written,  is the order of N-grams, i.e. the number
of characters in letter combination.  ( ,  ) =   is the number of N-grams in this alphabet.</p>
      <p>Each author is identified with his weighted average DFD which is given by formula (1):
   
  = ∑   ,   , ,   = ∑   ,</p>
      <p>=1  =1</p>
      <p>
        These DFD will play the role of copyright standards [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">9-12</xref>
        ]. To compare two texts, either the text
and the author standard, it is needed to specify the distance between the corresponding distribution
functions. The norm in the space of summed functions is used as a distance metric. For example, the
distance  0, between the DFD of the unknown text  0 and any copyright DFD will be calculated by
formula (2):
 0, = ‖ 0 −   ‖ =
∑ | 0( ) −   ( )|,
 =1
Accordingly, the text «0» will belong to the author whose distance to the DFD will be the shortest.
(1)
(2)
      </p>
      <p>When solving the problem of classification, the data set was not clearly divided into test and training
sets. Weighted average DFD were built on the whole set of books by one author. The distance from the
book  to "his" a -author is calculated by formula (3):
(3)
  , =
‖  , −   ‖</p>
      <p>.</p>
      <p>1 −   , /</p>
      <p>
        Formula (4) excludes the participation of the DFD of the document / i-article in the average DFD of
"its" author [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">9-12</xref>
        ]. The method of smoothing 3-gram distribution functions according to the analytical
approach is impossible because the function is too complex. There is only an algorithmic approach for
the implementation of which we can be focused on the main methods, such as simple or ordinary
moving average, weighted moving average, exponential smoothing, median smoothing. In our case, we
believe that the most optimal will be the use of the moving average method and this method is also
known as the filtering method. Its application will reduce the variety of data. This fits into our
analytically chosen tactics of ignoring extreme data, highs and lows. The degree of smoothing should
be tied in advance to the criterion that will ensure maximum smoothing while still retaining information.
      </p>
      <p>In our specific case we believe that correlation analysis of 3-gram data sequences in certain two of
three selected articles can help to determine the relationship, and thus help to answer the question of
how similar is the topic of articles. To do this, the function of the first studied article can be denoted by
the variable x and the set of values of the second article (variable y) and perform a correlation analysis
of the set of two sequences XY. The task of correlation is not only to assess connectivity, but also to
reduce the target score to a numerical expression. The method of studying 3-gram sequences allows to
reduce significantly the number of variables that are taken into account as important ones. Combination
of the metrics-related groups forms a new cluster, which compares the metrics of closeness with others,
and it is possible to end up with a fairly clear structure of the data set. Quantitative method of the
potential text author identification from the set of possible ones on the basis of comparison analysis
results of the reference text with the researched one is based on the technique of linguometry.</p>
      <p>Linguometry is a branch of applied linguistics that detects, measures and analyzes the quantitative
characteristics of different levels units of language or speech [33]. Using the apparatus of mathematical
statistics, linguometry is involved in solving such problems of linguistics as the following criteria:
• dictionaries (including frequency and statistical) and comparisons
• automatic dictionaries, thesauri
• shorthand systems
• methods and means of automatic language detection
• methods and means of information retrieval, etc.</p>
      <p>Each language has its own statistical parameters and knowledge of the frequency occurrence of
letters and their combinations (2-gram, 3-gram, 4-gram) that allows automatically to identify it. For
example, for Ukrainian texts [34-37] it was found that statistical parameters of styles can consider
frequencies of vowels, consonants, spaces between words, as well as soft and sonorous groups of
consonants [33]. We will show how to evaluate the speech of a particular author on a particular passage
of his work [36] using a certain standard, for example, Ukrainian language letters frequencies. Consider
two passages of the technical text in Ukrainian presented in a format where the letters are arranged in
descending order of frequency of their appearance (frequencies are given in Table 1), the distinction
between lowercase and uppercase letters has not been made. The type of letters correlation frequencies
of the passages [35] and the standard [36] have been investigated. The results that confirm the
conclusions have been presented, in particular, graphically.</p>
      <p>In the table. 1 the following data are entered for convenience: frequency of used Ukrainian language
letters, absolute and relative frequencies of letters used in the studied Passage 1 (Article 1) [35] and
Passage 2 (Article 2) [36]. Passage 1 contains 556 characters; Passage 2 contains 541 characters. The
concept of "other" in the column of letters contains authentic letters for the Ukrainian language (ї, є, г,
і), which are rarely used in most technical texts. This allows to achieve some independence in the
analysis. Fig. 1 illustrates the obtained results graphically.</p>
      <p>Graphical representation of the relative frequencies of letters in the passages gives a convincing
answer to the question which of the passages was written by which author.</p>
      <p>The distribution of 1-gram in the works is different. The optimal indicators of the texts study are the
analysis of 3-grams [38-44]. We will check this in the next stages of the study. There is a sharp jump
in the relative frequency of occurrence of the letter "e" for Passage 2 relative to the reference values of
Standard 1 [36] (Fig. 2), so we assume that it is more likely that Standard 1 was written by the author
of Passage 1 [35]. We also give the numerical values of the correlation of the frequency of letters in the
passages and the standard. We find two correlation coefficients: for the standard and Passage 1 [35] and
for the standard and Passage 2 [37]; factor closer to 1 will indicate that the relevant passage is more
likely to belong to the standard. Calculations of the correlation coefficient for the standard and Passage1
give Re-У1=0.962716, and the correlation coefficient for the standard and Passage 2 - Re-У2=0.909958.
Similarly, the values of relative frequencies in Standard 2 and Passages 1, 2 in Fig. 3 differ significantly,
so it is likely that the author of Standard 2 [34] is not the author of Passages 1 and 2.
0,4
0,3
0,2
0,1
0
Пропуск
о
а
н
и
в
т
е
р
с</p>
      <p>Step. 7. Determine the number of connectors S.</p>
      <p>Step. 8. Calculate the coefficients of author speech.</p>
      <p>Step. 9. Output the results to the end user (Table 2, Fig. 1).
The Web-resource for stylistic analysis has the following fields (Fig. 5):
• Select Passage 1 (2, 3) is open access to excerpts. Access to the next passage only after
activating access to the previous one. Access is opened sequentially from a smaller number to a
larger one.
• Reference text is the field where the Reference text is copied from the buffer.
• The text you enter must be at least 100 characters long. (Now 0) is after starting the calculation,
the actual number of characters of each passage will be calculated and displayed separately.
• Passage 1 (2, 3) is the field where the corresponding excerpt text is copied from the buffer.
• Calculate is start the calculation.
• Clear is clear the entered data.</p>
      <p>Algorithm 2. Stylometric analysis of the text to determine authorship.</p>
      <p>Step. 1. Check the lengths of standard text and selected passages and reduce the length of the
reference text to the minimum of the checked.</p>
      <p>Step. 2. Clean the reference text from special characters, etc.</p>
      <p>Step. 3. Determine of the words number in the text of the standard.</p>
      <p>Step. 4. Determine the number of stop words (prepositions + conjunctions + particles) in the text of
the standard (Fig. 5-6).</p>
      <p>Step. 5. The length of Passage 1 is not more than the minimum text.</p>
      <p>Step. 6. Clear Passage 1 from special characters, etc.</p>
      <p>Step. 7. Determine the number of words W1 for Passage 1.</p>
      <p>Step. 8. Determine the number of stop words (prepositions + conjunctions + particles) in the text.
Step. 9. Prepare individual arrays (excerpt and standard) to calculate the correlation coefficient (Fig.
Step. 10. Call the function to calculate the correlation coefficient.</p>
      <p>Step. 11. Form an array to form a graphical representation of the relative frequency of stop words in
Passage 1 and in the standard.</p>
      <p>Step. 12. Call the function to calculate the relative frequency distribution graph (Fig. 6).</p>
      <p>Step. 13. Call the function to calculate the correlation coefficient of Passage 2 (3) for each of the
service words.</p>
      <p>Step. 14. Form the words of the Swadesh list from the reference book, determine the number of
words from the Swadesh list in the text of the Passage.</p>
      <p>Step. 15. Form common for the Standard, Passages 1-3 and the Swadesh list.</p>
      <p>Step. 16. The results of the study are displayed on the screen.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>When identifying the author of the text, it is assumed that the text reflects the individual style of
author writing what distinguishes him from others. In order to compare texts with each other, it is
necessary to compare the text with some numerical characteristics that would be close to the texts of
the same author and would be significantly different for the works of different authors.</p>
      <p>The Web-resource for the analysis of N-grams has the following fields (Fig. 7):
• Number of grams - the number of characters in grams. Default is 3 ones. Can be changed to 1,
2, 3, 4.
• Choice of the text language - the language of the text for analysis (research). The default one is
"Ukrainian".
• Text - a field where the studied text is copied from the buffer.
• Restriction of text in characters.
• Generation - to start generating N-grams.
• Clearance - clear the entered data.</p>
      <p>Algorithm 3. Linguistic and statistical analysis of N-grams of text is the following:
Step. 1. Purify the studied text (numbers, special symbols).</p>
      <p>Step. 2. Calculate the number of words in the text.</p>
      <p>Step. 3. All words of the text are translated in lower case.</p>
      <p>Step. 4. Remove the spaces.</p>
      <p>Step. 5. Depending on the selected language, the corresponding alphabet is substituted.</p>
      <p>Step. 6. Depending on the set number of grams the corresponding function which calculates all
possible variants of grams and saves in an array is started.</p>
      <p>Step. 7. The function of counting the number of occurrences of words is started.</p>
      <p>Here we calculate the relative frequency of occurrence and store it in the array: the ordinal number
of the gram, the gram itself, the number of occurrences of this gram, the relative occurrence frequency
of this gram.</p>
      <p>Step. 8. The following function forms the array received in the previous function for export to the
CSV file. This file is stored on the server. It can be downloaded to the computer of the user (researcher)
via the link, which will be accessible after the formation of the form with the results of the study.</p>
      <p>Step. 9. The results of the study are displayed on the screen (only those grams that are found in the
text).</p>
      <p>Step. 10. Access the export file.</p>
      <p>Step. 11. The generalized results are deduced:
• only N-grams with repetitions were found
• only N-grams were found without repetitions
• total N-grams
• number of characters in the text that are completely cleared
• number of characters in the text with spaces
• number of words in the text
• size of the alphabet.</p>
      <p>Three publications of scientific and technical orientation on the basis of linguistic and statistical
analysis of 3-grams have been compared. Articles 1 and 2 have been written by one team, Article 3 has
been written by another author (Table 3). The language of the text is Ukrainian (letters in the alphabet
- 33, all possible N-grams – 35937)</p>
      <p>When comparing articles only those 3-grams that are found in the text at the same time in three
articles at least once have been taken into account. Therefore, for this particular example, all 3-grams
are 2147. That is, for Article 1 78.4814% 3-grams have been analyzed, for Article 2 - 72.6332% and
for Article 3 - 84.1271%. Accordingly, the difference in consumption of the relevant 3-grams between
Articles 1 and 2 is R12=56.5254 %, between Articles 2 and 3 - R23=69.4271 %, between Articles 1 and
3 - R13=62.9839 %. These indicators themselves show that the characteristics of Articles 1 and 2 are
more similar (R23&gt;R12on 12.9017 %, R23 &gt; R13on 6.4432 %, R13&gt; R12on 6.4585 %, that is R23&gt;R13&gt;R12)
than the characteristics under Articles 1-3 and 2-3. The smaller the Rij, the greater the degree that the
articles are written by the same author. Then in the case of Articles 1 and 2 it is more likely to be written
by one author / team than Articles 2-3 and Articles 1-3 respectively. But we will analyze the use of
individual clusters of 3-grams in the relevant articles and compare the results.</p>
      <p>Fig. 8 presents the results of the analysis of use in Articles 1-3 of 3-grams, starting with the letter a
(appearance in Articles 1-3 in the range of 6.1125-6.7087%). Most often the curve lines for Articles
12 (4.2322%) and Articles 1-3 (4.197%) coincide or approach each other (average discrepancy is
0.02713% and 0.0269%, respectively). But not always there is a coincidence with Article 2-3 (4.6322%)
and there are significant differences (the average difference is 0.02969%). If you analyze only such
3grams it turns out that all three articles are written more likely by one author. This is due to the fact that
this letter is one of the most commonly used for the formation of Ukrainian words.
0,006</p>
      <p>ааб або аві аги ади адр аєз ажа азо айв акі алг аль ами анд ант аоп апо ари асі асу атн афо ахо ацю ачу</p>
      <p>Fig. 9 presents the analysis results of use in Articles 1-3 of 3-grams, starting with the letter б (letter
b in English) (appearance in Articles 1-3 in the range of 0.48884-0.77738%). Most often the curve lines
for Articles 1-2 (0.594%) as opposed to Articles 1-3 (0.7072%) and Articles 2-3 (1.1208%) coincide or
approach. But the trajectory of the curve of Article 1 and Article 3 often coincides (most likely articles
are written by one author, the average discrepancy is 0.01809%, while for Articles 1-2 - 0.0261% and
0,004
0,002
0,004
0,002
0
0
ааб або аві аги ади адр аєз ажа азо айв акі алг аль ами анд ант аоп апо ари асі асу атн афо ахо ацю ачу
||p1-p2|| ||p2-p3|| ||p1-p3||
0,002</p>
      <p>0
0,002
0,001
0
1 gram
о
н
а
т
і
и
в
с
е
р
к
л
у
д
м
п
я
з
ь
ч
г
х
б
ж
й
ц
є
ф
Articles 2-3 - 0.02866%. If analyze only such 3-grams (which are less common), it turns out that all
Articles 1-2 are written more likely by one author, and Article 3 - by another one. This is due to the fact
that this letter would be rare in the formation of Ukrainian words. And some authors use such words
more often because of habit and / or because of the subject matter of their publications (this requires
further research).</p>
      <p>0,004
баг баз бдл без бер бир бит біг біл бір бла бле бли бло блт бмебни бов бор бот бра бсо був буд бут бхі бчи
||p1-p2|| ||p2-p3|| ||p1-p3||
баг баз бдл без бер бирбит біг біл бір бла бле блибло блт бмебнибов бор бот бра бсо був буд бут бхі бчи</p>
      <p>According to Table 4 and Fig. 10-12, a part of the letters in the Ukrainian language are most often
used, others - much less often. For the most frequently used letters, the frequency of occurrence of
3grams with such initial letters will have almost the same distribution (top values in the graph of Fig.
12), and not for other letters.
0,001</p>
      <p>0
0,004
0,002
0</p>
      <p>Therefore, it is advisable to study only the trigrams for the initial letters, which are less common in
the texts of a particular language to determine the degree of belonging of the text to the author (for
example, Fig. 12). Thus, for 3-grams of the letter are (the appearance in Articles 1-3 in the range of
0.2517-0.707%) most often the lines of curves for Articles 1-2 (0.2508%) in contrast to Articles 1-3
(0.6077 %) and Articles 2-3 (0.5443%) that coincide or approach each other. But the trajectory of the
curve of Article 1 and Article 2 often coincides (most likely articles written by one author - the average
discrepancy is 0.0114%, while for Articles 2-3 - 0.02478% and Articles 1-3 - 0.02762% this value is
higher twice as much).</p>
      <p>0,003
єак єап єва єве єви єві єво єдн єза єкт єма ємо єна єнт єпе єпо єпр єро єст єта єть єщо
||p1-p2|| ||p2-p3|| ||p1-p3||
єак єап єва єве єви єві єво єдн єза єкт єма ємо єна єнт єпе єпо єпр єро єст єта єть єщо</p>
      <p>Table 4 shows frequencies of letters appearance in the standard and the studied passages. Fig. 14
shows histograms of the relative frequency of n-grams in 1-3 articles. Low frequency (noise) values are
the most common and form the main volume of the data. We can ignore them (Fig. 15).</p>
      <p>All graphs of the distribution of the frequency of 3-grams in articles show a significantly noticeable
gradation of 3-grams on underused (noise-like) and widely used peak values. This allows to see the
specific examples of the three articles, the fact that to reduce the amount of information analyzed it is
desirable to proceed to the analysis of the distribution function from a certain threshold value of
frequency and at the same time cover the main information content is visible. To compare the
distribution function in the context of the three studied articles, it is necessary to compare clearly
expressed average values. After analyzing the most commonly used 3-grams, we conclude that they are
caused by the stylistics or grammar of the Ukrainian language and are not relevant to determine the
specific topic of articles. The most used 3-grams in Article:
• 1: ння [nnya] 0.008476, енн [enn] 0.007175, ого [oho] 0.005473.
• 2: ння [nnya] 0.006448, ист [yst] 0.006356, ува [uva] 0.006233.</p>
      <p>• 3: ння [nnya] 0.008769, ого [oho] 0.007717, мет [met] 0.006314.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In the algorithmic approach, the appearance of the trend is obtained due to various algorithms that
practically implement smoothing procedures. These procedures provide the researcher only with an
algorithm for calculating the new value of the time series at any given time t. These methods can be
classified as the following simple or ordinary moving average (Fig. 16), weighted moving average,
exponential smoothing - median smoothing. In this part of the calculation work, the relative frequency
of consumption of 3-grams in three texts has been smoothed by the method of moving average,
exponential smoothing and median smoothing.</p>
      <p>The moving average method is one of the oldest known methods of smoothing the time series. It
is based on the transition from the initial values of the series to their average values in the time interval,
the length of which is selected in advance. The selected time interval slides along the row. Moving
averages can smooth out both random and periodic fluctuations, identify existing trends in the process
and therefore serve as an important tool in filtering time series components. The moving average
method estimates the average level over a period of time. The longer the time interval to which the
average belongs, the smoother the level will be, but the less accurately the trend of the original time
series will be described. In all figures, the gray graph is the graph of the initial Relative frequency, and
the red graph is the graph of the smoothed Relative frequency data.</p>
      <p>At small values of the size of the interval w, the efficiency in terms of smoothing effect is not very
high, as can be seen in the following Figures 16-18 for Article 1 (smooth the data using the size of the
smoothing interval w = 3, 5, 7, 9,11, 13, 15).</p>
      <p>It is needed to smooth the data using the size of the smoothing interval w = 3 (Fig. 19), then smooth
the obtained data again but using the size of the smoothing interval w = 5. Then continue smoothing
the obtained data with the smoothing interval w = 7 and so on to w = 15 (Fig. 20-21).</p>
      <p>At small values of the size of the interval w, the efficiency in terms of the smoothing effect decreases,
which can be seen in the following figures. Also, the smoothing method, using pre-smoothed rows,
smoothest the data very effectively.</p>
      <p>We smooth the data using the dimensions of the smoothing interval w = 3, 5, 7, 9, 11, 13, 15 for
Article 2, the moving average showed a trend in the interval better than for Article 1 (Fig. 22-24).</p>
      <p>It is needed to smooth the data using the size of the smoothing interval w = 3 for Article 2 (Fig.
2527), then smooth the obtained data again but using the size of the smoothing interval w = 5. Then
continue smoothing the obtained data with the smoothing interval w = 7 and so on to w = 15.</p>
      <p>Smooth the data using the size of the smoothing interval w = 3, 5, 7, 9,11, 13, 15 for Article 3, the
moving average smoothed the data with about the same efficiency as in Article 1 (Fig. 28-10).</p>
      <p>It is needed to smooth the data using the size of the smoothing interval w = 3 (Fig. 31), then smooth
the obtained smoothed data again but using the size of the smoothing interval w = 5. Then continue
smoothing the obtained data with the smoothing interval w = 7 and so on to w = 15 (Fig. 32-33).</p>
      <p>Exponential smoothing. The main parameter of exponential smoothing is a parameter that takes
values in the range of 0.1 0.3. It is necessary to perform smoothing of the same series with the values
of the parameter = 0.1, 0.15, 0.2, 0.25, 0.3 and in all these cases to find for each smoothing the
correlation coefficients between the original values and smoothed ones. An essential feature of the use
of exponential averages is the substantiation of the value of the smoothing parameter α. The smaller it
is, the more the levels in the analyzed series are smoothed. This means an increase in the specific α.</p>
      <p>For data on the frequency of 3-grams use exponential smoothing for all three texts did not give a
"delay". Exponential smoothing has softened these data a little and it is harder to see the general trend.
Also, the correlation coefficients of the data are very low (Fig. 34-42).
Figure 24 Exponential smoothing of Article 2 for a=0.15, 0.2, 0.25
Figure 27 Exponential smoothing of Article 3 for a=0.15, 0.2, 0.25</p>
      <p>Median smoothing. In this case, use the same dimensions of the smoothing interval and the
operation as in paragraph 1.Characteristic feature of median smoothing is that it leaves monotonic parts
of the data sequence and sharp differences unchanged, and for nonmonotonic areas within the size of
the sliding smoothing interval leaves only a centered value equal to their median, i.e. effectively
eliminates those levels that violate monotonicity. It is needed completely to eliminate single extreme or
anomalous values of levels that are at least half the distance from the smoothing interval, maintain sharp
differences in trends (moving average and exponential smoothing lubricates them), effectively
eliminates single levels with very large or very small values that are random and stand out sharply
among other levels. These characteristics of the median smoothing were confirmed during the median
smoothing for relative frequency in Article 1. lower than the moving average. We smooth the data using
the dimensions of the smoothing interval w = 3, 5, 7, 9, 11, 13, 15 (Fig. 43-45). Graphics are arranged
in the appropriate order: Median w = 3 (for intervals 0-700, 700-1400, 1400-2148), Median w = 5, 7,
9, 11, 13, Median w = 15 (for intervals 0-700, 700-1400, 1400-2148).
Figure 31 Median smoothing of Article 1 for w = 3, 5, 7, 9, 11, 13</p>
      <p>It is needed to smooth the data using the size of the smoothing interval w = 3, then smooth the
obtained smoothed data again but using the size of the smoothing interval w = 5. We continue smoothing
the obtained data with the smoothing interval w = 7 and so on to w = 15 (Fig. 46-48).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussions</title>
      <p>Graphical representation of the relationship between two studied sequences is called a correlation
field or scatter plot. The graphical method provides a visual representation of the form of
communication between these sequences. So, it is needed to construct a correlation field for Article 1
and 2 (Fig. 49), Article 1 and 3 (Fig. 50), Article 2 and 3 (Fig. 51).</p>
      <p>Visually assessing the nature of the relationship, it can be stated that there is a linear relationship in
all three fields. Also evaluating the visual data of the field, we see that the correlation is present, so we
can assume that these Ukrainian articles can be written by one author or are based on one topic. But
visual assessment is not enough, so it is worth finding the value of the correlation coefficient for more
accurate research results. The correlation coefficient characterizes the degree of closeness of the linear
dependence. Therefore, there is a calculation of the correlation coefficients for Articles 1 and 2
(Correlation coefficient 0.575. Coefficient of determination 33%); for Articles 1 and 3 (Correlation
coefficient 0.63023. Coefficient of determination40%); for Articles2 and 3 (Correlation coefficient
0.49038. Coefficient of determination24%). Correlation coefficients that are less than 0.7 but greater
than 0.5 modulus indicate a medium-strength relationship (the coefficients of determination are less
than 50% but more than 25%). It is worth noting that in the first two cases we received a connection of
medium strength and in the third case we have a connection of weak force very close to the average, so
it can also be attributed to the average. It is obvious that having three different Ukrainian articles the
100% correlation is unlikely to be. So, given the average connection, the assumption that these articles
may have been written by the same author or are based on the similar topics has been confirmed. When
the pair statistical dependence on the linear correlation is rejected, the correlation coefficient loses its
meaning as a characteristic of the degree of closeness of the connection. In this case, such a measure of
communication as the correlation ratio is used. Since there is a linear relationship between the pair of
studied features, the correlation ratio does not need to be calculated.</p>
      <p>Autocorrelation function is a correlation of function with itself shifted by a certain amount of
independent variables. Autocorrelation is used to find patterns in a number of data, such as periodicity.</p>
      <p>The graph of the autocorrelation function is also called the correlogram (Fig. 52).</p>
      <p>Fig. 52 shows that the studied series are not stationary, as in the case of fixed time series the graph
of autocorrelation functions should be decreased rapidly after the first few values.</p>
      <p>It is needed to divide the sequence of Relative frequency Article 1 into three equal parts of 715
values. For convenience, we take the data into a separate table (Fig. 53). The correlation matrix is a
square table in which the correlation coefficient between the corresponding parameters is located at the
intersection of the corresponding row and column. Correlation matrix for column divided into 3 parts
and has been constructed and the results are obtained: correlation coefficients, that are less than 0.5, the
absolute value or modulus indicate a weak relationship. On the correlation matrix it is seen that all
values are close to 0, so we can conclude that there is no connection at all. It can be said that this is
quite an expected result, as the data do not depend on each other and have different values. We find the
coefficients of multiple correlation (Fig. 54-55).
0,042590192
0,071599169
0,018317132
0,006056347
0,047593571
0,022404806
0,010037348</p>
      <p>According to these graphs, Article 1 and Article 2 were more likely to have been written by one
author, although Article 1 and Article 3 could also have been written by one author (but this is not true).
But Articles 2-3 were definitely written by different authors. The application of linguistic and statistical
analysis of 3-grams to a set of articles will allow to form a subset of similar linguistically characteristic
publications. Imposing additional conditions on this subset in the form of statistical and quantitative
analyzes (sets of keywords, stable phrases, stylistic, linguometric, etc.) will significantly reduce this
subset, clarifying the list of more likely author works. Thus, the analysis of the content and frequency
of occurrence of only business words will separate Articles 1 and 3 into different subsets, Articles 1
and 2 in one the same. This study does not address the problem of identifying the author in full due to
the fact that the difference in authorial traits is subjective and depends on the limitations imposed on
the creative process of the author. However, as a result, a system that implements such methods is able
to give recommendations on the degree of belonging of the text to a particular author. Further
experimental research is needed to test the proposed method to determine the style of the author from
other categories of texts - scientific humanities, art, journalism and more. Therefore, we compare the
frequencies of all trigrams that begin with a particular letter (Fig. 56).
0,1
а б в г д е є ж з и і ї й к л м н о п р с т у ф х ц ч ш щ ь ю я</p>
      <p>According to these graphs, Article 1 and Article 2 were more likely to have been written by one
author, although Article 1 and Article could also have been written by one author (but this is not true).
But Articles 2-3 were definitely have been written by different authors. The application of linguistic
and statistical analysis of 3-grams to a set of articles will allow to form a subset of similar linguistically
characteristic publications. Imposing additional conditions on this subset in the form of statistical and
quantitative analyzes (sets of keywords, stable phrases, stylistic, linguometric, etc.) will significantly
reduce this subset, clarifying the list of more likely author works.</p>
      <p>Thus, the analysis of the content and frequency of occurrence of only business words will separate
Articles 1 and 3 into different subsets, Articles 1 and 2 in one the same.</p>
      <p>This study does not address the problem of identifying the author in full due to the fact that the
difference in authorial traits is subjective and depends on the limitations imposed on the creative process
of the author. However, as a result, a system that implements such methods is able to give
recommendations on the degree of the text belonging to a particular author. Further experimental
research needs to test the proposed method to determine the style of the author from other categories of
texts such as scientific humanities, art, journalism and others.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>The article dwells upon the completed scientific research in the field of information technology in
the part concerning computer linguistics, artificial intelligence and Machine Learning. Correlation
analysis of text author identification results based on n-grams in Ukrainian technical and scientific texts
have been made. The comparison between three articles have been done and the results have been
obtained. Quantitative content analysis of textual scientific and technical content has been studied based
on the fact that text authorship determination systems typically use plagiarism and rewrite its metrics
of identification fully or partially. The article presents the method of determining the author by
decomposition on the basis of the analysis of such speech coefficients as lexical diversity, degree of
syntactic complexity, speech coherence, indices of exclusivity and concentration of the text. Also, the
parameters of the author style such as words, sentences, prepositions, conjunctions numbers and
quantity of words with defined frequencies have been analyzed. It is highlighted that in the algorithmic
approach smoothing procedures are widely used. So, the relative frequency of 3-grams consumptions
in the studied texts has been smoothed by the method of moving average, exponential and median
smoothing. It is proposed to analyze the reference text in several stages for high-quality and effective
analysis of content in determining the degree of text authorship. To achieve the research goal a system
with the ability to select the language / languages of the analyzed content have been developed and
implemented on the Victana Web-resource. It is said that in order to compare the texts with each other
it is necessary to compare the text with some numerical characteristic that was close to the texts of the
same author and would different in the works of various authors that uses the distribution function
density of letter combinations of three consecutive characters. So, rapid distribution of text documents
in electronic form has caused the importance of using automatic methods to analyze the content
including the necessity of documents classification and clustering by various criteria.
8. References
[24] V. I. Perebyynis, M. P. Muravytska, N.P. Darchuk, Chastotni slovnyky ta yikhvykorystannya
[Frequency dictionaries and their use], Naukova dumka [Scientific opinion], 1983.
[25] I. Khomytska, V. Teslyuk, A. Holovatyy, O. Morushko, Development of methods, models, and
means for the author attribution of a text, Eastern-European Journal of Enterprise Technologies.
3(2(93)) (2018) 41–46. doi: 10.15587/1729-4061.2018.132052.
[26] I. Khomytska, V. Teslyuk, Authorship and Style Attribution by Statistical Methods of Style
Differentiation on the Phonological Level, Advances in Intelligent Systems and Computing 871
(2019) 105–118. doi: 10.1007/978-3-030-01069-0_8.
[27] I. Khomytska, V. Teslyuk, N. Kryvinska, I. Bazylevych, Software-Based Approach Towards
Automated Authorship Acknowledgement – Chi‐Square Test on One Consonant Group,
Electronics 9(7) (2020) 1138. doi: 10.3390/electronics9071138.
[28] I. Khomytska, V. Teslyuk, I. Bazylevych, I. Shylinska, Approach for Minimization of Phoneme
Groups in Authorship Attribution Attribution, International Journal of Computing 19(1) (2020)
55–62. Doi: 10.47839/IJC.19.1.1693.
[29] D. Jurafsky, J. H. Martin, N-gram Language Models. URL:
https://web.stanford.edu/~jurafsky/slp3/3.pdf.
[30] D. Jurafsky, J. H. Martin, Speech and Language Processing. URL:
https://web.stanford.edu/~jurafsky/slp3/ed3book_sep212021.pdf.
[31] D. Jurafsky, J. H. Martin, Regular Expressions, Text Normalization, Edit Distance. URL:
https://web.stanford.edu/~jurafsky/slp3/2.pdf.
[32] O. S. Goh, C. C. Fung, A. Depickere, Domain knowledge query conversation bots in instant
messaging (IM), Knowledge-Based Systems 21(7). (2008) 681-691.
[33] S. Buk, Osnovy statystychnoy lingvistyky, LNU n. I. Franko Publishing House,2008.
[34] V. Vysotska, V. B. Fernandes, V. Lytvyn, M. Emmerich, M. Hrendus, Method for Determining
Linguometric Coefficient Dynamics of Ukrainian Text Content Authorship, Advances in
Intelligent Systems and Computing 871 (2018) 132–151. doi: 10.1007/978-3-030-01069-0_10.
[35] P. Kravets, The control agent with fuzzy logic, in: Proceedings of the International Conference on</p>
      <p>Perspective Technologies and Methods in MEMS Design, Lviv, Ukraine, 2010, pp. 40–41.
[36] P. Kravets, The Game Method for Orthonormal Systems Construction, in: Proceedings of the 2007
9th International Conference - The Experience of Designing and Applications of CAD Systems in
Microelectronics, Lviv, Ukraine, 2007. doi: https://doi.org/10.1109/cadsm.2007.4297555.
[37] P. Kravets, Game Model of Dragonfly Animat Self-Learning, in: Proceedings of the International</p>
      <p>Conference on Perspective Technologies and Methodsin MEMS Design, Lviv, 2016, 195–201.
[38] V. Lytvyn, V. Vysotska, I. Budz, Y. Pelekh, N. Sokulska, R. Kovalchuk, L. Dzyubyk, O.</p>
      <p>Tereshchuk, M. Komar, Development of the quantitative method for automated text content
authorship attribution based on the statistical analysis of N-grams distribution, Eastern-European
Journal of Enterprise Technologies 6(2-102) (2019) 28–51.
[39] I. Balush, V. Vysotska, S. Albota, Recommendation System Development Based on Intelligent</p>
      <p>Search NLP and Machine Learning Methods, CEUR WorkshopProceedings 2917 (2021) 584–617.
[40] A. Berko, Y. Matseliukh, Y. Ivaniv, L. Chyrun, V. Schuchmann, The Text Classification Based on
Big Data Analysis for Keyword Definition Using Stemming, in: Proceedings of the 2021 IEEE
16th International Conference on Computer Sciences and Information Technologies (CSIT), 1,
Lviv, Ukraine, 2021, pp. 184–188. doi: 10.1109/CSIT52700.2021.9648764.
[41] N. Shakhovska, K. Shakhovska, The Method of Text Tonality Classification, in: Proceedings of
the Computer Sciences and Information Technologies (CSIT), 1, Lviv, Ukraine, 2020, pp. 19–23.
[42] I. Khomytska, V. Teslyuk, L. Bordyuk, The Kolmogorov-Smirnov’s Test for Authorship
Attribution on the Phonological Level, in: Proceedings of the Computer Sciences and Information
Technologies (CSIT), 1, 2020, pp. 259–262. doi: 10.1109/CSIT49958.2020.9322042.
[43] Y. Hlavcheva, V. Bobicev, O. Kanishcheva, Language-independent features for authorship
attribution on Ukrainian texts, CEUR Workshop Proceedings Vol-2833 (2021) 134–143.
[44] O. Bisikalo, O. Boivan, N. Khairova, O. V. Kovtun, V. Kovtun, Precision Automated Phonetic
Analysis of Speech Signals for Information Technology of Text-dependent Authentication of a
Person by Voice, CEUR Workshop Proceedings Vol-2853 (2021) 276–288.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>а б в г д е є ж з и і ї й к л м н о п р с т у ф х ц ч ш щ ь ю я |</article-title>
          |
          <fpage>p1</fpage>
          -
          <lpage>p2</lpage>
          || ||
          <fpage>p1</fpage>
          -
          <lpage>p3</lpage>
          || ||
          <fpage>p2</fpage>
          -
          <lpage>p3</lpage>
          ||
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. O.</given-names>
            <surname>Sushko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Y.</given-names>
            <surname>Fomychova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Barsukov</surname>
          </string-name>
          ,
          <article-title>Chastotypovtoryuvanosti bukv i bihram u vidkrytykh tekstakh ukrayinsʹkoyumovoyu [Frequency of repetition of letters and digrams in open texts in Ukrainian]</article-title>
          .
          <source>Zakhyst informatsiyi [Protection of information] 12</source>
          (
          <issue>3</issue>
          (
          <issue>48</issue>
          )) (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Dyurdeva</surname>
          </string-name>
          ,
          <article-title>Authorship definition based on the frequency distribution of letter combinations</article-title>
          ,
          <year>2015</year>
          . URL: https://www.math.spbu.ru/SD_AIS/documents/2015-12-441/
          <fpage>2015</fpage>
          -12-b-
          <volume>07</volume>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Likasa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vlassisb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Verbeekb</surname>
          </string-name>
          ,
          <article-title>The global k-means clustering algorithm</article-title>
          ,
          <source>Pattern Recognition</source>
          <volume>36</volume>
          (
          <issue>2</issue>
          ) (
          <year>2003</year>
          )
          <fpage>451</fpage>
          -
          <lpage>461</lpage>
          . URL: https://www.cs.uoi.gr/~arly/papers/PR2003.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. P. Reynolds G</given-names>
            . Richards,
            <surname>Rayward-Smith</surname>
          </string-name>
          <string-name>
            <surname>V. J.</surname>
          </string-name>
          <article-title>The Application of K-medoids</article-title>
          and PAM to the
          <source>Clustering of Rules, Lecture Notes in Computer Science</source>
          <volume>3177</volume>
          (
          <year>2004</year>
          ). doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>540</fpage>
          - 28651-6_
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Babichev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.A.</given-names>
            <surname>Taif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lytvynenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Osypenko</surname>
          </string-name>
          ,
          <article-title>Criterial analysis of gene expression sequences to create the objective clustering inductive technology</article-title>
          ,
          <source>in: Proceedings of the International Conference on Electronics and Nanotechnology</source>
          ,
          <string-name>
            <surname>ELNANO</surname>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>244</fpage>
          -
          <lpage>248</lpage>
          . doi:
          <volume>10</volume>
          .1109/ELNANO.
          <year>2017</year>
          .
          <volume>7939756</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Babichev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Durnyak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Pikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Senkivskyy</surname>
          </string-name>
          ,
          <article-title>An Evaluation of the Objective Clustering Inductive Technology Effectiveness Implemented Using Density-Based and Agglomerative Hierarchical Clustering Algorithms</article-title>
          ,
          <source>Advances in Intelligent Systems and Computing</source>
          <volume>1020</volume>
          (
          <year>2020</year>
          )
          <fpage>532</fpage>
          -
          <lpage>553</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -26474-1_
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Babichev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gozhyj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Kornelyuk</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. I. Lytvynenko</surname>
          </string-name>
          ,
          <article-title>Objective clustering inductive technology of gene expression profiles based on SOTA clustering algorithm</article-title>
          ,
          <source>Biopolymers and Cell</source>
          <volume>33</volume>
          (
          <issue>5</issue>
          ) (
          <year>2017</year>
          )
          <fpage>379</fpage>
          -
          <lpage>392</lpage>
          . doi:
          <volume>10</volume>
          .7124/bc.000961.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K</given-names>
            <surname>-Means Clustering</surname>
          </string-name>
          . URL: https://people.revoledu.com/kardi/tutorial/kMean/#google_vignette.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Romanov</surname>
          </string-name>
          ,
          <article-title>Methodology and software package for identifying the author of an unknown text</article-title>
          ,
          <year>2010</year>
          . URL: https://www.dissercat.com/content/metodika
          <article-title>-i-programmnyi-kompleks-dlyaidentifikatsii-avtora-neizvestnogo-teksta</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Borisov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yu. N.</given-names>
            <surname>Orlov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Osminin</surname>
          </string-name>
          ,
          <article-title>Identification of the author of the text by the frequency distribution of letter combinations</article-title>
          ,
          <source>Applied Informatics</source>
          <volume>26</volume>
          (
          <issue>2</issue>
          ) (
          <year>2013</year>
          )
          <fpage>95</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Yu. N.</given-names>
            <surname>Orlov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Osminin</surname>
          </string-name>
          ,
          <article-title>Determining the genre and author of a literary work by statistical methods</article-title>
          ,
          <year>2010</year>
          . URL: https://keldysh.ru/papers/2013/prep2013_
          <fpage>27</fpage>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Yu. N.</given-names>
            <surname>Orlov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Osminin</surname>
          </string-name>
          ,
          <article-title>Methods of statistical analysis of literary texts</article-title>
          , LIBROKOM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V. A.</given-names>
            <surname>Pavlov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Dyurdeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Shalymov</surname>
          </string-name>
          ,
          <article-title>Clustering of Russian Manuscripts Based on the Feature Relationship Graph, Computer tools in education 1 (</article-title>
          <year>2016</year>
          )
          <fpage>24</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T. V.</given-names>
            <surname>Batura</surname>
          </string-name>
          .
          <article-title>Formal methods for determining the authorship of texts</article-title>
          ,
          <source>NGU Bulletin of Series Information technologies 10(4)</source>
          (
          <year>2012</year>
          ). URL: https://cyberleninka.ru/article/n/formalnye
          <article-title>-metodyopredeleniya-avtorstva-tekstov</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Romanyshyn</surname>
          </string-name>
          ,
          <article-title>Intro to Natural Language Processing</article-title>
          . Grammarly, Inc.,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Babash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. P.</given-names>
            <surname>Shankin</surname>
          </string-name>
          , Cryptography, SOLON-R,
          <year>2002</year>
          . URL: https://pub.flowpaper.com/docs/https://book.edu-lib.net/books1/Babash_Kriprografiya_1.pdf, https://pub.flowpaper.com/docs/https://book.edu-lib.net/books1/Babash_Kriprografiya_2.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Alferov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yu. Zubov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Kuzmin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Cheryomushki</surname>
          </string-name>
          , Fundamentals of cryptography, Helios,
          <year>2002</year>
          . URL: https://studfile.net/preview/6311470/
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [18]
          <string-name>
            <surname>A. M. Yaglom</surname>
            ,
            <given-names>I. M.</given-names>
          </string-name>
          <string-name>
            <surname>Yaglom</surname>
          </string-name>
          ,
          <article-title>Probability and information</article-title>
          . Science, ed.
          <source>Phys.-Math. lit.</source>
          ,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Piotrovsky</surname>
          </string-name>
          , Information measurements of language, Nauka,
          <year>1968</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>I. M.</given-names>
            <surname>Yaglom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Dobrushin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Yaglom</surname>
          </string-name>
          ,
          <source>Information theory and linguistics</source>
          ,
          <source>Questions of linguistics 1</source>
          (
          <year>1960</year>
          )
          <fpage>100</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Lebedev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. A.</given-names>
            <surname>Garmash</surname>
          </string-name>
          ,
          <article-title>On the possibility of increasing the speed of transmission of telegraph messages</article-title>
          ,
          <source>Telecommunications</source>
          <volume>1</volume>
          (
          <year>1958</year>
          )
          <fpage>68</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Shannon</surname>
          </string-name>
          ,
          <source>Prediction and entropy of the printed English</source>
          ,
          <year>1951</year>
          . URL: https://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>O. V.</given-names>
            <surname>Verbitskyy</surname>
          </string-name>
          , Vstup do kryptolohiyi [Introduction to cryptology],
          <source>Vydavnytstvo Naukovotekhnichnoyi literatury [Publishing House of Scientific and Technical Literature], Lviv</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>