1. Introduction

International Conference on Computer Sciences and Information Technologies, Lviv

10.1109/CSIT52700.2021.9648682

Correlation Analysis of Text Author Identification Results Based on N-Grams Frequency Distribution in Ukrainian Scientific and Technical Articles

Victoria Vysotska

victoria.a.vysotska@lpnu.ua 0 1

Oksana Markiv

oksana.o.markiv@lpnu.ua 0

Sofiia Teslia

sofiia.teslia.sa.2019@lpnu.ua 0

Yeva Romanova

Inesa Pihulechko

0 0 Lviv Polytechnic National University , S. Bandera Street, 12, Lviv, 79013 , Ukraine 1 Osnabrück University , Friedrich-Janssen-Str. 1, Osnabrück, 49076 , Germany

2021

85 1 11

The results of experimental approbation of the proposed content monitoring method used for the determination of the author style in Ukrainian scientific texts of technical profile have been studied. Authorship identification systems typically use plagiarism and rewrite metrics to determine it. There is a necessity to identify whether the work has been borrowed fully or partially. Therefore, the situation when the work has not been published yet is not taken into consideration. Quantitative content analysis of the scientific and technical texts uses the advantages of content monitoring and analysis of text based on NLP, Web-Mining and stylometry methods to identify many authors whose speech styles are similar to the studied passages. It narrows the search for further use in stylometric methods to determine the degree of the analyzed text belonging to a particular author. The method of determining the author has been decomposed on the basis of such speech coefficients analysis as lexical diversity, degree (measure) of syntactic complexity, speech coherence, indices of the text exclusivity and concentration. In parallel, the parameters of the author style, such as the text words, sentences, prepositions, conjunction quantities and the number of words with a frequency of 1, 10 or more have been analyzed. N-Grams, NLP, correlation analysis, authorship definition, Ukrainian text, distribution function density, exponential and median smoothing, linguometry, stylometric analysis COLINS-2022: 6th International Conference on Computational Linguistics and Intelligent Systems, May 12-13, 2022, Gliwice, Poland ORCID: 0000-0001-6417-3689 (V. Vysotska); 0000-0002-1691-1357 (O. Markiv); 0000-0002-2591-2431 (S. Teslia); 0000-0003-0522-0806 (Y. Romanova); 0000-0003-2789-2902 (I. Pihulechko)

1. Introduction

Due to the increasing availability and distribution of text documents in electronic form the importance of using automatic methods to analyze the content of documents has been increased [ 1-3 ]. The tasks of text analysis include the necessity of documents classification and clustering [ 4-7 ] by various criteria, such as genre, writing format (novel, essay), emotional coloring, speech style, as well as the task of text author identification [ 8 -14 ].

With the simplification of access to various data, growth of the ability to search, copy and distribute data on networks, the task of identifying the author becomes urgent. Issues related to the determination of authorship are also important in linguistic, historical and forensic researches. The general availability of electronic devices allows to push the recognition of the author with the involvement of a large number of experts in the background, speed up and simplify this process through its automation.

The concept of author identification is defined as the process of author identification based on the set of the text general and particular features that constitute the author style [ 8-9 ].

2022 Copyright for this paper by its authors.

2. Related Works

Statistical methods based on the search for "author invariant" are popular in existing systems for determining the text authorship. "Author invariant" characterizes the text linguistic features (lexical, grammatical, phraseological and other ones). The invariant can be the following: the share of vowels or consonants, the frequency of certain part of speech use, the probability of transitions from one part of speech to another, "favorite" words, information entropy etc. Authors proposed a statistical method for determining the text author and genre based on the frequency distribution of letter combinations (ngrams) [ 10-12 ]. This method has shown decent results for works of Slavic research publications. Unfortunately, the accuracy of determining authorship statistical methods depends on the data specifics using the language, style and length of written texts that have been studied [ 13-28 ]. Because of this, it is difficult to conclude the accuracy of such an approach to data of a different nature. For this reason, the aim of this work is to analyze the application of such a mathematical apparatus as the distribution of letter combinations for different languages in solving the problem of establishing the texts authorship of different lengths and written in different language styles. Chosen topic, namely relative frequency ngrams, is only gaining popularity in Ukraine and is not very popular. Several literature sources to describe what n-grams are and what they are used for have been found.

N-gram is a sequence of n-elements [29]. From a semantic point of view, it can be a sequence of sounds, syllables, words or letters. In practice, N-grams are more common as a series of words, stable phrases that called collocations. A sequence of two consecutive elements is often called a bigram, a sequence of three elements is called a trigram, which have been presented in the studied dataset. At least four or more elements are denoted as N-grams, N is replaced by the number of consecutive elements. N-grams in general are used in a wide range of sciences. They can be used, for example, in the field of theoretical mathematics, biology, cartography, as well as in music. The most common uses of N-grams include the following: extracting data for the cluster of satellite images series of the Earth from space, decision which specific parts of the Earth are in the image, and searching for genetic sequences in computer compression for indexing data in search engines, using N-grams, usually indexed data related to sound. In natural language processing N-gram is used mainly for prediction based on probabilistic models. The N-gram model calculates the probability of the last word of the Ngram, if all the previous ones are known. When using this approach to modeling language, it is assumed that the appearance of each word depends only on previous words [30]. Another application of N-grams is the detection of plagiarism. If you divide the text into several small fragments represented by Ngrams, they are easy to compare with each other, and thus obtain a degree of similarity of controlled documents [31]. N-grams are often used successfully to categorize text and language. In addition, they can be used to create functions that allow you to gain knowledge from textual data. Using N-grams, you can effectively find candidates to replace words with spelling mistakes. Google Research Centers have used N-gram models for a wide range of research and development. These include projects such as statistical translation from one language to another, language recognition, spelling correction, information retrieval, and more. For the purposes of these projects were used text corpora, which contain several trillion words. Google has decided to create its own educational building. The project is called Google tera corpus and it contains 1,024,908,267,229 words collected from public websites [32].

For a long time, cryptograms decryption is aided by frequency analysis the essence of which is the study of statistical patterns of symbols appearance and their compounds in original and encrypted messages [ 1 ]. In order to complicate frequency analysis ciphers have appeared in cryptography what leads to a uniform distribution of characters in the cryptogram. The principles of frequency analysis are widely used in password programs and allow to reduce the search time by several orders of magnitude [ 2 ] based on classification and clustering [ 4-7 ] of documents by various criteria, such as genre, epoch, format (novel, essay), emotional coloring, speech style, as well as the task of determining the text author [ 8-15 ].Obviously, frequency analysis requires first of all the reference frequencies of alphabet letters repetition on which the open texts are written and frequencies of N-grams repetition. For Ukrainian, English and almost all European languages the average frequency of letters, bigrams, trigrams repetition can be found in the literature [ 16-22 ].

Unfortunately, for the Ukrainian language only the frequency of letters repetition is given in the literature [ 23-25 ]. Therefore, the purpose of this work is to investigate the repetition frequency of letters and letters of the Ukrainian language on the basis of randomly selected texts in the Ukrainian language of scientific and technical orientation. The analysis of the obtained data confirms that for the Ukrainian language, as well as for other European languages, the alternation of vowels and consonants is inherent. If you study other texts, there may differ in the numbers of the given letter’s frequencies, which is explained, firstly, by the length of the studied text, and, secondly, by its subject matter. For example, the generally used letter F can become quite common in technical texts, because it is used in such words as function, differential, diffusion, coefficient, etc. Even greater deviations from the traditional use of individual letters are observed in some works of art, especially in poems.

3. Methods and Materials

Modern systems for determining the text authorship use different approaches to the theory of mathematical statistics, pattern recognition and probability theory, cluster analysis algorithms, neural networks and others [33-45]. The systems differ in the method of author identification, the means of text analysis, the required text amount and accuracy [ 12 ]. Methods of text authorship identification based on the calculation of any text characteristics (official parts of speech, prepositions, conjunctions, particles, independent parts of speech, nouns, verbs, adjectives, word lengths, sentence lengths) also differ in comparing frequencies in the different textual content for different tasks [46-63]. The most commonly used measures of comparing texts are the following: Information entropy, Fisher information, Chi-squared test and Kullback-Leibler divergence [ 9-12 ].

When identifying the author of the text it is assumed that the text reflects the individual style of the text author, which allows to differ it from other ones. To compare the texts with each other it is necessary to compare the text with some numerical characteristic that was close to the texts of the same author and would different in the works of various authors. Such a characteristic of author in the article [ 9-12 ] uses the distribution function density (DFD) of letter combinations of three consecutive characters (3grams). DFD is defined as the set of empirical frequencies of birth of letters or their combinations. The analysis of the text with the help of DFD does not take into account the occurrence of punctuation marks, spaces and numbers.

The task of identifying the author of the unknown text in terms of DFD is formulated as follows.

Here is a set of texts that contain works of famous authors. Let be the number of works by the ath author. , – the number of symbols in the i-th work of the a-th author, = 1, … , . All texts in this set will be presented in the form of DFD. DFD of the text, the volume of which is equal to , , is given as the set of values , ( ) = / , , – the number of N-grams under the number . The argument = 1, … , ( , )corresponds to the number of letters (n-grams) in alphabetical order, where is the power of the alphabet of the language in which the text is written, is the order of N-grams, i.e. the number of characters in letter combination. ( , ) = is the number of N-grams in this alphabet.

Each author is identified with his weighted average DFD which is given by formula (1): = ∑ , , , = ∑ ,

=1 =1

These DFD will play the role of copyright standards [ 9-12 ]. To compare two texts, either the text and the author standard, it is needed to specify the distance between the corresponding distribution functions. The norm in the space of summed functions is used as a distance metric. For example, the distance 0, between the DFD of the unknown text 0 and any copyright DFD will be calculated by formula (2): 0, = ‖ 0 − ‖ = ∑ | 0( ) − ( )|, =1 Accordingly, the text «0» will belong to the author whose distance to the DFD will be the shortest. (1) (2)

When solving the problem of classification, the data set was not clearly divided into test and training sets. Weighted average DFD were built on the whole set of books by one author. The distance from the book to "his" a -author is calculated by formula (3): (3) , = ‖ , − ‖

1 − , /

Formula (4) excludes the participation of the DFD of the document / i-article in the average DFD of "its" author [ 9-12 ]. The method of smoothing 3-gram distribution functions according to the analytical approach is impossible because the function is too complex. There is only an algorithmic approach for the implementation of which we can be focused on the main methods, such as simple or ordinary moving average, weighted moving average, exponential smoothing, median smoothing. In our case, we believe that the most optimal will be the use of the moving average method and this method is also known as the filtering method. Its application will reduce the variety of data. This fits into our analytically chosen tactics of ignoring extreme data, highs and lows. The degree of smoothing should be tied in advance to the criterion that will ensure maximum smoothing while still retaining information.

In our specific case we believe that correlation analysis of 3-gram data sequences in certain two of three selected articles can help to determine the relationship, and thus help to answer the question of how similar is the topic of articles. To do this, the function of the first studied article can be denoted by the variable x and the set of values of the second article (variable y) and perform a correlation analysis of the set of two sequences XY. The task of correlation is not only to assess connectivity, but also to reduce the target score to a numerical expression. The method of studying 3-gram sequences allows to reduce significantly the number of variables that are taken into account as important ones. Combination of the metrics-related groups forms a new cluster, which compares the metrics of closeness with others, and it is possible to end up with a fairly clear structure of the data set. Quantitative method of the potential text author identification from the set of possible ones on the basis of comparison analysis results of the reference text with the researched one is based on the technique of linguometry.

Linguometry is a branch of applied linguistics that detects, measures and analyzes the quantitative characteristics of different levels units of language or speech [33]. Using the apparatus of mathematical statistics, linguometry is involved in solving such problems of linguistics as the following criteria: • dictionaries (including frequency and statistical) and comparisons • automatic dictionaries, thesauri • shorthand systems • methods and means of automatic language detection • methods and means of information retrieval, etc.

Each language has its own statistical parameters and knowledge of the frequency occurrence of letters and their combinations (2-gram, 3-gram, 4-gram) that allows automatically to identify it. For example, for Ukrainian texts [34-37] it was found that statistical parameters of styles can consider frequencies of vowels, consonants, spaces between words, as well as soft and sonorous groups of consonants [33]. We will show how to evaluate the speech of a particular author on a particular passage of his work [36] using a certain standard, for example, Ukrainian language letters frequencies. Consider two passages of the technical text in Ukrainian presented in a format where the letters are arranged in descending order of frequency of their appearance (frequencies are given in Table 1), the distinction between lowercase and uppercase letters has not been made. The type of letters correlation frequencies of the passages [35] and the standard [36] have been investigated. The results that confirm the conclusions have been presented, in particular, graphically.

In the table. 1 the following data are entered for convenience: frequency of used Ukrainian language letters, absolute and relative frequencies of letters used in the studied Passage 1 (Article 1) [35] and Passage 2 (Article 2) [36]. Passage 1 contains 556 characters; Passage 2 contains 541 characters. The concept of "other" in the column of letters contains authentic letters for the Ukrainian language (ї, є, г, і), which are rarely used in most technical texts. This allows to achieve some independence in the analysis. Fig. 1 illustrates the obtained results graphically.

Graphical representation of the relative frequencies of letters in the passages gives a convincing answer to the question which of the passages was written by which author.

The distribution of 1-gram in the works is different. The optimal indicators of the texts study are the analysis of 3-grams [38-44]. We will check this in the next stages of the study. There is a sharp jump in the relative frequency of occurrence of the letter "e" for Passage 2 relative to the reference values of Standard 1 [36] (Fig. 2), so we assume that it is more likely that Standard 1 was written by the author of Passage 1 [35]. We also give the numerical values of the correlation of the frequency of letters in the passages and the standard. We find two correlation coefficients: for the standard and Passage 1 [35] and for the standard and Passage 2 [37]; factor closer to 1 will indicate that the relevant passage is more likely to belong to the standard. Calculations of the correlation coefficient for the standard and Passage1 give Re-У1=0.962716, and the correlation coefficient for the standard and Passage 2 - Re-У2=0.909958. Similarly, the values of relative frequencies in Standard 2 and Passages 1, 2 in Fig. 3 differ significantly, so it is likely that the author of Standard 2 [34] is not the author of Passages 1 and 2. 0,4 0,3 0,2 0,1 0 Пропуск о а н и в т е р с

Step. 7. Determine the number of connectors S.

Step. 8. Calculate the coefficients of author speech.

Step. 9. Output the results to the end user (Table 2, Fig. 1). The Web-resource for stylistic analysis has the following fields (Fig. 5): • Select Passage 1 (2, 3) is open access to excerpts. Access to the next passage only after activating access to the previous one. Access is opened sequentially from a smaller number to a larger one. • Reference text is the field where the Reference text is copied from the buffer. • The text you enter must be at least 100 characters long. (Now 0) is after starting the calculation, the actual number of characters of each passage will be calculated and displayed separately. • Passage 1 (2, 3) is the field where the corresponding excerpt text is copied from the buffer. • Calculate is start the calculation. • Clear is clear the entered data.

Algorithm 2. Stylometric analysis of the text to determine authorship.

Step. 1. Check the lengths of standard text and selected passages and reduce the length of the reference text to the minimum of the checked.

Step. 2. Clean the reference text from special characters, etc.

Step. 3. Determine of the words number in the text of the standard.

Step. 4. Determine the number of stop words (prepositions + conjunctions + particles) in the text of the standard (Fig. 5-6).

Step. 5. The length of Passage 1 is not more than the minimum text.

Step. 6. Clear Passage 1 from special characters, etc.

Step. 7. Determine the number of words W1 for Passage 1.

Step. 8. Determine the number of stop words (prepositions + conjunctions + particles) in the text. Step. 9. Prepare individual arrays (excerpt and standard) to calculate the correlation coefficient (Fig. Step. 10. Call the function to calculate the correlation coefficient.

Step. 11. Form an array to form a graphical representation of the relative frequency of stop words in Passage 1 and in the standard.

Step. 12. Call the function to calculate the relative frequency distribution graph (Fig. 6).

Step. 13. Call the function to calculate the correlation coefficient of Passage 2 (3) for each of the service words.

Step. 14. Form the words of the Swadesh list from the reference book, determine the number of words from the Swadesh list in the text of the Passage.

Step. 15. Form common for the Standard, Passages 1-3 and the Swadesh list.

Step. 16. The results of the study are displayed on the screen.

4. Experiment

When identifying the author of the text, it is assumed that the text reflects the individual style of author writing what distinguishes him from others. In order to compare texts with each other, it is necessary to compare the text with some numerical characteristics that would be close to the texts of the same author and would be significantly different for the works of different authors.

The Web-resource for the analysis of N-grams has the following fields (Fig. 7): • Number of grams - the number of characters in grams. Default is 3 ones. Can be changed to 1, 2, 3, 4. • Choice of the text language - the language of the text for analysis (research). The default one is "Ukrainian". • Text - a field where the studied text is copied from the buffer. • Restriction of text in characters. • Generation - to start generating N-grams. • Clearance - clear the entered data.

Algorithm 3. Linguistic and statistical analysis of N-grams of text is the following: Step. 1. Purify the studied text (numbers, special symbols).

Step. 2. Calculate the number of words in the text.

Step. 3. All words of the text are translated in lower case.

Step. 4. Remove the spaces.

Step. 5. Depending on the selected language, the corresponding alphabet is substituted.

Step. 6. Depending on the set number of grams the corresponding function which calculates all possible variants of grams and saves in an array is started.

Step. 7. The function of counting the number of occurrences of words is started.

Here we calculate the relative frequency of occurrence and store it in the array: the ordinal number of the gram, the gram itself, the number of occurrences of this gram, the relative occurrence frequency of this gram.

Step. 8. The following function forms the array received in the previous function for export to the CSV file. This file is stored on the server. It can be downloaded to the computer of the user (researcher) via the link, which will be accessible after the formation of the form with the results of the study.

Step. 9. The results of the study are displayed on the screen (only those grams that are found in the text).

Step. 10. Access the export file.

Step. 11. The generalized results are deduced: • only N-grams with repetitions were found • only N-grams were found without repetitions • total N-grams • number of characters in the text that are completely cleared • number of characters in the text with spaces • number of words in the text • size of the alphabet.

Three publications of scientific and technical orientation on the basis of linguistic and statistical analysis of 3-grams have been compared. Articles 1 and 2 have been written by one team, Article 3 has been written by another author (Table 3). The language of the text is Ukrainian (letters in the alphabet - 33, all possible N-grams – 35937)

When comparing articles only those 3-grams that are found in the text at the same time in three articles at least once have been taken into account. Therefore, for this particular example, all 3-grams are 2147. That is, for Article 1 78.4814% 3-grams have been analyzed, for Article 2 - 72.6332% and for Article 3 - 84.1271%. Accordingly, the difference in consumption of the relevant 3-grams between Articles 1 and 2 is R12=56.5254 %, between Articles 2 and 3 - R23=69.4271 %, between Articles 1 and 3 - R13=62.9839 %. These indicators themselves show that the characteristics of Articles 1 and 2 are more similar (R23>R12on 12.9017 %, R23 > R13on 6.4432 %, R13> R12on 6.4585 %, that is R23>R13>R12) than the characteristics under Articles 1-3 and 2-3. The smaller the Rij, the greater the degree that the articles are written by the same author. Then in the case of Articles 1 and 2 it is more likely to be written by one author / team than Articles 2-3 and Articles 1-3 respectively. But we will analyze the use of individual clusters of 3-grams in the relevant articles and compare the results.

Fig. 8 presents the results of the analysis of use in Articles 1-3 of 3-grams, starting with the letter a (appearance in Articles 1-3 in the range of 6.1125-6.7087%). Most often the curve lines for Articles 12 (4.2322%) and Articles 1-3 (4.197%) coincide or approach each other (average discrepancy is 0.02713% and 0.0269%, respectively). But not always there is a coincidence with Article 2-3 (4.6322%) and there are significant differences (the average difference is 0.02969%). If you analyze only such 3grams it turns out that all three articles are written more likely by one author. This is due to the fact that this letter is one of the most commonly used for the formation of Ukrainian words. 0,006

ааб або аві аги ади адр аєз ажа азо айв акі алг аль ами анд ант аоп апо ари асі асу атн афо ахо ацю ачу

Fig. 9 presents the analysis results of use in Articles 1-3 of 3-grams, starting with the letter б (letter b in English) (appearance in Articles 1-3 in the range of 0.48884-0.77738%). Most often the curve lines for Articles 1-2 (0.594%) as opposed to Articles 1-3 (0.7072%) and Articles 2-3 (1.1208%) coincide or approach. But the trajectory of the curve of Article 1 and Article 3 often coincides (most likely articles are written by one author, the average discrepancy is 0.01809%, while for Articles 1-2 - 0.0261% and 0,004 0,002 0,004 0,002 0 0 ааб або аві аги ади адр аєз ажа азо айв акі алг аль ами анд ант аоп апо ари асі асу атн афо ахо ацю ачу ||p1-p2|| ||p2-p3|| ||p1-p3|| 0,002

0 0,002 0,001 0 1 gram о н а т і и в с е р к л у д м п я з ь ч г х б ж й ц є ф Articles 2-3 - 0.02866%. If analyze only such 3-grams (which are less common), it turns out that all Articles 1-2 are written more likely by one author, and Article 3 - by another one. This is due to the fact that this letter would be rare in the formation of Ukrainian words. And some authors use such words more often because of habit and / or because of the subject matter of their publications (this requires further research).

0,004 баг баз бдл без бер бир бит біг біл бір бла бле бли бло блт бмебни бов бор бот бра бсо був буд бут бхі бчи ||p1-p2|| ||p2-p3|| ||p1-p3|| баг баз бдл без бер бирбит біг біл бір бла бле блибло блт бмебнибов бор бот бра бсо був буд бут бхі бчи

According to Table 4 and Fig. 10-12, a part of the letters in the Ukrainian language are most often used, others - much less often. For the most frequently used letters, the frequency of occurrence of 3grams with such initial letters will have almost the same distribution (top values in the graph of Fig. 12), and not for other letters. 0,001

0 0,004 0,002 0

Therefore, it is advisable to study only the trigrams for the initial letters, which are less common in the texts of a particular language to determine the degree of belonging of the text to the author (for example, Fig. 12). Thus, for 3-grams of the letter are (the appearance in Articles 1-3 in the range of 0.2517-0.707%) most often the lines of curves for Articles 1-2 (0.2508%) in contrast to Articles 1-3 (0.6077 %) and Articles 2-3 (0.5443%) that coincide or approach each other. But the trajectory of the curve of Article 1 and Article 2 often coincides (most likely articles written by one author - the average discrepancy is 0.0114%, while for Articles 2-3 - 0.02478% and Articles 1-3 - 0.02762% this value is higher twice as much).

0,003 єак єап єва єве єви єві єво єдн єза єкт єма ємо єна єнт єпе єпо єпр єро єст єта єть єщо ||p1-p2|| ||p2-p3|| ||p1-p3|| єак єап єва єве єви єві єво єдн єза єкт єма ємо єна єнт єпе єпо єпр єро єст єта єть єщо

Table 4 shows frequencies of letters appearance in the standard and the studied passages. Fig. 14 shows histograms of the relative frequency of n-grams in 1-3 articles. Low frequency (noise) values are the most common and form the main volume of the data. We can ignore them (Fig. 15).

All graphs of the distribution of the frequency of 3-grams in articles show a significantly noticeable gradation of 3-grams on underused (noise-like) and widely used peak values. This allows to see the specific examples of the three articles, the fact that to reduce the amount of information analyzed it is desirable to proceed to the analysis of the distribution function from a certain threshold value of frequency and at the same time cover the main information content is visible. To compare the distribution function in the context of the three studied articles, it is necessary to compare clearly expressed average values. After analyzing the most commonly used 3-grams, we conclude that they are caused by the stylistics or grammar of the Ukrainian language and are not relevant to determine the specific topic of articles. The most used 3-grams in Article: • 1: ння [nnya] 0.008476, енн [enn] 0.007175, ого [oho] 0.005473. • 2: ння [nnya] 0.006448, ист [yst] 0.006356, ува [uva] 0.006233.

• 3: ння [nnya] 0.008769, ого [oho] 0.007717, мет [met] 0.006314.

5. Results

In the algorithmic approach, the appearance of the trend is obtained due to various algorithms that practically implement smoothing procedures. These procedures provide the researcher only with an algorithm for calculating the new value of the time series at any given time t. These methods can be classified as the following simple or ordinary moving average (Fig. 16), weighted moving average, exponential smoothing - median smoothing. In this part of the calculation work, the relative frequency of consumption of 3-grams in three texts has been smoothed by the method of moving average, exponential smoothing and median smoothing.

The moving average method is one of the oldest known methods of smoothing the time series. It is based on the transition from the initial values of the series to their average values in the time interval, the length of which is selected in advance. The selected time interval slides along the row. Moving averages can smooth out both random and periodic fluctuations, identify existing trends in the process and therefore serve as an important tool in filtering time series components. The moving average method estimates the average level over a period of time. The longer the time interval to which the average belongs, the smoother the level will be, but the less accurately the trend of the original time series will be described. In all figures, the gray graph is the graph of the initial Relative frequency, and the red graph is the graph of the smoothed Relative frequency data.

At small values of the size of the interval w, the efficiency in terms of smoothing effect is not very high, as can be seen in the following Figures 16-18 for Article 1 (smooth the data using the size of the smoothing interval w = 3, 5, 7, 9,11, 13, 15).

It is needed to smooth the data using the size of the smoothing interval w = 3 (Fig. 19), then smooth the obtained data again but using the size of the smoothing interval w = 5. Then continue smoothing the obtained data with the smoothing interval w = 7 and so on to w = 15 (Fig. 20-21).

At small values of the size of the interval w, the efficiency in terms of the smoothing effect decreases, which can be seen in the following figures. Also, the smoothing method, using pre-smoothed rows, smoothest the data very effectively.

We smooth the data using the dimensions of the smoothing interval w = 3, 5, 7, 9, 11, 13, 15 for Article 2, the moving average showed a trend in the interval better than for Article 1 (Fig. 22-24).

It is needed to smooth the data using the size of the smoothing interval w = 3 for Article 2 (Fig. 2527), then smooth the obtained data again but using the size of the smoothing interval w = 5. Then continue smoothing the obtained data with the smoothing interval w = 7 and so on to w = 15.

Smooth the data using the size of the smoothing interval w = 3, 5, 7, 9,11, 13, 15 for Article 3, the moving average smoothed the data with about the same efficiency as in Article 1 (Fig. 28-10).

It is needed to smooth the data using the size of the smoothing interval w = 3 (Fig. 31), then smooth the obtained smoothed data again but using the size of the smoothing interval w = 5. Then continue smoothing the obtained data with the smoothing interval w = 7 and so on to w = 15 (Fig. 32-33).

Exponential smoothing. The main parameter of exponential smoothing is a parameter that takes values in the range of 0.1 0.3. It is necessary to perform smoothing of the same series with the values of the parameter = 0.1, 0.15, 0.2, 0.25, 0.3 and in all these cases to find for each smoothing the correlation coefficients between the original values and smoothed ones. An essential feature of the use of exponential averages is the substantiation of the value of the smoothing parameter α. The smaller it is, the more the levels in the analyzed series are smoothed. This means an increase in the specific α.

For data on the frequency of 3-grams use exponential smoothing for all three texts did not give a "delay". Exponential smoothing has softened these data a little and it is harder to see the general trend. Also, the correlation coefficients of the data are very low (Fig. 34-42). Figure 24 Exponential smoothing of Article 2 for a=0.15, 0.2, 0.25 Figure 27 Exponential smoothing of Article 3 for a=0.15, 0.2, 0.25

Median smoothing. In this case, use the same dimensions of the smoothing interval and the operation as in paragraph 1.Characteristic feature of median smoothing is that it leaves monotonic parts of the data sequence and sharp differences unchanged, and for nonmonotonic areas within the size of the sliding smoothing interval leaves only a centered value equal to their median, i.e. effectively eliminates those levels that violate monotonicity. It is needed completely to eliminate single extreme or anomalous values of levels that are at least half the distance from the smoothing interval, maintain sharp differences in trends (moving average and exponential smoothing lubricates them), effectively eliminates single levels with very large or very small values that are random and stand out sharply among other levels. These characteristics of the median smoothing were confirmed during the median smoothing for relative frequency in Article 1. lower than the moving average. We smooth the data using the dimensions of the smoothing interval w = 3, 5, 7, 9, 11, 13, 15 (Fig. 43-45). Graphics are arranged in the appropriate order: Median w = 3 (for intervals 0-700, 700-1400, 1400-2148), Median w = 5, 7, 9, 11, 13, Median w = 15 (for intervals 0-700, 700-1400, 1400-2148). Figure 31 Median smoothing of Article 1 for w = 3, 5, 7, 9, 11, 13

It is needed to smooth the data using the size of the smoothing interval w = 3, then smooth the obtained smoothed data again but using the size of the smoothing interval w = 5. We continue smoothing the obtained data with the smoothing interval w = 7 and so on to w = 15 (Fig. 46-48).

6. Discussions

Graphical representation of the relationship between two studied sequences is called a correlation field or scatter plot. The graphical method provides a visual representation of the form of communication between these sequences. So, it is needed to construct a correlation field for Article 1 and 2 (Fig. 49), Article 1 and 3 (Fig. 50), Article 2 and 3 (Fig. 51).

Visually assessing the nature of the relationship, it can be stated that there is a linear relationship in all three fields. Also evaluating the visual data of the field, we see that the correlation is present, so we can assume that these Ukrainian articles can be written by one author or are based on one topic. But visual assessment is not enough, so it is worth finding the value of the correlation coefficient for more accurate research results. The correlation coefficient characterizes the degree of closeness of the linear dependence. Therefore, there is a calculation of the correlation coefficients for Articles 1 and 2 (Correlation coefficient 0.575. Coefficient of determination 33%); for Articles 1 and 3 (Correlation coefficient 0.63023. Coefficient of determination40%); for Articles2 and 3 (Correlation coefficient 0.49038. Coefficient of determination24%). Correlation coefficients that are less than 0.7 but greater than 0.5 modulus indicate a medium-strength relationship (the coefficients of determination are less than 50% but more than 25%). It is worth noting that in the first two cases we received a connection of medium strength and in the third case we have a connection of weak force very close to the average, so it can also be attributed to the average. It is obvious that having three different Ukrainian articles the 100% correlation is unlikely to be. So, given the average connection, the assumption that these articles may have been written by the same author or are based on the similar topics has been confirmed. When the pair statistical dependence on the linear correlation is rejected, the correlation coefficient loses its meaning as a characteristic of the degree of closeness of the connection. In this case, such a measure of communication as the correlation ratio is used. Since there is a linear relationship between the pair of studied features, the correlation ratio does not need to be calculated.

Autocorrelation function is a correlation of function with itself shifted by a certain amount of independent variables. Autocorrelation is used to find patterns in a number of data, such as periodicity.

The graph of the autocorrelation function is also called the correlogram (Fig. 52).

Fig. 52 shows that the studied series are not stationary, as in the case of fixed time series the graph of autocorrelation functions should be decreased rapidly after the first few values.

It is needed to divide the sequence of Relative frequency Article 1 into three equal parts of 715 values. For convenience, we take the data into a separate table (Fig. 53). The correlation matrix is a square table in which the correlation coefficient between the corresponding parameters is located at the intersection of the corresponding row and column. Correlation matrix for column divided into 3 parts and has been constructed and the results are obtained: correlation coefficients, that are less than 0.5, the absolute value or modulus indicate a weak relationship. On the correlation matrix it is seen that all values are close to 0, so we can conclude that there is no connection at all. It can be said that this is quite an expected result, as the data do not depend on each other and have different values. We find the coefficients of multiple correlation (Fig. 54-55). 0,042590192 0,071599169 0,018317132 0,006056347 0,047593571 0,022404806 0,010037348

According to these graphs, Article 1 and Article 2 were more likely to have been written by one author, although Article 1 and Article 3 could also have been written by one author (but this is not true). But Articles 2-3 were definitely written by different authors. The application of linguistic and statistical analysis of 3-grams to a set of articles will allow to form a subset of similar linguistically characteristic publications. Imposing additional conditions on this subset in the form of statistical and quantitative analyzes (sets of keywords, stable phrases, stylistic, linguometric, etc.) will significantly reduce this subset, clarifying the list of more likely author works. Thus, the analysis of the content and frequency of occurrence of only business words will separate Articles 1 and 3 into different subsets, Articles 1 and 2 in one the same. This study does not address the problem of identifying the author in full due to the fact that the difference in authorial traits is subjective and depends on the limitations imposed on the creative process of the author. However, as a result, a system that implements such methods is able to give recommendations on the degree of belonging of the text to a particular author. Further experimental research is needed to test the proposed method to determine the style of the author from other categories of texts - scientific humanities, art, journalism and more. Therefore, we compare the frequencies of all trigrams that begin with a particular letter (Fig. 56). 0,1 а б в г д е є ж з и і ї й к л м н о п р с т у ф х ц ч ш щ ь ю я

According to these graphs, Article 1 and Article 2 were more likely to have been written by one author, although Article 1 and Article could also have been written by one author (but this is not true). But Articles 2-3 were definitely have been written by different authors. The application of linguistic and statistical analysis of 3-grams to a set of articles will allow to form a subset of similar linguistically characteristic publications. Imposing additional conditions on this subset in the form of statistical and quantitative analyzes (sets of keywords, stable phrases, stylistic, linguometric, etc.) will significantly reduce this subset, clarifying the list of more likely author works.

Thus, the analysis of the content and frequency of occurrence of only business words will separate Articles 1 and 3 into different subsets, Articles 1 and 2 in one the same.

This study does not address the problem of identifying the author in full due to the fact that the difference in authorial traits is subjective and depends on the limitations imposed on the creative process of the author. However, as a result, a system that implements such methods is able to give recommendations on the degree of the text belonging to a particular author. Further experimental research needs to test the proposed method to determine the style of the author from other categories of texts such as scientific humanities, art, journalism and others.

7. Conclusions

The article dwells upon the completed scientific research in the field of information technology in the part concerning computer linguistics, artificial intelligence and Machine Learning. Correlation analysis of text author identification results based on n-grams in Ukrainian technical and scientific texts have been made. The comparison between three articles have been done and the results have been obtained. Quantitative content analysis of textual scientific and technical content has been studied based on the fact that text authorship determination systems typically use plagiarism and rewrite its metrics of identification fully or partially. The article presents the method of determining the author by decomposition on the basis of the analysis of such speech coefficients as lexical diversity, degree of syntactic complexity, speech coherence, indices of exclusivity and concentration of the text. Also, the parameters of the author style such as words, sentences, prepositions, conjunctions numbers and quantity of words with defined frequencies have been analyzed. It is highlighted that in the algorithmic approach smoothing procedures are widely used. So, the relative frequency of 3-grams consumptions in the studied texts has been smoothed by the method of moving average, exponential and median smoothing. It is proposed to analyze the reference text in several stages for high-quality and effective analysis of content in determining the degree of text authorship. To achieve the research goal a system with the ability to select the language / languages of the analyzed content have been developed and implemented on the Victana Web-resource. It is said that in order to compare the texts with each other it is necessary to compare the text with some numerical characteristic that was close to the texts of the same author and would different in the works of various authors that uses the distribution function density of letter combinations of three consecutive characters. So, rapid distribution of text documents in electronic form has caused the importance of using automatic methods to analyze the content including the necessity of documents classification and clustering by various criteria. 8. References [24] V. I. Perebyynis, M. P. Muravytska, N.P. Darchuk, Chastotni slovnyky ta yikhvykorystannya [Frequency dictionaries and their use], Naukova dumka [Scientific opinion], 1983. [25] I. Khomytska, V. Teslyuk, A. Holovatyy, O. Morushko, Development of methods, models, and means for the author attribution of a text, Eastern-European Journal of Enterprise Technologies. 3(2(93)) (2018) 41–46. doi: 10.15587/1729-4061.2018.132052. [26] I. Khomytska, V. Teslyuk, Authorship and Style Attribution by Statistical Methods of Style Differentiation on the Phonological Level, Advances in Intelligent Systems and Computing 871 (2019) 105–118. doi: 10.1007/978-3-030-01069-0_8. [27] I. Khomytska, V. Teslyuk, N. Kryvinska, I. Bazylevych, Software-Based Approach Towards Automated Authorship Acknowledgement – Chi‐Square Test on One Consonant Group, Electronics 9(7) (2020) 1138. doi: 10.3390/electronics9071138. [28] I. Khomytska, V. Teslyuk, I. Bazylevych, I. Shylinska, Approach for Minimization of Phoneme Groups in Authorship Attribution Attribution, International Journal of Computing 19(1) (2020) 55–62. Doi: 10.47839/IJC.19.1.1693. [29] D. Jurafsky, J. H. Martin, N-gram Language Models. URL: https://web.stanford.edu/~jurafsky/slp3/3.pdf. [30] D. Jurafsky, J. H. Martin, Speech and Language Processing. URL: https://web.stanford.edu/~jurafsky/slp3/ed3book_sep212021.pdf. [31] D. Jurafsky, J. H. Martin, Regular Expressions, Text Normalization, Edit Distance. URL: https://web.stanford.edu/~jurafsky/slp3/2.pdf. [32] O. S. Goh, C. C. Fung, A. Depickere, Domain knowledge query conversation bots in instant messaging (IM), Knowledge-Based Systems 21(7). (2008) 681-691. [33] S. Buk, Osnovy statystychnoy lingvistyky, LNU n. I. Franko Publishing House,2008. [34] V. Vysotska, V. B. Fernandes, V. Lytvyn, M. Emmerich, M. Hrendus, Method for Determining Linguometric Coefficient Dynamics of Ukrainian Text Content Authorship, Advances in Intelligent Systems and Computing 871 (2018) 132–151. doi: 10.1007/978-3-030-01069-0_10. [35] P. Kravets, The control agent with fuzzy logic, in: Proceedings of the International Conference on

Perspective Technologies and Methods in MEMS Design, Lviv, Ukraine, 2010, pp. 40–41. [36] P. Kravets, The Game Method for Orthonormal Systems Construction, in: Proceedings of the 2007 9th International Conference - The Experience of Designing and Applications of CAD Systems in Microelectronics, Lviv, Ukraine, 2007. doi: https://doi.org/10.1109/cadsm.2007.4297555. [37] P. Kravets, Game Model of Dragonfly Animat Self-Learning, in: Proceedings of the International

Conference on Perspective Technologies and Methodsin MEMS Design, Lviv, 2016, 195–201. [38] V. Lytvyn, V. Vysotska, I. Budz, Y. Pelekh, N. Sokulska, R. Kovalchuk, L. Dzyubyk, O.

Tereshchuk, M. Komar, Development of the quantitative method for automated text content authorship attribution based on the statistical analysis of N-grams distribution, Eastern-European Journal of Enterprise Technologies 6(2-102) (2019) 28–51. [39] I. Balush, V. Vysotska, S. Albota, Recommendation System Development Based on Intelligent

Search NLP and Machine Learning Methods, CEUR WorkshopProceedings 2917 (2021) 584–617. [40] A. Berko, Y. Matseliukh, Y. Ivaniv, L. Chyrun, V. Schuchmann, The Text Classification Based on Big Data Analysis for Keyword Definition Using Stemming, in: Proceedings of the 2021 IEEE 16th International Conference on Computer Sciences and Information Technologies (CSIT), 1, Lviv, Ukraine, 2021, pp. 184–188. doi: 10.1109/CSIT52700.2021.9648764. [41] N. Shakhovska, K. Shakhovska, The Method of Text Tonality Classification, in: Proceedings of the Computer Sciences and Information Technologies (CSIT), 1, Lviv, Ukraine, 2020, pp. 19–23. [42] I. Khomytska, V. Teslyuk, L. Bordyuk, The Kolmogorov-Smirnov’s Test for Authorship Attribution on the Phonological Level, in: Proceedings of the Computer Sciences and Information Technologies (CSIT), 1, 2020, pp. 259–262. doi: 10.1109/CSIT49958.2020.9322042. [43] Y. Hlavcheva, V. Bobicev, O. Kanishcheva, Language-independent features for authorship attribution on Ukrainian texts, CEUR Workshop Proceedings Vol-2833 (2021) 134–143. [44] O. Bisikalo, O. Boivan, N. Khairova, O. V. Kovtun, V. Kovtun, Precision Automated Phonetic Analysis of Speech Signals for Information Technology of Text-dependent Authentication of a Person by Voice, CEUR Workshop Proceedings Vol-2853 (2021) 276–288.

а б в г д е є ж з и і ї й к л м н о п р с т у ф х ц ч ш щ ь ю я | | p1 - p2 || || p1 - p3 || || p2 - p3 ||

[1]

S. O.

Sushko ,

L. Y.

Fomychova ,

Y. S.

Barsukov , Chastotypovtoryuvanosti bukv i bihram u vidkrytykh tekstakh ukrayinsʹkoyumovoyu [Frequency of repetition of letters and digrams in open texts in Ukrainian] . Zakhyst informatsiyi [Protection of information] 12 ( 3 ( 48 )) ( 2010 ).

[2]

P. S.

Dyurdeva , Authorship definition based on the frequency distribution of letter combinations , 2015 . URL: https://www.math.spbu.ru/SD_AIS/documents/2015-12-441/ 2015 -12-b- 07 .pdf.

[3]

Likasa ,

Vlassisb ,

J. J.

Verbeekb , The global k-means clustering algorithm , Pattern Recognition 36 ( 2 ) ( 2003 ) 451 - 461 . URL: https://www.cs.uoi.gr/~arly/papers/PR2003.pdf

[4] A. P. Reynolds G . Richards, Rayward-Smith

V. J.

The Application of K-medoids and PAM to the Clustering of Rules, Lecture Notes in Computer Science 3177 ( 2004 ). doi: 10 .1007/978-3- 540 - 28651-6_ 25 .

[5]

Babichev ,

M.A.

Taif ,

Lytvynenko ,

Osypenko , Criterial analysis of gene expression sequences to create the objective clustering inductive technology , in: Proceedings of the International Conference on Electronics and Nanotechnology , ELNANO , 2017 , pp. 244 - 248 . doi: 10 .1109/ELNANO. 2017 . 7939756 .

[6]

Babichev ,

Durnyak ,

Pikh ,

Senkivskyy , An Evaluation of the Objective Clustering Inductive Technology Effectiveness Implemented Using Density-Based and Agglomerative Hierarchical Clustering Algorithms , Advances in Intelligent Systems and Computing 1020 ( 2020 ) 532 - 553 . doi: 10 .1007/978-3- 030 -26474-1_ 37 .

[7]

S. A.

Babichev ,

Gozhyj ,

A. I.

Kornelyuk , V. I. Lytvynenko , Objective clustering inductive technology of gene expression profiles based on SOTA clustering algorithm , Biopolymers and Cell 33 ( 5 ) ( 2017 ) 379 - 392 . doi: 10 .7124/bc.000961.

[8]

-Means Clustering . URL: https://people.revoledu.com/kardi/tutorial/kMean/#google_vignette.

[9]

A. S.

Romanov , Methodology and software package for identifying the author of an unknown text , 2010 . URL: https://www.dissercat.com/content/metodika -i-programmnyi-kompleks-dlyaidentifikatsii-avtora-neizvestnogo-teksta

[10]

L. A.

Borisov ,

Yu. N.

Orlov ,

K. P.

Osminin , Identification of the author of the text by the frequency distribution of letter combinations , Applied Informatics 26 ( 2 ) ( 2013 ) 95 - 108 .

[11]

Yu. N.

Orlov ,

K. P.

Osminin , Determining the genre and author of a literary work by statistical methods , 2010 . URL: https://keldysh.ru/papers/2013/prep2013_ 27 .pdf.

[12]

Yu. N.

Orlov ,

K. P.

Osminin , Methods of statistical analysis of literary texts , LIBROKOM, 2012 .

[13]

V. A.

Pavlov ,

P. S.

Dyurdeva ,

D. S.

Shalymov , Clustering of Russian Manuscripts Based on the Feature Relationship Graph, Computer tools in education 1 ( 2016 ) 24 - 35 .

[14]

T. V.

Batura . Formal methods for determining the authorship of texts , NGU Bulletin of Series Information technologies 10(4) ( 2012 ). URL: https://cyberleninka.ru/article/n/formalnye -metodyopredeleniya-avtorstva-tekstov

[15]

Romanyshyn , Intro to Natural Language Processing . Grammarly, Inc., 2017 .

[16]

A. V.

Babash ,

G. P.

Shankin , Cryptography, SOLON-R, 2002 . URL: https://pub.flowpaper.com/docs/https://book.edu-lib.net/books1/Babash_Kriprografiya_1.pdf, https://pub.flowpaper.com/docs/https://book.edu-lib.net/books1/Babash_Kriprografiya_2.pdf.

[17]

A. P.

Alferov ,

Yu. Zubov ,

A. S.

Kuzmin ,

A. V.

Cheryomushki , Fundamentals of cryptography, Helios, 2002 . URL: https://studfile.net/preview/6311470/

[18] A. M. Yaglom , I. M. Yaglom , Probability and information . Science, ed. Phys.-Math. lit. , 1973 .

[19]

R. G.

Piotrovsky , Information measurements of language, Nauka, 1968 .

[20]

I. M.

Yaglom ,

R. L.

Dobrushin ,

A. M.

Yaglom , Information theory and linguistics , Questions of linguistics 1 ( 1960 ) 100 - 110 .

[21]

D. S.

Lebedev ,

V. A.

Garmash , On the possibility of increasing the speed of transmission of telegraph messages , Telecommunications 1 ( 1958 ) 68 - 69 .

[22]

C. E.

Shannon , Prediction and entropy of the printed English , 1951 . URL: https://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf.

[23]

O. V.

Verbitskyy , Vstup do kryptolohiyi [Introduction to cryptology], Vydavnytstvo Naukovotekhnichnoyi literatury [Publishing House of Scientific and Technical Literature], Lviv , 1998 .