Correlation Analysis of Text Author Identification Results Based on N-Grams Frequency Distribution in Ukrainian Scientific and Technical Articles

Correlation Analysis of Text Author Identification Results Based on N-Grams Frequency Distribution in Ukrainian Scientific and Technical Articles VictoriaVysotska victoria.a.vysotska@lpnu.ua Lviv Polytechnic National University

S. Bandera Street, 12 79013 Lviv Ukraine

Osnabrück University

Friedrich-Janssen-Str. 1 49076 Osnabrück Germany

OksanaMarkiv oksana.o.markiv@lpnu.ua Lviv Polytechnic National University

S. Bandera Street, 12 79013 Lviv Ukraine

SofiiaTeslia sofiia.teslia.sa.2019@lpnu.ua Lviv Polytechnic National University

S. Bandera Street, 12 79013 Lviv Ukraine

YevaRomanova yeva.romanova.sa.2019@lpnu.ua Lviv Polytechnic National University

S. Bandera Street, 12 79013 Lviv Ukraine

InesaPihulechko inesa.pihulechko.sa.2019@lpnu.ua Lviv Polytechnic National University

S. Bandera Street, 12 79013 Lviv Ukraine

International Conference on Computational Linguistics and Intelligent Systems

May 12-13 2022 Gliwice Poland

Correlation Analysis of Text Author Identification Results Based on N-Grams Frequency Distribution in Ukrainian Scientific and Technical Articles 5ED4776A8CD8C2604D5B7B4AE5FF6CEC GROBID - A machine learning software for extracting information from scholarly documents N-Grams NLP correlation analysis authorship definition Ukrainian text distribution function density exponential and median smoothing linguometry stylometric analysis

The results of experimental approbation of the proposed content monitoring method used for the determination of the author style in Ukrainian scientific texts of technical profile have been studied. Authorship identification systems typically use plagiarism and rewrite metrics to determine it. There is a necessity to identify whether the work has been borrowed fully or partially. Therefore, the situation when the work has not been published yet is not taken into consideration. Quantitative content analysis of the scientific and technical texts uses the advantages of content monitoring and analysis of text based on NLP, Web-Mining and stylometry methods to identify many authors whose speech styles are similar to the studied passages. It narrows the search for further use in stylometric methods to determine the degree of the analyzed text belonging to a particular author. The method of determining the author has been decomposed on the basis of such speech coefficients analysis as lexical diversity, degree (measure) of syntactic complexity, speech coherence, indices of the text exclusivity and concentration. In parallel, the parameters of the author style, such as the text words, sentences, prepositions, conjunction quantities and the number of words with a frequency of 1, 10 or more have been analyzed.

Introduction

Due to the increasing availability and distribution of text documents in electronic form the importance of using automatic methods to analyze the content of documents has been increased [1][2][3]. The tasks of text analysis include the necessity of documents classification and clustering [4][5][6][7] by various criteria, such as genre, writing format (novel, essay), emotional coloring, speech style, as well as the task of text author identification [8 -14].

With the simplification of access to various data, growth of the ability to search, copy and distribute data on networks, the task of identifying the author becomes urgent. Issues related to the determination of authorship are also important in linguistic, historical and forensic researches. The general availability of electronic devices allows to push the recognition of the author with the involvement of a large number of experts in the background, speed up and simplify this process through its automation.

The concept of author identification is defined as the process of author identification based on the set of the text general and particular features that constitute the author style [8][9].

Related Works

Statistical methods based on the search for "author invariant" are popular in existing systems for determining the text authorship. "Author invariant" characterizes the text linguistic features (lexical, grammatical, phraseological and other ones). The invariant can be the following: the share of vowels or consonants, the frequency of certain part of speech use, the probability of transitions from one part of speech to another, "favorite" words, information entropy etc. Authors proposed a statistical method for determining the text author and genre based on the frequency distribution of letter combinations (ngrams) [10][11][12]. This method has shown decent results for works of Slavic research publications. Unfortunately, the accuracy of determining authorship statistical methods depends on the data specifics using the language, style and length of written texts that have been studied [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28]. Because of this, it is difficult to conclude the accuracy of such an approach to data of a different nature. For this reason, the aim of this work is to analyze the application of such a mathematical apparatus as the distribution of letter combinations for different languages in solving the problem of establishing the texts authorship of different lengths and written in different language styles. Chosen topic, namely relative frequency ngrams, is only gaining popularity in Ukraine and is not very popular. Several literature sources to describe what n-grams are and what they are used for have been found.

N-gram is a sequence of n-elements [29]. From a semantic point of view, it can be a sequence of sounds, syllables, words or letters. In practice, N-grams are more common as a series of words, stable phrases that called collocations. A sequence of two consecutive elements is often called a bigram, a sequence of three elements is called a trigram, which have been presented in the studied dataset. At least four or more elements are denoted as N-grams, N is replaced by the number of consecutive elements. N-grams in general are used in a wide range of sciences. They can be used, for example, in the field of theoretical mathematics, biology, cartography, as well as in music. The most common uses of N-grams include the following: extracting data for the cluster of satellite images series of the Earth from space, decision which specific parts of the Earth are in the image, and searching for genetic sequences in computer compression for indexing data in search engines, using N-grams, usually indexed data related to sound. In natural language processing N-gram is used mainly for prediction based on probabilistic models. The N-gram model calculates the probability of the last word of the Ngram, if all the previous ones are known. When using this approach to modeling language, it is assumed that the appearance of each word depends only on previous words [30]. Another application of N-grams is the detection of plagiarism. If you divide the text into several small fragments represented by Ngrams, they are easy to compare with each other, and thus obtain a degree of similarity of controlled documents [31]. N-grams are often used successfully to categorize text and language. In addition, they can be used to create functions that allow you to gain knowledge from textual data. Using N-grams, you can effectively find candidates to replace words with spelling mistakes. Google Research Centers have used N-gram models for a wide range of research and development. These include projects such as statistical translation from one language to another, language recognition, spelling correction, information retrieval, and more. For the purposes of these projects were used text corpora, which contain several trillion words. Google has decided to create its own educational building. The project is called Google tera corpus and it contains 1,024,908,267,229 words collected from public websites [32].

For a long time, cryptograms decryption is aided by frequency analysis the essence of which is the study of statistical patterns of symbols appearance and their compounds in original and encrypted messages [1]. In order to complicate frequency analysis ciphers have appeared in cryptography what leads to a uniform distribution of characters in the cryptogram. The principles of frequency analysis are widely used in password programs and allow to reduce the search time by several orders of magnitude [2] based on classification and clustering [4][5][6][7] of documents by various criteria, such as genre, epoch, format (novel, essay), emotional coloring, speech style, as well as the task of determining the text author [8][9][10][11][12][13][14][15].Obviously, frequency analysis requires first of all the reference frequencies of alphabet letters repetition on which the open texts are written and frequencies of N-grams repetition. For Ukrainian, English and almost all European languages the average frequency of letters, bigrams, trigrams repetition can be found in the literature [16][17][18][19][20][21][22].

Unfortunately, for the Ukrainian language only the frequency of letters repetition is given in the literature [23][24][25]. Therefore, the purpose of this work is to investigate the repetition frequency of letters and letters of the Ukrainian language on the basis of randomly selected texts in the Ukrainian language of scientific and technical orientation. The analysis of the obtained data confirms that for the Ukrainian language, as well as for other European languages, the alternation of vowels and consonants is inherent. If you study other texts, there may differ in the numbers of the given letter's frequencies, which is explained, firstly, by the length of the studied text, and, secondly, by its subject matter. For example, the generally used letter F can become quite common in technical texts, because it is used in such words as function, differential, diffusion, coefficient, etc. Even greater deviations from the traditional use of individual letters are observed in some works of art, especially in poems.

Methods and Materials

Modern systems for determining the text authorship use different approaches to the theory of mathematical statistics, pattern recognition and probability theory, cluster analysis algorithms, neural networks and others [33][34][35][36][37][38][39][40][41][42][43][44][45]. The systems differ in the method of author identification, the means of text analysis, the required text amount and accuracy [12]. Methods of text authorship identification based on the calculation of any text characteristics (official parts of speech, prepositions, conjunctions, particles, independent parts of speech, nouns, verbs, adjectives, word lengths, sentence lengths) also differ in comparing frequencies in the different textual content for different tasks [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61][62][63]. The most commonly used measures of comparing texts are the following: Information entropy, Fisher information, Chi-squared test and Kullback-Leibler divergence [9][10][11][12].

When identifying the author of the text it is assumed that the text reflects the individual style of the text author, which allows to differ it from other ones. To compare the texts with each other it is necessary to compare the text with some numerical characteristic that was close to the texts of the same author and would different in the works of various authors. Such a characteristic of author in the article [9][10][11][12] uses the distribution function density (DFD) of letter combinations of three consecutive characters (3grams). DFD is defined as the set of empirical frequencies of birth of letters or their combinations. The analysis of the text with the help of DFD does not take into account the occurrence of punctuation marks, spaces and numbers.

The task of identifying the author of the unknown text in terms of DFD is formulated as follows.

Here is a set of texts that contain works of famous authors. Let 𝐿 𝑡 be the number of works by the ath author. 𝑁 𝑖,𝑡the number of symbols in the i-th work of the a-th author, 𝑖 = 1, … , 𝐿 𝑡 . All texts in this set will be presented in the form of DFD. DFD of the text, the volume of which is equal to 𝑁 𝑖,𝑡 , is given as the set of values 𝑝 𝑖,𝑡 (𝑗) = 𝑙 𝑗 /𝑁 𝑖,𝑡 , 𝑙 𝑗the number of N-grams under the number 𝑗. The argument 𝑗 = 1, … , 𝑓(𝑛, 𝑀)corresponds to the number of letters (n-grams) in alphabetical order, where 𝑀 is the power of the alphabet of the language in which the text is written, 𝑛 is the order of N-grams, i.e. the number of characters in letter combination. 𝑓(𝑛, 𝑀) = 𝑀 𝑛 is the number of N-grams in this alphabet.

Each author is identified with his weighted average DFD which is given by formula (1):

𝑃 𝑡 = 1 𝑁 𝑡 ∑ 𝑝 𝑖,𝑡 𝑁 𝑖,𝑡 𝐿 𝑡 𝑖=1 , 𝑁 𝑡 = ∑ 𝑁 𝑖,𝑡 𝐿 𝑡 𝑖=1(1)

These DFD will play the role of copyright standards [9][10][11][12]. To compare two texts, either the text and the author standard, it is needed to specify the distance between the corresponding distribution functions. The norm in the space of summed functions is used as a distance metric. For example, the distance 𝑤 0,𝑡 between the DFD of the unknown text 𝑝 0 and any copyright DFD will be calculated by formula (2):

𝑤 0,𝑡 = ‖𝑝 0 − 𝑃 𝑡 ‖ = ∑ |𝑝 0 (𝑗) − 𝑃 𝑡 (𝑗)| 𝑓(𝑛,𝑀) 𝑗=1 ,(2)

Accordingly, the text «0» will belong to the author whose distance to the DFD will be the shortest.

When solving the problem of classification, the data set was not clearly divided into test and training sets. Weighted average DFD were built on the whole set of books by one author. The distance from the book 𝑖 to "his" a -author is calculated by formula (3):

𝑤 𝑖,𝑡 = ‖𝑝 𝑖,𝑡 − 𝑃 𝑡 ‖ 1 − 𝑁 𝑖,𝑡 /𝑁 𝑡 .(3)

Formula (4) excludes the participation of the DFD of the document / i-article in the average DFD of "its" author [9][10][11][12]. The method of smoothing 3-gram distribution functions according to the analytical approach is impossible because the function is too complex. There is only an algorithmic approach for the implementation of which we can be focused on the main methods, such as simple or ordinary moving average, weighted moving average, exponential smoothing, median smoothing. In our case, we believe that the most optimal will be the use of the moving average method and this method is also known as the filtering method. Its application will reduce the variety of data. This fits into our analytically chosen tactics of ignoring extreme data, highs and lows. The degree of smoothing should be tied in advance to the criterion that will ensure maximum smoothing while still retaining information.

In our specific case we believe that correlation analysis of 3-gram data sequences in certain two of three selected articles can help to determine the relationship, and thus help to answer the question of how similar is the topic of articles. To do this, the function of the first studied article can be denoted by the variable x and the set of values of the second article (variable y) and perform a correlation analysis of the set of two sequences XY. The task of correlation is not only to assess connectivity, but also to reduce the target score to a numerical expression. The method of studying 3-gram sequences allows to reduce significantly the number of variables that are taken into account as important ones. Combination of the metrics-related groups forms a new cluster, which compares the metrics of closeness with others, and it is possible to end up with a fairly clear structure of the data set. Quantitative method of the potential text author identification from the set of possible ones on the basis of comparison analysis results of the reference text with the researched one is based on the technique of linguometry.

Linguometry is a branch of applied linguistics that detects, measures and analyzes the quantitative characteristics of different levels units of language or speech [33]. Using the apparatus of mathematical statistics, linguometry is involved in solving such problems of linguistics as the following criteria:

• dictionaries (including frequency and statistical) and comparisons • automatic dictionaries, thesauri • shorthand systems • methods and means of automatic language detection • methods and means of information retrieval, etc. Each language has its own statistical parameters and knowledge of the frequency occurrence of letters and their combinations (2-gram, 3-gram, 4-gram) that allows automatically to identify it. For example, for Ukrainian texts [34][35][36][37] it was found that statistical parameters of styles can consider frequencies of vowels, consonants, spaces between words, as well as soft and sonorous groups of consonants [33]. We will show how to evaluate the speech of a particular author on a particular passage of his work [36] using a certain standard, for example, Ukrainian language letters frequencies. Consider two passages of the technical text in Ukrainian presented in a format where the letters are arranged in descending order of frequency of their appearance (frequencies are given in Table 1), the distinction between lowercase and uppercase letters has not been made. The type of letters correlation frequencies of the passages [35] and the standard [36] have been investigated. The results that confirm the conclusions have been presented, in particular, graphically. In the table. 1 the following data are entered for convenience: frequency of used Ukrainian language letters, absolute and relative frequencies of letters used in the studied Passage 1 (Article 1) [35] and Passage 2 (Article 2) [36]. Passage 1 contains 556 characters; Passage 2 contains 541 characters. The concept of "other" in the column of letters contains authentic letters for the Ukrainian language (ї, є, г, і), which are rarely used in most technical texts. This allows to achieve some independence in the analysis. Fig. 1 illustrates the obtained results graphically.

Figure1:

The relative frequencies of letters in the standard and the studied passages Graphical representation of the relative frequencies of letters in the passages gives a convincing answer to the question which of the passages was written by which author.

The distribution of 1-gram in the works is different. The optimal indicators of the texts study are the analysis of 3-grams [38][39][40][41][42][43][44]. We will check this in the next stages of the study. There is a sharp jump in the relative frequency of occurrence of the letter "e" for Passage 2 relative to the reference values of Standard 1 [36] (Fig. 2), so we assume that it is more likely that Standard 1 was written by the author of Passage 1 [35]. We also give the numerical values of the correlation of the frequency of letters in the passages and the standard. We find two correlation coefficients: for the standard and Passage 1 [35] and for the standard and Passage 2 [37]; factor closer to 1 will indicate that the relevant passage is more likely to belong to the standard. Calculations of the correlation coefficient for the standard and Passage1 give Re-У1=0.962716, and the correlation coefficient for the standard and Passage 2 -Re-У2=0.909958. Similarly, the values of relative frequencies in Standard 2 and Passages 1, 2 in Fig. 3 differ significantly, so it is likely that the author of Standard 2 [34] is not the author of Passages 1 and 2. The obtained values of the coefficients, as well as the analysis of the graphical results allow to state that the probability of belonging of Section 1 [35] to Standard 1 [36] is higher than for Section 2 [34].To achieve the research goal a system with the ability to select the language / languages of the analyzed content have been developed and implemented on the Victana web-resource [63]. For high-quality and effective analysis of content in determining the degree of authorship of a particular person, we propose to analyze the reference text and the study in several stages:

• Linguometric analysis of the coefficients of diversity of the author's speech (Fig. 4, Alg. 1); • Stylometric analysis (Fig. 5); • Analysis of stable phrases (Fig. 6); • Linguistic and statistical analysis through N-grams (Fig. 7). The Web-resource for linguometric analysis has the following fields (Fig. 4):

•

Content -is a field where the researched text is copied from the buffer; • Signs (the entered text must contain at least 100 and at most 10000 characters) is the maximum size of the content is a set; • Calculation is meaning its start; • clearance is clear the entered data. Algorithm 1. Linguometric analysis of the text to determine authorship.

Step. 1. Check the length of the text -the excess is cut off.

Step. 2. Determine the number of sentences.

Step. 3. Purify the studied text (numbers, special symbols).

Step. 4. Determine the total number of words in the text N.

Step. 5. Determine the number of words W.

Step. 6. Determine the number of prepositions Z. Step. 7. Determine the number of connectors S.

Step. 8. Calculate the coefficients of author speech.

Step. 9. Output the results to the end user (Table 2, Fig. 1). The Web-resource for stylistic analysis has the following fields (Fig. 5):

Figure4: The example of linguistic analysis application result• Select Passage 1 (2, 3

) is open access to excerpts. Access to the next passage only after activating access to the previous one. Access is opened sequentially from a smaller number to a larger one.

•

Reference text is the field where the Reference text is copied from the buffer.

•

The text you enter must be at least 100 characters long. (Now 0) is after starting the calculation, the actual number of characters of each passage will be calculated and displayed separately.

• Passage 1 (2, 3) is the field where the corresponding excerpt text is copied from the buffer.

•

Calculate is start the calculation.

•

Clear is clear the entered data.

Algorithm 2. Stylometric analysis of the text to determine authorship.

Step. 1. Check the lengths of standard text and selected passages and reduce the length of the reference text to the minimum of the checked.

Step. 2. Clean the reference text from special characters, etc.

Step. 3. Determine of the words number in the text of the standard.

Step. 4. Determine the number of stop words (prepositions + conjunctions + particles) in the text of the standard (Fig. 5-6).

Figure5: Example of data entry for stylometric analysis

Step. 5. The length of Passage 1 is not more than the minimum text.

Step. 6. Clear Passage 1 from special characters, etc.

Step. 7. Determine the number of words W1 for Passage 1.

Step. 8. Determine the number of stop words (prepositions + conjunctions + particles) in the text.

Step. 9. Prepare individual arrays (excerpt and standard) to calculate the correlation coefficient (Fig. 6).

Step. 10. Call the function to calculate the correlation coefficient.

Step. 11. Form an array to form a graphical representation of the relative frequency of stop words in Passage 1 and in the standard.

Figure6: Example of stylometric analysis application results

Step. 12. Call the function to calculate the relative frequency distribution graph (Fig. 6).

Step. 13. Call the function to calculate the correlation coefficient of Passage 2 (3) for each of the service words.

Step. 14. Form the words of the Swadesh list from the reference book, determine the number of words from the Swadesh list in the text of the Passage.

Step. 15. Form common for the Standard, Passages 1-3 and the Swadesh list.

Step. 16. The results of the study are displayed on the screen.

Experiment

When identifying the author of the text, it is assumed that the text reflects the individual style of author writing what distinguishes him from others. In order to compare texts with each other, it is necessary to compare the text with some numerical characteristics that would be close to the texts of the same author and would be significantly different for the works of different authors.

The Web-resource for the analysis of N-grams has the following fields (Fig. 7):

• Number of grams -the number of characters in grams. Default is 3 ones. Can be changed to 1, 2, 3, 4.

•

Choice of the text language -the language of the text for analysis (research). The default one is "Ukrainian".

•

Text -a field where the studied text is copied from the buffer.

•

Restriction of text in characters.

•

Generation -to start generating N-grams.

•

Clearance -clear the entered data.

Algorithm 3. Linguistic and statistical analysis of N-grams of text is the following:

Step. 1. Purify the studied text (numbers, special symbols).

Step. 2. Calculate the number of words in the text.

Step. 3. All words of the text are translated in lower case.

Step. 4. Remove the spaces.

Step. 5. Depending on the selected language, the corresponding alphabet is substituted.

Figure7: Example of N-gram text analysis application

Step. 6. Depending on the set number of grams the corresponding function which calculates all possible variants of grams and saves in an array is started.

Step. 7. The function of counting the number of occurrences of words is started.

Here we calculate the relative frequency of occurrence and store it in the array: the ordinal number of the gram, the gram itself, the number of occurrences of this gram, the relative occurrence frequency of this gram.

Step. 8. The following function forms the array received in the previous function for export to the CSV file. This file is stored on the server. It can be downloaded to the computer of the user (researcher) via the link, which will be accessible after the formation of the form with the results of the study.

Step. 9. The results of the study are displayed on the screen (only those grams that are found in the text).

Step. 10. Access the export file.

Step Three publications of scientific and technical orientation on the basis of linguistic and statistical analysis of 3-grams have been compared. Articles 1 and 2 have been written by one team, Article 3 has been written by another author (Table 3). The language of the text is Ukrainian (letters in the alphabet -33, all possible N-grams -35937) When comparing articles only those 3-grams that are found in the text at the same time in three articles at least once have been taken into account. Therefore, for this particular example, all 3-grams are 2147. That is, for Article 1 78.4814% 3-grams have been analyzed, for Article 2 -72.6332% and for Article 3 -84.1271%. Accordingly, the difference in consumption of the relevant 3-grams between Articles 1 and 2 is R12=56.5254 %, between Articles 2 and 3 -R23=69.4271 %, between Articles 1 and 3 -R13=62.9839 %. These indicators themselves show that the characteristics of Articles 1 and 2 are more similar (R23>R12on 12.9017 %, R23 > R13on 6.4432 %, R13> R12on 6.4585 %, that is R23>R13>R12) than the characteristics under Articles 1-3 and 2-3. The smaller the Rij, the greater the degree that the articles are written by the same author. Then in the case of Articles 1 and 2 it is more likely to be written by one author / team than Articles 2-3 and Articles 1-3 respectively. But we will analyze the use of individual clusters of 3-grams in the relevant articles and compare the results.

Fig. 8 presents the results of the analysis of use in Articles 1-3 of 3-grams, starting with the letter a (appearance in Articles 1-3 in the range of 6.1125-6.7087%). Most often the curve lines for Articles 1-2 (4.2322%) and Articles 1-3 (4.197%) coincide or approach each other (average discrepancy is 0.02713% and 0.0269%, respectively). But not always there is a coincidence with Article 2-3 (4.6322%) and there are significant differences (the average difference is 0.02969%). If you analyze only such 3grams it turns out that all three articles are written more likely by one author. This is due to the fact that this letter is one of the most commonly used for the formation of Ukrainian words.

Figure8: The use of 3-grams, starting with the letter a (Article 1 -blue, Article 2 -red, Article 3 -green) Fig. 9 presents the analysis results of use in Articles 1-3 of 3-grams, starting with the letter б (letter b in English) (appearance in Articles 1-3 in the range of 0.48884-0.77738%). Most often the curve lines for Articles 1-2 (0.594%) as opposed to Articles 1-3 (0.7072%) and Articles 2-3 (1.1208%) coincide or approach. But the trajectory of the curve of Article 1 and Article 3 often coincides (most likely articles are written by one author, the average discrepancy is 0.01809%, while for Articles 1-2 -0.0261% and Articles 2-3 -0.02866%. If analyze only such 3-grams (which are less common), it turns out that all Articles 1-2 are written more likely by one author, and Article 3 -by another one. This is due to the fact that this letter would be rare in the formation of Ukrainian words. And some authors use such words more often because of habit and / or because of the subject matter of their publications (this requires further research).

Figure9: The use of 3 grams, starting with the letter б (Article 1 -blue, Article 2 -red, Article 3 -green)

According to Table 4 and Fig. 10-12, a part of the letters in the Ukrainian language are most often used, others -much less often. For the most frequently used letters, the frequency of occurrence of 3grams with such initial letters will have almost the same distribution (top values in the graph of Fig. 12), and not for other letters. Therefore, it is advisable to study only the trigrams for the initial letters, which are less common in the texts of a particular language to determine the degree of belonging of the text to the author (for example, Fig. 12). Thus, for 3-grams of the letter are (the appearance in Articles 1-3 in the range of 0.2517-0.707%) most often the lines of curves for Articles 1-2 (0.2508%) in contrast to Articles 1-3 (0.6077 %) and Articles 2-3 (0.5443%) that coincide or approach each other. But the trajectory of the curve of Article 1 and Article 2 often coincides (most likely articles written by one author -the average discrepancy is 0.0114%, while for Articles 2-3 -0.02478% and Articles 1-3 -0.02762% this value is higher twice as much).

Figure12: The use of 3-grams, starting with the letter є (Article 1 -blue, Article 2 -red, Article 3green)

Table 4 shows frequencies of letters appearance in the standard and the studied passages. Fig. 14 shows histograms of the relative frequency of n-grams in 1-3 articles. Low frequency (noise) values are the most common and form the main volume of the data. We can ignore them (Fig. 15). All graphs of the distribution of the frequency of 3-grams in articles show a significantly noticeable gradation of 3-grams on underused (noise-like) and widely used peak values. This allows to see the specific examples of the three articles, the fact that to reduce the amount of information analyzed it is desirable to proceed to the analysis of the distribution function from a certain threshold value of frequency and at the same time cover the main information content is visible. To compare the distribution function in the context of the three studied articles, it is necessary to compare clearly expressed average values. After analyzing the most commonly used 3-grams, we conclude that they are caused by the stylistics or grammar of the Ukrainian language and are not relevant to determine the specific topic of articles. The most used 3-grams in Article:

Results

In the algorithmic approach, the appearance of the trend is obtained due to various algorithms that practically implement smoothing procedures. These procedures provide the researcher only with an algorithm for calculating the new value of the time series at any given time t. These methods can be classified as the following simple or ordinary moving average (Fig. 16), weighted moving average, exponential smoothing -median smoothing. In this part of the calculation work, the relative frequency of consumption of 3-grams in three texts has been smoothed by the method of moving average, exponential smoothing and median smoothing.

The moving average method is one of the oldest known methods of smoothing the time series. It is based on the transition from the initial values of the series to their average values in the time interval, the length of which is selected in advance. The selected time interval slides along the row. Moving averages can smooth out both random and periodic fluctuations, identify existing trends in the process and therefore serve as an important tool in filtering time series components. The moving average method estimates the average level over a period of time. The longer the time interval to which the average belongs, the smoother the level will be, but the less accurately the trend of the original time series will be described. In all figures, the gray graph is the graph of the initial Relative frequency, and the red graph is the graph of the smoothed Relative frequency data.

At small values of the size of the interval w, the efficiency in terms of smoothing effect is not very high, as can be seen in the following Figures 16-18 for Article 1 (smooth the data using the size of the smoothing interval w = 3, 5, 7, 9,11, 13, 15). It is needed to smooth the data using the size of the smoothing interval w = 3 (Fig. 19), then smooth the obtained data again but using the size of the smoothing interval w = 5. Then continue smoothing the obtained data with the smoothing interval w = 7 and so on to w = 15 (Fig. 20-21). At small values of the size of the interval w, the efficiency in terms of the smoothing effect decreases, which can be seen in the following figures. Also, the smoothing method, using pre-smoothed rows, smoothest the data very effectively. We smooth the data using the dimensions of the smoothing interval w = 3, 5, 7, 9, 11, 13, 15 for Article 2, the moving average showed a trend in the interval better than for Article 1 (Fig. [22][23][24]. It is needed to smooth the data using the size of the smoothing interval w = 3 for Article 2 (Fig. [25][26][27], then smooth the obtained data again but using the size of the smoothing interval w = 5. Then continue smoothing the obtained data with the smoothing interval w = 7 and so on to w = 15. It is needed to smooth the data using the size of the smoothing interval w = 3 (Fig. 31), then smooth the obtained smoothed data again but using the size of the smoothing interval w = 5. Then continue smoothing the obtained data with the smoothing interval w = 7 and so on to w = 15 (Fig. 32-33). For data on the frequency of 3-grams use exponential smoothing for all three texts did not give a "delay". Exponential smoothing has softened these data a little and it is harder to see the general trend. Also, the correlation coefficients of the data are very low (Fig. 34-42). Median smoothing. In this case, use the same dimensions of the smoothing interval and the operation as in paragraph 1.Characteristic feature of median smoothing is that it leaves monotonic parts of the data sequence and sharp differences unchanged, and for nonmonotonic areas within the size of the sliding smoothing interval leaves only a centered value equal to their median, i.e. effectively eliminates those levels that violate monotonicity. It is needed completely to eliminate single extreme or anomalous values of levels that are at least half the distance from the smoothing interval, maintain sharp differences in trends (moving average and exponential smoothing lubricates them), effectively eliminates single levels with very large or very small values that are random and stand out sharply among other levels. These characteristics of the median smoothing were confirmed during the median smoothing for relative frequency in Article 1. lower than the moving average. We smooth the data using the dimensions of the smoothing interval w = 3, 5, 7, 9, 11, 13, 15 (Fig. [43][44][45]. Graphics are arranged in the appropriate order: Median w = 3 (for intervals 0-700, 700-1400, 1400-2148), Median w = 5, 7, 9, 11, 13, Median w = 15 (for intervals 0-700, 700-1400, 1400-2148). It is needed to smooth the data using the size of the smoothing interval w = 3, then smooth the obtained smoothed data again but using the size of the smoothing interval w = 5. We continue smoothing the obtained data with the smoothing interval w = 7 and so on to w = 15 (Fig. [46][47][48].

Discussions

Graphical representation of the relationship between two studied sequences is called a correlation field or scatter plot. The graphical method provides a visual representation of the form of communication between these sequences. So, it is needed to construct a correlation field for Article 1 and 2 (Fig. 49), Article 1 and 3 (Fig. 50), Article 2 and 3 (Fig. 51). Visually assessing the nature of the relationship, it can be stated that there is a linear relationship in all three fields. Also evaluating the visual data of the field, we see that the correlation is present, so we can assume that these Ukrainian articles can be written by one author or are based on one topic. But visual assessment is not enough, so it is worth finding the value of the correlation coefficient for more accurate research results. The correlation coefficient characterizes the degree of closeness of the linear dependence. Therefore, there is a calculation of the correlation coefficients for Articles 1 and 2 (Correlation coefficient 0.575. Coefficient of determination 33%); for Articles 1 and 3 (Correlation coefficient 0.63023. Coefficient of determination40%); for Articles2 and 3 (Correlation coefficient 0.49038. Coefficient of determination24%). Correlation coefficients that are less than 0.7 but greater than 0.5 modulus indicate a medium-strength relationship (the coefficients of determination are less than 50% but more than 25%). It is worth noting that in the first two cases we received a connection of medium strength and in the third case we have a connection of weak force very close to the average, so it can also be attributed to the average. It is obvious that having three different Ukrainian articles the 100% correlation is unlikely to be. So, given the average connection, the assumption that these articles may have been written by the same author or are based on the similar topics has been confirmed. When the pair statistical dependence on the linear correlation is rejected, the correlation coefficient loses its meaning as a characteristic of the degree of closeness of the connection. In this case, such a measure of communication as the correlation ratio is used. Since there is a linear relationship between the pair of studied features, the correlation ratio does not need to be calculated.

Autocorrelation function is a correlation of function with itself shifted by a certain amount of independent variables. Autocorrelation is used to find patterns in a number of data, such as periodicity.

The graph of the autocorrelation function is also called the correlogram (Fig. 52).

Figure 35: Correlogram

Fig. 52 shows that the studied series are not stationary, as in the case of fixed time series the graph of autocorrelation functions should be decreased rapidly after the first few values.

It is needed to divide the sequence of Relative frequency Article 1 into three equal parts of 715 values. For convenience, we take the data into a separate table (Fig. 53). The correlation matrix is a square table in which the correlation coefficient between the corresponding parameters is located at the intersection of the corresponding row and column. Correlation matrix for column divided into 3 parts and has been constructed and the results are obtained: correlation coefficients, that are less than 0.5, the absolute value or modulus indicate a weak relationship. On the correlation matrix it is seen that all values are close to 0, so we can conclude that there is no connection at all. It can be said that this is quite an expected result, as the data do not depend on each other and have different values. We find the coefficients of multiple correlation (Fig. 54-55). According to these graphs, Article 1 and Article 2 were more likely to have been written by one author, although Article 1 and Article 3 could also have been written by one author (but this is not true). But Articles 2-3 were definitely written by different authors. The application of linguistic and statistical analysis of 3-grams to a set of articles will allow to form a subset of similar linguistically characteristic publications. Imposing additional conditions on this subset in the form of statistical and quantitative analyzes (sets of keywords, stable phrases, stylistic, linguometric, etc.) will significantly reduce this subset, clarifying the list of more likely author works. Thus, the analysis of the content and frequency of occurrence of only business words will separate Articles 1 and 3 into different subsets, Articles 1 and 2 in one the same. This study does not address the problem of identifying the author in full due to the fact that the difference in authorial traits is subjective and depends on the limitations imposed on the creative process of the author. However, as a result, a system that implements such methods is able to give recommendations on the degree of belonging of the text to a particular author. Further experimental research is needed to test the proposed method to determine the style of the author from other categories of texts -scientific humanities, art, journalism and more. Therefore, we compare the frequencies of all trigrams that begin with a particular letter (Fig. 56). According to these graphs, Article 1 and Article 2 were more likely to have been written by one author, although Article 1 and Article could also have been written by one author (but this is not true). But Articles 2-3 were definitely have been written by different authors. The application of linguistic and statistical analysis of 3-grams to a set of articles will allow to form a subset of similar linguistically characteristic publications. Imposing additional conditions on this subset in the form of statistical and quantitative analyzes (sets of keywords, stable phrases, stylistic, linguometric, etc.) will significantly reduce this subset, clarifying the list of more likely author works.

Thus, the analysis of the content and frequency of occurrence of only business words will separate Articles 1 and 3 into different subsets, Articles 1 and 2 in one the same.

This study does not address the problem of identifying the author in full due to the fact that the difference in authorial traits is subjective and depends on the limitations imposed on the creative process of the author. However, as a result, a system that implements such methods is able to give recommendations on the degree of the text belonging to a particular author. Further experimental research needs to test the proposed method to determine the style of the author from other categories of texts such as scientific humanities, art, journalism and others.

Conclusions

The article dwells upon the completed scientific research in the field of information technology in the part concerning computer linguistics, artificial intelligence and Machine Learning. Correlation analysis of text author identification results based on n-grams in Ukrainian technical and scientific texts have been made. The comparison between three articles have been done and the results have been obtained. Quantitative content analysis of textual scientific and technical content has been studied based on the fact that text authorship determination systems typically use plagiarism and rewrite its metrics of identification fully or partially. The article presents the method of determining the author by decomposition on the basis of the analysis of such speech coefficients as lexical diversity, degree of syntactic complexity, speech coherence, indices of exclusivity and concentration of the text. Also, the parameters of the author style such as words, sentences, prepositions, conjunctions numbers and quantity of words with defined frequencies have been analyzed. It is highlighted that in the algorithmic approach smoothing procedures are widely used. So, the relative frequency of 3-grams consumptions in the studied texts has been smoothed by the method of moving average, exponential and median smoothing. It is proposed to analyze the reference text in several stages for high-quality and effective analysis of content in determining the degree of text authorship. To achieve the research goal a system with the ability to select the language / languages of the analyzed content have been developed and implemented on the Victana Web-resource. It is said that in order to compare the texts with each other it is necessary to compare the text with some numerical characteristic that was close to the texts of the same author and would different in the works of various authors that uses the distribution function density of letter combinations of three consecutive characters. So, rapid distribution of text documents in electronic form has caused the importance of using automatic methods to analyze the content including the necessity of documents classification and clustering by various criteria.

Figure2:Figure2: The relative frequencies of occurrence of the ten most frequent symbols in Standard 1 and the studied Excerpts 1, 2, including omission

. 11 .11The generalized results are deduced: • only N-grams with repetitions were found • only N-grams were found without repetitions • total N-grams • number of characters in the text that are completely cleared • number of characters in the text with spaces • number of words in the text • size of the alphabet.

аги ади адр аєз ажа азо айв акі алг аль ами анд ант аоп апо ари асі асу атн афо ахо ацю ачу аги ади адр аєз ажа азо айв акі алг аль ами анд ант аоп апо ари асі асу атн афо ахо ацю ачу ||p1

Figure10: 3 Figure 1 : 3 Figure12:313Figure10: Relative frequency for Article 1-3

2 а б в г д е є ж з и і ї й к л м н о п р с т у ф х ц ч ш щ ь юFigure14: 3 Figure15:23Figure14: Histogram of the relative frequency of N-grams in Articles 1-3

• 1 :1ння [nnya] 0.008476, енн [enn] 0.007175, ого [oho] 0.005473. • 2: ння [nnya] 0.006448, ист [yst] 0.006356, ува [uva] 0.006233. • 3: ння [nnya] 0.008769, ого [oho] 0.007717, мет [met] 0.006314.

Figure 2 :Figure 3 :23Figure 2: Moving Average of Article 1 for w=3 for the interval 0-700, 700-1400 and 1400-2100

Figure 4 :4Figure 4: Moving Average of Article 1 for w=15 for the interval 0-700, 700-1400 and 1400-2100

Figure 5 :5Figure 5: Moving Average of article 1 for w=3 for the interval 0-700, 700-1400 and 1400-2100

Figure 6 :6Figure 6: Moving Average of Article 1 for w = 5, 7, 9,11, 13

Figure 7 :7Figure 7: Moving Average of Article 1 for w=3 for the interval 0-700, 700-1400 and 1400-2100

Figure 8 :8Figure 8: Moving Average of Article 2 for w=3 for the interval 0-700, 700-1400 and 1400-2100

Figure 9 : 13 Figure 10 :91310Figure 9: Moving Average of Article 2 for w = 5, 7, 9,11, 13

Figure 11 :Figure 12 :1112Figure 11: Moving Average of Article 2 for w=3 for the interval 0-700, 700-1400 and 1400-2100

Figure 13 :13Figure 13: The Moving Average of Article 2 for w=3 for the interval 0-700, 700-1400 and 1400-2100

Figure 14 :14Figure 14: The Moving Average of Article 3 for w=3 for the interval 0-700, 700-1400 and 1400-2100

Figure 15 : 13 Figure 16 :151316Figure 15: The Moving Average of Article 3 for w = 5, 7, 9,11, 13

Figure 17 :Figure 18 :1718Figure 17: The Moving Average of Article 3 for w=3 for the interval 0-700, 700-1400 and 1400-2100

Figure 19 :19Figure 19: The Moving Average of Article 3 for w=3 for the interval 0-700, 700-1400 and 1400-2100

Figure 20 :Figure 21 . 25 Figure 22 :20212522Figure 20: Exponential smoothing a=0.1 of Article 1 for the interval 0-700, 700-1400, 1400-2148

Figure 23 :Figure 24 25 Figure 25 :23242525Figure 23: Exponential smoothing a=0.1 of Article 2 for the interval 0-700, 700-1400, 1400-2148

Figure 26 :Figure 27 25 Figure 28 :26272528Figure 26: Exponential smoothing a=0.1of Article 3 for the interval 0-700, 700-1400, 1400-2148

Figure 29 :Figure 30 :Figure 31293031Figure 29: Median smoothing w = 3 of Article 1 for the interval 0-700, 700-1400, 1400-2148

Figure 32 :Figure 333233Figure 32: Median smoothing w = 3 of Article 1 for the interval 0-700, 700-1400, 1400-2148

Figure 34 :34Figure 34: Median smoothing w = 15 of Article 1 for the interval 0-700, 700-1400, 1400-2148

Figure49: 2 Figure50:Figure51:2Figure49: Correlation field for Articles 1 and 2

Figure53:Figure54:Figure53:The column is divided into 3 equal parts and Correlation matrix

Figure 36 :36Figure 36: Autocorrelation

Figure56:Figure56: The 3-gram usage that starts with a specific letter (Article 1 -blue, Article 2 -red, Article 3 -green)

г д е є ж з и і ї й к л м н о п р с т у ф х ц ч ш щ ь ю г д е є ж з и і ї й к л м н о п р с т у ф х ц ч ш щ ь ю я ||p1-p2|| ||p1-p3|| ||p2-p3||

Table 11Frequencies of letters appearance in the standard and the studied passagesLetterFrequency ofThe absoluteThe absoluteThe relativeThe relativeuse offrequency offrequency offrequency offrequency ofUkrainianthe letters inletters inletters uses inletters use inlanguagePassage 1Passage 2Passage 1Passage 2lettersф0.003100.000.00щ0.004310.010.00

Table 22Example of author speech coefficients calculationsCoefficientIncoming dataCalculationLexical diversity: Kl=W/NW=184; N=295Kl=0.6237Speech connectivity: Kz=(Z+S)/(3*P) Z=20; S=28; P=18 Kz=0.8889Syntactic complexity: Ks=1-P/WP=18; W=184Ks=0.9022Concentration index: Ikt=W10/WW10=2; W=184Ikt=0.0109Exclusivity index: Iwt=W1/WW1=141; W=184 Iwt=0.7663

Table 33Values of parameters for the analyzed Articles 1-3ParametersArticle 1 Article 2 Article 3Total characters in plain text299673257037062Total characters in the raw text397923966347084Total words547553586060Total N-grams found (with repetition) 294942986236383Total N-grams found (no iterations)435443773890Total N-gram359373593735937

Table 44Distribution of frequencies of 1-gram in Articles 1-31 gramN1N2N3P1P2P3о28240.09424024720.07589838700.103601н24710.08246023700.07276628880.077312а22550.07525226980.08283724910.066685т21020.07014619560.06005521410.057315і17890.05970119670.06039322500.060233и17320.05779918520.05686220360.054504в16540.05519615900.04881819150.051265с15490.05169213270.04074313840.037050е14040.04685314530.04461220900.055950р13350.04455017220.05287118930.050676к12790.04268211100.03408014530.038897л11160.0372429270.0284629060.024254у9870.0329379600.02947511950.031990д8590.0286669390.02883013190.035310м8080.0269649760.02996613990.037451п6470.0215918250.02533011380.030464я6470.0215916810.0209098640.023129з6230.0207906440.0197739460.025325ь4980.0166194180.0128346130.016410ч4590.0153172890.0088735740.015366г4080.0136153730.0114526510.017427х3550.0118473840.0117904820.012903б2840.0094775690.0174704280.011458ж2460.0082092100.0064481760.004712й2390.0079762600.0079832650.007094ц2240.0074753340.0102552990.008004є1880.0062741650.0050663470.009289ф1790.0059732090.0064171370.003668

Table 55Frequencies of letters appearance in the standard and the studied passages 0.000366529 0.000339199 0.000392978Standard error1.28793E-05 1.24565E-05 1.53165E-05Median0.0001670.0001540.000162Fashion0.0000330.0000310.000027Standard deviation0.000596773 0.00057718 0.000709699Sampling variance3.56138E-07 3.33136E-07 5.03673E-07Kurtosis37.42530062 32.63050249 29.5089837Asymmetry4.881688545 4.624535064.54877741Interval0.0084430.0064170.008742Minimum0.0000330.0000310.000027Maximum0.0084760.0064480.008769Sum0.7869380.728260.843723Amount214721472147Reliability level (95.0%) 2.52573E-052.4428E-053.00366E-05

Chastotypovtoryuvanosti bukv i bihram u vidkrytykh tekstakh ukrayinsʹkoyumovoyu [Frequency of repetition of letters and digrams in open texts in Ukrainian SOSushko LYFomychova YSBarsukov Protection of information 12 3 2010 Zakhyst informatsiyi Authorship definition based on the frequency distribution of letter combinations PSDyurdeva 2015 The global k-means clustering algorithm ALikasa NVlassisb JJVerbeekb Pattern Recognition 36 2 2003 The Application of K-medoids and PAM to the Clustering of Rules APReynolds GRichards VJRayward-Smith 10.1007/978-3-540-28651-6_25 Lecture Notes in Computer Science 3177 2004 Criterial analysis of gene expression sequences to create the objective clustering inductive technology SBabichev MATaif VLytvynenko VOsypenko 10.1109/ELNANO.2017.7939756 Proceedings of the International Conference on Electronics and Nanotechnology the International Conference on Electronics and Nanotechnology ELNANO 2017 An Evaluation of the Objective Clustering Inductive Technology Effectiveness Implemented Using Density-Based and Agglomerative Hierarchical Clustering Algorithms SBabichev BDurnyak IPikh VSenkivskyy 10.1007/978-3-030-26474-1_37 Advances in Intelligent Systems and Computing 1020. 2020 Objective clustering inductive technology of gene expression profiles based on SOTA clustering algorithm SABabichev AGozhyj AIKornelyuk VILytvynenko 10.7124/bc.000961 Biopolymers and Cell 33 5 2017 Means Clustering K- Methodology and software package for identifying the author of an unknown text ASRomanov 2010 Identification of the author of the text by the frequency distribution of letter combinations LABorisov YuNOrlov KPOsminin Applied Informatics 26 2 2013 Determining the genre and author of a literary work by statistical methods NYu KPOrlov Osminin 2010 Methods of statistical analysis of literary texts NYu KPOrlov Osminin 2012 LIBROKOM Clustering of Russian Manuscripts Based on the Feature Relationship Graph VAPavlov PSDyurdeva DSShalymov Computer tools in education 1 2016 Formal methods for determining the authorship of texts TVBatura NGU Bulletin of Series Information technologies 10 4 2012 Intro to Natural Language Processing MRomanyshyn 2017 Grammarly, Inc AVBabash GPShankin Cryptography, SOLON-R 2002 Fundamentals of cryptography APAlferov AYu ASZubov AVKuzmin Cheryomushki 2002 Helios Probability and information AMYaglom IMYaglom Phys.-Math. lit 1973 Science Information measurements of language RGPiotrovsky 1968 Nauka Information theory and linguistics IMYaglom RLDobrushin AMYaglom Questions of linguistics 1 1960 On the possibility of increasing the speed of transmission of telegraph messages DSLebedev VAGarmash Telecommunications 1 1958 Prediction and entropy of the printed English CEShannon 1951 Vstup do kryptolohiyi [Introduction to cryptology OVVerbitskyy Vydavnytstvo Naukovotekhnichnoyi literatury

Lviv

Publishing House of Scientific and Technical Literature 1998 Chastotni slovnyky ta yikhvykorystannya [Frequency dictionaries and their use VIPerebyynis MPMuravytska NPDarchuk Naukova dumka 1983 Scientific opinion Development of methods, models, and means for the author attribution of a text IKhomytska VTeslyuk AHolovatyy OMorushko 10.15587/1729-4061.2018.132052 Eastern-European Journal of Enterprise Technologies 3 2 2018 Authorship and Style Attribution by Statistical Methods of Style Differentiation on the Phonological Level IKhomytska VTeslyuk 10.1007/978-3-030-01069-0_8 Advances in Intelligent Systems and Computing 871 2019 Software-Based Approach Towards Automated Authorship Acknowledgement -Chi-Square Test on One Consonant Group IKhomytska VTeslyuk NKryvinska IBazylevych 10.3390/electronics9071138 Electronics 9 7 1138 2020 Approach for Minimization of Phoneme Groups in Authorship Attribution Attribution IKhomytska VTeslyuk IBazylevych IShylinska 10.47839/IJC.19.1.1693 International Journal of Computing 19 1 2020 DJurafsky JHMartin N-gram Language Models Speech and Language Processing DJurafsky JHMartin Regular Expressions, Text Normalization, Edit Distance DJurafsky JHMartin Domain knowledge query conversation bots in instant messaging (IM) OSGoh CCFung ADepickere Knowledge-Based Systems 21 7 2008 SBuk Osnovy statystychnoy lingvistyky, LNU n. I Franko Publishing House 2008 Method for Determining Linguometric Coefficient Dynamics of Ukrainian Text Content Authorship VVysotska VBFernandes VLytvyn MEmmerich MHrendus 10.1007/978-3-030-01069-0_10 Advances in Intelligent Systems and Computing 871 2018 The control agent with fuzzy logic PKravets Proceedings of the International Conference on Perspective Technologies and Methods in MEMS Design the International Conference on Perspective Technologies and Methods in MEMS Design

Lviv, Ukraine

2010 The Game Method for Orthonormal Systems Construction PKravets 10.1109/cadsm.2007.4297555 doi: Proceedings of the 2007 9th International Conference -The Experience of Designing and Applications of CAD Systems in Microelectronics the 2007 9th International Conference -The Experience of Designing and Applications of CAD Systems in Microelectronics

Lviv, Ukraine

2007 Game Model of Dragonfly Animat Self-Learning PKravets Proceedings of the International Conference on Perspective Technologies and Methodsin MEMS Design the International Conference on Perspective Technologies and Methodsin MEMS Design

Lviv

2016 Development of the quantitative method for automated text content authorship attribution based on the statistical analysis of N-grams distribution VLytvyn VVysotska IBudz YPelekh NSokulska RKovalchuk LDzyubyk OTereshchuk MKomar Eastern-European Journal of Enterprise Technologies 6 2019 Recommendation System Development Based on Intelligent Search NLP and Machine Learning Methods IBalush VVysotska SAlbota CEUR WorkshopProceedings 2917 2021 The Text Classification Based on Big Data Analysis for Keyword Definition Using Stemming ABerko YMatseliukh YIvaniv LChyrun VSchuchmann 10.1109/CSIT52700.2021.9648764 Proceedings of the 2021 IEEE 16th International Conference on Computer Sciences and Information Technologies (CSIT), 1 the 2021 IEEE 16th International Conference on Computer Sciences and Information Technologies (CSIT), 1

Lviv, Ukraine

2021 The Method of Text Tonality Classification NShakhovska KShakhovska Proceedings of the Computer Sciences and Information Technologies (CSIT), 1 the Computer Sciences and Information Technologies (CSIT), 1

Lviv, Ukraine

2020 The Kolmogorov-Smirnov's Test for Authorship Attribution on the Phonological Level IKhomytska VTeslyuk LBordyuk 10.1109/CSIT49958.2020.9322042 Proceedings of the Computer Sciences and Information Technologies (CSIT) the Computer Sciences and Information Technologies (CSIT) 2020 1 Language-independent features for authorship attribution on Ukrainian texts YHlavcheva VBobicev OKanishcheva CEUR Workshop Proceedings 2833 2021 Precision Automated Phonetic Analysis of Speech Signals for Information Technology of Text-dependent Authentication of a Person by Voice OBisikalo OBoivan NKhairova OVKovtun VKovtun CEUR Workshop Proceedings 2853 2021 Implicit Visual Attention Feedback System for Wikipedia Users NDubey AAVerma SR SIyengar SSetia .10.1145/3479986.3479993 Proceedings of the 17th International Symposium on Open Collaboration the 17th International Symposium on Open Collaboration

NY, USA

Association for Computing Machinery 2021 The lexical innovations identification in English-languagee eurointegration discourse for the goods analysis by comments in e-commerce resources VLytvyn VDanylyk MBublyk LChyrun VPanasyuk OKorolenko Proceedings of the 2021 IEEE 16th International Conference on Computer Sciences and Information Technologies the 2021 IEEE 16th International Conference on Computer Sciences and Information Technologies

Lviv

2021 Intelligent system for film script formation based on artbook text and Big Data analysis OHladun ABerko MBublyk LChyrun VSchuchmann 10.1109/CSIT52700.2021.9648682 Proceedings of the 2021 IEEE 16th International Conference on Computer Sciences and Information Technologies (CSIT) the 2021 IEEE 16th International Conference on Computer Sciences and Information Technologies (CSIT)

Lviv, Ukraine

22-25 September, 2021 The user's psychological state identification based on Big Data analysis for person's electronic diary ADyriv VAndrunyk YBurov IKarpov LChyrun Computer Sciences and Information Technologies (CSIT)

Lviv, Ukraine

22-25 September, 2021 Corpus Technologies in Translation Studies: Fiction as Document NHrytsiv TShestakevych JShyyka CEUR Workshop Proceedings 2917. 2021 Development of a Speech-to-Text Program for People with Haring Impairments DKoshtura VAndrunyk TShestakevych CEUR Workshop Proceedings 2917 2021 Intelligent System for Checking the Authenticity of Goods Based on Blockchain Technology OProkipchuk LChyrun MBublyk VPanasyuk VYakimtsov RKovalchuk CEUR Workshop Proceedings 2917 2021 Uniform method of operative content management in web systems AGozhyj LChyrun AKowalska-Styczen OLozynska CEUR Workshop Proceedings 2136 2018 Heterogeneous data with agreed content aggregation system development LChyrun AKowalska-Styczen YBurov ABerko AVasevych IPelekh YRyshkovets CEUR Workshop Proceedings 2386 2019 The mobile application development based on online music library for socializing in the world of bard songs and scouts' bonfires BRusyn LPohreliuk ARzheuskyi RKubik YRyshkovets LChyrun SChyrun AVysotskyi VBFernandes 10.1007/978-3-030-33695-0_49 Advances in Intelligent Systems and Computing 1080 2020 Web Content Monitoring System Development LChyrun AGozhyj IYevseyeva DDosyn VTyhonov MZakharchuk CEUR Workshop Proceedings 2362 2019 Medical news aggregation and ranking of taking into account the user needs NAntonyuk LChyrun VAndrunyk AVasevych SChyrun AGozhyj IKalinina YBorzov CEUR Workshop Proceedings 2488 2019 Online Tourism System Development for Searching and Planning Trips with User's Requirements NAntonyuk MMedykovskyy LChyrun MDverii OOborska MKrylyshyn AVysotsky NTsiura ONaum Advances in Intelligent Systems and Computing 1080. 2020 Development of information system for aggregation and ranking of news taking into account the user needs VAndrunyk AVasevych LChyrun NChernovol NAntonyuk AGozhyj VGozhyj IKalinina MKorobchynskyi CEUR Workshop Proceedings 2020 2604 Commercial content distribution system based on neural network and machine learning ADemchuk BRusyn LPohreliuk AGozhyj IKalinina LChyrun NAntonyuk CEUR Workshop Proceedings 2516 2019 Design of a system for dynamic integration of weakly structured data based on mash-up technology IPelekh ABerko VAndrunyk LChyrun IDyyak 10.1109/DSMP47368.2020.9204160 Proceedings of the Data Stream Mining and Processing the Data Stream Mining and Processing 2020 Application of ontologies and meta-models for dynamic integration of weakly structured data ABerko IPelekh LChyrun MBublyk IBobyk YMatseliukh LChyrun 10.1109/DSMP47368.2020.9204321 Proceedings of the Data Stream Mining and Processing the Data Stream Mining and Processing 2020 Information resources analysis system of dynamic integration semi-structured data in a web environment ABerko IPelekh LChyrun IDyyak 10.1109/DSMP47368.2020.9204101 Proceedings of the Data Stream Mining and Processing the Data Stream Mining and Processing 2020 Victana Web-resource VVysotska 2022