=Paper=
{{Paper
|id=Vol-3396/paper26
|storemode=property
|title=Automated Identification of Authorial Styles
|pdfUrl=https://ceur-ws.org/Vol-3396/paper26.pdf
|volume=Vol-3396
|authors=Iryna Khomytska,Vasyl Teslyuk,Iryna Bazylevych,Iryna Karamysheva
|dblpUrl=https://dblp.org/rec/conf/colins/KhomytskaTBK23
}}
==Automated Identification of Authorial Styles==
Automated Identification of Authorial Styles Iryna Khomytska 1, Vasyl Teslyuk 1, Iryna Bazylevych 2 and Iryna Karamysheva1 1 Lviv Polytechnic National University, Lviv, 79013, Ukraine 2 Ivan Franko National University of Lviv, Lviv, 79000, Ukraine Abstract The problem of improvement of the software for authorial style identification is topical today and requires new approaches. The proposed approach consists in the use of the efficient classical and machine learning methods which ensure reliable data with a test validity of 95%. These are the following methods: the chi-square test and the discriminant analysis. The methods have been applied on the level of letters of Cyrillic alphabet which proved to be appropriate for an author identification task. Typical statistical characteristics have been established for Ukrainian authors’ styles. With the help of these characteristics, the author can be identified. The proposed structure of the software system is the novelty of the research. The developed software system is based on a modular principle. The algorithm of the author’s identification is realized on the Python programming language. The level of automation is high. Keywords:1 Chi-square test, Discriminant analysis, Ukrainian authors’ styles, Author identification, Modular principle of software system. 1. Introduction The information technologies for text differentiation and author identification have been widely used recently. These technologies are aimed at establishing the authorial characteristics typical of a certain author. However, the authorial features never occur alone, separated from the other text features. Different text features related to a functional style, genre and topic are combined and cause the complexity of authorship attribution. The problem of separation of the authorial features lies at the crux of the author identification. For every particular text, a certain part of vocabulary is typical of a certain topic and can occur in a text of any author. This vocabulary cannot identify a particular author. Therefore, some specific layer of vocabulary should be identified. If the author’s specificity is clearely expressed, that is an easy case of characterizing the author. If the author’s distinctive features are minimal, it is hard to draw a demarcation line between the general text specificity (functional style, genre) and the authorial specificity. In any case, the vocabulary characteristics of a certain style and topic should be identified as a preparatory stage of the author identification. For the purpose of characterizing the specificity of vocabulary of a certain topic, frequency dictionaries can be compiled. Such dictionaries list the most commonly used words for a particular sphere of communication. The authorial specific vocabulary can be separated from the layer of commonly used words. However in documents and formal papers, the standards and formalities prevail over a free expression of a thought. Consequently, the authorial features can hardly be noticed. In this case, the text features should be thoroughly studied and viewed from all possible sides. In our research, to avoid the ambiguity caused by the mentioned difficulties, we compare the texts with clearly expressed authorial specificity. These are the texts from emotive prose which is rich in specific expressive means. Expressive and emotional specificity of authorial styles is reflected in frequency of occurrence of language units. The texts from emotive prose by Ukrainian writers are researched in this paper. The developed program system uses the statistical tests (the chi-square test and the discriminant analysis) which are the most appropriate for the task of authorship attribution on the chosen language level (letters of Cyrillic alphabet, stop words, COLINS-2023: 7th International Conference on Computational Linguistics and Intelligent Systems, April 20–21, 2023, Kharkiv, Ukraine EMAIL: Iryna.khomytska@ukr.net (I. Khomytska); vasyl.m.teslyuk@lpnu.ua (V. Teslyuk); i_bazylevych@yahoo.com (I. Bazylevych); iryna.d.karamysheva@lpnu.ua (I. Karamysheva) ORCID: 0000-0003-3470-7191 (I. Khomytska); 0000-0002-5974-9310 (V. Teslyuk). ©️ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) punctuation marks, spaces). The purpose of the research is to prove that the chi-square test and the discriminant analysis combined are efficient for the task of author identification. The novelty of the research is the proposed structure of the program system based on a modular principle and a combination of the chi-square test and the discriminant analysis applied in the Ukrainian language. 2. Related Works The difficulties related to separation the authorial specific features from the other language features make the author identification a hard problem to solve. Different approaches have been tried and a lot of methods have been applied over the last decades. The problem has been studied on nearly all language levels. Nevertheless, a perfect solution has not been found yet and the problem is still topical. In recent research [1 – 4], the machine learning methods were applied to recognize the author of a given text. The use of the classification algorithms ensured obtaining the acknowledgement text, for some classifiers, with an accuracy of 92%. The authors were deduced in the Portuguese language. The extracted stylometric features (text relevant attributes) suggested that the applied technique was effective to distinguish the author or the ghost writer of a given text [1]. In our research, a classical approach is used. The significance level is 0.05 and all the results have been obtained with the test validity of 95%. The approaches to authorship attribution comprise the attempts to find the best solutions of separating the distinctive authorial style features from the rest of the characteristics of the researched text. Among the most efficient approaches are the following: text distortion for identifying the distinctive features of the authorial style [5]; leveraging the discourse information [6]; the use of orthogonal similarity relations [7]; the use of topic models [8]. Stylistic features of poetry and other styles are often determined on the basis of stylometric analysis [9 – 11]. Authentication of misinformation generated by some dubious sources is a task of great importance. This task was approached with the following machine learning methods: logistic regression and naive Bayes algorithms [12]. In the pre-processing phase, stop words and punctuation marks were removed. The texts were tokenized and stemmed. This way, certain specific to Twitter features were extracted. The highest precision of 91.1% was obtained using the method of logistic regression [12]. The method applied in this paper involves the use of the chi-square test and the discriminant analysis. These two tests have been applied for characterizing the authorial styles of Ukrainian writers. The two tests ensure higher precision than the method of logistic regression. A similarity metric was used to compare the pieces of a text with the most relevant words [13]. According to this approach, the words, corresponding to the nodes, were to be taken into account in order to enhance representation of a text with complex networks. The applied method involved constructing co-occurrence network for a text, obtaining dissimilarity matrices, joining them and analyzing the obtained data with a standard supervised learning algorithm. In most cases, the precision rates were above 90% and the maximum value was 98.75% [13]. Our research ensures a classical level of accuracy (95%). Another attempt to apply the neural network method was made for a sentiment analysis in English newspapers. The method was used as a public opinion influences identification tool [14]. This approach may also be used in an emotion recognition system project of English newspapers. A quantitative approach was used in a textual semantic analysis to highlight some important issues of semantics [15]. Quantitative parameters of linking words in political speeches of Bill Clinton were analyzed using Python [16]. For solving the task of authorship attribution in Arabic tweets, the support vector machine as a supervised learning algorithm was used for classification of relevant text features. The Bag of Words (BOW) approach proved to be efficient. The performance of different classifiers was tested. Different feature sets were created and combined. The combination of feature sets improved the results [17]. As it has been proved in our research, it is recommended to combine a machine learning method with a classical one, as the latter gives more reliable results. The random forest approach, using WEKA 3.8 tool, was tested authenticating Arabic poems. This approach was chosen because of a higher accuracy for decision trees. The dataset was tested on the basis of twelve features. The overall precision was 76.4%. The research included four stages: data collecting, data cleansing, feature extracting and classifying. The method was applied on the level of letters, words and word length [18]. The twelve linguistic features chosen for the research is quite a sufficient number and the results would be higher if some powerful classical statistical methods were combined with the NLP methods. The analysis of the related works has shown that, in most cases, for authorship attribution, the machine learning methods give results with an accuracy of 70% – 90%. A similarity metric, that involves constructing a co-occurrence network for a text, ensures a higher accuracy – up to 98.75% [13]. In our research we apply a combination of classical (the chi-square test) and machine learning (the discriminant analysis) methods. The chosen classical level of significance of 5% makes it possible to obtain the results with a precision of 95%. 3. Methods and Software 3.1. The Proposed Combination of Methods The mathematical support of this research is based on two statistical tests: one – a powerful classical method – the chi-square test and a machine learning method – the discriminant analysis. These two methods combined ensure reliability of the results with the first mentioned method, and simplicity in use with the second method. Both methods have proved to be efficient on the level of letters of Cyrillic alphabet. The methods have been tested on the material of texts from Ukrainian emotive prose. The algorithm for the chi-square test is the following: 1. Prepare samples of 51000 letters for the comparison. 2. Form portions of 1000 letters for the samples that are to be compared. 3. Obtain the values of frequencies of occurrence of letters in each portion. 4. Obtain the values of frequencies of occurrence of letters in each sample. 5. Obtain the values of relative frequencies of occurrence of letters in each portion. 6. Obtain the values of relative frequencies of occurrence of letters in each sample. 7. Use the chi-square test for two compared samples [19, 20]. 8. Analyze the obtained results. Using the chi-square test, we verify the H 0 hypothesis: the observations are done with the same variable. We use the statistics: s k X n2 ( p) = ( ij − n j pi )2 (n j pi ) . (1) i =1 j =1 To estimate the unknown parameters p1 ,..., p s , we use the maximum likelihood method: k L( p) = c pi ij = c pi i . , vi. = ij . (2) i, j i j =1 The method of indefinite Lagrange factor is employed to obtain estimates p̂ of parameters pi : 𝑝̂𝑖 = 𝑛𝑖. , 𝑖 = 1, . . . , 𝑠, where n = n1 + ... + nk = 𝜈 is the total of the observations. As a result, the i, j ij formula for the criterion statistics is the following [21, 22]: s k ( ij − n j vi. ) 2 s k ij2 X n2 ( p ) = n = n − 1. (3) n j i. i =1 j =1 n v i =1 j =1 j i. The H 0 hypothesis is rejected if the value t of the statistics (3) satisfies the inequality t 2 1− , ( s −1)( k −1) . The next step will be the algorithm for the discriminant analysis: 9. Obtain the mean values of frequencies of occurrence of letters, stop words, punctuation marks and spaces for each sample. 10. Construct the vectors. 11. Write the regression equations for the obtained data. 12. Obtain the coefficients for the regression equations. 13. Employ the formula of Mahalanobis distance [23]: D 2 ( x Gk ) = (n − g )(x − x k ) W −1 ( x − x k ), k = 1,..., g , T (4) where Gk stands for a set of authors, x stands for an object having p variables, n is a number of the researched literary works, g is a number of the chosen authors, W-1 stands for an inverse covariance matrix, x k stands for the vector of the values of the mean for the variables from k-th group of the objects. 3.2. The Developed Software A topical issue of computer linguistics is the development of information technologies for an automated identification of the authorial style. Every automated information system characterized by a certain technology is aimed at transforming the input data into some expected information. Therefore, the information technology structuring involves the development of a classification and a coding system, an organization of collecting and transferring information and different methods to access the data [24]. The developed information technology processes the researched texts by different authors using chosen statistical methods. The obtained data are statistical characteristics typical of a certain authorial style. These data form the author’s statistical parameters. Python programming language has been used for automated identification of the authorial style. The developed structure of information technology for automated authorship attribution is shown in Figure 1. Data base of statistical parameters of texts Program system for Statistical parameters of the A text in Ukrainian authorship attribution authorial styles in Ukrainian Methods for authorship attribution Figure 1: The structure of information technology for automated authorship attribution The automated author identification has been done using Python. The algorithm of performing the chosen mathematical tests involves standard functions and libraries of Python. The tools of Python were used for work with different protocols. The process of automated identification of the authorial style consists of two main stages: the first stage is preparatory before the statistical calculations, and the second stage is the stage of statistical calculations. On the first stage, we make the following changes: all the letters in uppercase are changed into the letters in lowercase, only one space is left between the words, a space is put at the beginning of a text. Then, we sort the linguistic units. For calculations, we have chosen letters, stop words, punctuation marks and spaces. Differentiation of authorial styles and author identification is done using the chi-square test and the discriminant analysis. The algorithm of the software system functioning includes the following steps: text files uploading, sample formation, sample division into portions, calculations of frequencies of occurrence of linguistic units in each portion and sample, application of the chi-square test and the discriminant analysis and analysis of the data obtained. The algorithm is presented in Figure 2. The developed structure of the software system for author identification is shown in Figure 3. The program is based on a modular principle. The main modules are: a module of file opening, a module of sample setting, a module of text analysis, applying the chi-square test and the discriminant analysis, a module of results visualization, a module of data storing. The module “data storing” gives an access to data base. The module “interface” ensures a connection between the user and the software system. The interface is written with the help of library PyQt5 which has Qt Designer. One of the biggest classes of PyQt5 is Widgets having tables, lists and other means of results visualization. Quick visualization is ensured by a considerable level of NumPy and Qt Qraphics View Framework. For more efficient work, a module MainWindowUI has been developed. It shows the program main window, imports all next modules and a file containing a code of the program interface. Modules StatisticLetters, StatisticStopWords, StatisticPunctuationMarks, StatisticSpaces are involved in the statistical analysis of letters, stop words, punctuation marks and spaces. Module StatisticTests is responsible for texts differentiation and author identification by the chi-square test and the discriminant analysis. Module re is responsible for editing a text before processing. Module GraphCreate has functions of graphical presentation of the obtained data. In every tab of the interface, there are options for building the graphs that show the results of the statistical analysis [24]. 4. Results of the Study The authorship attribution has been done on the material of Ukrainian emotive prose. The statistical parameters of the literary works by I. Franko, O. Honchar and L. Hlibov have been obtained by the chi- square test and the discriminant analysis. The chi-square test has been performed on the level of letters of Cyrillic alphabet. The relative frequencies of occurrence of letters have been calculated as a stage of the algorithm of the chi-square test. The highest values of the relative frequencies of occurrence of letters for the literary works by L. Hlibov (Text 1, Text 2) are given in Table 1. The values show that letters А, О, В are the most frequently used in this sample. The results of the chi-square test (given below) confirm that the literary works by L. Hlibov are written by one author as the homogeneity hypothesis is not rejected. This proves that the chi-square test is powerful for identifying authorial styles. Application of the method of discriminant analysis allowed us to identify the authorial styles by Franko, O. Honchar and L. Hlibov. The method is efficient on the level of letters, stop words, punctuation marks and spaces. The results of calculations of the average number of the mentioned linguistic units are presented in Figure 4. The analysis of the average number of words and punctuation marks shows that the number of punctuation marks may be relatively greater if the number of words is not much greater. This is the case with L. Hlibov’s literary works: for the average number of words of 34.46, the average number of punctuation marks is 6.53 (in one literary work), and for the average number of words of 37.21, the average number of punctuation marks is 10.07 (in another literary work). This may be considered a characteristic feature of L. Hlibov’s manner of writing. Start Download Ukrainian texts Form samples, changing uppercase to lowercase Divide samples into portions Calculate relative frequencies of occurrence of letters for portions and samples Perform the chi-square test Calculate mean frequencies of occurrence of letters, stop words, punctuation marks, spaces for portions and samples Build vectors Write regression equations Calculate coefficients for the regression equations Determine the distance between the authors using the formula of Mahalanobis distance Compare the results obtained by the two methods End Figure 2: A flow chart of the algorithm for author identification by the chi-suare test and the discriminant analysis Construction of statistical profiles of authorial styles File opening Sample setting Text analysis Interface Results Data storing visualization Data base Figure 3: Graphical presentation of the structure diagram of the software system for author identification In Figure 5, we see the results of the discriminant analysis determined by the squared Mahalanobis distances. The distances between the literary works of the same author are small if compared with the distances between the literary works of different authors. For the literary works by I. Franko, we have obtained the distances: 3.84, 1.64, 3.05, 3.09; for the works by O. Honchar: 0.60, 3.04, 2.36, 4.83; for the works by L. Hlibov: 2.84, 4.51, 2.78, 3.37. If we compare the literary works by I. Franko and l. Hlibov, the distance is much greater – 89.51. Consequently, the discriminant analysis is an efficient method for authorship attribution. Table 1 The highest values of relative frequencies of occurrence of letters Letters Text 1 Text 2 А 10% 7% В 6% 8% Е 5% 8% И 6% 5% І 5% 5% К 3% 5% Л 5% 3% Н 5% 5% О 8% 7% П 4% 3% Р 4% 4% С 4% 4% Т 6% 5% У 3% 5% The analysis of the obtained results show that the chi-square test combined with the discriminnt analysis is an efficient combination for characterizing the authorial styles and performing the author identification. 5. Discussions This research is a continuation of testing the classical and machine leaning methods for efficiency in authorship attribution. In our earlier research, the statistical tests were tested on different language levels (phonological, lexical, syntactic) in two languages – English and Ukrainian. In a comparison to our previous research, we can state that the combination of the classical statistical method – the chi- square test and the machine learning method – the discriminant analysis is efficient for author identification in the Ukrainian language. Figure 4: The average numbers of letters, stop words, punctuation marks and spaces Figure 5: The distances between literary works by the researched authors In our earlier research, we tested the chi-square test in a combination with the other classical methods – the Student’s t-test and the Kolmogorov-Smirnov test. In this combination, it was more powerful than the Student’s t-test, but less powerful than the Kolmogorov-Smirnov test on the phonological level. In this research, the chi-square test is applied on the levels of letters and words showing good results. The previously applied classical methods – the Lehmann-Rosenblatt test and the Wilcoxon test were tested on the levels of phonemes and word length. These methods were less powerful than the chi- square test. For the mentioned methods, the level of test validity was 95%. The machine learning methods – the data clustering method and the method of discriminant analysis were previously tested in a combination with the chi-square test and the Student’s t-test on the levels of words and phonemes. In this combination, the classical methods were more powerful [25]. According to the results of our earlier research, the method of discriminant analysis is more powerful than the method of data clustering. In this research, the method of discriminant analysis has shown good results obtained with the help of the squared Mahalanobis distances. The distances between the researched literary works by one author are small. This proves that the works have similar linguistic characteristics, typical of a certain authorial style. The distances for the works by I. Franko are: 3.84, 1.64, 3.05, 3.09; for the works by O. Honchar: 0.60, 3.04, 2.36, 4.83; for the works by L. Hlibov: 2.84, 4.51, 2.78, 3.37. Consequently, the method of discriminant analysis is rightly chosen for the language levels of letters and words, as it has given good results and solved the task of author identification. In our research, we have developed a structure of the software system for author identification. The program is based on a modular principle, which allows us to quickly modify the program. The structure of the software system includes the following modules: a module of file opening, a module of sample setting, a module of text analysis, applying the chi-square test and the discriminant analysis, a module of results visualization, a module of data storing. To improve the efficiency of the work, a module MainWindowUI has been developed. It imports all next modules and a file containing a code of the program interface. For the statistical analysis of letters, stop words, punctuation marks and spaces, the modules StatisticLetters, StatisticStopWords, StatisticPunctuationMarks, StatisticSpaces have been developed. The module StatisticTests is used for texts differentiation and author identification by the chi-square test and the discriminant analysis. The developed software system ensures quick and efficient work. The data obtained are reliable and can be used in our further research. We consider it to be expedient to use the combination of the chi-square test and the method of discriminant analysis for authorship attribution on other language levels and in other languages. The level of test validity for the chi-square test is high – 95%. It is recommended to apply this test in a combination with other machine learning methods. 6. Conclusions The purpose of the research has been achieved – the efficiency of the the chi-square test and the discriminant analysis has been proved on the levels of letters, stop words, punctuation marks and spaces in Ukrainian. The novel approach of the research consists in application of the developed structure of the software system based on a modular principle. The modular principle allows us to quickly modify the program system. The module StatisticTests based on the use of the chi-square test and the discriminant analysis has been applied for texts differentiation and author identification. The results obtained by the chi-square homogeneity test with a test validity of 95%, show that the authorship of I. Franko, O. Honchar and L. Hlibov has been established for the researched literary works. For the comparisons of the literary works by each of the mentioned Ukrainian writers, the homogeneity hypothesis has not been rejected. This means for each author that the literary works were written by the same author. Therefore, for author identification, it is expedient to use the chi-square test, either alone or in a combination with other classical or machine learning methods. The combination of the chi-square test with the discriminant analysis in this research has given good results. For the discriminant analysis, using the squared Mahalanobis distances, we have obtained the distances between the literary works by I. Franko, O. Honchar and L. Hlibov. The distances are small: for the literary works by I. Franko, the distances are: 3.84, 1.64, 3.05, 3.09; for the works by O. Honchar – 0.60, 3.04, 2.36, 4.83 and for the works by L. Hlibov – 2.84, 4.51, 2.78, 3.37. The established small distances testify that the researched literary works reflect the same authorial style, the linguistic features of the same manner of writing. Consequently, the discriminant analysis is an efficient method for author identification. In the developed software system, standard functions and libraries of Python were used in the algorithm of performing the chosen mathematical tests. The Python tools were employed for different protocols. Two main stages of the process of automated identification of the authorial style included: the preparatory stage before the statistical calculations, and the stage of the statistical calculations. The following changes were made on the first stage: all the letters in uppercase were changed into the letters in lowercase, only one space was left between the words, a space was put at the beginning of a text. Then, all the linguistic units were sorted. The letters, stop words, punctuation marks and spaces were calculated. The chi-square test and the discriminant analysis were performed for differentiation of the authorial styles and the author identification. In the algorithm of the software system functioning, there are the following steps: text files uploading, sample formation, sample division into portions, calculations of frequencies of occurrence of linguistic units in each portion and sample, application of the chi-square test and the discriminant analysis and analysis of the data obtained. The structure of the developed software system includes a module MainWindowUI which shows the program main window, imports all next modules and a file containing a code of the program interface. Modules StatisticLetters, StatisticStopWords, StatisticPunctuationMarks, StatisticSpaces are responsible for the statistical analysis of letters, stop words, punctuation marks and spaces. Module StatisticTests is involved in the texts differentiation and the author identification by the chi-square test and the discriminant analysis. The obtained results can be used in our future research aimed at testing statistical methods or their combinations for their efficiency in the author identification. 7. References [1] M. A. da Rocha, P. S. G. de Morais, D. M. da Silva Barros, J. P. Q. dos Santos, S. Dias-Trindade, R. A. de Medeiros Valentim, A text as unique as a fingerprint: Text analysis and authorship recognition in a Virtual Learning Environment of the Unified Health System in Brazil. In: Expert Systems with Applications: An International Journal Volume 203 Issue COct 2022 https://doi.org/10.1016/j.eswa.2022.117280. (2022). [2] M. Kestemont, M. Tschuggnall, E. Stamatatos, W. Daelemans, G. Specht, B. Stein, M. Potthast, Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings, vol. 2125, pp. 1–25. (2018). [3] L. Muttenthaler, G. Lucas, J. Amann, Authorship Attribution in Fan-Fictional Texts Given Variable Length Character and Word N-Grams, Notebook for PAN at CLEF 2019. 9-12 September 2019, Lugano, Switzerland, vol. 2380. Paper 49. (2019). [4] J. Bevendorff, B. Ghanem, A. Giachanou, M. Kestemont, E. Manjavacas, M. Potthast, F. Rangel, P. Rosso, G. Specht, E. Stamatatos, B. Stein, M. Wiegmann, E. Zangerle, Shared Tasks on Authorship Analysis at PAN 2020. In book: Advances in Information Retrieval, 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II, pp. 508–516. (2020) DOI: 10.1007/978-3-030-45442-5_66. (2020). [5] E. Stamatatos, Authorship attribution using text distortion, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 1, 2017, pp. 1138– 1149. (2017). [6] E. Ferracane, S. Wang, R. Mooney, Leveraging discourse information effectively for authorship attribution, in Proceedings of the Eighth International Joint Conference on Natural Language Processing, vol. 1, 2017, pp. 584–593. (2017). [7] U. Sapkota, T. Solorio, M. Montes-y Gomez, P. Rosso, The use of orthogonal similarity relations in the prediction of authorship, in Computational Linguistics and Intelligent Text Processing. Springer, 2013, pp. 463–475. (2013). [8] Y. Seroussi, I. Zukerman, F. Bohnert, Authorship attribution with topic models, Computational Linguistics, vol. 40, no. 2, pp. 269–310, 2014. (2014). [9] K. Sundararajan, D. Woodard, What represents style in authorship attribution? in Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 2814–2822. (2018). [10] P. Plecha´c, K. Bobenhausen, B. Hammerich, Versification and authorship attribution. a pilot study on czech, german, spanish, and english poetry, Studia Metrica et Poetica, vol. 5, no. 2, pp. 29–54, 2018. (2018). [11] R. Hou, C.-R. Huang, Robust stylometric analysis and author attribution based on tones and rimes, Natural Language Engineering, 2019, pp. 1–23. (2019). [12] O. Aborisade, M. Anwar, Classification for authorship of tweets by comparing logistic regression and Naive Bayes classifiers, in: 2018 IEEE international conference on information reuse and integration, IEEE, 2018, pp. 269–276. (2018). [13] C. Akimushkin, D.R. Amancio, O.N. Oliveira, On the role of words in the network structure of texts: Application to authorship attribution, Physica A: Statistical Mechanics and its Applications vol. 495, 2018, pp. 49–58, 10.1016/j.physa.2017.12.054. (2018). [14] S. Voloshyn, V. Vysotska, O. Markiv, I. Dyyak, I. Budz and V. Schuchmann, Sentiment Analysis Technology of English Newspapers Quotes Based on Neural Network as Public Opinion Influences Identification Tool, 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), 2022, pp. 83-88. (2022). [15] S. Albota, Modelling the impact of the pandemic on online communication: textual semantic analysis // CEUR Workshop Proceedings. – 2022. – Vol. 3171: Computational Linguistics and Intelligent Systems 2022: Proceedings of the 6th International conference on computational linguistics and intelligent systems (COLINS 2022). Vol. 1: Main conference, Gliwice, Poland, May 12-13, 2022, pp. 471–486. (2022). [16] M. Karp, A. Burtnyk, I. Bekhta, N. Kunanets, O. Melnychuk, I. Shainer, Study of linking words in political speeches of Bill Clinton using Python // IEEE 17th International Conference on Computer Science and Information Technologies, CSIT 2022, November 10-12, 2022, Lviv, UKRAINE, pp. 77–82/83. (2022). [17] M. Al-Ayyoub, Y. Jararweh, A. Rabab’ah, M. Aldwairi, Feature extraction and selection for Arabic tweets authorship authentication, Journal of Ambient Intellegence and Humanized Computing, 8 (3), 2017, pp. 383–393. (2017). [18] S. Alanazi, Classical Arabic Authorship Attribution Using Simple Features, Project: Natural Language Processing, Jouf University, Saudi Arabia. September, (2018). [19] P. C. Gomez, Statistical Methods in Language and Linguistic Research. University of Murcia, Spain (2013). [20] A. Kornai, Mathematical Linguistics. Springer (2008). [21] R. Bhattacharya, E. C Waymire: A Basic Course in Probability Theory Springer; 2nd ed. 2016 edition, February 16, (2017). [22] V. Turchyn, Matematychna statystyka. Navch. Posib. Vydavnychyj tsentr “Akademia”: Kyiv, Ukraine, (1999). (in Ukrainian). [23] V. Fetisov Paket statystychnoho analizu danyh STATISTICA, Nizhyn: NDU im. M. Hoholia, 2018, 114 s. (2018). [24] V. Teslyuk, I. Kazymyra, Yu. Kordiiaka, I. Rybak, Modeli ta zasoby avtomatychnoho vyznachennia statystychnoho profiliu ukrainomovnyh tekstiv. Ukrainskyy zhurnal informatsiynyh tehnologiy. Tom 4. № 1. 2022, ss. 37 – 43. (2022).. [25] I. Khomytska, V. Teslyuk, I. Bazylevych, Yu. Kordiiaka, Machine learning and classical methods combined for text differentiation // Proceedings of the 6th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2022). Vol. I: Main Conference, Gliwice, Poland, May 12-13, 2022. CEUR Workshop Proceedings, Vol. 3171, CEUR-WS.org 2022, pp 1107-1116. (2022).