1. Introduction

The chi-square test and the Student's t-test used for authorial style characterization

Vasyl Teslyuk

vasyl.m.teslyuk@lpnu.ua 1

Iryna Khomytska

Iryna Bazylevych

i_bazylevych@yahoo.com 0

Valentyna Holtvian

valentyna.i.holtvian@lpnu.ua 1

Olena Durytska

olena.durytska@lnu.edu.ua 1 0 Ivan Franko National University of Lviv , Lviv, 79000 , Ukraine 1 Lviv Polytechnic National University , Lviv, 79013 , Ukraine

In this research, we combine two classical statistical tests for author identification - the chisquare test and the Student's t-test. Application of these statistical tests for analysis of distribution of parts of speech is the novelty of the research. The research was conducted on the material of the belles-lettres and scientific styles. The research has proved that the chosen statistical tests give good results for determining the specificity of parts of speech distribution and phoneme distribution. The results of our research allow us to identify the style differentiating capability of each part of speech. Authors and styles are differentiated by the parts of speech which ensure statistically significant results. The calculations were carried out in Java. The structure of the developed software is based on the modular principle. The test validity of the obtained results is 95%. The results can be applied in authorship attribution.

eol>Chi-square test Student's t-test Distribution of parts of speech Phoneme distribution Belleslettres style Scientific style Authorship attribution 1

1. Introduction

the patterns the author follows in the manner of writing. In most cases, researchers analyze the author’s word stock, the distribution of the most frequently and the least frequently used words. However, here, we deal with the syntactic and the phonological levels. It is expedient to analyze the distribution of parts of speech and phonemes in the researched text. The difference between the authorial styles is the difference between the individual patterns used by the authors. The difference is established by various methods and techniques. The most efficient are those that ensure high level of test validity (95% – 99%). However, 95% test validity is considered classical and is applied in most cases. Powerful classical statistical tests (the Student’s t-test, the chi-square test, the Lehmann-Rosenblatt test, the Wilcoxon test), allow us to obtain the results with high accuracy. The data clustering and the discriminant analysis give also good results. The statistical tests can be checked for efficiency on the phonological, lexical and syntactic levels. The reliability of the results can be enhanced by the use of several tests. The purpose of this research is to prove that the chi-square test and the Student’s t-test are efficient statistical tests to differentiate texts by parts of speech distribution and phoneme distribution. The text differentiation by parts of speech distribution is a novel approach of this research.

2. Related works

The analysis of recent research has shown that the machine learning and classical methods are often applied for authorship attribution. In most cases, the content of the researched texts is emotionally colored [ 1 ]. Thus, an attempt was made to detect aggression in social media using the deep learning models. The models were tested on the Cyber-Troll dataset and gave the result – F1 score of 97% [ 2 ]. Convolutional neural networks (CNN) gave good results for author identification. The applied algorithm of this research was classical [ 3 ]. For fake news detection, the use of feature stacking gave the results of 93.39%. In the research, random forest and extra tree models were used for bagging [ 4, 5 ]. The textual semantic analysis of the Reddit statements was conducted with the help of the software toolbox LIWC-22 (Linguistic Inquiry and Word Count). On the basis of the analysis, two cognitive sub-models with linguistic psychological and social apprehension were developed [ 6 ]. The individual authorial conceptualization was characterised by the quantitative markers [ 7 ]. An intellectual analysis system aimed at determining the text authorship attribution probability for Ukrainian-language artistic works was developed [ 8 – 10 ]. For Ukrainian tweets analysis, algorithms using Levenstein distance, that is fuzz sort and fuzz set ensured good results. The best result is fingerprint similarity reaching 70% [11]. The research presented in this paper, has proved that the chi-square test and the Student’s t-test are powerful statistical tests for texts differentiation by parts of speech distribution and phoneme distribution. Statistically significant results have been obtained with a high level of test validity – 95%. Consequently, the results are reliable and may be used for further research or practically applied in author identification.

1. Choose the texts from J. K. Rowling’s creation.

Choose the texts from K. Ashley’s creation.

Determine the most frequently used parts of speech for each author.

Let the sample size be equal for the texts compared. 5. Calculate the absolute, mean and relative frequency of occurrence of parts of speech and phonemes for the two samples. 6. Use the Pearson’s normality test for two samples:

̂2 = ∑ −1 ( − )2,

where is a number of intervals [14 – 16].

Use the Student’s t-test: (1) (2) (3)

3. Methods and software 3.1. The proposed combination of methods

In this research, we combine the chi-square test and the Student’s t-test. The two tests were used in our previous research in different combinations: with the Lehmann-Rosenblatt test, the Wilcoxon test, the data clustering and the discriminant analysis [12, 13]. The tests were efficient in each combination. The algorithm of text differentiation in this research is the given below.

2. 3.

Module of forming samples of parts of speech.

Module of determining the most frequently used parts of speech.

Module of calculating the relative frequencies of occurrence of parts of speech. Module of forming samples of English phonemes.

Module of calculating the mean frequencies of occurrence of phonemes.

Module of carrying out the Pearson’s test 8. Module of carrying out the Student’s t-test.

= ( ̄ − ̄)/ √ + ≥ ;( + −2), where ̄ and ̄are the values of mean frequencies of occurrence of parts of speech and phoneme groups for the two samples and [17 – 19].

Use the chj-square test:

3.2. The developed software

1. Module of data input.

The text differentiation program is developed on the Java programming language [20]. The structure of the program is based on the modular principle and consists of the following modules: 9. Module of carrying out the chi-square test. 10. Module of data output.

The software has the following structure of classes: Main, SampleProcessor, PartsOfSpeechProcessor, PhonemeProcessor, PartsOfSpeechUtils, PhonemeUtils, StatisticProcessor.

The researched text files are downloaded in the class Main.

The texts are transcribed in the class SampleProcessor.

The samples of parts of speech are formed in the class PartsOfSpeechProcessor. The samples of phonemes are formed in the class PhonemeProcessor.

The relative frequencies of occurrence of word combinations are calculated in the class PartsOfSpeechUtils.

The mean frequencies of occurrence of phonemes are calculated in the class PhonemeUtils.

The Pearson’s test, the Student’s t-test and the chi-square test are carried out in the class StatisticProcessor.

4. Results of the study

"RBR", # Adverb, comparative "RBS", # Adverb, superlative "RP", # Particle "SYM", # Symbol "TO", # to "UH", # Interjection "VB", # Verb, base form "VBD", # Verb, past tense "VBG", # Verb, gerund or present participle "VBN", # Verb, past participle "VBP", # Verb, non-3rd person singular present "VBZ", # Verb, 3rd person singular present "WDT", # Wh-determiner "WP", # Wh-pronoun "WP$", # Possessive wh-pronoun "WRB" # Wh-adverb

In Figure 1, we present a fragment of the tagged text “Harry Potter and the Philosopher’s Stone” by J. K. Rowling

For calculations, the two samples were used.

For Harry Potter and the Philosopher’s Stone” by J. K. Rowling: 111, 1182, 22, 0, 1350, 599, 34, 15, 9, 272, 1302, 577, 355, 5, 15, 78, 1282, 302, 849, 4, 1, 120, 0, 260, 5, 532, 942, 157, 279, 233, 113, 53, 44, 0, 82.

For “Sebring” by K. Ashley: 145, 979, 15, 0, 1159, 422, 35, 8, 0, 351, 1202, 548, 407, 1, 11, 6, 1710, 448, 759, 4, 2, 99, 0, 517, 0, 847, 971, 132, 249, 268, 146, 63, 61, 0, 95.

The application of the chi-square test has proved that the homogeneity hypothesis is rejected and the differences between the compared texts are statistically significant: por_zn=qchisq(0.95,34) > por_zn [ 1 ] 48.60237

The style differentiation has been carried out by the Student’s t-test on the material of Show’s drama and the scientific style (classical mechanics). Three cases of style differentiation were considered: 1 – any position in the word; 2 – the beginning of the word; 3 – the end of the word. Statistically significant differences were obtained in position 1 for all except for two groups of phonemes and in positions 2, 3 – for all except one group of phonemes. The results prove the Student’s t-test efficiency. The data are given in Tables 1 – 3.

In Tables 1 – 6, we use such designations: GP – the group of phonemes; SD – Show’s drama; SC – the scientific style (classical mechanics); L – labials; D – dorsals; C – coronals; V – velars; N – nasals; S – sonorous; F – fricatives; T – stops; is the value of dispersion; is the Student’s statistic; 2 is the level of significance; ̄is the mean value of frequencies of phoneme groups; ( − ̄)2 is a sum of squares of difference of the value of middle of the interval and the mean value of frequencies of phoneme groups, ̄1 − ̄2 is the value of difference between the researched samples. The results of the calculations for the comparison between Show’s drama and the scientific style in an unidentified position

In Table 1 (continuation), we see the style differentiating capability of groups of phonemes. In the groups of dorsals, coronals, velars, sonorous and fricatives, the differences between the researched texts are statistically significant.

In Table 2, you can see the data of a sum of squares of difference of the value of middle of the interval and the mean value of frequencies of phoneme groups for Show’s drama and the scientific style in the position at the beginning of a word the end of a word.

In Table 2 (continuation), we can see the essential differences revealed in the position at the beginning of a word for the groups of labials, dorsals, coronals, velars, nasals, sonorous and stops.

In Table 3, we give the data of a sum of squares of difference of the value of middle of the interval and the mean value of frequencies of phoneme groups for Show’s drama and the scientific style in the position at the end of a word.

In Table 3 (continuation), we see the style differentiating capability of the groups of labials, dorsals, velars, sonorous, fricatives and stops for the comparison of Show’s drama and the scientific style in the position at the end of a word. The results of the calculations for the comparison between Show’s drama and the scientific style at the beginning of a word The essential differences between Show’s drama and the scientific style at the beginning of style at the end of a word The results of the calculations for the comparison between Show’s drama and the scientific GP L D C V N S F T

GP L D C V N S F T 18,5 125,8

– 10,1 37,0 54,2 56,5 43,3

2 The essential differences between Show’s drama and the scientific style at the end of a word L D C V N S F T

GP L D C V N S F

T GP L D C V N 142,5

– 15,4 35,2 60,7 49,3 74,4

4,70 7,85 4,22 6,71 9,20 7,47 7,40 a word. (continuation)).

The results obtained for the comparison of Show’s drama and the scientific style have shown that in three cases of phoneme’s position in a word the differences between the compared texts are statistically significant for almost all groups of phonemes. Consequently, the Student’s t-test is efficient for solving a text differentiation task. In another comparison, we have obtained statistically significant differences between Byron’s emotive prose and the scientific style. In Tables 4 – 6, we see the data for three cases of phoneme’s position in

Byron’s emotive prose differs essentially from the scientific style in an unidentified position for the groups of labials, dorsals, nasals, sonorous and fricatives (Table 4 The results of the calculations for the comparison between Byron’s emotive prose and the scientific style in an unidentified position 194,1 10770,24 12960,31 186,8 211,1 unidentified position The essential differences between Byron’s emotive prose and the scientific style in an

In Table 5, you can see the data of a sum of squares of difference of the value of middle of the interval and the mean value of frequencies of phoneme groups for Byron’s emotive prose and the scientific style in the position at the beginning of a word. The results of the calculations for the comparison between Byron’s emotive prose and the scientific style at the beginning of a word F T GP L D C V N S

F GP L D C V N S F T 12,5 17,79 3,50 8,69 10,11 14,10 15,05

At the beginning of a word, statistically significant differences have been obtained for the groups of labials, dorsals, velars, nasals, sonorous and fricatives (Table 5 (continuation)). The essential differences between Byron’s emotive prose and the scientific style at the 5,29 4,96 1,91 0,00 3,08 2,79 4,54

2 BE ( − ̄)2 1058,00 13225,44

49,99 2169,56 382,39 2865,56 10265,51 1875,44

In Table 6, we present the data of a sum of squares of difference of the value of middle of the interval and the mean value of frequencies of phoneme groups for Byron’s emotive prose and the scientific style in the position at the end of a word. The results of the calculations for the comparison between Byron’s emotive prose and the scientific style at the end of a word

Byron’s emotive prose differs essentially from the scientific style in the case of the end of a word for the groups of labials, dorsals, nasals, sonorous, fricatives and stops (Table 6 V N S F The essential differences between Byron’s emotive prose and the scientific style at the end 1,34 7,60 3,04 7,69 14,38 9,17 0,59 4,97 4,93 9,21 3,81 0,82

7,41 3,39 1,58 3,56 2,62 2,90

50% 2

In this research, the Student’s t-test is efficient for style differentiation. Statistically significant differences have been revealed in comparisons of the belles-lettres style (Show’s drama; Byron’s emotive prose) and the scientific style (classical mechanics) for the three cases of phoneme’s position in a word.

The analysis of the results obtained by the chi-square test in this research, has shown that this test is efficient for authorship attribution on the syntactic level. The Student’s t-test has given good results on the phonological level for style differentiation. The results have been obtained with the test validity of 95%.

5. Discussions

The chi-square test in this research has been used on the syntactic level for author identification. In our previous research, we used the test on the phonological and lexicalsemantic levels [12, 13]. The test was efficient on these levels. In this paper, we have proved efficiency of the chi-square test on the syntactic level. Consequently, the chi-square test ensures reliable data (the level of test validity – 95%) on the phonological, lexical-semantic and syntactic levels.

The Student’s t-test in this research has been used for style differentiation. The results of testing have shown statistically significant differences between the belles-lettres style (Shaw’s drama, Byron’s emotive prose) and the scientific style (classical mechanics). The level of test validity is 95%.

According to the analysis of similar research, the authorial style was identified by deep learning models in an attempt to detect aggression in social media. The models were tested on the Cyber-Troll dataset and ensured the result – F1 score of 97% [ 2 ]. In another research, the random forest and extra tree models were used for fake news detection. The use of feature stacking gave the results of 93.39%. [ 4, 5 ]. The algorithms, using Levenstein distance for Ukrainian tweets analysis, ensured reliable results. The best result is fingerprint similarity – 70% [9].

Having analyzed the results obtained in our research with the help of the chi-square test and the Student’s t-test, we can state that this combination of tests is efficient for style differentiation and author identification on three language levels: the phonological, lexicalsemantic and syntactic. As the test validity of the results is high – 95%, it is recommended to apply this combination of tests for solving the tasks of authorship attribution.

Conclusions

It is topical in modern research to propose a new approach to the authorial style identification. The novelty of the research is an application of the chi-square test and for analysis of distribution of parts of speech on the material of American emotive prose.

The chi-square test was performed on the material of the belles-lettres style (“Harry Potter and the Philosopher’s Stone” by J. K. Rowling and “Sebring” by K. Ashley). For text differentiation, the two texts were tagged by parts of speech (POS) in natural language processing (NLP). The task of an authorial style differentiation has been solved with a level of test validity of 95%.

The Student’s t-test was performed on the material of the belles-lettres style (Show’s drama, Byron’s emotive prose) and the scientific style (classical mechanics). Statistically significant differences were obtained in three cases of style differentiation: 1 – any position in the word; 2 – the beginning of the word; 3 – the end of the word. The style differentiating capability of phoneme groups (labials, dorsals, coronals, velars, nasals, sonorous, fricatives and stops) was revealed in position 1 for all except for two groups of phonemes and in positions 2, 3 for all except one group of phonemes. The results prove the Student’s t-test efficiency. The calculations were carried out in Java. The structure of the developed software is based on the modular principle. The test validity of the obtained results is 95%.

The goal of this research has been attained. The research has proved that the chi-square test and the Student’s t-test are efficient statistical tests to differentiate texts by parts of speech distribution and phoneme distribution.

The practical application of this research involves the author identification and style differentiation. In our future research, we will choose some other syntactic features for authorial styles differentiation. Authorship Attribution Probability, in: Proceedings of the 18th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2023, Lviv, Ukraine, 19-21 October, 2023, Doi:10.1109/CSIT61576.2023.10324012. [9] M. Kestemont, M. Tschuggnall, E. Stamatatos, W. Daelemans, G. Specht, B. Stein, M.

Potthast, Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings, vol. 2125, 2018, pp. 1–25. [10] Hou, R., & Huang, C.-R. (2020). Robust stylometric analysis and author attribution based on tones and rimes. Natural Language Engineering 26(1) 2020 49–71. doi:10.1017/S135132491900010X. [11] O. Prokipchuk, V. Vysotska, Ukrainian Language Tweets Analysis Technology for Public Opinion Dynamics Change Prediction Based on Machine Learning. Radio Electronics, Computer Science, Control 2 (2023) 103. doi: 10.15588/1607-3274-2023-2-11. [12] I. Khomytska, V. Teslyuk, K. Prysyazhnyk, N. Hrytsiv, The Lehmann-Rosenblatt test applied for determination of statistical parameters of Charles Dickens's authorial style, in Proceedings of IEEE XVIth Scientific and Technical Conference on Computer Science and Information Technologies. CSIT 2021, Lviv, Ukraine, 22–25 September, vol. 2, 2021, pp. 64–67. doi:10.1109/CSIT52700.2021.9648789. [13] I. Khomytska, V. Teslyuk, I. Bazylevych, Yu. Kordiiaka, Machine learning and classical methods combined for text differentiation, in Proceedings of the 6th International Conference on Computational Linguistics and Intelligent Systems. Vol. I: Main Conference, Gliwice, Poland, May 12-13, CEUR Workshop Proceedings, vol. 3171, 2022, pp 1107-1116. [14] Th. S. Gries, Statistics for Linguistics with R: A Practical Introduction (Trends in

Linguistics: Studies & Monographs), Mouton de Gruyter, 2009, р. 348. [15] R. Bhattacharya, E. C Waymire, A Basic Course in Probability Theory (2nd ed.), Springer, 2016 edition, February 16, 2017. [16] V. S. Pеrebyjnis, Statystychni metody dlia lingvistiv, Nova Knyha, Vinnytsia, Ukraine, 2013. [17] P. C. Gomez, Statistical Methods in Language and Linguistic Research. University of

Murcia, Spain, 2013. [18] A. Kornai, Mathematical Linguistics, Springer, 2008. [19] V. M. Turchyn, Matematychna statystyka, Navch. Posib., Vydavnychyj tsentr “Akademia”, Kyiv, Ukraine, 1999. [20] A. Batyuk, V. Voityshyn, V. Verhun, Software Architecture Design of the Real-Time Processes Monitoring Platform, in: Proceedings of the IEEE Second International Conference on Data Stream Mining & Processing, DSMP 2018, Lviv, Ukraine, 2018, pp. 98-101. doi: 10.1109/DSMP.2018.8478589.

[1]

Hajibabaee ,

Malekzadeh ,

Ahmadi ,

Heidari ,

Esmaeilzadeh ,

Abdolazimi ,

J. H. J.

Jones , Offensive language detection on social media based on text classification , in: Proceedings of the IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC) , Las Vegas , NV , USA, 2022 , pp. 0092 - 0098 , doi: 10.1109/CCWC54503. 2022 . 9720804 .

[2]

Khan ,

Rizwan , G. Atteai, M. M. Jamjoom , N. A. Samee , Aggression detection in social media from textual data using deep learning models , Applied Sciences 12 ( 10 ) 5083 ( 2022 ). doi: 10 .3390/app12105083.

[3]

Mohades Delami ,

Sadr , M. Nazari, Using machine learnjng-based models for personality recognition , Big Data and Computing Visions 1 ( 3 ) ( 2022 ) 128 - 139 . doi: 10 .22105/bdcv. 2021 . 142588 .

[4]

Lina ,

Fua ,

Jianga , Fake news detection in the Urdu language using CharCNNRoBERTa , CEUR Workshop Proceedings , vol. 2826 /T3-2, 2020 .

[5]

Shoaib Farooq ,

Naseem ,

Rustam , I. Ashraf , Fake news detection in the Urdu language using machine learning , PeerJ Computer Science 9 : e1353 ( 2023 ). doi: 10 .7717/peerj-cs. 1353 .

[6]

Albota , Creating a model of war and pandemic apprehension: textual semantic analysis , in proceedings of the 7th International conference on computational linguistics and intelligent systems . Vol. II: Computational linguistics workshop . Kharkiv, Ukraine, April 20-21 , 2023 , pp. 228 - 243 .

[7]

Levchenko ,

Dilai , Qualitative and Quantitative Markers of Individual Authorial Conceptualization , in proceedings of the 7th International conference on computational linguistics and intelligent systems . Vol. II: Computational linguistics workshop . Kharkiv, Ukraine, April 20-21 , 3396 , 2023 , pp. 1 - 19 .

[8]

Romanchuk ,

Vysotska ,

Andrunyk ,

Chyrun ,

Brodyak , Intellectual Analysis System Project for Ukrainian-language Artistic Works to Determine the Text