Text Attribution in Case of Sampling Imbalance Ьу the Method of Constructing an EnsemЫe of Classifiers Based оп Decision Тrees * Alexander Rogov 1 , Roman Abramov 1 , Alexander Lebedevl [оооо-0001-9939-9389], Kirill Kulakovl [оооо-0002-озо5-419Х], and Nikolai Mosk in1 [оооо-0001-5556-5349] Petrozavodsk State University, Petrozavodsk , Russia rogov@petrsu.ru, monset008@gmail.com, perevodchik88@yandex.ru, kulakov@cs.karelia.ru, moskin@petrsu.ru https:/ /petrsu.ru/ Abstract. When solving the attribution proЫem, the question of de­ termining the author's style of а writer who created а smaller number of texts (both quantitatively and in terms of the total number of words) in comparison with other analyzed authors arises. ln this paper we consider possiЫe solutions to this proЫem Ьу the example of determining the style of Apollon Grigoriev. As а method for constructing an ensemЫe of clas­ sifiers we use Bagging (Bootstrap aggregating). The SMALT information system ("Statistical methods for analyzing literary texts") was used to determine the frequency characteristics of the texts and Python 3.6 was used to build decision trees. As а result of calculations we сап assume that the relative frequency of the "particle-adjective" Ьigram more than 6.5 is а distinctive feature of the journalistic style of Apollon Grigoriev. There also was а study of the article "Poems Ьу А. S. Khomyakov", which confirms the previously conclusion that there is по reason to consider it as belonging to Apollon Grigoriev. Keywords: Text attribution · F. М. Dostoevsky · Apollon Grigoriev · Poems Ьу А. S. Khomyakov · sampling imbalance · decision tree · software complex "SMALT". 1 Introduction Authorship identification of anonymous texts ( attribution of texts) is one of most urgent proЫem for the philological community; however, there are no universal mechanisms for its solution [10]. Specialists in study of literature use methods that are often somewhat unusual for the humanitarian sphere to answer such questions, including mathematical methods of analysis. One of the issues, which is far from its final decision, is the affiliation of anonymous articles puЬlished in the magazines "Time" and "Epoch" (1861-1865). The authorship of some * Supported Ьу the Russian Foundation for Basic Research, project по. 18-012-90026. Copyright © 2020 for this paper Ьу its authors. Use permitted under Creative Commons License Attribution 4.0 Intemational (СС ВУ 4.0). 326 of these articles has been estaЫished, while the authorship of other materials causes а lot of controversy and discussion in the philological field [6]. The solution to this proЫem is additionally hampered Ьу the uneven amount of availaЫe textual material: there are many articles owned Ьу F. М. Dostoevsky, while the remaining authors puЬlished in these journals (for example, А. Grigoriev, N. N. Strakhov, Уа. Р. Polonsky, etc.), don't have so many texts that are uniquely attributed to them. The following mathematical methods are used to estaЫish authorship of works: neural networks, QSUM method, decision trees, support vector machine (SVG), k-means method, Bayesian classifier, Markov chains, principal compo­ nent analysis, discriminant analysis, genetic algorithms, statistical criteria (х 2 test, Student's t-test, Kolmogorov-Smirnov criterion), etc. Among other meth­ ods of data mining, decision trees are distinguished Ьу the fact that they are easy to understand and interpret and also do not require special preliminary data processing. Note some authors who used mathematical methods to solve the proЬlem of text attribution: Morton А. Q., Mendenhall Т. С., Farringdon J. М., Efron В., Thisted R., Teahan W. J., Chaski С. Е., Stamatatos Е., Juola Р., Peng R. D., Joachims Т., Diederich J. J., Apte С., Lowe D., Matthews R., Tweedie F. J., de Vel О., Argamon S., Levitan S., Zheng R. [3], [5], [11], [13]. It should Ье noted that Russian language differs significantly from English, so the methods of analysis of texts in English is often not suitaЫe for Russian language. When solving the proЬlem of classification into two classes, the proЬlem of sampling imbalance often arises, i.e. when the number of objects of one class significantly exceeds the number of objects of another class. In this case the first class is called the majority class and the second class is called the minority class. ln such samplings classifiers are configured for objects of the majority class, i.e. high accuracy of the classifier can Ье oЬtained without selecting objects of the minority class. When solving the attribution proЬlem, the question of determin­ ing the author's style of а writer who created а smaller number of texts (both quantitatively and in terms of the total number of words) in comparison with other analyzed authors arises. Let's consider possiЫe solutions to this proЬlem Ьу the example of determining the style of Apollon Grigoriev. The authors do not know any analogs of such research of Russian-language texts except for the works of G. Kjetsaa and М. А. Marusenko [4], [10]. 2 Construction and Analyzing Decision Тrees An overview of the types of sampling imbalance and the methods used in such cases can Ье found in [8]. In this work we will use sampling, namely Undeтsam­ pling. ln this method the balance of sampling elements is achieved Ьу removing objects of the majority class. The authors think that this method is more ap­ propriate for the task than Oveтsampling (the sampling balance is achieved Ьу duplicating objects of the minority class) or SMOTE (Ьу generating new objects of the minority class). 327 As а method for constructing an ensemЫe of classifiers we use Bagging (Boot­ strap aggregating) [2]. The idea of this method is to train several models on ran­ dom subsamples of the original sample (using Bootstrap) with further averaging. The authors believe that it meets the meaning of the task better than Boosting. During previous studies in determining the features of the journalistic style of F. М. Dostoyevsky we found that the constructed decision trees based on Ьigrams well refiect the author's style. ln the experiments the best results were shown Ьу decision trees with а fragment size of 1000 words. The optimal step size for choosing the beginning of the next fragment is 100 words. The same parameters were used in this work. The SMALT information system (" Statistical meth­ ods for analyzing literary texts") developed at Petrozavodsk State University was used to determine the frequency characteristics [9]. Specialists in philology carried out grammatical markup of texts, which took into account 14 parts of speech (noun, adjective, numeral, pronoun, adverb, category of state, verb, par­ ticiple, gerund, preposition, conjunction, particle, modal word, interjection) and also allowed to mark the quotes, foreign words, introductory words, abbreviated words and non-linguistic symbols. А set of data for training was compiled (118 fragments - Apollon Grigoriev, 899 - the rest). The texts from which the data were prepared are presented in ТаЫе 1. In this case fragments of the texts of Apollon Grigoriev are objects of the minority class and all the others are from the majority class. The text size is quite small (from 2000 to 7000 words). ТаЫе 1. Source texts for analysis. Name Author Pismo k redaktoru У. Р. Polonsky Zhukovskij i romantizm F. М. Dostoevsky Literaturnaya isterika F. М. Dostoevsky Odin iz proektov chudesnago obogasheniya Rossii 1. N. Shill Pismo k izdatelyu "Vremeni" У. Р. Polonsky Podpiska na 1863 god М. М. Dostoevsky Ryad statej о russkoj literature. Vvedenie F. М. Dostoevsky Slavyanofily, chernogorcy i zapadniki F. М. Dostoevsky Ryad statej о russkoj literature. G. -bov i vopros оЬ iskusstve F. М. Dostoevsky Knizhnost i gramotnost. Statya pervaya F. М. Dostoevsky Knizhnost i gramotnost. Statya vtoraya F. М. Dostoevsky Poslednie literaturnye yavleniya. Gazeta "Den" F. М. Dostoevsky Neobhodimoe literaturnoe obyasnenie, ро povodu ra... F. М. Dostoevsky Politicheskoe obozrenie А. А. Golovachev Lermontov i ego napravlenie. Statya vtoraya А. Grigoriev Oppoziciya zastoya. Cherty iz istorii mrakobesiya А. Grigoriev Nashi domashnie dela А. U. Poretsky Durnye priznaki N. N. Strakhov Eshe о Peterburgskoj literature N. N. Strakhov Vsyo-li na Rusi tak ploho, kak kazhetsya? V. Р. Mesherskij 328 Python 3.6 was used to build decision trees (libraries: scikit-learn - for tree implementation, pandas - for data reading). The original data set was divided into 7 parts. In each part all fragments of Apollon Grigoriev were taken as а class with а label "1", the sаше number of fragшents of other authors were taken randomly as а class with а label "О". Repetitions of fragments of other authors were not allowed. А decision tree was trained on each part of data. The training continued until accuracy reached 100% (tree depth). The fragшent of one of the trained trees is shown in Fig. 1. All trees formed an еnsешЫе. The decision was accepted Ьу а шajority vote. Accuracy was calculated on the entire data set using the following formula: TP+TN Accuracy = --------­ (1) TP + TN +FP+FN' where ТР is true-positive, ТN is true-negative, F Р is false-positive and F N is false-negative predicted class. The experimental results are presented in ТаЫе 2. ТаЫе 2. Classifier accuracy Depth Accuracy 1 0,8628 2 0,9592 3 0,9841 4 0,9891 5 0,992 6 0,9901 In total 7 decision trees were built. А fragment of one of the trees is shown in Fig. 1. Note that on the third level there are two leaves that contain а small number of fragшents (summary from 12 to 27, on average less than 8%). You should take into account the possiЫe inaccuracy of the source data. The texts of Apollon Grigoriev could Ье edited Ьу F. М. Dostoevsky. In addition there is а slight volatility in the paraшeters of the author's style depending on external factors (such as шооd, health status, etc). Therefore, when solving the рrоЫеш of text attribution, you should limit yourself to the first level or at most the first two levels of decision trees. As you can see from ТаЫе 2, the accuracy of the еnsешЫе at the second level already falls into the generally accepted 5% significance level. Analyzing the decision trees contained in the еnsешЫе, it can Ье noted that in 4 of them the first attribute was the "particle-adjective" Ьigram less than or equal to 6.5. In two cases the sаше attribute is found, but with а different threshold (less than or equal to 7.5). Only one tree had а different first attribute ("adjective-particle") less than or equal to 2.5. We can assume 329 that the relative frequency of the "particle-adjective" Ьigram more than 6.5 is а distinctive feature of the journalistic style of Apollon Grigoriev. The proposed algorithm allows to solve the proЫem of text attribution. P1111icle Adjecti\'e s7.5 gini = 0.5 samples = 236 value = [11В, 118I class = Other T,ue False AdjecU\·e lntcrje<:Dofl S 0.5 Conj1mc1ion Particlc S 11.5 gini = 0.23 gini = 0.121 samples = 128 samples = 10В value = (111, 17] value = [7.1011 class = Other class = ApoUon Grigoricv gini = О.О samples = 8 giпi = О.О value = [О, BJ samples = б class = ApoUon Grigoricv value = [6, О] Abbrcтiatcd \\'ord Abbrcтiaтed ,vord S 2.5 Prqюsitioп Nou.п s: 76.5 1 89 О class = 0tl1c-r gini = 0.139 gini = 0.019 1-89-1 О 13 1 samples = 120 samples = 102 1-89-2 0-37-9 value = [111, 9] value = [1, 101] class = Other Свэ-з class = АроUоп Grigoric-,r 0-39-1 1 -89 -60 О -76 -29 Свэ-61 0-77-25 Свэ-62 О-В7-32 1-89-63 gini = О.О NotщGm.u.1d s5.5 Pa.t1iclc Nощ1 s9.5 samples = 101 gini = 0.019 gini = 0.49 value = [О, 1011 samples = 106 samples = 14 class = АроUоп Grigorit"'.• value = [105, 1] value = fб, 81 1 89 4 class = Other class = ApoUon Grigorie,,• 1-89-5 1-89-6 1-В9- 7 gini = О.О 1-В9-8 samples = 105 1-89-9 1 -В9 -10 gini = О.О value = [105, О] gini = О.О samples = 1 class = Other gini = О.О gini = О.О samples = 8 1-89-11 1-89-12 value = [1, О] О 23 О samples = 1 samples = 6 value = ro, 81 class = 0t11c-r 0-23-1 value = [О, 11 value = [6, О] class = АроUоп Grigoric-,r 1-В9-13 1-89-14 0_92_44 0 -23 -11 class = Apolloп Grigoriev class = 0tl1c-r 1 89 24 0-23-12 1_89_58 О 37 5 1-89-25 1-89-15 0-23-24 О 42 -23 1-89-30 1-В9-16 0-23-50 0-42-98 1-89-36 0-23-52 О 42 -101 1-89-37 0=23=53 0 43 2 1-89-38 О -7 7 -80 1-89-39 1=89=44 Fig. 1. А fragment of one of the trees The influence of the universally accepted methods for processing unbalanced data "UpSampling", "UnderSampling", "SMOTE" on the accuracy of classifi­ cation of works Ьу Apollon Grigoriev was analyzed. The availaЫe data set was divided into test (42 - Apollon Grigoriev, 310 - Other) and training samples. The training sample was subjected to the tech­ niques listed above to confront class imbalance. Then the accuracy ("Accuracy", "roc-auc" curve) was calculated on а test sample, which was the same for all three techniques. The results of the experiment are shown in ТаЫе 3. This analysis showed approximately the same accuracy of all three methods. UpSampling looks worse. The advantage of UnderSampling is that it is easier to explain. Therefore, the authors decided to focus on it. 330 ТаЫе 3. Experimental results Depth Accuracy (test) roc-auc (test) Accuracy (training) roc-auc (training) UnderSampling 1 0,838068182 0,87718894 0,927711 0,927710843 2 0,920454545 0,923963134 0,975904 0,975903614 3 0,934659091 0,942319508 0,993976 0,993975904 4 0,946022727 0,948771121 1 1 UpSampling 1 0,838068182 0,87718894 0,90938 0,909379968 2 0,90625 0,874731183 0,961844 0,961844197 3 0,931818182 0,889247312 0,980922 0,980922099 4 0,940340909 0,894086022 0,992846 0,992845787 5 0,963068182 0,906989247 0,99682 0,99682035 6 0,96875 0,910215054 0,999205 0,999205087 7 0,965909091 0,898310292 1 1 SMOTE 1 0,838068182 0,87718894 0,90938 0,909379968 2 0,931818182 0,930414747 0,964229 0,964228935 3 0,980113636 0,957834101 0,983307 0,983306836 4 0,988636364 0,983256528 0,996025 0,996025437 5 0,980113636 0,957834101 0,99841 0,998410175 6 0,985795455 0,971351767 1 1 3 Analysis of "Poems Ьу А. S. Khomyakov" When discussing the affiliation of certain articles to certain authors, it should Ье noted, that in some cases there is no unequivocal evidence relating this article to а particular author. In particular, one of the controversial and still unresolved issues is the article "Poems Ьу А. S. Khomyakov" а discussion about whose authorship in the literary criticism continues over the past twenty years. The work of "Poems Ьу А. S. Khomyakov" has long been attributed to Apol­ lon Grigoriev. However, recently it has been considered the copyright text of F. М. Dostoevsky [14]. It was interesting to check where our classifier will take it. The text will Ье attributed to the author that most of the text fragments belong to. Fig. 2 shows one of the resulting decision trees. If we take the classification on the first node, then 6 of the 7 decision trees classify it as "Other", i.e. as not the text of Apollon Grigoriev. Only on one tree, there was an equality (5 frag­ ments "for belonging" and 5 "against"). During the split on the second level 3 "for belonging", 3 "against" and in one rejection of the classification. Our study confirms the earlier conclusion [14] that there is no reason to consider the article "Poems Ьу А. S. Khomyakov" as belonging to Apollon Grigoriev. The combination of parts of the speech "Particle" + "Adjective" that is so often encountered in two texts precisely belonging to Apollon Grigoriev (in transliteration from Russian "Lermontov i ego napravlenie. Statya vtoraya" and 331 "Oppoziciya zastoya. Cherty iz istorii mrakobesiya"), almost does not appear in the text of the controversial article "Poems Ьу А. S. Khomyakov". The author repeatedly uses this comЬination in the two indicated articles, then in the desired article it occurs only 10 times (the text consists of 2031 words), in six cases of which it is а "ne" particle, and in three cases - а "dazhe" particle; over large parts of text, such comЬinations of parts of speech could not Ье found (while in other articles belonging to А. Grigoriev, such comЬination is found more often and more diverse in terms of emerging types of particles - not only "ne" and "dazhe", but also "tolko", "to", "vse-taki", "zhe" followed Ьу the adjective. Of course, this observation alone is not enough to douЬt А. Grigoryev's text attribution, however, the application of methods based on decision trees can help with comprehensive analysis of texts in general, and the article "Poems Ьу А. S. Khomyakov" in the context of the issue of the attribution of journalistic texts. Partick Adjccti..-c: S 6.5 gini =0.5 samples = 10 value = [9, 1] class = Other True False Adj�tiп: Partick :s; 5.5 Conjщ1ctio11 Partick :s; 12.5 gini =0.104 gini =0.2GB samples =7 samples =3 value = (7, О] value = (2, 11 class = Othcr class = Apollon Grigoric:\' gini =О.О P1·ф0sitio11 Аdп!'Ь..:::; 3.5 Adjc:cti..-c: Nш11c:!'11l :s; 1.5 gini =О.О samples = 2 gini = 0.055 gini = 0.082 samples =О value = f2. 01 samples = 7 samples = 1 value = [О, О] class = Other value = f7, О] value = (О, 1) class = Apollon Grigoric:Y 1 124 7 class = 0tl1ei· class = Apollo11 Grigol'ic:Y 1=124=8 Participlc: Modal "'ord :s; 0.5 Nшпeral Modal \\·ord :s; 0.5 gini = О.О gini = О.О gini =0.019 gini =0.018 samples = О samples = О samples =7 samples = 1 value = ro. Ol value = [О, OJ value = [7, О] value = [О, 1] class = Apollo11 Grigoric:Y c1ass = Otl1er class = Other class = ApoUou Grigorie,.· gir1i = О.О samples =7 value = [7. О] ctass = Otl1er Prqюsitiш1 Nш11eral � 3.0 gini =О.О gini = О.О 1 124 О gini =0.5 samples =1 samples = О 1-124-1 samples =О value = Ю, 11 value = [О, О] С124-2 value = [О, OJ class = Apollon GrigoritY class = Other 1-124-4 class = Other 1_124_3 С124-5 1-124-6 С124=9 gini =О.О samples =О value = [О, О] class = Apollon Grigoritv Fig. 2. One of the trees of the classifier in the analysis of the work "Poems Ьу А. S. Khomyakov". 332 4 SMALT Information System Specialized software is required for research in the field of text attribution. As an example, we note several software tools that are described in more detail in [1]: "Stileanalizator" (graphematic and statistical analysis, work with marked texts); "Аvtoroved" (graphematic, morphological and statistical analysis); "Atributor" (statistical analysis); "Lingvoanalizator" (graphematical and statistical analysis). The SMALT information system developed at Petrozavodsk State Univer­ sity [7], [9], [12] is designed for the collective work of various specialists with texts. The information system can Ье divided into three sections (see Fig. 3): import of new texts, verification of texts Ьу philologists and the use of various analysis methods both on а single text and for а group of texts. Text fragment lmport Database Fig. 3. The architecture scheme of the SMALT software system. As part of the text import process, the text is divided into sections, para­ graphs, sentences and words, as well as matching each word with its morpholog­ ical analysis. If the task of text separation is typical, then the task of comparing the morphological analysis is rather complicated. The proЬlem is both in the wide variety of spelling of the word (using pre-revolutionary graphics, а more flexiЫe dictionary allowing different spelling ofthe word), and in the need to take into account the context of the use of the word. At different times, algorithms for finding the first possiЫe variant, а frequently used variant and an algorithm based on n-grams were used to select the semantic analysis of the word. The latter has а great prospect due to the small number of subsequent corrections. As part of the text verification process, philologists perform correction oftext analysis (for example, combining or separating words), correction of morpholog­ ical analysis of а word, or creation of а new analysis. Using the web interface allows several specialists to work on the text at the same time. 333 During the analysis process, the SMALT information system provides re­ searchers with access to the accumulated database in various sections. For ex­ ample, one of the popular statistical characteristics is Kjetsaa metrics [15]. The SMALT information system calculates the characteristics of both а single work and а group of texts. Another objective of the analysis is to identify the causes of the results. For example, to identify the reasons for the separation of text fragments between different nodes of the decision tree. The SMALT informa­ tion system allows you to access the source data of the required fragment for subsequent linguistic analysis. 5 Conclusion When solving the proЫem of determining the author's style of Apollon Grigoriev, the proЫem of sampling imbalance often arises, i.e. when the number of objects of one class significantly exceeds the number of objects of another class (in this case, the objects are the texts of the analyzed authors). As а method for con­ structing an ensemЫe of classifiers we use Bagging (Bootstrap aggregating). The idea of this method is to train several models on random subsamples of the orig­ inal sample (using Bootstrap) with further averaging. The authors believe that it meets the meaning of the task better than Boosting. Analyzing decision trees built using Python 3.6 (libraries: scikit-learn-tree implementation, pandas-data reading), we can assume that the relative frequency of the "particle-adjective" Ьigram more than 6.5 is а distinctive feature of the journalistic style of Apollon Grigoriev. The oЬtained knowledge was used to study the authorship of the article "Po­ ems Ьу А. S. Khomyakov", а discussion about whose authorship in the literary criticism continues over the past twenty years. If we take the classification on the first node, then 6 of the 7 decision trees classify it as "Other", i.e. as not the text of Apollon Grigoriev. The obtained results were presented for further consideration to the spe­ cialists of the Department of Russian Language and the Department of Classic Philology, Russian Literature and Journalism (Petrozavodsk State University). Acknowledgements. This work was supported Ьу the Russian Foundation for Basic Research, project no. 18-012-90026. References 1. Batura, Т. V.: Formal methods for determining the authorship of texts. Novosiblrsk State University Bulletin. Series "Information Technology". Novosiblrsk 10(4), 81- 94 (2012) 2. Biihlmann, Р.: Bagging, Boosting and EnsemЫe Methods. In: Gentle J., Hardle W., Mori У. (eds) Handbook of Computational Statistics. Springer Handbooks of Computational Statistics. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-21551-3_33 334 3. Calle-Martin, J., Miranda-Garcia, А.: Stylometry and Authorship Attribu­ tion: lntroduction to the Special lssue. English Studies 93(3), 251-258 (2012) https://doi.org/10.1080/0013838Х.2012.668788 4. Gurova, Е. 1.: Methods of Authorship Attribution in Contemporary National Philol­ ogy. The New Philological Bulletin 3(38), 29-44 (2016). 5. Farringdon, J. М.: Analyzing for Authorship / J. М. Farringdon with contributions Ьу Morton А. Q., Farringdon М. G., Baker М. D. Cardiff, University of Wales Press (1996). 6. Kjetsaa, G.: Attributed to Dostoevsky: The ProЫem of attributing to Dostoevsky anonymous articles in Time and Epoch. Oslo: Solum Forlag А. S. (1986) 7. Kotov, А. А., Mineeva, Z. 1., Rogov, А. А., Sedov, А. V., Sidorov, У. V.: Linguistic Corpuses. Petrozavodsk: PetrSU РuЫ. (2014) 8. Krawczyk, В.: Learning from imbalanced data: open challenges and fu­ ture directions. Progress in Artificial lntelligence 5(4), 221-232 (2016). https://doi.org/10.1007/s13748-016-0094-0 9. Rogov, А., Kulakov, К., Moskin, N.: Software support in solving the proЬ!em of text attribution. Software engineering 10(5), 234-240 (2019) https://doi.org/10.17587/prin.10.234-240 10. Rogov, А., Sedov, А., Sidorov, У., Surovceva, Т.: Mathematical methods for text attribution. Petrozavodsk, PetrSU РuЫ. (2014) 11. Romanov, А. S.: Methodology and software complex for identifying the author of an unknown text. Tomsk (2010) 12. Sidorov, У. V.: Mathematical and informational support of literary text processing methods based on formal grammatical parameters. Petrozavodsk (2002) 13. Stamatatos, Е.: А Survey of Modern Authorship Attribution Methods. Journal of the American Society for lnformation Science and Technology 60(3), 538-556 (2009) https://doi.org/10.1002/asi.21001 14. Zakharov, V.: Question about Khomyakov. ln: Zakharov, V. The name of the author is Dostoevsky. Essay on creativity. Moscow, lndrik, 231-247 (2013) 15. Zakharov, V.N., Rogov, А.А., Sidorov, У. V.: The proЫem of Dostoevsky gram­ matical constants search and anonymous and pseudonymous articles, puЬlished in "Time" and "Epoch" magazines (1861-1865) attribution. Works and Materials of "Russian Language Historical Destiny and the Present" lnternational Congress. Moscow, MSU, 404-405 (2001) 335