-

Text Attribution in Case of Sampling Imbalance Ьу the Method of Constructing an EnsemЫe of Classifiers Based оп Decision Тrees *

Alexander Rogov

rogov@petrsu.ru 0

Roman Abramov

Alexander

0 0 PetrozavodskStateUniversityP , etrozavodsk , Russia

326 335

When solvingthe attributionproЫem, the questionof determiningthe author'sstyleof а writerwho createdа smallernumber of texts (both quantitativelaynd in termsof the totalnumber of words)in comparisonwithotheranalyzedauthorsarisesl.n thispaperweconsider possiЫesolutionsto thisproЫemЬу theexampleof determinintghestyle of Apollon GrigorievA.s а methodfor constructingan ensemЫeof classifiers we use Bagging (Bootstrap aggregating). The SMALT information system ("Statisticamlethodsfor analyzingliterarytexts") was used to determinethe frequency characteristicosf the textsand Python 3.6 was used to builddecisiontrees.As а resultof calculationswe сап assume that the relativ efrequency of the "particle-adjectiveЬ"igrammorethan 6.5 isа distinctivfeeatureof the journalistisctyleof Apollon Grigoriev. Therealsowasа studyof thearticle"PoemsЬу А. S. Khomyakov",which confirms the previouslyconclusionthat thereisпо reasonto considerit as belongingto Apollon Grigoriev.

Text attribution F М Dostoevsky Apollon Grigoriev Poems Ьу А S Khomyakov sampling imbalance decisiontree software complex "SMALT"

Authorship identification of anonymous texts (attribution of texts) is one of most urgent proЫem for the philological community; however, there are no universal mechanisms rfo its solution [ 10 ]. Specialists in study of literature use methods that are often somewhat unusual for the humanitarian sphere to answer such questions, including mathematical methods of analysis. One of the issues, which is far ofrm its final decision, is the afiliation of anonymous articles puЬlished in the magazines "Time" and "Epoch" (1861-1865). The authorship of some * SupportedЬу the RussianFoundation for BasicResearch,projectпо. 18-012-90026. of these articles has been estaЫished, while the authorship of other materials causes а lot of controversy and discussion in the philologicalfield [ 6 ]. The solution to this proЫem is additionally hampered Ьу the uneven amount of availaЫe textual material: there are many articles owned Ьу F. М. Dostoevsky, while the remaining authors puЬlished in these journals (for example, А. Grigoriev, N. N. Strakhov, Уа. Р. Polonsky, etc.), don't have so many texts that are uniquely attributed to them.

The following mathematical methods are used to estaЫish authorship of works: neural networks, QSUM method, decision trees, support vector machine (SVG), k-means method, Bayesian classifier, Markov chains, principal compo nent analysis, discriminant analysis, genetic algorithms, statistical criteria (х2 test, Student's t-test, Kolmogorov-Smirnov criterion), etc. Among other meth ods of data mining, decision trees are distinguished Ьу the cfat that they are easy to understand and interpret and also do not require special preliminary data processing. Note some authors who used mathematical methods to solve the proЬlem of text attribution: Morton А. Q., Mendenhall Т. С., Farringdon J. М., Efron В., Thisted R., Teahan W. J., Chaski С. Е., Stamatatos Е., Juola Р., Peng R. D., Joachims Т., Diederich J. J., Apte С., Lowe D., Matthews R., Tweedie F. J., de Vel О., Argamon S., Levitan S., Zheng R. [ 3 ], [ 5 ], [ 11 ], [ 13 ]. It should Ье noted that Russian language diefrs signicfiantly from English, so the methods of analysis of texts in English is oeftn not suitaЫe for Russian language.

When solving the proЬlem of classification into two classes, the proЬlem of sampling imbalance oft en arises, i.e. when the number of objects of one class signicfiantly exceeds the number of objects of another class. In this case the rfist class is called the majority class and the second class is called the minority class. ln such samplings classifiers are configured for objects of the majority class, i.e. high accuracy of the classifier can Ье oЬtained without selecting objects of the minority class. When solving the attribution proЬlem, the question of determin ing the author's style of а writer who created а smaller number of texts (both quantitatively and in terms of the total number of words) in comparison with other analyzed authors arises. Let's consider possiЫe solutions to this proЬlem Ьу the example of determining the style of Apollon Grigoriev. The authors do not know any analogs of such research of Russian-language texts except for the works of G. Kjetsaa and М. А. Marusenko [ 4 ], [ 10 ]. 2

Construction and Analyzing Decision Тrees An overview of the types of sampling imbalance and the methods used in such cases can Ье found in [ 8 ]. In this work we will use sampling, namely Undeтsam pling. ln this method the balance of sampling elements is achieved Ьу removing objects of the majority class. The authors think that this method is more ap propriate for the task than Oveтsampling (the sampling balance is achieved Ьу duplicating objects of the minority class) or SMOTE (Ьу generating new objects of the minority class).

As а method for constructing an ensemЫe of classifiers we use Bagging (Boot strap aggregating) [ 2 ]. The idea of this method is to train several models on ran dom subsamples of the original sample (using Bootstrap) with further averaging. The authors believe that it meets the meaning of the task better than Boosting. During previous studies in determining the features of the journalistic style of F. М. Dostoyevsky we ufond that the constructed decision trees based on Ьigrams well refiect the author's style. ln the experiments the best results were shown Ьу decision trees with а fragment size of 1000 words. The optimal step size for choosing the beginning of the next afrgment is 100 words. The same parameters were used in this work. The SMALT information system ("Statistical meth ods for analyzing literary texts") developed at Petrozavodsk State University was used to determine the frequency characteristics [ 9 ]. Specialists in philology carried out grammatical markup of texts, which took into account 14 parts of speech (noun, adjective, numeral, pronoun, adverb, category of state, verb, par ticiple, gerund, preposition, conjunction, particle, modal word, interjection) and also allowed to mark the quotes, foreign words, introductory words, abbreviated words and non-linguisticsymbols. А set of data for training was compiled (118 rfagments - Apollon Grigoriev, 899 - the rest). The texts from which the data were prepared are presented in ТаЫе 1. In this case afrgments of the texts of Apollon Grigoriev are objects of the minority class and all the others are from the majority class. The text size is quite small (from 2000 to 7000 words). аТЫе 1.

Source texts for analysis.

Python 3.6 was used to build decision trees (libraries: scikit-learn - for tree implementation, pandas - rfo data reading). The original data set was divided into 7 parts. In each part all fragments of Apollon Grigoriev were taken as а class with а label "1", the sаше number of afrgшents of other authors were taken randomly as а class with а label "О". Repetitions of fragments of other authors were not allowed.

А decision tree was trained on each part of data. The training continued until accuracy reached 100% (tree depth). The fragшent of one of the trained trees is shown in Fig. 1. All trees formed an еnsешЫе. The decision was accepted Ьу а шajority vote. Accuracy was calculated on the entire data set using the following rofmula:

TP+TN

Accuracy = T-P-+-T-N-+-FP-+-FN' where ТР is true-positive, ТN is true-negative, FР is false-positive and FN is false-negative predicted class. The experimental results are presented in ЫаТе 2. (1) аТЫе 2.

Classifer accuracy Depth Accuracy 1 0,8628 2 0,9592 3 0,9841 4 0,9891 5 0,992 6 0,9901

In total 7 decision trees were built. А fragment of one of the trees is shown in Fig. 1. Note that on the third level there are two leaves that contain а small number of fragшents (summary ofrm 12 to 27, on average less than 8%). You should take into account the possiЫe inaccuracy of the source data. The texts of Apollon Grigoriev could Ье edited Ьу F. М. Dostoevsky. In addition there is а slight volatility in the paraшeters of the author's style depending on external factors (such as шооd, health status, etc). Therefore, when solving the рrоЫеш of text attribution, you should limit yourself to the first level or at most the rifst two levels of decision trees. As you can see from ЫТае 2, the accuracy of the еnsешЫе at the second level already falls into the generally accepted 5% significance level. Analyzing the decision trees contained in the еnsешЫе, it can Ье noted that in 4 of them the fi rst attribute was the "particle-adjective" Ьigram less than or equal to 6.5. In two cases the sаше attribute is found, but with а diferent threshold (less than or equal to 7.5). Only one tree had а dif erent rifst attribute ("adjective-particle") less than or equal to 2.5. We can assume that the relative efrquency of the "particle-adjective" Ьigram more than 6.5 is а distinctive feature of the journalistic style of Apollon Grigoriev. The proposed algorithm allows to solve the proЫem of text attribution.

Abbricтatcd \\'ord Abbrcтiaтed ,vord S 2.5 gini = 0.139 samples = 120 value = [ 111, 9 ] class = Other

The influence of the universally accepted methods for processing unbalanced data "UpSampling", "UnderSampling", "SMOTE" on the accuracy of classifi cation of works Ьу Apollon Grigoriev was analyzed.

The availaЫe data set was divided into test (42 - Apollon Grigoriev, 310 - Other) and training samples. The training sample was subjected to the tech niques listed above to conofrnt class imbalance. Then the accuracy ("Accuracy", "roc-auc" curve) was calculated on а test sample, which was the same for all three techniques. The results of the experiment are shown in ТаЫе 3.

This analysis showed approximately the same accuracy of all three methods. UpSampling looks worse. The advantage of UnderSampling is that it is easier to explain. Therefore, the authors decided to focus on it.

Experimental results

Accuracy (test) roc-auc (test) Accuracy (training) roc-auc (training) When discussing the afiliation of certain articles to certain authors, it should Ье noted, that in some cases there is no unequivocal evidence relating this article to а particular author. In particular, one of the controversial and still unresolved issues is the article "Poems Ьу А. S. Khomyakov" а discussion about whose authorship in the literary criticism continues over the past twenty years.

The work of "Poems Ьу А. S. Khomyakov" has long been attributed to Apol lon Grigoriev. However, recently it has been considered the copyright text of F. М. Dostoevsky [ 14 ]. It was interesting to check where our classifier will take it. The text will Ье attributed to the author that most of the text fragments belong to. Fig. 2 shows one of the resulting decision trees. If we take the classicfiation on the fi rst node, then 6 of the 7 decision trees classify it as "Other", i.e. as not the text of Apollon Grigoriev. Only on one tree, there was an equality (5 frag ments "for belonging" and 5 "against"). During the split on the second level 3 "for belonging", 3 "against" and in one rejection of the classification. Our study confirms the earlier conclusion [ 14 ] that there is no reason to consider the article "Poems Ьу А. S. Khomyakov" as belonging to Apollon Grigoriev.

The combination of parts of the speech "Particle" + "Adjective" that is so often encountered in two texts precisely belonging to Apollon Grigoriev (in transliteration from Russian "Lermontov i e go napravlenie. Statya vtoraya" and "Oppoziciya zastoya. Cherty iz istorii mrakobesiya"), almost does not appear in the text of the controversial article "Poems Ьу А. S. Khomyakov". The author repeatedly uses this comЬination in the two indicated articles, then in the desired article it occurs only 10 times (the text consists of 2031 words), in six cases of which it is а "ne" particle, and in three cases - а "dazhe" particle; over large parts of text, such comЬinations of parts of speech could not Ье found (while in other articles belonging to А. Grigoriev, such comЬination is found more often and more diverse in terms of emerging types of particles - not only "ne" and "dazhe", but also "tolko", "to", "vse-taki", "zhe" lfolowed Ьу the adjective. Of course, this observation alone is not enough to douЬt А. Grigoryev's text attribution, however, the application of methods based on decision trees can help with comprehensive analysis of texts in general, and the article "Poems Ьу А. S. Khomyakov" in the context of the issue of the attribution of journalistic texts.

P1·фg0isniitio=1 0А.d0п5!'Ь5.:; 3.5 samples = 7 value = f7, О] class = 0tl1ei·

PartickAdjccti.-c:

gini = 0.5 S 6.5 samples = 10 value = [ 9, 1 ] class = Ot her

True Adj gtiпin: i = 0.104

Partick s:; 5.5 samples =7 value = (7, О] class = Othcr gini = О.О samples =О value = [О, О] class = ApollonGrigoric:Y

False Conjщ1ctio1 Partick:s; 12.5 gini = 0.2GB samples =3 classva=luAep=ol(l2o,n1Grigoric:\' Adjc:ctgi.-c:ini N=ш01.c:!'01l82 s:; 1 .5 samples = 1 value = (О, 1) class = Apol o1 Grigol'ic:Y gini = О.О samples =2 value = f2. 01 clas11=s1212=44=8Oth7er gir1i = О.О samples =7 value =[7. О] ctass = Otl1er 11-121244-1 О С124-2 1-124-4 С124-5 1-124-6 С124=9 gini = О.О samples =О classva=luAep=ol[lОo,nОG]rigoritv Participlc: Modal"'ord :s; 0.5 gini = 0.019 samples =7 value = [7, О] class = Other gini = О.О samples = О classva=luAepo=lrloo1. GOrligoric:Y

NшпegrainMli=od0a.0l\\·18ord:s; 0.5 samples =1 value = [О, 1] class = ApoUouGrigorie,.· gini = О.О samples = О value = [О, OJ c1ass = Otl1er Prqюsitgiшi n1iN=ш01e.5ral 3.0 samples =О value = [О, OJ class = Other gini = О.О samples =1 classva=lue = Ю, 1

ApollonGrigoritY 1_124_3 gini = О.О samples = О value = [О, О] class = Other

SMALT Information System Specialized software is required for example, we note several software research in the field oftext attribution. As an tools that are described in more detail in [ 1 ]: "Stileanalizator" (graphematic and statistical analysis, work with marked texts); "Аvtoroved" (graphematic, morphological and statistical analysis); "Atributor" (statistical analysis); "Lingvoanalizator" (graphematical and statistical analysis).

The SMALT information system developed at Petrozavodsk State Univer sity [ 7 ], [ 9 ], [ 12 ] is designed for the collective work of various specialists with texts. The information system can Ье divided into three sections (see Fig. 3): import of new texts, verification of texts Ьу philologists and the use of various analysis methods both on а single text and rfo а group of texts.

Text fragment lmport Database

As part of the text import process, the text is divided into sections, para graphs, sentences and words, as well as matching each word with its morpholog ical analysis. Ifthe task oftext separation is typical, then the task ofcomparing the morphological analysis is rather complicated. The proЬlem is both in the wide variety of spelling of the word (using pre-revolutionary graphics, а more lfexiЫe dictionary allowing diefrent spelling ofthe word), and in the need to take into account the context of the use of the word. At diefrent times, algorithms of r finding the first possiЫe variant, а frequently used variant and an algorithm based on n-grams were used to select the semantic analysis of the word. The latter has а great prospect due to the small number ofsubsequent corrections.

As part ofthe text verification process, philologists perform correction oftext analysis (rfo example, combining or separating words), correction ofmorpholog ical analysis of а word, or creation of а new analysis. Using the web interface allows several specialists to work on the text at the same time.

During the analysis process, the SMALT information system provides re searchers with access to the accumulated database in various sections. For ex ample, one of the popular statistical characteristics is Kjetsaa metrics [ 15 ]. The SMALT information system calculates the characteristics of both а single work and а group of texts. Another objective of the analysis is to identify the causes of the results. For example, to identify the reasons for the separation of text arfgments between diefrent nodes of the decision tree. The SMALT informa tion system allows you to access the source data of the required fragment for subsequent linguistic analysis. 5

Conclusion When solving the proЫem of determining the author's style of Apollon Grigoriev, the proЫem of sampling imbalance often arises, i.e. when the number of objects of one class signicfiantly exceeds the number of objects of another class (in this case, the objects are the texts of the analyzed authors). As а method for con structing an ensemЫe of classifiers we use Bagging (Bootstrap aggregating). The idea of this method is to train several models on random subsamples of the orig inal sample (using Bootstrap) with further averaging. The authors believe that it meets the meaning of the task better than Boosting. Analyzing decision trees built using Python 3.6 (libraries: scikit-learn-tree implementation, pandas-data reading), we can assume that the relative efrquency of the "particle-adjective" Ьigram more than 6.5 is а distinctive feature of the journalistic style of Apollon Grigoriev.

The oЬtained knowledge was used to study the authorship of the article "Po ems Ьу А. S. Khomyakov", а discussion about whose authorship in the literary criticism continues over the past twenty years. If we take the classification on the first node, then 6 of the 7 decision trees classify it as "Other", i.e. as not the text of Apollon Grigoriev.

The obtained results were presented for further consideration to the spe cialists of the Department of Russian Language and the Department of Classic Philology , Russian Literature and Journalism (Petrozavodsk State University). Acknowledgements. This work was supported Ьу the Russian Foundation Basic Research, project no. 18-012-90026. for

1. Batura , Т. V.: Formal methods for determining the authorship of texts . Novosiblrsk State University Bulletin. Series "Information cTehnology" . Novosiblrsk 10 ( 4 ), 81 - 94 ( 2012 )

2. Biihlmann , Р.: Bagging, Boosting and EnsemЫe Methods . In: Gentle J., Hardle

, Mori

. (eds) Handbook of Computational Statistics. Springer Handbooks of Computational Statistics. Springer, Berlin, Heidelberg ( 2012 ). https://doi.org/10.1007 /978-3- 642 -21551-3_ 33

3. Calle-Martin , J. , Miranda-Garcia, А.: Stylometry and Authorship Attribution: lntroduction to the Special lssue . English Studies 93 ( 3 ), 251 - 258 ( 2012 ) https://doi.org/10.1080/0013838Х. 2012 .668788

4. Gurova , Е. 1.: Methods of Authorship Attribution in Contemporary National Philology. The New Philological Bulletin 3 ( 38 ), 29 - 44 ( 2016 ).

5. Farringdon , J. М. : Analyzing for Authorship / J. М. Farringdon with contributions Ьу Morton А . Q., Farringdon

М. G.

, Baker М . D. Cardif, University of Wales Press ( 1996 ).

6. Kjetsaa , G.: Attributed to Dostoevsky: The ProЫem of attributing to Dostoevsky anonymous articles in Time and Epoch . Oslo: Solum Forlag А. S. ( 1986 )

7. Kotov , А. А., Mineeva , Z. 1 ., Rogov, А. А., Sedov , А. V. , Sidorov , У. V.: Linguistic Corpuses. Petrozavodsk: PetrSU РuЫ. ( 2014 )

8. Krawczyk , В.: Learning from imbalanced data: open challenges and future directions . Progress in Articfiial lntelligence 5(4) , 221 - 232 ( 2016 ). https://doi.org/10.1007/s13748-016-0094-0

9. Rogov , А., Kulakov , К., Moskin , N.: Software support in solving the proЬ!em of text attribution . Sowftare engineering 10(5) , 234 - 240 ( 2019 ) https://doi.org/10.17587/prin.10. 234 - 240

10. Rogov , А., Sedov , А., Sidorov , У., Surovceva , Т.: Mathematical methods for text attribution . Petrozavodsk, PetrSU РuЫ. ( 2014 )

11. Romanov , А. S.: Methodology and sowftare complex for identifying the author of an unknown text . Tomsk ( 2010 )

12. Sidorov , У. V.: Mathematical and informational support of literary text processing methods based on formal grammatical parameters . Petrozavodsk ( 2002 )

13. Stamatatos , Е.: А Survey of Modern Authorship Attribution Methods . Journal of the American Society rfo lnformation Science and Technology 60 ( 3 ), 538 - 556 ( 2009 ) https://doi.org/10.1002/asi.21001

14. Zakharov , V.: Question about Khomyakov . ln: Zakharov, V. The name of the author is Dostoevsky. Essay on creativity . Moscow, lndrik, 231 - 247 ( 2013 )

15. Zakharov , V.N. , Rogov , А.А., Sidorov , У. V.: The proЫem of Dostoevsky grammatical constants search and anonymous and pseudonymous articles, puЬlished in "Time" and "Epoch" magazines (1861-1865) attribution. rWoks and Materials of "Russian Language Historical Destiny and the Present" lnternational Congress . Moscow, MSU, 404 - 405 ( 2001 )