=Paper= {{Paper |id=Vol-2790/paper29 |storemode=property |title= Text Attribution in Case of Sampling Imbalance by the Method of Constructing an Ensemble of Classifiers Based on Decision Trees |pdfUrl=https://ceur-ws.org/Vol-2790/paper29.pdf |volume=Vol-2790 |authors=Alexander Rogov,Roman Abramov,Alexander Lebedev,Kirill Kulakov,Nikolai Moskin |dblpUrl=https://dblp.org/rec/conf/rcdl/RogovALKM20 }} == Text Attribution in Case of Sampling Imbalance by the Method of Constructing an Ensemble of Classifiers Based on Decision Trees == https://ceur-ws.org/Vol-2790/paper29.pdf
Text Attribution in Case of Sampling Imbalance
Ьу the Method of Constructing an EnsemЫe of
     Classifiers Based оп Decision Тrees *

                  Alexander Rogov 1 , Roman Abramov 1 , Alexander
      Lebedevl [оооо-0001-9939-9389], Kirill Kulakovl [оооо-0002-озо5-419Х], and
                        Nikolai Mosk in1 [оооо-0001-5556-5349]
                Petrozavodsk State University, Petrozavodsk , Russia
        rogov@petrsu.ru, monset008@gmail.com, perevodchik88@yandex.ru,
                    kulakov@cs.karelia.ru, moskin@petrsu.ru
                                https:/ /petrsu.ru/



        Abstract. When solving the attribution proЫem, the question of de­
        termining the author's style of а writer who created а smaller number of
        texts (both quantitatively and in terms of the total number of words) in
        comparison with other analyzed authors arises. ln this paper we consider
        possiЫe solutions to this proЫem Ьу the example of determining the style
        of Apollon Grigoriev. As а method for constructing an ensemЫe of clas­
        sifiers we use Bagging (Bootstrap aggregating). The SMALT information
        system ("Statistical methods for analyzing literary texts") was used to
        determine the frequency characteristics of the texts and Python 3.6 was
        used to build decision trees. As а result of calculations we сап assume
        that the relative frequency of the "particle-adjective" Ьigram more than
        6.5 is а distinctive feature of the journalistic style of Apollon Grigoriev.
        There also was а study of the article "Poems Ьу А. S. Khomyakov", which
        confirms the previously conclusion that there is по reason to consider it
        as belonging to Apollon Grigoriev.

         Keywords: Text attribution · F. М. Dostoevsky · Apollon Grigoriev
         · Poems Ьу А. S. Khomyakov · sampling imbalance · decision tree ·
         software complex "SMALT".


1      Introduction

Authorship identification of anonymous texts ( attribution of texts) is one of most
urgent proЫem for the philological community; however, there are no universal
mechanisms for its solution [10]. Specialists in study of literature use methods
that are often somewhat unusual for the humanitarian sphere to answer such
questions, including mathematical methods of analysis. One of the issues, which
is far from its final decision, is the affiliation of anonymous articles puЬlished
in the magazines "Time" and "Epoch" (1861-1865). The authorship of some
* Supported Ьу the Russian Foundation for Basic Research, project по. 18-012-90026.



    Copyright © 2020 for this paper Ьу its authors. Use permitted under Creative
    Commons License Attribution 4.0 Intemational (СС ВУ 4.0).




                                           326
of these articles has been estaЫished, while the authorship of other materials
causes а lot of controversy and discussion in the philological field [6]. The solution
to this proЫem is additionally hampered Ьу the uneven amount of availaЫe
textual material: there are many articles owned Ьу F. М. Dostoevsky, while the
remaining authors puЬlished in these journals (for example, А. Grigoriev, N. N.
Strakhov, Уа. Р. Polonsky, etc.), don't have so many texts that are uniquely
attributed to them.
    The following mathematical methods are used to estaЫish authorship of
works: neural networks, QSUM method, decision trees, support vector machine
(SVG), k-means method, Bayesian classifier, Markov chains, principal compo­
nent analysis, discriminant analysis, genetic algorithms, statistical criteria (х 2
test, Student's t-test, Kolmogorov-Smirnov criterion), etc. Among other meth­
ods of data mining, decision trees are distinguished Ьу the fact that they are
easy to understand and interpret and also do not require special preliminary
data processing. Note some authors who used mathematical methods to solve
the proЬlem of text attribution: Morton А. Q., Mendenhall Т. С., Farringdon
J. М., Efron В., Thisted R., Teahan W. J., Chaski С. Е., Stamatatos Е., Juola
Р., Peng R. D., Joachims Т., Diederich J. J., Apte С., Lowe D., Matthews R.,
Tweedie F. J., de Vel О., Argamon S., Levitan S., Zheng R. [3], [5], [11], [13]. It
should Ье noted that Russian language differs significantly from English, so the
methods of analysis of texts in English is often not suitaЫe for Russian language.
    When solving the proЬlem of classification into two classes, the proЬlem of
sampling imbalance often arises, i.e. when the number of objects of one class
significantly exceeds the number of objects of another class. In this case the first
class is called the majority class and the second class is called the minority class.
ln such samplings classifiers are configured for objects of the majority class, i.e.
high accuracy of the classifier can Ье oЬtained without selecting objects of the
minority class. When solving the attribution proЬlem, the question of determin­
ing the author's style of а writer who created а smaller number of texts (both
quantitatively and in terms of the total number of words) in comparison with
other analyzed authors arises. Let's consider possiЫe solutions to this proЬlem
Ьу the example of determining the style of Apollon Grigoriev. The authors do
not know any analogs of such research of Russian-language texts except for the
works of G. Kjetsaa and М. А. Marusenko [4], [10].


2    Construction and Analyzing Decision Тrees

An overview of the types of sampling imbalance and the methods used in such
cases can Ье found in [8]. In this work we will use sampling, namely Undeтsam­
pling. ln this method the balance of sampling elements is achieved Ьу removing
objects of the majority class. The authors think that this method is more ap­
propriate for the task than Oveтsampling (the sampling balance is achieved Ьу
duplicating objects of the minority class) or SMOTE (Ьу generating new objects
of the minority class).




                                        327
    As а method for constructing an ensemЫe of classifiers we use Bagging (Boot­
strap aggregating) [2]. The idea of this method is to train several models on ran­
dom subsamples of the original sample (using Bootstrap) with further averaging.
The authors believe that it meets the meaning of the task better than Boosting.
During previous studies in determining the features of the journalistic style of F.
М. Dostoyevsky we found that the constructed decision trees based on Ьigrams
well refiect the author's style. ln the experiments the best results were shown
Ьу decision trees with а fragment size of 1000 words. The optimal step size for
choosing the beginning of the next fragment is 100 words. The same parameters
were used in this work. The SMALT information system (" Statistical meth­
ods for analyzing literary texts") developed at Petrozavodsk State University
was used to determine the frequency characteristics [9]. Specialists in philology
carried out grammatical markup of texts, which took into account 14 parts of
speech (noun, adjective, numeral, pronoun, adverb, category of state, verb, par­
ticiple, gerund, preposition, conjunction, particle, modal word, interjection) and
also allowed to mark the quotes, foreign words, introductory words, abbreviated
words and non-linguistic symbols. А set of data for training was compiled (118
fragments - Apollon Grigoriev, 899 - the rest). The texts from which the data
were prepared are presented in ТаЫе 1. In this case fragments of the texts of
Apollon Grigoriev are objects of the minority class and all the others are from
the majority class. The text size is quite small (from 2000 to 7000 words).


                        ТаЫе 1. Source texts for analysis.


  Name                                                            Author
  Pismo k redaktoru                                               У. Р. Polonsky
  Zhukovskij i romantizm                                          F. М. Dostoevsky
  Literaturnaya isterika                                          F. М. Dostoevsky
  Odin iz proektov chudesnago obogasheniya Rossii                 1. N. Shill
  Pismo k izdatelyu "Vremeni"                                     У. Р. Polonsky
  Podpiska na 1863 god                                            М. М. Dostoevsky
  Ryad statej о russkoj literature. Vvedenie                      F. М. Dostoevsky
  Slavyanofily, chernogorcy i zapadniki                           F. М. Dostoevsky
  Ryad statej о russkoj literature. G. -bov i vopros оЬ iskusstve F. М. Dostoevsky
  Knizhnost i gramotnost. Statya pervaya                          F. М. Dostoevsky
  Knizhnost i gramotnost. Statya vtoraya                          F. М. Dostoevsky
  Poslednie literaturnye yavleniya. Gazeta "Den"                  F. М. Dostoevsky
  Neobhodimoe literaturnoe obyasnenie, ро povodu ra...            F. М. Dostoevsky
  Politicheskoe obozrenie                                         А. А. Golovachev
  Lermontov i ego napravlenie. Statya vtoraya                     А. Grigoriev
  Oppoziciya zastoya. Cherty iz istorii mrakobesiya               А. Grigoriev
  Nashi domashnie dela                                            А. U. Poretsky
  Durnye priznaki                                                 N. N. Strakhov
  Eshe о Peterburgskoj literature                                 N. N. Strakhov
  Vsyo-li na Rusi tak ploho, kak kazhetsya?                       V. Р. Mesherskij




                                       328
    Python 3.6 was used to build decision trees (libraries: scikit-learn - for tree
implementation, pandas - for data reading). The original data set was divided
into 7 parts. In each part all fragments of Apollon Grigoriev were taken as а
class with а label "1", the sаше number of fragшents of other authors were
taken randomly as а class with а label "О". Repetitions of fragments of other
authors were not allowed.
    А decision tree was trained on each part of data. The training continued until
accuracy reached 100% (tree depth). The fragшent of one of the trained trees is
shown in Fig. 1. All trees formed an еnsешЫе. The decision was accepted Ьу а
шajority vote. Accuracy was calculated on the entire data set using the following
formula:

                                      TP+TN
                      Accuracy = --------­                                      (1)
                                 TP + TN +FP+FN'
   where ТР is true-positive, ТN is true-negative, F Р is false-positive and
F N is false-negative predicted class. The experimental results are presented in
ТаЫе 2.

                           ТаЫе 2. Classifier accuracy


                                 Depth Accuracy
                                 1     0,8628
                                 2     0,9592
                                 3     0,9841
                                 4     0,9891
                                 5     0,992
                                 6     0,9901




    In total 7 decision trees were built. А fragment of one of the trees is shown
in Fig. 1. Note that on the third level there are two leaves that contain а small
number of fragшents (summary from 12 to 27, on average less than 8%). You
should take into account the possiЫe inaccuracy of the source data. The texts
of Apollon Grigoriev could Ье edited Ьу F. М. Dostoevsky. In addition there is
а slight volatility in the paraшeters of the author's style depending on external
factors (such as шооd, health status, etc). Therefore, when solving the рrоЫеш
of text attribution, you should limit yourself to the first level or at most the
first two levels of decision trees. As you can see from ТаЫе 2, the accuracy of
the еnsешЫе at the second level already falls into the generally accepted 5%
significance level. Analyzing the decision trees contained in the еnsешЫе, it can
Ье noted that in 4 of them the first attribute was the "particle-adjective" Ьigram
less than or equal to 6.5. In two cases the sаше attribute is found, but with а
different threshold (less than or equal to 7.5). Only one tree had а different
first attribute ("adjective-particle") less than or equal to 2.5. We can assume




                                       329
that the relative frequency of the "particle-adjective" Ьigram more than 6.5 is а
distinctive feature of the journalistic style of Apollon Grigoriev. The proposed
algorithm allows to solve the proЫem of text attribution.



                                                                                                          P1111icle Adjecti\'e s7.5
                                                                                                                  gini = 0.5
                                                                                                              samples = 236
                                                                                                            value = [11В, 118I
                                                                                                                class = Other
                                                                                                     T,ue                       False

                                                                                  AdjecU\·e lntcrje<:Dofl S 0.5               Conj1mc1ion Particlc S 11.5
                                                                                       gini = 0.23                                  gini = 0.121
                                                                                    samples = 128                                 samples = 10В
                                                                                   value = (111, 17]                              value = [7.1011
                                                                                    class = Other                             class = ApoUon Grigoricv



                                                                                    gini = О.О
                                                                                   samples = 8                                                                            giпi = О.О
                                                                                  value = [О, BJ                                                                        samples = б
                                                                              class = ApoUon Grigoricv                                                                  value = [6, О]
                           Abbrcтiatcd \\'ord Abbrcтiaтed ,vord S 2.5                                                          Prqюsitioп Nou.п s: 76.5
                                                                                     1 89 О                                                                             class = 0tl1c-r
                                    gini = 0.139                                                                                   gini = 0.019
                                                                                     1-89-1                                                                                О 13 1
                                  samples = 120                                                                                   samples = 102
                                                                                     1-89-2                                                                                0-37-9
                                  value = [111, 9]                                                                               value = [1, 101]
                                  class = Other
                                                                                          Свэ-з                                class = АроUоп Grigoric-,r
                                                                                                                                                                           0-39-1
                                                                                         1 -89 -60                                                                        О -76 -29
                                                                                         Свэ-61                                                                           0-77-25
                                                                                         Свэ-62                                                                           О-В7-32
                                                                                         1-89-63



                                                                                                                            gini = О.О
                 NotщGm.u.1d s5.5                           Pa.t1iclc Nощ1 s9.5                                          samples = 101
                 gini = 0.019                                 gini = 0.49                                                value = [О, 1011
               samples = 106                                samples = 14                                              class = АроUоп Grigorit"'.•
               value = [105, 1]                             value = fб, 81                                                   1 89 4
                class = Other                        class = ApoUon Grigorie,,•                                              1-89-5
                                                                                                                             1-89-6
                                                                                                                             1-В9- 7
   gini = О.О                                                                                                                1-В9-8
samples = 105                                                                                                                1-89-9
                                                                                                                            1 -В9 -10                        gini = О.О
value = [105, О]                                                                   gini = О.О                                                               samples = 1
class = Other                 gini = О.О                 gini = О.О               samples = 8                               1-89-11
                                                                                                                            1-89-12                         value = [1, О]
    О 23 О                  samples = 1                samples = 6               value = ro, 81                                                             class = 0t11c-r
    0-23-1                  value = [О, 11             value = [6, О]        class = АроUоп Grigoric-,r                     1-В9-13
                                                                                                                            1-89-14                           0_92_44
   0 -23 -11           class = Apolloп Grigoriev       class = 0tl1c-r             1 89 24
   0-23-12                    1_89_58                     О 37 5                   1-89-25                                  1-89-15
   0-23-24                                               О 42 -23                   1-89-30                                 1-В9-16
   0-23-50                                               0-42-98                    1-89-36
   0-23-52                                              О 42 -101                   1-89-37
   0=23=53                                                0 43 2                    1-89-38
                                                         О -7 7 -80                1-89-39
                                                                                   1=89=44




                                              Fig. 1. А fragment of one of the trees




    The influence of the universally accepted methods for processing unbalanced
data "UpSampling", "UnderSampling", "SMOTE" on the accuracy of classifi­
cation of works Ьу Apollon Grigoriev was analyzed.
   The availaЫe data set was divided into test (42 - Apollon Grigoriev, 310
- Other) and training samples. The training sample was subjected to the tech­
niques listed above to confront class imbalance. Then the accuracy ("Accuracy",
"roc-auc" curve) was calculated on а test sample, which was the same for all three
techniques. The results of the experiment are shown in ТаЫе 3.
   This analysis showed approximately the same accuracy of all three methods.
UpSampling looks worse. The advantage of UnderSampling is that it is easier to
explain. Therefore, the authors decided to focus on it.




                                                                                  330
                           ТаЫе 3. Experimental results


Depth         Accuracy (test) roc-auc (test) Accuracy (training) roc-auc (training)
UnderSampling
1             0,838068182     0,87718894 0,927711                0,927710843
2             0,920454545     0,923963134 0,975904               0,975903614
3             0,934659091     0,942319508 0,993976               0,993975904
4             0,946022727     0,948771121 1                      1
UpSampling
1             0,838068182     0,87718894 0,90938                 0,909379968
2             0,90625         0,874731183 0,961844               0,961844197
3             0,931818182     0,889247312 0,980922               0,980922099
4             0,940340909     0,894086022 0,992846               0,992845787
5             0,963068182     0,906989247 0,99682                0,99682035
6             0,96875         0,910215054 0,999205               0,999205087
7             0,965909091     0,898310292 1                      1
SMOTE
1             0,838068182     0,87718894 0,90938                 0,909379968
2             0,931818182     0,930414747 0,964229               0,964228935
3             0,980113636     0,957834101 0,983307               0,983306836
4             0,988636364     0,983256528 0,996025               0,996025437
5             0,980113636     0,957834101 0,99841                0,998410175
6             0,985795455     0,971351767 1                      1



3    Analysis of "Poems Ьу А. S. Khomyakov"

When discussing the affiliation of certain articles to certain authors, it should Ье
noted, that in some cases there is no unequivocal evidence relating this article
to а particular author. In particular, one of the controversial and still unresolved
issues is the article "Poems Ьу А. S. Khomyakov" а discussion about whose
authorship in the literary criticism continues over the past twenty years.
    The work of "Poems Ьу А. S. Khomyakov" has long been attributed to Apol­
lon Grigoriev. However, recently it has been considered the copyright text of F.
М. Dostoevsky [14]. It was interesting to check where our classifier will take it.
The text will Ье attributed to the author that most of the text fragments belong
to. Fig. 2 shows one of the resulting decision trees. If we take the classification
on the first node, then 6 of the 7 decision trees classify it as "Other", i.e. as not
the text of Apollon Grigoriev. Only on one tree, there was an equality (5 frag­
ments "for belonging" and 5 "against"). During the split on the second level 3
"for belonging", 3 "against" and in one rejection of the classification. Our study
confirms the earlier conclusion [14] that there is no reason to consider the article
"Poems Ьу А. S. Khomyakov" as belonging to Apollon Grigoriev.
    The combination of parts of the speech "Particle" + "Adjective" that is
so often encountered in two texts precisely belonging to Apollon Grigoriev (in
transliteration from Russian "Lermontov i ego napravlenie. Statya vtoraya" and




                                        331
"Oppoziciya zastoya. Cherty iz istorii mrakobesiya"), almost does not appear in
the text of the controversial article "Poems Ьу А. S. Khomyakov". The author
repeatedly uses this comЬination in the two indicated articles, then in the desired
article it occurs only 10 times (the text consists of 2031 words), in six cases of
which it is а "ne" particle, and in three cases - а "dazhe" particle; over large
parts of text, such comЬinations of parts of speech could not Ье found (while
in other articles belonging to А. Grigoriev, such comЬination is found more
often and more diverse in terms of emerging types of particles - not only "ne"
and "dazhe", but also "tolko", "to", "vse-taki", "zhe" followed Ьу the adjective.
Of course, this observation alone is not enough to douЬt А. Grigoryev's text
attribution, however, the application of methods based on decision trees can
help with comprehensive analysis of texts in general, and the article "Poems Ьу
А. S. Khomyakov" in the context of the issue of the attribution of journalistic
texts.




                                                                                                    Partick Adjccti..-c: S 6.5
                                                                                                          gini =0.5
                                                                                                       samples = 10
                                                                                                       value = [9, 1]
                                                                                                        class = Other
                                                                                             True                           False

                                                                            Adj�tiп: Partick :s; 5.5                    Conjщ1ctio11 Partick :s; 12.5
                                                                                gini =0.104                                    gini =0.2GB
                                                                                samples =7                                     samples =3
                                                                               value = (7, О]                                 value = (2, 11
                                                                                class = Othcr                             class = Apollon Grigoric:\'



                                                                                                                                                                             gini =О.О
                                       P1·ф0sitio11 Аdп!'Ь..:::; 3.5                                                       Adjc:cti..-c: Nш11c:!'11l :s; 1.5
                                                                                  gini =О.О                                                                                samples = 2
                                            gini = 0.055                                                                        gini = 0.082
                                                                                samples =О                                                                                 value = f2. 01
                                           samples = 7                                                                          samples = 1
                                                                                value = [О, О]                                                                            class = Other
                                           value = f7, О]                                                                      value = (О, 1)
                                                                            class = Apollon Grigoric:Y                                                                       1 124 7
                                           class = 0tl1ei·                                                                 class = Apollo11 Grigol'ic:Y
                                                                                                                                                                             1=124=8



     Participlc: Modal "'ord :s; 0.5                                                                        Nшпeral Modal \\·ord :s; 0.5
                                                          gini = О.О                                                                                              gini = О.О
           gini =0.019                                                                                           gini =0.018
                                                         samples = О                                                                                             samples = О
           samples =7                                                                                           samples = 1
                                                        value = ro. Ol                                                                                          value = [О, OJ
          value = [7, О]                                                                                        value = [О, 1]
                                                    class = Apollo11 Grigoric:Y                                                                                c1ass = Otl1er
         class = Other                                                                                       class = ApoUou Grigorie,.·



      gir1i = О.О
    samples =7
    value = [7. О]
    ctass = Otl1er              Prqюsitiш1 Nш11eral � 3.0                                   gini =О.О
                                                                                                                              gini = О.О
      1 124 О                          gini =0.5                                          samples =1
                                                                                                                            samples = О
       1-124-1                       samples =О                                           value = Ю, 11
                                                                                                                            value = [О, О]
       С124-2                       value = [О, OJ                                   class = Apollon GrigoritY
                                                                                                                            class = Other
       1-124-4                     class = Other                                            1_124_3
       С124-5
       1-124-6
       С124=9



                      gini =О.О
                     samples =О
                    value = [О, О]
                class = Apollon Grigoritv




Fig. 2. One of the trees of the classifier in the analysis of the work "Poems Ьу А. S.
Khomyakov".




                                                                                    332
4     SMALT Information System
Specialized software is required for research in the field of text attribution. As an
example, we note several software tools that are described in more detail in [1]:
     "Stileanalizator" (graphematic and statistical analysis, work with marked
     texts);
     "Аvtoroved" (graphematic, morphological and statistical analysis);
     "Atributor" (statistical analysis);
     "Lingvoanalizator" (graphematical and statistical analysis).
    The SMALT information system developed at Petrozavodsk State Univer­
sity [7], [9], [12] is designed for the collective work of various specialists with
texts. The information system can Ье divided into three sections (see Fig. 3):
import of new texts, verification of texts Ьу philologists and the use of various
analysis methods both on а single text and for а group of texts.



       Text
    fragment                   lmport




                                                    Database




               Fig. 3. The architecture scheme of the SMALT software system.


    As part of the text import process, the text is divided into sections, para­
graphs, sentences and words, as well as matching each word with its morpholog­
ical analysis. If the task of text separation is typical, then the task of comparing
the morphological analysis is rather complicated. The proЬlem is both in the
wide variety of spelling of the word (using pre-revolutionary graphics, а more
flexiЫe dictionary allowing different spelling ofthe word), and in the need to take
into account the context of the use of the word. At different times, algorithms
for finding the first possiЫe variant, а frequently used variant and an algorithm
based on n-grams were used to select the semantic analysis of the word. The
latter has а great prospect due to the small number of subsequent corrections.
    As part of the text verification process, philologists perform correction oftext
analysis (for example, combining or separating words), correction of morpholog­
ical analysis of а word, or creation of а new analysis. Using the web interface
allows several specialists to work on the text at the same time.




                                           333
    During the analysis process, the SMALT information system provides re­
searchers with access to the accumulated database in various sections. For ex­
ample, one of the popular statistical characteristics is Kjetsaa metrics [15]. The
SMALT information system calculates the characteristics of both а single work
and а group of texts. Another objective of the analysis is to identify the causes
of the results. For example, to identify the reasons for the separation of text
fragments between different nodes of the decision tree. The SMALT informa­
tion system allows you to access the source data of the required fragment for
subsequent linguistic analysis.


5    Conclusion

When solving the proЫem of determining the author's style of Apollon Grigoriev,
the proЫem of sampling imbalance often arises, i.e. when the number of objects
of one class significantly exceeds the number of objects of another class (in this
case, the objects are the texts of the analyzed authors). As а method for con­
structing an ensemЫe of classifiers we use Bagging (Bootstrap aggregating). The
idea of this method is to train several models on random subsamples of the orig­
inal sample (using Bootstrap) with further averaging. The authors believe that
it meets the meaning of the task better than Boosting. Analyzing decision trees
built using Python 3.6 (libraries: scikit-learn-tree implementation, pandas-data
reading), we can assume that the relative frequency of the "particle-adjective"
Ьigram more than 6.5 is а distinctive feature of the journalistic style of Apollon
Grigoriev.
    The oЬtained knowledge was used to study the authorship of the article "Po­
ems Ьу А. S. Khomyakov", а discussion about whose authorship in the literary
criticism continues over the past twenty years. If we take the classification on
the first node, then 6 of the 7 decision trees classify it as "Other", i.e. as not the
text of Apollon Grigoriev.
    The obtained results were presented for further consideration to the spe­
cialists of the Department of Russian Language and the Department of Classic
Philology, Russian Literature and Journalism (Petrozavodsk State University).

Acknowledgements. This work was supported Ьу the Russian Foundation for
Basic Research, project no. 18-012-90026.


References

1. Batura, Т. V.: Formal methods for determining the authorship of texts. Novosiblrsk
   State University Bulletin. Series "Information Technology". Novosiblrsk 10(4), 81-
   94 (2012)
2. Biihlmann, Р.: Bagging, Boosting and EnsemЫe Methods. In: Gentle J.,
   Hardle W., Mori У. (eds) Handbook of Computational Statistics. Springer
   Handbooks of Computational Statistics. Springer, Berlin, Heidelberg (2012).
   https://doi.org/10.1007/978-3-642-21551-3_33




                                        334
3. Calle-Martin, J., Miranda-Garcia, А.: Stylometry and Authorship Attribu­
   tion: lntroduction to the Special lssue. English Studies 93(3), 251-258 (2012)
   https://doi.org/10.1080/0013838Х.2012.668788
4. Gurova, Е. 1.: Methods of Authorship Attribution in Contemporary National Philol­
   ogy. The New Philological Bulletin 3(38), 29-44 (2016).
5. Farringdon, J. М.: Analyzing for Authorship / J. М. Farringdon with contributions
   Ьу Morton А. Q., Farringdon М. G., Baker М. D. Cardiff, University of Wales Press
   (1996).
6. Kjetsaa, G.: Attributed to Dostoevsky: The ProЫem of attributing to Dostoevsky
   anonymous articles in Time and Epoch. Oslo: Solum Forlag А. S. (1986)
7. Kotov, А. А., Mineeva, Z. 1., Rogov, А. А., Sedov, А. V., Sidorov, У. V.: Linguistic
   Corpuses. Petrozavodsk: PetrSU РuЫ. (2014)
8. Krawczyk, В.: Learning from imbalanced data: open challenges and fu­
   ture directions. Progress in Artificial lntelligence 5(4), 221-232 (2016).
   https://doi.org/10.1007/s13748-016-0094-0
9. Rogov, А., Kulakov, К., Moskin, N.: Software support in solving the
   proЬ!em of text attribution. Software engineering 10(5), 234-240 (2019)
   https://doi.org/10.17587/prin.10.234-240
10. Rogov, А., Sedov, А., Sidorov, У., Surovceva, Т.: Mathematical methods for text
   attribution. Petrozavodsk, PetrSU РuЫ. (2014)
11. Romanov, А. S.: Methodology and software complex for identifying the author of
   an unknown text. Tomsk (2010)
12. Sidorov, У. V.: Mathematical and informational support of literary text processing
   methods based on formal grammatical parameters. Petrozavodsk (2002)
13. Stamatatos, Е.: А Survey of Modern Authorship Attribution Methods. Journal
   of the American Society for lnformation Science and Technology 60(3), 538-556
   (2009) https://doi.org/10.1002/asi.21001
14. Zakharov, V.: Question about Khomyakov. ln: Zakharov, V. The name of the
   author is Dostoevsky. Essay on creativity. Moscow, lndrik, 231-247 (2013)
15. Zakharov, V.N., Rogov, А.А., Sidorov, У. V.: The proЫem of Dostoevsky gram­
   matical constants search and anonymous and pseudonymous articles, puЬlished in
   "Time" and "Epoch" magazines (1861-1865) attribution. Works and Materials of
   "Russian Language Historical Destiny and the Present" lnternational Congress.
   Moscow, MSU, 404-405 (2001)




                                         335