<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using a Decision Tree to Identify Non-uniform Fragments in a Text ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>r Rogov</string-name>
          <email>rogov@petrsu.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kirill Kul</string-name>
          <email>kulakov@cs.karelia.ru</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikol</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>i Moskin</string-name>
          <email>moskin@petrsu.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ITMO University</institution>
          ,
          <addr-line>Saint Petersburg, Russia https://itmo.ru/</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Petrozavodsk State University</institution>
          ,
          <addr-line>Petrozavodsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>402</fpage>
      <lpage>410</lpage>
      <abstract>
        <p>This article discusses the problem of searching for non-uniform text fragments. Each text fragment consists of several paragraphs or separate sentences which significantly difer from the rest of the text in terms of a set of characteristics. The problem of finding non-uniform fragments and their interpretation arises in the study of the pre-revolutionary magazines ”Time” (1861-1863), ”Epoch” (1864-1865) and the weekly ”Citizen” (1873-1874). It's a known fact that F. M. Dostoevsky was their editor. It means that he could have made his own edits to the texts of articles written by other authors. In our research the texts were divided into separate parts. For every part the frequency of n-grams (encoded sequences of parts of speech) was determined. Further, the analysis was carried out using decision trees that classified texts by author. In particular, the texts of F. M. Dostoevsky and V. P. Meshchersky were subjected to this analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>Text attribution</kwd>
        <kwd>non-uniform fragment</kwd>
        <kwd>n-gram</kwd>
        <kwd>F</kwd>
        <kwd>M</kwd>
        <kwd>Dostoevsky</kwd>
        <kwd>V</kwd>
        <kwd>P</kwd>
        <kwd>Meshchersky</kwd>
        <kwd>decision tree</kwd>
        <kwd>software complex“SMALT”</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A fragment of text that consists of several paragraphs or even separate sentences
will be called non-uniform if it difers significantly from the rest of the text in
terms of a set of characteristics. Each characteristic should be poorly controlled
by the author of the work and be able to statistically separate two or more
authors. For example, for this purpose (to distinguish the text of F. M. Dostoevsky,
A. Grigoriev, V. Dahl), the frequency of diferent sequences of parts of speech
found in fragments was used [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Based on the calculation of chi-square
statistics for the selected fragments and the remaining parts of speech, the question
of the uniformity of the fragments was solved.
⋆ Supported by the Russian Foundation for Basic Research, project no. 18-012-90026.
      </p>
      <p>
        At the same time, it is important to select the most informative features that
allow you to identify non-uniform fragments [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For example, in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
experiments were carried out using the FCBF (Fast Correlation-Based Filter), which
was proposed by Lei Yu and Huan Liu [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The method does not target a
specific machine learning model. It does not use classical correlations (for example,
Pearson’s), but it is based on information theory. As a result of its application,
a subset of characteristics is formed by searching and sequential exclusion of
uninformative features. Note that to solve the plagiarism search problem, the
authors combine FCBF with such mathematical methods as the support vector
machine (SVM) and the cumulative sum method (QSUM) [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ].
      </p>
      <p>
        Based on the corpus of prosaic texts by Russian writers of the XVIII-XX
centuries (215 texts by 50 authors) and scientific articles on philology, history, law,
economics and other social and humanitarian sciences (500 texts written without
co-authorship), the authors conclude that the method is quite accurate. Units
of the symbolic level of the text, elements of grammar, idiosyncratic and special
features of the text were used as characteristics to be compared, including [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]:
– features suggested by Morton: sentence length (in words) and a combination
of words starting with a vowel letter and short words of two to four letters;
– sets of bigrams and trigrams of symbols, separated by frequency;
– sets of words and word combinations, divided by frequency;
– grammatical classes of words and combinations of grammatical classes;
– dictionaries of relevant scientific disciplines;
– dictionaries of male and female characters of the text, etc.
      </p>
      <p>
        A close task is the identification of artificial texts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It can be written
by any author, group of authors or be the product of a software algorithm. To
compare artificial and natural texts (written by a person) the following numerical
characteristics were used: the number of sentences in the text, the number of
service words, the average word length, the mention of certain words, the number
of short words, the number of long words [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], an invariant of artificial
texts is proposed, which is a set of values of text characteristics, which allows
us to classify texts according to the method of their creation. A method is also
proposed for determining artificial texts based on the calculation of the measure
of the input text belonging to the invariants (using the Mahalanobis distance),
which makes it possible to reach a decision about the origin of the text. We also
note the following works on this topic [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ].
      </p>
      <p>A significant disadvantage of the considered algorithms is the poor
interpretability of the results, which is very important when solving the problem
of attribution. In addition, modern neural classifiers require a large amount of
training sample for their construction. Decision trees and forests difer better in
this case.</p>
      <p>The work consists of four sections and a conclusion. The introduction
describes the problem of detecting non-uniform fragments in the text and the
existing solutions. The second section shows how, based on the frequency of
occurrence of certain n-grams, decision trees are constructed for the problem of
attribution of texts from the magazines ”Time” (1861-1863), ”Epoch” (1864-1865)
and the weekly ”Citizen” (1873-1874). The third part provides a
mathematical model used for attribution of a text fragment. The fourth section describes
the developed tools for visualizing research results (highlighting text fragments,
highlighting n-grams, coloring text), implemented in the SMALT information
system.
2</p>
      <p>Description of the Method Based on the Decision Tree
The problem of finding non-uniform fragments and their interpretation arises
in the study of the pre-revolutionary magazines ”Time” (1861-1863), ”Epoch”
(1864-1865) and the weekly ”Citizen” (1873-1874). It’s a known fact that the
famous Russian writer F. M. Dostoevsky was their editor. It means that he
could have made his own edits to the texts of articles of other authors. There
is a number of works that have been published without the author’s name (or
under a pseudonym), which also allows specialists in the field of literary studies
to formulate attributive hypotheses.</p>
      <p>
        Since any hypothesis need a convincing proof, mathematical methods that
complement philological research are becoming more and more popular.
Previously, the authors have already used decision tree method and it has shown
good results. For example, in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], it was found that the relative frequency of the
bigram ”particle-adjective” greater than 6.5 is a distinctive feature of the
journalistic style of Apollon Grigoriev, who published his articles in these journals.
In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the analysis of the strong positions of texts (i.e. fragments located at the
beginning or end of the text) using decision trees demonstrates the possibility of
stylistic edits that F. M. Dostoevsky made to the texts of the original authors.
Using such trees as an example, one can easily show how the set of texts that
fall into a certain node is divided into two subsets, one of which contains objects
that satisfy a certain rule and the other one does not.
      </p>
      <p>
        Note that the rfist work with the use of decision trees appeared in the 60s of
the 20th century and since then they are often used by data mining specialists.
There are Classification trees and Regression trees [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In the first case, the
method predicts if an object belongs to a particular class. In the second case, the
predicted result is a real number. The problem of obtaining an optimal decision
tree is NP-complete, so heuristic algorithms are needed. Currently, there are
the following methods for training decision trees: ID3, C4.5 (improved version of
ID3), C5.0, CART (and its modifications IndCART, DB-CART), NewId, ITrule,
CHAID, CN2, etc.
      </p>
      <p>
        In order to carry out such an analysis, it is necessary to split the source
texts into fragments, while setting the left and right borders. Then grammatical
markup is carried out, which takes into account 14 parts of speech (noun,
adjective, numeral, pronoun, adverb, category of state, verb, participle, gerund,
preposition, conjunction, particle, modal word, interjection) and also allows to mark
quotes, foreign words, introductory words, abbreviated words and non-linguistic
symbols. Sequences of parts of speech of diferent lengths can be represented
as n-grams (n indicates how many elements should be taken and determines
the size of the sequence). There are bigrams (2-grams), trigrams (3-grams),
4grams, 5-grams, etc. When we talk about unigrams, or in other words n-grams
consisting of one element, we mean the words themselves. N-grams are widely
used in natural language processing: for example, for prompting the next word
in a search string, searching for plagiarism, correcting errors, etc. Let’s take a
look at a well-known example of determining the authorship of the novel ”The
Quiet Don”. Some critics argued that M. Sholokhov was too young at the time
of writing the novel (23 years old), while the novel was very mature. There
was a hypothesis that the author was Fyodor Kryukov, who also wrote about
the life of cossacks. To test this hypothesis the frequency of the combination of
diferent classes of words at the level of bigrams, trigrams and tetragrams was
studied [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In another study of ”The Quiet Don” informative signs were formed
from the frequency of the symbolic 3-grams, formed from the space character
and 33 letters of the Russian alphabet [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Other examples of using n-grams for
text attribution can be found in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        The studies carried out by the authors have shown that bigrams are suitable
for solving the problems of attribution of texts from the journals ”Time”
(18611863), ”Epoch” (1864-1865) and the weekly ”Citizen” (1873-1874) [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. The
text is divided into fragments of 1000 words in increments of 100 words. Having
chosen the reference texts of F. M. Dostoevsky and V. P. Meshchersky, a decision
tree was constructed that separates the texts of these authors.
      </p>
      <p>After that the text No. 160 ”Dvoryanin, zhelayushchij byt’ krest’yaninom”
(”A nobleman who wants to be a peasant”) was analyzed, which was published
in the magazine ”Time” (1861, volume VI, No. 12, pp. 117-123). However, it is
dificult for specialists in philology to analyze the tree in this representation. In
order to simplify this process, an algorithm for attributing text fragments was
developed.
3</p>
      <p>The Attribution Model of a Text Fragment
The method of attribution of a text fragment proposed in the report is based
on the ensemble of classifiers. Let’s select the fragment in the text that needs to
be attributed, denote it by x. With each fragment x we will associate a binary
variable z, which will take the value 0 if the text belongs to one author and value
1 if it belongs to the other. Let this fragment x is contained in j fragments yj
whose classification is known. Let’s denote it by vj . The general classification of
all fragments is denoted by f (v1, ..., vj ).</p>
      <p>As an example, you can use:
j
f (v1, ..., vj ) = 1 X vk
j</p>
      <p>k=1
f (v1, ..., vj ) =
(1, if Pjk=1 vk ≥ 2j
0, otherwise</p>
      <p>If f (v1, ..., vj ) is a binary function, then the obvious solution would be z =
f (v1, ..., vj ). If we have not one tree, but a forest of decisions, then the function
f (v1, ..., vj ) may have a diferent form.</p>
      <p>When it is required to take into account a possible rejection of the
classification (for example, when j is an even number), then the variable zi must take
three values. To the values 0 and 1, 1/2 must be added, as a rejection of the
classification</p>
      <p>Then the formula (2) will take the form:
11,/2, iiff PPjjkk==11 vvkk &gt;= 22jj
0, otherwise
(3)</p>
      <p>Note that this technique can be generalized to the case of evaluating the
degree of belonging of a given fragment to the author (high, medium, low and
excluded) or evaluating the probability of belonging.</p>
      <p>Let’s consider an example of the proposed algorithm based on the constructed
decision tree. In our case, the size of the fragment x is equal to the split step. As
a criterion we used the formula (3). The results of the algorithm are presented in
Table 1. Here, the first author is F. M. Dostoevsky and the second author is V.
P. Meshchersky. The text in question is marked with a number 160 ”Dvoryanin,
zhelayushchij byt’ krest’yaninom” (”A nobleman who wants to be a peasant”).
The unit of measurement is the word. Based on the source data, the SMALT
system builds a text that is colored in accordance with the identity of the author
(this is described in more detail in the next section).</p>
      <p>The authors applied this approach when a transformer model was used as a
classifier. The results are presented on the resource http://smalt.karelia.ru/. A
significant limitation of the use of the classifier built on the transformer model
is the need for a large volume of the training sample.</p>
      <p>Due to the absence of marked texts with foreign fragments, it is not possible
to assess the accuracy of the proposed method. However, philological experts
rated it highly. Their expert assessment of some fragments coincided with the
assessment obtained using the method described in this report.
4</p>
      <p>
        Software Support for Text Analysis Methods and
Algorithms
A large number of routine operations is required to identify non-uniform
fragments: marking up texts, calculating statistics and analyzing the results. Usually
such tasks are performed by a team of specialists, where the problem of
interaction and exchange of the obtained results arises. The information system
”Statistical methods of literary text analysis” (SMALT) allows you to speed up these
operations at the expense of computer technologies. SMALT has a modular
structure [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which allows automatic text markup, markup correction by specialists
and calculation of statistical characteristics of a separate text for research. The
SMALT system is available at http://smalt.karelia.ru/shower. The algorithms
and methods presented in this article are focused on obtaining characteristics of
groups of texts. Thus, the researcher performs the following process:
– obtaining marked up texts from SMALT;
– running the required algorithms on groups of texts;
– analysis of the obtained results.
      </p>
      <p>Information system SMALT allows you to get the marked-up text in the form
of a table in the format xls (Microsoft Excel), csv or ods (Librofice Calc). The
table contains three columns:
– source word;
– initial form;
– part of speech code.</p>
      <p>Sentences are separated by an empty line, paragraphs are separated by two
empty lines. If there is loaded text with punctuation marks, a punctuation mark
is displayed instead of an empty line in the first column. The results of text
processing are submitted in JSON format as an array of objects containing the
following fields:
– id : the sequence number of the fragment;
– left border : the number of the first word (numbering from zero);
– right border : the number of the last word;
– pros: assessment of the degree of belonging to the first author;
– cons: assessment of the degree of belonging to the second author.</p>
      <p>To visualize the obtained results, SMALT provides the following tools:
– selection of a fragment of text;
– highlighting n-gram;
– coloring text.</p>
      <p>Displaying fragments of text allows you to select the required fragment in
a text work. To view the text fragments go to the view form through the
”Research” → ”Text Fragments” menu. The form consists of fields for defining the
selection parameters (fragment size, fragment number, indent size) and a block
for displaying the selection results. To select a fragment its size (sample size)
and also its ofset relative to the beginning of the text are used. The ofset is
calculated based on the indent number and its size (for example 3 indent of size
10 gives an ofset of 30 words). The indent number is the fragment number.</p>
      <p>The system allows you to specify several fragments using the generally
accepted notation of enumeration and range. For example, for a fragment size of
15 and an indent size of 10, specifying fragment numbers in the form ”1-3,5,7”
results in the selection of the following word ranges: from 10 to 45, from 50 to
65, and from 70 to 85. The highlighting of the n-gram is done in the same form.
To do this in the fields ”Part No. 1”, ”Part No. 2” and ”Part No. 3” you need to
select the required part of speech (see Fig. 1). You can also show only matches
at the beginning of sentences by specifying the appropriate flag. You can search
for unigram, bigram and trigram.</p>
      <p>Coloring text allows you to visually highlight text fragments. The selection
is performed in accordance with the formula (3). There are three options for
coloring:
– yellow: f (v1, ..., vj ) = 1;
– green: f (v1, ..., vj ) = 1/2;
– white: f (v1, ..., vj ) = 0.</p>
      <p>Insertions of quotations in the text do not participate in the coloring of the
text, they are marked in italics. Text coloring pages are stored in the information
system database. To download the coloring book you must specify the text, the
description of the experiment and the json-file with the distribution of votes by
fragments in the appropriate form.
This article depicts the problem of searching and interpreting non-uniform text
fragments based on the material of the pre-revolutionary journals ”Time”
(18611863), ”Epoch” (1864-1865) and the weekly ”Citizen” (1873-1874). The model
of attribution of text fragments based on a heuristic algorithm using decision
trees has been developed. The used sings were the frequency of the occurrence
of certain n-grams (encoded sequences of parts of the speech). To analyze the
obtained results in the SMALT information system (http://smalt.karelia.ru/),
tools for highlighting text fragments, highlighting n-grams and coloring text have
been implemented.</p>
      <p>Acknowledgements. This work was supported by the Russian Foundation for
Basic Research, project no. 18-012-90026.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bakhteev</surname>
            ,
            <given-names>O. Yu.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsova</surname>
            ,
            <given-names>M. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romanov</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chekhov</surname>
          </string-name>
          , Yu. V.
          <article-title>About one method of detecting artificial and non-scientific texts in an extensive collection of documents</article-title>
          .
          <source>Electronic libraries 20(5)</source>
          ,
          <fpage>298</fpage>
          -
          <lpage>304</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olshen</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stone</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          :
          <article-title>Classification and regression trees</article-title>
          .
          <source>Wadsworth</source>
          , Belmont, Ca (
          <year>1984</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Grechnikov</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gusev</surname>
            ,
            <given-names>G. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kustarev</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raigorodsky</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          :
          <article-title>Search for unnatural texts</article-title>
          .
          <source>Russian Conference on Digital Libraries, Petrozavodsk</source>
          ,
          <fpage>306</fpage>
          -
          <lpage>308</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Iskhakova</surname>
            ,
            <given-names>A. O.</given-names>
          </string-name>
          <article-title>Method and software tool for determining artificially created texts</article-title>
          .
          <source>Tomsk</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kjetsaa</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gustavsson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beckman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gil</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Who wrote ”The Quiet Don”? Moscow (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kolesnikova</surname>
            ,
            <given-names>S. I.</given-names>
          </string-name>
          :
          <article-title>Methods of analyzing the informativeness of diferent types of attributes</article-title>
          <source>Tomsk State University Journal of Control and Computer Science</source>
          <volume>1</volume>
          (
          <issue>6</issue>
          ),
          <fpage>69</fpage>
          -
          <lpage>80</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Rogov</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abramov</surname>
            ,
            <given-names>R. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lebedev</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kulakov</surname>
            ,
            <given-names>K. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moskin</surname>
            ,
            <given-names>N. D.</given-names>
          </string-name>
          :
          <article-title>Text Attribution in Case of Sampling Imbalance by the Method of Constructing an Ensemble of Classifiers Based on Decision Trees. Data Analytics and Management in Data Intensive Domains</article-title>
          .
          <source>Supplementary Proceedings of the 22th International Conference DAMDID/RCDL'2020 (October 13-16</source>
          ,
          <year>2020</year>
          , Voronezh, Russia).
          <source>CEUR Workshop Proceedings</source>
          ,
          <fpage>319</fpage>
          -
          <lpage>328</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Rogov</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lebedev</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abramov</surname>
            ,
            <given-names>R. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moskin</surname>
            ,
            <given-names>N. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kulakov</surname>
            ,
            <given-names>K. A.</given-names>
          </string-name>
          :
          <article-title>Application of decision trees for analyzing the strong positions of the text in the problem of attribution of works by F. M. Dostoevsky</article-title>
          .
          <source>Computer Linguistics and Computing Ontologies</source>
          . Vol.
          <volume>4</volume>
          (Proceedings of the XXIII International Joint Scientific Conference ”Internet and Modern Society”, IMS-2020, St. Petersburg, June 17-20,
          <year>2020</year>
          ). St. Petersburg: ITMO University,
          <fpage>118</fpage>
          -
          <lpage>127</lpage>
          (
          <year>2020</year>
          ) https://doi.org/10.17586/
          <fpage>0000</fpage>
          - 0000-2020-4-
          <fpage>118</fpage>
          -127
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Romanov</surname>
            ,
            <given-names>A. S.:</given-names>
          </string-name>
          <article-title>Modification of the method of accumulative sums for checking the uniformity of the text and detecting plagiarism // Materials of reports of the International scientific-practical conference ”Electronic means</article-title>
          and
          <source>control systems” 2</source>
          ,
          <fpage>30</fpage>
          -
          <lpage>38</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Romanov</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meshcheryakov</surname>
            ,
            <given-names>R. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rezanova</surname>
            ,
            <given-names>Z. I.</given-names>
          </string-name>
          :
          <article-title>Plagiarism detection and text homogeneity checking technique based on one-class support machine and fast correlation-based filter</article-title>
          .
          <source>Proceedings of TUSUR University</source>
          <volume>2</volume>
          (
          <issue>32</issue>
          ),
          <fpage>264</fpage>
          -
          <lpage>269</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sedov</surname>
            ,
            <given-names>A. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rogov</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          :
          <article-title>Program system to detect heterogeneity in texts. Proceedings of the VI International Scientific</article-title>
          and Practical Conference ”
          <article-title>Information environment of the university of the XXI century”</article-title>
          .
          <source>Petrozavodsk</source>
          ,
          <volume>135</volume>
          -
          <fpage>139</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Shumskaya</surname>
            ,
            <given-names>A. O.</given-names>
          </string-name>
          :
          <article-title>Choice of parameters for identification of artificial texts</article-title>
          .
          <source>Proceedings of TUSUR University</source>
          <volume>2</volume>
          (
          <issue>28</issue>
          ),
          <fpage>126</fpage>
          -
          <lpage>128</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Shumskaya</surname>
            ,
            <given-names>A. O.</given-names>
          </string-name>
          :
          <article-title>Method of the artificial texts identification based on the calculation of the belonging measure to the invariants</article-title>
          .
          <source>SPIIRAS Proceedings</source>
          <volume>6</volume>
          (
          <issue>49</issue>
          ),
          <fpage>104</fpage>
          -
          <lpage>121</lpage>
          (
          <year>2016</year>
          ) https://doi.org/10.15622/sp.49.6
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A Survey of Modern Authorship Attribution Methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          ) https://doi.org/10.1002/asi.21001
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Usmanov</surname>
            ,
            <given-names>Z. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosimov</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          :
          <article-title>About metrization of works of fiction</article-title>
          . New information technologies in
          <source>automated systems 21</source>
          ,
          <fpage>183</fpage>
          -
          <lpage>186</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , H.:
          <article-title>Feature Selection for High-Dimensional Data: A Fast CorrelationBased Filter Solution</article-title>
          .
          <source>Proceedings of The Twentieth International Conference on Machine Leaning (ICML-03)</source>
          . Washington DC,
          <fpage>856</fpage>
          -
          <lpage>863</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>