<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Building a High-Precision Classifier on Unbalanced Sets of Short Text Messages*</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National research university "Moscow power engineering institute"</institution>
          ,
          <addr-line>111250, Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1979</year>
      </pub-date>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper considers the building of a classifier in the case of unbalanced training sets, which consist of very short text messages. We use machine learning to automatic classifier construction. High quality of classification is the main requirement for a classifier (we use precision and recall as the criterion of quality). Imbalance in sets means, that the number of texts belonging to one class is significantly lower than those belonging to the other classes. The development of classifiers on the base of unbalanced samples is a well-known complex problem in Text Mining. A lot of theoretical and experimental researches are devoted to searching for a solution. This is especially the case for relatively short text documents, for which the use of known approaches is not always effective. We give a brief review of methods and argue using Gradient Boosting Machine (GBM) for classification short text messages in the case of unbalanced training sets. We diagnose situations in which the classifier fails, analyze them, form a hypothesis on how we can improve text processing, and configure the settings of the classifier. Also, we investigate and compare two schemes of preprocessing Bag of Words and Word2Vec. A final section is devoted to experimental results. Experiments were carried out on datasets collected in the process of the daily operation of one of the large bank's branches. These real data have a rather significant imbalance. In the case of the unbalanced training set our voting approach (GBM) improves performance considerably.</p>
      </abstract>
      <kwd-group>
        <kwd>Text Mining</kwd>
        <kwd>Text Classification</kwd>
        <kwd>Bag of words</kwd>
        <kwd>Word2Vec</kwd>
        <kwd>Gradient Boosting Machine</kwd>
        <kwd>Preprocessing of text documents</kwd>
        <kwd>Feature extraction</kwd>
        <kwd>Precision</kwd>
        <kwd>Recall</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the rapid growth of unstructured information in Internet, classification has become
*
one of the key techniques for handling and organizing text data. Methods of
classification are used to analysis of documents (scientific articles, news, e-mails, WWW-pages,
product reviews, etc.) and dividing them into classes (or categories). This problem
arises in various fields of human activity. For example in large corporations (banks,
governmental organizations, offices of e-commerce) are frequently required to
automatically identify the type of messages sent by customers (or users) and correctly
recognize their categories and direct them to the proper department for processing.</p>
      <p>In this paper, we are dealing with the problem of automatically dividing service
requests that come through the corporate network of the Bank. The purpose is to distribute
requests based on the content and send them for execution to the relevant department.
These requests may have the following form: "Please replace the cartridge in office
#12, printer – HP Color LaserJet CM 1312 MFP, suitable time – from 10 to 12 a.m.”.
To be able to solve such a real-world problem it is necessary to develop a classifier,
which will efficiently distribute the requests according to the classes (departments)
automatically. By doing so we want to exclude an operator from the process of
distribution of requests, but maintain the high quality of classification (precision and recall).
Thus, the processing of requests should take place without any human participation
(more correctly with no or minimum human effort).</p>
      <p>It is important to emphasize that the precision and recall of the classifier should not
be lower than the precision and recall of the operator having instructions and rules for
the direction of the request to the departments. Moreover, it is even possible to increase
the accuracy shown by the operator due to eliminating a human impact (errors occurring
because of fatigue, negligence, loss of concentration, etc.).</p>
      <p>The main approach of creating such classifiers is machine learning – the construction
of a classification function (rules) on the examples. To develop such classifiers, we
need labeled data (documents and their corresponding categories - labels). The labels
are assigned by the experts (supervised learning). A set of labeled data is used to predict
the labels of unseen data. In this research, the number of classes is six and all requests
can belong to only one category</p>
      <p>
        A variety of methods have been proposed to categorize unstructured information.
The most widely used classifiers are Naive Bayes, Logistic Regression, Nearest
Neighbor, Decision Trees, Neural Networks, Support Vector Machines, and others [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1,2,3</xref>
        ].
Choosing the most appropriate algorithm is still an open problem and depends on the
different dataset or domain area. A major drawback of the already mentioned
algorithms is their relatively low quality of classification. In most cases, these approaches
don’t achieve the quality of classification that is provided by the operator.
      </p>
      <p>
        As described above standard classifiers have a bias towards classes that have more
texts and tend to predict the majority class. As a result, we gain misclassification of the
minority class (in comparison with the majority class). That is way conventional
approaches to classifying texts demonstrate rather bad performance when faced with
imbalanced datasets. There are two effective ways how to operate with unbalanced sets
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>First, we can change samples (data preprocessing). The goal is either to increase the
number of documents in the minority class or to reduce the number of documents in the
majority class. This is done to obtain approximately the same number of instances for
both classes. The most popular techniques are random undersampling (randomly
eliminating majority class text examples) and oversampling (increasing the number of text
instances in the minority class first through simulation). Our experimental studies show
that the disadvantages of these approaches outwit the advantages, and the quality of the
classification changes only slightly.</p>
      <p>Second, we can improve classifiers to make them appropriate for imbalanced data
sets. These studies are conducted as part of the construction of ensemble classifiers to
improve the performance of single classifiers by voting (and aggregating their
predictions).</p>
      <p>
        Many researchers have investigated the technique of combining the predictions of
multiple classifiers to produce a single solution [
        <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
        ]. Nowadays the most popular and
effective techniques of creating ensembles of classifiers are Random Forests,
AdaBoost, and Gradient Boosting Machine – GBM [
        <xref ref-type="bibr" rid="ref2 ref7 ref8">2,7,8</xref>
        ].
      </p>
      <p>
        In this article, we apply and explore Gradient Boosting Machine for the
categorization of requests. The arguments in favor of usage GBM can be formulated in the
following way:
 In comparison with other ensemble classifiers, GBM generally performs better than
Random Forests and AdaBoost (although there is a price for that: GBM has a few
hyper-parameters to tune, while Random Forest and AdaBoost are almost
tuningfree). The possibility of tuning gives researchers more options to achieve better
results on a particular dataset.
 The Kaggle competitions (the most popular testing ground of machine learning
techniques) demonstrate high universality GBM and efficiency for a wide range of real
applications (https://www.kaggle.com/).
 The availability of software implementations modifications of the method for
solving various practical problems (for example, algorithm XGBoost - eXtreme Gradient
Boost – the variant of classical GBM with some additional heuristics [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]).
 GBM meets the requirements for speed of categorization and provides
parallelization of computations. So this method is computationally cheap and easy to
implement.
      </p>
      <p>This paper is organized as follows. In the first section, we consider pre-processing
tasks (extracting text, tokenization, stop-word removal, normalization, named entity
recognition) and give descriptions of two models of text representation (Bag of Words
and Word2Vec). In the second section, we briefly explain Gradient Boosting Machine,
setting its parameters on the simples and compare the quality of classification under
using Bag of Words and Word2Vec.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preprocessing and Feature Extraction</title>
      <p>
        For the learner to compute a classification function, it needs to understand the
document. For the learner, the document is merely a string of text. Hence, there is a need to
represent the document text in a structured manner. The most common technique to
represent text is the Bag-Of-Words (BOW) model [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this technique, the text is
broken down into words. Each word represents a feature. This process is also referred
to as “Tokenization” since the document is broken down into tokens (individual words).
A group of features extracted thus forms a feature vector for the document. Note that
in such a model, the exact order of word occurrence is ignored. Since this vector
becomes too large, there are several ways to prune this vector. Techniques like stop word
removal and stemming are commonly applied. Stop word removal involves removing
words that add no significant value to the document. However, when dealing with
shorter text messages, traditional techniques will not perform as well as they would
have performed on larger texts. This is primarily because there are few word
occurrences and hence it is difficult to capture the semantics of such messages (the word
occurrence is too small, they offer no sufficient knowledge about the text itself). That
is why short text messages are harder to classify than the larger corpus of text.
      </p>
      <p>
        The other model of text representation is Word2Vec [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
      </p>
      <p>
        Word2vec takes as its input a large corpus of text and produces a vector space,
typically of several hundred dimensions, with each unique word in the corpus being
assigned a corresponding vector in the space. Word vectors are positioned in the vector
space such that words that share common contexts in the corpus are located close to
one another in the space [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Word2vec has two main learning algorithms: CBOW (Continuous Bag of Words)
and Skip-gram. CBOW is "a continuous bag of words" a model architecture that
predicts the current word based on its surrounding context. The Skip-gram architecture
works the other way: it uses the current word to predict the surrounding words. The
user of Word2vec can switch and choose between the algorithms. The order of context
words does not affect the result in any of these algorithms. Obtained at the output of
the coordinate representation of the vectors of the words allow us to calculate the
"semantic distance" between words. Word2vec technology makes its predictions based on
the contextual proximity of these words. Since Word2vec tool is based on neural
network training, it is necessary to use large datasets to train it to achieve its most efficient
operation. This allows improving the quality of predictions.</p>
      <p>To apply models, the original samples first must be processed specially, taking into
account the peculiarities of the language to identify the most informative features
(Fig.1).</p>
      <p>This processing included:
 Removing punctuation marks and service characters using the Re library and regular
expressions.
 Reducing words to lowercase.
 Splitting the text into separate tokens using"whitespace".
 Reduction of words to normal form. Individual tokens were normalized using the
pymorphy2 library.
 The addition of the negative particle to the next word. The meaning of some requests
depends on the negative particle and affects the meaning of the entire request.
Therefore, it is important to save the negative essence.
 Removing stop words and template phrases (clichés). It was performed using two
dictionaries. Experts compiled the first one manually. It contained a list of words
directly related to the automated system. The second dictionary was the standard
dictionary from the NLTK library (Natural Language Toolkit).
 Named entities recognition for further linking.
The task of named entity recognition generally involves finding named entities in the
text and determining their types, as well as linking a named entity to an object. The
latter is important since identical words/phrases can refer to completely different
objects. For example, the word “Lena” in Russian can be:
 The name of the person
 The name of the river
 The name of the town
 The name of the highway
The types of named entities in the work included:
 Dates - 07.02.2019, 28 November 2018, 1995-2003;
 Numbers – 1, 3.5, 547899.00;
 Addresses – Novosibirskaya street, 23;
 Time – 9:30, from 13:00 to 14:00;
 Currency units and money amounts - 5000 rubles, $10;
 Names of automated systems - SAP, CRM;
 Positions - Manager, engineer, senior analyst;
To recognize the named entities in the text, both the specificity of their entry and
dictionary resources are used: dictionaries of names, geographical names, monetary units,
occupations, etc. In this work, regular expressions are used to recognize named entities,
in addition to special dictionaries supplemented with the subject of the problem.</p>
      <p>It is well known that the quality of classification with GBM is significantly
influenced by the dimension of vectors and terms used to describe documents. Therefore, it
is important to select the best representation of the documents. In this paper, we conduct
a comparative analysis of two approaches to feature extraction BoW and W2V. Based
on our experiments, we identify an approach that provides a higher quality of
classification with Gradient Boosting Machine. We make our conclusion based on analysis of
values of precision and recall (as well as their combination – F1-measure).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Text Categorization with Gradient Boosting Machine</title>
      <p>
        [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
tion  of the form
Classification is a supervised data mining technique that involves assigning a label to a
set of unlabeled input objects. Based on the number of classes present, there are two
types of classification: binary classification (classify texts into one of the two classes)
and multi-class classification (classify texts into one of the multiple classes). In our
work, we consider the more complicated case of multi-class classification. We need to
build a classifier that can correctly assign a document to one class based on its content.
      </p>
      <p>
        As noted earlier for the building of the classifier we use boosting technique which
creates a highly accurate prediction rule by combining many weak and inaccurate
rules. Each classifier is serially trained with the goal of correctly classifying texts in
every round that were incorrectly classified in the previous round. After each round, it
gives more focus to texts that are harder to classify. The quantity of focus is measured
by a weight, which initially is equal for all instances. The most common representatives
of the boosting family are Adaptive Boosting (AdaBoost), Gradient Boosting (GBM),
and XGBoost (eXtreme Gradient Boost). In Gradient Boosting Machine the principle
idea behind the algorithm is to construct the new base-learners to be maximally
correlated with the negative gradient of the loss function, associated with the whole ensemble
Gradient Boosting is an approach to approximate a function  ∗:   →  by a
func
 ( ) = ∑ =1   ℎ ( )
(1)
where   ∈ R are real-valued weights, ℎ :   →  are weak learners and x ∈   is
the input vector [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Gradient Boosting can be seen as a generalization of AdaBoost. The latter is
designed to minimize the exponential loss with stump weak learners ℎ :   → {+1, −1},
while Gradient Boosting can make use of real-valued weak learners and can minimize
other loss functions [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Gradient Boosting has shown significant performance
improvements in many classification problems concerning classic AdaBoost [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>Let us discuss in more detail the description of the short text documents (requests
for service) and training samples. The request that must be classified consists of two
parts: “Information” – a brief description of the problem, written by the user and
“Options” that may contain additional information about the request, for example, user’s
department, system information, etc. As a result of combining these two parts, we
obtain a document for classification.</p>
      <p>To perform the experiments we downloaded and processed 9000 user requests. After
that corpus was divided into two parts – training set and testing set – in the ratio of 67
and 33 percent respectively. Each class in the dataset has a different volume, which is
shown in table 1.</p>
      <p>To perform the best classification result we need to adjust the parameters of the
Gradient Boosting Machine (http://xgboost.readthedocs.io). For some parameters were
used the recommendations of the developers. But for the main parameters that have the
greatest impact on the accuracy indicators, a special adjusting was carried out with the
GridSearchCV module which performs an exhaustive search over specified parameter
values for an estimator using k-fold cross-validation (k = 5). The next table shows
parameters with the description, range, and selected value.</p>
      <sec id="sec-3-1">
        <title>Step size shrinkage used in the update to pre</title>
        <p>vents overfitting. After each boosting step,
we can directly get the weights of new
features, and eta shrinks the feature weights to
make the boosting process more
conservative.</p>
        <p>Maximum depth of a tree, increase this value
will make the model more complex/likely to
be overfitting. 0 indicates no limit, limit is
required for depth-wise grow policy.</p>
        <p>Specify the learning task and the
corresponding learning objective</p>
        <p>To perform feature extraction first we break sentences into words. On this step, we
had a deal with 469880 words and after preprocessing our training corpus contained
230905 informative words which we gave to the input of BoW and W2V.</p>
        <p>To perform word extraction with Word2vec using CBoW algorithm we applied the
following parameters:
 Learning algorithm - Hierarchical softmax.
 The minimum number of repetitions of the word in all sentences -min_word_count
= 40.
 The dimension of the resulting space - num_features = 300 (default value).
 How many words "around" a particular word should be considered
 Context = 10 (Normally used value).
 Parameter «sampling» makes the model choose very-frequent words less often in the
training process - sampling = 1e-3 (default value).</p>
        <p>Evaluation of a classification algorithm performance is measured by the Confusion
Matrix which contains information about the actual and the predicted class.</p>
        <sec id="sec-3-1-1">
          <title>Actual</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Positive class</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Negative class</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Predicted</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>Negative class</title>
          <p>False Negative (FN)</p>
          <p>True Negative (TN)


=</p>
          <p>+
=</p>
          <p>+
 1 − 
= 2(
(
∗
+
)
)
(2)
(3)
(4)
The results of classification with GBM and two methods of feature extraction are
presented in tables 4 and 52. The results are obtained on the testing set.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Class</title>
        <p>C1 0,92 0,89 0,90
C2 0,91 0,92 0,89
C3 0,89 0,90 0,90
С4 0,95 0,97 0,93
С5 0,88 0,86 0,85</p>
        <p>
          As we can see prediction with GBM using Bag of Words provides a slightly better
result than Word2vec but it is still not good enough to replace a human operator.
According to the experience of the Bank human-operators perform such task
approximately with f-score = 0.92 - so our goal is to perform an average f-score of at least 0.93
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Small classes are rather poorly identified with the model, but the results on the
biggest C4 let us expect good results if we will train our model on an extended corpus.
        </p>
        <p>Although that BoW model provides better results on average it also performs the
worst result in C5 category. Results of W2V model are smoother than the results of
BoW model so we could note that W2V model is more resistant to the size of the
dictionary.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>
        Currently, the models and methods considered in the article are tested in one of the
branches of a large bank in Russia. During testing, we focused on the analysis of the
causes of errors. The main conclusion is that most of the errors are due to incorrect
filling of requests, in particular, the use of slang, abbreviations that correspond to the
vocabulary of modern messengers and social networks. To improve the prediction
quality, we plan to achieve through the following:
─ as the number of requests increases to expand the training set to achieve the more
precise tuning of the parameters for Word2vec, and also make our sample balanced
by extending small classes and discarding some objects of large classes;
─ try W2V skip-gram algorithm. It is more computationally expensive than CBoW but
performs better results in models using large corpora and a high number of
dimensions [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ];
─ perform more detailed customizing of XGBoost model on an extended corpus;
─ develop formal rules and guidelines for filling service requests.
      </p>
      <p>At the same time, the conducted testing and discussion of its results with the Bank's
specialists allows identifying several similar processes that can be automated based on
Text Mining techniques.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Manning C.D.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schutze</surname>
            <given-names>H.</given-names>
          </string-name>
          : Introduction to Information Retrieval. - Cambridge University Press, (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bobryakov</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuryliov</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mokhov</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefantsov</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Approaches to automation processing of user requests in a multi-level support service using featured models</article-title>
          .
          <source>Proceedings of the 30th DAAAM International Symposium</source>
          , pp.
          <fpage>0936</fpage>
          -
          <lpage>0944</lpage>
          , B.
          <string-name>
            <surname>Katalinic</surname>
          </string-name>
          (Ed.),
          <source>Published by DAAAM International, ISBN 978-3-902734-22-8, ISSN 1726-9679</source>
          , Vienna, Austria. DOI:
          <volume>10</volume>
          .2507/30th.daaam.
          <source>proceedings.130</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Batura</surname>
            <given-names>T.V.</given-names>
          </string-name>
          <article-title>Metodi avtomaticheskoy klassifikacii tekstov [Methods for automatic classification</article-title>
          of texts]// Software &amp; Systems. №
          <volume>1</volume>
          (
          <issue>30</issue>
          ), pp.
          <fpage>85</fpage>
          -
          <lpage>89</lpage>
          . DOI:
          <volume>10</volume>
          .15827/
          <fpage>0236</fpage>
          -
          <lpage>235X</lpage>
          .
          <fpage>117</fpage>
          .
          <fpage>085</fpage>
          -
          <lpage>099</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Krawczyk</surname>
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Learning from imbalanced data: open challenges and future directions</article-title>
          .
          <source>Progress in Artificial Intelligence</source>
          <volume>5</volume>
          (
          <issue>4</issue>
          ), pp
          <fpage>221</fpage>
          -
          <lpage>232</lpage>
          DOI: 10.1007/s13748-016-0094-
          <fpage>0</fpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Onan</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korukoglu</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bulut</surname>
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Ensemble of keyword extraction methods and classifiers in text classification</article-title>
          .
          <source>In: Expert Systems with Applications</source>
          .
          <volume>57</volume>
          (
          <issue>15</issue>
          ), pp.
          <fpage>232</fpage>
          -
          <lpage>247</lpage>
          . DOI:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2016</year>
          .
          <volume>03</volume>
          .045 (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Daroczy</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siklois</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palovics</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benczur</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Text Classification Kernels for Quality Prediction over the C3 Data Set</article-title>
          .
          <source>In: Proceedings of the 24th International Conference on World Wide Web</source>
          , pp.
          <fpage>1441</fpage>
          -
          <lpage>1446</lpage>
          . DOI:
          <volume>10</volume>
          .1145/2740908.2778847 (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Friedman</surname>
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Greedy boosting approximation: a gradient boosting machine</article-title>
          .
          <source>In: Annals of statistics. 29</source>
          ,
          <fpage>1189</fpage>
          -
          <lpage>1232</lpage>
          , (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Friedman</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hastie</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Additive logistic regression: a statistical view of boosting</article-title>
          .
          <source>Annals of statistics. 28</source>
          ,
          <fpage>337</fpage>
          -
          <lpage>407</lpage>
          , (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Chen</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guestrin</surname>
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>XGBoost: A Scalable Tree Boosting System</article-title>
          .
          <source>In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , pp.
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          . DOI:
          <volume>10</volume>
          .1145/2939672.2939785 (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mikolov</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            <given-names>J</given-names>
          </string-name>
          :
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          . In: Proceedings of Workshop at ICLR, (
          <year>2013</year>
          ) (arXiv:
          <fpage>1301</fpage>
          .3781).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Natekin</surname>
            <given-names>A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Knoll</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Gradient boosting machines, a tutorial</article-title>
          .
          <source>In: Front. Neurorobot</source>
          .
          <volume>7</volume>
          (
          <issue>21</issue>
          ). DOI:
          <volume>10</volume>
          .3389/fnbot.
          <year>2013</year>
          .
          <volume>00021</volume>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Becker</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rigamonti</surname>
            <given-names>R.</given-names>
          </string-name>
          : Kernel Boost:
          <article-title>Supervised Learning of Image Features For Classification</article-title>
          .
          <source>School of Computer and Communication Sciences Swiss Federal Institute of Technology, Lausanne. Technical Report</source>
          .
          <year>2013</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hastie</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            <given-names>R.</given-names>
          </string-name>
          , and Friedman J.:
          <source>The Elements of Statistical Learning</source>
          . Springer (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Bolshakova</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vorontsov</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efremova N</surname>
          </string-name>
          . et al.
          <article-title>Avtomaticheskaya obrabotka tekstov na estestvennom yazike I analiz dannikh</article-title>
          . [
          <article-title>Automatic processing of texts in a foreign language and data analysis] // NRU “Higher school of economics”</article-title>
          ,
          <source>ISBN 978-5-9909752-1-7</source>
          (
          <year>2017</year>
          )
          <article-title>(In Russ</article-title>
          .).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>