<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>L. A. Sevastianov);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>methods for improving the accuracy of multi-class classification on imbalanced data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonid A. Sevastianov</string-name>
          <email>sevastianov-la@rudn.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eugene Yu. Shchetinin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Financial University under the Government of the Russian Federation</institution>
          ,
          <addr-line>49, Leningradsky pr., Moscow, 117538</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Peoples' Friendship University of Russia (RUDN University)</institution>
          ,
          <addr-line>6, Miklukho-Maklaya St., Moscow, 117198</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1856</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Imbalance of the classes, characterized by a disproportional ratio of observations in each class, is one of the significant problems in machine learning. Class imbalances can be detected in many areas, including medical diagnostics, spam filtering, and fraud detection. Most machine learning algorithms work optimally when the number of samples in each class is approximately the same. This is because most algorithms are designed to maximize accuracy and reduce error. However, under conditions of class imbalance, the model may be overfitted, which leads to incorrect estimates of object classification. Thus, in order to avoid this phenomenon and achieve better results, it is necessary to research methods for working with unbalanced data, as well as develop efective algorithms for classifying them.</p>
      </abstract>
      <kwd-group>
        <kwd>multiclass classification</kwd>
        <kwd>imbalanced classes</kwd>
        <kwd>machine learning</kwd>
        <kwd>SMOTE</kwd>
        <kwd>ADASYN</kwd>
        <kwd>Random Forest</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Classification tasks are among the most popular in data analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Supervised machine
learning is most often used as the method for determining whether an object belongs to a
particular class. The main idea of this approach is to inductively output a function based
on marked-up data for training. This means that the success of using a machine learning
classification algorithm depends largely on the selection of objects that the algorithm “learns”
from. Most of these algorithms require the researcher to include a comparable number of
examples for each of the classes, but it is often not possible to make balanced data sets due to a
number of factors. Often there are situations when the dataset number of examples of some
Workshop on information technology and scientific computing in the framework of the X International Conference
      </p>
      <p>CEUR
Workshop
Proceedings
htp:/ceur-ws.org
IS N1613-073</p>
      <p>
        CEUR Workshop Proceedings (CEUR-WS.org)
of the minor class (this class will be called the minority, and the other, prevailing over first —
majority class). The key ones are the specificity of the target area (balancing data can lower the
indicator of its representativeness) and the diferent price of errors of the first and second types
when classifying. Such trends are clearly visible, for example, in credit scoring, medicine and
marketing [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>This leads to the problem of training the model on imbalanced data (these are data whose
distribution is skewed, and the mode and average values are not equal): according to the basic
assumptions contained in most algorithms, the goal of training is to maximize the proportion
of correct decisions relative to all decisions made, and the data for training and the general
population are subject to the same distribution. However, taking into account these assumptions
and unbalanced sampling results in the model being unable to classify data better than a trivial
model that completely ignores a less represented class and marks all objects for classification as
belonging to the majority class.</p>
      <p>On the other hand, it is possible to build too much complex model that includes a large set
of rules, but will cover a small number of objects. This classifier may be inefective, which
will lead the model to overfitting and incorrect estimates of the forecast. It should be noted
that the consequences of erroneous classification may also difer. Moreover, an incorrect
classification of examples of a minority class usually costs many times more than an erroneous
classification of an object from a majority class. The correct selection of features may be more
important than reducing data processing time or improving classification accuracy. for example,
in medicine, finding the minimum set of features that is optimal for the classification task may
be a prerequisite for making a diagnosis. Thus, to avoid this phenomenon and achieve a good
result, it is necessary to research methods for working with imbalanced data.</p>
      <p>In this paper, we study methods for overcoming imbalanced classes in order to improve
the quality of classification with a higher accuracy than when directly using classification
algorithms for imbalanced classes. To improve the accuracy of classification, we propose a
scheme that consists of using a combination of classification algorithms and feature selection
methods RFE, Random Forest and Boruta with the preliminary use of class balancing by random
sampling, SMOTE and ADASYN.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Basic algorithms for balancing classes</title>
      <p>
        One approach to solving this problem is to use various sampling strategies, which can be divided
into two groups: random and special [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In the first case, delete a certain number of examples
of the majority class (undersampling), in the second — increase the number of examples of the
minority class (oversampling).
      </p>
      <sec id="sec-2-1">
        <title>2.1. The exclusion of examples of the majority class. Algorithm for random sampling of the majority class (random undersampling)</title>
        <p>To do this, we calculate the K – number of majority examples that must be removed to achieve
the required ratio of diferent classes. Then K majority examples are randomly selected and
removed. In the case of the studied data, methods for increasing the minority class are natural.
Let’s move on to the consideration of such strategies.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. The increase in the minority class. Duplicate examples of a minority class (oversampling). Random naive sampling</title>
        <p>The easiest way to increase the number of examples of a minority class is to randomly select
observations from it and add them to the general dataset until a balance is reached between
the majority and minority classes. Depending on what class ratio is needed, the number of
random records to duplicate is selected. One of the problems with random naive sampling is
that it simply duplicates existing data. The advantages of this approach include its simplicity,
ease of implementation and the ability to change the balance in any desired direction. The
disadvantages should be discussed separately according to which sampling strategy is used:
although both of them change the overall size of the data in order to find a balance, their
application has diferent consequences. In the case of undersampling, deleting data may cause
the class to lose important information and, as a result, lower its presentation rate.</p>
        <p>
          In turn, the use of oversampling can lead to overfitting [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This approach to restoring balance
is not always efective, so a special method was proposed to increase the number of examples of
a minority class-the SMOTE algorithm (Synthetic Minority Oversampling Technique) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The
SMOTE algorithm is based on the idea of generating a certain number of artificial examples that
are “similar” to those in the minority class, but do not duplicate them. To create a new record
ifnd the diference  =   −   , where   ,   − feature vectors of “neighboring” examples  and
 from the minority class. They are found using the nearest neighbors algorithm (KNN). In this
case, it is necessary and suficient for example b to get a set of k neighbors, from which the
entry  will be selected later. The remaining steps of the KNN algorithm are not required. Then
from  by multiplying each of its elements by a random number in the interval (0, 1)we get  ̃.
The feature vector of the new example is calculated by adding   and  ̃. The SMOTE algorithm
allows you to set the number of records to be artificially generated. The degree of similarity of
examples  and  can be adjusted by changing the value of k (the number of nearest neighbors).
See for the illustration SMOTE algorithm on Figure 2.
        </p>
        <p>
          SMOTE solves many problems that are inherent to the random sampling method, and actually
increases the initial data set in such a way that the model is trained much more eficiently [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
However, this algorithm has its drawbacks, the main of which is ignoring the majority class.
This may result in a highly sparse distribution of objects of a minority class relative to a majority
class, where data sets are “mixed”, i.e. they are arranged in such a way that it is very dificult to
separate objects of one class from another.
        </p>
        <p>An example of this phenomenon is when an object of a diferent class is located between an
object and its neighbor, based on which a new instance is generated. As a result, the synthetically
created object will be closer to the opposite class than to the class of its parents. In addition, the
number of instances generated using SMOTE is set in advance, which reduces the ability to
change the balance and flexibility of the method.</p>
        <p>
          It is important to note the significant limitations of SMOTE algorithm [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Since it works by
interpolating between rare examples, it can only generate examples inside the body of available
examples — never outside. Formally, SMOTE can only fill in the convex hull of existing minority
examples, but not create new external areas for them. The main advantage of SMOTE over
traditional random naive over-sampling is that when creating synthetic observations instead of
reusing existing observations, this classifier is less likely to be overfitted. At the same time, it is
always necessary to make sure that the observations created by SMOTE are realistic.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Adaptive synthetic sampling algorithm and its generalizations</title>
        <p>
          This method is based on synthetic sampling algorithms, the main ones being Borderline-SMOTE
and Adaptive Synthetic Sampling (ADASYN) [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ]. Borderline-SMOTE imposes restrictions
on the selection of objects of the minority class that new instances are generated from. This
happens as follows: for each object of a minority class, a set of k nearest neighbors is determined,
then it is calculated how many instances of this set belong to the majority class (this number is
taken as m).
        </p>
        <p>After this, we select those objects of the minority class for which the inequality /2 ⩽  &lt; 
is true. The resulting set represents instances of the minority class located on the distribution
boundary, and they are the ones that are more likely to be incorrectly classified than the others.
It should be noted why the inequality that determines the selection of objects excludes cases in
which all k neighbors belong to the majority class: this is due to the fact that such instances are
located in the “mixing” zone of two classes, and only objects that distort the model learning
process can be generated on their basis. In this regard, they are declared as noise and are ignored
by the algorithm.</p>
        <p>
          The ADASYN algorithm, in turn, is based on a systematic method that allows adaptive
generation of diferent amounts of data in accordance with their distributions [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Input data for
the algorithm – training data set:   with  samples with {  ,   } ,  = 1, … , , where   − is the
− dimensional vector in the feature space,   − labels of corresponding class. Let’s the   and
  are the number of samples of minority and majority classes, respectively, such that   ≪  
and   +   = . The algorithm’s pseudocode looks like this:
1. Calculate the proportion of classes  =   /  ;
2. If  &lt;   (where   is the specified threshold for the maximum allowable class imbalance):
a) Find the number of synthetically generated samples of the minor class  = (  −   )× ,
where  is the parameter used to determine the desired balance level ( = 1 )indicates full class
balance.
        </p>
        <p>b) for each   ∈    ifnd the K-nearest neighbors using the Euclidean distance and
calculate   = △ / ;
c) normalize   =   /∑   so that   becomes the distibution density;
d) calculate   =   ×  a synthetic sample formed for each image from the minority class,
where  is the total number of examples of synthetic data;</p>
        <p>e) for each example of data from a minority class   create the examples of synthetic   data
in accordance with the following steps:</p>
        <p>
          In a cycle from 1 to  ∶
(i) randomly select one example of minority data,   from  nearest neighbors for   data;
(ii) create an example of synthetic data:   =   + (  −   )×  , where (  −   )is  -dimensional
vector of Euclidean space,  — random number,  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ].
        </p>
        <p>The main diference between SMOTE and ADASYN is how to create synthetic sample samples
for the minority class. ADASYN uses the   density function to determine the number of synthetic
samples that will be created for a specific point, whereas SMOTE has a single weight for all
minority points.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Research data: description and characteristics</title>
      <p>In this paper, a set of data on skin diseases was used for testing and comparative analysis of
the methods described above to eliminate the class imbalance. Diagnosis of erythematous
squamous cell diseases is a serious problem in dermatology, and modern principles of diagnosis
and treatment are based on the earliest detection of the disease. All of them have common
clinical features with very small diferences. Another dificulty for diagnosis is that the disease
may show signs of another disease at the initial stage and may have characteristic signs in
subsequent stages.</p>
      <p>
        The study data was created by Nielsen in 1998 and contains 366 observations forming 6
classes that can be characterized by 34 features [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The classes are: psoriasis (class 1): — 112
cases; seborrheic dermatitis (class 2): — 72 cases; lichen planus (class 3): — 61 cases; pink lichen
(class 4): — 49 cases; chronic dermatitis (class 5): — 52 cases; red hair lichen (class 6): — 20 cases.
A full description of the data is given in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Computer experiments</title>
      <sec id="sec-4-1">
        <title>Data studies were performed using the following algorithm:</title>
        <p>1. Data pre-processing: filling the gaps in the data and the coding of signs.
2. Balancing classes using the sampling algorithms described above.
3. Selecting attributes based on their importance.
4. Classification using logistic regression and the support vector method.
5. Assessment of classification quality.</p>
        <p>
          In this paper, the selection of features based on their importance and informativeness was
carried out by the following methods: a) recursive exclusion of RFE features [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]; b) decision
trees RF [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]; c) Boruta [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>The Random Forest algorithm is an ensemble of numerous classification algorithms (decision
trees). Each of these classifiers is built on a random subset of objects and a random subset of
features. Let the training sample consist of  examples, the dimension of the feature space is
equal to  , and an additional parameter  is set. All trees are built independently of each other
using the following procedure:
1. Generate a random sub-sample with a repeat of size n from the training sample.
2. Let’s build a decision tree that classifies the examples of this sub-sample, and during
the creation of the next node of the tree, we will select the feature based on which the
partition is made, not from all  features, but only from  randomly selected ones.
3. The tree is built until the subsample is completely exhausted and does not undergo the
procedure of cutting of branches.</p>
        <p>
          Object classification is carried out by voting: each tree of the ensemble refers the object to be
classified to one of the classes, and the class that the largest number of trees voted for wins. To
use Random Forest in the task of evaluating the importance of features, it is necessary to train
the algorithm on the sample and calculate the out-of-bag error for each example of the training
sample [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>Let   be a bootstrapped sample of the   tree. Bootstrapping is the selection of l objects
from the selection with a return, as a result of which some objects are selected several times,
and some – never. Placing multiple copies of the same object in a bootstrapped selection
corresponds to setting the weight for this object, the corresponding term will be included in
the functionality several times, and therefore the error penalty will be greater on it. Let ( , )
be the loss function, and   be the response on the  -th object of the training sample, then the
out-of-bag error is calculated using the following formula:
  =
∑  (  ,</p>
        <p>∑=1 [  ∋    ]   (  )


=1 [  ∋    ]
) .</p>
        <p>Then, for each object, this error is averaged across the entire random forest. To evaluate the
feature importance, its values are mixed for all objects in the training sample, and the out-of-bag
error is counted again. The importance of the features is estimated by averaging the diference
in out-of-bag errors across all trees before and after mixing the values. The values of such errors
are normalized to the standard deviation.</p>
        <p>
          Boruta is a heuristic algorithm for selecting significant features based on the use of Random
Forest [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. At each iteration, features that have a Z-measure less than the maximum Z-measure
among the added features are removed. To get the Z-measure of a feature, you need to calculate
the feature’s importance obtained using the built-in algorithm in Random Forest, and divide it
by the standard deviation of the feature importance. The added features are obtained as follows:
the features that are present in the selection are copied, and then each new feature is filled in
by shufling its values. In order to get statistically significant results, this procedure is repeated
several times, and variables are generated independently at each iteration.
        </p>
        <p>Let’s write down the Boruta algorithm step by step:
1. Add copies of all attributes to the data. In the future, copies will be called hidden signs.
2. Randomly shufle each hidden attribute.
3. Run Random Forest and get the Z-measure of all attributes.
4. Find the maximum I-measure of all I-measures for hidden features.
5. Delete features that have a Z-measure smaller than the one found in the previous step.
6. Remove all hidden attributes.
7. Repeat all the steps until the Z-measure of all features is greater than the maximum
z-measure of hidden features.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and discussion</title>
      <p>
        To solve the problem of multiclass classification on unbalanced data, machine learning
algorithms were chosen: logistic regression and the method of support vectors with a linear kernel
(Linear SVM). All calculations were implemented in PYTHON, their results, data, and
program codes are placed in the repository of the authors of this article [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and some algorithms
in [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
        ]. Some fragments are presented in Computer Code paragraph. Three metrics
were used to compare classification results: accuracy, recall, and F1-measure. The results of the
research are presented in Table 2, Table 3.
      </p>
      <p>First column of Table 2 lists the sampling methods used. The second column shows the
methods used for selecting features, and the third column shows the number of selected features.
The remaining columns show the values of quality metrics obtained as a result of applying the
support vector algorithm (SVM) to the converted data. The Table 3 is constructed similarly,
containing the results of classification using logistic regression.</p>
      <p>
        From the analysis of the obtained results, that are shown in Table 2, and Table 3, it can be
seen that in all cases, the use of sampling methods allowed for higher classification accuracy
than on unbalanced data. Within the framework of the scheme described in this paper, the best
classification accuracy was achieved by applying the ADASYN class balancing algorithm and
then selecting features using the Random Forest algorithm. For comparison, in the works of
other researchers who conducted similar studies, for example, [
        <xref ref-type="bibr" rid="ref11 ref5">5, 11</xref>
        ], the classification accuracy
reached only 93%.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we propose a scheme for improving the accuracy of classification on unbalanced
data using algorithms for class balancing and feature selection, such as RFE, Boruta, Random
Forest, and others. The results of computational experiments have shown the efectiveness
of its application to solve this problem. In particular, the ADASYN algorithm has improved
classification accuracy by up to 98% compared to other algorithms. In conclusion, it is worth
noting that the problem discussed in this paper is still relevant, and existing methods can be
improved. In recent time there are some new trends in data mining so called dee learning,
developing the deep neural networks as a tool for solving various classification problems. So,
we hope to apply them in our future researches of imbalanced classes classification.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>The work is partially supported by RFBR grants No 18-07-00567.</title>
        <p>A. Program Code: Deep CNN model
# F i t L o g i s t i c F e a t u r e s t o a l l f e a t u r e s
s v c = L o g i s t i c R e g r e s s i o n ( )
s v c . f i t ( X _ t r a i n , y _ t r a i n )
# T e s t a c c u r a c y
a c c = a c c u r a c y _ s c o r e ( y _ t e s t , s v c . p r e d i c t ( X _ t e s t ) )
p r i n t ( ’ T e s t A c c u r a c y { } ’ . f o r m a t ( a c c ) )
# P l o t c o n f u s i o n m a t r i x
c m = c o n f u s i o n _ m a t r i x ( y _ t e s t , s v c . p r e d i c t ( X _ t e s t ) )
s n s . h e a t m a p ( c m , f m t = ’ d ’ , c m a p = ’ G n B u ’ , c b a r = F a l s e , a n n o t = T r u e )
l r _ p r e d = s v c . p r e d i c t ( X _ t e s t )
# = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
# R e c u r s i v e F e a t u r e S e l e c t i o n
# = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
f r o m s k l e a r n . f e a t u r e _ s e l e c t i o n i m p o r t R F E C V
# R F E
r f e = R F E C V ( e s t i m a t o r = L o g i s t i c R e g r e s s i o n ( ) , c v = 4 , s c o r i n g = ’ a c c u r a c y ’ )
r f e = r f e . f i t ( X _ t r a i n , y _ t r a i n )
# S e l e c t v a r i a b l e s a n d c a l u l a t e t e s t a c c u r a c y
c o l s = X _ t r a i n . c o l u m n s [ r f e . s u p p o r t _ ]
a c c = a c c u r a c y _ s c o r e ( y _ t e s t , r f e . e s t i m a t o r _ . p r e d i c t ( X _ t e s t [ c o l s ] ) )
p r i n t ( ’ N u m b e r o f f e a t u r e s s e l e c t e d : { } ’ . f o r m a t ( r f e . n _ f e a t u r e s _ ) )
p r i n t ( ’ T e s t A c c u r a c y { } ’ . f o r m a t ( a c c ) )
# P l o t n u m b e r o f f e a t u r e s v s C V s c o r e s
p l t . f i g u r e ( )
p l t . x l a b e l ( ’ k ’ )
p l t . y l a b e l ( ’ C V a c c u r a c y ’ )
p l t . p l o t ( n p . a r a n g e ( 1 , r f e . g r i d _ s c o r e s _ . s i z e + 1 ) , r f e . g r i d _ s c o r e s _ )
p l t . s h o w ( )
l r _ p r e d = r f e . e s t i m a t o r _ . p r e d i c t ( X _ t e s t [ c o l s ] )
p r i n t ( ’ f 1 _ s c o r e ( m a c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m a c r o ’ ) )
p r i n t ( ’ f 1 _ s c o r e ( m i c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m i c r o ’ ) )
p r i n t ( ’ f 1 _ s c o r e ( w e i g h t e d ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ w e i g h t e d ’ ) )
p r i n t ( ’ r e c a l l _ s c o r e ( m a c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m a c r o ’ ) )
p r i n t ( ’ r e c a l l _ s c o r e ( m i c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m i c r o ’ ) )
p r i n t ( ’ r e c a l l _ s c o r e ( w e i g h t e d ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ w e i g h t e d ’ ) )
# = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
# F e a t u r e i m p o r t a n c e s
# = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
f r o m s k l e a r n . e n s e m b l e i m p o r t R a n d o m F o r e s t C l a s s i f i e r
# F e a t u r e i m p o r t a n c e v a l u e s f r o m R a n d o m F o r e s t s
r f = R a n d o m F o r e s t C l a s s i f i e r ( n _ j o b s = - 1 , r a n d o m _ s t a t e = S E E D )
r f . f i t ( X _ t r a i n , y _ t r a i n )
f e a t _ i m p = r f . f e a t u r e _ i m p o r t a n c e s _
# S e l e c t f e a t u r e s a n d f i t L o g i s t i c R e g r e s s i o n
c o l s = X _ t r a i n . c o l u m n s [ f e a t _ i m p &gt; = 0 . 0 1 ]
e s t _ i m p = L o g i s t i c R e g r e s s i o n ( )
e s t _ i m p . f i t ( X _ t r a i n [ c o l s ] , y _ t r a i n )
# T e s t a c c u r a c y
a c c = a c c u r a c y _ s c o r e ( y _ t e s t , e s t _ i m p . p r e d i c t ( X _ t e s t [ c o l s ] ) )
p r i n t ( ’ N u m b e r o f f e a t u r e s s e l e c t e d : { } ’ . f o r m a t ( l e n ( c o l s ) ) )
p r i n t ( ’ T e s t A c c u r a c y { } ’ . f o r m a t ( a c c ) )
l r _ p r e d = e s t _ i m p . p r e d i c t ( X _ t e s t [ c o l s ] )
p r i n t ( ’ f 1 _ s c o r e ( m a c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m a c r o ’ ) )
p r i n t ( ’ f 1 _ s c o r e ( m i c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m i c r o ’ ) )
p r i n t ( ’ f 1 _ s c o r e ( w e i g h t e d ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d ,
a v e r a g e = ’ w e i g h t e d ’ ) )
p r i n t ( ’ r e c a l l _ s c o r e ( m a c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d ,
a v e r a g e = ’ m a c r o ’ ) )
p r i n t ( ’ r e c a l l _ s c o r e ( m i c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d ,
a v e r a g e = ’ m i c r o ’ ) )
p r i n t ( ’ r e c a l l _ s c o r e ( w e i g h t e d ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d ,
a v e r a g e = ’ w e i g h t e d ’ ) )
# = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
# B o r u t a
# = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
f r o m b o r u t a i m p o r t B o r u t a P y
# R a n d o m F o r e s t s f o r B o r u t a
r f _ b o r u t a = R a n d o m F o r e s t C l a s s i f i e r ( n _ j o b s = - 1 , r a n d o m _ s t a t e = S E E D )
# P e r f o r m B o r u t a
b o r u t a = B o r u t a P y ( r f _ b o r u t a , n _ e s t i m a t o r s = ’ a u t o ’ , v e r b o s e = 2 )
b o r u t a . f i t ( X _ t r a i n . v a l u e s , y _ t r a i n . v a l u e s . r a v e l ( ) )
# T e s t a c c u r a c y
a c c = a c c u r a c y _ s c o r e ( y _ t e s t , e s t _ b o r u t a . p r e d i c t ( X _ t e s t [ c o l s ] ) )
p r i n t ( ’ N u m b e r o f f e a t u r e s s e l e c t e d : { } ’ . f o r m a t ( l e n ( c o l s ) ) )
p r i n t ( ’ T e s t A c c u r a c y { } ’ . f o r m a t ( a c c ) )
l r _ p r e d = e s t _ b o r u t a . p r e d i c t ( X _ t e s t [ c o l s ] )
p r i n t ( ’ f 1 _ s c o r e ( m a c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m a c r o ’ ) )
p r i n t ( ’ f 1 _ s c o r e ( m i c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m i c r o ’ ) )
p r i n t ( ’ f 1 _ s c o r e ( w e i g h t e d ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d ,
a v e r a g e = ’ w e i g h t e d ’ ) )
p r i n t ( ’ r e c a l l _ s c o r e ( m a c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d ,
a v e r a g e = ’ m a c r o ’ ) )
p r i n t ( ’ r e c a l l _ s c o r e ( m i c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d ,
a v e r a g e = ’ m i c r o ’ ) )
p r i n t ( ’ r e c a l l _ s c o r e ( w e i g h t e d ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d ,
a v e r a g e = ’ w e i g h t e d ’ ) )</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gibson</surname>
          </string-name>
          ,
          <article-title>Deep Learning: A Practitioner's Approach,</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <article-title>Learning from imbalanced data</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>21</volume>
          (
          <year>2009</year>
          )
          <fpage>1263</fpage>
          -
          <lpage>1284</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>0</volume>
          <fpage>9</fpage>
          <string-name>
            <surname>/ T K D E .</surname>
          </string-name>
          <article-title>2 0 0 8 . 2 3 9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Japkowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stephen</surname>
          </string-name>
          ,
          <article-title>The class imbalance problem: A systematic study</article-title>
          ,
          <source>Intelligent Data Analysis</source>
          <volume>6</volume>
          (
          <year>2002</year>
          )
          <fpage>429</fpage>
          -
          <lpage>449</lpage>
          .
          <source>doi:1 0 . 3 2 3 3 / I D A - 2</source>
          <volume>0 0 2 - 6 5 0 4 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N. V.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. W.</given-names>
            <surname>Bowyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. O.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. P.</given-names>
            <surname>Kegelmeyer</surname>
          </string-name>
          , Smote:
          <article-title>Synthetic minority over-sampling technique</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>16</volume>
          (
          <year>2002</year>
          )
          <fpage>321</fpage>
          -
          <lpage>357</lpage>
          . URL: http://dx.doi.org/10.1613/jair.953.
          <source>doi:1 0 . 1 6 1 3 / j a i r . 9</source>
          <volume>5</volume>
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>A support vector machine-recursive feature elimination feature selection method based on artificial contrast variables and mutual information</article-title>
          ,
          <source>Journal of Chromatography B: Analytical Technologies in the Biomedical and Life Sciences</source>
          <volume>910</volume>
          (
          <year>2012</year>
          )
          <fpage>149</fpage>
          -
          <lpage>155</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>1 6</volume>
          / j . j
          <source>c h r o m b . 2 0</source>
          <volume>1 2 . 0</volume>
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Abdi</surname>
          </string-name>
          ,
          <source>S. Hashemi</source>
          <volume>28</volume>
          (
          <year>2016</year>
          )
          <fpage>238</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Adasyn: Adaptive synthetic sampling approach for imbalanced learning</article-title>
          ,
          <source>in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>1322</fpage>
          -
          <lpage>1328</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Han</surname>
          </string-name>
          , W.-Y. Wang,
          <string-name>
            <given-names>B.-H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <article-title>Borderline-smote: A new over-sampling method in imbalanced data sets learning</article-title>
          , volume
          <volume>3644</volume>
          ,
          <year>2005</year>
          , pp.
          <fpage>878</fpage>
          -
          <lpage>887</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>0 7 / 1 1 5 3 8 0 5 9 _ 9</volume>
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Murphy</surname>
          </string-name>
          , D. W. Aha,
          <article-title>Uci repository of machine learning databases</article-title>
          . - irvine: University of california, department of information and computer science,
          <year>1998</year>
          . URL: https://www. ics.uci.edu/mlearn/MLRepository.html.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Dermatology-article</surname>
          </string-name>
          ,
          <year>2020</year>
          . URL: https://github.com/riviera2015/Dermatology-article.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Tuv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Borisov</surname>
          </string-name>
          , G. Runger,
          <string-name>
            <given-names>K.</given-names>
            <surname>Torkkola</surname>
          </string-name>
          ,
          <article-title>Feature selection with ensembles, artificial variables, and redundancy elimination</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>10</volume>
          (
          <year>2009</year>
          )
          <fpage>1341</fpage>
          -
          <lpage>1366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kursa</surname>
          </string-name>
          , W. Rudnicki,
          <article-title>Feature selection with boruta package</article-title>
          ,
          <source>Journal of Statistical Software</source>
          <volume>36</volume>
          (
          <year>2010</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
          <source>doi: 1 0 . 1 8 6 3 7 / j s s . v 0 3 6 . i 1 1 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lyubin</surname>
          </string-name>
          , E. Shchetinin,
          <article-title>Fast two-dimensional smoothing with discrete cosine transform</article-title>
          , volume
          <volume>678</volume>
          ,
          <year>2016</year>
          , pp.
          <fpage>646</fpage>
          -
          <lpage>656</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>0 7 / 9 7 8 - 3 - 3 1 9 - 5 1 9 1 7 - 3</volume>
          _
          <fpage>5</fpage>
          5 .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E. Y.</given-names>
            <surname>Shchetinin</surname>
          </string-name>
          ,
          <article-title>Cluster-based energy consumption forecasting in smart grids</article-title>
          , in: V.
          <string-name>
            <surname>M. Vishnevskiy</surname>
            ,
            <given-names>D. V.</given-names>
          </string-name>
          <string-name>
            <surname>Kozyrev</surname>
          </string-name>
          (Eds.),
          <source>Distributed Computer and Communication Networks</source>
          , volume
          <volume>919</volume>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>445</fpage>
          -
          <lpage>456</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Sevastianov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Y.</given-names>
            <surname>Shchetinin</surname>
          </string-name>
          ,
          <article-title>On methods for improving the accuracy of multiclass classification on imbalanced data</article-title>
          ,
          <source>Informatics and Applications</source>
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>63</fpage>
          -
          <lpage>70</lpage>
          . doi:
          <volume>1</volume>
          <fpage>0</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname># S e l</surname>
          </string-name>
          <article-title>e c t f e a t u r e s a n d f i t L o g i s t i c R e g r e s s i o n c o l s = X _ t r a i n . c o l u m n s [ b o r u t a . s u p p o r t _ ] e s t _ b o r u t a = L o g i s t i c R e g r e s s i o n ( ) e s t _ b o r u t a . f i t ( X _ t r a i n [ c o l s ] , y _ t r a i n ) f r o m i m b l e a r n . o v e r _ s a m p l i n g i m p o r t R a n d o m O v e r S a m p l e r # r a n d o m o v e r s a m p l i n g r o s = R a n d o m O v e r S a m p l e r ( r a n d o m _ s t a t e = 0 ) X _ r e s a m p l e d , y _ r e s a m p l e d = r o s . f i t _ r e s a m p l e ( X _ t r a i n , y _ t r a i n ) # u s i n g C o u n t e r t o d i s p l a y r e s u l t s o f n a i v e o v e r s a m p l i n g f r o m c o l l e c t i o n s i m p o r t C o u n t e r p r i n t ( s o r t e d ( C o u n t e r ( y _ r e s a m p l e d ) . i t e m s ( ) ) ) f r o m i m b l e a r n . o v e r _ s a m p l i n g i m p o</article-title>
          <string-name>
            <surname>r t S M O T E #</surname>
          </string-name>
          <article-title>a p p l y i n g S M O T E t o o u r d a t a a n d c h e c k i n g t h e c l a s s c o u n t s X _ r e s a m p l e d 1 , y _ r e s a m p l e d 1</article-title>
          =
          <string-name>
            <surname>S M O T E (</surname>
          </string-name>
          <article-title>) . f i t _ r e s a m p l e ( X _ t r a i n , y _ t r a i n ) p r i n t ( s o r t e d ( C o u n t e r ( y _ r e s a m p l e d 1 ) . i t e m s ( ) ) )</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>