1. Introduction

L. A. Sevastianov);

methods for improving the accuracy of multi-class classification on imbalanced data

Leonid A. Sevastianov

sevastianov-la@rudn.ru 1

Eugene Yu. Shchetinin

0 0 Financial University under the Government of the Russian Federation , 49, Leningradsky pr., Moscow, 117538 , Russia 1 Peoples' Friendship University of Russia (RUDN University) , 6, Miklukho-Maklaya St., Moscow, 117198 , Russia

1856

000 0 0003

Imbalance of the classes, characterized by a disproportional ratio of observations in each class, is one of the significant problems in machine learning. Class imbalances can be detected in many areas, including medical diagnostics, spam filtering, and fraud detection. Most machine learning algorithms work optimally when the number of samples in each class is approximately the same. This is because most algorithms are designed to maximize accuracy and reduce error. However, under conditions of class imbalance, the model may be overfitted, which leads to incorrect estimates of object classification. Thus, in order to avoid this phenomenon and achieve better results, it is necessary to research methods for working with unbalanced data, as well as develop efective algorithms for classifying them.

multiclass classification imbalanced classes machine learning SMOTE ADASYN Random Forest

1. Introduction

Classification tasks are among the most popular in data analysis [ 1 ]. Supervised machine learning is most often used as the method for determining whether an object belongs to a particular class. The main idea of this approach is to inductively output a function based on marked-up data for training. This means that the success of using a machine learning classification algorithm depends largely on the selection of objects that the algorithm “learns” from. Most of these algorithms require the researcher to include a comparable number of examples for each of the classes, but it is often not possible to make balanced data sets due to a number of factors. Often there are situations when the dataset number of examples of some Workshop on information technology and scientific computing in the framework of the X International Conference

CEUR Workshop Proceedings htp:/ceur-ws.org IS N1613-073

CEUR Workshop Proceedings (CEUR-WS.org) of the minor class (this class will be called the minority, and the other, prevailing over first — majority class). The key ones are the specificity of the target area (balancing data can lower the indicator of its representativeness) and the diferent price of errors of the first and second types when classifying. Such trends are clearly visible, for example, in credit scoring, medicine and marketing [ 2, 3 ].

This leads to the problem of training the model on imbalanced data (these are data whose distribution is skewed, and the mode and average values are not equal): according to the basic assumptions contained in most algorithms, the goal of training is to maximize the proportion of correct decisions relative to all decisions made, and the data for training and the general population are subject to the same distribution. However, taking into account these assumptions and unbalanced sampling results in the model being unable to classify data better than a trivial model that completely ignores a less represented class and marks all objects for classification as belonging to the majority class.

On the other hand, it is possible to build too much complex model that includes a large set of rules, but will cover a small number of objects. This classifier may be inefective, which will lead the model to overfitting and incorrect estimates of the forecast. It should be noted that the consequences of erroneous classification may also difer. Moreover, an incorrect classification of examples of a minority class usually costs many times more than an erroneous classification of an object from a majority class. The correct selection of features may be more important than reducing data processing time or improving classification accuracy. for example, in medicine, finding the minimum set of features that is optimal for the classification task may be a prerequisite for making a diagnosis. Thus, to avoid this phenomenon and achieve a good result, it is necessary to research methods for working with imbalanced data.

In this paper, we study methods for overcoming imbalanced classes in order to improve the quality of classification with a higher accuracy than when directly using classification algorithms for imbalanced classes. To improve the accuracy of classification, we propose a scheme that consists of using a combination of classification algorithms and feature selection methods RFE, Random Forest and Boruta with the preliminary use of class balancing by random sampling, SMOTE and ADASYN.

2. Basic algorithms for balancing classes

One approach to solving this problem is to use various sampling strategies, which can be divided into two groups: random and special [ 3 ]. In the first case, delete a certain number of examples of the majority class (undersampling), in the second — increase the number of examples of the minority class (oversampling).

2.1. The exclusion of examples of the majority class. Algorithm for random sampling of the majority class (random undersampling)

To do this, we calculate the K – number of majority examples that must be removed to achieve the required ratio of diferent classes. Then K majority examples are randomly selected and removed. In the case of the studied data, methods for increasing the minority class are natural. Let’s move on to the consideration of such strategies.

2.2. The increase in the minority class. Duplicate examples of a minority class (oversampling). Random naive sampling

The easiest way to increase the number of examples of a minority class is to randomly select observations from it and add them to the general dataset until a balance is reached between the majority and minority classes. Depending on what class ratio is needed, the number of random records to duplicate is selected. One of the problems with random naive sampling is that it simply duplicates existing data. The advantages of this approach include its simplicity, ease of implementation and the ability to change the balance in any desired direction. The disadvantages should be discussed separately according to which sampling strategy is used: although both of them change the overall size of the data in order to find a balance, their application has diferent consequences. In the case of undersampling, deleting data may cause the class to lose important information and, as a result, lower its presentation rate.

In turn, the use of oversampling can lead to overfitting [ 3 ]. This approach to restoring balance is not always efective, so a special method was proposed to increase the number of examples of a minority class-the SMOTE algorithm (Synthetic Minority Oversampling Technique) [ 4 ]. The SMOTE algorithm is based on the idea of generating a certain number of artificial examples that are “similar” to those in the minority class, but do not duplicate them. To create a new record ifnd the diference = − , where , − feature vectors of “neighboring” examples and from the minority class. They are found using the nearest neighbors algorithm (KNN). In this case, it is necessary and suficient for example b to get a set of k neighbors, from which the entry will be selected later. The remaining steps of the KNN algorithm are not required. Then from by multiplying each of its elements by a random number in the interval (0, 1)we get ̃. The feature vector of the new example is calculated by adding and ̃. The SMOTE algorithm allows you to set the number of records to be artificially generated. The degree of similarity of examples and can be adjusted by changing the value of k (the number of nearest neighbors). See for the illustration SMOTE algorithm on Figure 2.

SMOTE solves many problems that are inherent to the random sampling method, and actually increases the initial data set in such a way that the model is trained much more eficiently [ 5 ]. However, this algorithm has its drawbacks, the main of which is ignoring the majority class. This may result in a highly sparse distribution of objects of a minority class relative to a majority class, where data sets are “mixed”, i.e. they are arranged in such a way that it is very dificult to separate objects of one class from another.

An example of this phenomenon is when an object of a diferent class is located between an object and its neighbor, based on which a new instance is generated. As a result, the synthetically created object will be closer to the opposite class than to the class of its parents. In addition, the number of instances generated using SMOTE is set in advance, which reduces the ability to change the balance and flexibility of the method.

It is important to note the significant limitations of SMOTE algorithm [ 6 ]. Since it works by interpolating between rare examples, it can only generate examples inside the body of available examples — never outside. Formally, SMOTE can only fill in the convex hull of existing minority examples, but not create new external areas for them. The main advantage of SMOTE over traditional random naive over-sampling is that when creating synthetic observations instead of reusing existing observations, this classifier is less likely to be overfitted. At the same time, it is always necessary to make sure that the observations created by SMOTE are realistic.

2.3. Adaptive synthetic sampling algorithm and its generalizations

This method is based on synthetic sampling algorithms, the main ones being Borderline-SMOTE and Adaptive Synthetic Sampling (ADASYN) [ 7, 8 ]. Borderline-SMOTE imposes restrictions on the selection of objects of the minority class that new instances are generated from. This happens as follows: for each object of a minority class, a set of k nearest neighbors is determined, then it is calculated how many instances of this set belong to the majority class (this number is taken as m).

After this, we select those objects of the minority class for which the inequality /2 ⩽ < is true. The resulting set represents instances of the minority class located on the distribution boundary, and they are the ones that are more likely to be incorrectly classified than the others. It should be noted why the inequality that determines the selection of objects excludes cases in which all k neighbors belong to the majority class: this is due to the fact that such instances are located in the “mixing” zone of two classes, and only objects that distort the model learning process can be generated on their basis. In this regard, they are declared as noise and are ignored by the algorithm.

The ADASYN algorithm, in turn, is based on a systematic method that allows adaptive generation of diferent amounts of data in accordance with their distributions [ 7 ]. Input data for the algorithm – training data set: with samples with { , } , = 1, … , , where − is the − dimensional vector in the feature space, − labels of corresponding class. Let’s the and are the number of samples of minority and majority classes, respectively, such that ≪ and + = . The algorithm’s pseudocode looks like this: 1. Calculate the proportion of classes = / ; 2. If < (where is the specified threshold for the maximum allowable class imbalance): a) Find the number of synthetically generated samples of the minor class = ( − )× , where is the parameter used to determine the desired balance level ( = 1 )indicates full class balance.

b) for each ∈ ifnd the K-nearest neighbors using the Euclidean distance and calculate = △ / ; c) normalize = /∑ so that becomes the distibution density; d) calculate = × a synthetic sample formed for each image from the minority class, where is the total number of examples of synthetic data;

e) for each example of data from a minority class create the examples of synthetic data in accordance with the following steps:

In a cycle from 1 to ∶ (i) randomly select one example of minority data, from nearest neighbors for data; (ii) create an example of synthetic data: = + ( − )× , where ( − )is -dimensional vector of Euclidean space, — random number, ∈ [ 0, 1 ].

The main diference between SMOTE and ADASYN is how to create synthetic sample samples for the minority class. ADASYN uses the density function to determine the number of synthetic samples that will be created for a specific point, whereas SMOTE has a single weight for all minority points.

3. Research data: description and characteristics

In this paper, a set of data on skin diseases was used for testing and comparative analysis of the methods described above to eliminate the class imbalance. Diagnosis of erythematous squamous cell diseases is a serious problem in dermatology, and modern principles of diagnosis and treatment are based on the earliest detection of the disease. All of them have common clinical features with very small diferences. Another dificulty for diagnosis is that the disease may show signs of another disease at the initial stage and may have characteristic signs in subsequent stages.

The study data was created by Nielsen in 1998 and contains 366 observations forming 6 classes that can be characterized by 34 features [ 9 ]. The classes are: psoriasis (class 1): — 112 cases; seborrheic dermatitis (class 2): — 72 cases; lichen planus (class 3): — 61 cases; pink lichen (class 4): — 49 cases; chronic dermatitis (class 5): — 52 cases; red hair lichen (class 6): — 20 cases. A full description of the data is given in [ 10 ].

4. Computer experiments Data studies were performed using the following algorithm:

1. Data pre-processing: filling the gaps in the data and the coding of signs. 2. Balancing classes using the sampling algorithms described above. 3. Selecting attributes based on their importance. 4. Classification using logistic regression and the support vector method. 5. Assessment of classification quality.

In this paper, the selection of features based on their importance and informativeness was carried out by the following methods: a) recursive exclusion of RFE features [ 5 ]; b) decision trees RF [ 11 ]; c) Boruta [ 12 ].

The Random Forest algorithm is an ensemble of numerous classification algorithms (decision trees). Each of these classifiers is built on a random subset of objects and a random subset of features. Let the training sample consist of examples, the dimension of the feature space is equal to , and an additional parameter is set. All trees are built independently of each other using the following procedure: 1. Generate a random sub-sample with a repeat of size n from the training sample. 2. Let’s build a decision tree that classifies the examples of this sub-sample, and during the creation of the next node of the tree, we will select the feature based on which the partition is made, not from all features, but only from randomly selected ones. 3. The tree is built until the subsample is completely exhausted and does not undergo the procedure of cutting of branches.

Object classification is carried out by voting: each tree of the ensemble refers the object to be classified to one of the classes, and the class that the largest number of trees voted for wins. To use Random Forest in the task of evaluating the importance of features, it is necessary to train the algorithm on the sample and calculate the out-of-bag error for each example of the training sample [ 11 ].

Let be a bootstrapped sample of the tree. Bootstrapping is the selection of l objects from the selection with a return, as a result of which some objects are selected several times, and some – never. Placing multiple copies of the same object in a bootstrapped selection corresponds to setting the weight for this object, the corresponding term will be included in the functionality several times, and therefore the error penalty will be greater on it. Let ( , ) be the loss function, and be the response on the -th object of the training sample, then the out-of-bag error is calculated using the following formula: = ∑ ( ,

∑=1 [ ∋ ] ( ) =1 [ ∋ ] ) .

Then, for each object, this error is averaged across the entire random forest. To evaluate the feature importance, its values are mixed for all objects in the training sample, and the out-of-bag error is counted again. The importance of the features is estimated by averaging the diference in out-of-bag errors across all trees before and after mixing the values. The values of such errors are normalized to the standard deviation.

Boruta is a heuristic algorithm for selecting significant features based on the use of Random Forest [ 12 ]. At each iteration, features that have a Z-measure less than the maximum Z-measure among the added features are removed. To get the Z-measure of a feature, you need to calculate the feature’s importance obtained using the built-in algorithm in Random Forest, and divide it by the standard deviation of the feature importance. The added features are obtained as follows: the features that are present in the selection are copied, and then each new feature is filled in by shufling its values. In order to get statistically significant results, this procedure is repeated several times, and variables are generated independently at each iteration.

Let’s write down the Boruta algorithm step by step: 1. Add copies of all attributes to the data. In the future, copies will be called hidden signs. 2. Randomly shufle each hidden attribute. 3. Run Random Forest and get the Z-measure of all attributes. 4. Find the maximum I-measure of all I-measures for hidden features. 5. Delete features that have a Z-measure smaller than the one found in the previous step. 6. Remove all hidden attributes. 7. Repeat all the steps until the Z-measure of all features is greater than the maximum z-measure of hidden features.

5. Results and discussion

To solve the problem of multiclass classification on unbalanced data, machine learning algorithms were chosen: logistic regression and the method of support vectors with a linear kernel (Linear SVM). All calculations were implemented in PYTHON, their results, data, and program codes are placed in the repository of the authors of this article [ 10 ] and some algorithms in [ 13, 14, 15 ]. Some fragments are presented in Computer Code paragraph. Three metrics were used to compare classification results: accuracy, recall, and F1-measure. The results of the research are presented in Table 2, Table 3.

First column of Table 2 lists the sampling methods used. The second column shows the methods used for selecting features, and the third column shows the number of selected features. The remaining columns show the values of quality metrics obtained as a result of applying the support vector algorithm (SVM) to the converted data. The Table 3 is constructed similarly, containing the results of classification using logistic regression.

From the analysis of the obtained results, that are shown in Table 2, and Table 3, it can be seen that in all cases, the use of sampling methods allowed for higher classification accuracy than on unbalanced data. Within the framework of the scheme described in this paper, the best classification accuracy was achieved by applying the ADASYN class balancing algorithm and then selecting features using the Random Forest algorithm. For comparison, in the works of other researchers who conducted similar studies, for example, [ 5, 11 ], the classification accuracy reached only 93%.

6. Conclusion

In this paper, we propose a scheme for improving the accuracy of classification on unbalanced data using algorithms for class balancing and feature selection, such as RFE, Boruta, Random Forest, and others. The results of computational experiments have shown the efectiveness of its application to solve this problem. In particular, the ADASYN algorithm has improved classification accuracy by up to 98% compared to other algorithms. In conclusion, it is worth noting that the problem discussed in this paper is still relevant, and existing methods can be improved. In recent time there are some new trends in data mining so called dee learning, developing the deep neural networks as a tool for solving various classification problems. So, we hope to apply them in our future researches of imbalanced classes classification.

Acknowledgments The work is partially supported by RFBR grants No 18-07-00567.

A. Program Code: Deep CNN model # F i t L o g i s t i c F e a t u r e s t o a l l f e a t u r e s s v c = L o g i s t i c R e g r e s s i o n ( ) s v c . f i t ( X _ t r a i n , y _ t r a i n ) # T e s t a c c u r a c y a c c = a c c u r a c y _ s c o r e ( y _ t e s t , s v c . p r e d i c t ( X _ t e s t ) ) p r i n t ( ’ T e s t A c c u r a c y { } ’ . f o r m a t ( a c c ) ) # P l o t c o n f u s i o n m a t r i x c m = c o n f u s i o n _ m a t r i x ( y _ t e s t , s v c . p r e d i c t ( X _ t e s t ) ) s n s . h e a t m a p ( c m , f m t = ’ d ’ , c m a p = ’ G n B u ’ , c b a r = F a l s e , a n n o t = T r u e ) l r _ p r e d = s v c . p r e d i c t ( X _ t e s t ) # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = # R e c u r s i v e F e a t u r e S e l e c t i o n # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = f r o m s k l e a r n . f e a t u r e _ s e l e c t i o n i m p o r t R F E C V # R F E r f e = R F E C V ( e s t i m a t o r = L o g i s t i c R e g r e s s i o n ( ) , c v = 4 , s c o r i n g = ’ a c c u r a c y ’ ) r f e = r f e . f i t ( X _ t r a i n , y _ t r a i n ) # S e l e c t v a r i a b l e s a n d c a l u l a t e t e s t a c c u r a c y c o l s = X _ t r a i n . c o l u m n s [ r f e . s u p p o r t _ ] a c c = a c c u r a c y _ s c o r e ( y _ t e s t , r f e . e s t i m a t o r _ . p r e d i c t ( X _ t e s t [ c o l s ] ) ) p r i n t ( ’ N u m b e r o f f e a t u r e s s e l e c t e d : { } ’ . f o r m a t ( r f e . n _ f e a t u r e s _ ) ) p r i n t ( ’ T e s t A c c u r a c y { } ’ . f o r m a t ( a c c ) ) # P l o t n u m b e r o f f e a t u r e s v s C V s c o r e s p l t . f i g u r e ( ) p l t . x l a b e l ( ’ k ’ ) p l t . y l a b e l ( ’ C V a c c u r a c y ’ ) p l t . p l o t ( n p . a r a n g e ( 1 , r f e . g r i d _ s c o r e s _ . s i z e + 1 ) , r f e . g r i d _ s c o r e s _ ) p l t . s h o w ( ) l r _ p r e d = r f e . e s t i m a t o r _ . p r e d i c t ( X _ t e s t [ c o l s ] ) p r i n t ( ’ f 1 _ s c o r e ( m a c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m a c r o ’ ) ) p r i n t ( ’ f 1 _ s c o r e ( m i c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m i c r o ’ ) ) p r i n t ( ’ f 1 _ s c o r e ( w e i g h t e d ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ w e i g h t e d ’ ) ) p r i n t ( ’ r e c a l l _ s c o r e ( m a c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m a c r o ’ ) ) p r i n t ( ’ r e c a l l _ s c o r e ( m i c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m i c r o ’ ) ) p r i n t ( ’ r e c a l l _ s c o r e ( w e i g h t e d ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ w e i g h t e d ’ ) ) # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = # F e a t u r e i m p o r t a n c e s # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = f r o m s k l e a r n . e n s e m b l e i m p o r t R a n d o m F o r e s t C l a s s i f i e r # F e a t u r e i m p o r t a n c e v a l u e s f r o m R a n d o m F o r e s t s r f = R a n d o m F o r e s t C l a s s i f i e r ( n _ j o b s = - 1 , r a n d o m _ s t a t e = S E E D ) r f . f i t ( X _ t r a i n , y _ t r a i n ) f e a t _ i m p = r f . f e a t u r e _ i m p o r t a n c e s _ # S e l e c t f e a t u r e s a n d f i t L o g i s t i c R e g r e s s i o n c o l s = X _ t r a i n . c o l u m n s [ f e a t _ i m p > = 0 . 0 1 ] e s t _ i m p = L o g i s t i c R e g r e s s i o n ( ) e s t _ i m p . f i t ( X _ t r a i n [ c o l s ] , y _ t r a i n ) # T e s t a c c u r a c y a c c = a c c u r a c y _ s c o r e ( y _ t e s t , e s t _ i m p . p r e d i c t ( X _ t e s t [ c o l s ] ) ) p r i n t ( ’ N u m b e r o f f e a t u r e s s e l e c t e d : { } ’ . f o r m a t ( l e n ( c o l s ) ) ) p r i n t ( ’ T e s t A c c u r a c y { } ’ . f o r m a t ( a c c ) ) l r _ p r e d = e s t _ i m p . p r e d i c t ( X _ t e s t [ c o l s ] ) p r i n t ( ’ f 1 _ s c o r e ( m a c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m a c r o ’ ) ) p r i n t ( ’ f 1 _ s c o r e ( m i c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m i c r o ’ ) ) p r i n t ( ’ f 1 _ s c o r e ( w e i g h t e d ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ w e i g h t e d ’ ) ) p r i n t ( ’ r e c a l l _ s c o r e ( m a c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m a c r o ’ ) ) p r i n t ( ’ r e c a l l _ s c o r e ( m i c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m i c r o ’ ) ) p r i n t ( ’ r e c a l l _ s c o r e ( w e i g h t e d ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ w e i g h t e d ’ ) ) # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = # B o r u t a # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = f r o m b o r u t a i m p o r t B o r u t a P y # R a n d o m F o r e s t s f o r B o r u t a r f _ b o r u t a = R a n d o m F o r e s t C l a s s i f i e r ( n _ j o b s = - 1 , r a n d o m _ s t a t e = S E E D ) # P e r f o r m B o r u t a b o r u t a = B o r u t a P y ( r f _ b o r u t a , n _ e s t i m a t o r s = ’ a u t o ’ , v e r b o s e = 2 ) b o r u t a . f i t ( X _ t r a i n . v a l u e s , y _ t r a i n . v a l u e s . r a v e l ( ) ) # T e s t a c c u r a c y a c c = a c c u r a c y _ s c o r e ( y _ t e s t , e s t _ b o r u t a . p r e d i c t ( X _ t e s t [ c o l s ] ) ) p r i n t ( ’ N u m b e r o f f e a t u r e s s e l e c t e d : { } ’ . f o r m a t ( l e n ( c o l s ) ) ) p r i n t ( ’ T e s t A c c u r a c y { } ’ . f o r m a t ( a c c ) ) l r _ p r e d = e s t _ b o r u t a . p r e d i c t ( X _ t e s t [ c o l s ] ) p r i n t ( ’ f 1 _ s c o r e ( m a c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m a c r o ’ ) ) p r i n t ( ’ f 1 _ s c o r e ( m i c r o ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m i c r o ’ ) ) p r i n t ( ’ f 1 _ s c o r e ( w e i g h t e d ) ’ , f 1 _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ w e i g h t e d ’ ) ) p r i n t ( ’ r e c a l l _ s c o r e ( m a c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m a c r o ’ ) ) p r i n t ( ’ r e c a l l _ s c o r e ( m i c r o ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ m i c r o ’ ) ) p r i n t ( ’ r e c a l l _ s c o r e ( w e i g h t e d ) ’ , r e c a l l _ s c o r e ( y _ t e s t , l r _ p r e d , a v e r a g e = ’ w e i g h t e d ’ ) )

[1]

Patterson ,

Gibson , Deep Learning: A Practitioner's Approach, O'Reilly Media , 2017 .

[2]

He ,

Garcia , Learning from imbalanced data , IEEE Transactions on Knowledge and Data Engineering 21 ( 2009 ) 1263 - 1284 . doi:1 0 . 1 1 0 9

/ T K D E .

2 0 0 8 . 2 3 9 .

[3]

Japkowicz ,

Stephen , The class imbalance problem: A systematic study , Intelligent Data Analysis 6 ( 2002 ) 429 - 449 . doi:1 0 . 3 2 3 3 / I D A - 2 0 0 2 - 6 5 0 4 .

[4]

N. V.

Chawla ,

K. W.

Bowyer ,

L. O.

Hall ,

W. P.

Kegelmeyer , Smote: Synthetic minority over-sampling technique , Journal of Artificial Intelligence Research 16 ( 2002 ) 321 - 357 . URL: http://dx.doi.org/10.1613/jair.953. doi:1 0 . 1 6 1 3 / j a i r . 9 5 3 .

[5]

Lin ,

Yang ,

Yin ,

Kong ,

Xing ,

Wu ,

Jia ,

Wang ,

Xu , A support vector machine-recursive feature elimination feature selection method based on artificial contrast variables and mutual information , Journal of Chromatography B: Analytical Technologies in the Biomedical and Life Sciences 910 ( 2012 ) 149 - 155 . doi:1 0 . 1 0 1 6 / j . j c h r o m b . 2 0 1 2 . 0 5 .

[6]

Abdi , S. Hashemi 28 ( 2016 ) 238 - 251 .

[7]

He ,

Bai ,

E. A.

Garcia ,

Li , Adasyn: Adaptive synthetic sampling approach for imbalanced learning , in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) , 2008 , pp. 1322 - 1328 .

[8]

Han , W.-Y. Wang,

B.-H.

Mao , Borderline-smote: A new over-sampling method in imbalanced data sets learning , volume 3644 , 2005 , pp. 878 - 887 . doi:1 0 . 1 0 0 7 / 1 1 5 3 8 0 5 9 _ 9 1 .

[9]

P. M.

Murphy , D. W. Aha, Uci repository of machine learning databases . - irvine: University of california, department of information and computer science, 1998 . URL: https://www. ics.uci.edu/mlearn/MLRepository.html.

[10] Dermatology-article , 2020 . URL: https://github.com/riviera2015/Dermatology-article.

[11]

Tuv ,

Borisov , G. Runger,

Torkkola , Feature selection with ensembles, artificial variables, and redundancy elimination , Journal of Machine Learning Research 10 ( 2009 ) 1341 - 1366 .

[12]

Kursa , W. Rudnicki, Feature selection with boruta package , Journal of Statistical Software 36 ( 2010 ) 1 - 13 . doi: 1 0 . 1 8 6 3 7 / j s s . v 0 3 6 . i 1 1 .

[13]

Lyubin , E. Shchetinin, Fast two-dimensional smoothing with discrete cosine transform , volume 678 , 2016 , pp. 646 - 656 . doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 3 1 9 - 5 1 9 1 7 - 3 _ 5 5 .

[14]

E. Y.

Shchetinin , Cluster-based energy consumption forecasting in smart grids , in: V. M. Vishnevskiy , D. V. Kozyrev (Eds.), Distributed Computer and Communication Networks , volume 919 , Springer International Publishing, Cham, 2018 , pp. 445 - 456 .

[15]

L. A.

Sevastianov ,

E. Y.

Shchetinin , On methods for improving the accuracy of multiclass classification on imbalanced data , Informatics and Applications 14 ( 2020 ) 63 - 70 . doi: 1 0 .

# S e l

e c t f e a t u r e s a n d f i t L o g i s t i c R e g r e s s i o n c o l s = X _ t r a i n . c o l u m n s [ b o r u t a . s u p p o r t _ ] e s t _ b o r u t a = L o g i s t i c R e g r e s s i o n ( ) e s t _ b o r u t a . f i t ( X _ t r a i n [ c o l s ] , y _ t r a i n ) f r o m i m b l e a r n . o v e r _ s a m p l i n g i m p o r t R a n d o m O v e r S a m p l e r # r a n d o m o v e r s a m p l i n g r o s = R a n d o m O v e r S a m p l e r ( r a n d o m _ s t a t e = 0 ) X _ r e s a m p l e d , y _ r e s a m p l e d = r o s . f i t _ r e s a m p l e ( X _ t r a i n , y _ t r a i n ) # u s i n g C o u n t e r t o d i s p l a y r e s u l t s o f n a i v e o v e r s a m p l i n g f r o m c o l l e c t i o n s i m p o r t C o u n t e r p r i n t ( s o r t e d ( C o u n t e r ( y _ r e s a m p l e d ) . i t e m s ( ) ) ) f r o m i m b l e a r n . o v e r _ s a m p l i n g i m p o

r t S M O T E #

a p p l y i n g S M O T E t o o u r d a t a a n d c h e c k i n g t h e c l a s s c o u n t s X _ r e s a m p l e d 1 , y _ r e s a m p l e d 1 = S M O T E (

) . f i t _ r e s a m p l e ( X _ t r a i n , y _ t r a i n ) p r i n t ( s o r t e d ( C o u n t e r ( y _ r e s a m p l e d 1 ) . i t e m s ( ) ) )