1. Introduction

Influence of Data Dimension Reduction, Feature Scaling and Activation Function on Machine Learning Performance

Grzegorz Słowiński

0 0 University of Technology and Economics , ul. Jagiellońska 82f, 03-301 Warsaw , Poland

A dataset containing over 13k samples of dry beans geometric features is being analysed using machine learning (ML) and deep learning (DL) techniques with the goal to automatically classify the bean specie. The obtained geometrical data has quite a lot redundancy. Many features are strongly correlated. This work analyses the influence of data dimension reduction (DDR) (elimination of excess strongly correlated features) and features scaling (FS), often called normalization, on the machine learning performance (measured in terms of accuracy and approximate training time). Additionally also an influence of activation function (sigmoid vs. ReLU) on artificial neural network performance has been checked.

1 machine learning deep learning data dimension reduction features scaling activation function

1. Introduction

Classification of dry beans is of some economic importance. Manual classification is labour intensive, etc. Over 13 k samples of dry beans of 7 various species were photographed and their geometry was measured via computer vision techniques in [ 1 ]. Then the set was analysed via several machine learning (or data science) and deep learning (or artificial neural network) techniques. The overall accuracy obtained was 87.92-93.13%, depending on technique used.

The dataset used in [ 1 ] has been published in the UCI machine learning repository [ 2 ]. In this work, a collection of beans was used as material for investigation how machine learning process is influenced by the following factors: 1) data dimension reduction, 2) features scaling (or data normalization) and 3) in case of neural networks, how their performance depends on activation function used (ReLU vs. sigmoid).

The research question examined in this work is: How do data dimension reduction, feature scaling and activation function influence machine learning performance? The above question is related to concurency, specification and programming in the following way. Among topics of CS&P 2021 one can find: Model checking and testing - this work checks different ML models, knowledge discovery and data mining - machine learning belong to this field, soft computing - artificial neural networks are are categorized as a kind of soft-computing. 1.1. work [ 1 ] data dimension has not been reduced, although many features are strongly correlated. This work investigates the effect of data dimension reduction on performance (computing time and accuracy). 1.2.

Feature scaling

In the handbook [ 4 ], page 72 Aurelien Geron, states: "One of the most important transformations you need to apply to your data is feature scaling. With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales." This work verifies this statement and investigates what ML methods really needs feature scaling. 1.3.

Activation Function

compared: ReLU and sigmoid.

2. Tools

RAM, no graphical processing unit (GPU) acceleration.

Majority of experiments performed were shallow learning that do not need GPU support. As the dry beans dataset is relatively simple, the artificial neural network (ANN) applied was also rather simple and GPU support was not crucial for ANN training. Training times were in range from milliseconds to a few minutes.

3. Data

The dataset under study consists of 13611 samples. A sample amounts to 16 geometrical features and a label identifying the specie of the bean. The species are as follows: Barbunya, Bombay, Cali,

Dermason, Horoz, Seker, and Sira.

The features are:

Area, Perimeter, MajorAxisLength, MinorAxisLength, AspectRatio, Eccentricity, ConvexArea, EquivDiameter, Extent, Solidity, Roundness, Compactness, ShapeFactor1, ShapeFactor2, ShapeFactor3, and ShapeFactor4. A detailed explanation how the features were calculated is presented in [1].

Correlation analysis (see table 1) has shown that several of the features are strongly (positively or negatively) correlated. This is due to the fact that basically all of them are kind of geometric measures. In the original work [ 1 ] the issue of strong correlation between features has not been addressed. Generally strongly (over 0,9) features bring little extra information, so its elimination should reduce computational complexity (speed up training) with little if any loss in classification accuracy.

It is also sometimes suggested that feature scaling (often called normalization) can improve performance [ 4 ], pages 72-73. This is also investigated. To give a brief visualisation of beans dataset, the pair-plot with selected features (less correlated) has been done, see figure 1.

4. Shallow learning results

The methods tried were: Naive Bayes Classifier, Decision Tree, Random Forest, Support Vector Classifier.

4.1. Naive Bayes Classifier

Results for Gaussian naive Bayes classifier are shown in table 2. One can see that DDR or FS has small effect on training time. Using DDR or FS (or both) significantly increased accuracy from 77.23% to 89.83-91.00%.

4.2. Decision tree

Results for decision tree are shown in table 3. Decision tree applied was limited to 16 leaf nodes and maximum depth of 5. One can see that FS has no effect on accuracy and little effect on training time. This probably connected with the fact that DT analyses one feature at the time, so it not cares what is the ratio of specific feature range to other features. DDR shorten training time with limited accuracy decrease.

Decision tree is known to be sensitive for data “rotation”, see [ 4 ] p 188. DT analyses only one feature at the time. Strongly correlated features gives little extra information, but can present information in a slightly different manner, suitable for decision tree.

4.3. Random Forest Classifier

Results for the random forest (RF) are shown in table 4. The random forest consisted of 150 trees. No limits (max leaves, max depth and etc.) were put on trees. One can observe that training times are longer that for single decision tree (which is reasonable as here we have a set of decision trees). The accuracies are high. DDR shortened training time and allowed for slightly higher accuracy (0,14-0,18 % point). This is quite interesting that although DDR slightly reduced accuracy on single tree it improved accuracy on RF. Similarly to decision tree, SF practically has little effect on training time.

4.4. Support Vector Classifier Accuracy

93.06% 93.24% 93.10% 93.24%

Approx. training time

4.69 s 2.69 s 4.79 s 2.59 s

Results for support vector classifier (SVC) is shown in table 5. Polynomial kernel has been used. Generally SVC is much more “heavier” model than gaussian classifier, decision tree or random forest. Training times much longer. One can see that DDR or FS has small effect on SVC accuracy. DDR on not scaled features reduced training time. Feature scaling significantly increased training time and increased accuracy a little (about 1% point). The longest training time was observed for DDR and SF data. The training time was 9 times longer than for DDR and not SF data. The author cannot explained this effect.

5. Artificial neural network

For an artificial neural network (ANN) the data needs additional treatment. First, the names of bean species were labelled with numbers and then these numbers 0-6 were codded as so called ”onehot”. The reason of using ”one-hot” encoding is well explained for example in [ 5 ] p. 376 or [ 6 ] pp. 190-194.

Three experiments has been performed to analyse: 1) influence of data dimension reduction, 2) influence of features scaling and 3) influence of activation function (sigmoid vs. ReLU). The ANN architecture was kept similar (as much as possible) to described in [ 1 ]. All ANNs had 3 hidden layers with 17, 12, 3, neurons respectively. However here ReLU function has be used as “default” option. Output layer consisted of 7 neurons with softmax activation function – one for each class. Generally training lasted for 16 epochs. However, as it was obvious that ANN with sigmoid activation is undertrained, this net was trained for 48 epochs. The training process is presented in figure 2. The performance summary is presented in table 6. It can be visible that: 1. feature scaling (or data normalisation) is very important for ANN’s. An attempt to train without prior data scaling failed. Only 55,82% accuracy has been obtained. Perhaps bigger network can manage this issue by rescaling data in a few first layers, but it will influence training time and accuracy. 2. ReLU works significantly better than sigmoid function as an activation function. ReLU network trains faster and reaches better accuracy. 3. Data dimension reduction shortens training time nearly by half and increases accuracy by about 0,58 % point.

6. Conclusions

Influence of data dimension reduction, data scaling (or normalisation) and activation function has been investigated. The influence depends on machine learning technique used.

Generally data dimension reduction reduces training time with rather limited influence on accuracy Data scaling is a must in case of artificial neural network. Omitting data scaling decreased accuracy from about 93% to about 56%. In case of shallow learning techniques its influence is smaller, it sometimes help a little with accuracy, sometimes not.

Generally scaling had no effect on decision tree and random forest performance. In case of support vector classifier scaling resulted in huge training time increase. Author cannot explain this effect.

The highest accuracy observed was 93,24%. It was obtained 3 times with: 1) random forest with 8 features (scaled and not scaled), 2) ANN, 8 features, scaled and 3) SVC, 16 features, scaled. It is quite intriguing that exactly the same, maximum result repeated 3 time. 7. References

[1]

Murat

Koklu , Ilker Ali Ozkan, Multiclass classification of dry beans using computer vision and machine learning techniques , Computers and Electronics in Agriculture 174 ( 2020 ) 105507

[2] Dry beans dataset at UCI repository: https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset, access 23 .06.2021

[3] Colab notebook containing computation scripts for this work : https://colab.research.google.com/drive/1l5lH1QgesDX8CbbkqcnmlbqwcXfksGQB?usp=sharing

[4]

Aurelien

Geron , Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow, O'Reilly , 2019

[5] Jake

VanderPlass

, Python Data Science Handbook, O'Reilly , 2017

[6]

Francois

Chollet , Deep Learning with Python , Manning Publications , 2018