<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Influence of Data Dimension Reduction, Feature Scaling and Activation Function on Machine Learning Performance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Grzegorz Słowiński</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Technology and Economics</institution>
          ,
          <addr-line>ul. Jagiellońska 82f, 03-301 Warsaw</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A dataset containing over 13k samples of dry beans geometric features is being analysed using machine learning (ML) and deep learning (DL) techniques with the goal to automatically classify the bean specie. The obtained geometrical data has quite a lot redundancy. Many features are strongly correlated. This work analyses the influence of data dimension reduction (DDR) (elimination of excess strongly correlated features) and features scaling (FS), often called normalization, on the machine learning performance (measured in terms of accuracy and approximate training time). Additionally also an influence of activation function (sigmoid vs. ReLU) on artificial neural network performance has been checked.</p>
      </abstract>
      <kwd-group>
        <kwd>1 machine learning</kwd>
        <kwd>deep learning</kwd>
        <kwd>data dimension reduction</kwd>
        <kwd>features scaling</kwd>
        <kwd>activation function</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Classification of dry beans is of some economic importance. Manual classification is labour
intensive, etc. Over 13 k samples of dry beans of 7 various species were photographed and their
geometry was measured via computer vision techniques in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Then the set was analysed via several
machine learning (or data science) and deep learning (or artificial neural network) techniques. The
overall accuracy obtained was 87.92-93.13%, depending on technique used.
      </p>
      <p>
        The dataset used in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has been published in the UCI machine learning repository [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this
work, a collection of beans was used as material for investigation how machine learning process is
influenced by the following factors: 1) data dimension reduction, 2) features scaling (or data
normalization) and 3) in case of neural networks, how their performance depends on activation
function used (ReLU vs. sigmoid).
      </p>
      <p>
        The research question examined in this work is: How do data dimension reduction, feature scaling
and activation function influence machine learning performance? The above question is related to
concurency, specification and programming in the following way. Among topics of CS&amp;P 2021 one
can find: Model checking and testing - this work checks different ML models, knowledge discovery
and data mining - machine learning belong to this field, soft computing - artificial neural networks are
are categorized as a kind of soft-computing.
1.1.
work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] data dimension has not been reduced, although
many features are strongly
correlated. This work investigates the effect of data dimension reduction on performance (computing
time and accuracy).
1.2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Feature scaling</title>
      <p>
        In the handbook [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], page 72 Aurelien Geron, states: "One of the most important transformations
you need to apply to your data is feature scaling. With few exceptions, Machine Learning algorithms
don’t perform well when the input numerical attributes have very different scales." This work verifies
this statement and investigates what ML methods really needs feature scaling.
1.3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Activation Function</title>
      <p>compared: ReLU and sigmoid.</p>
    </sec>
    <sec id="sec-4">
      <title>2. Tools</title>
      <p>In</p>
      <p>GB</p>
      <p>RAM, no graphical processing unit (GPU) acceleration.</p>
      <p>Majority of experiments performed were shallow learning that do not need GPU support. As the dry
beans dataset is relatively simple, the artificial neural network (ANN) applied was also rather simple
and GPU support was not crucial for ANN training. Training times were in range from
milliseconds to
a few minutes.</p>
    </sec>
    <sec id="sec-5">
      <title>3. Data</title>
      <p>The dataset under study consists of 13611 samples. A sample amounts to 16 geometrical features
and a label identifying the specie of the bean. The species are as follows: Barbunya, Bombay, Cali,</p>
      <sec id="sec-5-1">
        <title>Dermason, Horoz, Seker, and</title>
      </sec>
      <sec id="sec-5-2">
        <title>Sira.</title>
        <p>The
features
are:</p>
      </sec>
      <sec id="sec-5-3">
        <title>Area,</title>
      </sec>
      <sec id="sec-5-4">
        <title>Perimeter,</title>
      </sec>
      <sec id="sec-5-5">
        <title>MajorAxisLength,</title>
      </sec>
      <sec id="sec-5-6">
        <title>MinorAxisLength, AspectRatio,</title>
      </sec>
      <sec id="sec-5-7">
        <title>Eccentricity,</title>
      </sec>
      <sec id="sec-5-8">
        <title>ConvexArea,</title>
      </sec>
      <sec id="sec-5-9">
        <title>EquivDiameter,</title>
      </sec>
      <sec id="sec-5-10">
        <title>Extent,</title>
      </sec>
      <sec id="sec-5-11">
        <title>Solidity, Roundness, Compactness, ShapeFactor1, ShapeFactor2, ShapeFactor3, and ShapeFactor4. A detailed explanation how the features were calculated is presented in [1].</title>
        <p>
          Correlation analysis (see table 1) has shown that several of the features are strongly (positively or
negatively) correlated. This is due to the fact that basically all of them are kind of geometric
measures. In the original work [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] the issue of strong correlation between features has not been
addressed. Generally strongly (over 0,9) features bring little extra information, so its elimination
should reduce computational complexity (speed up training) with little if any loss in classification
accuracy.
        </p>
        <p>
          It is also sometimes suggested that feature scaling (often called normalization) can improve
performance [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], pages 72-73. This is also investigated. To give a brief visualisation of beans dataset,
the pair-plot with selected features (less correlated) has been done, see figure 1.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. Shallow learning results</title>
      <p>The methods tried were: Naive Bayes Classifier, Decision Tree, Random Forest, Support Vector
Classifier.</p>
    </sec>
    <sec id="sec-7">
      <title>4.1. Naive Bayes Classifier</title>
      <p>Results for Gaussian naive Bayes classifier are shown in table 2. One can see that DDR or FS has
small effect on training time. Using DDR or FS (or both) significantly increased accuracy from
77.23% to 89.83-91.00%.</p>
    </sec>
    <sec id="sec-8">
      <title>4.2. Decision tree</title>
      <p>Results for decision tree are shown in table 3. Decision tree applied was limited to 16 leaf nodes
and maximum depth of 5. One can see that FS has no effect on accuracy and little effect on training
time. This probably connected with the fact that DT analyses one feature at the time, so it not cares
what is the ratio of specific feature range to other features. DDR shorten training time with limited
accuracy decrease.</p>
      <p>
        Decision tree is known to be sensitive for data “rotation”, see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] p 188. DT analyses only one
feature at the time. Strongly correlated features gives little extra information, but can present
information in a slightly different manner, suitable for decision tree.
      </p>
    </sec>
    <sec id="sec-9">
      <title>4.3. Random Forest Classifier</title>
      <p>Results for the random forest (RF) are shown in table 4. The random forest consisted of 150 trees.
No limits (max leaves, max depth and etc.) were put on trees. One can observe that training times are
longer that for single decision tree (which is reasonable as here we have a set of decision trees). The
accuracies are high. DDR shortened training time and allowed for slightly higher accuracy (0,14-0,18
% point). This is quite interesting that although DDR slightly reduced accuracy on single tree it
improved accuracy on RF. Similarly to decision tree, SF practically has little effect on training time.</p>
    </sec>
    <sec id="sec-10">
      <title>4.4. Support Vector Classifier</title>
      <sec id="sec-10-1">
        <title>Accuracy</title>
        <p>93.06%
93.24%
93.10%
93.24%</p>
      </sec>
      <sec id="sec-10-2">
        <title>Approx. training time</title>
        <p>4.69 s
2.69 s
4.79 s
2.59 s</p>
        <p>Results for support vector classifier (SVC) is shown in table 5. Polynomial kernel has been used.
Generally SVC is much more “heavier” model than gaussian classifier, decision tree or random forest.
Training times much longer. One can see that DDR or FS has small effect on SVC accuracy. DDR on
not scaled features reduced training time. Feature scaling significantly increased training time and
increased accuracy a little (about 1% point). The longest training time was observed for DDR and SF
data. The training time was 9 times longer than for DDR and not SF data. The author cannot
explained this effect.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>5. Artificial neural network</title>
      <p>
        For an artificial neural network (ANN) the data needs additional treatment. First, the names of
bean species were labelled with numbers and then these numbers 0-6 were codded as so called
”onehot”. The reason of using ”one-hot” encoding is well explained for example in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] p. 376 or [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] pp.
190-194.
      </p>
      <p>
        Three experiments has been performed to analyse: 1) influence of data dimension reduction,
2) influence of features scaling and 3) influence of activation function (sigmoid vs. ReLU). The ANN
architecture was kept similar (as much as possible) to described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. All ANNs had 3 hidden layers
with 17, 12, 3, neurons respectively. However here ReLU function has be used as “default” option.
Output layer consisted of 7 neurons with softmax activation function – one for each class. Generally
training lasted for 16 epochs. However, as it was obvious that ANN with sigmoid activation is
undertrained, this net was trained for 48 epochs. The training process is presented in figure 2. The
performance summary is presented in table 6.
It can be visible that:
1. feature scaling (or data normalisation) is very important for ANN’s. An attempt to train
without prior data scaling failed. Only 55,82% accuracy has been obtained. Perhaps bigger
network can manage this issue by rescaling data in a few first layers, but it will influence
training time and accuracy.
2. ReLU works significantly better than sigmoid function as an activation function. ReLU
network trains faster and reaches better accuracy.
3. Data dimension reduction shortens training time nearly by half and increases accuracy by
about 0,58 % point.
      </p>
    </sec>
    <sec id="sec-12">
      <title>6. Conclusions</title>
      <p>Influence of data dimension reduction, data scaling (or normalisation) and activation function has
been investigated. The influence depends on machine learning technique used.</p>
      <p>Generally data dimension reduction reduces training time with rather limited influence on accuracy
Data scaling is a must in case of artificial neural network. Omitting data scaling decreased accuracy
from about 93% to about 56%. In case of shallow learning techniques its influence is smaller, it
sometimes help a little with accuracy, sometimes not.</p>
      <p>Generally scaling had no effect on decision tree and random forest performance. In case of support
vector classifier scaling resulted in huge training time increase. Author cannot explain this effect.</p>
      <p>The highest accuracy observed was 93,24%. It was obtained 3 times with: 1) random forest with 8
features (scaled and not scaled), 2) ANN, 8 features, scaled and 3) SVC, 16 features, scaled. It is quite
intriguing that exactly the same, maximum result repeated 3 time.
7. References</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Murat</given-names>
            <surname>Koklu</surname>
          </string-name>
          , Ilker Ali Ozkan,
          <article-title>Multiclass classification of dry beans using computer vision and machine learning techniques</article-title>
          ,
          <source>Computers and Electronics in Agriculture</source>
          <volume>174</volume>
          (
          <year>2020</year>
          )
          <fpage>105507</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] Dry beans dataset at UCI repository: https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset, access
          <volume>23</volume>
          .06.2021
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Colab notebook containing computation scripts for this work</article-title>
          : https://colab.research.google.com/drive/1l5lH1QgesDX8CbbkqcnmlbqwcXfksGQB?usp=sharing
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Aurelien</given-names>
            <surname>Geron</surname>
          </string-name>
          ,
          <article-title>Hands-on Machine Learning with Scikit-Learn, Keras &amp; TensorFlow,</article-title>
          <string-name>
            <surname>O'Reilly</surname>
          </string-name>
          ,
          <year>2019</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Jake</surname>
            <given-names>VanderPlass</given-names>
          </string-name>
          ,
          <source>Python Data Science Handbook, O'Reilly</source>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Francois</given-names>
            <surname>Chollet</surname>
          </string-name>
          ,
          <article-title>Deep Learning with Python</article-title>
          ,
          <source>Manning Publications</source>
          ,
          <year>2018</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>