=Paper=
{{Paper
|id=Vol-1853/p03
|storemode=property
|title=Optimum size of feed forward neural network for
Iris data set
|pdfUrl=https://ceur-ws.org/Vol-1853/p03.pdf
|volume=Vol-1853
|authors=Wojciech Masarczyk
|dblpUrl=https://dblp.org/rec/conf/system/Masarczyk17
}}
==Optimum size of feed forward neural network for
Iris data set==
<pdf width="1500px">https://ceur-ws.org/Vol-1853/p03.pdf</pdf>
<pre>
   Optimum size of feed forward neural network for
                    Iris data set.
                                                     Wojciech Masarczyk
                                               Faculty of Applied Mathematics
                                               Silesian University of Technology
                                                        Gliwice, Poland
                                                  Email: wojcmas042@polsl.pl


   Abstract—This paper illustrates a process of finding optimum         simultaneously resulting in brain processing speed close to
structure of neural network which will be used to solve problem         1018 operations per second.
of Iris data set. This work presents dependencies between number        Why one should even bother applying neural network into any
of layers, neurons and efficiency of network providing the best
configuration for given data set.                                       field of computations? Mostly due to the fact that artificial
                                                                        neural networks are able to generalize obtained knowledge
  Index Terms—feed forward neural network, Iris data set,               which means that after proper training such network should
backpropagation algorithm                                               be able to predict correctly value of given example not
                                                                        included in training data. Second advantage of these models
                                                                        is robustness for random fluctuations or lack of part of values
                        I. I NTRODUCTION                                in data set. Generally neural networks are used to solve
   Computers can help on automated models implemented to                problems that seems to be incomputable or too complicated
co-work with other devices in service. We can find various ap-          to solve by classical algorithms.
plication of intelligent systems in medicine, technology, trans-
port, etc. We expect from computer to control possible dangers
and advise the best options, while assisting humans [3], [4].
                                                                        B. Artificial neuron
However in these aspects it is necessary to process data of
various origins [13]. We have many possible approaches to                  One can observe that artificial neuron is really simplified
data analysis. There are solutions devoted to big data [8],             versions of human neuron, however still keeps three most
where we use some sophisticated methods implemented for                 important features of neuron:
knowledge engineering.                                                  -taking input from other neurons (dendrites)
   Mainly computational intelligence is assisting in data pro-          -computing and processing impulses taken as input
cessing to discover knowledge from input data, there are                -sending on computed impulse to further neurons(axon)
mathematical models implemented to find incomplete infor-
mation and work with this issues [1], [2]. Similarly we can
find reports on efficiency of neural networks in processing
incoming information [10]. Neural networks are efficient in
processing input data of various types, from voice samples
[15] to handwritings [11]. However to improve efficiency of
processing it is important to discuss optimum size of these
architectures, what will be done in this article using iris data
set as an example.
                II. N EURAL NETWORK MODEL
A. Biological inspiration
   Neural network is a biologically inspired model of
mathematical computations which structure is based on
architecture of human brain. In the simplest approach
brain consists of approximately 8.6 · 1010 [5] neurons
that are connected with each other creating a network of
1015 connections through which impulses are being sent

  Copyright c 2017 held by the authors.                                               Figure 1. Biological and artificial neuron


                                                                   14
   One cycle of computations in neurons may be described
as follow: values are being multiplied by appropriate weights
and summed together in a cell of neuron after which sum is
taken as argument to activation function which value is set to
be impulse send as the output of neuron. Equation describing
this process is presented below:
                             n
                             P
                    y = f(        w(xi )j · xi ),              (1)
                            i=1

 where
 w(xi )j − weight between i and j neuron in consecutive layers,
                                                                Figure 2. Graph of different activation functions tested in this paper
 xi − value of neuron i,
 f − activation function
                                                                          D. Neural network
In (1) it is assumed that previous layers consists of n neurons.             Feed forward neural network is constructed by neurons
                                                                          stacked in a rows called layers that are fully connected
C. Activation function                                                    with previous and subsequent layer. First and last layer are
   Activation functions is an abstract indicator that says about          respectively called input and output layer, each layer between
action taken in neuron. In simplest case once information is              these two is called a hidden layer. Role of neurons in input
important it outputs 1 otherwise neuron outputs 0. This model             layer is to store the initial data and send it further. Flow of
is called a binary step function, despite its simplicity it is            information takes place from left to right, so once initial data
able to solve a few problems, however it is insufficient with             is provided the result will appear at the output layer which
more complex problems due to the lack of desired features                 ends one full cycle of computations for network.
detailed in following list:
- Finite range results in stability while gradient - based
method are used for learning
- Continuous differentiable is necessary for every gradient -
based method, because of that feature binary step function
cannot be used in models with gradient based learning
algorithms.
- Identity near origin fasten the learning process once initial
weights are small numbers.                                                 Figure 3. Example of feed forward neural network with 2 hidden layers
In this paper results will be obtained using only four different
activation functions:                                                     According to (1) output of neural network is just a sum of
                                                                          random weights multiplied by initial values, the only param-
  Logistic function:                                                      eter that might be modified in neural network are weights so
                                                                          learning algorithm is just a process of changing weights in
                         f (x) = 1+e1−βx                       (2)        such way that output is exactly the same as presented in data
                                                                          set. In this paper backpropagation algorithm will be discussed.
 where
 β − determines steepness of function near origin,                        E. Backpropagation algorithm
  Hyperbolic tangent:                                                        To train the network it is necessary to provide a data
                                                                          set that consists of vector of input signals (x1 , x2 , ..., xn )
                f (x) = tanh (x) = 1+e2−2x − 1                 (3)        and corresponding desired output z. Thanks to this fact it is
                                                                          possible to compute the difference between output signal y
Arctangent
                                                                          and desired value z. Let δ = z − y. Next step is to propagate
                                                                          δ error through every neuron according to equation:
                          f (x) = tan1 x                       (4)                                        n
                                                                                                          P
                                                                                                   δj =         w(j)i δi ,                    (6)
 Modified Hyperbolic tangent:                                                                             i=1
Modification proposed by Yann LeCun [6]
                                                                           where
                   f (x) = 1.7159 tanh( 32 x)                  (5)         w(i)j − weight between i and j neuron in consecutive layer,
                                                                           δi − value of error on neuron i in consecutive layer,


                                                                     15
   Once δ error is computed for each neuron following step isAs batch learning approach is used in this case every iteration
to actualize weights using equation below:                   is equal to updating every weight in whole neural network
                                                             structure. Since then it seems natural to come up with equation
                w(j)i = w(j)i + ηδi dfdx
                                      i (x)
                                            yj         (7) describing numbers of updates done throughout whole process
 where                                                       of learning each network.
 w(i)j − weight between i and j neuron in consecutive layers,                            Q=q·r                           (9)
 yj − value of neuron j,
                                                               where
 fi − activation function on neuron i
                                                               r − number of iterations,
 δi − value of error on neuron i in consecutive layer,
                                                               q − number of neurons in network,
 η − coefficient that affects speed of learning.

                                                                         Q from equation (9) will be used in next section to compare
Learning is an iterative process that takes place until δ for
                                                                         efficiency of different network structures.
output layer is smaller than desired precision.
                                                                         A. Results
 III. C ONVERGENCE OF GRADIENT DESCENT ALGORITHM
                                                                            It is clearly visible in table below that in case network
                 WITH RESPECT TO η
                                                                         consists of only input and output layers (0 hidden layers) it
   While analyzing equation (7) it is obvious that η has a               fails to classify properly each of three types, usually being
direct impact on pace of learning of neural network since                able to classify correctly two of them while having issues
derivative determines direction where error function descends.           with third one. This is caused by the fact that one class is
It is crucial to select balanced value of η. Too small value             linearly separable from the other two, however the latter are
may result in slow learning process while value bigger than              not linearly separable from each other.
necessary will end with divergence of algorithm. Four different
scenarios of η values are presented in graphical interpretation.
Every test in this paper was carried out with η = 0.02 as it             One can observe that Hyperbolic Tangent function is most
appears to be most suitable value of this parameter.                     uniform function while being most effective taking into con-
                      IV. I RIS DATA SET                                 sideration number of iterations and number of neurons needed
                                                                         to obtain desired precision. Similar behaviour can be viewed
   Iris data set is a classic and one of the best known sets
                                                                         with Arctangent function, however it fails to work for networks
for pattern recognition described in [7]. It consists of 150
                                                                         that were properly computed with Hyperbolic Tan.
instances that are equally divided into three different classes.
                                                                         On the other side Modified Tangent performed below expecta-
Each class refers to a different type of iris plant. Every sample
                                                                         tions, however it is important to remember that all tests were
is a vector of length equal four, where vector components
                                                                         carried out for fixed η. Since steepness of this function is mod-
describes:
                                                                         ified and value of η should be set with respect to derivatives
1. sepal length [cm]
                                                                         [6] its underperformance seems to be understandable.
2. sepal width [cm]
3. petal length [cm]
                                                                         After analyzing table it is clear that adding additional hidden
4. petal width [cm].
                                                                         layers is not worth the effort since optimum is obtained for 1 or
In order to create training data set, 35 samples were chosen
                                                                         2 hidden layers. Note that each hidden layer with 10 neurons
from each class, so that 30% of whole set is left for testing
                                                                         added have to result in decreasing iterations tenfold to be
accuracy of trained model. The aim for neural network is to
                                                                         computationally profitable, unfortunately this does not happen
classify Iris, described by these four dimensions, into one of
                                                                         with any function. Moreover, as shown, network that is too
three given classes.
                                                                         big (5 hidden layers and more) tend to fail this classification
                           V. T ESTS                                     task at all. It might caused by the fact that Iris Data set is
   For further tests η is equal to 0.02, it will be pointed out          relatively small (150 examples) and it is not enough to adjust
once η will change. In every case desired error is set to be             properly so many weights.
                                                                            Three graphs presented in this paper represents Error func-
smaller than 10−5 . Maximum epochs are set to 20000.
                                                                         tion over consecutive iterations. Figure 6 presents nearly
In order to compare obtained results it is necessary to come up
                                                                         perfect plot that is smooth and monotonic at most of a time. On
with general formula determining number of weights in single
                                                                         contrary to the Figure 5 which presents heavy oscillations of
network:                           n
                                  Q                                      Error function caused by wrong η coefficient. Exactly the same
                         q =4·3       i,                    (8)          situations appears to be on Figure 7 at around 400 iteration,
                                  i=1
                                                                         luckily it is a situation when η > ηopt and after a few iterations
 where                                                                   of oscillations it finally converges. Values presented in table
 i − neurons on each hidden layer,                                       show that network failed to obtain desired precision in 20000
 n − number of hidden layers,                                            iterations.


                                                                    16
                                               Figure 4. Graphical interpretation of η coefficient.


                                                                               Figure 6. Error function over iterations in 1 hidden layer network

Figure 5. Error function over iterations in 0 hidden layer network
                                                                                                   VI. R EMARKS
                                                                              During designing artificial neural network there is a list
                                                                            of crucial parameters that need to be set precisely in order


                                                                       17
           Activation function   Number of hidden layers   Neurons on hidden layers   Total weights   Iterations   Weights · iterations
                                 0                         0                          12              F            F
                                                           5                          60              F            F
                                 1
                                                           10                         120             10811        1297320
                                                           10 - 10                    1200            8467         10160400
                                 2                         10 - 5                     900             F            F
                                                           5 - 10                     900             5909         5318100
                                                           8-8-8                      6144            F            F
           Logistic Function     3                         10 - 8 - 6                 10800           3867         41763600
                                                           6 - 8 - 10                 10800           553          5972400
                                                           6-6-6-6                    15552           F            F
                                 4                         10 - 9 - 8 - 6             51840           F            F
                                                           6 - 8 - 9 - 10             51840           F            F
                                                           7-7-7-7-7                  201684          F            F
                                 5                         12 - 10 - 9 - 8 - 7        725760          F            F
                                                           7 - 8 - 9 - 10 - 12        725760          F            F
                                 0                         0                          12              F            F
                                                           5                          60              259          15540
                                 1
                                                           10                         120             2047         245640
                                                           10 - 10                    1200            1124         1348800
                                 2                         10 - 5                     900             1266         1139400
                                                           5 - 10                     900             F            F
                                                           8-8-8                      6144            F            F
           Hyperbolic Tangent    3                         10 - 8 - 6                 10800           313          3380400
                                                           6 - 8 - 10                 10800           1359         14677200
                                                           6-6-6-6                    15552           1512         23514624
                                 4                         10 - 9 - 8 - 6             51840           1566         81181440
                                                           6 - 8 - 9 - 10             51840           874          45308160
                                                           7-7-7-7-7                  201684          1072         216205248
                                 5                         12 - 10 - 9 - 8 - 7        725760          F            F
                                                           7 - 8 - 9 - 10 - 12        725760          F            F
                                 0                         0                          12              F            F
                                                           5                          60              699          41940
                                 1
                                                           10                         120             311          37320
                                                           10 - 10                    1200            2824         3388800
                                 2                         10 - 5                     900             2567         2310300
                                                           5 - 10                     900             4026         3623400
                                                           8-8-8                      6144            F            F
           Arctangent            3                         10 - 8 - 6                 10800           777          8391600
                                                           6 - 8 - 10                 10800           831          8974800
                                                           6-6-6-6                    15552           1072         16671744
                                 4                         10 - 9 - 8 - 6             51840           235          12182400
                                                           6 - 8 - 9 - 10             51840           F            F
                                                           7-7-7-7-7                  201684          F            F
                                 5                         12 - 10 - 9 - 8 - 7        725760          F            F
                                                           7 - 8 - 9 - 10 - 12        725760          F            F
                                 0                         0                          12              F            F
                                                           5                          60              248          14880
                                 1
                                                           10                         120             1245         1297320
                                                           10 - 10                    1200            F            F
                                 2                         10 - 5                     900             F            F
                                                           5 - 10                     900             F            F
                                                           8-8-8                      6144            F            F
           Modified Tangent      3                         10 - 8 - 6                 10800           F            F
                                                           6 - 8 - 10                 10800           F            F
                                                           6-6-6-6                    15552           F            F
                                 4                         10 - 9 - 8 - 6             51840           F            F
                                                           6 - 8 - 9 - 10             51840           F            F
                                                           7-7-7-7-7                  201684          F            F
                                 5                         12 - 10 - 9 - 8 - 7        725760          F            F
                                                           7 - 8 - 9 - 10 - 12        725760          F            F


to achieve properly working model. As most important and                    •   Learning coefficient - highly depends on activation
discussed in this paper are:                                                    function, is a crucial factor in converging process of
                                                                                gradient - based methods of learning,
  •   Activation function - as turned out, it has a huge impact
      on capabilities of network, not only these provided by                •   Number of hidden layers and neurons - in simple
      differentiability but also can significantly accelerate                   problems like that one discussed in this paper 2 hidden
      process of learning,                                                      layers are enough,


                                                                      18
                                                                                   [7] R. Fisher, The use of multiple measurements in taxonomic problems,
                                                                                       Annual Eugenics, vol. 7, pp. 179-188, 1936.
                                                                                   [8] Z. Marszałek, “Novel Recursive Fast Sort Algorithm,” in Communica-
                                                                                       tions in Computer and Information Science, vol. 639, pp. 344–355, 2016,
                                                                                       DOI: 10.1007/978-3-319-46254-7_27.
                                                                                   [9] C. Napoli, E. Tramontana, E., “An object-oriented neural network tool-
                                                                                       box based on design patterns,” in International Conference on Informa-
                                                                                       tion and Software Technologies, pp. 388-399, 2015, DOI: 10.1007/978-
                                                                                       3-319-24770-0_34.
                                                                                  [10] C. Napoli, G. Pappalardo, E. Tramontana, “A mathematical model for
                                                                                       file fragment diffusion and a neural predictor to manage priority queues
                                                                                       over BitTorrent,” in Applied Mathematics and Computer Science, vol.
                                                                                       26, no. 1, pp. 147-160, 2016, DOI: 10.1515/amcs-2016-0010.
                                                                                  [11] D. Połap, M. Woźniak, “Flexible Neural Network Architecture for
                                                                                       Handwritten Signatures Recognition” International Journal of Electron-
                                                                                       ics and Telecommunications, vol. 62, no. 2, pp. 197–202, 2016, DOI:
                                                                                       10.1515/eletel-2016-0027.
                                                                                  [12] C. Napoli, G. Pappalardo, G. M. Tina, E. Tramontana, “Cooper-
                                                                                       ative strategy for optimal management of smart grids by wavelet
                                                                                       rnns and cloud computing” IEEE transactions on neural networks
      Figure 7. Error function over iterations in 2 hidden layer network               and learning systems, vol. 27, no. 6, pp. 1672–1685, 2016, DOI:
                                                                                       10.1109/TNNLS.2015.2480709.
                                                                                  [13] D. Połap, M. Woźniak, “Introduction to the Model of the Active
                                                                                       Assistance System for Elder and Disabled People,” in Communications
  •    Number of iterations and desired precision - again,                             in Computer and Information Science, vol. 639, pp. 392–403, 2016,
       strongly depends on problem one tries to solve, there is                        DOI: 10.1007/978-3-319-46254-7_31.
                                                                                  [14] C. Napoli, G. Pappalardo, E. Tramontana, R. K. Nowicki, J. T. Star-
       no general rule that describes how to set these values,                         czewski, J. T., M. Woźniak, “Toward work groups classification based
       however it is important to set upper limit of iterations                        on probabilistic neural network approach” in International Conference
       in order to stop the computations at some point once                            on Artificial Intelligence and Soft Computing, pp. 79–89. 2016, DOI:
                                                                                       10.1007/978-3-319-19324-3_8.
       learning algorithm is unable to achieve desired precision,                 [15] D. Połap, “Neuro-heuristic voice recognition,” in 2016 Federated Con-
                                                                                       ference on Computer Science and Information Systems, FedCSIS 2016,
                                                                                       Proceedings. 11-14 September, Gdańsk, Poland, IEEE 2016, pp. 487–
                                                                                       490, DOI: 10.15439/2016F128.

                            VII. S UMMARY
   Among various architectures neural networks are one of
most efficient processors of information about controlled ob-
jects, moreover these architectures can cooperate with infor-
mation presented in a form of image, voice sample, network
statistics, etc.
   Each implemented solution needs adjusted architecture to
exactly fit the model of decision support. Therefore research
on performance of various types can give valuable information
about performance, what can improve implementations of
neural networks.

                              R EFERENCES
[1] P. Artiemjew, Stability of Optimal Parameters for Classifier Based on
    Simple Granules of Knowledge, Technical Sciences, vol. 14, no. 1, pp.
    57-69. UWM Publisher, Olsztyn 2011.
[2] P. Artiemjew, P. Gorecki, K. Sopyła, Categorization of Similar Objects
    Using Bag of Visual Words and k – Nearest Neighbour Classifier,
    Technical Sciences, vol. 15, no.2, pp. 293-305, UWM Publisher, Olsztyn
    2012.
[3] R. Damasevicius, M. Vasiljevas, J. Salkevicius, M. Woźniak, “Human
    Activity Recognition in AAL Environments Using Random Projec-
    tions” Comp. Math. Methods in Medicine, vol. 2016, pp. 4073584:1–
    4073584:17, 2016, DOI: 10.1155/2016/4073584.
[4] R. Damasevicius, R. Maskeliunas, A. Venckauskas, M. Woźniak,
    “Smartphone User Identity Verification Using Gait Characteris-
    tics” Symmetry, vol. 8, no. 10, pp. 100:1–100:20, 2016, DOI:
    10.3390/sym8100100.
[5] S. Herculano-Houzel, The human brain in numbers: a linearly scaled-up
    primate brain, Frontiers in Human Neuroscience vol. 3, pp. 31:1-31:11,
    2009, DOI: 10.3389/neuro.09.031.2009
[6] Y. Le Cun, L. Bottou, G. Orr, K. Muller, Efficient backprop. Neural
    Networks: Tricks of the trade, 1998.


                                                                             19

</pre>