=Paper=
{{Paper
|id=Vol-2694/paper4
|storemode=property
|title=Image classification with feed-forward neural networks
|pdfUrl=https://ceur-ws.org/Vol-2694/p4.pdf
|volume=Vol-2694
|authors=Bartłomiej Meller,Kamil Matula,Paweł Chłąd
|dblpUrl=https://dblp.org/rec/conf/system/MellerMC20
}}
==Image classification with feed-forward neural networks==
Image classification with Feed-Forward Neural Networks Bartłomiej Mellera , Kamil Matulaa and Paweł Chłąda a Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44-100 Gliwice Abstract Artificial neural networks, such as feed-forward networks (FFN), convolutional neural networks (CNN), recursive neural network (RNN) are becoming powerful tools that are starting to replace many classical algorithms. It is known that, for image recognition, CNNs are often the best choice in terms of accuracy. In this paper we show that feed forward networks are capable of achieving comparable performance, with less complicated architecture in comparison to CNNs. After presentation of underlying theory of Feed Forward networks, we present different methods, that allowed us to get past network local minima, then we show experiments and conclusions that followed. Keywords Neural networks, Activation function, Images, Classification 1. Introduction be done using neural networks [7]. We can also find applications of vision support in virtual reality enter- Artificial neural networks such as feed-forward net- tainment systems [8], where neural networks are used works (FFN), convolutional neural networks (CNN), to improve perception. There are also many examples recursive neural network (RNN) are becoming pow- in which neural networks are used for understanding erful tools that are starting to replace many classical context and emotions from movies. In [9] adaptive at- algorithms. It is known that, for image recognition, tention model was used to recognize patterns, while CNNs are often the best choice in terms of accuracy. emotions from clips were detected by neural networks We went with FFN because they are simpler in their [10] or complex neuro-fuzzy systems [11]. structure and with sufficient amount of data and heuris- This paper addresses both of those issues. We show tic approach they can achieve comparable performance, that combining heuristic and backpropagation algo- while being easier to implement and manage. rithm allows, for efficient overcoming of ”minima traps” One of the most effective learning method is back- and methods that help networks to generalize their propagation (BP) algorithm. BP uses gradient calcula- knowledge. First we introduce basic theory of Feed- tion to determine ”direction” in which network should Forward Networks, alongside with explanation of back- ”go”. The downside of strict mathematical approach is propagation algorithm. Then we describe our example that gradient-based methods often ”get stuck” in local model and series of experiments that were performed minima of a function. Another common problem in on the model. At the end of our paper, we present con- many kinds of networks is knowledge generalization; clusion that we have gathered and give performance we often train our models on large data sets, to avoid metrics of our model. fixation of a network on small set of examples. We can find many applications of neural networks in image processing. In [1] CNN was adopted to rec- 2. Feed-Forward Networks ognize archaeological sites. We can also find many applications in medicine where such systems extract Theory bacteria [2] or detect respiratory malfunctions or other Artificial Neural Network (ANN) is a mathematical mo- pathologies [3, 4, 5]. In movie and advertisements field del of biological neural networks that compose a hu- there are also many applications of such ideas. Movie man brain; similarly to one, the ANN is built of neu- scenes can be segmented by using background infor- rons which are parts of layers. Each neuron from each mation [6] or even prediction about such content can layer is connected to all neurons from the previous layer and all neurons of the next layer are connected SYSTEM 2020: Symposium for Young Scientists in Technology, Engineering and Mathematics, Online, May 20 2020 by synapses. These connections have randomly ini- " bartmel655@student.polsl.pl (B. Meller); tialized weights, which are being modified in the learn- kamimat133@student.polsl.pl (K. Matula); ing process. pawechl893@student.polsl.pl (P. Chłąd) The first layer, responsible for receiving input data, © 2020 Copyright for this paper by its authors. Use permitted under Creative is called input layer. Similarly the last one, which re- Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) turns output data, is called output layer. There can be by adjusting synaptic weights in strictly defined way. There are many methods of learning, but the most pop- ular training technique that is used in Feed-forward Neural Network is back-propagation in which modify- ing weights of the connections goes from output layer to input layer sequentially. This direction is opposite to the way inserted information moves in all FFNs. The goal of training process is to minimize the value of loss function, for all elements included in training set (T). The training set consists of input vectors, which are inserted to first layer by input synapses, and expected output vectors, which are compared with gained out- puts every time neural network is fed. The loss func- tion shows the distance between predicted value and the actual one. To calculate it we can use Mean Square Error Loss or Cross-Entropy Loss (also known as Log Figure 1: Bipolar linear activation function Loss). Using MSE, total error of training set can be de- scribed with equation: 𝑛 zero, one or more hidden layers between them. The 𝐸 = ∑ ∑(𝑑𝑖 − 𝑦𝑖 )2 , (1) goal of neural network architect is to find optimal sizes 𝑇 𝑖=1 of layers, to make learning process much more effi- cient. Input neurons’ count depends on number of where 𝑛 is a dimension of output vector (and also count features the analyzing object has and output neurons’ of neurons on output layer), 𝑑𝑖 is predicted value on count on how many classes it can be classified to. 𝑖 𝑡ℎ position of output vector and 𝑦𝑖 is actual value on Every neuron receives value on input, transforms it 𝑖 𝑡ℎ position of output vector. Correction of synaptic using activation function and sends the output signal weights starts in last layer and goes backwards through to the next layer. The input signal of 𝑖 𝑡ℎ neuron of 𝑘 𝑡ℎ all hidden layers until it reaches input layer. The weight layer is: is changed according to equation: 𝑛 𝑤𝑖𝑗𝑘 = 𝑤𝑖𝑗𝑘 + 𝜂∇𝑤𝑖𝑗𝑘 , (2) 𝑠𝑖𝑘 = ∑ 𝑤𝑖𝑗𝑘 𝑦𝑗𝑘−1 , 𝑗=1 where 𝜂 is correcting coefficient commonly called ’Learn- where 𝑤𝑖𝑗𝑘 is a weight of connection between 𝑖 𝑡ℎ neu- ing Rate’ and ∇𝑤𝑖𝑗𝑘 is a value of gradient of synapse’s ron of 𝑘 𝑡ℎ layer and 𝑗 𝑡ℎ neuron of previous layer and weight’s error: 𝑦𝑗𝑘−1 is 𝑗 𝑡ℎ neuron of previous layer’s output signal value. The output signal of 𝑖 𝑡ℎ neuron is: 𝜕𝐸 1 𝜕𝐸 𝜕𝑠𝑖𝑘 ∇𝑤𝑖𝑗𝑘 = = ⋅ ⋅ 2 ⋅ = 2𝛿𝑖𝑘 𝑦𝑗𝑘−1 , (3) 𝑛 𝜕𝑤𝑖𝑗𝑘 2 𝜕𝑠𝑖𝑘 𝜕𝑤𝑖𝑗𝑘 𝑦𝑖𝑘 = 𝑓 (𝑠𝑖𝑘 ) = 𝑓 (∑ 𝑤𝑖𝑗𝑘 𝑦𝑗𝑘−1 ). 𝑗=1 where 𝛿𝑖𝑘 is value of a change of error function, for 𝑘 𝑡ℎ There are many activation functions (also known as layer’s 𝑖 𝑡ℎ neuron’s input signal and 𝑦 𝑘−1 is previous 𝑗 transfer functions or threshold functions). One of the layer’s 𝑗 𝑡ℎ neuron’s output signal value. On last, 𝐾 𝑡ℎ most commonly used is bipolar linear function. It’s layer 𝛿 equals: equation is: 𝑘 2 1 − 𝑒 −𝛼𝑠𝑖 1 𝜕𝐸 1 𝜕(𝑑 𝐾 − 𝑦 𝐾 )2 𝑓 (𝑠𝑖𝑘 ) = 𝑘 −1= 𝑘 𝛿𝑖𝐾 = ⋅ 𝑘 = ⋅ 𝑖 𝑘 𝑖 = 𝑓 ′ (𝑠𝑖𝐾 )⋅(𝑑𝑖𝐾 −𝑦𝑖𝐾 ), (4) 1 + 𝑒 −𝛼𝑠𝑖 1 + 𝑒 −𝛼𝑠𝑖 2 𝜕𝑠𝑖 2 𝜕𝑠𝑖 where 𝛼 is a coefficient that influences ”width” (con- vergence rate) of activation function. where 𝑓 ′ (𝑠𝑖𝐾 ) is value of activation function’s differen- After artificial neural network has been properly built, tial of 𝑖 𝑡ℎ output neuron’s input signal. On this layer it is time to teach it object recognition. ANN learns 23 the value depends mainly on distance between pre- Data: 𝐸𝑝𝑜𝑐ℎ𝑠𝐶𝑜𝑢𝑛𝑡, 𝑇 𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝐼 𝑛𝑝𝑢𝑡𝑠, dicted and actual values - on error. Other layers’ 𝛿s 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑂𝑢𝑡𝑝𝑢𝑡𝑠 (𝐸𝑂𝑈 𝑇 ) use numbers calculated in previous steps: Result: Higher precision of neural network 𝐿 ∶= count of network’s layers; 𝑁 𝑘+1 𝑁𝑘+1 𝐷 ∶= empty jagged array, for 𝛿 values; 1 𝜕𝐸 1 𝑘+1 𝜕𝐸 𝜕𝑠𝑗 for 𝑖 = 0 to 𝐸𝑝𝑜𝑐ℎ𝑠𝐶𝑜𝑢𝑛𝑡 do 𝛿𝑖𝑘 = ⋅ 𝑘 = ⋅ ∑ 𝑘+1 = 𝑓 ′ 𝑘 (𝑠 𝑖 ) ∑ 𝛿𝑗𝑘+1 𝑤𝑖𝑗𝑘+1 , 2 𝜕𝑠𝑖 2 𝑗=1 𝜕𝑠𝑗 𝜕𝑠𝑖𝑘 𝑗=1 for 𝑗 = 0 to Training Set’s Length do (5) Insert 𝑗 𝑡ℎ vector from 𝐼 𝑛𝑝𝑢𝑡𝑠 to first layer’s input synapses; where 𝑁𝑘+1 is count of neurons on (𝑘 + 1)𝑡ℎ layer. for 𝑘 = 0 to L do In Algorithm 1 you can see the full process of train- Calculate 𝑠 𝑘 on all neurons on 𝑘 𝑡ℎ ing Feed-Forward Neural Network using back-propag- layer ation algorithm. It uses array called 𝐷, which con- by summing products of synaptic sists of 𝛿 values. This jagged array has 𝐿 rows and weights as many columns, in a row, as many neurons the layer and output values of previous layer has, it corresponds to. All elements of arrays are writ- neurons (or synapses if it is input ten like 𝐴𝑖,𝑗 which is just alternative form, for writing layer); 𝐴[𝑖][𝑗]. What’s more, numerical intervals that are used Calculate 𝑦 𝑘 on all neurons on 𝑘 𝑡ℎ in the for-loops are half-open (the numbers after to layer don’t count to these intervals). There are also symbols by using activation function; like 𝐿𝑅, 𝑠𝑖𝑘 , 𝑦𝑖𝑘 and 𝑤𝑖𝑗𝑘 . They mean respectively: Learn- end ing Rate (𝜂), 𝑘 𝑡ℎ layer’s 𝑖 𝑡ℎ neuron’s input and output 𝑂𝑢𝑡𝑝𝑢𝑡 (𝑂𝑈 𝑇 ) ∶= vector made of values and weight of synapse between 𝑖 𝑡ℎ neuron of output values 𝑘 𝑡ℎ layer and 𝑗 𝑡ℎ neuron of previous layer. Moreover values from output neurons; there is 𝑓 ′ (⋅) symbol that signifies the differential of the for 𝑛 = 0 to output neurons count do activation function - for bipolar linear function 𝐷(𝐿−1),𝑛 = (𝐸𝑂𝑈 𝑇𝑗,𝑛 − 𝑂𝑈 𝑇𝑗 ) ⋅ 𝑓 ′ (𝑠𝑛𝐿−1 ); 1 − 𝑒 −𝛼𝑥 𝑓 (𝑥) = (6) end 1 + 𝑒 −𝛼𝑥 for 𝑘 = 𝐿 − 2 to 0 step −1 do the differential is: for 𝑛 = 0 to 𝑘 𝑡ℎ layer’s neurons 2𝛼𝑒 −𝛼𝑥 count do 𝑓 ′ (𝑥) = . (7) 𝐷𝑘,𝑛 = 0; (1 + 𝑒 −𝛼𝑥 )2 for 𝑚 = 0 to (𝑘 + 1)𝑡ℎ layer’s neurons count do 𝑘+1 ; 𝐷𝑘,𝑛 = 𝐷𝑘,𝑛 + 𝐷(𝑘+1),𝑚 ⋅ 𝑤𝑚𝑛 3. Example system end Our experimental FFN takes in an 50x25x3 image and 𝐷𝑘,𝑛 = 𝐷𝑘,𝑛 ⋅ 𝑓 ′ (𝑠𝑛𝑘 ); outputs six dimensional vector of decision values. Each end decision value represents one movie that is associated end with given frame. for 𝑘 = 𝐿 − 2 to 0 step −1 do for 𝑛 = 0 to 𝑘 𝑡ℎ layer’s neurons 3.1. Data preparation count do 3.1.1. Image preparation for 𝑚 = 0 to (𝑘 − 1)𝑡ℎ layer’s neurons count do After loading data set into the memory, we cut each 𝑘,𝑛 ⋅ 𝑦𝑚 ; 𝑘 = 2 ⋅ 𝐿𝑅 ⋅ 𝐷 𝑤𝑛𝑚 𝑘−1 image to meet 2:1 aspect ratio, this value was chosen end because our input vector is an image that has an as- end pect ratio 2:1. If we took an image with aspect ratio end that does not meet input aspect ratio, we would need to stretch or shrink an image; that would in turn add end end Algorithm 1: FFN training algorithm. 24 some pixels and thus make results more inaccurate. was not capacious enough to accommodate amount of After cropping to target aspect ratio, we crop another knowledge that could recognize four or more movies. 12 of width and height, from each side, to eliminate Eventually, model that contained 3 hidden layers with 1 possible letterboxes. Finally we shrink an image to get the following count of neurons: 800, 200, 50 came out it to 50x25 dimensions. Such dimensions have been to be an optimal solution. chosen to cut down on training times. Obviously to The back-propagation algorithm has one significant eliminate possible artifacts introduced by the prepro- defect. It does not use any heuristics that could help cessing steps odescribed above the image will be fil- it to deal with local minima. Solution to this issue tered [12]. is quite simple and easy to implement. In this case, adding a random number from range of [−0.002; 0.002] 3.1.2. Data labeling to weights of all connections turned out to be extraor- dinarily efficient. However it’s hard to say about ex- Each image is given label, signifying movie it belongs act values because, the method was crossed with grad- 1 of image samples are redirected into test- to. Then 10 ual increase of classes count.Surprisingly, increasing ing set. amount of classes resulted in leap of quality in deci- sions made by the network. 3.2. Training network This phenomenon was observed, after few epochs of learning, after class addition. Sometimes network After preparation of training and testing datasets, all needed randomization mentioned above for the phe- 225 per movie images, converted to the vectors, are in- nomenon to occur. serted to the neural network. As it was said in ”Feed- At the time of writing this article absolute accuracy Forward Networks Theory” section, the information of predictions made by the network, for six classes ex- goes through all layers to the output layer with us- ceeded maximal accuracy given on four classes. We ing activation function in all encountered neurons and suspect that this could be explained by growing gener- then goes back in back-propagation process, modify- ality of classifier contained in the model with increas- ing synaptic weights. This sequence is repeated mul- ing amount of known classes. tiple times and results in improvement of network’s Methodology of learning was following. Model was ability to recognizing objects. trained until it’s accuracy hadn’t increased, for few 1 hidden layer with 50 neurons epochs. In next stage networks’ weights was random- Movies count Highest precision ized to get three different child networks. All four in- 4 76.0 % stances was learned in parallel. At the end, the best 5 71.2 % one of them was picked, reproduced by randomizing 6 52.7 % their weights and then all the process were repeated. When the network had stalled with it’s advance- 3 hidden layers: 800, 200 and 50 neurons ment, one class was added to its possible outputs and Movies count Highest precision learning was continued in the same way. One of the most interesting issues that we have experienced in 4 82.0 % the initial part of the research was a significant drop 5 88.8 % in accuracy when class set contained ”Shrek 2” movie. 6 84.6 % After preliminary learning on dataset that hadn’t con- tained this movie, this problem disappeared. 4. Experiments We were also experimenting with different weight initialization techniques, described in [4, 13]. We have Main problem that was dealt with in order to provide tried the following methods: optimal learning accuracy and time, was selecting ap- propriate architecture that grants relatively good ac- • He weight initialization √ - multiplies random value curacy, but on the other hand, contains as few neurons [−1.0, 1.0] with 𝑠𝑖𝑧𝑒 where 𝑠𝑖𝑧𝑒 is a size of pre- 2 as required. Minimal neuron count results in minimal vious layer computational time. Initially, when tests were carried out on 4 classes • Xavier’s weight initialization √ - similar to He ini- of data, network that contained only one hidden layer tialization, but with 1 𝑠𝑖𝑧𝑒 which consisted of as few as 50 neurons seemed to be the most efficient and accurate (76%). Nevertheless it • We also used the following technique: 25 √ 𝑠𝑖𝑧𝑒𝑙−1 +𝑠𝑖𝑧𝑒𝑙 where 𝑠𝑖𝑧𝑒𝑛 is size of nth layer 2 [2] D. Połap, M. Woźniak, Bacteria shape classifica- tion by the use of region covariance and convo- Unfortunately none of those methods worked cor- lutional neural network, in: 2019 International rectly. In testing, we used networks with significant Joint Conference on Neural Networks (IJCNN), number of neurons in nearly every layer (3750 neurons IEEE, 2019, pp. 1–7. in input layer) that in turn set initial weight values to [3] M. Wózniak, D. Połap, R. K. Nowicki, C. Napoli, minute values. G. Pappalardo, E. Tramontana, Novel approach Results of our research are shown in the tables be- toward medical signals classifier, in: 2015 Inter- low. national Joint Conference on Neural Networks First table presents the most satisfactory accuracies (IJCNN), IEEE, 2015, pp. 1–7. of the neural network (with 1 hidden layer consisting [4] G. Capizzi, G. L. Sciuto, C. Napoli, D. Polap, of 50 neurons) before the experiment. The second one M. Woźniak, Small lung nodules detection based shows the effects of randomization. Percents situated on fuzzy-logic and probabilistic neural network in right columns are calculated by dividing amount of with bio-inspired reinforcement learning, IEEE correctly recognized movies by size of testing dataset Transactions on Fuzzy Systems 28 (2019) 1178– (which is 25 frames per movie). As it shows, the results 1189. are surprisingly high. [5] F. Beritelli, G. Capizzi, G. Lo Sciuto, C. Napoli, M. Woźniak, A novel training method to preserve generalization of rbpnn classifiers applied to ecg 5. Example Predictions signals diagnosis, Neural Networks 108 (2018) The tables below contain example predictions made by 331–338. the network. Provided images are screenshots from [6] L.-H. Chen, Y.-C. Lai, H.-Y. M. Liao, Movie the Netflix platform that were taken independently from scene segmentation using background informa- the learning and testing datasets. tion, Pattern Recognition 41 (2008) 1056–1065. As it can be seen, the accuracy is satisfactory high. [7] Y. Zhou, L. Zhang, Z. Yi, Predicting movie box- It makes a few mistakes when it comes to dark and office revenues using deep neural networks, Neu- distant frames. Fixing this issue will be our next goal. ral Computing and Applications 31 (2019) 1855– 1865. [8] D. Połap, K. Kęsik, A. Winnicka, M. Woź- 6. Conclusions niak, Strengthening the perception of the vir- tual worlds in a virtual reality environment, ISA After all experiments done with this network, we can transactions 102 (2020) 397–406. state the following conclusions: [9] J. Chen, J. Shao, C. He, Movie fill in the blank by 1. Combining back-propagation and heuristic ap- joint learning from video and text with adaptive proach gave an unprecedented leap in network temporal attention, Pattern Recognition Letters accuracy. - After a number of tests with weight 132 (2020) 62–68. randomization, we suspect that by giving a ”nud- [10] S. Alghowinem, R. Goecke, M. Wagner, A. Alwa- ge” to weights, we push it out of local minimum, bil, Evaluating and validating emotion elicitation hence allowing it further learning. using english and arabic movie clips on a saudi 2. Changing model in-flight allows, for more gen- sample, Sensors 19 (2019) 2218. erality. - By changing model topology (ex. adding [11] T.-L. Nguyen, S. Kavuri, M. Lee, A multimodal additional output dimension), we have seen an convolutional neuro-fuzzy network for emotion increase in generality of a model. understanding of movie clips, Neural Networks 118 (2019) 208–219. 3. Training on diversified samples first, results in [12] G. Capizzi, S. Coco, G. Lo Sciuto, C. Napoli, A increased generality. new iterative fir filter design approach using a gaussian approximation, IEEE Signal Processing References Letters 25 (2018) 1615–1619. [13] M. Matta, G. Cardarilli, L. Di Nunzio, R. Fazzo- [1] M. Woźniak, D. Połap, Soft trees with neural lari, D. Giardino, A. Nannarelli, M. Re, S. Spanò, components as image-processing technique for A reinforcement learning-based qam/psk sym- archeological excavations, Personal and Ubiqui- bol synchronizer, IEEE Access 7 (2019) 124147– tous Computing 24 (2020) 363–375. 124157. 26 Picture Origin Prediction Matches 2nd prediction Indiana Jones Indiana Jones yes Shark Tale Indiana Jones Indiana Jones yes Shrek 2 Indiana Jones Indiana Jones yes Shrek 2 Indiana Jones Shrek 2 no The Lego Movie The Lego Movie The Lego Movie yes Shark Tale The Lego Movie The Lego Movie yes Shark Tale The Lego Movie Shark Tale no The Lego Movie The Lego Movie The Lego Movie yes Shrek 2 Mada- gascar Mada- gascar yes Shark Tale Mada- gascar Shrek 2 no The Lego Movie Mada- gascar Mada- gascar yes Shrek 2 Mada- gascar Mada- gascar yes Shrek 2 27 Picture Origin Prediction Matches 2nd prediction Shark Tale Shark Tale yes The Lego Movie Shark Tale Shark Tale yes Indiana Jones Shark Tale Shark Tale yes The Lego Movie Shark Tale Shark Tale yes The Lego Movie Shrek 2 Shrek 2 yes Shark Tale Shrek 2 Shrek 2 yes Indiana Jones Shrek 2 Shrek 2 yes Mada- gascar Shrek 2 Shrek 2 yes Shark Tale Loving Vincent Loving Vincent yes Shrek 2 Loving Vincent Loving Vincent yes The Lego Movie Loving Vincent Loving Vincent yes Shark Tale Loving Vincent Loving Vincent yes Shrek 2 28