Choice of Neural Network Architecture when Recognizing Objects that do not Have High-Level Features Gregory Malyshev1, Vyacheslav Andreev2, Olga Andreeva2, Oleg Chistyakov1 and Dmitriy Sveshnikov1 1 JSC «Afrikantov OKBM», Burnakovsky passage 15, Nizhny Novgorod, 603074, Russia 2 Nizhny Novgorod state technical university n.a. R.E. Alekseev, Minina 24, Nizhny Novgorod, 603950, Russia Abstract This article explores the capabilities of pretrained convolutional neural networks in relation to the problem of recognizing defects for which it is impossible to identify any abstract features. The results of training the convolutional neural network AlexNet and the fully connected classifier of the VGG16 network are compared. The efficiency of using a pretrained neural network in the problem of defect recognition is demonstrated. A graph of the change in the proportion of correctly recognized images in the process of training a fully connected classifier is presented. The article attempts to explain the efficiency of a fully connected neural network classifier trained on a critically small training dataset with images of defects. The work of a convolutional neural network with a fully connected classifier is investigated. The classifier allows for classification into five categories: «crack» type defects, «chip» type defects, «hole» type defects, «multi hole» type defects and «defect-free surface». The article provides examples of convolutional network activation channels, visualized for each of the five categories. The signs of defects on which the activation of the network channels takes place are formulated. The classification errors made by the network are analyzed. The article provides predictive probabilities, below which the result of the network operation can be considered doubtful. Practical recommendations for using the trained network are given. Keywords 1 convolutional neural networks, pretrained neural networks, activation channels, image recognition, defects 1. Introduction Currently, the most advanced image recognition tool is convolutional neural networks [1-4]. The use of modern neural network architectures [5-8] and a large training set will certainly allow obtaining a high percentage of correctly recognized images. Such studies are no longer original. At the same time, identifying tasks in which it is quite possible to get by with networks of simple architecture is a hot topic. This article examines the operation of a convolutional neural network, which allows the classification of defects on products into five categories: defects of the "crack" type, defects of the "chip" type, defects of the "single pore" type, defects of the "accumulation of pores" type and "defect- free surface". For each of the five classified categories, the researchers only had 55 images, meaning the entire training set consisted of only 275 images. Seventy five images (15 images per category) were used to validate the network during the training (validation) phase. The set of validation already at the training stage allows us to track the epoch from which the network retraining begins [1, 9]. In addition, there GraphiCon 2021: 31st International Conference on Computer Graphics and Vision, September 27-30, 2021, Nizhny Novgorod, Russia EMAIL gsmalyshev@okbm.nnov.ru (G. Malyshev); vyach.andreev@mail.ru (V. Andreev); andreevaov@gmail.com (O. Andreeva); gsmalyshev@okbm.nnov.ru (O. Chistyakov); budnikov@okbm.nnov.ru (D. Sveshnikov); ORCID: 0000-0002-8147-988X (G. Malyshev); 0000-0002-7557-352X (V. Andreev); 0000-0001-9581-3028 (O. Andreeva); 0000-0002- 6515-9691 (O. Chistyakov); 0000-0002-2152-5346 (D. Sveshnikov); ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) were 50 test images (10 images per category) available, which were used to test the network performance after it was trained. 2. Choosing a convolutional network architecture When choosing a network architecture, the researchers proceeded from the fact that the initial layers of the convolutional neural network highlight the most generalized (local) features (for example, boundaries and textures) in the image, while deeper layers highlight abstract concepts, that is, high- level features (such as "cat's nose" or "bird's feather") [9]. When it comes to defects, it is rather difficult to talk about any abstract signs of defects, since defects look very amorphous. Thus, the main information about defects should be laid down in low-level features, the most complex of which can be, for example, broken lines (a sign of a crack), darkening on the surface or large accumulations of small dark spots (a sign of pore accumulation), violation of strict geometry at the edges. Products (a sign of chipping), a single spot (a sign of a separate pore), large surface areas with a uniform texture (a sign of a defect-free surface). These features are far from abstract, therefore, to identify such features, it is enough to use a convolutional network [10, 11] with a sufficiently small number of convolutional layers (no more than fifteen). Complex, abstract features (which are not detected in a problem with defect recognition), characteristic of specific classes, are “wired up” in deep layers. Therefore, to solve the problem of recognizing defects, we can remove not only the fully connected classifier [9], but also the deep convolution layers. The parameters of the convolutional basis of the network must be frozen in the process of training a new classifier, that is, only weights and thresholds [1, 9] of a fully connected classifier will be trained. Naturally, you can also retrain the parameters of the convolutional basis or its individual blocks. But this approach leads to significant time costs. Initially, an attempt was made to solve the problem of defect recognition using the AlexNET network [12]. The network was trained from scratch using the functions of the Keras library written in Python. To initialize the weights, a normal distribution with zero mean and a standard deviation of 0,1 was used. The initial thresholds were set to 0.1. The RMSprop method [1] was chosen as the optimization method (gradient descent method). When training the network using Keras library 5 functions, the initial learning rate was set to 2 10 , and the rest of the parameters of the RMSprop algorithm were left by default (these parameters can be found in the description of the Keras library functions, for example, here [13]). Each iteration, five images (minibatch) from the training set were fed to the network input. Already after eight epochs of training, the effect of overfitting began to manifest itself: losses on the validation set began to increase, so training was stopped. The trained network showed poor performance results: the share of correctly recognized images from the test set was only 72%. Such a low recognition accuracy can be explained (in addition to the small volume of the training set) by the too primitive architecture of the network, as well as by ineffective initial initialization of weights and thresholds [14- 16]. In [1], it was shown that even the most successful combinations of the initial initialization of the neural network parameters and the gradient descent algorithm can significantly increase the network learning rate, but will not give a serious gain in the accuracy of image recognition. That is, even using the pre-trained AlexNET network does not guarantee high recognition accuracy after additional training. Subsequently, the problem of recognizing defects was solved using the VGG16 network [17-18]. To avoid the difficulties associated with the initial initialization of the weights and thresholds of the network, it was decided to use the VGG16 network, already trained on a million images (1000 images per category) from the ImageNet training set [9]. The VGG16 model is part of the Keras framework, and the capabilities of this library allow you to modernize the network for your tasks: a fully connected classifier was chosen, which has only one hidden layer of 256 neurons. The RMSprop method was 5 chosen as the optimization method. The initial learning rate was set to 2 10 . The size of the minibatch was five. The fully connected classifier was trained over 40 epochs, after which the overfitting effect began to be observed [1, 9]. After training the network, the classification accuracy of images on the validation set reached 92% (69 images out of 75 validation ones). The change in the percentage of correctly recognized images during training is shown in Figure 1. Figure 1: The process of changing the proportion of correctly recognized images in the learning process Testing on a test set (50 images) showed that the accuracy of the trained network is 90% (45 images out of 50 test ones). The figures in Figure 2 - Figure 6 shows examples of channels [9] of the VGG16 network, which are activated on the most characteristic signs of defects. Figure 2: Activation on broken lines, which are a sign of a crack (channel 188 in the third convolutional layer of the third block of the VGG16 network) Figure 3: Activation on blackouts, which are a sign of multi hole (channel 61 in the third convolutional layer of the third block of the VGG16 network) Figure 4: Activation in areas where the smooth geometry at the edges is broken, which indicates the presence of a chip (channel 182 in the third convolutional layer of the third block of the VGG16 network) Figure 5: Activation on single spots, which are a sign of separate holes (channel 192 in the second convolutional layer of the third block of the VGG16 network) Figure 6: Activation in large areas with a uniform texture, which is a sign of a defect-free surface (channel 240 in the third convolutional layer of the third block of the VGG16 network) 3. Analysis of the results of the trained network Examples of recognized defects are shown in Figure 7 (the most difficult cases are selected). The parentheses in the figures indicate the true categories (class labels), without parentheses, the labels predicted by the network are indicated. The percentages in the figures are the predictive probability [1, 9] of class membership. a) b) c) d) Figure 7: Examples of the trained network operation: a – «crack» type defect; b - defect of the «multi hole» type; c - defect of the «chip» type; d - defect of the «separate hole» type Of particular interest are the faulty verdicts handed down by the network. In a test set of 50 images, only 5 images were incorrectly identified. In the validation set of 75 images, 6 images were incorrectly identified (at the time of the end of training). The two images are poorly classified even by an experienced operator. Another six out of eleven incorrectly recognized images are “uncharacteristic” images for the training sample. Such images should not be presented to the network: the use of image data in network testing is due to the limited availability of test and validation images. Another option for solving the problem is additional training or retraining of the network using "problem" images. Separately, it is necessary to pay attention to three erroneous verdicts of the network, which, upon first examination, may seem rather rude. In Figure 8a shows a defect-free surface that has been classified by the network as a "pore pool". In Figure 8b shows a defect of the “crack” type (the network classified the defective product as a defect-free surface). Nevertheless, the predictive probability for the cases presented in the figures in Figure 8a and Figure 8b is rather low, that is, the network “doubts” its verdict. In Figure 8c shows a clearly visible crack at the edge of the product. However, the network with a probability of 57.63% passed the verdict that the image shows a chip. This problem can be solved by adding images of products with cracks closed to the edges of the product to the training set. а) b) c) Figure 8: Errors made by the network 4. Recommendations for using a trained neural network To formulate recommendations for using the network, you must first of all proceed from the analysis of errors made by the network. Erroneously recognized defects were found only among images with a predictive probability of less than 60% (Figure 8c shows an incorrectly recognized defect, the predictive probability for which is the highest among all incorrectly recognized images). Thus, if an engineer is interested not only in the fact of a defect, but also in its type, then any verdict made by the network with a probability of less than 60% should be considered doubtful. Such "questionable" images must be sent to an experienced professional for a final decision. On the contrary, all images for which the predictive probability exceeded 60% were correctly recognized by the network. This fact allows us to make a rather rough assumption that all verdicts for which the predictive probability exceeds 60% are reliable. Such an assumption, in spite of all its roughness, is quite acceptable for control processes, in which a certain percentage of errors is pre-built. Among all the test and validation images (a total of 125 images), there were only 21 images for which the network delivered a verdict with a predictive probability of less than 60%. That is, a rough estimate based on the analysis of the recognition results of test and validation samples shows that a trained neural network saves 83% (104 images from 125) time for examining samples from silicified graphite. That is, if the operator does not check for image defects, the predictive probability for which exceeds 60%, he will save 83% of the working time. If the researcher is interested only in the fact of the presence of a defect, and not in its type, then the percentage of “doubtful” images should be estimated from Figure 8b. The fact is that among all test and validation images with defects, only one image (Figure 8b) was incorrectly classified as a defect-free surface. Such errors are the most dangerous, since the defect that the network "overlooked" can be harmful. Based on Figure 8b, then we can conclude that all images for which the network delivered a verdict with a predictive probability of less than 40% should be sent to an experienced specialist for additional research. Among all test and validation images, only 6 images were identified that meet this requirement. Thus, if the researcher is only interested in the fact of the presence of a defect on the surface, then the trained network will save 95% of the time spent on examining samples. 5. Conclusions Taking into account the smallness of the training set (when training commercial networks, 1000 images per category are used), the result obtained (90% of correctly recognized images on the test set) indicates the effectiveness of using a pre-trained neural network of a simple architecture. This effect is most likely due to the fact that defects do not need to reveal any abstract features. That is, to recognize objects that do not have high-level features, it is quite sufficient to use pre-trained networks with a simple architecture. Only a fully connected classifier will be trained, which will significantly save time for training. 6. References [1] E. Arkhangelskaya, Deep Learning. Immersion in the world of neural networks, SPb, Peter, St. Peterburg, 2020. [2] Y.-L. Boureau, F. Bach, Y. LeCun, J. Ponce, Learning mid-level features for recognition, in: Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR2010, San Francisco, CA, 2010, pp. 2559–2566. doi: 10.1109/CVPR.2010.5539963. [3] D. Scherer, A. Muller, S. Behnke, Evaluation of pooling operations in convolutional architectures for object recognition, in: K. Diamantaras, W. Duch, L.S. Iliadis (Eds.), Proceedings of the 20th International Conference Artificial Neural Networks – ICANN 2010, Proceedings, Part III, Thessaloniki, Greece, 2010, pp. 92-101. doi: 10.1007/978-3-642-15825-4_10. [4] Y. L. Boureau, J. Ponce, Y. Lecun, A theoretical analysis of feature pooling in visual recognition, in: Proceedings of the 27th International Conference on Machine Learning, ICML 2010 - Proceedings, Haifa, Israel, 2010, pp. 111-118. [5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015. doi: 10.1109/CVPR.2015.7298594. [6] M. Lin, Q. Chen, S. Yan, Network in Network, 2014. 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, 14-16 April 2014. URL: http://arxiv.org/abs/1312.4400 [7] K. He, X. Zhang, S. Ren J. Sun, Deep Residual Learning for Image Recognition, in: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Las Vegas, NV, USA, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90. [8] K. He, Х. Zhang, S. Ren, J. Sun, Identity Mappings in Deep Residual Networks, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol. 9908. Springer, Cham. https://doi.org/10.1007/978-3-319-46493-0_38 [9] F. Scholle, Deep Learning in Python, SPb, Peter, St. Peterburg, 2020. [10] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, Х. Wang, G. Wang, J. Cai, et al. Recent advances in convolutional neural networks, 2015. arXiv 2015, URL: http: arXiv:1512.07108. [11] S. Srinivas, R. K. Sarvadevabhatla, K. R. Mopuri, N. Prabhu, S. S. Kruthiventi and R. V. Babu, A taxonomy of deep convolutional neural nets for computer vision, Frontiers Robot. AI, vol. 2, p.36, Jan. 2016. doi: 10.3389/frobt.2015.00036. [12] A. Krizhevsky, I. Sutskever, G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012). URL: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.299.205. doi:10.1145/3065386. [13] Keras RMSprop algorithm parameters, 2021. URL: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop?hl=ru. [14] Y. LeCun, L. Bottou, G. Orr, K. Muller, Efficient BackProp, in: G. Orr and M. K., (Eds.), Neural Networks: Tricks of the trade, volume 1524 of Lecture Notes in Computer Science, Springer- Verlag, Berlin, Heidelberg, 1998, pp. 9–50. doi: 10.1007/3-540-49430-8. [15] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, Journal of Machine Learning Research, volume 9, January 2010, pp. 249-256. URL: http://proceedings.mlr.press/v9/glorot10a.html9. [16] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: B. Francis, D. Blei (Eds.), Proceedings of the 32nd International Conference on International Conference on Machine Learning, volume 37 of ICML'15, JMLR.org, Lille France, 2015, pp. 448–456. doi:10.5555/3045118. [17] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014. arXiv preprint arXiv:1409.1556, 2014. URL: https://arxiv.org/abs/1409.1556 [18] K. Greff, R. K. Srivastava, J. Schmidhuber, Training Very Deep Networks. Advances in Neural Information, in Processing Systems (NIPS) 28th NIPS, Cambridge, MA, USA: MIT Press, 2015, pp. 2377–2385, doi: 10.5555/2969442.2969505.