An Introduction to Image Classification and Object Detection using YOLO Detector Martin Štancel1() [0000-0001-6669-1439] and Michal Hulič1 [0000-0002-2974-8050] 1 Technical University of Košice, Košice, Slovakia martin.stancel@tuke.sk , michal.hulic@tuke.sk Abstract. Artificial neural networks have been proved to be the best and the most used solution for image classification and object detection tasks. Paper an- alyzes them as a tool that significantly improves the mentioned, very compli- cated computational calculations. In the paper there is a brief history of their development as well as the selected object detector that we used for our intro- ductory experiment that is shown later in the paper. Also, there is introduced the idea of the future research that is going to be based on the conducted experi- ment and which is going to involve a new methodology for an automated gen- eration of new domain-specific datasets that are essential in the training phase of the neural networks. Keywords: Artificial neural network, Image classification, Object detection, Dataset, Pattern recognition, Computer vision, Machine learning. 1 Introduction In the last two decades scientists and researchers in the fields of computer vision, machine learning and neural networks perceive an increasing popularity of these sec- tors of computer science due to the fact that technologically hardware as well as soft- ware components of today's computers have been significantly advanced. It has al- lowed us to do extensive algorithmic operations and work with a huge amount of data. We analyzed artificial neural networks (in short neural networks), which is a sub- area of the machine learning, that are the most suitable method for image classifica- tion and object detection tasks. Neural networks use methodologies of the machine learning and computer vision. Computer vision takes care about image processing in a way so it also deals with noise reduction, brightness change, or image enhancement by various techniques. On the other hand, the machine learning is very flexible, because it can be used in com- puter vision, image processing as well as other sectors of computer science. The paper also describes the history of the neural networks as well as the primarily used convolutional neural network which has become the most popular method at the image classification and object detection tasks. According to the analyzed facts and the results from our empirically tested data, in the future we would like to design and implement optimized method for automated generation of domain-specific datasets that are essential in the training phase of the neural networks which is very necessary task to do for the neural networks to actually be able to learn and detect objects on the series of any new images. 2 Neural Networks There are a lot of general methods that deal with a problem in a unique way in an optimal time consuming interval and nowadays neural networks have been one of them that become commercially popularized thanks to the fact that hardware as well as software are being significantly advanced on daily basis. Today, they have been widely used in many sectors of the computer science from arduino microcontroller interfaces [1] through authentication [2] or our researched image classification and object detection. Neural networks consist of many interconnected groups of nodes that are called neurons. Variables from input functions from data are transmitted to these neurons as a multivariable linear combination, where the values are multiplied with each function variable (i.e. weights). On this linear combination there is later applied non-linearity that give the neural networks an ability to model complex non-linear relations. Neural networks can have more layers, where an output from one layer is the input for the other. Also, for the learning and detecting processes, neural networks use trained da- tasets (section 2.2). Nowadays, there are a lot of algorithms with various types of neural networks. Their historical development is described in the next section 2.1. 2.1 History of the Neural Networks For a few decades there have been simple approaches to create one of the firsts neural networks and its very first approach begun by Frank Rosenblatt in 1958 [3], who re- searched how information from physical world are stored in biological system so it could be used for detection or behavioral influences in the future. Later, there were developed models with several successively non-linear layers of neurons that are dated back to the sixty's [4] and seventy's [5]. Gradient descent method was in the supervised learning [6] in discrete, differencional networks of an arbitrary depth called backpropagation [7] applied for the first time to a neural net- work in 1981. With a huge amount of various layers, neural networks were too hard to develop at this time, because of that their development stagnated until the beginning of ninety's [8], when unsupervised learning [9] method was implemented. In the ninety's and twenty's of the last century there were significant improve- ments in this kind of field. There was developed a new method of reinforced learning [10] that looks into an unknown environment and by using the trial and error method, agent learns about its surroundings and gets better every time it tries a new approach with its actions [11]. In the third millennium, the neural networks attracted a large amount of research- ers for their application in many different sectors [12, 13] resulting among the best algorithms. Since 2009 the neural networks have won many competitions especially in a pattern recognition. The pattern recognition was significantly improved when Alex Krizhevsky et al. in 2012 developed convolutional neural network for image classification task on ImageNet challenge [14]. He and his team won the challenge and created state-of-the- art image classification method that is also used today. 2.2 Datasets Today, there are a lot of various datasets for the machine learning but we will take a closer look at image datasets that are essential for image classification and object detection tasks. Creating image datasets is a relatively time-consuming operation, since their meaning is acquired when they contain a huge amount of data. The image datasets that are used in image classification and object detection are created by labeling ob- jects and accurately locating them with a bounding box. Nowadays, there are no such tools that could perform fully automated objects labeling and locating. We want to direct our research to domain-specific environments, so creating a method that automates the generation of these datasets is desired in the community. We assume that it will be based on a convolutional neural network and an image object detector within YOLO architecture which we empirically tested on the series of our two experiments (section 3.2). Our idea is to collect images online that would consist of various types and colors of the same object classes, transparent or one- colored background and accurate name. Then, we could extract individual objects from the images and programmatically adjust their brightness, light settings, shadows, etc to get even more images for the training phase. Our idea is to put those objects into randomly generated backgrounds with random location and overlapping as can be seen in the next figure (Fig. 1). Fig. 1. Randomly generated background with random locating and overlapping objects. 3 Detector YOLO and the Experiments YOLO is an object detector created by Redmon, J., et al. [16]. The YOLO authors state [17] that it is a state-of-the-art image object detector that achieves the best re- sults in terms of accuracy and speed and that's why we used it in our research along with its neural network called Darknet. 3.1 Detector YOLO divides each image into a grid of size S x S and each cell in the grid predicts B bounding boxes and their confidence. This confidence of an object reflects how relia- ble and accurate the bounding box that locates and classifies an object is. It defines the confidence of an object as follows: 𝑡𝑟𝑢𝑡ℎ 𝑃𝑅(𝑂𝑏𝑗𝑒𝑐𝑡) ∗ 𝐼𝑂𝑈𝑝𝑟𝑒𝑑 (1) which means that the probability of the detected object is multiplied with an intersec- tion over union (the intersection area divided by the union area for two bounding boxes) between the predicted boundary box and the ground truth box (i.e. hand la- beled bounding box in a training data). 3.2 Experiments With the detector YOLO we conducted two experiments on a pre-trained COCO da- taset [18]. In the first one, we showed how the detector works on the image shown below (Fig. 2) and in the second one we tested the detector on the series of 500 images to empirically confirm its functionality. Using the detector on the image in various resolutions. In this experiment we com- pared the image classification and object detection while processed on processor Intel Core i7-7700K (Table 1) and graphic card GeForce GTX 1070 (Table 2) while we used the same image for both of the components. Fig. 2. Used Image for this experiment. By comparing the two tables, we can see that the data processing, image classification and the object detection on the processor is noticeably slower than on the graphic card (approximately 8x slower). Also, with the increasing resolution, the number of detect- ed objects is also increased, which is caused because of the better quality and clearer image. Table 1. Objects Detection Testing on the Processor. Resolution Objects Detected Time in ms 378x284 8 1639.114 756x567 9 1623.239 1008x756 11  2016x1512 13 1679.288 4032x3024 14 1550.013 Table 2. Objects Detection Testing on the Graphic Card. Resolution Objects Detected Time in ms 378x284 8 194.474 756x567 9 202.065 1008x756 11  2016x1512 13 198.224 4032x3024 14 194.799 Using the detector on the series of 500 images. We extended our first experiment to detect objects on the series of 500 images. Also, according to the results of our previ- ous experiment, we didn't use various resolutions anymore, because it has no effect in time on the final detections and using the images in their original resolution provide more detected objects. For the comparison we chose the images with the fastest and the slowest detection time and the images with the most and the least objects detected. Also, we provided average time and average amount of detected objects per whole series of the images. Similarly we used processor and graphic card processing as in the first experiment. The results are shown in the next tables (Table 3 and Table 4). Table 3. 500 Images Objects Detection Testing on the Processor. Property Objects Detected FLOPS Time in ms The fastest detection 20 65.864 2188.822 The slowest detection 20 65.864 1401.273 Average time 13.23  1563.503 The most objects 44 65.864 1505.41 The least objects 2 65.864 1572.51 Table 4. 500 Images Objects Detection Testing on the Graphic Card. Property Objects Detected FLOPS Time in ms The fastest detection 20 65.864 220.742 The slowest detection 20 65.864 187.003 Average time 13.23  192.299 The most objects 44 65.864 189.685 The least objects 2 65.864 188.625 The speed of the detection on the series of 500 images is between 1401.273ms to 2188.822ms with the average time of the detection 1563.503ms on the processor and 187.003ms to 220.742ms with the average time of the detection 192.299ms on the graphic card. From the results of this experiment we can conclude that the amount of objects de- tected doesn't affect the speed of detection (the fastest and the slowest processed im- ages contain the same amount of objects) as well as the time of the most and the least objects detected images is almost identical. 4 Future Research In the future, we would like to use the YOLO detector for processing a huge amount of images for a training phase of automated generation of domain-specific datasets. Based on our results, we will aim the processing on a graphic card. The card we used achieved 5 FPS. The future research will also be aimed to design completely new methodology for the automated generation of domain-specific datasets. We assume that the method will be of a great importance in reducing time cost while creating new datasets, espe- cially in the phase of the labeling where each object on an image must be precisely put into the bounding box. Nowadays this task is handmade by people and this ap- proach should completely get rid of the human intervention during the labeling pro- cess. The method would also be applied in real-time detections as well as many other tasks like determining specific species of a certain kind or in education to learn spe- cific objects in the same way as children learn from their very first moments of life. The last thing I would like to point out is that creating such datasets is a serious problem since labeling and locating of the objects in the images is mostly a manual work. Our method would help researchers in many different areas to get significantly better results because, as is written in this papers [19, 20], often times their datasets are very limited and it could affect the results accuracy. With our approach of automated generation of domain-specific datasets we could train the neural networks on specific environments which would significantly help with a determination not only of a class of some object but also its kinds and sub- classes e.g. a detected flower would be more accurately detected as forget-me-not or a detected tree would be more accurately detected as baobab. Acknowledgement This work was supported by the Faculty of Electrical Engineering and Informatics, Technical University of Košice under the contract No. FEI-2018-59: Semantic Ma- chine of Source-Oriented Transparent Intensional Logic. References 1. Madoš, B., Ádám, N., Hurtuk, J., Čopjak, M.: Brain-computer interface and Arduino mi- crocontroller family software interconnection solution. In: Proc. of the IEEE 14th Interna- tional Symposium on Applied Machine Intelligence and Informatics (2016), pp. 217–221, 2010. 2. Vokorokos, L., Danková, E., Ádám, N.: Task scheduling in distributed system for photore- alistic rendering. In: Proc. of the IEEE 8th International Symposium on Applied Machine Intelligence and Informatics (2010), pp. 43–47, 2010. 3. Rosenblatt, F.: The Perceptron: A Probabilistic Model for Information Storage and Organ- ization in Brain. In: Psychological Review, USA, 1958, vol. 65, iss. 6, pp. 386–407. 4. Ivakhnenko, G. A., Lapa, G. V.: Cybernetic predicting devices. USA. CCM Information Corp, 1965. 5. Werbos, P.: Beyond regression: new tools for prediction and analysis in the behavioral sci- ences. 1974. 6. Hardt, M., Price, E., Srebro, N.: Equality of Opportunity in Supervised Learning. In: Ad- vances in Neural Information Processing Systems (2016), vol. 29, 2016. 7. Wang, L., Zengya, Y., Chen, T.: Back propagation neural network with adaptive differen- tial evolution algorithm for time series forecasting. In: Expert Systems with Applications. 2015, vol. 42, iss. 2, pp. 855–863. 8. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient de- scent is difficult. In: IEEE Transactions on Neural Networks (1994), vol. 5, iss. 2, pp. 157– 166, 1994. 9. Radford, A., Metz, L., Chintala, S.: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In: International Conference on Learning Representations (ICLR). 2016. 10. Marco, W., Van Otterlo, M.: Reinforcement Learning. 2012. ISBN 978-3-642-27645-3. 11. Chovanec, M., Chovancová, E., Dufala, M.: DIDS based on hybrid detection. In: IEEE In- ternational Conference on Emerging eLearning Technologies and Applications (ICETA), Slovakia, pp. 79-83. 2014. 12. Vokorokos, L., Pekár, A., Ádám, N., Daranyi, P.: Yet Another Attempt in User Authenti- cation. 2013, vol. 10, iss. 3, pp 37–50. Acta Polytechnica Hungarica (2013). 13. Hurtuk, J., Baláž, A., Ádám, N.: Security sandbox based on RBAC model. In: Proc. of the 11th International Symposium on Applied Computational Intelligence and Informatics (2016). pp. 75–80. 2016. 14. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet Classification with Deep Convolu- tional Neural Networks. In: Advances in Neural Information Processing Systems (2012), vol. 25, 2012. 15. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. In: Nature. 2015, pp. 436–444. 16. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You Only Look Once: Unified, Real- Time Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (2016). IEEE Xplore, pp. 779–788. 2016. 17. Redmon, J., Farhadi, A.: YOLOv3: An Incremental Improvement. Tech Report. arXiv:1804.02767. 2018. 18. Tsung-Yi, L., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ra- manan, D., Zitnick, L., Dollár, P.: Microsoft COCO: Common Objects in Context. In: Eu- ropean Conference on Computer Vision (ECCV), pp. 740-755. 2014. 19. Garcia, J., Barbedo, A.: Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification. In: Computers and Electron- ics in Agriculture, vol. 153, pp. 46-53. Elsevier. 2018. 20. Vokorokos, L., Ennert, M., Čajkovský, M., Radušovský, J.: A Survey of parallel intrusion detection on graphical processors. In: Central European Journal of Computer Science, vol. 4, iss. 4, pp. 222–230. Open Computer Science. 2014.