Machine Learning Methods for Computer Vision Eros Innocenti1 , Alessandro Vizzarri2 1 Deptartment of Engineering Science, Guglielmo Marconi University, Italy 2 Deptartment of Enterprise Engineering, University of Rome Tor Vergata, Italy Abstract Over the last years, deep learning methods proved to outperform previous machine learning techniques, especially in high computational task such as computer vision. This review paper aims to provide a preliminary overview of the machine learning tasks where computer vision in involved. Furthermore, a brief review of their history and state-of-the-art techniques is presented in the fields of image classification and object detection. Keywords Machine Learning, Computer Vision, Artificial Intelligence, Deep Learning 1. Introduction describe these three categories. Nowadays, computer vision is one of the most studied artificial intelligence and machine learning subfields. Its 2. Machine Learning Tasks applications are many and various, ranging from indus- try applications to manufacturing [1], healthcare and 2.1. Supervised learning autonomous vehicles. The CV main goal is to replicate In supervised learning the goal is to infer a function the capabilities of humans’ vision. Although for our brain starting from a collection of labeled training data. The this kind of task appears fairly simple, there is a lot of training data, typically consists in a set of image exam- information processing under the hood. Over the years, ples annotated with extra information such as the image the field of computer vision is shifting from a statistical class, or the position of the depicted object(s). The train- approach, based on hand-crafted methods, to deep learn- ing in most cases is hand-made, but semi-supervised ing neural networks ones. This change of perspective is approaches are available too. This possibility is useful driven not only by an increasing performance demand if the training set size is small, and it is difficult or even [2]. In fact, deep learning models proved that they can impossible to obtain more samples. Moreover, image aug- learn semantic representations of images, thus adapting mentations techniques (e.g., horizontal and vertical flip, better to different scenarios without requiring human shear, brightness and contrast variations) can be used to interventions [3]. In this paper we want to take a brief artificially increase the training set size, thus achieving review on the problems, which CV could solve and the better training performances. state-of-the-art technologies developed in the last few The steps required to train a computer vision model years of research. In Section 2 we illustrate how the using supervised learning can be summarized in the fol- machine learning problems are categorized in different lowing: tasks, each one with different goals. Section 3 presents the subtasks specifically related to computer vision, sub- 1. Decide the kind of training examples which rep- sequently in Section 4 some mainly used object detection resent accurately the problem. techniques are described. Eventually, in Section 5 an 2. Collect a sufficient number of examples. In the overview of future directions is presented, presenting case of many classes, make sure to balance the some of next years open challenges. number of examples across all of them. Machine learning includes an extensive set of tasks, 3. Decide an input feature vector which is descrip- which can be classified in three broad categories: Super- tive for the selected task. The number of features vised Learning, Unsupervised Learning and Reinforce- should not be too large, in order to avoid overfit- ment Learning. In the next subsections we will briefly ting. ICYRIME 2021 @ International Conference of Yearly Reports on 4. Decide the learning function structure and pick a Informatics Mathematics and Engineering, online, July 9, 2021 loss function which has to be minimized during " eros@newtechweb.it (E. Innocenti); the training phase. alessandro.vizzarri@uniroma2.it (A. Vizzarri) 5. Run the model on the training set, iteratively op-  0000-0002-7793-4974 (E. Innocenti); 0000-0002-6274-991X (A. Vizzarri) timizing its parameters until the target metric © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). (e.g., loss, accuracy, average precision) reaches CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) the target value. 85 Eros Innocenti et al. CEUR Workshop Proceedings 85–89 Image classification Object Localization Object Detection Object Segmentation 3.1. Image classification Image classification is probably the most well-known computer vision task. The main goal is to assign an Apple Apple Apple Pear Apple Pear input image to one of a set of predefined categories. The simplest case is represented by binary classification, it Figure 1: Computer vision tasks means that the output of the model consists in only two possible values: true or false. An example could be a classifier which given a picture returns if that picture 6. Evaluate the trained model on a test set. In order contains a person or not. A more complex version of the to obtain an unbiased evaluation of the model, same classifier could have more than two categories (e.g., it’s important that the test set is composed only person, cat, dog, car). by unseen examples. 3.2. Object localization 2.2. Unsupervised learning Starting from the previous image classification task, we Unsupervised learning, unlike the supervised one, does could improve the output of the neural network adding not need a labeled training set. Instead, the goal is to the information about the location of the object. The infer a function which describes the underlying structure common way to describe the location of an object is to from unlabeled data. It is worth noting that since the define a bounding box which encloses the object in the examples are not annotated, it is not possible to eval- picture. uate the performance of the model using the methods applied in supervised learning. Unsupervised learning is 3.3. Object detection used in many situations, some of them are dimensionality Object localization is limited to one object per image. The reduction, search of clusters, data compression. One pop- computer vision task whose goal is to localize multiple ular example of unsupervised learning is the k-means object of different classes in the same picture is called clustering algorithm [4]. Object Detection. This task introduces major complexi- ties if compared with the previous one, and the required 2.3. Reinforcement learning effort to scale from Object Localization to Object Detec- Lastly, reinforcement learning substantially differs from tion can be significant. Some problems encountered can the previous ones because it lacks the initial training be difficult even for humans. Some objects could be par- data completely [5]. In this kind of machine learning, tially visible, because they overlap each other or may be the running program (i.e., the agent) interacts with the partially outside the frame. Moreover, the sizes of the environment making use of sensors and actuators with objects belonging to the same class could vary noticeably. a certain goal to achieve. The agent is provided by feed- backs that could be rewards or penalties based on the 3.4. Object segmentation actions taken in the previous one or more time spans. In the previous localization and detection tasks, the main In the next sections of this paper we will focus mainly goal is to place a bounding box (and a class label) over on supervised learning.Specifically we will analyze the all the objects present in the input image. Segmentation most frequent computer vision related subtasks and the differs from localization and detection because the output techniques commonly used to solve this kind of problems. is no more a set bounding box. Instead, in segmentation, the computer vision model tries to annotate every pixel 3. Computer Vision Tasks of the image whether part of a specific class from a set of predefined ones. As stated before, in computer vision, we can further split Object segmentation can be further divided in two these tasks, mainly into 4 categories: types: semantic segmentation [6, 7] and instance segmen- tation [8, 9]. • Image classification The main difference between these two kinds is that • Object localization semantic segmentation treats multiple objects belonging • Object detection to the same class as a single entity. On the other hand, • Object segmentation instance segmentation treats multiple objects of the same class as individual instances. In figure 1 an example of these categories is depicted. 86 Eros Innocenti et al. CEUR Workshop Proceedings 85–89 3.5. Object tracking In 2015, ResNet by Kaiming He et al [18] introduced a novel CNN architectured called Residual Neural Net- Object tracking applies to a sequence of images instead work. The main difference from the previous is the intro- of a single input, because of this reason it has not been duction of skip connections between layers. Such skip listed at the beginning of this section. The purpose of connections permitted to obtain better training results object tracking is to track a moving object over subse- with fewer parameters. ResNet obtained a top-5 error quent frames. This kind of functionality is essentials for rate of 3.5% on ImageNet, which beats human-level per- robots or autonomous cars. A straightforward approach formances (approximately 5%) on the same dataset. to perform object tracking is to apply the object detec- In 2017, MobileNet [19] was presented as a solu- tions techniques to a video instance and then compare tion for mobile and embedded visual applications. This every object instance in order to determine the direction lightweight network is particularly suited for low power and the speed of the movement. However, it is worth system [20]. The network is very flexible and can be noting that, in many cases, the object tracking does not easily adapted to the specific application, tweaking its need to recognize objects of different classes, but could hyper-parameters. simply rely on motion criteria without being aware of Lastly, in 2019 Mingxing T. and Quoc V. [21] studied the objects classes. a novel neural network (i.e., EfficientNet) which can be scaled up as needed in a very efficient way. The main 4. Techniques novelty about this method is that the scaling process involves not only the depth of the network, but also the 4.1. Object classification width and the resolution of the input, thus proving that this compound method obtains better results with less The emergence of large scale annotated training sets such parameters. as ImageNet [10] or COCO [11], required significant com- putational power and deeper network architectures. In the last few years, high performance parallel computa- 4.2. Object detection tional systems, such as GPUs, enabled new challenges Deep Neural Networks for Object Detection can be cate- in computer vision that can be solved by the means of gorized in two different types: deep learning. The most representative models of deep learning applied to computer vision are Convolutional • Region proposal networks Neural Networks (i.e., CNNs). The first convolutional • Single shot detectors neural network appeared in 1998 with LeNet-5 [12], a 7 Historically, the first detectors were based on the pre- layers convolutional neural network developed by Yann vious described image classification networks. The ba- LeCun. LeNet was used to recognize hand-written num- sic idea to obtain object detection is based on a sliding bers from the famous MNIST dataset [13], a collection window approach. Substantially, a fixed size rectangu- of 32x32 pixels greyscale input images. The architecture lar window crops the image at different positions and a was pretty simple, mainly because for the time there were subsequent image classification network is in charge of computational power constraints. predicting the object class. At each iteration, the win- In 2012, AlexNet [14] won the ILSVRC [15] (ImageNet dow is moved by a stride value until the whole image Large Scale Visual Recognition Challenge) 2021 compe- is analyzed. The main drawback of this method is the tition, with a similar architecture but with more filters low speed because it is computational expensive. An im- and layers, thus becoming one of the first deep neural provement over the sliding window approach, is called networks. selective search [22], which consists in a hierarchical The next year, ZFNet [16] won the ILSVRC mostly grouping segmentation algorithm that combines multi- tweaking the hyper-parameters of AlexNet, maintaining ple grouping strategies. This algorithm starts with an the same base structure. initial set of regions and at each iteration merges the In 2014 VGGNet [17] entered the scene becoming one most similar regions together, until the whole image is of the reference architecture for object classification. The represented as a single region. Finally, a set of regions of first version (i.e., VGG16) had a very uniform architecture, interests (ROI) are selected and fed into an image classi- composed by sixteen 3x3 convolutional layers followed fication network. The resulting object detection network by max pooling operations. The main drawback of VGG is called Region-based ConvNet (R-CNN) [23, 24]. Al- is the number of parameters (i.e., 138 million), which can though selective search improved quite noticeably the be challenging to handle. Anyhow, VGG is still one of the overall speed of the process, it is still not enough when preferred architecture used for feature extraction from speed is a key factor. In 2015 other two improvements images. of region proposal based networks were proposed, Fast R-CNN [25] and soon after Faster R-CNN [26]. The main 87 Eros Innocenti et al. CEUR Workshop Proceedings 85–89 novelty about these new architectures was the integra- References tion of ROIs generation into the neural network itself. In fact, the previous version of R-CNN used selective search [1] A. Jaber, R. Bicker, Fault diagnosis of industrial for ROI extraction as a separated process. robot bearings based on discrete wavelet transform In the same year, YOLO (You Only Look Once) [27, and artificial neural network , International Journal 28] revolutionized the object detection scene presenting of Prognostics and Health Management 7 (2016) art. an algorithm substantially different from the classical no. 017. region proposal networks. A new kind of architecture [2] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramontana, started to emerge, called Single Shot Detectors. Instead A multithread nested neural network architecture of using a ROIs extraction phase, single shot detectors to model surface plasmon polaritons propagation, divides the image in a grid, giving at each cell the task to Micromachines 7 (2016) 110. detect objects in that region. For each grid cell, multiple [3] F. Fallucchi, M. Petito, E. De Luca, Analysing and predefined boxes (i.e., anchors or priors) are considered. Visualising Open Data Within the Data and Ana- These boxes have multiple sizes, aspect ratio in order to lytics Framework, Communications in Computer be able to detect objects of different shapes. Immediately and Information Science 846 (2019) p.135–146. after, Single Shot MultiBox Detectors [29] followed the [4] Y. Li, H. Wu, A clustering method based on k-means same approach obtaining similar results to YOLO in terms algorithm, Physics Procedia 25 (2012) 1104–1109. of speed and accuracy. [5] L. Canese, G. C. Cardarilli, L. Di Nunzio, R. Faz- Over the years many variations of these architectures zolari, D. Giardino, M. Re, S. Spanò, Multi-agent were presented, each one with its particularities and reinforcement learning: A review of challenges and strengths. Although there are exceptions, nowadays re- applications, Applied Sciences 11 (2021) 4948. gion proposal based networks are preferred when ac- [6] C. Napoli, G. Pappalardo, E. Tramontana, An agent- curacy is of main importance and speed is secondary. driven semantical identifier using radial basis neu- Moreover, R-CNNs are considered better in detecting ral networks and reinforcement learning, volume small objects. 1260, 2014. On the other hand, single shot detectors overtake R- [7] A. Venckauskas, A. Karpavicius, R. Damasevicius, CNNs in real-time tasks, edge or mobile computing [30]. R. Marcinkevicius, J. Kapociute-Dzikiene, C. Napoli, The inference time of these networks is less, at the cost Open class authorship attribution of lithuanian in- of lower accuracy [31]. ternet comments using one-class classifier, 2017, p. 373 – 382. doi:10.15439/2017F461. [8] G. De Magistris, S. Russo, P. Roma, J. Starczewski, 5. Conclusions C. Napoli, An explainable fake news detector based on named entity recognition and stance classifica- In this paper, a brief review of commonly used deep tion applied to covid-19, Information (Switzerland) learning methods has been made, emphasizing its appli- 13 (2022). doi:10.3390/info13030137. cation in the field of computer vision. In the last years, [9] C. Napoli, E. Tramontana, G. Lo Sciuto, M. Woź- especially using GPUs clusters, we obtained the com- niak, R. Damaševičius, G. Borowik, Authorship putational power to enable the design of deeper neural semantical identification using holomorphic cheby- networks [32]. Moreover, the availability of large datasets shev projectors, 2015, p. 232 – 237. doi:10.1109/ such as COCO or ImageNet allowed training accurate APCASE.2015.48. models, which can be adapted to a variety of scenarios. [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei- With the increasing importance of mobile devices and Fei, Imagenet: A large-scale hierarchical image edge computing, the high power requirements of the re- database, in: 2009 IEEE Conference on Computer viewed techniques will inevitably conflict with the low Vision and Pattern Recognition, 2009, pp. 248–255. power resources offered by edge devices. Although cloud doi:10.1109/CVPR.2009.5206848. computing can help, many situations such as rural ar- [11] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. eas, make internet access problematic, thus invalidating Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, the remote processing possibility. Moreover, supervised C. L. Zitnick, Microsoft COCO: common objects learning, which is the commonly used method for com- in context, CoRR abs/1405.0312 (2014). URL: http: puter vision tasks, allows obtaining noticeably results at //arxiv.org/abs/1405.0312. arXiv:1405.0312. the cost of long training times. In the future, self-learning [12] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient- methods should be considered, in order to skip the whole based learning applied to document recognition, in: dataset creation and focus in the learning phase, as it Proceedings of the IEEE, 1998, pp. 2278–2324. happens for the humankind. [13] Y. LeCun, C. Cortes, MNIST handwritten digit database, prova (2010). URL: http://yann.lecun.com/ 88 Eros Innocenti et al. CEUR Workshop Proceedings 85–89 exdb/mnist/. R-CNN: towards real-time object detection with [14] A. Krizhevsky, I. Sutskever, G. E. Hinton, Im- region proposal networks, CoRR abs/1506.01497 agenet classification with deep convolutional (2015). URL: http://arxiv.org/abs/1506.01497. neural networks, Commun. ACM 60 (2017) arXiv:1506.01497. 84–90. URL: https://doi.org/10.1145/3065386. [27] J. Redmon, S. K. Divvala, R. B. Girshick, A. Farhadi, doi:10.1145/3065386. You only look once: Unified, real-time object de- [15] O. Russakovsky, J. Deng, H. Su, J. Krause, tection, CoRR abs/1506.02640 (2015). URL: http: S. Satheesh, S. Ma, Z. Huang, A. Karpathy, //arxiv.org/abs/1506.02640. arXiv:1506.02640. A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, [28] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac- ImageNet Large Scale Visual Recognition Chal- caro, Yolov3-based mask and face recognition al- lenge, International Journal of Computer Vi- gorithm for individual protection applications, vol- sion (IJCV) 115 (2015) 211–252. doi:10.1007/ ume 2768, 2020, p. 41 – 45. s11263-015-0816-y. [29] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. [16] M. D. Zeiler, R. Fergus, Visualizing and un- Reed, C. Fu, A. C. Berg, SSD: single shot multibox derstanding convolutional networks, CoRR detector, CoRR abs/1512.02325 (2015). URL: http: abs/1311.2901 (2013). URL: http://arxiv.org/abs/ //arxiv.org/abs/1512.02325. arXiv:1512.02325. 1311.2901. arXiv:1311.2901. [30] F. Mazzenga, R. Giuliano, F. Vatalaro, FttC-based [17] K. Simonyan, A. Zisserman, Very deep convolu- fronthaul for 5G dense/ultra-dense access network: tional networks for large-scale image recognition, Performance and costs in realistic scenarios, Future 2015. arXiv:1409.1556. Internet 9 (2017). [18] K. He, X. Zhang, S. Ren, J. Sun, Deep resid- [31] A. Simonetta, M. Paoletti, Designing digital cir- ual learning for image recognition, 2015. cuits in multi-valued logic, International Journal arXiv:1512.03385. on Advanced Science, Engineering and Information [19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, Technology 8 (2018) pp. 1166–1172. W. Wang, T. Weyand, M. Andreetto, H. Adam, [32] G. Capizzi, F. Bonanno, C. Napoli, Hybrid neural Mobilenets: Efficient convolutional neural networks architectures for soc and voltage predic- networks for mobile vision applications, 2017. tion of new generation batteries storage, in: 2011 arXiv:1704.04861. International Conference on Clean Electrical Power [20] G. M. Bianco, R. Giuliano, G. Marrocco, F. Mazzenga, (ICCEP), IEEE, 2011, pp. 341–344. A. Mejia-Aguilar, LoRa System for Search and Res- cue: Path-Loss Models and Procedures in Mountain Scenarios, IEEE Internet of Things Journal 8 (2021) p.1985–1999. [21] M. Tan, Q. V. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, CoRR abs/1905.11946 (2019). URL: http://arxiv.org/abs/ 1905.11946. arXiv:1905.11946. [22] J. Uijlings, K. van de Sande, T. Gevers, A. Smeul- ders, Selective search for object recogni- tion, International Journal of Computer Vision (2013). URL: http://www.huppelen. nl/publications/selectiveSearchDraft.pdf. doi:10.1007/s11263-013-0620-5. [23] R. B. Girshick, J. Donahue, T. Darrell, J. Ma- lik, Rich feature hierarchies for accurate ob- ject detection and semantic segmentation, CoRR abs/1311.2524 (2013). URL: http://arxiv.org/abs/ 1311.2524. arXiv:1311.2524. [24] N. Brandizzi, V. Bianco, G. Castro, S. Russo, A. Wa- jda, Automatic rgb inference based on facial emo- tion recognition, volume 3092, 2021, p. 66 – 74. [25] R. B. Girshick, Fast R-CNN, CoRR abs/1504.08083 (2015). URL: http://arxiv.org/abs/1504.08083. arXiv:1504.08083. [26] S. Ren, K. He, R. B. Girshick, J. Sun, Faster 89