1. Introduction

International Conference of Yearly Reports on Informatics Mathematics and Engineering, online, July

Machine Learning Methods for Computer Vision

Eros Innocenti

Alessandro Vizzarri

1 0 Deptartment of Engineering Science, Guglielmo Marconi University , Italy 1 Deptartment of Enterprise Engineering, University of Rome Tor Vergata , Italy

2021

9 2021 0000 0002

Over the last years, deep learning methods proved to outperform previous machine learning techniques, especially in high computational task such as computer vision. This review paper aims to provide a preliminary overview of the machine learning tasks where computer vision in involved. Furthermore, a brief review of their history and state-of-the-art techniques is presented in the fields of image classification and object detection.

eol>Machine Learning Computer Vision Artificial Intelligence Deep Learning

1. Introduction

describe these three categories.

Nowadays, computer vision is one of the most studied artificial intelligence and machine learning subfields. Its 2. Machine Learning Tasks applications are many and various, ranging from industry applications to manufacturing [ 1 ], healthcare and 2.1. Supervised learning autonomous vehicles. The CV main goal is to replicate In supervised learning the goal is to infer a function the capabilities of humans’ vision. Although for our brain starting from a collection of labeled training data. The this kind of task appears fairly simple, there is a lot of training data, typically consists in a set of image examinformation processing under the hood. Over the years, ples annotated with extra information such as the image the field of computer vision is shifting from a statistical class, or the position of the depicted object(s). The trainapproach, based on hand-crafted methods, to deep learn- ing in most cases is hand-made, but semi-supervised ing neural networks ones. This change of perspective is approaches are available too. This possibility is useful driven not only by an increasing performance demand if the training set size is small, and it is dificult or even [ 2 ]. In fact, deep learning models proved that they can impossible to obtain more samples. Moreover, image auglearn semantic representations of images, thus adapting mentations techniques (e.g., horizontal and vertical flip, better to diferent scenarios without requiring human shear, brightness and contrast variations) can be used to interventions [ 3 ]. In this paper we want to take a brief artificially increase the training set size, thus achieving review on the problems, which CV could solve and the better training performances. state-of-the-art technologies developed in the last few The steps required to train a computer vision model years of research. In Section 2 we illustrate how the using supervised learning can be summarized in the folmachine learning problems are categorized in diferent lowing: tasks, each one with diferent goals. Section 3 presents the subtasks specifically related to computer vision, subsequently in Section 4 some mainly used object detection techniques are described. Eventually, in Section 5 an overview of future directions is presented, presenting some of next years open challenges.

Machine learning includes an extensive set of tasks, which can be classified in three broad categories: Supervised Learning, Unsupervised Learning and Reinforcement Learning. In the next subsections we will briefly 1. Decide the kind of training examples which rep

resent accurately the problem. 2. Collect a suficient number of examples. In the case of many classes, make sure to balance the number of examples across all of them. 3. Decide an input feature vector which is descriptive for the selected task. The number of features should not be too large, in order to avoid overfitting. 4. Decide the learning function structure and pick a loss function which has to be minimized during the training phase. 5. Run the model on the training set, iteratively optimizing its parameters until the target metric (e.g., loss, accuracy, average precision) reaches the target value. Image classification Object Localization

Object Detection Object Segmentation 6. Evaluate the trained model on a test set. In order to obtain an unbiased evaluation of the model, it’s important that the test set is composed only by unseen examples.

2.2. Unsupervised learning

Unsupervised learning, unlike the supervised one, does not need a labeled training set. Instead, the goal is to infer a function which describes the underlying structure from unlabeled data. It is worth noting that since the examples are not annotated, it is not possible to evaluate the performance of the model using the methods applied in supervised learning. Unsupervised learning is used in many situations, some of them are dimensionality reduction, search of clusters, data compression. One popular example of unsupervised learning is the k-means clustering algorithm [ 4 ].

2.3. Reinforcement learning

Lastly, reinforcement learning substantially difers from the previous ones because it lacks the initial training data completely [ 5 ]. In this kind of machine learning, the running program (i.e., the agent) interacts with the environment making use of sensors and actuators with a certain goal to achieve. The agent is provided by feedbacks that could be rewards or penalties based on the actions taken in the previous one or more time spans.

In the next sections of this paper we will focus mainly on supervised learning.Specifically we will analyze the most frequent computer vision related subtasks and the techniques commonly used to solve this kind of problems.

3. Computer Vision Tasks

As stated before, in computer vision, we can further split these tasks, mainly into 4 categories: • Image classification • Object localization • Object detection • Object segmentation In figure 1 an example of these categories is depicted.

3.1. Image classification

Image classification is probably the most well-known computer vision task. The main goal is to assign an input image to one of a set of predefined categories. The simplest case is represented by binary classification, it means that the output of the model consists in only two possible values: true or false. An example could be a classifier which given a picture returns if that picture contains a person or not. A more complex version of the same classifier could have more than two categories (e.g., person, cat, dog, car).

3.2. Object localization

Starting from the previous image classification task, we could improve the output of the neural network adding the information about the location of the object. The common way to describe the location of an object is to define a bounding box which encloses the object in the picture.

3.3. Object detection

Object localization is limited to one object per image. The computer vision task whose goal is to localize multiple object of diferent classes in the same picture is called Object Detection. This task introduces major complexities if compared with the previous one, and the required efort to scale from Object Localization to Object Detection can be significant. Some problems encountered can be dificult even for humans. Some objects could be partially visible, because they overlap each other or may be partially outside the frame. Moreover, the sizes of the objects belonging to the same class could vary noticeably.

3.4. Object segmentation

In the previous localization and detection tasks, the main goal is to place a bounding box (and a class label) over all the objects present in the input image. Segmentation difers from localization and detection because the output is no more a set bounding box. Instead, in segmentation, the computer vision model tries to annotate every pixel of the image whether part of a specific class from a set of predefined ones.

Object segmentation can be further divided in two types: semantic segmentation [ 6, 7 ] and instance segmentation [ 8, 9 ].

The main diference between these two kinds is that semantic segmentation treats multiple objects belonging to the same class as a single entity. On the other hand, instance segmentation treats multiple objects of the same class as individual instances.

3.5. Object tracking

Object tracking applies to a sequence of images instead of a single input, because of this reason it has not been listed at the beginning of this section. The purpose of object tracking is to track a moving object over subsequent frames. This kind of functionality is essentials for robots or autonomous cars. A straightforward approach to perform object tracking is to apply the object detections techniques to a video instance and then compare every object instance in order to determine the direction and the speed of the movement. However, it is worth noting that, in many cases, the object tracking does not need to recognize objects of diferent classes, but could simply rely on motion criteria without being aware of the objects classes.

4. Techniques 4.1. Object classification

The emergence of large scale annotated training sets such as ImageNet [ 10 ] or COCO [11], required significant computational power and deeper network architectures. In the last few years, high performance parallel computational systems, such as GPUs, enabled new challenges in computer vision that can be solved by the means of deep learning. The most representative models of deep learning applied to computer vision are Convolutional Neural Networks (i.e., CNNs). The first convolutional neural network appeared in 1998 with LeNet-5 [12], a 7 layers convolutional neural network developed by Yann LeCun. LeNet was used to recognize hand-written numbers from the famous MNIST dataset [13], a collection of 32x32 pixels greyscale input images. The architecture was pretty simple, mainly because for the time there were computational power constraints.

In 2012, AlexNet [14] won the ILSVRC [15] (ImageNet Large Scale Visual Recognition Challenge) 2021 competition, with a similar architecture but with more filters and layers, thus becoming one of the first deep neural networks.

The next year, ZFNet [ 16 ] won the ILSVRC mostly tweaking the hyper-parameters of AlexNet, maintaining the same base structure.

In 2014 VGGNet [17] entered the scene becoming one of the reference architecture for object classification. The ifrst version (i.e., VGG16) had a very uniform architecture, composed by sixteen 3x3 convolutional layers followed by max pooling operations. The main drawback of VGG is the number of parameters (i.e., 138 million), which can be challenging to handle. Anyhow, VGG is still one of the preferred architecture used for feature extraction from images.

In 2015, ResNet by Kaiming He et al [ 18 ] introduced a novel CNN architectured called Residual Neural Network. The main diference from the previous is the introduction of skip connections between layers. Such skip connections permitted to obtain better training results with fewer parameters. ResNet obtained a top-5 error rate of 3.5% on ImageNet, which beats human-level performances (approximately 5%) on the same dataset.

In 2017, MobileNet [19] was presented as a solution for mobile and embedded visual applications. This lightweight network is particularly suited for low power system [20]. The network is very flexible and can be easily adapted to the specific application, tweaking its hyper-parameters.

Lastly, in 2019 Mingxing T. and Quoc V. [ 21 ] studied a novel neural network (i.e., EficientNet) which can be scaled up as needed in a very eficient way. The main novelty about this method is that the scaling process involves not only the depth of the network, but also the width and the resolution of the input, thus proving that this compound method obtains better results with less parameters.

4.2. Object detection

Deep Neural Networks for Object Detection can be categorized in two diferent types: • Region proposal networks • Single shot detectors

Historically, the first detectors were based on the previous described image classification networks. The basic idea to obtain object detection is based on a sliding window approach. Substantially, a fixed size rectangular window crops the image at diferent positions and a subsequent image classification network is in charge of predicting the object class. At each iteration, the window is moved by a stride value until the whole image is analyzed. The main drawback of this method is the low speed because it is computational expensive. An improvement over the sliding window approach, is called selective search [ 22 ], which consists in a hierarchical grouping segmentation algorithm that combines multiple grouping strategies. This algorithm starts with an initial set of regions and at each iteration merges the most similar regions together, until the whole image is represented as a single region. Finally, a set of regions of interests (ROI) are selected and fed into an image classiifcation network. The resulting object detection network is called Region-based ConvNet (R-CNN) [ 23, 24 ]. Although selective search improved quite noticeably the overall speed of the process, it is still not enough when speed is a key factor. In 2015 other two improvements of region proposal based networks were proposed, Fast R-CNN [ 25 ] and soon after Faster R-CNN [ 26 ]. The main novelty about these new architectures was the integration of ROIs generation into the neural network itself. In fact, the previous version of R-CNN used selective search for ROI extraction as a separated process.

In the same year, YOLO (You Only Look Once) [27, 28] revolutionized the object detection scene presenting an algorithm substantially diferent from the classical region proposal networks. A new kind of architecture started to emerge, called Single Shot Detectors. Instead of using a ROIs extraction phase, single shot detectors divides the image in a grid, giving at each cell the task to detect objects in that region. For each grid cell, multiple predefined boxes (i.e., anchors or priors) are considered. These boxes have multiple sizes, aspect ratio in order to be able to detect objects of diferent shapes. Immediately after, Single Shot MultiBox Detectors [29] followed the same approach obtaining similar results to YOLO in terms of speed and accuracy.

Over the years many variations of these architectures were presented, each one with its particularities and strengths. Although there are exceptions, nowadays region proposal based networks are preferred when accuracy is of main importance and speed is secondary. Moreover, R-CNNs are considered better in detecting small objects.

On the other hand, single shot detectors overtake RCNNs in real-time tasks, edge or mobile computing [30]. The inference time of these networks is less, at the cost of lower accuracy [31].

[1]

Jaber ,

Bicker , Fault diagnosis of industrial robot bearings based on discrete wavelet transform and artificial neural network , International Journal of Prognostics and Health Management 7 ( 2016 ) art .

no. 017.

[2]

Capizzi ,

G. Lo

Sciuto ,

Napoli ,

Tramontana , A multithread nested neural network architecture to model surface plasmon polaritons propagation , Micromachines 7 ( 2016 ) 110 .

[3]

Fallucchi ,

Petito , E. De Luca, Analysing and Visualising Open Data Within the Data and Analytics Framework , Communications in Computer and Information Science 846 ( 2019 ) p. 135 - 146 .

[4]

Li ,

Wu , A clustering method based on k-means algorithm , Physics Procedia 25 ( 2012 ) 1104 - 1109 .

[5]

Canese ,

G. C.

Cardarilli ,

L. Di

Nunzio ,

Fazzolari ,

Giardino , M. Re, S. Spanò, Multi-agent reinforcement learning: A review of challenges and applications , Applied Sciences 11 ( 2021 ) 4948 .

[6]

Napoli ,

Pappalardo , E. Tramontana, An agentdriven semantical identifier using radial basis neural networks and reinforcement learning , volume 1260 , 2014 .

[7]

Venckauskas ,

Karpavicius ,

Damasevicius ,

Marcinkevicius ,

Kapociute-Dzikiene ,

Napoli , Open class authorship attribution of lithuanian internet comments using one-class classifier , 2017 , p.

373 - 382 . doi: 10 .15439/2017F461.

[8]

De Magistris ,

Russo , P. Roma, J. Starczewski, 5 . Conclusions

Napoli , An explainable fake news detector based on named entity recognition and stance classificaIn this paper, a brief review of commonly used deep tion applied to covid-19, Information (Switzerland) learning methods has been made , emphasizing its appli- 13 ( 2022 ). doi: 10 .3390/info13030137.

cation in the field of computer vision . In the last years, [9]

Napoli , E. Tramontana,

G. Lo

Sciuto , M. Woź- especially using GPUs clusters, we obtained the com- niak , R. Damaševičius, G. Borowik, Authorship putational power to enable the design of deeper neural semantical identification using holomorphic chebynetworks [32]. Moreover, the availability of large datasets shev projectors , 2015 , p. 232 - 237 . doi: 10 . 1109/ such as COCO or ImageNet allowed training accurate APCASE . 2015 . 48 .

models, which can be adapted to a variety of scenarios . [10]

Deng ,

Dong ,

Socher ,

L.-J.

Li ,

Li , L. FeiWith the increasing importance of mobile devices and Fei, Imagenet: A large-scale hierarchical image edge computing, the high power requirements of the re- database , in: 2009 IEEE Conference on Computer viewed techniques will inevitably conflict with the low Vision and Pattern Recognition, 2009 , pp. 248 - 255 .

power resources ofered by edge devices . Although cloud doi:10 .1109/CVPR. 2009 . 5206848 .

computing can help, many situations such as rural ar- [11]

Lin ,

Maire ,

S. J.

Belongie ,

L. D.

Bourdev , R. B.

eas, make internet access problematic, thus invalidating Girshick ,

Hays ,

Perona ,

Ramanan , P. Dollár, the remote processing possibility . Moreover, supervised

C. L.

Zitnick , Microsoft

COCO

: common objects learning, which is the commonly used method for com- in context , CoRR abs/1405 .0312 ( 2014 ). URL: http: puter vision tasks, allows obtaining noticeably results at //arxiv.org/abs/1405.0312. arXiv: 1405 . 0312 .

the cost of long training times. In the future , self-learning [12]

Lecun ,

Bottou ,

Bengio ,

Hafner , Gradientmethods should be considered, in order to skip the whole based learning applied to document recognition, in: dataset creation and focus in the learning phase , as it Proceedings of the IEEE , 1998 , pp. 2278 - 2324 .

happens for the humankind . [13]

LeCun , C. Cortes, MNIST handwritten digit database, prova ( 2010 ). URL: http://yann.lecun.com/ exdb/mnist/. R-CNN: towards real-time object detection with [14]

Krizhevsky , I. Sutskever,

G. E.

Hinton , Im- region proposal networks , CoRR abs/1506 . 01497 agenet classification with deep convolutional ( 2015 ). URL: http://arxiv.org/abs/1506.01497.

neural networks , Commun. ACM 60 ( 2017 ) arXiv: 1506 . 01497 .

84- 90 . URL: https://doi.org/10.1145/3065386. [27]

Redmon ,

S. K.

Divvala ,

R. B.

Girshick , A . Farhadi, doi:10.1145/3065386. You only look once: Unified, real-time object de[15]

Russakovsky ,

Deng ,

Su , J. Krause, tection, CoRR abs/1506 .02640 ( 2015 ). URL: http: S. Satheesh,

Ma ,

Huang , A . Karpathy, //arxiv.org/abs/1506.02640. arXiv: 1506 . 02640 .

Khosla ,

Bernstein ,

A. C.

Berg ,

Fei-Fei , [28]

Avanzato ,

Beritelli ,

Russo ,

Russo , M. VacImageNet Large Scale Visual Recognition Chal- caro, Yolov3-based mask and face recognition allenge , International Journal of Computer Vi- gorithm for individual protection applications, volsion (IJCV) 115 ( 2015 ) 211 - 252 . doi: 10 .1007/ ume 2768, 2020 , p. 41 - 45 .

s11263- 015 -0816-y. [29]

Liu ,

Anguelov ,

Erhan ,

Szegedy , S. E.

[16] M. D. Zeiler , R.

Fergus , Visualizing and un- Reed, C.

Fu , A. C.

Berg , SSD: single shot multibox derstanding convolutional networks, CoRR detector , CoRR abs/1512 .02325 ( 2015 ). URL: http: abs/1311 .2901 ( 2013 ). URL: http://arxiv.org/abs/ //arxiv.org/abs/1512.02325. arXiv: 1512 . 02325 .

1311.2901. arXiv: 1311 . 2901 . [30]

Mazzenga ,

Giuliano ,

Vatalaro , FttC-based [17]

Simonyan ,

Zisserman , Very deep convolu- fronthaul for 5G dense/ultra-dense access network: tional networks for large-scale image recognition, Performance and costs in realistic scenarios , Future 2015 . arXiv:1409.1556. Internet 9 ( 2017 ).

[18]

He ,

Zhang , S. Ren,

Sun , Deep resid- [31]

Simonetta ,

Paoletti , Designing digital cirual learning for image recognition, 2015. cuits in multi-valued logic , International Journal arXiv:1512.03385. on Advanced Science , Engineering and Information [19]

A. G.

Howard ,

Zhu ,

Chen ,

Kalenichenko , Technology 8 ( 2018 ) pp. 1166 - 1172 .

Wang ,

Weyand ,

Andreetto , H. Adam, [32]

Capizzi ,

Bonanno ,

Napoli , Hybrid neural Mobilenets: Eficient convolutional neural networks architectures for soc and voltage predicnetworks for mobile vision applications, 2017. tion of new generation batteries storage , in: 2011 arXiv:1704 .04861. International Conference on Clean Electrical Power [20]

G. M.

Bianco ,

Giuliano ,

Marrocco ,

Mazzenga , (ICCEP), IEEE, 2011 , pp. 341 - 344 .

Mejia-Aguilar , LoRa System for Search and Rescue: Path-Loss Models and Procedures in Mountain Scenarios , IEEE Internet of Things Journal 8 ( 2021 ) p. 1985 - 1999 .

[21]

Tan ,

Q. V.

Le , Eficientnet: Rethinking model scaling for convolutional neural networks , CoRR abs/ 1905 .11946 ( 2019 ). URL: http://arxiv.org/abs/ 1905 .11946. arXiv: 1905 .11946.

[22]

Uijlings , K. van de Sande,

Gevers ,

Smeulders , Selective search for object recognition , International Journal of Computer Vision ( 2013 ). URL: http://www.huppelen.

doi:10.1007/s11263-013-0620-5.

[23] R. B. Girshick , J.

Donahue , T.

Darrell , J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation , CoRR abs/1311 .2524 ( 2013 ). URL: http://arxiv.org/abs/ 1311.2524. arXiv: 1311 . 2524 .

[24]

Brandizzi ,

Bianco , G. Castro,

Russo ,

Wajda , Automatic rgb inference based on facial emotion recognition , volume 3092 , 2021 , p. 66 - 74 .

[25] R. B. Girshick , Fast

R-CNN

, CoRR abs/1504 .08083 ( 2015 ). URL: http://arxiv.org/abs/1504.08083.

arXiv:1504 . 08083 .

[26]

Ren ,

He ,

R. B.

Girshick ,

Sun , Faster