Machine Learning Methods for Computer Vision
Eros Innocenti1 , Alessandro Vizzarri2
1
    Deptartment of Engineering Science, Guglielmo Marconi University, Italy
2
    Deptartment of Enterprise Engineering, University of Rome Tor Vergata, Italy


                                             Abstract
                                             Over the last years, deep learning methods proved to outperform previous machine learning techniques, especially in high
                                             computational task such as computer vision. This review paper aims to provide a preliminary overview of the machine
                                             learning tasks where computer vision in involved. Furthermore, a brief review of their history and state-of-the-art techniques
                                             is presented in the fields of image classification and object detection.

                                             Keywords
                                             Machine Learning, Computer Vision, Artificial Intelligence, Deep Learning


1. Introduction                                                                                                            describe these three categories.

Nowadays, computer vision is one of the most studied
artificial intelligence and machine learning subfields. Its                                                                2. Machine Learning Tasks
applications are many and various, ranging from indus-
try applications to manufacturing [1], healthcare and                                                                      2.1. Supervised learning
autonomous vehicles. The CV main goal is to replicate
                                                                                                                           In supervised learning the goal is to infer a function
the capabilities of humans’ vision. Although for our brain
                                                                                                                           starting from a collection of labeled training data. The
this kind of task appears fairly simple, there is a lot of
                                                                                                                           training data, typically consists in a set of image exam-
information processing under the hood. Over the years,
                                                                                                                           ples annotated with extra information such as the image
the field of computer vision is shifting from a statistical
                                                                                                                           class, or the position of the depicted object(s). The train-
approach, based on hand-crafted methods, to deep learn-
                                                                                                                           ing in most cases is hand-made, but semi-supervised
ing neural networks ones. This change of perspective is
                                                                                                                           approaches are available too. This possibility is useful
driven not only by an increasing performance demand
                                                                                                                           if the training set size is small, and it is difficult or even
[2]. In fact, deep learning models proved that they can
                                                                                                                           impossible to obtain more samples. Moreover, image aug-
learn semantic representations of images, thus adapting
                                                                                                                           mentations techniques (e.g., horizontal and vertical flip,
better to different scenarios without requiring human
                                                                                                                           shear, brightness and contrast variations) can be used to
interventions [3]. In this paper we want to take a brief
                                                                                                                           artificially increase the training set size, thus achieving
review on the problems, which CV could solve and the
                                                                                                                           better training performances.
state-of-the-art technologies developed in the last few
                                                                                                                               The steps required to train a computer vision model
years of research. In Section 2 we illustrate how the
                                                                                                                           using supervised learning can be summarized in the fol-
machine learning problems are categorized in different
                                                                                                                           lowing:
tasks, each one with different goals. Section 3 presents
the subtasks specifically related to computer vision, sub-                                                                     1. Decide the kind of training examples which rep-
sequently in Section 4 some mainly used object detection                                                                          resent accurately the problem.
techniques are described. Eventually, in Section 5 an                                                                          2. Collect a sufficient number of examples. In the
overview of future directions is presented, presenting                                                                            case of many classes, make sure to balance the
some of next years open challenges.                                                                                               number of examples across all of them.
   Machine learning includes an extensive set of tasks,                                                                        3. Decide an input feature vector which is descrip-
which can be classified in three broad categories: Super-                                                                         tive for the selected task. The number of features
vised Learning, Unsupervised Learning and Reinforce-                                                                              should not be too large, in order to avoid overfit-
ment Learning. In the next subsections we will briefly                                                                            ting.
ICYRIME 2021 @ International Conference of Yearly Reports on                                                                   4. Decide the learning function structure and pick a
Informatics Mathematics and Engineering, online, July 9, 2021                                                                     loss function which has to be minimized during
" eros@newtechweb.it (E. Innocenti);                                                                                              the training phase.
alessandro.vizzarri@uniroma2.it (A. Vizzarri)                                                                                  5. Run the model on the training set, iteratively op-
 0000-0002-7793-4974 (E. Innocenti); 0000-0002-6274-991X
(A. Vizzarri)
                                                                                                                                  timizing its parameters until the target metric
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                                  (e.g., loss, accuracy, average precision) reaches
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                                    the target value.


                                                                                                                      85
Eros Innocenti et al. CEUR Workshop Proceedings                                                                                                  85–89


  Image classification   Object Localization   Object Detection   Object Segmentation
                                                                                             3.1. Image classification
                                                                                             Image classification is probably the most well-known
                                                                                             computer vision task. The main goal is to assign an
         Apple                  Apple            Apple Pear          Apple Pear              input image to one of a set of predefined categories. The
                                                                                             simplest case is represented by binary classification, it
Figure 1: Computer vision tasks                                                              means that the output of the model consists in only two
                                                                                             possible values: true or false. An example could be a
                                                                                             classifier which given a picture returns if that picture
     6. Evaluate the trained model on a test set. In order                                   contains a person or not. A more complex version of the
        to obtain an unbiased evaluation of the model,                                       same classifier could have more than two categories (e.g.,
        it’s important that the test set is composed only                                    person, cat, dog, car).
        by unseen examples.
                                                                                             3.2. Object localization
2.2. Unsupervised learning                                Starting from the previous image classification task, we
Unsupervised learning, unlike the supervised one, does could improve the output of the neural network adding
not need a labeled training set. Instead, the goal is to the information about the location of the object. The
infer a function which describes the underlying structure common way to describe the location of an object is to
from unlabeled data. It is worth noting that since the define a bounding box which encloses the object in the
examples are not annotated, it is not possible to eval- picture.
uate the performance of the model using the methods
applied in supervised learning. Unsupervised learning is 3.3. Object detection
used in many situations, some of them are dimensionality
                                                          Object localization is limited to one object per image. The
reduction, search of clusters, data compression. One pop-
                                                          computer vision task whose goal is to localize multiple
ular example of unsupervised learning is the k-means
                                                          object of different classes in the same picture is called
clustering algorithm [4].
                                                          Object Detection. This task introduces major complexi-
                                                          ties if compared with the previous one, and the required
2.3. Reinforcement learning                               effort to scale from Object Localization to Object Detec-
Lastly, reinforcement learning substantially differs from tion can be significant. Some problems encountered can
the previous ones because it lacks the initial training be difficult even for humans. Some objects could be par-
data completely [5]. In this kind of machine learning, tially visible, because they overlap each other or may be
the running program (i.e., the agent) interacts with the partially outside the frame. Moreover, the sizes of the
environment making use of sensors and actuators with objects belonging to the same class could vary noticeably.
a certain goal to achieve. The agent is provided by feed-
backs that could be rewards or penalties based on the                                        3.4. Object segmentation
actions taken in the previous one or more time spans.
                                                                                             In the previous localization and detection tasks, the main
   In the next sections of this paper we will focus mainly
                                                                                             goal is to place a bounding box (and a class label) over
on supervised learning.Specifically we will analyze the
                                                                                             all the objects present in the input image. Segmentation
most frequent computer vision related subtasks and the
                                                                                             differs from localization and detection because the output
techniques commonly used to solve this kind of problems.
                                                                                             is no more a set bounding box. Instead, in segmentation,
                                                                                             the computer vision model tries to annotate every pixel
3. Computer Vision Tasks                                                                     of the image whether part of a specific class from a set
                                                                                             of predefined ones.
As stated before, in computer vision, we can further split                                      Object segmentation can be further divided in two
these tasks, mainly into 4 categories:                                                       types: semantic segmentation [6, 7] and instance segmen-
                                                                                             tation [8, 9].
      • Image classification                                                                    The main difference between these two kinds is that
      • Object localization                                                                  semantic segmentation treats multiple objects belonging
      • Object detection                                                                     to the same class as a single entity. On the other hand,
      • Object segmentation                                                                  instance segmentation treats multiple objects of the same
                                                                                             class as individual instances.
  In figure 1 an example of these categories is depicted.


                                                                                        86
Eros Innocenti et al. CEUR Workshop Proceedings                                                                   85–89


3.5. Object tracking                                            In 2015, ResNet by Kaiming He et al [18] introduced
                                                             a novel CNN architectured called Residual Neural Net-
Object tracking applies to a sequence of images instead
                                                             work. The main difference from the previous is the intro-
of a single input, because of this reason it has not been
                                                             duction of skip connections between layers. Such skip
listed at the beginning of this section. The purpose of
                                                             connections permitted to obtain better training results
object tracking is to track a moving object over subse-
                                                             with fewer parameters. ResNet obtained a top-5 error
quent frames. This kind of functionality is essentials for
                                                             rate of 3.5% on ImageNet, which beats human-level per-
robots or autonomous cars. A straightforward approach
                                                             formances (approximately 5%) on the same dataset.
to perform object tracking is to apply the object detec-
                                                                In 2017, MobileNet [19] was presented as a solu-
tions techniques to a video instance and then compare
                                                             tion for mobile and embedded visual applications. This
every object instance in order to determine the direction
                                                             lightweight network is particularly suited for low power
and the speed of the movement. However, it is worth
                                                             system [20]. The network is very flexible and can be
noting that, in many cases, the object tracking does not
                                                             easily adapted to the specific application, tweaking its
need to recognize objects of different classes, but could
                                                             hyper-parameters.
simply rely on motion criteria without being aware of
                                                                Lastly, in 2019 Mingxing T. and Quoc V. [21] studied
the objects classes.
                                                             a novel neural network (i.e., EfficientNet) which can be
                                                             scaled up as needed in a very efficient way. The main
4. Techniques                                                novelty about this method is that the scaling process
                                                             involves not only the depth of the network, but also the
4.1. Object classification                                   width and the resolution of the input, thus proving that
                                                             this compound method obtains better results with less
The emergence of large scale annotated training sets such parameters.
as ImageNet [10] or COCO [11], required significant com-
putational power and deeper network architectures. In
the last few years, high performance parallel computa-
                                                             4.2. Object detection
tional systems, such as GPUs, enabled new challenges Deep Neural Networks for Object Detection can be cate-
in computer vision that can be solved by the means of gorized in two different types:
deep learning. The most representative models of deep
learning applied to computer vision are Convolutional              • Region proposal networks
Neural Networks (i.e., CNNs). The first convolutional              • Single shot detectors
neural network appeared in 1998 with LeNet-5 [12], a 7          Historically, the first detectors were based on the pre-
layers convolutional neural network developed by Yann vious described image classification networks. The ba-
LeCun. LeNet was used to recognize hand-written num- sic idea to obtain object detection is based on a sliding
bers from the famous MNIST dataset [13], a collection window approach. Substantially, a fixed size rectangu-
of 32x32 pixels greyscale input images. The architecture lar window crops the image at different positions and a
was pretty simple, mainly because for the time there were subsequent image classification network is in charge of
computational power constraints.                             predicting the object class. At each iteration, the win-
   In 2012, AlexNet [14] won the ILSVRC [15] (ImageNet dow is moved by a stride value until the whole image
Large Scale Visual Recognition Challenge) 2021 compe- is analyzed. The main drawback of this method is the
tition, with a similar architecture but with more filters low speed because it is computational expensive. An im-
and layers, thus becoming one of the first deep neural provement over the sliding window approach, is called
networks.                                                    selective search [22], which consists in a hierarchical
   The next year, ZFNet [16] won the ILSVRC mostly grouping segmentation algorithm that combines multi-
tweaking the hyper-parameters of AlexNet, maintaining ple grouping strategies. This algorithm starts with an
the same base structure.                                     initial set of regions and at each iteration merges the
   In 2014 VGGNet [17] entered the scene becoming one most similar regions together, until the whole image is
of the reference architecture for object classification. The represented as a single region. Finally, a set of regions of
first version (i.e., VGG16) had a very uniform architecture, interests (ROI) are selected and fed into an image classi-
composed by sixteen 3x3 convolutional layers followed fication network. The resulting object detection network
by max pooling operations. The main drawback of VGG is called Region-based ConvNet (R-CNN) [23, 24]. Al-
is the number of parameters (i.e., 138 million), which can though selective search improved quite noticeably the
be challenging to handle. Anyhow, VGG is still one of the overall speed of the process, it is still not enough when
preferred architecture used for feature extraction from speed is a key factor. In 2015 other two improvements
images.                                                      of region proposal based networks were proposed, Fast
                                                                R-CNN [25] and soon after Faster R-CNN [26]. The main


                                                           87
Eros Innocenti et al. CEUR Workshop Proceedings                                                                      85–89


novelty about these new architectures was the integra-           References
tion of ROIs generation into the neural network itself. In
fact, the previous version of R-CNN used selective search         [1] A. Jaber, R. Bicker, Fault diagnosis of industrial
for ROI extraction as a separated process.                            robot bearings based on discrete wavelet transform
   In the same year, YOLO (You Only Look Once) [27,                   and artificial neural network , International Journal
28] revolutionized the object detection scene presenting              of Prognostics and Health Management 7 (2016) art.
an algorithm substantially different from the classical               no. 017.
region proposal networks. A new kind of architecture              [2] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramontana,
started to emerge, called Single Shot Detectors. Instead              A multithread nested neural network architecture
of using a ROIs extraction phase, single shot detectors               to model surface plasmon polaritons propagation,
divides the image in a grid, giving at each cell the task to          Micromachines 7 (2016) 110.
detect objects in that region. For each grid cell, multiple       [3] F. Fallucchi, M. Petito, E. De Luca, Analysing and
predefined boxes (i.e., anchors or priors) are considered.            Visualising Open Data Within the Data and Ana-
These boxes have multiple sizes, aspect ratio in order to             lytics Framework, Communications in Computer
be able to detect objects of different shapes. Immediately            and Information Science 846 (2019) p.135–146.
after, Single Shot MultiBox Detectors [29] followed the           [4] Y. Li, H. Wu, A clustering method based on k-means
same approach obtaining similar results to YOLO in terms              algorithm, Physics Procedia 25 (2012) 1104–1109.
of speed and accuracy.                                            [5] L. Canese, G. C. Cardarilli, L. Di Nunzio, R. Faz-
   Over the years many variations of these architectures              zolari, D. Giardino, M. Re, S. Spanò, Multi-agent
were presented, each one with its particularities and                 reinforcement learning: A review of challenges and
strengths. Although there are exceptions, nowadays re-                applications, Applied Sciences 11 (2021) 4948.
gion proposal based networks are preferred when ac-               [6] C. Napoli, G. Pappalardo, E. Tramontana, An agent-
curacy is of main importance and speed is secondary.                  driven semantical identifier using radial basis neu-
Moreover, R-CNNs are considered better in detecting                   ral networks and reinforcement learning, volume
small objects.                                                        1260, 2014.
   On the other hand, single shot detectors overtake R-           [7] A. Venckauskas, A. Karpavicius, R. Damasevicius,
CNNs in real-time tasks, edge or mobile computing [30].               R. Marcinkevicius, J. Kapociute-Dzikiene, C. Napoli,
The inference time of these networks is less, at the cost             Open class authorship attribution of lithuanian in-
of lower accuracy [31].                                               ternet comments using one-class classifier, 2017, p.
                                                                      373 – 382. doi:10.15439/2017F461.
                                                                  [8] G. De Magistris, S. Russo, P. Roma, J. Starczewski,
5. Conclusions                                                        C. Napoli, An explainable fake news detector based
                                                                      on named entity recognition and stance classifica-
In this paper, a brief review of commonly used deep                   tion applied to covid-19, Information (Switzerland)
learning methods has been made, emphasizing its appli-                13 (2022). doi:10.3390/info13030137.
cation in the field of computer vision. In the last years,        [9] C. Napoli, E. Tramontana, G. Lo Sciuto, M. Woź-
especially using GPUs clusters, we obtained the com-                  niak, R. Damaševičius, G. Borowik, Authorship
putational power to enable the design of deeper neural                semantical identification using holomorphic cheby-
networks [32]. Moreover, the availability of large datasets           shev projectors, 2015, p. 232 – 237. doi:10.1109/
such as COCO or ImageNet allowed training accurate                    APCASE.2015.48.
models, which can be adapted to a variety of scenarios.          [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-
With the increasing importance of mobile devices and                  Fei, Imagenet: A large-scale hierarchical image
edge computing, the high power requirements of the re-                database, in: 2009 IEEE Conference on Computer
viewed techniques will inevitably conflict with the low               Vision and Pattern Recognition, 2009, pp. 248–255.
power resources offered by edge devices. Although cloud               doi:10.1109/CVPR.2009.5206848.
computing can help, many situations such as rural ar-            [11] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B.
eas, make internet access problematic, thus invalidating              Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár,
the remote processing possibility. Moreover, supervised               C. L. Zitnick, Microsoft COCO: common objects
learning, which is the commonly used method for com-                  in context, CoRR abs/1405.0312 (2014). URL: http:
puter vision tasks, allows obtaining noticeably results at            //arxiv.org/abs/1405.0312. arXiv:1405.0312.
the cost of long training times. In the future, self-learning    [12] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-
methods should be considered, in order to skip the whole              based learning applied to document recognition, in:
dataset creation and focus in the learning phase, as it               Proceedings of the IEEE, 1998, pp. 2278–2324.
happens for the humankind.                                       [13] Y. LeCun, C. Cortes, MNIST handwritten digit
                                                                      database, prova (2010). URL: http://yann.lecun.com/


                                                            88
Eros Innocenti et al. CEUR Workshop Proceedings                                                                  85–89


     exdb/mnist/.                                                 R-CNN: towards real-time object detection with
[14] A. Krizhevsky, I. Sutskever, G. E. Hinton, Im-               region proposal networks, CoRR abs/1506.01497
     agenet classification with deep convolutional                (2015). URL: http://arxiv.org/abs/1506.01497.
     neural networks,         Commun. ACM 60 (2017)               arXiv:1506.01497.
     84–90. URL: https://doi.org/10.1145/3065386.            [27] J. Redmon, S. K. Divvala, R. B. Girshick, A. Farhadi,
     doi:10.1145/3065386.                                         You only look once: Unified, real-time object de-
[15] O. Russakovsky, J. Deng, H. Su, J. Krause,                   tection, CoRR abs/1506.02640 (2015). URL: http:
     S. Satheesh, S. Ma, Z. Huang, A. Karpathy,                   //arxiv.org/abs/1506.02640. arXiv:1506.02640.
     A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei,        [28] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac-
     ImageNet Large Scale Visual Recognition Chal-                caro, Yolov3-based mask and face recognition al-
     lenge, International Journal of Computer Vi-                 gorithm for individual protection applications, vol-
     sion (IJCV) 115 (2015) 211–252. doi:10.1007/                 ume 2768, 2020, p. 41 – 45.
     s11263-015-0816-y.                                      [29] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E.
[16] M. D. Zeiler, R. Fergus, Visualizing and un-                 Reed, C. Fu, A. C. Berg, SSD: single shot multibox
     derstanding convolutional networks,           CoRR           detector, CoRR abs/1512.02325 (2015). URL: http:
     abs/1311.2901 (2013). URL: http://arxiv.org/abs/             //arxiv.org/abs/1512.02325. arXiv:1512.02325.
     1311.2901. arXiv:1311.2901.                             [30] F. Mazzenga, R. Giuliano, F. Vatalaro, FttC-based
[17] K. Simonyan, A. Zisserman, Very deep convolu-                fronthaul for 5G dense/ultra-dense access network:
     tional networks for large-scale image recognition,           Performance and costs in realistic scenarios, Future
     2015. arXiv:1409.1556.                                       Internet 9 (2017).
[18] K. He, X. Zhang, S. Ren, J. Sun, Deep resid-            [31] A. Simonetta, M. Paoletti, Designing digital cir-
     ual learning for image recognition, 2015.                    cuits in multi-valued logic, International Journal
     arXiv:1512.03385.                                            on Advanced Science, Engineering and Information
[19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,              Technology 8 (2018) pp. 1166–1172.
     W. Wang, T. Weyand, M. Andreetto, H. Adam,              [32] G. Capizzi, F. Bonanno, C. Napoli, Hybrid neural
     Mobilenets:       Efficient convolutional neural             networks architectures for soc and voltage predic-
     networks for mobile vision applications, 2017.               tion of new generation batteries storage, in: 2011
     arXiv:1704.04861.                                            International Conference on Clean Electrical Power
[20] G. M. Bianco, R. Giuliano, G. Marrocco, F. Mazzenga,         (ICCEP), IEEE, 2011, pp. 341–344.
     A. Mejia-Aguilar, LoRa System for Search and Res-
     cue: Path-Loss Models and Procedures in Mountain
     Scenarios, IEEE Internet of Things Journal 8 (2021)
     p.1985–1999.
[21] M. Tan, Q. V. Le, Efficientnet: Rethinking model
     scaling for convolutional neural networks, CoRR
     abs/1905.11946 (2019). URL: http://arxiv.org/abs/
     1905.11946. arXiv:1905.11946.
[22] J. Uijlings, K. van de Sande, T. Gevers, A. Smeul-
     ders,      Selective search for object recogni-
     tion,       International Journal of Computer
     Vision (2013). URL: http://www.huppelen.
     nl/publications/selectiveSearchDraft.pdf.
     doi:10.1007/s11263-013-0620-5.
[23] R. B. Girshick, J. Donahue, T. Darrell, J. Ma-
     lik, Rich feature hierarchies for accurate ob-
     ject detection and semantic segmentation, CoRR
     abs/1311.2524 (2013). URL: http://arxiv.org/abs/
     1311.2524. arXiv:1311.2524.
[24] N. Brandizzi, V. Bianco, G. Castro, S. Russo, A. Wa-
     jda, Automatic rgb inference based on facial emo-
     tion recognition, volume 3092, 2021, p. 66 – 74.
[25] R. B. Girshick, Fast R-CNN, CoRR abs/1504.08083
     (2015). URL: http://arxiv.org/abs/1504.08083.
     arXiv:1504.08083.
[26] S. Ren, K. He, R. B. Girshick, J. Sun, Faster


                                                        89