<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Conference of Yearly Reports on
Informatics Mathematics and Engineering, online, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Machine Learning Methods for Computer Vision</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eros Innocenti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Vizzarri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deptartment of Engineering Science, Guglielmo Marconi University</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Deptartment of Enterprise Engineering, University of Rome Tor Vergata</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>9</volume>
      <issue>2021</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Over the last years, deep learning methods proved to outperform previous machine learning techniques, especially in high computational task such as computer vision. This review paper aims to provide a preliminary overview of the machine learning tasks where computer vision in involved. Furthermore, a brief review of their history and state-of-the-art techniques is presented in the fields of image classification and object detection.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Computer Vision</kwd>
        <kwd>Artificial Intelligence</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>describe these three categories.</p>
      <p>
        Nowadays, computer vision is one of the most studied
artificial intelligence and machine learning subfields. Its 2. Machine Learning Tasks
applications are many and various, ranging from
industry applications to manufacturing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], healthcare and 2.1. Supervised learning
autonomous vehicles. The CV main goal is to replicate In supervised learning the goal is to infer a function
the capabilities of humans’ vision. Although for our brain starting from a collection of labeled training data. The
this kind of task appears fairly simple, there is a lot of training data, typically consists in a set of image
examinformation processing under the hood. Over the years, ples annotated with extra information such as the image
the field of computer vision is shifting from a statistical class, or the position of the depicted object(s). The
trainapproach, based on hand-crafted methods, to deep learn- ing in most cases is hand-made, but semi-supervised
ing neural networks ones. This change of perspective is approaches are available too. This possibility is useful
driven not only by an increasing performance demand if the training set size is small, and it is dificult or even
[
        <xref ref-type="bibr" rid="ref3">2</xref>
        ]. In fact, deep learning models proved that they can impossible to obtain more samples. Moreover, image
auglearn semantic representations of images, thus adapting mentations techniques (e.g., horizontal and vertical flip,
better to diferent scenarios without requiring human shear, brightness and contrast variations) can be used to
interventions [
        <xref ref-type="bibr" rid="ref4">3</xref>
        ]. In this paper we want to take a brief artificially increase the training set size, thus achieving
review on the problems, which CV could solve and the better training performances.
state-of-the-art technologies developed in the last few The steps required to train a computer vision model
years of research. In Section 2 we illustrate how the using supervised learning can be summarized in the
folmachine learning problems are categorized in diferent lowing:
tasks, each one with diferent goals. Section 3 presents
the subtasks specifically related to computer vision,
subsequently in Section 4 some mainly used object detection
techniques are described. Eventually, in Section 5 an
overview of future directions is presented, presenting
some of next years open challenges.
      </p>
      <p>Machine learning includes an extensive set of tasks,
which can be classified in three broad categories:
Supervised Learning, Unsupervised Learning and
Reinforcement Learning. In the next subsections we will briefly
1. Decide the kind of training examples which
rep</p>
      <p>resent accurately the problem.
2. Collect a suficient number of examples. In the
case of many classes, make sure to balance the
number of examples across all of them.
3. Decide an input feature vector which is
descriptive for the selected task. The number of features
should not be too large, in order to avoid
overfitting.
4. Decide the learning function structure and pick a
loss function which has to be minimized during
the training phase.
5. Run the model on the training set, iteratively
optimizing its parameters until the target metric
(e.g., loss, accuracy, average precision) reaches
the target value.
Image classification Object Localization</p>
      <p>Object Detection Object Segmentation
6. Evaluate the trained model on a test set. In order
to obtain an unbiased evaluation of the model,
it’s important that the test set is composed only
by unseen examples.</p>
      <sec id="sec-1-1">
        <title>2.2. Unsupervised learning</title>
        <p>
          Unsupervised learning, unlike the supervised one, does
not need a labeled training set. Instead, the goal is to
infer a function which describes the underlying structure
from unlabeled data. It is worth noting that since the
examples are not annotated, it is not possible to
evaluate the performance of the model using the methods
applied in supervised learning. Unsupervised learning is
used in many situations, some of them are dimensionality
reduction, search of clusters, data compression. One
popular example of unsupervised learning is the k-means
clustering algorithm [
          <xref ref-type="bibr" rid="ref5">4</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>2.3. Reinforcement learning</title>
        <p>
          Lastly, reinforcement learning substantially difers from
the previous ones because it lacks the initial training
data completely [
          <xref ref-type="bibr" rid="ref6">5</xref>
          ]. In this kind of machine learning,
the running program (i.e., the agent) interacts with the
environment making use of sensors and actuators with
a certain goal to achieve. The agent is provided by
feedbacks that could be rewards or penalties based on the
actions taken in the previous one or more time spans.
        </p>
        <p>In the next sections of this paper we will focus mainly
on supervised learning.Specifically we will analyze the
most frequent computer vision related subtasks and the
techniques commonly used to solve this kind of problems.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Computer Vision Tasks</title>
      <p>As stated before, in computer vision, we can further split
these tasks, mainly into 4 categories:
• Image classification
• Object localization
• Object detection
• Object segmentation
In figure 1 an example of these categories is depicted.</p>
      <sec id="sec-2-1">
        <title>3.1. Image classification</title>
        <p>Image classification is probably the most well-known
computer vision task. The main goal is to assign an
input image to one of a set of predefined categories. The
simplest case is represented by binary classification, it
means that the output of the model consists in only two
possible values: true or false. An example could be a
classifier which given a picture returns if that picture
contains a person or not. A more complex version of the
same classifier could have more than two categories (e.g.,
person, cat, dog, car).</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Object localization</title>
        <p>Starting from the previous image classification task, we
could improve the output of the neural network adding
the information about the location of the object. The
common way to describe the location of an object is to
define a bounding box which encloses the object in the
picture.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.3. Object detection</title>
        <p>Object localization is limited to one object per image. The
computer vision task whose goal is to localize multiple
object of diferent classes in the same picture is called
Object Detection. This task introduces major
complexities if compared with the previous one, and the required
efort to scale from Object Localization to Object
Detection can be significant. Some problems encountered can
be dificult even for humans. Some objects could be
partially visible, because they overlap each other or may be
partially outside the frame. Moreover, the sizes of the
objects belonging to the same class could vary noticeably.</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.4. Object segmentation</title>
        <p>In the previous localization and detection tasks, the main
goal is to place a bounding box (and a class label) over
all the objects present in the input image. Segmentation
difers from localization and detection because the output
is no more a set bounding box. Instead, in segmentation,
the computer vision model tries to annotate every pixel
of the image whether part of a specific class from a set
of predefined ones.</p>
        <p>
          Object segmentation can be further divided in two
types: semantic segmentation [
          <xref ref-type="bibr" rid="ref7 ref8">6, 7</xref>
          ] and instance
segmentation [
          <xref ref-type="bibr" rid="ref10">8, 9</xref>
          ].
        </p>
        <p>The main diference between these two kinds is that
semantic segmentation treats multiple objects belonging
to the same class as a single entity. On the other hand,
instance segmentation treats multiple objects of the same
class as individual instances.</p>
      </sec>
      <sec id="sec-2-5">
        <title>3.5. Object tracking</title>
        <p>Object tracking applies to a sequence of images instead
of a single input, because of this reason it has not been
listed at the beginning of this section. The purpose of
object tracking is to track a moving object over
subsequent frames. This kind of functionality is essentials for
robots or autonomous cars. A straightforward approach
to perform object tracking is to apply the object
detections techniques to a video instance and then compare
every object instance in order to determine the direction
and the speed of the movement. However, it is worth
noting that, in many cases, the object tracking does not
need to recognize objects of diferent classes, but could
simply rely on motion criteria without being aware of
the objects classes.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Techniques</title>
      <sec id="sec-3-1">
        <title>4.1. Object classification</title>
        <p>
          The emergence of large scale annotated training sets such
as ImageNet [
          <xref ref-type="bibr" rid="ref29">10</xref>
          ] or COCO [11], required significant
computational power and deeper network architectures. In
the last few years, high performance parallel
computational systems, such as GPUs, enabled new challenges
in computer vision that can be solved by the means of
deep learning. The most representative models of deep
learning applied to computer vision are Convolutional
Neural Networks (i.e., CNNs). The first convolutional
neural network appeared in 1998 with LeNet-5 [12], a 7
layers convolutional neural network developed by Yann
LeCun. LeNet was used to recognize hand-written
numbers from the famous MNIST dataset [13], a collection
of 32x32 pixels greyscale input images. The architecture
was pretty simple, mainly because for the time there were
computational power constraints.
        </p>
        <p>In 2012, AlexNet [14] won the ILSVRC [15] (ImageNet
Large Scale Visual Recognition Challenge) 2021
competition, with a similar architecture but with more filters
and layers, thus becoming one of the first deep neural
networks.</p>
        <p>
          The next year, ZFNet [
          <xref ref-type="bibr" rid="ref22">16</xref>
          ] won the ILSVRC mostly
tweaking the hyper-parameters of AlexNet, maintaining
the same base structure.
        </p>
        <p>In 2014 VGGNet [17] entered the scene becoming one
of the reference architecture for object classification. The
ifrst version (i.e., VGG16) had a very uniform architecture,
composed by sixteen 3x3 convolutional layers followed
by max pooling operations. The main drawback of VGG
is the number of parameters (i.e., 138 million), which can
be challenging to handle. Anyhow, VGG is still one of the
preferred architecture used for feature extraction from
images.</p>
        <p>
          In 2015, ResNet by Kaiming He et al [
          <xref ref-type="bibr" rid="ref24">18</xref>
          ] introduced
a novel CNN architectured called Residual Neural
Network. The main diference from the previous is the
introduction of skip connections between layers. Such skip
connections permitted to obtain better training results
with fewer parameters. ResNet obtained a top-5 error
rate of 3.5% on ImageNet, which beats human-level
performances (approximately 5%) on the same dataset.
        </p>
        <p>In 2017, MobileNet [19] was presented as a
solution for mobile and embedded visual applications. This
lightweight network is particularly suited for low power
system [20]. The network is very flexible and can be
easily adapted to the specific application, tweaking its
hyper-parameters.</p>
        <p>
          Lastly, in 2019 Mingxing T. and Quoc V. [
          <xref ref-type="bibr" rid="ref27">21</xref>
          ] studied
a novel neural network (i.e., EficientNet) which can be
scaled up as needed in a very eficient way. The main
novelty about this method is that the scaling process
involves not only the depth of the network, but also the
width and the resolution of the input, thus proving that
this compound method obtains better results with less
parameters.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Object detection</title>
        <p>Deep Neural Networks for Object Detection can be
categorized in two diferent types:
• Region proposal networks
• Single shot detectors</p>
        <p>
          Historically, the first detectors were based on the
previous described image classification networks. The
basic idea to obtain object detection is based on a sliding
window approach. Substantially, a fixed size
rectangular window crops the image at diferent positions and a
subsequent image classification network is in charge of
predicting the object class. At each iteration, the
window is moved by a stride value until the whole image
is analyzed. The main drawback of this method is the
low speed because it is computational expensive. An
improvement over the sliding window approach, is called
selective search [
          <xref ref-type="bibr" rid="ref28">22</xref>
          ], which consists in a hierarchical
grouping segmentation algorithm that combines
multiple grouping strategies. This algorithm starts with an
initial set of regions and at each iteration merges the
most similar regions together, until the whole image is
represented as a single region. Finally, a set of regions of
interests (ROI) are selected and fed into an image
classiifcation network. The resulting object detection network
is called Region-based ConvNet (R-CNN) [
          <xref ref-type="bibr" rid="ref30 ref31">23, 24</xref>
          ].
Although selective search improved quite noticeably the
overall speed of the process, it is still not enough when
speed is a key factor. In 2015 other two improvements
of region proposal based networks were proposed, Fast
R-CNN [
          <xref ref-type="bibr" rid="ref32">25</xref>
          ] and soon after Faster R-CNN [
          <xref ref-type="bibr" rid="ref34">26</xref>
          ]. The main
novelty about these new architectures was the
integration of ROIs generation into the neural network itself. In
fact, the previous version of R-CNN used selective search
for ROI extraction as a separated process.
        </p>
        <p>In the same year, YOLO (You Only Look Once) [27,
28] revolutionized the object detection scene presenting
an algorithm substantially diferent from the classical
region proposal networks. A new kind of architecture
started to emerge, called Single Shot Detectors. Instead
of using a ROIs extraction phase, single shot detectors
divides the image in a grid, giving at each cell the task to
detect objects in that region. For each grid cell, multiple
predefined boxes (i.e., anchors or priors) are considered.
These boxes have multiple sizes, aspect ratio in order to
be able to detect objects of diferent shapes. Immediately
after, Single Shot MultiBox Detectors [29] followed the
same approach obtaining similar results to YOLO in terms
of speed and accuracy.</p>
        <p>Over the years many variations of these architectures
were presented, each one with its particularities and
strengths. Although there are exceptions, nowadays
region proposal based networks are preferred when
accuracy is of main importance and speed is secondary.
Moreover, R-CNNs are considered better in detecting
small objects.</p>
        <p>On the other hand, single shot detectors overtake
RCNNs in real-time tasks, edge or mobile computing [30].
The inference time of these networks is less, at the cost
of lower accuracy [31].</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bicker</surname>
          </string-name>
          ,
          <article-title>Fault diagnosis of industrial robot bearings based on discrete wavelet transform and artificial neural network</article-title>
          ,
          <source>International Journal of Prognostics and Health Management</source>
          <volume>7</volume>
          (
          <year>2016</year>
          )
          <article-title>art</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>no. 017.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Capizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Lo</given-names>
            <surname>Sciuto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tramontana</surname>
          </string-name>
          ,
          <article-title>A multithread nested neural network architecture to model surface plasmon polaritons propagation</article-title>
          ,
          <source>Micromachines</source>
          <volume>7</volume>
          (
          <year>2016</year>
          )
          <fpage>110</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Fallucchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Petito</surname>
          </string-name>
          , E. De Luca,
          <article-title>Analysing and Visualising Open Data Within the Data and Analytics Framework</article-title>
          ,
          <source>Communications in Computer and Information Science</source>
          <volume>846</volume>
          (
          <year>2019</year>
          ) p.
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>A clustering method based on k-means algorithm</article-title>
          ,
          <source>Physics Procedia</source>
          <volume>25</volume>
          (
          <year>2012</year>
          )
          <fpage>1104</fpage>
          -
          <lpage>1109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Canese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Cardarilli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Di</given-names>
            <surname>Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fazzolari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Giardino</surname>
          </string-name>
          , M. Re,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Spanò, Multi-agent reinforcement learning: A review of challenges and applications</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>11</volume>
          (
          <year>2021</year>
          )
          <fpage>4948</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Tramontana,</surname>
          </string-name>
          <article-title>An agentdriven semantical identifier using radial basis neural networks and reinforcement learning</article-title>
          , volume
          <volume>1260</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Venckauskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpavicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Damasevicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Marcinkevicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kapociute-Dzikiene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>Open class authorship attribution of lithuanian internet comments using one-class classifier</article-title>
          ,
          <year>2017</year>
          , p.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          373 -
          <fpage>382</fpage>
          . doi:
          <volume>10</volume>
          .15439/2017F461.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>De Magistris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          , P. Roma, J. Starczewski,
          <volume>5</volume>
          .
          <string-name>
            <surname>Conclusions</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>An explainable fake news detector based on named entity recognition and stance classificaIn this paper, a brief review of commonly used deep tion applied to covid-19, Information (Switzerland) learning methods has been made</article-title>
          ,
          <source>emphasizing its appli- 13</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .3390/info13030137.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>cation in the field of computer vision</article-title>
          . In the last years, [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , E. Tramontana,
          <string-name>
            <given-names>G. Lo</given-names>
            <surname>Sciuto</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Woź- especially using GPUs clusters, we obtained the com- niak</article-title>
          , R. Damaševičius, G. Borowik,
          <article-title>Authorship putational power to enable the design of deeper neural semantical identification using holomorphic chebynetworks [32]. Moreover, the availability of large datasets shev projectors</article-title>
          ,
          <year>2015</year>
          , p.
          <fpage>232</fpage>
          -
          <lpage>237</lpage>
          . doi:
          <volume>10</volume>
          .
          <article-title>1109/ such as COCO or ImageNet allowed training accurate APCASE</article-title>
          .
          <year>2015</year>
          .
          <volume>48</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>models, which can be adapted to a variety of scenarios</article-title>
          . [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. FeiWith</surname>
          </string-name>
          <article-title>the increasing importance of mobile devices and Fei, Imagenet: A large-scale hierarchical image edge computing, the high power requirements of the re- database</article-title>
          , in: 2009 IEEE Conference on
          <article-title>Computer viewed techniques will inevitably conflict with the low Vision</article-title>
          and Pattern Recognition,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>power resources ofered by edge devices</article-title>
          .
          <source>Although cloud doi:10</source>
          .1109/CVPR.
          <year>2009</year>
          .
          <volume>5206848</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>computing can help, many situations such as rural ar-</article-title>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Bourdev</surname>
          </string-name>
          , R. B.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>eas, make internet access problematic, thus invalidating Girshick</article-title>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          , P. Dollár,
          <article-title>the remote processing possibility</article-title>
          . Moreover, supervised
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>common objects learning, which is the commonly used method for com- in context</article-title>
          ,
          <source>CoRR abs/1405</source>
          .0312 (
          <year>2014</year>
          ).
          <article-title>URL: http: puter vision tasks, allows obtaining noticeably</article-title>
          results at //arxiv.org/abs/1405.0312. arXiv:
          <volume>1405</volume>
          .
          <fpage>0312</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>the cost of long training times. In the future</article-title>
          , self-learning [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lecun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hafner</surname>
          </string-name>
          ,
          <article-title>Gradientmethods should be considered, in order to skip the whole based learning applied to document recognition, in: dataset creation and focus in the learning phase</article-title>
          ,
          <source>as it Proceedings of the IEEE</source>
          ,
          <year>1998</year>
          , pp.
          <fpage>2278</fpage>
          -
          <lpage>2324</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>happens for the humankind</article-title>
          . [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , C. Cortes,
          <article-title>MNIST handwritten digit database, prova (</article-title>
          <year>2010</year>
          ). URL: http://yann.lecun.com/ exdb/mnist/. R-CNN:
          <article-title>towards real-time object detection with [14]</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Im- region proposal networks</article-title>
          ,
          <source>CoRR abs/1506</source>
          .
          <article-title>01497 agenet classification with deep convolutional (</article-title>
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1506.01497.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>neural networks</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>60</volume>
          (
          <year>2017</year>
          ) arXiv:
          <fpage>1506</fpage>
          .
          <fpage>01497</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          84-
          <fpage>90</fpage>
          . URL: https://doi.org/10.1145/3065386. [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Farhadi, doi:10.1145/3065386. You only look once: Unified, real-time object de[15]
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          , J. Krause, tection,
          <source>CoRR abs/1506</source>
          .02640 (
          <year>2015</year>
          ). URL: http: S. Satheesh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Karpathy, //arxiv.org/abs/1506.02640. arXiv:
          <volume>1506</volume>
          .
          <fpage>02640</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          , [28]
          <string-name>
            <given-names>R.</given-names>
            <surname>Avanzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Beritelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>VacImageNet Large Scale Visual Recognition Chal- caro, Yolov3-based mask and face recognition allenge</article-title>
          ,
          <source>International Journal of Computer</source>
          Vi-
          <article-title>gorithm for individual protection applications, volsion (IJCV) 115 (</article-title>
          <year>2015</year>
          )
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          . doi:
          <volume>10</volume>
          .1007/ ume 2768,
          <year>2020</year>
          , p.
          <fpage>41</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          s11263-
          <fpage>015</fpage>
          -0816-y. [29]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. E.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [16]
          <string-name>
            <surname>M. D. Zeiler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Fergus</surname>
            , Visualizing and un- Reed,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          <string-name>
            <surname>Berg</surname>
          </string-name>
          ,
          <article-title>SSD: single shot multibox derstanding convolutional networks, CoRR detector</article-title>
          ,
          <source>CoRR abs/1512</source>
          .02325 (
          <year>2015</year>
          ).
          <source>URL: http: abs/1311</source>
          .2901 (
          <year>2013</year>
          ). URL: http://arxiv.org/abs/ //arxiv.org/abs/1512.02325. arXiv:
          <volume>1512</volume>
          .
          <fpage>02325</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          1311.2901. arXiv:
          <volume>1311</volume>
          .
          <fpage>2901</fpage>
          . [30]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mazzenga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Giuliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vatalaro</surname>
          </string-name>
          , FttC-based [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolu- fronthaul for 5G dense/ultra-dense access network: tional networks for large-scale image recognition, Performance and costs in realistic scenarios</article-title>
          ,
          <year>Future 2015</year>
          .
          <source>arXiv:1409.1556. Internet</source>
          <volume>9</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Deep resid- [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Simonetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paoletti</surname>
          </string-name>
          ,
          <article-title>Designing digital cirual learning for image recognition, 2015. cuits in multi-valued logic</article-title>
          ,
          <source>International Journal arXiv:1512.03385. on Advanced Science</source>
          , Engineering and Information [19]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          , Technology
          <volume>8</volume>
          (
          <issue>2018</issue>
          ) pp.
          <fpage>1166</fpage>
          -
          <lpage>1172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Weyand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Andreetto</surname>
          </string-name>
          , H. Adam, [32]
          <string-name>
            <given-names>G.</given-names>
            <surname>Capizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bonanno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>Hybrid neural Mobilenets: Eficient convolutional neural networks architectures for soc and voltage predicnetworks for mobile vision applications, 2017. tion of new generation batteries storage</article-title>
          ,
          <source>in: 2011 arXiv:1704</source>
          .04861. International Conference on Clean Electrical Power [20]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Bianco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Giuliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Marrocco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mazzenga</surname>
          </string-name>
          , (ICCEP), IEEE,
          <year>2011</year>
          , pp.
          <fpage>341</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mejia-Aguilar</surname>
          </string-name>
          ,
          <article-title>LoRa System for Search and Rescue: Path-Loss Models and Procedures in Mountain Scenarios</article-title>
          ,
          <source>IEEE Internet of Things Journal</source>
          <volume>8</volume>
          (
          <year>2021</year>
          ) p.
          <fpage>1985</fpage>
          -
          <lpage>1999</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet:
          <article-title>Rethinking model scaling for convolutional neural networks</article-title>
          , CoRR abs/
          <year>1905</year>
          .11946 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1905</year>
          .11946. arXiv:
          <year>1905</year>
          .11946.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Uijlings</surname>
          </string-name>
          , K. van de Sande,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gevers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smeulders</surname>
          </string-name>
          ,
          <article-title>Selective search for object recognition</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          (
          <year>2013</year>
          ). URL: http://www.huppelen.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>doi:10.1007/s11263-013-0620-5.</source>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [23]
          <string-name>
            <surname>R. B. Girshick</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Darrell</surname>
          </string-name>
          , J. Malik,
          <article-title>Rich feature hierarchies for accurate object detection and semantic segmentation</article-title>
          ,
          <source>CoRR abs/1311</source>
          .2524 (
          <year>2013</year>
          ). URL: http://arxiv.org/abs/ 1311.2524. arXiv:
          <volume>1311</volume>
          .
          <fpage>2524</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brandizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bianco</surname>
          </string-name>
          , G. Castro,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wajda</surname>
          </string-name>
          ,
          <article-title>Automatic rgb inference based on facial emotion recognition</article-title>
          , volume
          <volume>3092</volume>
          ,
          <year>2021</year>
          , p.
          <fpage>66</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [25]
          <string-name>
            <surname>R. B. Girshick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fast</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          ,
          <source>CoRR abs/1504</source>
          .08083 (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1504.08083.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <source>arXiv:1504</source>
          .
          <fpage>08083</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Faster
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>