<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Distance Estimation of Fixed Objects in Driving Environments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgio Leporoni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerio Ponzi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Pro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Napoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer, Control and Management Engineering, Sapienza University of Rome</institution>
          ,
          <addr-line>Via Ariosto 25, Roma, 00185</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Systems Analysis and Computer Science, Italian National Research Council</institution>
          ,
          <addr-line>Via dei Taurini 19, Roma, 00185</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>17</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>Autonomous driving is a highly relevant topic today, particularly among major car manufacturers attempting to lead in technological innovation and enhance driving safety. An autonomous vehicle must possess the capability to sense its environment and navigate without human intervention. Thus, it serves as both a driver support system and, in some cases, a substitute. A crucial aspect involves identifying the positions of pedestrians, trafic signs, trafic lights, and other vehicles while computing distances from them. This enables the vehicle to emit alerts to the driver in potentially dangerous situations, such as impending obstacles due to external factors or driver distraction. In this paper, we introduce an approach for identifying trafic signs and determining the distance from them. Our method utilizes the YOLOv4 network for identification and a customized network for distance computation. This integration of AI technologies facilitates the timely detection of hazards and enables proactive alert mechanisms, thereby advancing the capabilities of autonomous vehicles and enhancing driving safety.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Yolo</kwd>
        <kwd>Autonomous Driving</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        for each frame, enhancing the accuracy of distance mea- distances from objects bounding boxes (DisNet).
surements between signs located at the same depth. The Geometry approach [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] Other papers are based on
second method capitalizes on temporal frame correla- the assumption of fixed sizes for known objects, such as
tion, enhancing the smoothness and consistency of our vehicles. In this way, knowing camera parameters can be
system, and thereby augmenting its overall performance. used as a formula to compute distances [9, 10? ].
      </p>
      <p>The use of depth maps helps us to get more accurate
measurements between signs that are collocated at the
same depth. Temporal frame correlation instead helps 3. Our approach
us to: Filtrate some false positive predictions keeping a
bounding box if and only if it appears in the previous and
the next frames and get more stable distance predictions
for successive ones.</p>
      <p>The major car manufacturers are at the forefront in this
ifeld. Taking Tesla as an example, it uses a huge amount
of sensors and cameras mounted on its vehicles. This
implies that the car must be produced in that way. With
methods like ours, what you can do is simply mount
a camera, such as a dash cam, inside the vehicle as a
driving aid. Furthermore, what we have tried to do is
to implement, as in the reference paper, a method that
was not bound to the parameters of the camera used. For
example, the IPM methods are bounded by the height of
the camera from the ground, instead in this case the driver
does not have to worry about the position in which the
camera is mounted, which can easily be used on diferent
vehicles. building a simple and portable system usable
on any camera.</p>
      <sec id="sec-1-1">
        <title>Our approach focused on the use of Italian road signs. In</title>
        <p>Italy, for each category of sign there is a most commonly
used size, so once we classified the sign surveyed, we
assumed that its size was the common one.</p>
        <p>To approach the problem, we started creating our
dataset from scratch. To accomplish this task, we used
a dash cam mounted on our vehicle recording routes
around the city to finally get more or less 3 hours of
recordings. Then we filtered out all unsuitable videos,
from the remaining videos we got about 1500 frames
representing the roads around the city. We cut each frame
on the vertical axis because of a visible portion of the
vehicle interior, removing useless information.</p>
        <p>
          For object detection, we needed a quick solution to
avoid wasting time in the whole process. So, we chose
YOLOv4 (You Only Look Once) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] because it runs a
lot faster than other methods as RCNN [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] or methods
based on color segmentation [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. We downloaded a
pre-trained YOLO network on which we did transfer
learning on a German Trafic Sign dataset training for
2. Related works 4000 iterations. During the transfer learning phase. Other
attempts we made were to use some image pre-processing
Inverse Perspective Mapping [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] consists of removing techniques, those in grayscale, or the images on which
the perspective distortion from the road surface, taking we used histogram equalization getting unfortunately
as reference the lane lines to compute distances assuming bad results. In the end, the network reached an accuracy
they have a fixed size. In this method, a bird’s eye view of about 91%.
of the roadway is computed to carry out the correspon- With the YOLO network, we got the bounding boxes
dence between a pixel dimension and the lane line size. of the trafic signs for each frame, discarding manually all
This correspondence is then used to count the pixel be- the frames without detected objects or with the presence
tween an object and the vehicle getting the approximated of wrong detections. To get the ground truth of each
distance. This method has problems in the presence of bounding box we use the following formula:
road curves or road signs not very visible or absent. In
addition, it is very dependent on the camera parameters.  =  ℎ *  ℎ
        </p>
        <p>
          Stereo vision [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] This method foresees the use of  ℎ
a stereo camera that generates two images, a left and
a right view. From these two images of the same envi- It is based on the focal length of the camera that we
ronment is generated a disparity map with the use of obtained by taking a picture of an object of known size
epipolar geometry. Using a simple formula from the gen- placed at a known distance to count the pixels of which
erated map it is possible to compute for each pixel of the object is composed within the image. This is the only
the 2D image the z coordinates that give us the depth of parameter of the camera that was necessary to create the
the object in that pixel in the real 3D world. The main dataset.
problem with this method is the expensive cost of the In particular, the width of the triangular and octagonal
stereo camera. signs used is 90 cm, while it is 60 cm for the square and
        </p>
        <p>
          AI-based approach[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] This method is based on a circular ones.
deep learning approach to monocular images. Starting Through this process, we built a dataset composed of
from labeled data train a neural network able to compute 959 images. After the creation of the dataset, we focused
(a) Predictive model for trafic sign distance computation. Input image with bounding boxes undergoes VGG16 feature extraction,
        </p>
        <p>ROI pooling for size standardization, and a three-layer feedforward network for distance prediction using soft plus activation.
(b) Enhanced model integrating depth map information and temporal frame correlation for stabilized predictions. Input image
with bounding boxes processed through VGG16, ROI pooling, and a modified three-layer feedforward network, leading to
improved distance accuracy.
on the detection part. For this purpose, we used YOLOv4 results by adding the use of depth map information and
as mentioned above. exploiting the concept of temporal frame correlation.</p>
        <p>
          After obtaining the bounding boxes for an image, it is Depth map [16]: The concept is that trafic signs
passed to a specific network for the distance computation. at the same depth in the real world are more or less at
This second network is composed of a CNN (VGG16) [14] the same distance from the vehicle. Based on this point
for feature map extraction, and then this is combined with we use a pre-trained network called MIDAS [
          <xref ref-type="bibr" rid="ref14">17</xref>
          ] [
          <xref ref-type="bibr" rid="ref15">18</xref>
          ]
the information about bounding boxes passed through an to get the depth map of the image under exam. Once
ROI pooling layer [15]. This Layer is necessary because bounding boxes are detected in the original image and
bounding boxes for a single image could be of diferent distances are computed, we report the bounding boxes
sizes, this layer aims to remove this diference in the in the depth map image. For each trafic sign at the
dimension standardizing them. The output of the ROI same depth, considering a small variance based on the
pooling is finally passed to a feedforward network, com- maximum depth value inside the image, we computed an
posed of 3 layers (2048, 512, 1), that predicts distances average of the distances in the original image to obtain a
using a soft plus activation function. The architecture of uniform value. At the moment, we used this method after
the network is shown in Figure 1a. the computation of the distances, but it could be used
        </p>
        <p>By testing the entire process on diferent videos, we also in the creation of the dataset to get more detailed
noticed that for our cases this method was not stable in labels or in the training phase to directly stabilize results
the predictions made between successive frames, in fact in the network.
in some cases, it happened that there was a large variance Figure 2 shows a representation of this method,
lookbetween distances predicted for the same trafic sign in ing at the trafic signs in the image are now visible from
two or more successive frames. We tried to increase our the depth map coloration that they are at the same
dis</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Results</title>
      <sec id="sec-2-1">
        <title>Talking about the detection part with the YOLO we reach an accuracy of around 91%.</title>
        <p>tance. So, thanks to this now the prediction for them is For the distance prediction network instead, it is not
corrected at the same value. possible to compute a true accuracy, but we reach a loss</p>
        <p>
          Temporal frame correlation: We use this technique of more or less 130, visible in the graph in Figure 4.
to give a linearity in predicting distances for the sequence It shows that the loss function has a trend that tends
of frames. Going through this method, we noticed that to improve if trained for more epochs.
in some cases the network’s predictions were much dif- As an evaluation metric, we used the ones provided
ferent for successive frames. To stabilize predictions, we by [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. In particular, we use the RMSE on predictions
thought that given a trafic sign in a frame at time t, if divided by meters:
it is also present at time t-1 and t+1 it is a valid object
to consider for time t and its distance is the average be- ⎯ 
tween the 3 frames in sequence. To verify if the same   = ⎷⎸⎸ 1 ∑︁ ‖ − * ‖2
trafic sign is present in the 3 subsequence frames, we =1
ifrst find the center of its bounding box at time t and of
all the trafic signs for the previous and forward frames.
        </p>
        <p>Then we compute the distances between points and if it
is lower than a certain threshold, we are looking at the
same trafic sign.</p>
        <p>An example of this concept is given in Figure 3, in
which there is a wrong detection at frame t (red circle on
the top right image) and since this wrong prediction is
not present at frame (t-1) and (t+1), it is also discarded at
frame t.</p>
        <p>The architecture of this modified network is
represented in Figure 1b.</p>
        <p>This is to see how the behavior of the network changes
concerning the distance from the detected object. Results
are represented in the graph in Figure 5, compared with
the ones obtained by the reference paper. Visible
predictions get worse as distances increase. We notice that
bounding boxes of trafic signs at higher distances do not
match perfectly their dimensions introducing an error.</p>
        <p>Another source of error is probably the fact that we have
only a few samples of road signs at large distances.</p>
        <p>Table 1 compares our results with the ones of the
reference paper. As visible, results are similar, ours are a little
bit better because lower values represent better
predictions. This is because we make predictions only on trafic
4. Training signs while they predict on cars, cyclists, and pedestrians,
this means that they have a larger margin of error than
About the training phase, due to time and resource issues, us.
we were unable to train the networks for long sessions. To show the method in action, we made some test
We trained the YOLOv4 for about 4000 iterations using video, available on YouTube, of the network works. In
RGB frames from the German Trafic Sign Dataset. For particular, we made videos with the following
characterthe distance prediction network (DPN), all components istics:
composing the DPN network are trained together. We
trained it with our dataset for 560 epochs using RGB
frames. About the training parameters, we used a
learning rate starting from 0,001 with ADAMS, minibatch size
of 16, and loss the ℎ1.
• Test video using the base network without depth
map and temporal frame correlation (daylight
conditions)
• Test video using depth map and temporal frame</p>
        <p>correlation (daylight conditions)
• Test video using the base network without depth
map and temporal frame correlation, rounded on
5 meters (daylight conditions)
• Test video using depth map and temporal frame
correlation, rounded on 5 meters (daylight
conditions)</p>
        <p>Rounded on 5 meters, means that we do an
approximation on the predictions made to get more stable results.</p>
        <p>As, 12.4 meters is rounded to 10 meters, while 12.6 meters
is rounded to 15 meters.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Conclusion</title>
      <sec id="sec-3-1">
        <title>The method seems to work well, there are errors introduced by the labels of our dataset that are not accurate,</title>
        <p>(a) Our meters-RMSE predictions graph
(b) Reference paper meters-RMSE predictions graph
caused by the possible diferent dimensions for each
trafifc sign on the road introducing a small error that then
will propagate throughout the process, even if we tried to
solve it using depth map and temporal frame correlation.
So, the main future step could be using more accurate
labels for the samples inside the dataset. The work is
based on the objects detected and rounded by bounding
boxes but is not always sure that their dimensions match
perfectly the sizes of the trafic signs, so this point
introduces errors in the predictions of the network. As said
at the beginning, in Italy the same trafic signs could be
used up to 3 diferent dimensions, so it could be useful to
infer their dimensions to improve the predicted distances.
As future improvement, there possible extension of the
detected objects also to vehicles and pedestrians.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>This work has been developed at is.Lab() Intelligent Sys</title>
        <p>tems Laboratory at the Department of Computer, Control,
and Management Engineering, Sapienza University of
Rome (https:// islab.diag.uniroma1.it). The work has also
been partially supported from Italian Ministerial grant
PRIN 2022 “ISIDE: Intelligent Systems for Infrastructural
Diagnosis in smart-concretE”, n. 2022S88WAY - CUP
B53D2301318, and by the Age-It: Ageing Well in an
ageing society project, task 9.4.1 work package 4 spoke 9,
within topic 8 extended partnership 8, under the National
Recovery and Resilience Plan (PNRR), Mission 4
Component 2 Investment 1.3—Call for tender No. 1557 of
11/10/2022 of Italian Ministry of University and Research
funded by the European Union—NextGenerationEU, CUP
B53C22004090006.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ponzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wajda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brociek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>Analysis pre and post covid-19 pandemic rorschach test data of using em algorithms and gmm models</article-title>
          , volume
          <volume>3360</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Capizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Paternò</surname>
          </string-name>
          ,
          <article-title>An innovative hybrid neuro-wavelet method for reconstruction of missing data in astronomical photometric surveys 7267 LNAI (</article-title>
          <year>2012</year>
          )
          <fpage>21</fpage>
          -
          <lpage>29</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>642</fpage>
          -29347-
          <issue>4</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Alfarano</surname>
          </string-name>
          , G. De Magistris,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mongelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Starczewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>A novel convmixer transformer based architecture for violent behavior detection 14126 LNAI (</article-title>
          <year>2023</year>
          )
          <fpage>3</fpage>
          -
          <lpage>16</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -42508-
          <issue>0</issue>
          _
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Abu-Haimed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-C.</given-names>
            <surname>Lien</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <article-title>Learning Object-specific Distance from a Monocular Image</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2019</year>
          . URL: http://arxiv.org/abs/
          <year>1909</year>
          .04182. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1909</year>
          .
          <volume>04182</volume>
          , arXiv:
          <year>1909</year>
          .
          <article-title>04182 [cs] type: article.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tuohy</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. O'Cualain</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Glavin</surname>
          </string-name>
          ,
          <article-title>Distance determination for an automobile environment using Inverse Perspective Mapping in OpenCV</article-title>
          ,
          <source>in: IET Irish Signals and Systems Conference (ISSC</source>
          <year>2010</year>
          ),
          <year>2010</year>
          , pp.
          <fpage>100</fpage>
          -
          <lpage>105</lpage>
          . doi:
          <volume>10</volume>
          .1049/cp.
          <year>2010</year>
          .
          <volume>0495</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <source>Distance Measurement System Based on Binocular Stereo Vision, IOP Conference Series: Earth and Environmental Science</source>
          <volume>252</volume>
          (
          <year>2019</year>
          )
          <article-title>052051</article-title>
          . URL: https://doi.org/10.1088/
          <fpage>1755</fpage>
          -1315/ 252/5/052051. doi:
          <volume>10</volume>
          .1088/
          <fpage>1755</fpage>
          -1315/252/5/ 052051.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] DisNet: A novel method for distance estimation from monocular camera</article-title>
          , ???? URL: https: //patrick-llgc.github.io/Learning-Deep-Learning/ paper_notes/disnet.html.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saleh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khwandah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Heller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mumtaz</surname>
          </string-name>
          ,
          <article-title>Trafic Signs Recognition and Distance Estimation using a Monocular Camera</article-title>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>A comprehensive solution for psychological treatment and therapeutic path planning based on knowledge base and expertise sharing</article-title>
          , volume
          <volume>2472</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lo Sciuto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>A cloud-based lfexible solution for psychometric tests validation, administration and evaluation</article-title>
          , volume
          <volume>2468</volume>
          ,
          <year>2019</year>
          , doi:10.1007/978-3-
          <fpage>319</fpage>
          -48680-2_
          <fpage>19</fpage>
          . pp.
          <fpage>16</fpage>
          -
          <lpage>21</lpage>
          . [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <string-name>
            <surname>Very Deep</surname>
          </string-name>
          Convo-
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bochkovskiy</surname>
          </string-name>
          , C.-Y. Wang, H.
          <string-name>
            <surname>-Y. M. Liao</surname>
          </string-name>
          ,
          <article-title>YOLOv4: lutional Networks for Large-Scale Image RecogOptimal Speed and Accuracy of Object Detection, nition</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2015</year>
          . URL: http://arxiv.
          <source>Technical Report</source>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/ org/abs/1409.1556. doi:
          <volume>10</volume>
          .48550/arXiv.1409.
          <year>2004</year>
          .
          <volume>10934</volume>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2004</year>
          .
          <volume>10934</volume>
          ,
          <issue>1556</issue>
          , arXiv:
          <fpage>1409</fpage>
          .1556 [
          <article-title>cs] type: article</article-title>
          . arXiv:
          <year>2004</year>
          .
          <article-title>10934 [cs, eess] type: article</article-title>
          . [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          , J. Malik,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fast</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          , in: 2015 IEEE Inter-
          <article-title>Rich feature hierarchies for accurate object detecnational Conference on Computer Vision (ICCV), tion and semantic segmentation</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1440</fpage>
          -
          <lpage>1448</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2015</year>
          .
          <year>2014</year>
          . URL: http://arxiv.org/abs/1311.2524. doi:
          <volume>10</volume>
          . 169, iSSN:
          <fpage>2380</fpage>
          -
          <lpage>7504</lpage>
          . 48550/arXiv.1311.2524, arXiv:
          <fpage>1311</fpage>
          .2524 [cs]
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Youssef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Albani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bloisi</surname>
          </string-name>
          ,
          <article-title>Fast Trafic type: article. Sign Recognition Using Color Segmentation and</article-title>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Godard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Mac</given-names>
            <surname>Aodha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Firman</surname>
          </string-name>
          , G. Brostow, Deep Convolutional Networks, volume
          <volume>10016</volume>
          ,
          <year>2016</year>
          . Digging Into
          <string-name>
            <surname>Self-Supervised Monocular Depth Estimation</surname>
          </string-name>
          ,
          <source>Technical Report</source>
          ,
          <year>2019</year>
          . URL: http://arxiv. org/abs/
          <year>1806</year>
          .01260. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1806</year>
          .
          <volume>01260</volume>
          , arXiv:
          <year>1806</year>
          .
          <article-title>01260 [cs, stat] type: article.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ranftl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bochkovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Koltun</surname>
          </string-name>
          ,
          <article-title>Vision Transformers for Dense Prediction</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2021</year>
          . URL: http://arxiv.org/abs/2103.13413. doi:
          <volume>10</volume>
          .48550/arXiv.2103.13413, arXiv:
          <fpage>2103</fpage>
          .13413 [
          <article-title>cs] type: article.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ranftl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lasinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hafner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Schindler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Koltun</surname>
          </string-name>
          ,
          <article-title>Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Crossdataset Transfer</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2020</year>
          . URL: http: //arxiv.org/abs/
          <year>1907</year>
          .01341. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1907</year>
          .
          <volume>01341</volume>
          , arXiv:
          <year>1907</year>
          .
          <article-title>01341 [cs] type: article.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>