=Paper=
{{Paper
|id=Vol-3695/p03
|storemode=property
|title=Distance Estimation of Fixed Objects in Driving
Environments
|pdfUrl=https://ceur-ws.org/Vol-3695/p03.pdf
|volume=Vol-3695
|authors=Giorgio Leporoni,Valerio Ponzi,Francesco Pro,Christian Napoli
|dblpUrl=https://dblp.org/rec/conf/system/LeporoniPP023
}}
==Distance Estimation of Fixed Objects in Driving
Environments==
<pdf width="1500px">https://ceur-ws.org/Vol-3695/p03.pdf</pdf>
<pre>
                                Distance Estimation of Fixed Objects in Driving
                                Environments
                                Giorgio Leporoni1 , Valerio Ponzi1,2 , Francesco Pro1 and Christian Napoli1,2
                                1
                                    Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, Roma, 00185, Italy
                                2
                                    Institute for Systems Analysis and Computer Science, Italian National Research Council, Via dei Taurini 19, Roma, 00185, Italy


                                                                             Abstract
                                                                             Autonomous driving is a highly relevant topic today, particularly among major car manufacturers attempting to lead in
                                                                             technological innovation and enhance driving safety. An autonomous vehicle must possess the capability to sense its
                                                                             environment and navigate without human intervention. Thus, it serves as both a driver support system and, in some cases, a
                                                                             substitute. A crucial aspect involves identifying the positions of pedestrians, traffic signs, traffic lights, and other vehicles while
                                                                             computing distances from them. This enables the vehicle to emit alerts to the driver in potentially dangerous situations, such
                                                                             as impending obstacles due to external factors or driver distraction. In this paper, we introduce an approach for identifying
                                                                             traffic signs and determining the distance from them. Our method utilizes the YOLOv4 network for identification and a
                                                                             customized network for distance computation. This integration of AI technologies facilitates the timely detection of hazards
                                                                             and enables proactive alert mechanisms, thereby advancing the capabilities of autonomous vehicles and enhancing driving
                                                                             safety.

                                                                             Keywords
                                                                             Machine Learning, Deep Learning, Yolo, Autonomous Driving


                                1. Introduction                                                                                                            the used camera.
                                                                                                                                                              One of the main problems in this field of research is
                                Road safety is a major global concern, impacting the                                                                       the dataset. We are talking about a very delicate area, so
                                well-being of individuals and communities worldwide.                                                                       to be sure of the system’s accuracy the dataset should
                                The development and adoption of advanced technolo-                                                                         be composed of a huge number of samples representing
                                gies, such as driver assistance systems and autonomous                                                                     different objects in very different contexts [1, 2, 3]. So,
                                vehicles, offer significant potential to further enhance                                                                   what we did was to record video on short routes from a
                                road safety in the long term. This is possible by creating                                                                 dash cam mounted on our vehicle, extracting frames on
                                systems based on cameras or sensors mounted on the ve-                                                                     which we then calculated ground truth in an automated
                                hicles that process the acquired images and can identify                                                                   way to finally make an ad hoc dataset for us.
                                the typical objects of a road environment by doing some                                                                       In this paper, we focus on computing the distances
                                computation on them, such as looking at their distances.                                                                   between the vehicle and the detected traffic signs using
                                In this way, the vehicle could be able to make quick de-                                                                   single images captured by a monocular camera. We de-
                                cisions autonomously in case of necessity. A classical                                                                     cided to use this type of camera because it is the most
                                example is when there is a stop signal and the system                                                                      common and affordable. The method foresees two phases,
                                detects that the driver is not reducing the velocity, at this                                                              one for the detection of the traffic signs on the captured
                                point it can brake autonomously the vehicle or easily                                                                      images and a second phase for inferring distances from
                                alarms the driver with acoustic signals.                                                                                   them. For this second phase we built a network based on
                                   In the last years, attempts have begun to approach                                                                      a modern paper [4] that tries to solve the problem with a
                                this field of research by exploiting artificial intelligence.                                                              pure base artificial intelligence approach.
                                Previous methods involved the use of geometry with the                                                                        Our main contributions arise from our endeavor to
                                assumption of fixed dimensions for objects such as vehi-                                                                   create an automated system tailored to our needs. Ini-
                                cles. Other methods were based on IPM (Inverse Perspec-                                                                    tially, we integrated YOLOv4 to produce bounding boxes
                                tive Mapping) using the lines present on the carriageway,                                                                  around traffic signs, facilitating the automatic identifica-
                                these methods are all dependent on the parameters of                                                                       tion of their positions within images, thus concluding the
                                                                                                                                                           initial phase of our approach. Subsequently, we directed
                                SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology,
                                                                                                                                                           our efforts towards developing a specialized dataset to
                                Engineering and Mathematics, Rome, December 3-6, 2023
                                $ leporoni.1944533@studenti.uniroma1.it (G. Leporoni);                                                                     address our specific problem, as existing datasets did not
                                ponzi@diag.uniroma1.it (V. Ponzi);                                                                                         fulfill our requirements. Building upon our initial find-
                                pro.1944191@studenti.uniroma1.it (F. Pro);                                                                                 ings, we sought to enhance our system by implementing
                                cnapoli@diag.uniroma1.it (C. Napoli)                                                                                       two stabilization methods for predicted distances. The
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative
                                                                       Commons License Attribution 4.0 International (CC BY 4.0).                          first method entails generating and utilizing depth maps
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                                                      17


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Giorgio Leporoni et al. CEUR Workshop Proceedings                                                                17–24


for each frame, enhancing the accuracy of distance mea-        distances from objects bounding boxes (DisNet).
surements between signs located at the same depth. The            Geometry approach [8] Other papers are based on
second method capitalizes on temporal frame correla-           the assumption of fixed sizes for known objects, such as
tion, enhancing the smoothness and consistency of our          vehicles. In this way, knowing camera parameters can be
system, and thereby augmenting its overall performance.        used as a formula to compute distances [9, 10? ].
   The use of depth maps helps us to get more accurate
measurements between signs that are collocated at the
same depth. Temporal frame correlation instead helps           3. Our approach
us to: Filtrate some false positive predictions keeping a
                                                          Our approach focused on the use of Italian road signs. In
bounding box if and only if it appears in the previous and
                                                          Italy, for each category of sign there is a most commonly
the next frames and get more stable distance predictions
                                                          used size, so once we classified the sign surveyed, we
for successive ones.
                                                          assumed that its size was the common one.
   The major car manufacturers are at the forefront in this
                                                             To approach the problem, we started creating our
field. Taking Tesla as an example, it uses a huge amount
                                                          dataset from scratch. To accomplish this task, we used
of sensors and cameras mounted on its vehicles. This
                                                          a dash cam mounted on our vehicle recording routes
implies that the car must be produced in that way. With
                                                          around the city to finally get more or less 3 hours of
methods like ours, what you can do is simply mount
                                                          recordings. Then we filtered out all unsuitable videos,
a camera, such as a dash cam, inside the vehicle as a
                                                          from the remaining videos we got about 1500 frames rep-
driving aid. Furthermore, what we have tried to do is
                                                          resenting the roads around the city. We cut each frame
to implement, as in the reference paper, a method that
                                                          on the vertical axis because of a visible portion of the
was not bound to the parameters of the camera used. For
                                                          vehicle interior, removing useless information.
example, the IPM methods are bounded by the height of
                                                             For object detection, we needed a quick solution to
the camera from the ground, instead in this case the driver
                                                          avoid wasting time in the whole process. So, we chose
does not have to worry about the position in which the
                                                          YOLOv4 (You Only Look Once) [11] because it runs a
camera is mounted, which can easily be used on different
                                                          lot faster than other methods as RCNN [12] or methods
vehicles. building a simple and portable system usable
                                                          based on color segmentation [13]. We downloaded a
on any camera.
                                                          pre-trained YOLO network on which we did transfer
                                                          learning on a German Traffic Sign dataset training for
2. Related works                                          4000 iterations. During the transfer learning phase. Other
                                                          attempts we made were to use some image pre-processing
Inverse Perspective Mapping [5] consists of removing techniques, those in grayscale, or the images on which
the perspective distortion from the road surface, taking we used histogram equalization getting unfortunately
as reference the lane lines to compute distances assuming bad results. In the end, the network reached an accuracy
they have a fixed size. In this method, a bird’s eye view of about 91%.
of the roadway is computed to carry out the correspon-       With the YOLO network, we got the bounding boxes
dence between a pixel dimension and the lane line size. of the traffic signs for each frame, discarding manually all
This correspondence is then used to count the pixel be- the frames without detected objects or with the presence
tween an object and the vehicle getting the approximated of wrong detections. To get the ground truth of each
distance. This method has problems in the presence of bounding box we use the following formula:
road curves or road signs not very visible or absent. In
addition, it is very dependent on the camera parameters.                          𝑊 𝑖𝑑𝑡ℎ𝑐𝑚 * 𝐹 𝑜𝑐𝑎𝑙𝑙𝑒𝑛𝑔ℎ𝑡
                                                                     𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 =
   Stereo vision [6] This method foresees the use of                                       𝑊 𝑖𝑑𝑡ℎ𝑝𝑥
a stereo camera that generates two images, a left and
                                                          It is based on the focal length of the camera that we
a right view. From these two images of the same envi-
                                                          obtained by taking a picture of an object of known size
ronment is generated a disparity map with the use of
                                                          placed at a known distance to count the pixels of which
epipolar geometry. Using a simple formula from the gen-
                                                          the object is composed within the image. This is the only
erated map it is possible to compute for each pixel of
                                                          parameter of the camera that was necessary to create the
the 2D image the z coordinates that give us the depth of
                                                          dataset.
the object in that pixel in the real 3D world. The main
                                                             In particular, the width of the triangular and octagonal
problem with this method is the expensive cost of the
                                                          signs used is 90 cm, while it is 60 cm for the square and
stereo camera.
                                                          circular ones.
   AI-based approach[7] This method is based on a
                                                             Through this process, we built a dataset composed of
deep learning approach to monocular images. Starting
                                                          959 images. After the creation of the dataset, we focused
from labeled data train a neural network able to compute


                                                          18
Giorgio Leporoni et al. CEUR Workshop Proceedings                                                                           17–24


(a) Predictive model for traffic sign distance computation. Input image with bounding boxes undergoes VGG16 feature extraction,
    ROI pooling for size standardization, and a three-layer feedforward network for distance prediction using soft plus activation.


(b) Enhanced model integrating depth map information and temporal frame correlation for stabilized predictions. Input image
    with bounding boxes processed through VGG16, ROI pooling, and a modified three-layer feedforward network, leading to
    improved distance accuracy.


Figure 1: Schematic representations of the comprehensive distance computation system.


on the detection part. For this purpose, we used YOLOv4              results by adding the use of depth map information and
as mentioned above.                                                  exploiting the concept of temporal frame correlation.
   After obtaining the bounding boxes for an image, it is               Depth map [16]: The concept is that traffic signs
passed to a specific network for the distance computation.           at the same depth in the real world are more or less at
This second network is composed of a CNN (VGG16) [14]                the same distance from the vehicle. Based on this point
for feature map extraction, and then this is combined with           we use a pre-trained network called MIDAS [17] [18]
the information about bounding boxes passed through an               to get the depth map of the image under exam. Once
ROI pooling layer [15]. This Layer is necessary because              bounding boxes are detected in the original image and
bounding boxes for a single image could be of different              distances are computed, we report the bounding boxes
sizes, this layer aims to remove this difference in the              in the depth map image. For each traffic sign at the
dimension standardizing them. The output of the ROI                  same depth, considering a small variance based on the
pooling is finally passed to a feedforward network, com-             maximum depth value inside the image, we computed an
posed of 3 layers (2048, 512, 1), that predicts distances            average of the distances in the original image to obtain a
using a soft plus activation function. The architecture of           uniform value. At the moment, we used this method after
the network is shown in Figure 1a.                                   the computation of the distances, but it could be used
   By testing the entire process on different videos, we             also in the creation of the dataset to get more detailed
noticed that for our cases this method was not stable in             labels or in the training phase to directly stabilize results
the predictions made between successive frames, in fact              in the network.
in some cases, it happened that there was a large variance              Figure 2 shows a representation of this method, look-
between distances predicted for the same traffic sign in             ing at the traffic signs in the image are now visible from
two or more successive frames. We tried to increase our              the depth map coloration that they are at the same dis-


                                                                19
Giorgio Leporoni et al. CEUR Workshop Proceedings                                                                   17–24


Figure 2: Example of depth map using MiDas network.


tance. So, thanks to this now the prediction for them is            For the distance prediction network instead, it is not
corrected at the same value.                                     possible to compute a true accuracy, but we reach a loss
   Temporal frame correlation: We use this technique             of more or less 130, visible in the graph in Figure 4.
to give a linearity in predicting distances for the sequence        It shows that the loss function has a trend that tends
of frames. Going through this method, we noticed that            to improve if trained for more epochs.
in some cases the network’s predictions were much dif-              As an evaluation metric, we used the ones provided
ferent for successive frames. To stabilize predictions, we       by [7]. In particular, we use the RMSE on predictions
thought that given a traffic sign in a frame at time t, if       divided by meters:
it is also present at time t-1 and t+1 it is a valid object                               ⎯
to consider for time t and its distance is the average be-                                        𝑛
                                                                                          ⎸
                                                                                          ⎸ 1 ∑︁
tween the 3 frames in sequence. To verify if the same                        𝑅𝑀 𝑆𝐸 = ⎷               ‖𝑑𝑖 − 𝑑*𝑖 ‖2
                                                                                             𝑁
traffic sign is present in the 3 subsequence frames, we                                       𝑑=1

first find the center of its bounding box at time t and of This is to see how the behavior of the network changes
all the traffic signs for the previous and forward frames. concerning the distance from the detected object. Results
Then we compute the distances between points and if it     are represented in the graph in Figure 5, compared with
is lower than a certain threshold, we are looking at the   the ones obtained by the reference paper. Visible pre-
same traffic sign.                                         dictions get worse as distances increase. We notice that
   An example of this concept is given in Figure 3, in     bounding boxes of traffic signs at higher distances do not
which there is a wrong detection at frame t (red circle on match perfectly their dimensions introducing an error.
the top right image) and since this wrong prediction is    Another source of error is probably the fact that we have
not present at frame (t-1) and (t+1), it is also discarded at
                                                           only a few samples of road signs at large distances.
frame t.                                                      Table 1 compares our results with the ones of the refer-
   The architecture of this modified network is repre-     ence paper. As visible, results are similar, ours are a little
sented in Figure 1b.                                       bit better because lower values represent better predic-
                                                           tions. This is because we make predictions only on traffic
4. Training                                                signs while they predict on cars, cyclists, and pedestrians,
                                                           this means that they have a larger margin of error than
About the training phase, due to time and resource issues, us.
we were unable to train the networks for long sessions.       To show the method in action, we made some test
We trained the YOLOv4 for about 4000 iterations using video, available on YouTube, of the network works. In
RGB frames from the German Traffic Sign Dataset. For particular, we made videos with the following character-
the distance prediction network (DPN), all components istics:
composing the DPN network are trained together. We
                                                                 • Test video using the base network without depth
trained it with our dataset for 560 epochs using RGB
                                                                   map and temporal frame correlation (daylight
frames. About the training parameters, we used a learn-
                                                                   conditions)
ing rate starting from 0,001 with ADAMS, minibatch size
                                                                 • Test video using depth map and temporal frame
of 16, and loss the 𝑆𝑚𝑜𝑜𝑡ℎ𝐿1 .
                                                                   correlation (daylight conditions)
                                                                 • Test video using the base network without depth
5. Results                                                         map and temporal frame correlation, rounded on
                                                                   5 meters (daylight conditions)
Talking about the detection part with the YOLO we reach          • Test video using depth map and temporal frame
an accuracy of around 91%.                                         correlation, rounded on 5 meters (daylight condi-
                                                                   tions)


                                                            20
Giorgio Leporoni et al. CEUR Workshop Proceedings                                                                  17–24


Figure 3: Example of temporal frame correlation in case of wrong predictions.


Table 1
Comparison of results between our implementation and the one of the paper we take as reference.

                                 Method          Abs Rel    Squa Rel      RMSE     RMSE(log)
                             Our base model         0.131        0.468    3.126    0.173
                            Paper base model        0.251        1.844    6.870    0.314


                                                                   The formulas used in the table are the following:
                                                                                    ⎯
                                                                                    ⎸       𝑛
                                                                                    ⎸ 1 ∑︁
                                                                    𝑅𝑀 𝑆𝐸(𝑙𝑜𝑔) = ⎷             ‖𝑙𝑜𝑔(𝑑𝑖 ) − 𝑙𝑜𝑔(𝑑*𝑖 )‖2
                                                                                       𝑁
                                                                                           𝑑=1

                                                                                                        𝑛
                                                                                                    1 ∑︁ |𝑑𝑖 − 𝑑*𝑖 |
                                                                     𝐴𝑏𝑠𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒𝐷𝑖𝑓 𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 =
                                                                                                    𝑁        𝑑*
                                                                                                       𝑑=1
                                                                                                            𝑛
                                                                                               1 ∑︁ ‖𝑑𝑖 − 𝑑*𝑖 ‖2
                                                                 𝑆𝑞𝑢𝑎𝑟𝑒𝑑𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒𝐷𝑖𝑓 𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 =
                                                                                               𝑁        𝑑*
                                                                                                        𝑑=1

                                                              Figure 6 shows two examples of predictions in images.
Figure 4: Graph of the loss function of the distances predic- The top image distances are predicted without the use
tion network.                                                 of the depth map and temporal frame correlation, as
                                                              the predictions do not seem reliable, they appear quite
                                                              random. The bottom image instead, is done using our
      • Test video using depth map and temporal frame two variations. As visible all the detected signs are more
        correlation, rounded on 5 meters (night condi- or less at the same depth, this is not considered for the
        tions)                                                top image, while in this case thanks to the depth map
                                                              their predictions are adjusted correctly.
   Rounded on 5 meters, means that we do an approxima-
tion on the predictions made to get more stable results.
As, 12.4 meters is rounded to 10 meters, while 12.6 meters 6. Conclusion
is rounded to 15 meters.
                                                              The method seems to work well, there are errors intro-
                                                              duced by the labels of our dataset that are not accurate,


                                                            21
Giorgio Leporoni et al. CEUR Workshop Proceedings                                                                      17–24


           (a) Our meters-RMSE predictions graph                   (b) Reference paper meters-RMSE predictions graph
Figure 5: These are the graphs that are put in relation to the predictions at certain meters with the distance error from the
true values.


caused by the possible different dimensions for each traf- [2] G. Capizzi, C. Napoli, L. Paternò, An innova-
fic sign on the road introducing a small error that then       tive hybrid neuro-wavelet method for reconstruc-
will propagate throughout the process, even if we tried to     tion of missing data in astronomical photometric
solve it using depth map and temporal frame correlation.       surveys 7267 LNAI (2012) 21 – 29. doi:10.1007/
So, the main future step could be using more accurate          978-3-642-29347-4_3.
labels for the samples inside the dataset. The work is     [3] A. Alfarano, G. De Magistris, L. Mongelli, S. Russo,
based on the objects detected and rounded by bounding          J. Starczewski, C. Napoli, A novel convmixer trans-
boxes but is not always sure that their dimensions match       former based architecture for violent behavior de-
perfectly the sizes of the traffic signs, so this point intro- tection 14126 LNAI (2023) 3 – 16. doi:10.1007/
duces errors in the predictions of the network. As said        978-3-031-42508-0_1.
at the beginning, in Italy the same traffic signs could be [4] J. Zhu, Y. Fang, H. Abu-Haimed, K.-C. Lien,
used up to 3 different dimensions, so it could be useful to    D. Fu, J. Gu, Learning Object-specific Distance
infer their dimensions to improve the predicted distances.     from a Monocular Image, Technical Report,
As future improvement, there possible extension of the         2019.      URL:     http://arxiv.org/abs/1909.04182.
detected objects also to vehicles and pedestrians.             doi:10.48550/arXiv.1909.04182,
                                                               arXiv:1909.04182 [cs] type: article.
                                                           [5] S. Tuohy, D. O’Cualain, E. Jones, M. Glavin, Dis-
Acknowledgments                                                tance determination for an automobile environment
                                                               using Inverse Perspective Mapping in OpenCV, in:
This work has been developed at is.Lab() Intelligent Sys-
                                                               IET Irish Signals and Systems Conference (ISSC
tems Laboratory at the Department of Computer, Control,
                                                               2010), 2010, pp. 100–105. doi:10.1049/cp.2010.
and Management Engineering, Sapienza University of
                                                               0495.
Rome (https:// islab.diag.uniroma1.it). The work has also
                                                           [6] X. Sun, Y. Jiang, Y. Ji, W. Fu, S. Yan, Q. Chen,
been partially supported from Italian Ministerial grant
                                                               B. Yu, X. Gan, Distance Measurement System Based
PRIN 2022 “ISIDE: Intelligent Systems for Infrastructural
                                                               on Binocular Stereo Vision, IOP Conference Se-
Diagnosis in smart-concretE”, n. 2022S88WAY - CUP
                                                               ries: Earth and Environmental Science 252 (2019)
B53D2301318, and by the Age-It: Ageing Well in an age-
                                                               052051. URL: https://doi.org/10.1088/1755-1315/
ing society project, task 9.4.1 work package 4 spoke 9,
                                                               252/5/052051. doi:10.1088/1755-1315/252/5/
within topic 8 extended partnership 8, under the National
                                                               052051.
Recovery and Resilience Plan (PNRR), Mission 4 Com-
                                                           [7] DisNet: A novel method for distance estima-
ponent 2 Investment 1.3—Call for tender No. 1557 of
                                                               tion from monocular camera, ???? URL: https:
11/10/2022 of Italian Ministry of University and Research
                                                               //patrick-llgc.github.io/Learning-Deep-Learning/
funded by the European Union—NextGenerationEU, CUP
                                                               paper_notes/disnet.html.
B53C22004090006.
                                                           [8] S. Saleh, S. Khwandah, A. Heller, W. Hardt, A. Mum-
                                                               taz, Traffic Signs Recognition and Distance Estima-
References                                                     tion using a Monocular Camera, 2019.
                                                           [9] S. Russo, C. Napoli, A comprehensive solution for
 [1] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli,      psychological treatment and therapeutic path plan-
     Analysis pre and post covid-19 pandemic rorschach         ning based on knowledge base and expertise shar-
     test data of using em algorithms and gmm models,          ing, volume 2472, 2019, pp. 41 – 47.
     volume 3360, 2022, pp. 55 – 63.                      [10] G. Lo Sciuto, S. Russo, C. Napoli, A cloud-based
                                                               flexible solution for psychometric tests validation,


                                                             22
Giorgio Leporoni et al. CEUR Workshop Proceedings                                                              17–24


Figure 6: (Top image) example of predictions without depth map and frame correlation time. (Bottom image) example of
predictions using depth map and frame correlation time


     administration and evaluation, volume 2468, 2019,             doi:10.1007/978-3-319-48680-2_19.
     pp. 16 – 21.                                             [14] K. Simonyan, A. Zisserman, Very Deep Convo-
[11] A. Bochkovskiy, C.-Y. Wang, H.-Y. M. Liao, YOLOv4:            lutional Networks for Large-Scale Image Recog-
     Optimal Speed and Accuracy of Object Detection,               nition, Technical Report, 2015. URL: http://arxiv.
     Technical Report, 2020. URL: http://arxiv.org/abs/            org/abs/1409.1556. doi:10.48550/arXiv.1409.
     2004.10934. doi:10.48550/arXiv.2004.10934,                    1556, arXiv:1409.1556 [cs] type: article.
     arXiv:2004.10934 [cs, eess] type: article.               [15] R. Girshick, J. Donahue, T. Darrell, J. Malik,
[12] R. Girshick, Fast R-CNN, in: 2015 IEEE Inter-                 Rich feature hierarchies for accurate object detec-
     national Conference on Computer Vision (ICCV),                tion and semantic segmentation, Technical Report,
     2015, pp. 1440–1448. doi:10.1109/ICCV.2015.                   2014. URL: http://arxiv.org/abs/1311.2524. doi:10.
     169, iSSN: 2380-7504.                                         48550/arXiv.1311.2524, arXiv:1311.2524 [cs]
[13] A. Youssef, D. Albani, D. Nardi, D. Bloisi, Fast Traffic      type: article.
     Sign Recognition Using Color Segmentation and [16] C. Godard, O. Mac Aodha, M. Firman, G. Brostow,
     Deep Convolutional Networks, volume 10016, 2016.              Digging Into Self-Supervised Monocular Depth Es-


                                                         23
Giorgio Leporoni et al. CEUR Workshop Proceedings            17–24


     timation, Technical Report, 2019. URL: http://arxiv.
     org/abs/1806.01260. doi:10.48550/arXiv.1806.
     01260, arXiv:1806.01260 [cs, stat] type: article.
[17] R. Ranftl, A. Bochkovskiy, V. Koltun, Vision Trans-
     formers for Dense Prediction, Technical Report,
     2021.     URL:     http://arxiv.org/abs/2103.13413.
     doi:10.48550/arXiv.2103.13413,
     arXiv:2103.13413 [cs] type: article.
[18] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler,
     V. Koltun, Towards Robust Monocular Depth Es-
     timation: Mixing Datasets for Zero-shot Cross-
     dataset Transfer, Technical Report, 2020. URL: http:
     //arxiv.org/abs/1907.01341. doi:10.48550/arXiv.
     1907.01341, arXiv:1907.01341 [cs] type: article.


                                                        24

</pre>