=Paper=
{{Paper
|id=Vol-3695/p03
|storemode=property
|title=Distance Estimation of Fixed Objects in Driving
Environments
|pdfUrl=https://ceur-ws.org/Vol-3695/p03.pdf
|volume=Vol-3695
|authors=Giorgio Leporoni,Valerio Ponzi,Francesco Pro,Christian Napoli
|dblpUrl=https://dblp.org/rec/conf/system/LeporoniPP023
}}
==Distance Estimation of Fixed Objects in Driving
Environments==
Distance Estimation of Fixed Objects in Driving
Environments
Giorgio Leporoni1 , Valerio Ponzi1,2 , Francesco Pro1 and Christian Napoli1,2
1
Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, Roma, 00185, Italy
2
Institute for Systems Analysis and Computer Science, Italian National Research Council, Via dei Taurini 19, Roma, 00185, Italy
Abstract
Autonomous driving is a highly relevant topic today, particularly among major car manufacturers attempting to lead in
technological innovation and enhance driving safety. An autonomous vehicle must possess the capability to sense its
environment and navigate without human intervention. Thus, it serves as both a driver support system and, in some cases, a
substitute. A crucial aspect involves identifying the positions of pedestrians, traffic signs, traffic lights, and other vehicles while
computing distances from them. This enables the vehicle to emit alerts to the driver in potentially dangerous situations, such
as impending obstacles due to external factors or driver distraction. In this paper, we introduce an approach for identifying
traffic signs and determining the distance from them. Our method utilizes the YOLOv4 network for identification and a
customized network for distance computation. This integration of AI technologies facilitates the timely detection of hazards
and enables proactive alert mechanisms, thereby advancing the capabilities of autonomous vehicles and enhancing driving
safety.
Keywords
Machine Learning, Deep Learning, Yolo, Autonomous Driving
1. Introduction the used camera.
One of the main problems in this field of research is
Road safety is a major global concern, impacting the the dataset. We are talking about a very delicate area, so
well-being of individuals and communities worldwide. to be sure of the system’s accuracy the dataset should
The development and adoption of advanced technolo- be composed of a huge number of samples representing
gies, such as driver assistance systems and autonomous different objects in very different contexts [1, 2, 3]. So,
vehicles, offer significant potential to further enhance what we did was to record video on short routes from a
road safety in the long term. This is possible by creating dash cam mounted on our vehicle, extracting frames on
systems based on cameras or sensors mounted on the ve- which we then calculated ground truth in an automated
hicles that process the acquired images and can identify way to finally make an ad hoc dataset for us.
the typical objects of a road environment by doing some In this paper, we focus on computing the distances
computation on them, such as looking at their distances. between the vehicle and the detected traffic signs using
In this way, the vehicle could be able to make quick de- single images captured by a monocular camera. We de-
cisions autonomously in case of necessity. A classical cided to use this type of camera because it is the most
example is when there is a stop signal and the system common and affordable. The method foresees two phases,
detects that the driver is not reducing the velocity, at this one for the detection of the traffic signs on the captured
point it can brake autonomously the vehicle or easily images and a second phase for inferring distances from
alarms the driver with acoustic signals. them. For this second phase we built a network based on
In the last years, attempts have begun to approach a modern paper [4] that tries to solve the problem with a
this field of research by exploiting artificial intelligence. pure base artificial intelligence approach.
Previous methods involved the use of geometry with the Our main contributions arise from our endeavor to
assumption of fixed dimensions for objects such as vehi- create an automated system tailored to our needs. Ini-
cles. Other methods were based on IPM (Inverse Perspec- tially, we integrated YOLOv4 to produce bounding boxes
tive Mapping) using the lines present on the carriageway, around traffic signs, facilitating the automatic identifica-
these methods are all dependent on the parameters of tion of their positions within images, thus concluding the
initial phase of our approach. Subsequently, we directed
SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology,
our efforts towards developing a specialized dataset to
Engineering and Mathematics, Rome, December 3-6, 2023
$ leporoni.1944533@studenti.uniroma1.it (G. Leporoni); address our specific problem, as existing datasets did not
ponzi@diag.uniroma1.it (V. Ponzi); fulfill our requirements. Building upon our initial find-
pro.1944191@studenti.uniroma1.it (F. Pro); ings, we sought to enhance our system by implementing
cnapoli@diag.uniroma1.it (C. Napoli) two stabilization methods for predicted distances. The
© 2023 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). first method entails generating and utilizing depth maps
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
17
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Giorgio Leporoni et al. CEUR Workshop Proceedings 17–24
for each frame, enhancing the accuracy of distance mea- distances from objects bounding boxes (DisNet).
surements between signs located at the same depth. The Geometry approach [8] Other papers are based on
second method capitalizes on temporal frame correla- the assumption of fixed sizes for known objects, such as
tion, enhancing the smoothness and consistency of our vehicles. In this way, knowing camera parameters can be
system, and thereby augmenting its overall performance. used as a formula to compute distances [9, 10? ].
The use of depth maps helps us to get more accurate
measurements between signs that are collocated at the
same depth. Temporal frame correlation instead helps 3. Our approach
us to: Filtrate some false positive predictions keeping a
Our approach focused on the use of Italian road signs. In
bounding box if and only if it appears in the previous and
Italy, for each category of sign there is a most commonly
the next frames and get more stable distance predictions
used size, so once we classified the sign surveyed, we
for successive ones.
assumed that its size was the common one.
The major car manufacturers are at the forefront in this
To approach the problem, we started creating our
field. Taking Tesla as an example, it uses a huge amount
dataset from scratch. To accomplish this task, we used
of sensors and cameras mounted on its vehicles. This
a dash cam mounted on our vehicle recording routes
implies that the car must be produced in that way. With
around the city to finally get more or less 3 hours of
methods like ours, what you can do is simply mount
recordings. Then we filtered out all unsuitable videos,
a camera, such as a dash cam, inside the vehicle as a
from the remaining videos we got about 1500 frames rep-
driving aid. Furthermore, what we have tried to do is
resenting the roads around the city. We cut each frame
to implement, as in the reference paper, a method that
on the vertical axis because of a visible portion of the
was not bound to the parameters of the camera used. For
vehicle interior, removing useless information.
example, the IPM methods are bounded by the height of
For object detection, we needed a quick solution to
the camera from the ground, instead in this case the driver
avoid wasting time in the whole process. So, we chose
does not have to worry about the position in which the
YOLOv4 (You Only Look Once) [11] because it runs a
camera is mounted, which can easily be used on different
lot faster than other methods as RCNN [12] or methods
vehicles. building a simple and portable system usable
based on color segmentation [13]. We downloaded a
on any camera.
pre-trained YOLO network on which we did transfer
learning on a German Traffic Sign dataset training for
2. Related works 4000 iterations. During the transfer learning phase. Other
attempts we made were to use some image pre-processing
Inverse Perspective Mapping [5] consists of removing techniques, those in grayscale, or the images on which
the perspective distortion from the road surface, taking we used histogram equalization getting unfortunately
as reference the lane lines to compute distances assuming bad results. In the end, the network reached an accuracy
they have a fixed size. In this method, a bird’s eye view of about 91%.
of the roadway is computed to carry out the correspon- With the YOLO network, we got the bounding boxes
dence between a pixel dimension and the lane line size. of the traffic signs for each frame, discarding manually all
This correspondence is then used to count the pixel be- the frames without detected objects or with the presence
tween an object and the vehicle getting the approximated of wrong detections. To get the ground truth of each
distance. This method has problems in the presence of bounding box we use the following formula:
road curves or road signs not very visible or absent. In
addition, it is very dependent on the camera parameters. 𝑊 𝑖𝑑𝑡ℎ𝑐𝑚 * 𝐹 𝑜𝑐𝑎𝑙𝑙𝑒𝑛𝑔ℎ𝑡
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 =
Stereo vision [6] This method foresees the use of 𝑊 𝑖𝑑𝑡ℎ𝑝𝑥
a stereo camera that generates two images, a left and
It is based on the focal length of the camera that we
a right view. From these two images of the same envi-
obtained by taking a picture of an object of known size
ronment is generated a disparity map with the use of
placed at a known distance to count the pixels of which
epipolar geometry. Using a simple formula from the gen-
the object is composed within the image. This is the only
erated map it is possible to compute for each pixel of
parameter of the camera that was necessary to create the
the 2D image the z coordinates that give us the depth of
dataset.
the object in that pixel in the real 3D world. The main
In particular, the width of the triangular and octagonal
problem with this method is the expensive cost of the
signs used is 90 cm, while it is 60 cm for the square and
stereo camera.
circular ones.
AI-based approach[7] This method is based on a
Through this process, we built a dataset composed of
deep learning approach to monocular images. Starting
959 images. After the creation of the dataset, we focused
from labeled data train a neural network able to compute
18
Giorgio Leporoni et al. CEUR Workshop Proceedings 17–24
(a) Predictive model for traffic sign distance computation. Input image with bounding boxes undergoes VGG16 feature extraction,
ROI pooling for size standardization, and a three-layer feedforward network for distance prediction using soft plus activation.
(b) Enhanced model integrating depth map information and temporal frame correlation for stabilized predictions. Input image
with bounding boxes processed through VGG16, ROI pooling, and a modified three-layer feedforward network, leading to
improved distance accuracy.
Figure 1: Schematic representations of the comprehensive distance computation system.
on the detection part. For this purpose, we used YOLOv4 results by adding the use of depth map information and
as mentioned above. exploiting the concept of temporal frame correlation.
After obtaining the bounding boxes for an image, it is Depth map [16]: The concept is that traffic signs
passed to a specific network for the distance computation. at the same depth in the real world are more or less at
This second network is composed of a CNN (VGG16) [14] the same distance from the vehicle. Based on this point
for feature map extraction, and then this is combined with we use a pre-trained network called MIDAS [17] [18]
the information about bounding boxes passed through an to get the depth map of the image under exam. Once
ROI pooling layer [15]. This Layer is necessary because bounding boxes are detected in the original image and
bounding boxes for a single image could be of different distances are computed, we report the bounding boxes
sizes, this layer aims to remove this difference in the in the depth map image. For each traffic sign at the
dimension standardizing them. The output of the ROI same depth, considering a small variance based on the
pooling is finally passed to a feedforward network, com- maximum depth value inside the image, we computed an
posed of 3 layers (2048, 512, 1), that predicts distances average of the distances in the original image to obtain a
using a soft plus activation function. The architecture of uniform value. At the moment, we used this method after
the network is shown in Figure 1a. the computation of the distances, but it could be used
By testing the entire process on different videos, we also in the creation of the dataset to get more detailed
noticed that for our cases this method was not stable in labels or in the training phase to directly stabilize results
the predictions made between successive frames, in fact in the network.
in some cases, it happened that there was a large variance Figure 2 shows a representation of this method, look-
between distances predicted for the same traffic sign in ing at the traffic signs in the image are now visible from
two or more successive frames. We tried to increase our the depth map coloration that they are at the same dis-
19
Giorgio Leporoni et al. CEUR Workshop Proceedings 17–24
Figure 2: Example of depth map using MiDas network.
tance. So, thanks to this now the prediction for them is For the distance prediction network instead, it is not
corrected at the same value. possible to compute a true accuracy, but we reach a loss
Temporal frame correlation: We use this technique of more or less 130, visible in the graph in Figure 4.
to give a linearity in predicting distances for the sequence It shows that the loss function has a trend that tends
of frames. Going through this method, we noticed that to improve if trained for more epochs.
in some cases the network’s predictions were much dif- As an evaluation metric, we used the ones provided
ferent for successive frames. To stabilize predictions, we by [7]. In particular, we use the RMSE on predictions
thought that given a traffic sign in a frame at time t, if divided by meters:
it is also present at time t-1 and t+1 it is a valid object ⎯
to consider for time t and its distance is the average be- 𝑛
⎸
⎸ 1 ∑︁
tween the 3 frames in sequence. To verify if the same 𝑅𝑀 𝑆𝐸 = ⎷ ‖𝑑𝑖 − 𝑑*𝑖 ‖2
𝑁
traffic sign is present in the 3 subsequence frames, we 𝑑=1
first find the center of its bounding box at time t and of This is to see how the behavior of the network changes
all the traffic signs for the previous and forward frames. concerning the distance from the detected object. Results
Then we compute the distances between points and if it are represented in the graph in Figure 5, compared with
is lower than a certain threshold, we are looking at the the ones obtained by the reference paper. Visible pre-
same traffic sign. dictions get worse as distances increase. We notice that
An example of this concept is given in Figure 3, in bounding boxes of traffic signs at higher distances do not
which there is a wrong detection at frame t (red circle on match perfectly their dimensions introducing an error.
the top right image) and since this wrong prediction is Another source of error is probably the fact that we have
not present at frame (t-1) and (t+1), it is also discarded at
only a few samples of road signs at large distances.
frame t. Table 1 compares our results with the ones of the refer-
The architecture of this modified network is repre- ence paper. As visible, results are similar, ours are a little
sented in Figure 1b. bit better because lower values represent better predic-
tions. This is because we make predictions only on traffic
4. Training signs while they predict on cars, cyclists, and pedestrians,
this means that they have a larger margin of error than
About the training phase, due to time and resource issues, us.
we were unable to train the networks for long sessions. To show the method in action, we made some test
We trained the YOLOv4 for about 4000 iterations using video, available on YouTube, of the network works. In
RGB frames from the German Traffic Sign Dataset. For particular, we made videos with the following character-
the distance prediction network (DPN), all components istics:
composing the DPN network are trained together. We
• Test video using the base network without depth
trained it with our dataset for 560 epochs using RGB
map and temporal frame correlation (daylight
frames. About the training parameters, we used a learn-
conditions)
ing rate starting from 0,001 with ADAMS, minibatch size
• Test video using depth map and temporal frame
of 16, and loss the 𝑆𝑚𝑜𝑜𝑡ℎ𝐿1 .
correlation (daylight conditions)
• Test video using the base network without depth
5. Results map and temporal frame correlation, rounded on
5 meters (daylight conditions)
Talking about the detection part with the YOLO we reach • Test video using depth map and temporal frame
an accuracy of around 91%. correlation, rounded on 5 meters (daylight condi-
tions)
20
Giorgio Leporoni et al. CEUR Workshop Proceedings 17–24
Figure 3: Example of temporal frame correlation in case of wrong predictions.
Table 1
Comparison of results between our implementation and the one of the paper we take as reference.
Method Abs Rel Squa Rel RMSE RMSE(log)
Our base model 0.131 0.468 3.126 0.173
Paper base model 0.251 1.844 6.870 0.314
The formulas used in the table are the following:
⎯
⎸ 𝑛
⎸ 1 ∑︁
𝑅𝑀 𝑆𝐸(𝑙𝑜𝑔) = ⎷ ‖𝑙𝑜𝑔(𝑑𝑖 ) − 𝑙𝑜𝑔(𝑑*𝑖 )‖2
𝑁
𝑑=1
𝑛
1 ∑︁ |𝑑𝑖 − 𝑑*𝑖 |
𝐴𝑏𝑠𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒𝐷𝑖𝑓 𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 =
𝑁 𝑑*
𝑑=1
𝑛
1 ∑︁ ‖𝑑𝑖 − 𝑑*𝑖 ‖2
𝑆𝑞𝑢𝑎𝑟𝑒𝑑𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒𝐷𝑖𝑓 𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 =
𝑁 𝑑*
𝑑=1
Figure 6 shows two examples of predictions in images.
Figure 4: Graph of the loss function of the distances predic- The top image distances are predicted without the use
tion network. of the depth map and temporal frame correlation, as
the predictions do not seem reliable, they appear quite
random. The bottom image instead, is done using our
• Test video using depth map and temporal frame two variations. As visible all the detected signs are more
correlation, rounded on 5 meters (night condi- or less at the same depth, this is not considered for the
tions) top image, while in this case thanks to the depth map
their predictions are adjusted correctly.
Rounded on 5 meters, means that we do an approxima-
tion on the predictions made to get more stable results.
As, 12.4 meters is rounded to 10 meters, while 12.6 meters 6. Conclusion
is rounded to 15 meters.
The method seems to work well, there are errors intro-
duced by the labels of our dataset that are not accurate,
21
Giorgio Leporoni et al. CEUR Workshop Proceedings 17–24
(a) Our meters-RMSE predictions graph (b) Reference paper meters-RMSE predictions graph
Figure 5: These are the graphs that are put in relation to the predictions at certain meters with the distance error from the
true values.
caused by the possible different dimensions for each traf- [2] G. Capizzi, C. Napoli, L. Paternò, An innova-
fic sign on the road introducing a small error that then tive hybrid neuro-wavelet method for reconstruc-
will propagate throughout the process, even if we tried to tion of missing data in astronomical photometric
solve it using depth map and temporal frame correlation. surveys 7267 LNAI (2012) 21 – 29. doi:10.1007/
So, the main future step could be using more accurate 978-3-642-29347-4_3.
labels for the samples inside the dataset. The work is [3] A. Alfarano, G. De Magistris, L. Mongelli, S. Russo,
based on the objects detected and rounded by bounding J. Starczewski, C. Napoli, A novel convmixer trans-
boxes but is not always sure that their dimensions match former based architecture for violent behavior de-
perfectly the sizes of the traffic signs, so this point intro- tection 14126 LNAI (2023) 3 – 16. doi:10.1007/
duces errors in the predictions of the network. As said 978-3-031-42508-0_1.
at the beginning, in Italy the same traffic signs could be [4] J. Zhu, Y. Fang, H. Abu-Haimed, K.-C. Lien,
used up to 3 different dimensions, so it could be useful to D. Fu, J. Gu, Learning Object-specific Distance
infer their dimensions to improve the predicted distances. from a Monocular Image, Technical Report,
As future improvement, there possible extension of the 2019. URL: http://arxiv.org/abs/1909.04182.
detected objects also to vehicles and pedestrians. doi:10.48550/arXiv.1909.04182,
arXiv:1909.04182 [cs] type: article.
[5] S. Tuohy, D. O’Cualain, E. Jones, M. Glavin, Dis-
Acknowledgments tance determination for an automobile environment
using Inverse Perspective Mapping in OpenCV, in:
This work has been developed at is.Lab() Intelligent Sys-
IET Irish Signals and Systems Conference (ISSC
tems Laboratory at the Department of Computer, Control,
2010), 2010, pp. 100–105. doi:10.1049/cp.2010.
and Management Engineering, Sapienza University of
0495.
Rome (https:// islab.diag.uniroma1.it). The work has also
[6] X. Sun, Y. Jiang, Y. Ji, W. Fu, S. Yan, Q. Chen,
been partially supported from Italian Ministerial grant
B. Yu, X. Gan, Distance Measurement System Based
PRIN 2022 “ISIDE: Intelligent Systems for Infrastructural
on Binocular Stereo Vision, IOP Conference Se-
Diagnosis in smart-concretE”, n. 2022S88WAY - CUP
ries: Earth and Environmental Science 252 (2019)
B53D2301318, and by the Age-It: Ageing Well in an age-
052051. URL: https://doi.org/10.1088/1755-1315/
ing society project, task 9.4.1 work package 4 spoke 9,
252/5/052051. doi:10.1088/1755-1315/252/5/
within topic 8 extended partnership 8, under the National
052051.
Recovery and Resilience Plan (PNRR), Mission 4 Com-
[7] DisNet: A novel method for distance estima-
ponent 2 Investment 1.3—Call for tender No. 1557 of
tion from monocular camera, ???? URL: https:
11/10/2022 of Italian Ministry of University and Research
//patrick-llgc.github.io/Learning-Deep-Learning/
funded by the European Union—NextGenerationEU, CUP
paper_notes/disnet.html.
B53C22004090006.
[8] S. Saleh, S. Khwandah, A. Heller, W. Hardt, A. Mum-
taz, Traffic Signs Recognition and Distance Estima-
References tion using a Monocular Camera, 2019.
[9] S. Russo, C. Napoli, A comprehensive solution for
[1] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli, psychological treatment and therapeutic path plan-
Analysis pre and post covid-19 pandemic rorschach ning based on knowledge base and expertise shar-
test data of using em algorithms and gmm models, ing, volume 2472, 2019, pp. 41 – 47.
volume 3360, 2022, pp. 55 – 63. [10] G. Lo Sciuto, S. Russo, C. Napoli, A cloud-based
flexible solution for psychometric tests validation,
22
Giorgio Leporoni et al. CEUR Workshop Proceedings 17–24
Figure 6: (Top image) example of predictions without depth map and frame correlation time. (Bottom image) example of
predictions using depth map and frame correlation time
administration and evaluation, volume 2468, 2019, doi:10.1007/978-3-319-48680-2_19.
pp. 16 – 21. [14] K. Simonyan, A. Zisserman, Very Deep Convo-
[11] A. Bochkovskiy, C.-Y. Wang, H.-Y. M. Liao, YOLOv4: lutional Networks for Large-Scale Image Recog-
Optimal Speed and Accuracy of Object Detection, nition, Technical Report, 2015. URL: http://arxiv.
Technical Report, 2020. URL: http://arxiv.org/abs/ org/abs/1409.1556. doi:10.48550/arXiv.1409.
2004.10934. doi:10.48550/arXiv.2004.10934, 1556, arXiv:1409.1556 [cs] type: article.
arXiv:2004.10934 [cs, eess] type: article. [15] R. Girshick, J. Donahue, T. Darrell, J. Malik,
[12] R. Girshick, Fast R-CNN, in: 2015 IEEE Inter- Rich feature hierarchies for accurate object detec-
national Conference on Computer Vision (ICCV), tion and semantic segmentation, Technical Report,
2015, pp. 1440–1448. doi:10.1109/ICCV.2015. 2014. URL: http://arxiv.org/abs/1311.2524. doi:10.
169, iSSN: 2380-7504. 48550/arXiv.1311.2524, arXiv:1311.2524 [cs]
[13] A. Youssef, D. Albani, D. Nardi, D. Bloisi, Fast Traffic type: article.
Sign Recognition Using Color Segmentation and [16] C. Godard, O. Mac Aodha, M. Firman, G. Brostow,
Deep Convolutional Networks, volume 10016, 2016. Digging Into Self-Supervised Monocular Depth Es-
23
Giorgio Leporoni et al. CEUR Workshop Proceedings 17–24
timation, Technical Report, 2019. URL: http://arxiv.
org/abs/1806.01260. doi:10.48550/arXiv.1806.
01260, arXiv:1806.01260 [cs, stat] type: article.
[17] R. Ranftl, A. Bochkovskiy, V. Koltun, Vision Trans-
formers for Dense Prediction, Technical Report,
2021. URL: http://arxiv.org/abs/2103.13413.
doi:10.48550/arXiv.2103.13413,
arXiv:2103.13413 [cs] type: article.
[18] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler,
V. Koltun, Towards Robust Monocular Depth Es-
timation: Mixing Datasets for Zero-shot Cross-
dataset Transfer, Technical Report, 2020. URL: http:
//arxiv.org/abs/1907.01341. doi:10.48550/arXiv.
1907.01341, arXiv:1907.01341 [cs] type: article.
24