1. Introduction

Distance Estimation of Fixed Objects in Driving Environments

Giorgio Leporoni

Valerio Ponzi

0 1

Francesco Pro

Christian Napoli

0 1 0 Department of Computer, Control and Management Engineering, Sapienza University of Rome , Via Ariosto 25, Roma, 00185 , Italy 1 Institute for Systems Analysis and Computer Science, Italian National Research Council , Via dei Taurini 19, Roma, 00185 , Italy

17 24

Autonomous driving is a highly relevant topic today, particularly among major car manufacturers attempting to lead in technological innovation and enhance driving safety. An autonomous vehicle must possess the capability to sense its environment and navigate without human intervention. Thus, it serves as both a driver support system and, in some cases, a substitute. A crucial aspect involves identifying the positions of pedestrians, trafic signs, trafic lights, and other vehicles while computing distances from them. This enables the vehicle to emit alerts to the driver in potentially dangerous situations, such as impending obstacles due to external factors or driver distraction. In this paper, we introduce an approach for identifying trafic signs and determining the distance from them. Our method utilizes the YOLOv4 network for identification and a customized network for distance computation. This integration of AI technologies facilitates the timely detection of hazards and enables proactive alert mechanisms, thereby advancing the capabilities of autonomous vehicles and enhancing driving safety.

eol>Machine Learning Deep Learning Yolo Autonomous Driving

1. Introduction

for each frame, enhancing the accuracy of distance mea- distances from objects bounding boxes (DisNet). surements between signs located at the same depth. The Geometry approach [ 8 ] Other papers are based on second method capitalizes on temporal frame correla- the assumption of fixed sizes for known objects, such as tion, enhancing the smoothness and consistency of our vehicles. In this way, knowing camera parameters can be system, and thereby augmenting its overall performance. used as a formula to compute distances [9, 10? ].

The use of depth maps helps us to get more accurate measurements between signs that are collocated at the same depth. Temporal frame correlation instead helps 3. Our approach us to: Filtrate some false positive predictions keeping a bounding box if and only if it appears in the previous and the next frames and get more stable distance predictions for successive ones.

The major car manufacturers are at the forefront in this ifeld. Taking Tesla as an example, it uses a huge amount of sensors and cameras mounted on its vehicles. This implies that the car must be produced in that way. With methods like ours, what you can do is simply mount a camera, such as a dash cam, inside the vehicle as a driving aid. Furthermore, what we have tried to do is to implement, as in the reference paper, a method that was not bound to the parameters of the camera used. For example, the IPM methods are bounded by the height of the camera from the ground, instead in this case the driver does not have to worry about the position in which the camera is mounted, which can easily be used on diferent vehicles. building a simple and portable system usable on any camera.

Our approach focused on the use of Italian road signs. In

Italy, for each category of sign there is a most commonly used size, so once we classified the sign surveyed, we assumed that its size was the common one.

To approach the problem, we started creating our dataset from scratch. To accomplish this task, we used a dash cam mounted on our vehicle recording routes around the city to finally get more or less 3 hours of recordings. Then we filtered out all unsuitable videos, from the remaining videos we got about 1500 frames representing the roads around the city. We cut each frame on the vertical axis because of a visible portion of the vehicle interior, removing useless information.

For object detection, we needed a quick solution to avoid wasting time in the whole process. So, we chose YOLOv4 (You Only Look Once) [ 11 ] because it runs a lot faster than other methods as RCNN [ 12 ] or methods based on color segmentation [ 13 ]. We downloaded a pre-trained YOLO network on which we did transfer learning on a German Trafic Sign dataset training for 2. Related works 4000 iterations. During the transfer learning phase. Other attempts we made were to use some image pre-processing Inverse Perspective Mapping [ 5 ] consists of removing techniques, those in grayscale, or the images on which the perspective distortion from the road surface, taking we used histogram equalization getting unfortunately as reference the lane lines to compute distances assuming bad results. In the end, the network reached an accuracy they have a fixed size. In this method, a bird’s eye view of about 91%. of the roadway is computed to carry out the correspon- With the YOLO network, we got the bounding boxes dence between a pixel dimension and the lane line size. of the trafic signs for each frame, discarding manually all This correspondence is then used to count the pixel be- the frames without detected objects or with the presence tween an object and the vehicle getting the approximated of wrong detections. To get the ground truth of each distance. This method has problems in the presence of bounding box we use the following formula: road curves or road signs not very visible or absent. In addition, it is very dependent on the camera parameters. = ℎ * ℎ

Stereo vision [ 6 ] This method foresees the use of ℎ a stereo camera that generates two images, a left and a right view. From these two images of the same envi- It is based on the focal length of the camera that we ronment is generated a disparity map with the use of obtained by taking a picture of an object of known size epipolar geometry. Using a simple formula from the gen- placed at a known distance to count the pixels of which erated map it is possible to compute for each pixel of the object is composed within the image. This is the only the 2D image the z coordinates that give us the depth of parameter of the camera that was necessary to create the the object in that pixel in the real 3D world. The main dataset. problem with this method is the expensive cost of the In particular, the width of the triangular and octagonal stereo camera. signs used is 90 cm, while it is 60 cm for the square and

AI-based approach[ 7 ] This method is based on a circular ones. deep learning approach to monocular images. Starting Through this process, we built a dataset composed of from labeled data train a neural network able to compute 959 images. After the creation of the dataset, we focused (a) Predictive model for trafic sign distance computation. Input image with bounding boxes undergoes VGG16 feature extraction,

ROI pooling for size standardization, and a three-layer feedforward network for distance prediction using soft plus activation. (b) Enhanced model integrating depth map information and temporal frame correlation for stabilized predictions. Input image with bounding boxes processed through VGG16, ROI pooling, and a modified three-layer feedforward network, leading to improved distance accuracy. on the detection part. For this purpose, we used YOLOv4 results by adding the use of depth map information and as mentioned above. exploiting the concept of temporal frame correlation.

After obtaining the bounding boxes for an image, it is Depth map [16]: The concept is that trafic signs passed to a specific network for the distance computation. at the same depth in the real world are more or less at This second network is composed of a CNN (VGG16) [14] the same distance from the vehicle. Based on this point for feature map extraction, and then this is combined with we use a pre-trained network called MIDAS [ 17 ] [ 18 ] the information about bounding boxes passed through an to get the depth map of the image under exam. Once ROI pooling layer [15]. This Layer is necessary because bounding boxes are detected in the original image and bounding boxes for a single image could be of diferent distances are computed, we report the bounding boxes sizes, this layer aims to remove this diference in the in the depth map image. For each trafic sign at the dimension standardizing them. The output of the ROI same depth, considering a small variance based on the pooling is finally passed to a feedforward network, com- maximum depth value inside the image, we computed an posed of 3 layers (2048, 512, 1), that predicts distances average of the distances in the original image to obtain a using a soft plus activation function. The architecture of uniform value. At the moment, we used this method after the network is shown in Figure 1a. the computation of the distances, but it could be used

By testing the entire process on diferent videos, we also in the creation of the dataset to get more detailed noticed that for our cases this method was not stable in labels or in the training phase to directly stabilize results the predictions made between successive frames, in fact in the network. in some cases, it happened that there was a large variance Figure 2 shows a representation of this method, lookbetween distances predicted for the same trafic sign in ing at the trafic signs in the image are now visible from two or more successive frames. We tried to increase our the depth map coloration that they are at the same dis

5. Results Talking about the detection part with the YOLO we reach an accuracy of around 91%.

tance. So, thanks to this now the prediction for them is For the distance prediction network instead, it is not corrected at the same value. possible to compute a true accuracy, but we reach a loss

Temporal frame correlation: We use this technique of more or less 130, visible in the graph in Figure 4. to give a linearity in predicting distances for the sequence It shows that the loss function has a trend that tends of frames. Going through this method, we noticed that to improve if trained for more epochs. in some cases the network’s predictions were much dif- As an evaluation metric, we used the ones provided ferent for successive frames. To stabilize predictions, we by [ 7 ]. In particular, we use the RMSE on predictions thought that given a trafic sign in a frame at time t, if divided by meters: it is also present at time t-1 and t+1 it is a valid object to consider for time t and its distance is the average be- ⎯ tween the 3 frames in sequence. To verify if the same = ⎷⎸⎸ 1 ∑︁ ‖ − * ‖2 trafic sign is present in the 3 subsequence frames, we =1 ifrst find the center of its bounding box at time t and of all the trafic signs for the previous and forward frames.

Then we compute the distances between points and if it is lower than a certain threshold, we are looking at the same trafic sign.

An example of this concept is given in Figure 3, in which there is a wrong detection at frame t (red circle on the top right image) and since this wrong prediction is not present at frame (t-1) and (t+1), it is also discarded at frame t.

The architecture of this modified network is represented in Figure 1b.

This is to see how the behavior of the network changes concerning the distance from the detected object. Results are represented in the graph in Figure 5, compared with the ones obtained by the reference paper. Visible predictions get worse as distances increase. We notice that bounding boxes of trafic signs at higher distances do not match perfectly their dimensions introducing an error.

Another source of error is probably the fact that we have only a few samples of road signs at large distances.

Table 1 compares our results with the ones of the reference paper. As visible, results are similar, ours are a little bit better because lower values represent better predictions. This is because we make predictions only on trafic 4. Training signs while they predict on cars, cyclists, and pedestrians, this means that they have a larger margin of error than About the training phase, due to time and resource issues, us. we were unable to train the networks for long sessions. To show the method in action, we made some test We trained the YOLOv4 for about 4000 iterations using video, available on YouTube, of the network works. In RGB frames from the German Trafic Sign Dataset. For particular, we made videos with the following characterthe distance prediction network (DPN), all components istics: composing the DPN network are trained together. We trained it with our dataset for 560 epochs using RGB frames. About the training parameters, we used a learning rate starting from 0,001 with ADAMS, minibatch size of 16, and loss the ℎ1. • Test video using the base network without depth map and temporal frame correlation (daylight conditions) • Test video using depth map and temporal frame

correlation (daylight conditions) • Test video using the base network without depth map and temporal frame correlation, rounded on 5 meters (daylight conditions) • Test video using depth map and temporal frame correlation, rounded on 5 meters (daylight conditions)

Rounded on 5 meters, means that we do an approximation on the predictions made to get more stable results.

As, 12.4 meters is rounded to 10 meters, while 12.6 meters is rounded to 15 meters.

6. Conclusion The method seems to work well, there are errors introduced by the labels of our dataset that are not accurate,

(a) Our meters-RMSE predictions graph (b) Reference paper meters-RMSE predictions graph caused by the possible diferent dimensions for each trafifc sign on the road introducing a small error that then will propagate throughout the process, even if we tried to solve it using depth map and temporal frame correlation. So, the main future step could be using more accurate labels for the samples inside the dataset. The work is based on the objects detected and rounded by bounding boxes but is not always sure that their dimensions match perfectly the sizes of the trafic signs, so this point introduces errors in the predictions of the network. As said at the beginning, in Italy the same trafic signs could be used up to 3 diferent dimensions, so it could be useful to infer their dimensions to improve the predicted distances. As future improvement, there possible extension of the detected objects also to vehicles and pedestrians.

Acknowledgments This work has been developed at is.Lab() Intelligent Sys

tems Laboratory at the Department of Computer, Control, and Management Engineering, Sapienza University of Rome (https:// islab.diag.uniroma1.it). The work has also been partially supported from Italian Ministerial grant PRIN 2022 “ISIDE: Intelligent Systems for Infrastructural Diagnosis in smart-concretE”, n. 2022S88WAY - CUP B53D2301318, and by the Age-It: Ageing Well in an ageing society project, task 9.4.1 work package 4 spoke 9, within topic 8 extended partnership 8, under the National Recovery and Resilience Plan (PNRR), Mission 4 Component 2 Investment 1.3—Call for tender No. 1557 of 11/10/2022 of Italian Ministry of University and Research funded by the European Union—NextGenerationEU, CUP B53C22004090006.

[1]

Ponzi ,

Russo ,

Wajda ,

Brociek ,

Napoli , Analysis pre and post covid-19 pandemic rorschach test data of using em algorithms and gmm models , volume 3360 , 2022 , pp. 55 - 63 .

[2]

Capizzi ,

Napoli ,

Paternò , An innovative hybrid neuro-wavelet method for reconstruction of missing data in astronomical photometric surveys 7267 LNAI ( 2012 ) 21 - 29 . doi: 10 .1007/ 978-3- 642 -29347- 4 _ 3 .

[3]

Alfarano , G. De Magistris,

Mongelli ,

Russo ,

Starczewski ,

Napoli , A novel convmixer transformer based architecture for violent behavior detection 14126 LNAI ( 2023 ) 3 - 16 . doi: 10 .1007/ 978-3- 031 -42508- 0 _ 1 .

[4]

Zhu ,

Fang ,

Abu-Haimed ,

K.-C.

Lien ,

Fu ,

Gu , Learning Object-specific Distance from a Monocular Image , Technical Report , 2019 . URL: http://arxiv.org/abs/ 1909 .04182. doi: 10 .48550/arXiv. 1909 . 04182 , arXiv: 1909 . 04182 [cs] type: article.

[5]

Tuohy , D. O'Cualain , E.

Jones , M.

Glavin , Distance determination for an automobile environment using Inverse Perspective Mapping in OpenCV , in: IET Irish Signals and Systems Conference (ISSC 2010 ), 2010 , pp. 100 - 105 . doi: 10 .1049/cp. 2010 . 0495 .

[6]

Sun ,

Jiang ,

Ji ,

Fu ,

Yan ,

Chen ,

Yu ,

Gan , Distance Measurement System Based on Binocular Stereo Vision, IOP Conference Series: Earth and Environmental Science 252 ( 2019 ) 052051 . URL: https://doi.org/10.1088/ 1755 -1315/ 252/5/052051. doi: 10 .1088/ 1755 -1315/252/5/ 052051.

[7] DisNet: A novel method for distance estimation from monocular camera , ???? URL: https: //patrick-llgc.github.io/Learning-Deep-Learning/ paper_notes/disnet.html.

[8]

Saleh ,

Khwandah ,

Heller ,

Hardt ,

Mumtaz , Trafic Signs Recognition and Distance Estimation using a Monocular Camera , 2019 .

[9]

Russo ,

Napoli , A comprehensive solution for psychological treatment and therapeutic path planning based on knowledge base and expertise sharing , volume 2472 , 2019 , pp. 41 - 47 .

[10]

Lo Sciuto ,

Russo ,

Napoli , A cloud-based lfexible solution for psychometric tests validation, administration and evaluation , volume 2468 , 2019 , doi:10.1007/978-3- 319 -48680-2_ 19 . pp. 16 - 21 . [14]

Simonyan ,

Zisserman , Very Deep Convo-

[11]

Bochkovskiy , C.-Y. Wang, H. -Y. M. Liao , YOLOv4: lutional Networks for Large-Scale Image RecogOptimal Speed and Accuracy of Object Detection, nition , Technical Report , 2015 . URL: http://arxiv. Technical Report , 2020 . URL: http://arxiv.org/abs/ org/abs/1409.1556. doi: 10 .48550/arXiv.1409. 2004 . 10934 . doi: 10 .48550/arXiv. 2004 . 10934 , 1556 , arXiv: 1409 .1556 [ cs] type: article . arXiv: 2004 . 10934 [cs, eess] type: article . [15]

Girshick ,

Donahue ,

Darrell , J. Malik,

[12]

Girshick , Fast

R-CNN

, in: 2015 IEEE Inter- Rich feature hierarchies for accurate object detecnational Conference on Computer Vision (ICCV), tion and semantic segmentation , Technical Report , 2015 , pp. 1440 - 1448 . doi: 10 .1109/ICCV. 2015 . 2014 . URL: http://arxiv.org/abs/1311.2524. doi: 10 . 169, iSSN: 2380 - 7504 . 48550/arXiv.1311.2524, arXiv: 1311 .2524 [cs]

[13]

Youssef ,

Albani ,

Nardi ,

Bloisi , Fast Trafic type: article. Sign Recognition Using Color Segmentation and [16]

Godard ,

O. Mac

Aodha ,

Firman , G. Brostow, Deep Convolutional Networks, volume 10016 , 2016 . Digging Into Self-Supervised Monocular Depth Estimation , Technical Report , 2019 . URL: http://arxiv. org/abs/ 1806 .01260. doi: 10 .48550/arXiv. 1806 . 01260 , arXiv: 1806 . 01260 [cs, stat] type: article.

[17]

Ranftl ,

Bochkovskiy ,

Koltun , Vision Transformers for Dense Prediction , Technical Report , 2021 . URL: http://arxiv.org/abs/2103.13413. doi: 10 .48550/arXiv.2103.13413, arXiv: 2103 .13413 [ cs] type: article.

[18]

Ranftl ,

Lasinger ,

Hafner ,

Schindler ,

Koltun , Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Crossdataset Transfer , Technical Report , 2020 . URL: http: //arxiv.org/abs/ 1907 .01341. doi: 10 .48550/arXiv. 1907 . 01341 , arXiv: 1907 . 01341 [cs] type: article.