<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Distance Estimation of Fixed Objects in Driving Environments</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Giorgio</forename><surname>Leporoni</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer, Control and Management Engineering</orgName>
								<orgName type="institution">Sapienza University of Rome</orgName>
								<address>
									<addrLine>Via Ariosto 25</addrLine>
									<postCode>00185</postCode>
									<settlement>Roma</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Valerio</forename><surname>Ponzi</surname></persName>
							<email>ponzi@diag.uniroma1.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer, Control and Management Engineering</orgName>
								<orgName type="institution">Sapienza University of Rome</orgName>
								<address>
									<addrLine>Via Ariosto 25</addrLine>
									<postCode>00185</postCode>
									<settlement>Roma</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Institute for Systems Analysis and Computer Science</orgName>
								<orgName type="institution">Italian National Research Council</orgName>
								<address>
									<addrLine>Via dei Taurini 19</addrLine>
									<postCode>00185</postCode>
									<settlement>Roma</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Francesco</forename><surname>Pro</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer, Control and Management Engineering</orgName>
								<orgName type="institution">Sapienza University of Rome</orgName>
								<address>
									<addrLine>Via Ariosto 25</addrLine>
									<postCode>00185</postCode>
									<settlement>Roma</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Christian</forename><surname>Napoli</surname></persName>
							<email>cnapoli@diag.uniroma1.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer, Control and Management Engineering</orgName>
								<orgName type="institution">Sapienza University of Rome</orgName>
								<address>
									<addrLine>Via Ariosto 25</addrLine>
									<postCode>00185</postCode>
									<settlement>Roma</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Institute for Systems Analysis and Computer Science</orgName>
								<orgName type="institution">Italian National Research Council</orgName>
								<address>
									<addrLine>Via dei Taurini 19</addrLine>
									<postCode>00185</postCode>
									<settlement>Roma</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Distance Estimation of Fixed Objects in Driving Environments</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">B5FB79457BDDDB705A03AEB13A1F35A6</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:08+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Machine Learning</term>
					<term>Deep Learning</term>
					<term>Yolo</term>
					<term>Autonomous Driving</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Autonomous driving is a highly relevant topic today, particularly among major car manufacturers attempting to lead in technological innovation and enhance driving safety. An autonomous vehicle must possess the capability to sense its environment and navigate without human intervention. Thus, it serves as both a driver support system and, in some cases, a substitute. A crucial aspect involves identifying the positions of pedestrians, traffic signs, traffic lights, and other vehicles while computing distances from them. This enables the vehicle to emit alerts to the driver in potentially dangerous situations, such as impending obstacles due to external factors or driver distraction. In this paper, we introduce an approach for identifying traffic signs and determining the distance from them. Our method utilizes the YOLOv4 network for identification and a customized network for distance computation. This integration of AI technologies facilitates the timely detection of hazards and enables proactive alert mechanisms, thereby advancing the capabilities of autonomous vehicles and enhancing driving safety.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Road safety is a major global concern, impacting the well-being of individuals and communities worldwide. The development and adoption of advanced technologies, such as driver assistance systems and autonomous vehicles, offer significant potential to further enhance road safety in the long term. This is possible by creating systems based on cameras or sensors mounted on the vehicles that process the acquired images and can identify the typical objects of a road environment by doing some computation on them, such as looking at their distances. In this way, the vehicle could be able to make quick decisions autonomously in case of necessity. A classical example is when there is a stop signal and the system detects that the driver is not reducing the velocity, at this point it can brake autonomously the vehicle or easily alarms the driver with acoustic signals.</p><p>In the last years, attempts have begun to approach this field of research by exploiting artificial intelligence. Previous methods involved the use of geometry with the assumption of fixed dimensions for objects such as vehicles. Other methods were based on IPM (Inverse Perspective Mapping) using the lines present on the carriageway, these methods are all dependent on the parameters of One of the main problems in this field of research is the dataset. We are talking about a very delicate area, so to be sure of the system's accuracy the dataset should be composed of a huge number of samples representing different objects in very different contexts <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3]</ref>. So, what we did was to record video on short routes from a dash cam mounted on our vehicle, extracting frames on which we then calculated ground truth in an automated way to finally make an ad hoc dataset for us.</p><p>In this paper, we focus on computing the distances between the vehicle and the detected traffic signs using single images captured by a monocular camera. We decided to use this type of camera because it is the most common and affordable. The method foresees two phases, one for the detection of the traffic signs on the captured images and a second phase for inferring distances from them. For this second phase we built a network based on a modern paper <ref type="bibr" target="#b3">[4]</ref> that tries to solve the problem with a pure base artificial intelligence approach.</p><p>Our main contributions arise from our endeavor to create an automated system tailored to our needs. Initially, we integrated YOLOv4 to produce bounding boxes around traffic signs, facilitating the automatic identification of their positions within images, thus concluding the initial phase of our approach. Subsequently, we directed our efforts towards developing a specialized dataset to address our specific problem, as existing datasets did not fulfill our requirements. Building upon our initial findings, we sought to enhance our system by implementing two stabilization methods for predicted distances. The first method entails generating and utilizing depth maps for each frame, enhancing the accuracy of distance measurements between signs located at the same depth. The second method capitalizes on temporal frame correlation, enhancing the smoothness and consistency of our system, and thereby augmenting its overall performance.</p><p>The use of depth maps helps us to get more accurate measurements between signs that are collocated at the same depth. Temporal frame correlation instead helps us to: Filtrate some false positive predictions keeping a bounding box if and only if it appears in the previous and the next frames and get more stable distance predictions for successive ones.</p><p>The major car manufacturers are at the forefront in this field. Taking Tesla as an example, it uses a huge amount of sensors and cameras mounted on its vehicles. This implies that the car must be produced in that way. With methods like ours, what you can do is simply mount a camera, such as a dash cam, inside the vehicle as a driving aid. Furthermore, what we have tried to do is to implement, as in the reference paper, a method that was not bound to the parameters of the camera used. For example, the IPM methods are bounded by the height of the camera from the ground, instead in this case the driver does not have to worry about the position in which the camera is mounted, which can easily be used on different vehicles. building a simple and portable system usable on any camera.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related works</head><p>Inverse Perspective Mapping <ref type="bibr" target="#b4">[5]</ref> consists of removing the perspective distortion from the road surface, taking as reference the lane lines to compute distances assuming they have a fixed size. In this method, a bird's eye view of the roadway is computed to carry out the correspondence between a pixel dimension and the lane line size. This correspondence is then used to count the pixel between an object and the vehicle getting the approximated distance. This method has problems in the presence of road curves or road signs not very visible or absent. In addition, it is very dependent on the camera parameters.</p><p>Stereo vision <ref type="bibr" target="#b5">[6]</ref> This method foresees the use of a stereo camera that generates two images, a left and a right view. From these two images of the same environment is generated a disparity map with the use of epipolar geometry. Using a simple formula from the generated map it is possible to compute for each pixel of the 2D image the z coordinates that give us the depth of the object in that pixel in the real 3D world. The main problem with this method is the expensive cost of the stereo camera.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AI-based approach[7]</head><p>This method is based on a deep learning approach to monocular images. Starting from labeled data train a neural network able to compute distances from objects bounding boxes (DisNet).</p><p>Geometry approach <ref type="bibr" target="#b7">[8]</ref> Other papers are based on the assumption of fixed sizes for known objects, such as vehicles. In this way, knowing camera parameters can be used as a formula to compute distances <ref type="bibr">[9, 10?</ref> ].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Our approach</head><p>Our approach focused on the use of Italian road signs. In Italy, for each category of sign there is a most commonly used size, so once we classified the sign surveyed, we assumed that its size was the common one.</p><p>To approach the problem, we started creating our dataset from scratch. To accomplish this task, we used a dash cam mounted on our vehicle recording routes around the city to finally get more or less 3 hours of recordings. Then we filtered out all unsuitable videos, from the remaining videos we got about 1500 frames representing the roads around the city. We cut each frame on the vertical axis because of a visible portion of the vehicle interior, removing useless information.</p><p>For object detection, we needed a quick solution to avoid wasting time in the whole process. So, we chose YOLOv4 (You Only Look Once) <ref type="bibr" target="#b10">[11]</ref> because it runs a lot faster than other methods as RCNN <ref type="bibr" target="#b11">[12]</ref> or methods based on color segmentation <ref type="bibr" target="#b12">[13]</ref>. We downloaded a pre-trained YOLO network on which we did transfer learning on a German Traffic Sign dataset training for 4000 iterations. During the transfer learning phase. Other attempts we made were to use some image pre-processing techniques, those in grayscale, or the images on which we used histogram equalization getting unfortunately bad results. In the end, the network reached an accuracy of about 91%.</p><p>With the YOLO network, we got the bounding boxes of the traffic signs for each frame, discarding manually all the frames without detected objects or with the presence of wrong detections. To get the ground truth of each bounding box we use the following formula:</p><formula xml:id="formula_0">𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = 𝑊 𝑖𝑑𝑡ℎ𝑐𝑚 * 𝐹 𝑜𝑐𝑎𝑙 𝑙𝑒𝑛𝑔ℎ𝑡 𝑊 𝑖𝑑𝑡ℎ𝑝𝑥</formula><p>It is based on the focal length of the camera that we obtained by taking a picture of an object of known size placed at a known distance to count the pixels of which the object is composed within the image. This is the only parameter of the camera that was necessary to create the dataset.</p><p>In particular, the width of the triangular and octagonal signs used is 90 cm, while it is 60 cm for the square and circular ones.</p><p>Through this process, we built a dataset composed of 959 images. After the creation of the dataset, we focused  on the detection part. For this purpose, we used YOLOv4 as mentioned above.</p><p>After obtaining the bounding boxes for an image, it is passed to a specific network for the distance computation. This second network is composed of a CNN (VGG16) <ref type="bibr" target="#b13">[14]</ref> for feature map extraction, and then this is combined with the information about bounding boxes passed through an ROI pooling layer <ref type="bibr" target="#b14">[15]</ref>. This Layer is necessary because bounding boxes for a single image could be of different sizes, this layer aims to remove this difference in the dimension standardizing them. The output of the ROI pooling is finally passed to a feedforward network, composed of 3 layers (2048, 512, 1), that predicts distances using a soft plus activation function. The architecture of the network is shown in Figure <ref type="figure" target="#fig_2">1a</ref>.</p><p>By testing the entire process on different videos, we noticed that for our cases this method was not stable in the predictions made between successive frames, in fact in some cases, it happened that there was a large variance between distances predicted for the same traffic sign in two or more successive frames. We tried to increase our results by adding the use of depth map information and exploiting the concept of temporal frame correlation.</p><p>Depth map <ref type="bibr" target="#b15">[16]</ref>: The concept is that traffic signs at the same depth in the real world are more or less at the same distance from the vehicle. Based on this point we use a pre-trained network called MIDAS <ref type="bibr" target="#b16">[17]</ref>  <ref type="bibr" target="#b17">[18]</ref> to get the depth map of the image under exam. Once bounding boxes are detected in the original image and distances are computed, we report the bounding boxes in the depth map image. For each traffic sign at the same depth, considering a small variance based on the maximum depth value inside the image, we computed an average of the distances in the original image to obtain a uniform value. At the moment, we used this method after the computation of the distances, but it could be used also in the creation of the dataset to get more detailed labels or in the training phase to directly stabilize results in the network.</p><p>Figure <ref type="figure" target="#fig_3">2</ref> shows a representation of this method, looking at the traffic signs in the image are now visible from the depth map coloration that they are at the same dis- tance. So, thanks to this now the prediction for them is corrected at the same value.</p><p>Temporal frame correlation: We use this technique to give a linearity in predicting distances for the sequence of frames. Going through this method, we noticed that in some cases the network's predictions were much different for successive frames. To stabilize predictions, we thought that given a traffic sign in a frame at time t, if it is also present at time t-1 and t+1 it is a valid object to consider for time t and its distance is the average between the 3 frames in sequence. To verify if the same traffic sign is present in the 3 subsequence frames, we first find the center of its bounding box at time t and of all the traffic signs for the previous and forward frames. Then we compute the distances between points and if it is lower than a certain threshold, we are looking at the same traffic sign.</p><p>An example of this concept is given in Figure <ref type="figure" target="#fig_4">3</ref>, in which there is a wrong detection at frame t (red circle on the top right image) and since this wrong prediction is not present at frame (t-1) and (t+1), it is also discarded at frame t.</p><p>The architecture of this modified network is represented in Figure <ref type="figure" target="#fig_2">1b</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Training</head><p>About the training phase, due to time and resource issues, we were unable to train the networks for long sessions. We trained the YOLOv4 for about 4000 iterations using RGB frames from the German Traffic Sign Dataset. For the distance prediction network (DPN), all components composing the DPN network are trained together. We trained it with our dataset for 560 epochs using RGB frames. About the training parameters, we used a learning rate starting from 0,001 with ADAMS, minibatch size of 16, and loss the 𝑆𝑚𝑜𝑜𝑡ℎ𝐿1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head><p>Talking about the detection part with the YOLO we reach an accuracy of around 91%.</p><p>For the distance prediction network instead, it is not possible to compute a true accuracy, but we reach a loss of more or less 130, visible in the graph in Figure <ref type="figure" target="#fig_5">4</ref>.</p><p>It shows that the loss function has a trend that tends to improve if trained for more epochs.</p><p>As an evaluation metric, we used the ones provided by <ref type="bibr" target="#b6">[7]</ref>. In particular, we use the RMSE on predictions divided by meters:</p><formula xml:id="formula_1">𝑅𝑀 𝑆𝐸 = ⎯ ⎸ ⎸ ⎷ 1 𝑁 𝑛 ∑︁ 𝑑=1 ‖𝑑𝑖 − 𝑑 * 𝑖 ‖ 2</formula><p>This is to see how the behavior of the network changes concerning the distance from the detected object. Results are represented in the graph in Figure <ref type="figure" target="#fig_8">5</ref>, compared with the ones obtained by the reference paper. Visible predictions get worse as distances increase. We notice that bounding boxes of traffic signs at higher distances do not match perfectly their dimensions introducing an error. Another source of error is probably the fact that we have only a few samples of road signs at large distances. Table <ref type="table" target="#tab_0">1</ref> compares our results with the ones of the reference paper. As visible, results are similar, ours are a little bit better because lower values represent better predictions. This is because we make predictions only on traffic signs while they predict on cars, cyclists, and pedestrians, this means that they have a larger margin of error than us.</p><p>To show the method in action, we made some test video, available on YouTube, of the network works. In particular, we made videos with the following characteristics:</p><p>• Test video using the base network without depth map and temporal frame correlation (daylight conditions) • Test video using depth map and temporal frame correlation (daylight conditions) • Test video using the base network without depth map and temporal frame correlation, rounded on 5 meters (daylight conditions) • Test video using depth map and temporal frame correlation, rounded on 5 meters (daylight conditions)   • Test video using depth map and temporal frame correlation, rounded on 5 meters (night conditions)</p><p>Rounded on 5 meters, means that we do an approximation on the predictions made to get more stable results. As, 12.4 meters is rounded to 10 meters, while 12. The top image distances are predicted without the use of the depth map and temporal frame correlation, as the predictions do not seem reliable, they appear quite random. The bottom image instead, is done using our two variations. As visible all the detected signs are more or less at the same depth, this is not considered for the top image, while in this case thanks to the depth map their predictions are adjusted correctly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>The method seems to work well, there are errors introduced by the labels of our dataset that are not accurate,  caused by the possible different dimensions for each traffic sign on the road introducing a small error that then will propagate throughout the process, even if we tried to solve it using depth map and temporal frame correlation. So, the main future step could be using more accurate labels for the samples inside the dataset. The work is based on the objects detected and rounded by bounding boxes but is not always sure that their dimensions match perfectly the sizes of the traffic signs, so this point introduces errors in the predictions of the network. As said at the beginning, in Italy the same traffic signs could be used up to 3 different dimensions, so it could be useful to infer their dimensions to improve the predicted distances. As future improvement, there possible extension of the detected objects also to vehicles and pedestrians.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) the used camera.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>(a) Predictive model for traffic sign distance computation. Input image with bounding boxes undergoes VGG16 feature extraction, ROI pooling for size standardization, and a three-layer feedforward network for distance prediction using soft plus activation. (b) Enhanced model integrating depth map information and temporal frame correlation for stabilized predictions. Input image with bounding boxes processed through VGG16, ROI pooling, and a modified three-layer feedforward network, leading to improved distance accuracy.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Schematic representations of the comprehensive distance computation system.</figDesc><graphic coords="3,91.37,202.28,412.51,156.18" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Example of depth map using MiDas network.</figDesc><graphic coords="4,89.29,84.19,416.70,76.36" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Example of temporal frame correlation in case of wrong predictions.</figDesc><graphic coords="5,89.29,84.19,416.70,197.19" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Graph of the loss function of the distances prediction network.</figDesc><graphic coords="5,89.29,402.25,203.37,128.95" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 6</head><label>6</label><figDesc>Figure6shows two examples of predictions in images. The top image distances are predicted without the use of the depth map and temporal frame correlation, as the predictions do not seem reliable, they appear quite random. The bottom image instead, is done using our two variations. As visible all the detected signs are more or less at the same depth, this is not considered for the top image, while in this case thanks to the depth map their predictions are adjusted correctly.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head></head><label></label><figDesc>(a) Our meters-RMSE predictions graph (b) Reference paper meters-RMSE predictions graph</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: These are the graphs that are put in relation to the predictions at certain meters with the distance error from the true values.</figDesc><graphic coords="6,90.33,84.19,206.26,76.32" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Comparison of results between our implementation and the one of the paper we take as reference.</figDesc><table><row><cell>Method</cell><cell cols="4">Abs Rel Squa Rel RMSE RMSE(log)</cell></row><row><cell>Our base model</cell><cell>0.131</cell><cell>0.468</cell><cell>3.126</cell><cell>0.173</cell></row><row><cell>Paper base model</cell><cell>0.251</cell><cell>1.844</cell><cell>6.870</cell><cell>0.314</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work has been developed at is.Lab() Intelligent Systems Laboratory at the Department of Computer, Control, and Management Engineering, Sapienza University of Rome (https:// islab.diag.uniroma1.it). The work has also been partially supported from Italian Ministerial grant PRIN 2022 "ISIDE: Intelligent Systems for Infrastructural Diagnosis in smart-concretE", n. 2022S88WAY -CUP B53D2301318, and by the Age-It: Ageing Well in an ageing society project, task 9.4.1 work package 4 spoke 9, within topic 8 extended partnership 8, under the National Recovery and Resilience Plan (PNRR), Mission 4 Component 2 Investment 1.3-Call for tender No. 1557 of 11/10/2022 of Italian Ministry of University and Research funded by the European Union-NextGenerationEU, CUP B53C22004090006.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Analysis pre and post covid-19 pandemic rorschach test data of using em algorithms and gmm models</title>
		<author>
			<persName><forename type="first">V</forename><surname>Ponzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Russo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wajda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Brociek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Napoli</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">3360</biblScope>
			<biblScope unit="page" from="55" to="63" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">An innovative hybrid neuro-wavelet method for reconstruction of missing data in astronomical photometric surveys</title>
		<author>
			<persName><forename type="first">G</forename><surname>Capizzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Napoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Paternò</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-642-29347-4_3</idno>
	</analytic>
	<monogr>
		<title level="j">LNAI</title>
		<imprint>
			<biblScope unit="page" from="21" to="29" />
			<date type="published" when="2012">7267. 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A novel convmixer transformer based architecture for violent behavior detection 14126</title>
		<author>
			<persName><forename type="first">A</forename><surname>Alfarano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>De Magistris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Mongelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Russo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Starczewski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Napoli</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-42508-0_1</idno>
	</analytic>
	<monogr>
		<title level="j">LNAI</title>
		<imprint>
			<biblScope unit="page" from="3" to="16" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Learning Object-specific Distance from a Monocular Image</title>
		<author>
			<persName><forename type="first">J</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Abu-Haimed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K.-C</forename><surname>Lien</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gu</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.1909.04182</idno>
		<idno type="arXiv">arXiv:1909.04182[cs</idno>
		<ptr target="http://arxiv.org/abs/1909.04182.doi:10.48550/arXiv.1909.04182" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
	<note>type: article</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Distance determination for an automobile environment using Inverse Perspective Mapping in OpenCV</title>
		<author>
			<persName><forename type="first">S</forename><surname>Tuohy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>O'cualain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Glavin</surname></persName>
		</author>
		<idno type="DOI">10.1049/cp.2010.0495</idno>
	</analytic>
	<monogr>
		<title level="m">IET Irish Signals and Systems Conference (ISSC</title>
				<imprint>
			<date type="published" when="2010">2010. 2010</date>
			<biblScope unit="page" from="100" to="105" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Distance Measurement System Based on Binocular Stereo Vision</title>
		<author>
			<persName><forename type="first">X</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Gan</surname></persName>
		</author>
		<idno type="DOI">10.1088/1755-1315/252/5/052051</idno>
		<idno>doi:</idno>
		<ptr target="10.1088/1755-1315/252/5/052051" />
	</analytic>
	<monogr>
		<title level="j">IOP Conference Series: Earth and Environmental Science</title>
		<imprint>
			<biblScope unit="volume">252</biblScope>
			<biblScope unit="page">52051</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">DisNet: A novel method for distance estimation from monocular camera, ??</title>
		<ptr target="https://patrick-llgc.github.io/Learning-Deep-Learning/paper_notes/disnet.html" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Traffic Signs Recognition and Distance Estimation using a Monocular Camera</title>
		<author>
			<persName><forename type="first">S</forename><surname>Saleh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Khwandah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Heller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hardt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mumtaz</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">A comprehensive solution for psychological treatment and therapeutic path planning based on knowledge base and expertise sharing</title>
		<author>
			<persName><forename type="first">S</forename><surname>Russo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Napoli</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">2472</biblScope>
			<biblScope unit="page" from="41" to="47" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">A cloud-based flexible solution for psychometric tests validation, Figure 6: (Top image) example of predictions without depth map and frame correlation time. (Bottom image) example of predictions using depth map and frame correlation time administration and evaluation</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">Lo</forename><surname>Sciuto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Russo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Napoli</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">2468</biblScope>
			<biblScope unit="page" from="16" to="21" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">YOLOv4: Optimal Speed and Accuracy of Object Detection</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bochkovskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-Y</forename><forename type="middle">M</forename><surname>Liao</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2004.10934</idno>
		<idno type="arXiv">arXiv:2004.10934</idno>
		<ptr target="http://arxiv.org/abs/2004.10934.doi:10.48550/arXiv.2004.10934" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
	<note>cs, eess] type: article</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Fast R-CNN</title>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICCV.2015.169</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Computer Vision (ICCV)</title>
				<imprint>
			<date type="published" when="2015">2015. 2015</date>
			<biblScope unit="page" from="1440" to="1448" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Fast Traffic Sign Recognition Using Color Segmentation and Deep Convolutional Networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Youssef</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Albani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Nardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bloisi</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-48680-2_19</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="volume">10016</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Very Deep Convolutional Networks for Large-Scale Image Recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.1409.1556</idno>
		<idno type="arXiv">arXiv:1409.1556</idno>
		<ptr target="http://arxiv.org/abs/1409.1556.doi:10.48550/arXiv.1409.1556" />
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
	<note>type: article</note>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Rich feature hierarchies for accurate object detection and semantic segmentation</title>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Donahue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Malik</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.1311.2524</idno>
		<idno type="arXiv">arXiv:1311.2524</idno>
		<ptr target="http://arxiv.org/abs/1311.2524.doi:10.48550/arXiv.1311.2524" />
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
	<note>type: article</note>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Digging Into Self-Supervised Monocular Depth Estimation</title>
		<author>
			<persName><forename type="first">C</forename><surname>Godard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Mac Aodha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Firman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Brostow</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.1806.01260</idno>
		<idno type="arXiv">arXiv:1806.01260</idno>
		<ptr target="http://arxiv.org/abs/1806.01260.doi:10.48550/arXiv.1806.01260" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
	<note>cs, stat] type: article</note>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Vision Transformers for Dense Prediction</title>
		<author>
			<persName><forename type="first">R</forename><surname>Ranftl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bochkovskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Koltun</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2103.13413</idno>
		<idno type="arXiv">arXiv:2103.13413</idno>
		<ptr target="http://arxiv.org/abs/2103.13413.doi:10.48550/arXiv.2103.13413" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
	<note>type: article</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Crossdataset Transfer</title>
		<author>
			<persName><forename type="first">R</forename><surname>Ranftl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lasinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hafner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Schindler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Koltun</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.1907.01341</idno>
		<idno type="arXiv">arXiv:1907.01341[cs</idno>
		<ptr target="http://arxiv.org/abs/1907.01341.doi:10.48550/arXiv.1907.01341" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
	<note>type: article</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
