1. Introduction

Improvement of Precise Vehicle Location in Urban Areas Using Video-based Photogrammetry

André Pinhal

José Gonçalves

0 1 0 CIIMAR, Interdisciplinary Centre of Marine and Environmental Research , 4450-208 Matosinhos , Portugal 1 University of Porto , Observatório Astronómico, 4430-146 Vila Nova de Gaia , Portugal

This paper presents a system under development at the University of Porto, which integrates a Septentrio MosaicGO GNSS receiver, with a helical antenna, associated with a Gopro action camera. Video frames can be processed by a Structure from Motion (SfM) approach to do an alignment of sequential frames, derive relative position of the camera projection centres, and eventually create point clouds of the surrounding environment. The system being developed has the camera and the antenna mounted in a block, which is placed on a rear-view mirror of a car. The camera acquires a video in 4K mode, at 60 frames per second. Frames are extracted from the video at a frequency of between 2Hz to 10 Hz, depending on the car speed. The camera collects GNSS data with its own navigation receiver, allowing to tag all extracted video frames with GPS time and position. Due to the low accuracy of the camera receiver, its positions are discarded and only GPS time is kept. This time is used to synchronize with much more accurate positions obtained by the RTK receiver, connected to a CORS network. Many times, the obstacles present in urban areas do not allow for the high accuracy positioning. A trajectory made within an urban area will have some frames with very precise positions and other with much higher errors. All frames are photogrammetrically processed. Standard deviations of camera positions are considered in the bundle adjustment, allowing for the improvement of the camera positions of lower accuracy. The SfM processing also generates a point cloud of the surrounding objects, which can be densi ed. Objects identi ed on this point cloud can be used to assess the location accuracy of the process. Several experiments carried with the system in the city of Porto allowed to con rm that a positional accuracy of 10 cm can be achieved.

eol>RTK ambiguity x action camera MMS structure from motion point cloud

1. Introduction

Precise positioning by GNSS RTK (Real Time Kinematics) is now very common in Surveying and GIS data collection for 3D city models. There are many surveying solutions for use in mobile mapping systems (MMS), integrating GNSS positioning, inertial navigation systems (INS) and data acquisition sensors, such as cameras or laser scanners [ 1 ]. Popular brands, as Riegl, Leica Geosystems or Trimble, among others, provide solutions that allow for the generation of very dense and accurate point clouds in urban environments, but with costs of several hundred thousand euros, and whose use is restricted to professional markets. Many users will be interested in having cheaper systems, making use of small and low-cost GNSS receivers now available, which can have good performance in urban areas [ 2 ]. There are several devices for development, costing less than 1000 euros, which have triple frequency and all the capacity for high-precision di erential positioning, in real time or in post-processing. Regardless of the additional sensors associated with the GNSS receiver, it is currently possible to have high-precision kinematic positioning in a motor vehicle with this type of low-cost receivers.

As inevitably happens with GNSS positioning in urban environments, or generally in situations of major obstructions to signal propagation, RTK positioning, with ambiguity xing (“ x” solution), with errors of a few centimeters, will not be possible in signi cant parts of trajectories made in this type of environment. “Float” type solutions will often result, with precision on the order of a few decimeters, or even “single point” solutions (SPP), with errors of a few meters. This limitation is normally overcome with inertial navigation systems, signi cantly increasing the costs of an MMS. The system under development aims to solve this problem through the use of sequences of images acquired by a small video camera. It is possible to apply to the image sequences, photogrammetric techniques that determine the relative orientation of the images, thus contributing to a more complete positioning solution.

The system is intended to use cameras classi ed as “action camera”. These cameras are compact, rugged, and versatile to capture both video and still images in outdoor conditions. They are very popular among adventure enthusiasts, many times also interested in geolocating images and videos. For this reason, there are several models that incorporate a GPS navigation receiver, allowing to geotag photographs, and video as well. That is the case of cameras of the brand Gopro, since the launch of model Hero 5. These cameras acquire video in format MP4, which can accommodate GPS data, to be later extracted using speci c software [ 3 ] provided by the camera manufacturer [ 4 ]. It is possible to obtain in this way GPS position and GPS time of every individual video frames.

Photogrammetry has seen major developments in recent years, through the incorporation of methodologies derived from computer vision. It was mainly with an algorithm developed by David Lowe [ 5 ] – SIFT (Scale Invariant Feature transform) – that made the automatic extraction of common points between images much easier, even for images with variations of orientation and scale. This availability of conjugate points between images of large blocks is combined with the bundle adjustment methods already possible for large blocks. This process is also applicable to cameras for which only approximate internal orientation parameters are known, integrating in the bundle adjustment a self-calibration process. This largely automated methodology of image orientation started to be designated in some communities as “Structure from Motion” (SfM), allowing the processing of aerial or terrestrial images.

In the present system a video is acquired, frames are extracted and are oriented (“aligned”, in the terminology associated to SfM) in relative terms. Provided that, for some of the images, coordinates of projection centers are known, all images in the block will have their positions determined. In this way it is possible to complete or correct the camera trajectory.

2. Description of the hardware implemented

The system described in this paper incorporates a Septentrio Mosaic GO X5 GNSS receiver, that has a compact and lightweight design, suitable for integration into platforms such as drones and handheld devices. It supports multiple satellite constellations, including GPS, GLONASS, Galileo and BeiDou, and does phase measurement in three di erent frequencies. The receiver is mounted in a box, which is attached to the back of the camera, and a helical antenna is placed on top of the box. A small power bank is mounted next to the receiver, which once turned on starts recording a le in the Septentrio Binary Format (SBF). When connected to a CORS (Continuous Operation Reference Station) network the receiver provides reliable real time kinematic (RTK) positioning. The receiver was used in 10 Hz, providing positions with enough density to properly model curved trajectories of a car moving in a city.

The operation of the system requires the use of a smartphone, which connects to the internet. An application runs on the smartphone, which allows for controlling the receiver, receive an NTRIP correction from a CORS network, carry out the di erential correction and monitor the receiver’s performance. In the present case an Android smartphone, was used, running the SW-Maps application1, which is free software and includes all those capabilities. Although positions may be recorded in the app, they are retrieved from the receiver memory, in the form of the SBF le and in NMEA format (National Marine Electronics Association), for the RTK positions.

The camera has a xing piece to be mounted, together with the receiver, on the side rear view mirror of a car. Once the camera and the receiver are turned on, both can be controlled from inside the vehicle, 1http://softwel.com.np/mobile_products with the smartphone. Figure 1 shows the block mounted on the right rear-view mirror of a car. The camera optical axis is horizontal, rotated by 15 degrees to the right of the vehicle axis.

The camera has a GNSS receiver, which works at a frequency of 18 Hz. After the camera is turned on, the user should check when GPS position is available and only then start the video recording. The receiver must be connected beforehand and should, preferably, have RTK xed position when the camera starts the video. Note that there is no electronics establishing a connection between the two devices. The way of synchronizing between image and RTK positions is done through the GPS time recorded by the two devices.

The camera acquires video in 4K resolution, with a dimension of 3840 by 2160 pixels, at a rate of 60 fps (frames per second). Action cameras are known for having a large eld of view, at the cost of a signi cant radial deformation. Although images can be used in this way, preference was given to the image mode called “linear”, which consists of the application of a general correction model to produce images with a regular central projection. The resulting image has an equivalent focal distance of approximately 1800 pixels, still a very wide angle. Small additional corrections, in the focal distance, principal point position and radial distortion coe cients, characteristic of each camera unit, may be necessary, but these will be handled in the processing step, through a self-calibration incorporated in the SfM bundle-adjustment.

Videos taken from the car are processed to extract navigation information from the camera, which is carried out by programs from the GPMF library2. Information is extracted in groups of 60 frames, that is, every 1.001 seconds of the video, and includes video time of each group, position, speed and UTC time, in seconds of the day. Table 1 shows an example of this information. Positions result from the camera navigation receiver and will, in fact, be discarded. The main information is the video time, which is transformed into frame number, and UTC.

Frames are extracted from the video, in JPEG format, at a suitable cadence so that the overlaps between consecutive images are adequate for the alignment process to be successful. For the conditions under which the system was operated, the 4Hz cadence proved to be adequate. In situations of higher speed, and especially when there is rotation, the cadence may be increased. In cases where the vehicle is stopped, for example at tra c lights, the corresponding frames may be discarded if the GPMF data has a speed lower than a tolerance, for example 0.5 m/s.

The positions collected by the GNSS receiver are retrieved in the NMEA format, essentially through messages GGA and GST [ 6 ], that provide results of UTC time, position, quality of the position (Q), with values of 1 for SPP, 4 for FIX and 5 for FLOAT. Standard deviations are also obtained, which will be of few centimetres in case of Q=4. Surveys were carried with a connection to the Portuguese permanent station network ReNEP (“Rede Nacional de Estações Permanentes”). Although a post-processing can be done over the SBF data, in general the RTK positioning could be obtained in those cases where obstacles allowed for that. Table 2 shows a sample of data extracted from the NMEA les, with a loss of ambiguity x and a sudden increase of estimated precision.

At this point, a linear interpolation is carried in order to calculate camera position. There is a need of a previous calibration of the relation between frame number and UTC, which is described in the next section. There is also a small o set between the antenna phase centre and the centre of the camera, of approximately 4 cm in the horizontal component and 4 cm in the vertical. In a rst approximation this is not being corrected, since it is smaller than what is initially expected for the system accuracy. The camera GPS receiver has a frequency of 18 Hz, i.e., 3.3 times smaller than the video frame rate. Assuming an error of 1.5 frames in the synchronization, i.e., 0.25 seconds, for a car moving at a speed of 30 km/h, the error corresponds to a distance of 20 cm. For that reason, we put the expectations at the level of 1 or 2 decimetres. However, the procedure for correction of the small lever arm is described in the calibration section.

Finally, all the selected frames were aligned in a photogrammetric software that applies the SfM concept, which in our case was Agisoft Metashape [ 7 ]. Interpolated positions are provided only for those images that had Q=4. Standard deviation of 0.2 m was considered for the least squares adjustment. As a result of the bundle adjustment positions of all images are obtained. They will be analysed in an independent manner.

3. System calibration

In order to process the image data and RTK positions collected, some calibration steps are necessary, especially regarding the times to be assigned to the frames extracted from the video.

3.1. Time calibration

As seen in Table 2, the UTC time does not correspond to the exact moments of the rst frame of each block of 60 frames, since they are not equally spaced. A linear relationship was established between frame number, Nf , and UTC time, through, where A0 is the time of the rst frame (starting count at frame zero) and A1 is the frame rate. This will allow us to determine a more reliable value for the UTC time of the rst frame.

An independent validation of this assessment of the time of the rst frame, was done with a small ashlight, connected to the GNSS receiver, red in front of the camera. A precise time of the ash event is recorded on the GNSS receiver, and with a few ashes along a video, the calibration can also be done with the same formula. Very similar results were obtained, with di erences in the time of the rst frame around 20 ms, i.e., only slightly more than one frame. Figure 2 shows an image of the ash in a frame.

U T C = A0 + A1Nf , (1)

3.2. Lever arm between receiver and camera

As referred before, there is a lever arm between the antenna and the camera, which is of only a few centimeters and, in a rst approach, is not being considered. There is no attitude assessment, so the coordinate transportation must be done with the orientation of the trajectory. As the vehicle moves approximately in a horizontal plane, the azimuth of the trajectory can be estimated and the corresponding rotation applied to the vector between antenna and camera. In the vertical component there is only a need of subtracting the height di erence between the phase center and the camera. The actual projection center position is not known, but since the focal distance of the lens is only 3 mm, it was assumed as the center of the lens, outside the camera.

4. Assessment of system performance

The system was tested in some urban environments, in a rst approach, in areas without extreme situations of di culty in signal capture. A route was taken in an urban residential area of the city of Porto, without tall buildings, but with trees along the streets. The route began and ended at the same point, it had some crossings and there were some repetitions of some sections along the route, which had a total length of 3.2 km. Images were extracted at a rate of 4 Hz, in a total of 2464 images. Figure 3 shows, on the left side, the path followed, over the Google Maps image base, and on the right, two examples of frames from the video, one in a more unobstructed area and another with more tree cover.

4.1. Image alignment

After interpolation of RTK positions for all images, it was observed that 72% were FIX-type positions. The images were loaded into the Agisoft Mestashape program and a rst alignment was made, at this step without coordinates of the projection centers. All were successfully aligned. The coordinates of the images that had FIX were then inserted, with an a-priori accuracy value of 0.2 meters, in the three coordinates. The bundle adjustment was reprocessed, resulting in adjusted coordinates for all images. The e ect of trajectory quality improvement is observed in places where the positions were not FIX. Figure 4 shows, on the left, an area where there is an interruption in the FIX positions (blue dots), in a total of nearly 40 images. After aligning the images, the positions were regularized, resulting in a much smoother trajectory that was in line with expectations.

Although visually there is a qualitative improvement in the trajectory, and a very small change in the sections where there was FIX, this assessment has some subjectivity, so it is preferable to have a numerical evaluation of the error.

The simplest way to make this assessment is through checkpoints, whose coordinates can be determined photogrammetrically through the images. A total of 14 points were selected, on well-de ned points, which were identi ed in the nal photogrammetric project, on the images where they are observed with the highest quality. This results in coordinates of these points. Subsequently, the points were surveyed on the terrain, with GNSS, and the corresponding three-dimensional errors were evaluated. Figure 5 shows the location of the checkpoints and the location of two of them in the images.

Errors were calculated in the three coordinates (ex, ey, ez), in centimetres, and the corresponding statistics are presented in Table 3. They are: the average error (AVG), the root mean square error (RMSE) and the maximum absolute error (MAX), according to:

AV G =

P ei , RM SE = n

n r P e2 i , M AX = max(|ei|), i = 1, . . . , n.

(2)

Average errors are small, not evidencing systematic trends. The RMSEs are of the order of one decimetre, agreeing with the initial expectation. These were the rst tests, which are quite promising. More performance assessments of the system, in diverse conditions, will be carried in a near future.

5. Conclusions and future work

A positioning system for precise position determination of vehicles in urban environments is under development at the University of Porto. The system integrates video imagery processed by SfM in order to contribute to an integrated trajectory solution. This allows to complement the position gaps in the urban environment. Initial tests point to a possible accuracy at decimeter level. More tests will be carried out in order to assess system performance in a diversity of environments with strong obstructions, both in urban areas as in forested environments.

Several improvements to the system will be developed soon, namely an improvement in the temporal synchronization process, for example using new models of action cameras with higher frame rates. It is also intended to improve the correct reduction of the GNSS position to the camera projection center.

Acknowledgments

This work was developed within project 4Map4Health (CHIST-ERA/0006/2019), nanced by the Portuguese Foundation for Science and Technology, under program ERA NET CHIST-ERA.

The GNSS RTK positioning was done with the ReNEP permanent stations of the Directorate General for Territorial Development (DGT).

[1]

Elhashash ,

Albanwan ,

Qin , A Review of Mobile Mapping Systems: From Sensors to Applications , Sensors 22 ( 2022 ) 4262 .

[2]

Hamza ,

Stopar ,

Sterle , P. Pavlov i -Pre eren, Low-Cost Dual-Frequency GNSS Receivers and Antennas for Surveying in Urban Areas , Sensors 23 ( 2023 ) 2861 .

[3]

Petroskey ,

Funk ,

I. A.

Tibavinsky , Validation of Telemetry Data Acquisition Using GoPro Cameras , Technical Report, SAE Technical Paper , 2020 .

[4] GoPro, Metadata Format - GPMF, Processing software available at https://github.com/gopro/gpm f-parser (v2.2.1) , 2020 .

[5]

D. G.

Lowe , Distinctive image features from scale-invariant keypoints , International journal of computer vision 60 ( 2004 ) 91 - 110 .

[6]

Ardalan ,

Awange , Compatibility of NMEA GGA with GPS receivers implementation , GPS Solutions 3 ( 2000 ) 1 - 3 .

[7] Metashape , Agisoft, Agisoft Metashape User Manual, Professional Edition, Version 2 .1, Available at https://www.agisoft.com/pdf/metashape-pro_ 2_1_en .pdf, 2024 .