1. Introduction

The International Journal of Robotics Research 36 (2017) 142-149. [28] S. Bryner

A Review of Event-Based Indoor Positioning and Navigation

Chenyang Shi

shicy@buaa.edu.cn 0

Ningfang Song

Songnf@buaa.edu.cn 0

Wenzhuo Li

Yuzhen Li

Boyi Wei

Hanxiao Liu

Jing Jin

jinjing@buaa.edu.cn 0 0 School of Instrumentation and Opto-electronics Engineering, Beihang University , Beijing, 100191 , China

2019

81 98 2402 2412

Event cameras are neuromorphic vision sensors that work diferently from frame-based cameras. Instead of outputting global images of the scene at fixed frequency, event cameras generate pixel-wise output asynchronously under illumination changes. Event cameras have desirable features that make them suitable for indoor navigation and positioning: high dynamic range, high temporal resolution (consequently less motion blur) and low power consumption. However, as conventional algorithms are no longer valid for event cameras, they call for new methods to exploit their potential. This paper thus surveys sensors and algorithms for event-based navigation and positioning. We investigate event cameras (also known as Dynamic Vision Sensors) including their working principle, the trend of development and an overview of recently available sensors. We also summarize event-based algorithms that have maximized the superiority of event sensors in terms of ego-motion estimation, tracking and depth estimation. In the end, we discuss the advantages, challenges, hardware requirements and future of event cameras application in indoor navigation and positioning.

eol>Event camera event-based vision indoor positioning indoor navigation

1. Introduction

Event cameras are bio-inspired vision sensors. They respond to the relative light changes of the natural world in an asynchronous and sparse way, completely subverting the imaging mode of the global exposure of standard camera. The event stream they output is fundamentally diferent from frames, boasting high time resolution, high dynamic range and low power consumption. Thus, in scenarios, event cameras are an alternative to traditional cameras. Recent studies have shown that event cameras have outperformed standard cameras in challenging positioning and mapping scenarios. Event stream naturally reflects the edges of scenes and remain low data rate, providing a new option for indoor positioning and navigation tasks that require high real-time performance. However, there are still many challenges and dificulties to be solved in practice. Therefore, we conduct a detailed investigation and discussion on the application of event cameras in positioning and navigation to further tap the potential of event cameras and provide researchers with some ideas to solve the current dificulties encountered in this field.

Currently, the main applications of event cameras are, objects detection and recognition [ 1, 2 ] feature extraction and tracking [ 3, 4 ], motion estimation [ 3, 5 ], pose estimation [ 6, 7 ], depth estimation [ 8, 9 ], video interpolation [ 10, 11 ], super-resolution [ 12, 13 ], 3D reconstruction and mapping [ 14, 15 ], etc. In the survey of [ 16 ], it reviewed the main application and development process of the event camera. Diferent from [ 16 ], our review aims at the application of event cameras in vision navigation and positioning, where the rationality of applying event cameras will be illustrated.

Outline: The rest of paper is organized as follows. Section.2 introduces the principle of the event cameras. Section.3 reviews the algorithms of event-based ego-motion estimation and discusses the superiority of it in complex conditions. Section.4 reviews event-based Visual Odometry (VO) and Visual Inertial Odometry (VIO) for pose estimation and tracking and discusses the development tendency of these methods. Section.5 discusses the method for event-based mapping, including depth estimation and 3D reconstruction. Section.6 summarizes the datasets for estimating the performance of these method. The paper ends with a discussion (Section.7) and a conclusion (Section.8).

2. Event Camera

Inspired by biological vision, event cameras have a completely diferent working mechanism compared with traditional cameras [ 17 ]. Event cameras, also known as Dynamic Vision Sensors (DVSs), no longer measure the "absolute" brightness at a constant rate, but asynchronously measure the brightness change per pixel [ 18, 19 ]. Each pixel works independently. Once the brightness change of a pixel exceeds the threshold, an event will be output at the pixel location without waiting for global exposure, which guarantees the low latency feature. Additionally, this working mechanism fundamentally gets rid of the constraint of frame rate, leading to faster response to the change of brightness (up to 1MHz) and higher dynamic range (up to 120), making it capable of imaging in extremely fast motion in bright or dark environment. Moreover, as event cameras only transmit brightness changes, no output will be generated without relative displacement or change of light between the environment and the camera, which largely eliminates redundant data and reduces the transmission bandwidth and power consumption.

The output of a DVS is an event stream. Event is represented as a tuple e = (x, y, p, t) in which t represents the time when the pixel brightness changes in microsecond, with high sensitivity and resolution; the coordinates (x, y) are the position of the pixel where the brightness change occurs; polarity p indicates the direction of brightness change [ 20 ]. If the brightness increase exceeds the threshold, the polarity is +1 (ON Event). If the brightness decrease exceeds the threshold, the polarity is –1 (OFF Event).

Speicifically, DVSs refer to those whose photodiodes only contain circuits that trigger events, the mechanism of which is shown in Figure. 1. DAVISs, however, refer to those whose photodiodes can also carry out global exposure.

When reaches the threshold (upper boundary) or (lower boundary), the comparator outputs the signal of OFF or ON Event, which is connected with the global exposure signal and the Address-Event-Representation (AER) handshake circuit through the OR gate. Then the row request signal is output through the handshake circuit. When the row request is answered, the column request signal is sent, and the column response signal is returned through the decision tree. This event is read out and the pixel coordinates are obtained through the address encoder.

In conclusion, the superiorities of the DVS make it especially suitable for intelligent systems such as Unmanned Aerial Vehicle (UAV), aircraft, missile, intelligent shell and high-speed robot to carry out tasks such as target detection and tracking, motion estimation and autonomous navigation in indoor and outdoor environments.

With years of development, the dynamic vision sensor has made progress towards higher resolution, smaller pixel size and higher readout speed [ 21 ]. At present, the resolution of the mainstream DVS has reached 1 million pixels, with multiple modes such as gray mode, dynamic mode and optical flow mode. Table.1 introduces the parameters of the latest dynamic vision sensors in comparison to a traditional image sensor.

3. Ego-motion estmation

Ego-motion estimation recovers the state of vision sensors given their output images of the scene they are in with high accuracy. Considering the dimensions, 2D problem aims at solving 3-DOF (Degree of Freedom) motion (2-DOF translation and 1-DOF rotation or 3-DOF pure rotation) and 3D problem tackles the estimation of 6-DOF arbitrary motion.

Frame-based ego-motion estimation is mostly realized through either filter-based or optimizationbased algorithms. Filter-based algorithms are the earliest applied in positioning and navigation, among log intensity ∆()

reconstruction NO NO NO

FFO FFO FFO (b) The event triggering principal of DVS.

time OFF Threshold Reset Level

ON Threshold time DAVIS346 event camera

IniVation

2020 346× 260 1 120 <900 18.5× 18.5 12 ✓

DVXplorer event camera

IniVation

2020 640× 480 200 110 <700 9× 9 165 ✕

EB sensor [ 21 ]

DVS Gen4 [ 22 ] event camera Prophesee

2020 1280× 720 >124 73@300Meps 4.86× 4.86 1066 ✕ event camera

Samsung

2019 1280× 960 90 140 4.95× 4.95 ✕

CeleX-V [ 23 ] event camera

CelePixel

2019 1280× 800 <0.5 >120 390 9.8× 9.8 100 ✓ (a) The diference between imaging machanism of

DVS and standard camera. which the most widely used is Extended Kalman Filter (EKF). They are incremental methods, where the current camera state is considered only relevant to the camera state at one timestamp ahead. This presumption makes them suitable for small sources of data yet is rather idealized in real situations. In contrast, optimization-based algorithms are batch methods that consider all state estimation results within an interval ahead to estimate the current camera state. They concern more information and are proved to be more robust and accurate.

Event-based ego-motion estimation is carried out following two event processing patterns: (1) processing event-by-event and (2) processing on groups of events. Event-by-event-based methods enable every event to asynchronously update the system state, preserving the inherent high temporal resolution of event sensors. However, an individual event fails to depict the change of the whole scene and may sufer from strong noise signals. Therefore, it is reasonable to update the camera state with forms of event groups, such as event maps (EM), time surfaces (TSs), event frames, voxel grids and so on,

Under these patterns, the estimation problem is usually addressed within three kinds of frameworks: iflter-based, optimization-based and Artificial Neural Network (ANN) -based framework. An overview of recent works on event-based ego-motion estimation can be seen in Table.2.

3.1. Filter-based framework

Probabilistic (Bayesian) filters, including Kalman filters, EKFs and particle filters (PFs), update present camera state by prior states.

Probabilistic filters have grown to be major pose estimation methods in event-by-event processing scenarios, because they naturally fit the characteristics of events: (1) filters operate asynchronous data of events, ensuring high temporal resolutions and (2) filters are particularly applicable to limited scale of computing resources of events. [24] proposed the first 6-DOF high-speed camera tracking algorithm in random natural scenes. A robust filter combining Bayesian estimation and posterior approximation of a distribution in the exponential family was put forward, enabling event-by-event pose updates from an existing photometric map of the scene. This work revealed 6-DOF high-speed tracking capabilities by event-based method and freed the tracking algorithm from limitation of scene texture.

Dimension

In recent years, the filter-based framework has also become workable for groups of events with the contribution of event outlier rejection technique. For instance, [25] presented an EKF that updated camera pose for event packets collected within small temporal windows of 100 . This was made possible by an event-to-line matching which validated or discarded events quickly before they are stacked for estimation.

To summarize, filter-based methods suit the asynchronous nature of events and are applicable to both event-by-event and event-group algorithms. Broadly speaking, they appear to be used not as much as other methods, especially in complex scenes, due to their considerable consumption of resource to calculate and save camera states.

3.2. Optimization-based framework

Optimization is another dominant means of ego-pose estimation, which is mostly carried out on event groups. In practice, the optimization of the pose of camera is realized through the optimization of a loss function, which takes on diferent forms for diferent algorithms and optimization objectives [47].

For example, [28] tracked event camera under a maximum likelihood optimization framework from a photometric 3D map. The optimization objective is the error between the measured intensity change from event frames and the predicted intensity change calculated from the given photometric 3D map. [30] presented an enhanced motion tracker that first used TS-based method in all circumstances, and then applied EM-based method to optimize pose parameters when the optimization problem might degenerate. [33] proposed a 2D translation velocity estimation algorithm which could be seen as the backend of a VIO system. The loss function was built based on the so-called Continuous Event-Line Constraint that described the relationship between line projections from events and ego-motion of the event camera. The optimization objective was the geometric distance between the reprojected 3D line and the events.

Event-based optimization algorithms depart from conventional frame-based algorithms in that most involve motion compensation to eliminate noise and motion blur for accumulated event groups. In motion compensation algorithms, events are supposed to be triggered on the pixels which an edge moves across. [35] put forward the first unifying framework of motion compensation on the assumption that the ego-motion is uniform within a small time interval. It made a representative contribution of Contrast Maximization (CMax) framework which produced motion-compensated edge-like event images for 6-DOF camera pose estimation. It estimated the parameters of the motion that best fits a group of events by warping events to a reference time and maximizing their alignment, producing a sharp image of warped events (IWE). This fundamental framework was later renovated by several works [36, 37, 38]. [39] was another milestone that proposed the Entropy Minimization (EMin) framework. It estimated motion directly in the 3D space rather than projecting them onto image planes like CMax [35]. Therefore, it can solve motion problems in arbitrary dimensions by optimizing a family of entropy loss functions for the minimal dispersion. [26] addressed ego-motion estimation with a novel probabilistic approach which Event-to-Frame Conversion > 0 ℎ!(,)

Synchronous

Event Frame Asynchronous Events (,,,) * " Integration Time < 0 ℎ#(,) Features

Vectorized Descriptor

Rotation Estimation modeled event alignment as a spatial-temporal Poisson point process. Camera rotation was estimated by maximizing joint probability of events, which achieved higher accuracy than Cmax [35], AEMin [39] and EMin [39] models in most scenarios.

In general, optimization-based methods are considered as most widely adopted for event-based egomotion estimation. Optimization of the pose of camera is implemented by minimizing specific loss functions with the help of optimizers. Future works in this field may similar paths as prior works: refining object functions and inventing motion compensation methods for events to better depict the change of scene and upgrading existing algorithms for higher dimension of motion.

3.3. ANN-based framework

As deep-learning technologies flourish in recent years, they are popularly applied to ego-motion estimation. [48] introduced the first deep learning framework to retrieve the 6-DOF camera pose from a single frame. This groundbreaking work found that, compared to conventional key-point methods, using Convolutional Neural Network (CNN) to learn deep features appeared to be more robust in challenging scenarios such as noisy or uncleared images. This conclusion boosted the development of ANN architecture for 6-DOF pose investigation in computer vision.

Accordingly, multi-layer ANN has grown to be another mainstream method for ego-motion estimation from events, atypical structure is shown in Figure.2. They train their networks to optimize loss functions that include the state parameters of camera (also discussed in [47]). One of the representative works using event-by-event deep learning method was [40], which applied an event-based on-chip Spiking Neural Network (SNN) to the estimation of 2-DOF head pose of the iCub robot. ANN-based methods in the form of event groups are abundant to list, most of which were supervised. The work of [41] and [43] both served event frames as the neural network input, diferent in that [ 41] stacked events in dual channels of opposite polarities while [43] accumulated events in one single channel. Unlike the prior works that only used CNN or LSTM to obtain depth and geometry information, the network in [43] was composed of both a CNN to learn deep features from the event frames and a stack of LSTM to learn spatial dependencies in the image feature space, outperforming the state-of-the-art in pose estimation in general or challenging circumstances with short inference time. In recent years, unsupervised ANNs with loss functions built without restrictive conditions were also developed to solve event-based ego-motion estimation tasks. The earliest works that adopted a self and unsupervised manner still relied on input resources other than events, like greyscale images [49], or auxiliary assumptions, like the photoconsistency assumption [50]. The most recent works have realized unsupervised ANNs that only take events as input [44, 45].

To conclude, multi-layer ANNs with architectures like CNN and SNN have shown to perform well in event-based ego-motion estimation. Either original events or event groups were fed into ANNs for whom to regress the pose of camera. The birth of unsupervised networks that took pure events as input has further simplified the problem. It is prospective that novel networks will be designed to fully exploit the advantages of diferent architectures.

4. Event-based tracking

Estimating the pose and trajectory of rigid-body robustly and accurately is the first step to achieve positioning and mapping. Vision sensors output refined textures of the scenes for 6-DOF motion estimation. To achieve that, VO and VIO frameworks were proposed. However, under restrictions of global exposure mechanism of standard cameras, frame-based VO often sufer from motion blur especially in terms of rotation, high-speed and high-mobility motion. Adding Inertia Measurement Unit (IMU) to VO will increase the robustness of system. In a tightly or loose coupled VIO framework, triaxial angular velocity and acceleration output from IMU provide pose estimation when feature tracking fails, and visual features correct the drift of IMU. Unfortunately, when feature tracking fails for a long time, the drift cannot be corrected. In conclusion, robust vision information output is the key component of VO or VIO system.

Event cameras can output information continuously in high temporal resolution without any motion blur. Event-based VO and VIO are thus explored to deal with the problem of pose and trajectory estimation in challenging scenes.

4.1. Event-based visual odometry

Visual Odometry is a dominant approach to estimate the pose and trajectory using the visual features of scenes. If the real depth of visual features in scenes is estimated in the meanwhile, a global map can then be built, namely Simultaneously Localization and Mapping (SLAM). Similar to frame-based VO, two configuration schemes of event-based VO are generally considered, namely monocular and stereo VO. The majority of research focus on monocular event-based VO, because the configuration of this scheme is simpler than stereo scheme and comparable in terms of accuracy.

4.1.1. Angular velocity and rotation estimation

Angular velocity and rotation estimation are fundamental process of VO. An angular velocity estimation method was presented in [51], confirming that event camera was capable to estimate the 3D rotational motion of rigid-body. Currently, learning-based and optimization methods domain this field. For the learning-base methods, SNNs [52] are introduced to this task, which are comparable than ANNs-based method. For the optimization methods, CMax [53] and Rodrigues’Rotation Formula [54] are introduced as objective function for optimization.

4.1.2. Monocular and stereo visual odometry

In this task, optimization and filter-based methods are mainstreams. Filter-based methods are the first to be proposed. In [55], an event-based VO named EVO was presented, which was considered as the first SLAM system that only depended on event camera. The system transformed event streams into event frames and tracked the poses via optimizing the error between event image and semi-dense map using the inverse compositional Lucas–Kanade (L-K) method. A semi-dense 3D map was constructed by Eventbased Multi-View Stereo (EMVS), a geometric 3D reconstruction method. Optimization-based methods are currently the most dominant. The choice of the objective equation for optimization is the core diference among these methods. Specifically, reprojection error minimization [ 56, 57], spatiotemporal registration [58] and CMax [59] et.al. It is worth noting that [59] presented an event-based VO called ETAM using continuous ray warping and volumetric contrast maximization. It extended CMax into 3D estimation, in which the target of optimization was maximizing the variance of Volume Warped Event, and achieved a sharpest warped event frame. It then built a VO, consisting of single-frame optimization as front-end based on CMax and a global optimization using B-spline curve model as back-end. In addition, there are methods [32, 60] that utilize Time Surface Maps (TSMs) to build maps and track poses while performing depth estimation.

In summary, precise pose estimation and tracking are the front-end of event-based VO, and optimization is the back-end. The key step of event-based tracking is motion compensation, the majority of aforementioned event-based VO selected optimization method to achieve that. CMax and nonlinear optimization have become mainstream in recent years, because filter-based methods occupy a large storage for saving the landmarks of a map ineficiently. Specifically, event-based tracking compute an image of warped events and sharpen the image by optimization. The sharpness of warped images reflected the accuracy of pose tracking. Thus, the objective function and tool for optimization are critical research topics. However, current event-based VO and VIO continue in the typical framework aiming at frame-based VO and VIO, the unique characteristics of event camera are still not be manifested in current processing.

4.2. Event-based visual inertial odometry

Visual Inertial Odometry is based on VO, adding IMU as a component of pose and trajectory estimation system. VIO transcends VO in terms of accuracy and robustness upon most occasions. Generally, VIO comprises of two steps, namely the front-end and the back-end. The front-end decides the data format of visual information, such as event frame and time surface. It extracts visual features as the input of the back-end. The back-end refers to the fusion method of visual information and IMU measurements, the main approaches are filter-based method [ 61], probabilistic method [62] and optimization method [63, 64, 65, 66]. Typically, [64] presented an approach for tightly-coupled VIO named UltimateSLAM combining events, images and IMU measurements. To synchronize the two vision sensors, this approach accumulated events as event frames at the same timestamps of standard frames and motion-compensated the event frames. It tracked the features of event frames and standard frames using FAST conner detector [67] and L-K tracker [68] individually. If the features could be triangulated and belonged to key-frames, then it fused these features and IMU measurements with nonlinear optimization, achieving the results of pose and trajectory estimation. This pipeline was demonstrated in real-time on a light-weight quadrotor system.

The original intention of adding IMU is to enhance the robustness of frame-based VO, because IMU can still maintain data output when standard cameras sufer from motion blur. The addition of the IMU has also improved the robustness of event-based VO. However, event cameras can work stably in rotating and high-speed scenes. Therefore, in theory, event cameras can perform without the addition of IMU.

5. Event-based mapping

Mapping is the final goal of SLAM. In section IV, the world coordinates of landmarks are obtained to build a sparse spatial map. However, more information of the scene is needed for diversified applications, which prompts the birth of semi-dense and dense map. Compared to sparse map, semi-dense or dense map models more or all of what is captured by the camera instead of only landmarks. These are commonly used in robot navigation, where routes and obstructions should all be reconstructed. They are also applied where 3D reconstruction with full texture of the scene or a target object is necessary for realistic and aesthetic purposes. Driven by practical purposes, this section thus focuses on semi-dense and dense mapping, which is equivalent to estimating the depth of objects in the scene. Note that mapping in most cases is preceded by ego-motion estimation, which means that the poses of cameras at all timestamps are given information.

Depth estimation in frame-based SLAM is solved in three mainstream approaches: (1) in the case of adopting a monocular camera, calculating the motion of the camera and then triangulating the depth of space points; (2) in the case of adopting a stereo camera, triangulating the depth of space points with the optical parallax between two frames; (3) using depth estimation setup, for example RGB-D camera and lidar, to directly obtain depth information. In comparison with the third approach, the former two approaches involve a significantly larger amount of computing resources and are more fragile. But they are more robust in large-scale outdoor scenes.

With event cameras emerging, event-based monocular and stereo depth estimation methods arise inheriting the former two for frame-based depth estimation. Meanwhile, event-based depth estimation using structured light is developed and works for both monocular and stereo scenarios.

5.1. Monocular depth estimation

Table. 3 lists recent event-based monocular depth estimation methods, classified according to diferent criteria on method and experiment. Depth estimation from a monocular event camera is a challenging task for the hardship in data association. Specifically, the temporal relationship between events cannot be straightly acquired. Therefore, early methods for event-based monocular depth estimation involved additional information, like an intensity image, in order to address the data association issue. Works in recent years have simplified the work of mapping by eliminating those auxiliary conditions. Rebecq [69] Gallego [35] Haessig [73] Chaney [74] Zhu [44] Carrió [75] Baudron [76] Gehrig [77]

Density semi-dense semi-dense semi-dense semi-dense semi-dense dense dense dense "Own data" refers to own dataset applied that is not open source.

[69] did the pioneering work that reconstructed semi-dense depth map from monocular event streams without requiring event associations or intensity images. It generalized an even-based space-sweep algorithm that estimated 3D structures from frame-based MVS [70] data without traditional data association [71] for a moving event-based EMVS. In this work, individual events were considered to back-project corresponding rays that spanned spatial structures and events from multi views split the space up into disparity space image (DSI) voxels [72]. A ray counter counting rays that traversed each voxel was formed to determine ray density per voxel and a semi-dense map was obtained by computing voxels with a local maxima of ray density, which corresponds to a structural point of the scene. [35] solved the same problem as [69], estimating depth of 3D structures from multi-view events. It worked under the optimization-based CMax framework that is mentioned in Section.3.2 where events are warped into motion-corrected images. And the correct depth could be found where patches of warped events had the highest variance.

Most works in recent years addressed the mapping problem in an ANN-based fashion. However, due to the asynchronous nature of event streams, data association appeared to be a major hardship, especially for deep-learning methods. Attempts were made to achieve better events alignment by applying variable network architectures, renovating algorithms or adjusting event representation inputs. [73] transplanted the model of depth estimation from the time of focus to event-based SLAM. This work presented a novel SNN approach to the depth from defocus problem for depth map reconstruction, considering events were ideal spikes input to the SNN. The core of this network was a Leaky-Integrate-and-Fire-neurons based focus detection network composed of two input neurons for ON and OFF polarities events respectively. [74] designed an ANN specially for environments with a ground plane. It trained to learn the ratio between the height of a point from the ground plane and its depth in the event camera frame, after which height and depth information could be decomposed easily given the ground plane calibration.

Some other works discussed event representation for preserving the spatial-temporal information of event streams. CNN [44, 75, 77] is introduced for this task. [44] constructed a CNN with an unsupervised encoder-decoder architecture for depth prediction. It took discretized volumes of events as input to preserve the temporal distribution of the events as well as remove motion blur. Meanwhile, Rotational Neural Network (RNN) is introduced to handle the asynchronous data of events combined with frames. [77] did this by applying an encoder-decoder architecture on UNet which maintained an internal state that was updated asynchronously through events or frames input and could be decoded to depth estimation at any timestamp.

Overall, depth is estimated via the projection and coordinate transformation of features in monocular SLAM. Monocular depth estimation methods attempted to address the hardship of data association, which was to recover the temporal association between events. Among these methods, the ones based on deep learning have shown more robustness, because they can integrate several cues from the event stream, and thus have drawn great attention from researchers. The rise in the ability of novel methods to estimate depth from monocular events have also resulted in denser maps, which provided more detailed information of the scene for more real 3D reconstruction and more accurate navigation.

5.2. Stereo depth estimation

It is feasible to use frame-based stereo systems to estimate depth, because the shutter of the two cameras is triggered synchronously, thus feature extraction and matching for the left and right images are directly operated at the same timestamps. However, in the stereo system composed of two events cameras, the pixel matching of the left and right cameras is dificult. The principles of event co-occurrence and

Event Frame light patterns to the scene. An event camera extracts features along illuminated patterns to generate event streams. In some works, events are further aggregated into event frames to depict features in a clearer manner, where green line represents ON events and red line represents OFF events. epipolar constraint are often used to estimate the depth. Namely, the two events triggered by the edge in 3D space are on corresponding epipolar lines of the left and right cameras. However, due to the existence of latency and noise, it is dificult to achieve this in pixel level implementation. In summary, the key step of depth estimation in event-based stereo system is finding the correspondence events of both cameras.

The most significant theory, hardship and algorithms about event-based stereo depth estimation were surveyed by [78]. For one thing, it introduced the supporting principle for stereo vision that disparity (the horizontal displacement) in two eyes of stereo camera is inversely proportional to the depth. For another, it outlined the core problem to obtaining disparity, which was to match corresponding events from two eyes along with the mismatching problem incurred by the high temporal resolution and high sensitivity of event sensors.

As was mentioned in [78], correspondence existed between disparity information and depth information in stereo problems. [79] thus realized event-based disparity estimation by introducing lifetime estimation of single events, which can be used for map reconstruction. It raised accuracy of disparity estimation by generating sharp gradient images from lifetime matching between corresponding events from two sensors. [80] utilized the velocity of event camera for generating disparity estimation. [81] developed a disparity mapping network with the stereo framework of [82] as the baseline, reserving the event embedding and stereo matching sub-networks in the previous study. In the meantime, it made major architectural modifications to the image reconstruction by integrating a cross-semantic attention mechanism and feature aggregation sub-networks by modulating event features with reconstructed image features with a stacked dilated spatially-adaptive denormalization mechanism.

In addition, window-based [83, 84], uniqueness constraint [85] and optimization method [ 7, 32, 86, 87, 88 ] were feasible for event matching. Furthermore, frame-based deep learning method [31, 86, 89, 90, 91, 92, 93, 94, 95, 96, 97] were applied to address these problems. The above works took inputs from a pair of event sensors. Distinctly, [98, 99] went down a diferent route. In this work, the so-called stereo setup included a frame-based camera and an event-based camera. [98] estimated dense disparity from stereo frames when they were available, predicted the disparity using odometry information, and tracked the disparity asynchronously using optical flow of events between frames.

In summary, depth is estimated via stereo matching using the disparity of two sensors in stereo SLAM. Accuracy and eficiency of events correspondence of both cameras are the key standards for evaluating stereo mapping algorithms.

5.3. Depth Estimation Using Structured Light

Structured light (SL) is considered as the most reliable technique in depth estimation. When applied in event-based SLAM, the hardware setup for an SL system mostly includes a Digital Light Process (DLP) lightcrafter model casting simple or encoded light patterns to the illuminated scene with a mirror array reflecting light back and an event camera or a pair of event cameras receiving light to generate images. A common setup for event-based monocular SL can be seen in Figure.3. Its main purpose is to simplify the extraction of features and facilitate data association in two views. In event-based systems, the measurement of spatial points using SL is accomplished by the calibration of the relative pose between the lightcrafter and the event camera, followed by triangulation when events of corresponding points are identified by data association.

A universal calibration procedure for event-driven DLP-based monocular depth estimation systems was firstly proposed by [ 100]. Its main contribution was a Temporal Metrices Mapping (TMM) calibration algorithm that calibrates the event camera and galvanometer of DLP with two temporal matrices attained through scanning a front-parallel plane and corresponding scanning speed.

As for triangulation, recent works on monocular depth estimation using SL largely focused on adopting high-frequency light patterns to fit the high temporal resolution of event cameras, such as frequencytagged light patterns [101], blinking lights of a pseudo-random pattern [103] and periodic fringe patterns [91]. [102] projected temporally modulated lights of two wavelengths and triggered events by bispectral diference induced by light absorbance diference of a certain medium. The good merits of high temporal resolution and high dynamic range of event cameras were fully exploited to obtain unafected bispectral diference for depth calculation. [ 104] built a novel formulation comprising a laser point projector and an event camera. It estimated dense depth by maximizing the spatial-temporal consistency between data from the projector and the event camera, when interpreted as a stereo system. This work took advantage of the focusing power of laser point light source and the data redundancy suppression, high temporal resolution and HDR of event camera to produce more robust mapping in high-speed motion. [105] adopted a similar hardware system with [104] but followed a more adaptive path in SL illumination where the density of projected laser in a certain area depended on the intensity of scene activity in that area to reduce power consumption.

SL can also be integrated with event-based stereo setup to simplify stereo correspondence. A typical work of event-based stereo depth estimation using SL was [106], in which a mirror-galvanometer-driven laser served as SL projector to generate blobs in the space. These blobs triggered events that were captured by two event cameras and were served as the key points for triangulation.

In general, the integration of SL has intuitively made depth features directly accessed by SLAM systems. Hardware innovations have exploited the attractive properties of events with diverse light encoding patterns adapting to the high temporal resolution merit of event cameras, while laser point light source was popularly applied to meet the HDR merit of event cameras.

6. Resources

In this section we summarize present resources (datasets and simulators) for event-based navigation and positioning, as listed in Table.4. Most of these resources have been widely applied for researchers to test the accuracy, robustness and computational eficiency of their event-based SLAM algorithms. The results served as benchmarks for the performance of new methods, which has played significant roles in driving the techniques in this field forward.

6.1. Resources for ego-motion estimation

One of the features of ego-pose estimation datasets is to provide existing information, in most cases a reconstructed depth map. [24, 28] released datasets for event-based camera tracking from an existing photometric depth map constructed by and RGB-D camera. Event streams were generated by DVS and the known photometric depth map was constructed from prior mapping by an RGB-D camera. The latter further improved the accuracy of 3D reconstructed map by attaching ElasticFusion poses from a motion capture system. [107] released the first dataset specifically for ornithopter robot perception in indoor and outdoor scenarios. This dataset was generated to prove the advantage of event camera applied to lfapping-wing ornithopters.

6.2. Resources for tracking

Datasets and simulators are numerous to list for VO, VIO and SLAM. DAVISs, RGB-D cameras or stereo cameras, external motion capture systems like OptiTrack or odometry systems on hardware platforms are commonly used for generating events, depth, groundtruth motion respectively [108, 109]. One of the earliest and most classic event-based SLAM resources was [27], which released an event-based dataset and simulator for pose estimation, visual odometry and SLAM presenting variable scenes. Latter work [34, 110] boosted the study of event-based positioning and navigation with dataset derived from aggressive high-speed motions in changeable illumination scenes that were beyond the capabilities of existing tracking algorithms. In light of real-world SLAM applications, [111] proposed to include multi-sensor configurations for solving motion disturbances and illumination conditions together.

Catering to the rise of deep learning methods in event-based vision, datasets devised to train and test the performance of ANNs were released accordingly. [42] published the first annotated DAVIS driving recordings. This dataset was specially built for end-to-end (E2E) CNN and CNN/RNN networks in VO/SLAM. Vehicle speed, GPS position and driver steering, throttle, brake captured from the car’s on-board diagnostics interface were given for computing ground truth. This work was expanded by [112] in terms of road types, weather and daylight conditions. Following these works, [113] published the largest event-based dataset with ground truth of independently moving entities. This dataset was recorded specially for testing deep-learning-based SLAM algorithms targeted for cameras in anomalous motion, which was made possible by including multiple labeled independently moving entities into the dataset. [114] released the first event-based dataset which included accurate pixel-wise motion masks, ego-motion and ground truth depth for the test of learning motion segmentation method.

6.3. Resources for mapping

Resources for mapping so far were mostly generated by synchronised stereo setups originally for stereo depth estimation. However, they can also be adapted to monocular depth estimation by only using events and images from one of the cameras, left or right. [31] released the first and most widely used eventbased stereo depth dataset for driving, which was later improved by [115] for the first high-resolution, large-scale stereo event dataset in driving scenarios. [94] published synthetic sequences of rotating synthetic 3D object and real-world sequences of fast-rotating objects for testing the ability of algorithm to operate on nonrigid rapidly rotating objects.

7. Discussion 7.1. Current and future application of event-based positioning and navigation

In general, event camera has the ability to estimate rotation, depth, and pose in complex environments (whether indoors or outdoors) with low power consumption and without interruption. Meanwhile, it is very suitable for deployment in navigation and positioning scenarios that frequently perform complex maneuvers with strict restrictions on power consumption and highly dependency on visual information. Specifically, UAV autonomous navigation, high-speed object detection and obstacle avoidance are some examples.

Researchers will continue to work on event-based navigation and positioning algorithms that are more eficient and easier to implement on hardware. Faster and more accurate motion compensation approaches will hopefully be worked out to output high-quality poses for tracking. At the same time, in parallel pipelines, the depth of the scene can be eficiently estimated, whether monocular or stereo, and finally realize positioning and mapping in complex environments, improving robustness and speed while minimizing power consumption.

7.2. Advantages of event camera in indoor positioning and navigation

For indoor environment concerned in this paper, event camera can be incorporated as vision sensor in positioning and navigation. The high dynamic range (HDR), high temporal resolution and low power of event camera better cater to the complex characteristics of indoor environment, ensuring robust performance of the system.

7.2.1. High dynamic range

Unlike outdoor fields where natural light ofers consistent illumination bright enough for camera to capture scene features, indoor environments are often dynamic with complex structures illuminated by artificial lighting. Event cameras boast high dynamic range that can reach 140 compared to a common 60 of frame-based cameras. This property is especially required by navigation and positioning in extreme working scenarios, for example in natural open field for long durations, where illumination condition may vary largely within a long period of time. High dynamic range ensures that navigation is consistent and robust to environmental alternations.

7.2.2. High temporal resolution

Indoor scenes are usually limited in space with a lot of obstructions. Therefore, vehicles or robots are frequently put under rapid maneuver control to avoid crashing onto obstacles. In these circumstances, the high temporal resolution property of event camera is needed for robust SLAM. Event cameras are capable of outputting event streams at microsecond level of temporal resolution in lab and sub-millisecond in the real world, enabling the navigation system to reconstruct obstacles rapidly and thus vehicles to react quickly. This also results in less motion blur as in common frame-based cameras, so that events are generated by actual features within the scene rather than noises caused by high-speed motion.

7.2.3. Low power consumption

The limited scale of indoor environment places restrictions on the volume and power dissipation of the hardware system for complicated motion of vehicle and more durable navigation. In event sensors, pixels only react to brightness changes that reach a priorly defined threshold. While the system-level power consumption of a traditional camera may be around 1 2W, that of event cameras can reach lower than 24mW. The power-saving feature makes event camera applicable to indoor onboard navigation and positioning for compact equipments that may not be able to carry power packs with mass battery.

7.3. Changllenges of event camera in indoor positioning and navigation 7.3.1. The lower bound of dynamic range

Event cameras can sense strong light intensity changes, but are not sensitive enough to weak changes (0.1lux). In extremely dim scenes, minor changes in lighting can generate large numbers of events, when in reality, they are all noise. The real events are drowned in noise, this phenomenon is very severe in low-light scenes. The latest event camera Prophesee EVK4 can perceive a minimum light level of 0.08lux and has enhanced low-light capability, but the noise problem still cannot be solved. This brings great challenges to the application of event cameras in indoor scenes that are often dimly lit. In 2021, DARPA announced that it has begun research on event cameras in the infrared band to enhance the ability of event-driven sensors to work in low-light conditions, but it still remains on paper.

7.3.2. The noise from event stream

Existing neuromorphic vision sensors sufer from three main types of output noises: background activity noise (BA), hotspot noise and flicker noise. In a static scene, most noises can be easily removed by judging the time correlation and the flicker frequency in the sliding time window. However, when the sensor performs complex motion, it is very dificult to remove hot spot noise and flicker noise. The plane is represented as a region, not a sparse point. Events generated by ambient light and reflections in windows have longer timestamps due to camera motion, much like events generated by dynamic objects in the scene. Meanwhile, events triggered by the static objects without the flickering efect under the motion of camera are temporally consistent. Using methods such as TS, it is easy to distinguish static from dynamic containing flicker noise, but it is dificult to distinguish flicker noise from real dynamic objects. By accurately estimating the camera trajectory, optical flow estimation and pixel area matching, the flicker noise from reflective objects can be judged to a certain extent. But this is still dificult in practice, because it is hard to efectively extract and track their features. We are likely to regard a mirror as a dynamic object and ignore it when building a map, which is easy to cause collisions.

7.3.3. The configuration scheme of sensors

The hardware configuration of existing event-based systems for navigation includes single event camera, binocular event cameras, an event camera combined with other visual sensor, and multi-source integration with IMU. The above schemes are proven to be feasible. From a complementary perspective, the scheme of an event camera and a standard camera combination can take into account both high-speed and low-speed scenes. On the one hand, in high-speed exercise, the camera as the main sensor provides event flow without motion blur. On the other hand, the standard camera as the main sensor provides fine scene texture characteristics. This configuration is suitable to compensate for the lack of information when event camera stays at low speed or stationary and motion blur when standard camera moving at high speed. Judging from Section.4, a single event camera can complete the feature extraction, tracking, and depth estimation. Although the addition of IMU has proven to improve the robustness of the system, we believe that a single event camera can be competent without IMU. The task of indoor positioning and navigation is actually very complicated with lot of parallel pipelines. If detection, identification and control (such as obstacle avoidance) are summarized into part of the navigation task, then the system requires the addition of standard cameras and IMU to meet the needs of diverse tasks.

7.4. Neccesity of specific event-based hardwares for indoor navigation and positioning 7.4.1. Necessity of specific event camera for positioning and navigation

In indoor tasks, sensors with low resolution are suficient to obtain finer scene information for the limited depth of field. The reduction of sensor resolution results in the reduction of data volume and accordingly the load pressure on the back-end data processing system as well as improved computing speed. Furthermore, the noise of sensors is much higher in a dim environment (more than half of the data being noise) than in a bright scene. The reduction in sensor resolution also reduces the amount of noise. For complex indoor environments, this improvement can greatly enhance the responsiveness and maneuverability of the system.

Existing sensors output single event streams without any denoising processing, which leads to a high event rate in complex scenes. The back-end system has to use complex algorithms or hardware to perform denoising first, which leads to low eficiency. Simultaneously, due to the separation of sensors and computing hardware, data transfer process needs to be repeated. This unnecessary process also causes a lot of computational delays. Therefore, an ideal event-based sensor for positioning and navigation should have chip-level or sensor-level denoising capabilities and output high-quality data at the sensor level, which can significantly reduce the event rate and maintain the sparsity property of event stream. In order to enhance the intelligence level of the sensor, the output data should undergo a level of preprocessing to output features that can be used for tracking, namely events that have undergone feature extraction. After these preprocessing, the data output by the sensor can be directly used by the back-end, which has the significant advantages of high eficiency, sparseness and low power consumption.

7.4.2. Necessity of specific neuromorphic processing hardware for positioning and navigation

As a matter of fact, signals triggered by event-based sensor are naturally processed by an event-based processing system, namely SNNs calculating hardware. The existing calculation hardware is generally CPU, FPGA or GPU, but they are not designed for events. In order to use these setups to handle events, the events must be converted to data formats suitable for hardware, but such transformations often sacrifice the sparse and asynchronous properties of the event itself. This leads to the acceptance of only neural network methods, but without their usually heavy calculation. Those methods do not actually play the sensing advantages of event drive. Therefore, there are still some gaps compared to ordinary visual sensors in the performance indicators of actual applications. So far no SNN training mechanism that is broadly-accepted and feasible has been generated, which should hopefully facilitate the deployment and implementation of SNN networks on the hardware to truly exert the advantages of neuropsychological perception and calculation in visual navigation and positioning. In the end, the neurological sensor and the neurological calculation hardware are combined into neurological visual navigation and positioning system, which is truly high speed, high dynamic, low power consumption.

8. Conclusion

Event cameras are the representative achievement of neuromorphic vision boasting high time resolution, high dynamic range and low latency compared with standard camera. Their emergence makes applications that traditional cameras cannot handle possible, bringing a revolution to visual applications, especially in vision navigation and positioning that are full of challenges and dificulties. In this paper, we briefly introduce the principle of event cameras. Then we overviewed the research of event-based vision in navigation and positioning, including ego-motion estimation, event-based tracking, event-based mapping and datasets for estimation and analysis. Great challenges are remained in existing event-based navigation and positioning research. But challenges are opportunities. We analyze the advantages of event-based solutions, the possible improvements and research directions and make suggestions for neuromorphic hardware specialized for navigation. Finally, we put forward prospects. We hope that this paper can give researchers inspirations, so that neuromorphic vision can play a greater role in indoor navigation and positioning as well as achieve intelligent perception and calculation in complex conditions.

[1]

Huang , Event-based timestamp image encoding network for human action recognition and anticipation , in: 2021 International Joint Conference on Neural Networks (IJCNN) , IEEE, 2021 , pp. 1 - 9 .

[2]

Hadviger , I. Cvišić , I. Marković,

Vražić , I. Petrović , Feature-based event stereo visual odometry , in: 2021 European Conference on Mobile Robots (ECMR) , IEEE, 2021 , pp. 1 - 6 .

[3]

Grimaldi ,

Boutin ,

Perrinet ,

S.-H.

Ieng ,

Benosman , A homeostatic gain control mechanism to improve event-driven object recognition , in: 2021 International Conference on Content-Based Multimedia Indexing (CBMI) , IEEE, 2021 , pp. 1 - 6 .

[4]

Cao , G. Chen,

Xia ,

Zhuang ,

Knoll , Fusion-based feature attention gate component for vehicle detection based on event camera , IEEE Sensors Journal 21 ( 2021 ) 24540 - 24548 .

[5]

Akolkar ,

S.-H.

Ieng ,

Benosman , Real-time high speed motion prediction using fast aperture-robust event-driven visual flow , IEEE Transactions on Pattern Analysis and Machine Intelligence 44 ( 2020 ) 361 - 372 .

[6]

Scarpellini ,

Morerio ,

Del Bue , Lifting monocular events to 3d human poses , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021 , pp. 1358 - 1368 .

[7]

S.-H.

Ieng ,

Carneiro ,

Osswald ,

Benosman , Neuromorphic event-based generalized time-based stereovision , Frontiers in Neuroscience 12 ( 2018 ) 442 .

[8]

Chen ,

Zheng ,

Niu ,

Tang , G. Pan, Indoor lighting estimation using an event camera , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021 , pp. 14760 - 14770 .

[9]

Risi , E. Calabrese, G. Indiveri, Instantaneous stereo depth estimation of real-world stimuli with a neuromorphic stereo-vision setup , in: 2021 IEEE International Symposium on Circuits and Systems (ISCAS) , IEEE, 2021 , pp. 1 - 5 .

[10]

Liu ,

Guan ,

Jiang ,

Gao ,

S. S.

Ge , Image reconstruction with event cameras based on asynchronous particle filter , in: 2022 5th International Symposium on Autonomous Systems (ISAS) , IEEE, 2022 , pp. 1 - 6 .

[11]

Yu ,

Zhang , D. Liu,

Zou ,

Chen ,

Liu ,

J. S.

Ren , Training weakly supervised video frame interpolation with events , in: Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021 , pp. 14589 - 14598 .

[12]

Jing ,

Yang ,

Wang ,

Song ,

Tao , Turning frequency to resolution: Video super-resolution via event cameras , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021 , pp. 7772 - 7781 .

[13]

Wang , T.-K. Kim , K.-J. Yoon , Joint framework for single image reconstruction and super-resolution with an event camera , IEEE Transactions on Pattern Analysis & Machine Intelligence ( 2021 ) 1 - 1 .

[14]

A. J.

Lee ,

Kim , Eventvlad: Visual place recognition with reconstructed edges from event cameras , in: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , IEEE, 2021 , pp. 2247 - 2252 .

[15]

Cho ,

Jeong , K.-J. Yoon , Eomvs: Event-based omnidirectional multi-view stereo , IEEE Robotics and Automation Letters 6 ( 2021 ) 6709 - 6716 .

[16]

Gallego ,

Delbrück , G. Orchard,

Bartolozzi ,

Taba ,

Censi ,

Leutenegger ,

A. J.

Davison ,

Conradt ,

Daniilidis , et al., Event-based vision: A survey , IEEE transactions on pattern analysis and machine intelligence 44 ( 2020 ) 154 - 180 .

[17]

S.-C.

Liu ,

Delbruck ,

Indiveri ,

Whatley ,

Douglas , Event-based neuromorphic systems , John Wiley & Sons, 2014 .

[18]

Delbruck ,

Berner , Temporal contrast aer pixel with 0.3%-contrast event threshold , in: Proceedings of 2010 IEEE International Symposium on Circuits and Systems , IEEE, 2010 , pp. 2442 - 2445 .

[19]

Posch ,

Matolin ,

Wohlgenannt , A qvga 143 db dynamic range frame-free pwm image sensor with lossless pixel-level video compression and time-domain cds , IEEE Journal of Solid-State Circuits 46 ( 2010 ) 259 - 275 .

[20]

Posch ,

Serrano-Gotarredona ,

Linares-Barranco ,

Delbruck , Retinomorphic event-based vision sensors: bioinspired cameras with spiking output , Proceedings of the IEEE 102 ( 2014 ) 1470 - 1484 .

[21]

Finateu ,

Niwa ,

Matolin ,

Tsuchimoto ,

Mascheroni , E. Reynaud,

Mostafalu ,

Brady ,

Chotard ,

LeGof , et al., 5 .10 a 1280 × 720 back-illuminated stacked temporal contrast event-based vision sensor with 4.86 m pixels, 1.066 geps readout, programmable event-rate controller and compressive data-formatting pipeline , in: 2020 IEEE

International

Solid-State Circuits Conference-(ISSCC), IEEE, 2020 , pp. 112 - 114 .

[22]

Suh ,

Choi ,

Ito ,

Kim ,

Lee ,

Seo ,

Jung ,

D.-H.

Yeo ,

Namgung ,

Bong , et al., A 1280 × 960 dynamic vision sensor with a 4 . 95 - m pixel pitch and motion artifact minimization , in: 2020 IEEE international symposium on circuits and systems (ISCAS) , IEEE, 2020 , pp. 1 - 5 .

[23]

Chen ,

Guo , Live demonstration: Celex-v: A 1m pixel multi-mode event-based sensor , in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , IEEE, 2019 , pp. 1682 - 1683 .